# Automatic Process - Suite Completa de Datos Plataforma completa de ingesta, procesamiento y visualizaciΓ³n de datos con lineage tracking automΓ‘tico. --- ## 🎯 Arquitectura ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ DATA PIPELINE STACK β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ πŸ“… SCHEDULING πŸ”„ MESSAGING πŸ’Ύ STORAGE β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Dagu │────────→ β”‚ NATS │────────→ β”‚PostgreSQLβ”‚ β”‚ (Native) β”‚ β”‚JetStream β”‚ β”‚ClickHouseβ”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ βš™οΈ TRANSFORMATIONS β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ ↓ πŸ“Š LINEAGE πŸ“ˆ VISUALIZATION β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Marquez β”‚ β”‚ Grafana β”‚ β”‚OpenLineageβ”‚ β”‚ Metabase β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ Rill β”‚ ↑ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ πŸ” MONITORING β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚Prometheusβ”‚ β”‚ Loki β”‚ β”‚ Alloy β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` --- ## πŸš€ Quick Start ### 1. Iniciar Todos los Servicios ```bash # Core services docker-compose up -d # Analytics docker-compose -f docker-compose-analytics.yml up -d # Databases docker-compose -f docker-compose-databases.yml up -d # Lineage docker-compose -f docker-compose-marquez.yml up -d # Messaging docker-compose -f docker-compose-nats.yml up -d # Orchestration (Dagu ya estΓ‘ corriendo como systemd) systemctl --user status dagu.service ``` ### 2. Acceder al Dashboard **Homer Dashboard**: http://localhost:8080 Desde ahΓ­ puedes acceder a todos los servicios. --- ## πŸ“¦ Servicios Disponibles ### 🎨 Visualization | Servicio | Puerto | DescripciΓ³n | |----------|--------|-------------| | **Grafana** | 3500 | Dashboards y alertas | | **Metabase** | 3200 | Business Intelligence | | **Rill** | 9009 | BI Dashboard moderno | ### πŸ“Š Monitoring | Servicio | Puerto | DescripciΓ³n | |----------|--------|-------------| | **Prometheus** | 9090 | MΓ©tricas y alertas | | **Loki** | 3100 | AgregaciΓ³n de logs | | **Alloy** | 12345 | Colector de telemetrΓ­a | ### πŸ”„ Orchestration & Transformations | Servicio | Puerto | DescripciΓ³n | |----------|--------|-------------| | **Dagu** | 8090 | DAG Scheduler & Data Transformations (nativo WSL) | ### πŸ“¨ Messaging | Servicio | Puerto | DescripciΓ³n | |----------|--------|-------------| | **NATS JetStream** | 4222/8222 | Message broker | ### πŸ’Ύ Databases | Servicio | Puerto | DescripciΓ³n | |----------|--------|-------------| | **PostgreSQL** | 5434 | Base de datos relacional | | **ClickHouse** | 8123/9000 | Base de datos analΓ­tica | | **DBGate** | 3300 | Database management UI | ### πŸ—ΊοΈ Data Lineage | Servicio | Puerto | DescripciΓ³n | |----------|--------|-------------| | **Marquez** | 3001/5000 | OpenLineage tracking | --- ## πŸ—οΈ Crear un Pipeline de Datos ### Ejemplo: IngestiΓ³n desde API #### 1. Crear el script de recolecciΓ³n ```python # ~/dagu/scripts/fetch_api_data.py #!/usr/bin/env python3 import requests import json from datetime import datetime def fetch_data(): response = requests.get('https://api.example.com/data') data = response.json() # Guardar temporalmente with open('/tmp/api_data.json', 'w') as f: json.dump(data, f) # Log a Marquez log_lineage('START', 'api.example.com', '/tmp/api_data.json') if __name__ == '__main__': fetch_data() ``` #### 2. Crear el DAG en Dagu ```yaml # ~/dagu/dags/api_ingestion.yaml name: api_ingestion description: Ingesta datos desde API cada hora schedule: - "0 * * * *" # Cada hora env: - NATS_URL: nats://localhost:4222 - POSTGRES_URL: postgresql://postgres:postgres@localhost:5434/postgres steps: # 1. Fetch data from API - name: fetch command: python ~/dagu/scripts/fetch_api_data.py # 2. Validate data - name: validate command: | python ~/dagu/scripts/validate_schema.py \ --input /tmp/api_data.json depends: [fetch] # 3. Publish to NATS - name: publish_nats command: | nats pub data.api.raw \ "$(cat /tmp/api_data.json)" \ --server=$NATS_URL depends: [validate] # 4. Consume and ingest to PostgreSQL - name: ingest_postgres command: | python ~/dagu/scripts/nats_to_postgres.py \ --stream data.api.raw \ --table api_events depends: [publish_nats] # 5. Push metrics - name: metrics command: | python ~/dagu/scripts/push_metrics.py \ --job api_ingestion \ --success true depends: [ingest_postgres] handlers: failure: - name: alert command: | echo "Pipeline failed!" | \ curl -X POST http://localhost:9093/api/v1/alerts ``` #### 3. Monitorear en Grafana 1. Ir a http://localhost:3500 2. Crear dashboard con: - Query a Prometheus: `job_success{job="api_ingestion"}` - Logs de Loki: `{job="dagu"} |= "api_ingestion"` #### 4. Verificar Lineage en Marquez 1. Ir a http://localhost:3001 2. Buscar job: `api_ingestion` 3. Ver el grafo completo de datos: ``` api.example.com β†’ /tmp/api_data.json β†’ NATS β†’ PostgreSQL ``` --- ## πŸ“ Scripts Helper Incluidos ### `~/dagu/scripts/log_lineage.py` EnvΓ­a eventos OpenLineage a Marquez ```bash python ~/dagu/scripts/log_lineage.py \ --event START \ --source api.example.com \ --target /tmp/data.json \ --job my_pipeline ``` ### `~/dagu/scripts/push_metrics.py` Publica mΓ©tricas a Prometheus ```bash python ~/dagu/scripts/push_metrics.py \ --job my_pipeline \ --success true \ --duration 45 ``` ### `~/dagu/scripts/publish_to_nats.sh` Publica mensajes a NATS JetStream ```bash ./~/dagu/scripts/publish_to_nats.sh data.stream data.json ``` ### `~/dagu/scripts/nats_to_postgres.py` Consume de NATS e ingesta a PostgreSQL ```bash python ~/dagu/scripts/nats_to_postgres.py \ --stream data.raw \ --table events \ --batch-size 100 ``` --- ## 🎯 Casos de Uso ### 1. ETL desde API a Warehouse ``` API β†’ Dagu (fetch) β†’ NATS β†’ PostgreSQL β†’ Grafana ↓ Marquez (lineage tracking) ``` ### 2. Stream Processing en Tiempo Real ``` IoT Devices β†’ NATS β†’ Dagu (transform) β†’ ClickHouse β†’ Rill ↓ Marquez ``` ### 3. Reporting Diario ``` Dagu (schedule) β†’ PostgreSQL (query) β†’ Metabase (dashboard) β†’ Email ↓ Marquez ``` --- ## πŸ”§ ConfiguraciΓ³n ### NATS JetStream ```bash # Crear stream nats stream add DATA_STREAM \ --subjects "data.*" \ --storage file \ --retention limits \ --max-age 7d # Ver estado nats stream ls nats stream info DATA_STREAM ``` ### PostgreSQL ```bash # Conectar psql -h localhost -p 5434 -U postgres -d postgres # Crear tabla CREATE TABLE events ( id SERIAL PRIMARY KEY, timestamp TIMESTAMPTZ DEFAULT NOW(), source VARCHAR(255), data JSONB, lineage_job VARCHAR(255) ); ``` ### ClickHouse ```bash # Conectar clickhouse-client --host localhost --port 9000 # Crear tabla CREATE TABLE events ( timestamp DateTime, source String, data String, lineage_job String ) ENGINE = MergeTree() ORDER BY timestamp; ``` --- ## πŸ“Š Monitoring ### Ver MΓ©tricas en Prometheus ``` http://localhost:9090 Queries ΓΊtiles: - job_success{job="*"} - job_duration_seconds{job="*"} - rate(job_executions_total[5m]) ``` ### Ver Logs en Grafana ``` http://localhost:3500 β†’ Explore β†’ Loki Queries ΓΊtiles: - {job="dagu"} - {job="dagu"} |= "error" - {job="dagu"} |= "api_ingestion" ``` ### Ver Lineage en Marquez ``` http://localhost:3001 Buscar: - Jobs: api_ingestion, data_transform - Datasets: /tmp/api_data.json, postgres://events - Runs: ΓΊltimas ejecuciones ``` --- ## 🚨 Troubleshooting ### Dagu no responde ```bash # Ver logs journalctl --user -u dagu.service -f # Reiniciar systemctl --user restart dagu.service ``` ### NATS no conecta ```bash # Ver estado docker logs nats # Verificar puerto nats server ping nats://localhost:4222 ``` ### Base de datos no accesible ```bash # PostgreSQL docker logs postgres-main # ClickHouse docker logs clickhouse ``` ### Marquez no registra eventos ```bash # Ver logs docker logs marquez # Probar API manualmente curl http://localhost:5000/api/v1/namespaces ``` --- ## πŸ“š DocumentaciΓ³n Adicional - **CLAUDE.md**: GuΓ­a tΓ©cnica de manipulaciΓ³n de servicios - **TRANSFORMATIONS.md**: GuΓ­a completa de transformaciones con Dagu - **~/dagu/README.md**: DocumentaciΓ³n especΓ­fica de Dagu - **Dagu Docs**: https://dagu.sh/ - **OpenLineage Spec**: https://openlineage.io/ - **NATS Docs**: https://docs.nats.io/ --- ## 🀝 Contribuir ### AΓ±adir un nuevo pipeline 1. Crear script en `~/dagu/scripts/` 2. Crear DAG en `~/dagu/dags/` 3. AΓ±adir lineage tracking 4. Crear dashboard en Grafana 5. Documentar en este README ### AΓ±adir un nuevo servicio 1. Crear `docker-compose-.yml` 2. AΓ±adir a Homer en `homer/assets/config.yml` 3. Documentar puertos y configuraciΓ³n 4. Actualizar CLAUDE.md si necesita manipulaciΓ³n especial --- ## πŸ“Š Arquitectura Detallada ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ DAGU β”‚ β”‚ Scheduler β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ β–Ό β–Ό β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Script β”‚ β”‚ Script β”‚ β”‚ Script β”‚ β”‚ Fetch β”‚ β”‚ Transform β”‚ β”‚ Export β”‚ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β” β”‚ NATS β”‚ β”‚JetStreamβ”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ β–Ό β–Ό β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚PostgreSQLβ”‚ β”‚ClickHouseβ”‚ β”‚ Dagu β”‚ β”‚ β”‚ β”‚ β”‚ β”‚Transform β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ β–Ό β–Ό β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Grafana β”‚ β”‚Metabase β”‚ β”‚ Rill β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ Lineage Tracking (Marquez) β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ API β†’ NATS β†’ DB β†’ Visualizationβ”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` --- ## πŸ” Credenciales | Servicio | Usuario | Password | Puerto | |----------|---------|----------|--------| | PostgreSQL | postgres | postgres | 5434 | | ClickHouse | default | clickhouse | 8123 | | Marquez DB | marquez | marquez | 5433 | | Metabase DB | metabase | metabase | (interno) | | NATS | nats | nats123 | 4222 | **⚠️ IMPORTANTE**: Cambiar passwords en producciΓ³n --- ## πŸ“ˆ Roadmap - [ ] AΓ±adir dbt para transformaciones SQL - [ ] Integrar Airflow como alternativa a Dagu - [ ] AΓ±adir Kafka como alternativa a NATS - [ ] Implementar data quality con Great Expectations - [ ] Dashboard unificado de lineage + monitoring - [ ] CI/CD para pipelines de datos - [ ] Disaster recovery y backups automΓ‘ticos --- **Última actualizaciΓ³n**: 2026-03-23 **VersiΓ³n**: 1.0.0 **Mantenedor**: Lucas (@egutierrez) --- ## πŸ“ž Soporte Para issues y preguntas: - Gitea: https://gitea-dgg044oo04woo4ggcsws4gk0.organic-machine.com/dataforge/automatic-process - Claude Assistant: Disponible 24/7 para gestiΓ³n de pipelines