refactor: remove Temporal in favor of Dagu for transformations

Temporal era overkill para nuestros pipelines de datos típicos. Cambios: - Eliminado docker-compose-temporal.yml y configuración - Removido Temporal de Homer dashboard - Actualizado README y CLAUDE.md sin referencias a Temporal - Añadida documentación completa de transformaciones con Dagu Dagu es suficiente porque: - Workflows terminan en minutos, no días - Transformaciones simples/medias (Python/SQL) - No necesitamos pausar/reanudar workflows - Menor overhead y más simple de mantener Si en el futuro necesitamos workflows de larga duración o state complejo, podemos volver a levantar Temporal. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-03-23 22:58:53 +01:00
parent aadae87a78
commit ea84a8e1f8
6 changed files with 1444 additions and 92 deletions
@@ -0,0 +1,338 @@
 # CLAUDE.md - Guía de Manipulación de Servicios
 ## 🎯 Propósito
 Este documento describe qué servicios puedo manipular directamente, cuáles requieren MCPs, y cómo interactuar con cada uno para construir pipelines de datos.
 ---
 ## ✅ Servicios que PUEDO Manipular Directamente
 ### 1. **Dagu** (Fácil - Acceso Total)
 - ✅ **Capacidad**: Crear, modificar y eliminar workflows (DAGs)
 - ✅ **Ubicación**: `~/dagu/dags/*.yaml`
 - ✅ **Uso**: Scheduling, lanzamiento de scripts, orchestración básica
 - **Ejemplo**:
  ```yaml
  name: ingest_data
  schedule: "0 * * * *"  # Cada hora
  steps:
    - name: fetch
      command: python ~/scripts/fetch_data.py
    - name: publish
      command: ~/scripts/publish_to_nats.sh
      depends: [fetch]
  ```
 ### 2. **NATS JetStream** (Medio - API REST/CLI)
 - ✅ **Capacidad**: Publicar mensajes, crear streams, suscripciones
 - ⚠️ **Limitación**: Requiero usar `nats` CLI o scripts con la API
 - ✅ **Uso**: Message broker, event streaming, pub/sub
 - **Acceso**:
  - Puerto 4222: Cliente NATS
  - Puerto 8222: HTTP Monitoring API
 - **Ejemplo (vía Dagu)**:
  ```bash
  # Publicar a NATS
  nats pub data.ingested "$(cat data.json)" --server=nats://localhost:4222
  ```
 ### 3. **Bases de Datos** (Fácil - SQL Directo)
 - ✅ **PostgreSQL**: Puerto 5434
  ```bash
  psql -h localhost -p 5434 -U postgres -d postgres -c "INSERT INTO..."
  ```
 - ✅ **ClickHouse**: Puertos 8123 (HTTP), 9000 (Native)
  ```bash
  curl -X POST 'http://localhost:8123/' -d "INSERT INTO table VALUES..."
  ```
 - ✅ **Marquez DB**: Puerto 5433 (para metadata)
 ### 4. **Marquez (OpenLineage)** (Medio - API REST)
 - ✅ **Capacidad**: Enviar eventos de lineage via API
 - ✅ **Uso**: Rastrear origen/destino de datos en cada paso
 - **Ejemplo**:
  ```bash
  curl -X POST http://localhost:5000/api/v1/lineage \
    -H "Content-Type: application/json" \
    -d @lineage_event.json
  ```
 ### 5. **Logs (Prometheus/Loki)** (Medio - Pushgateway/API)
 - ✅ **Prometheus**: Exportar métricas vía Pushgateway
 - ✅ **Loki**: Enviar logs vía HTTP API
 - ✅ **Uso**: Monitoreo, alertas, debugging
 ---
 ## ❌ Servicios que NECESITAN MCP
 ### 1. **Grafana** (Dashboards/Datasources)
 - ❌ **Problema**: Crear dashboards complejos requiere UI o API compleja
 - 🔧 **Solución**: MCP de Grafana
  - Crear datasources programáticamente
  - Generar dashboards desde templates
  - Configurar alertas
 - **Sin MCP puedo**: Usar datasources existentes manualmente
 ### 2. **Metabase** (Queries/Dashboards)
 - ❌ **Problema**: Crear questions/dashboards es vía UI
 - 🔧 **Solución**: MCP de Metabase
  - Crear queries SQL desde código
  - Generar dashboards automáticamente
  - Configurar filtros y parámetros
 - **Sin MCP puedo**: Ejecutar queries manualmente en la UI
 ### 3. **Rill** (Dashboards Modernos)
 - ❌ **Problema**: Configuración específica de modelos y dashboards
 - 🔧 **Solución**: MCP de Rill o manipular archivos YAML
 - **Sin MCP puedo**: Editar archivos en `~/rill-data/` si conozco la estructura
 ---
 ## 🏗️ Arquitectura de Datos Propuesta
 ### Flujo Completo (SIEMPRE con Lineage)
 ```
 ┌──────────┐
 │  DAGU    │ ← Scheduling (cron, manual)
 │ (Native) │
 └────┬─────┘
     │
     ├─→ [PASO 1: RECOLECCIÓN]
     │   ├─→ Script Python/Bash
     │   ├─→ API calls, scraping, etc.
     │   └─→ 📝 Log a Marquez (source: API)
     │
     ├─→ [PASO 2: VALIDACIÓN]
     │   ├─→ Schema validation
     │   ├─→ Data quality checks
     │   └─→ 📝 Log a Marquez (transformation)
     │
     ├─→ [PASO 3: PUBLICACIÓN A NATS]
     │   ├─→ NATS JetStream (stream: raw_data)
     │   ├─→ Formato: JSON events
     │   └─→ 📝 Log a Marquez (target: NATS)
     │
     ├─→ [PASO 4: CONSUMO E INGESTA]
     │   ├─→ Consumer NATS → PostgreSQL
     │   ├─→ Consumer NATS → ClickHouse
     │   └─→ 📝 Log a Marquez (target: DB)
     │
     ├─→ [PASO 5: TRANSFORMACIÓN (en Dagu)]
     │   ├─→ Python/Pandas o SQL
     │   ├─→ Agregaciones, cálculos
     │   └─→ 📝 Log a Marquez (transformation)
     │
     └─→ [PASO 6: LOGS & MONITORING]
         ├─→ Prometheus: Métricas (éxito, fallos, tiempo)
         ├─→ Loki: Logs estructurados
         └─→ Grafana: Dashboards en tiempo real
 ```
 ---
 ## 📋 Template de DAG con Lineage
 ```yaml
 name: data_pipeline_template
 description: Template para pipelines con lineage completo
 tags:
  - data-pipeline
  - lineage
  - production
 env:
  - MARQUEZ_URL: http://localhost:5000
  - NATS_URL: nats://localhost:4222
  - POSTGRES_URL: postgresql://postgres:postgres@localhost:5434/postgres
 schedule:
  - "0 */6 * * *"  # Cada 6 horas
 steps:
  # 1. FETCH DATA
  - name: fetch_data
    command: |
      python ~/dagu/scripts/fetch_data.py \
        --output /tmp/raw_data.json \
        --log-lineage
  # 2. VALIDATE
  - name: validate_data
    command: |
      python ~/dagu/scripts/validate.py \
        --input /tmp/raw_data.json \
        --log-lineage
    depends: [fetch_data]
  # 3. PUBLISH TO NATS
  - name: publish_to_nats
    command: |
      nats pub data.raw \
        "$(cat /tmp/raw_data.json)" \
        --server=$NATS_URL
      # Log lineage
      python ~/dagu/scripts/log_lineage.py \
        --event publish \
        --source /tmp/raw_data.json \
        --target nats://data.raw
    depends: [validate_data]
  # 4. INGEST TO POSTGRES
  - name: ingest_postgres
    command: |
      python ~/dagu/scripts/ingest_postgres.py \
        --nats-stream data.raw \
        --table raw_events \
        --log-lineage
    depends: [publish_to_nats]
  # 5. SEND METRICS
  - name: log_metrics
    command: |
      python ~/dagu/scripts/push_metrics.py \
        --job data_pipeline_template \
        --success true
    depends: [ingest_postgres]
 handlers:
  failure:
    - name: alert_failure
      command: |
        python ~/dagu/scripts/push_metrics.py \
          --job data_pipeline_template \
          --success false
 ```
 ---
 ## 🎯 Scripts Helper Necesarios
 ### 1. `~/dagu/scripts/log_lineage.py`
 ```python
 #!/usr/bin/env python3
 import requests
 import json
 from datetime import datetime
 def log_openlineage_event(event_type, source, target, job_name):
    """Envía evento OpenLineage a Marquez"""
    event = {
        "eventType": event_type,  # START, COMPLETE, FAIL
        "eventTime": datetime.utcnow().isoformat() + "Z",
        "producer": "dagu://pipeline",
        "job": {
            "namespace": "automatic-process",
            "name": job_name
        },
        "inputs": [{"namespace": "automatic-process", "name": source}],
        "outputs": [{"namespace": "automatic-process", "name": target}]
    }
    requests.post(
        "http://localhost:5000/api/v1/lineage",
        json=event
    )
 ```
 ### 2. `~/dagu/scripts/push_metrics.py`
 ```python
 #!/usr/bin/env python3
 from prometheus_client import CollectorRegistry, Gauge, push_to_gateway
 def push_metrics(job_name, success):
    """Push métricas a Prometheus Pushgateway"""
    registry = CollectorRegistry()
    g = Gauge('job_success', 'Job success status', registry=registry)
    g.set(1 if success else 0)
    push_to_gateway(
        'localhost:9091',
        job=job_name,
        registry=registry
    )
 ```
 ### 3. `~/dagu/scripts/publish_to_nats.sh`
 ```bash
 #!/bin/bash
 # Publicar a NATS JetStream
 nats pub "$1" "$(cat $2)" --server=nats://localhost:4222
 ```
 ---
 ## 🚀 Primeros Pasos
 ### 1. Instalar CLIs necesarios
 ```bash
 # NATS CLI
 curl -sf https://binaries.nats.dev/nats-io/natscli/nats@latest | sh
 ```
 ### 2. Crear directorio de scripts
 ```bash
 mkdir -p ~/dagu/scripts
 chmod +x ~/dagu/scripts/*.{py,sh}
 ```
 ### 3. Configurar variables de entorno
 ```bash
 # Añadir a ~/.bashrc
 export MARQUEZ_URL=http://localhost:5000
 export NATS_URL=nats://localhost:4222
 export POSTGRES_URL=postgresql://postgres:postgres@localhost:5434/postgres
 export CLICKHOUSE_URL=http://localhost:8123
 ```
 ---
 ## 📊 MCPs Recomendados (Futuro)
 ### Prioridad Alta
 1. **Grafana MCP** - Automatizar dashboards
 2. **PostgreSQL MCP** - Queries complejas y migraciones
 3. **ClickHouse MCP** - Queries analíticas
 ### Prioridad Media
 4. **Metabase MCP** - BI self-service
 ### Prioridad Baja
 5. **Rill MCP** - Dashboards modernos
 ---
 ## 📝 Checklist para Cada Pipeline
 Cuando crees un pipeline, SIEMPRE:
 - [ ] Define el schedule en Dagu
 - [ ] Log inicio en Marquez (START event)
 - [ ] Valida datos antes de procesar
 - [ ] Publica a NATS para desacoplar
 - [ ] Log cada transformación en Marquez
 - [ ] Ingesta a bases de datos
 - [ ] Log fin en Marquez (COMPLETE event)
 - [ ] Push métricas a Prometheus
 - [ ] Envía logs estructurados a Loki
 - [ ] Maneja errores (FAIL event a Marquez)
 ---
 ## 🔗 URLs de Servicios
 - **Dagu**: http://localhost:8090
 - **NATS Monitoring**: http://localhost:8222
 - **Marquez**: http://localhost:3001
 - **Grafana**: http://localhost:3500
 - **Prometheus**: http://localhost:9090
 - **DBGate**: http://localhost:3300
 ---
 **Última actualización**: 2026-03-23
 **Mantenedor**: Claude (Assistant)
@@ -0,0 +1,523 @@
 # Automatic Process - Suite Completa de Datos
 Plataforma completa de ingesta, procesamiento y visualización de datos con lineage tracking automático.
 ---
 ## 🎯 Arquitectura
 ```
 ┌─────────────────────────────────────────────────────────────────┐
 │                    DATA PIPELINE STACK                          │
 └─────────────────────────────────────────────────────────────────┘
 📅 SCHEDULING          🔄 MESSAGING           💾 STORAGE
 ┌──────────┐          ┌──────────┐          ┌──────────┐
 │  Dagu    │────────→ │   NATS   │────────→ │PostgreSQL│
 │ (Native) │          │JetStream │          │ClickHouse│
 └────┬─────┘          └──────────┘          └──────────┘
     │                                             │
     │ ⚙️ TRANSFORMATIONS                         │
     └────────────────────────────────────────────┘
                            │
                            ↓
                    📊 LINEAGE            📈 VISUALIZATION
                    ┌──────────┐          ┌──────────┐
                    │ Marquez  │          │ Grafana  │
                    │OpenLineage│         │ Metabase │
                    └──────────┘          │   Rill   │
                            ↑             └──────────┘
                            │
                    🔍 MONITORING
                    ┌──────────┐
                    │Prometheus│
                    │   Loki   │
                    │  Alloy   │
                    └──────────┘
 ```
 ---
 ## 🚀 Quick Start
 ### 1. Iniciar Todos los Servicios
 ```bash
 # Core services
 docker-compose up -d
 # Analytics
 docker-compose -f docker-compose-analytics.yml up -d
 # Databases
 docker-compose -f docker-compose-databases.yml up -d
 # Lineage
 docker-compose -f docker-compose-marquez.yml up -d
 # Messaging
 docker-compose -f docker-compose-nats.yml up -d
 # Orchestration (Dagu ya está corriendo como systemd)
 systemctl --user status dagu.service
 ```
 ### 2. Acceder al Dashboard
 **Homer Dashboard**: http://localhost:8080
 Desde ahí puedes acceder a todos los servicios.
 ---
 ## 📦 Servicios Disponibles
 ### 🎨 Visualization
 | Servicio | Puerto | Descripción |
 |----------|--------|-------------|
 | **Grafana** | 3500 | Dashboards y alertas |
 | **Metabase** | 3200 | Business Intelligence |
 | **Rill** | 9009 | BI Dashboard moderno |
 ### 📊 Monitoring
 | Servicio | Puerto | Descripción |
 |----------|--------|-------------|
 | **Prometheus** | 9090 | Métricas y alertas |
 | **Loki** | 3100 | Agregación de logs |
 | **Alloy** | 12345 | Colector de telemetría |
 ### 🔄 Orchestration & Transformations
 | Servicio | Puerto | Descripción |
 |----------|--------|-------------|
 | **Dagu** | 8090 | DAG Scheduler & Data Transformations (nativo WSL) |
 ### 📨 Messaging
 | Servicio | Puerto | Descripción |
 |----------|--------|-------------|
 | **NATS JetStream** | 4222/8222 | Message broker |
 ### 💾 Databases
 | Servicio | Puerto | Descripción |
 |----------|--------|-------------|
 | **PostgreSQL** | 5434 | Base de datos relacional |
 | **ClickHouse** | 8123/9000 | Base de datos analítica |
 | **DBGate** | 3300 | Database management UI |
 ### 🗺️ Data Lineage
 | Servicio | Puerto | Descripción |
 |----------|--------|-------------|
 | **Marquez** | 3001/5000 | OpenLineage tracking |
 ---
 ## 🏗️ Crear un Pipeline de Datos
 ### Ejemplo: Ingestión desde API
 #### 1. Crear el script de recolección
 ```python
 # ~/dagu/scripts/fetch_api_data.py
 #!/usr/bin/env python3
 import requests
 import json
 from datetime import datetime
 def fetch_data():
    response = requests.get('https://api.example.com/data')
    data = response.json()
    # Guardar temporalmente
    with open('/tmp/api_data.json', 'w') as f:
        json.dump(data, f)
    # Log a Marquez
    log_lineage('START', 'api.example.com', '/tmp/api_data.json')
 if __name__ == '__main__':
    fetch_data()
 ```
 #### 2. Crear el DAG en Dagu
 ```yaml
 # ~/dagu/dags/api_ingestion.yaml
 name: api_ingestion
 description: Ingesta datos desde API cada hora
 schedule:
  - "0 * * * *"  # Cada hora
 env:
  - NATS_URL: nats://localhost:4222
  - POSTGRES_URL: postgresql://postgres:postgres@localhost:5434/postgres
 steps:
  # 1. Fetch data from API
  - name: fetch
    command: python ~/dagu/scripts/fetch_api_data.py
  # 2. Validate data
  - name: validate
    command: |
      python ~/dagu/scripts/validate_schema.py \
        --input /tmp/api_data.json
    depends: [fetch]
  # 3. Publish to NATS
  - name: publish_nats
    command: |
      nats pub data.api.raw \
        "$(cat /tmp/api_data.json)" \
        --server=$NATS_URL
    depends: [validate]
  # 4. Consume and ingest to PostgreSQL
  - name: ingest_postgres
    command: |
      python ~/dagu/scripts/nats_to_postgres.py \
        --stream data.api.raw \
        --table api_events
    depends: [publish_nats]
  # 5. Push metrics
  - name: metrics
    command: |
      python ~/dagu/scripts/push_metrics.py \
        --job api_ingestion \
        --success true
    depends: [ingest_postgres]
 handlers:
  failure:
    - name: alert
      command: |
        echo "Pipeline failed!" | \
        curl -X POST http://localhost:9093/api/v1/alerts
 ```
 #### 3. Monitorear en Grafana
 1. Ir a http://localhost:3500
 2. Crear dashboard con:
   - Query a Prometheus: `job_success{job="api_ingestion"}`
   - Logs de Loki: `{job="dagu"} |= "api_ingestion"`
 #### 4. Verificar Lineage en Marquez
 1. Ir a http://localhost:3001
 2. Buscar job: `api_ingestion`
 3. Ver el grafo completo de datos:
   ```
   api.example.com → /tmp/api_data.json → NATS → PostgreSQL
   ```
 ---
 ## 📝 Scripts Helper Incluidos
 ### `~/dagu/scripts/log_lineage.py`
 Envía eventos OpenLineage a Marquez
 ```bash
 python ~/dagu/scripts/log_lineage.py \
  --event START \
  --source api.example.com \
  --target /tmp/data.json \
  --job my_pipeline
 ```
 ### `~/dagu/scripts/push_metrics.py`
 Publica métricas a Prometheus
 ```bash
 python ~/dagu/scripts/push_metrics.py \
  --job my_pipeline \
  --success true \
  --duration 45
 ```
 ### `~/dagu/scripts/publish_to_nats.sh`
 Publica mensajes a NATS JetStream
 ```bash
 ./~/dagu/scripts/publish_to_nats.sh data.stream data.json
 ```
 ### `~/dagu/scripts/nats_to_postgres.py`
 Consume de NATS e ingesta a PostgreSQL
 ```bash
 python ~/dagu/scripts/nats_to_postgres.py \
  --stream data.raw \
  --table events \
  --batch-size 100
 ```
 ---
 ## 🎯 Casos de Uso
 ### 1. ETL desde API a Warehouse
 ```
 API → Dagu (fetch) → NATS → PostgreSQL → Grafana
         ↓
      Marquez (lineage tracking)
 ```
 ### 2. Stream Processing en Tiempo Real
 ```
 IoT Devices → NATS → Dagu (transform) → ClickHouse → Rill
                ↓
             Marquez
 ```
 ### 3. Reporting Diario
 ```
 Dagu (schedule) → PostgreSQL (query) → Metabase (dashboard) → Email
         ↓
      Marquez
 ```
 ---
 ## 🔧 Configuración
 ### NATS JetStream
 ```bash
 # Crear stream
 nats stream add DATA_STREAM \
  --subjects "data.*" \
  --storage file \
  --retention limits \
  --max-age 7d
 # Ver estado
 nats stream ls
 nats stream info DATA_STREAM
 ```
 ### PostgreSQL
 ```bash
 # Conectar
 psql -h localhost -p 5434 -U postgres -d postgres
 # Crear tabla
 CREATE TABLE events (
  id SERIAL PRIMARY KEY,
  timestamp TIMESTAMPTZ DEFAULT NOW(),
  source VARCHAR(255),
  data JSONB,
  lineage_job VARCHAR(255)
 );
 ```
 ### ClickHouse
 ```bash
 # Conectar
 clickhouse-client --host localhost --port 9000
 # Crear tabla
 CREATE TABLE events (
  timestamp DateTime,
  source String,
  data String,
  lineage_job String
 ) ENGINE = MergeTree()
 ORDER BY timestamp;
 ```
 ---
 ## 📊 Monitoring
 ### Ver Métricas en Prometheus
 ```
 http://localhost:9090
 Queries útiles:
 - job_success{job="*"}
 - job_duration_seconds{job="*"}
 - rate(job_executions_total[5m])
 ```
 ### Ver Logs en Grafana
 ```
 http://localhost:3500 → Explore → Loki
 Queries útiles:
 - {job="dagu"}
 - {job="dagu"} |= "error"
 - {job="dagu"} |= "api_ingestion"
 ```
 ### Ver Lineage en Marquez
 ```
 http://localhost:3001
 Buscar:
 - Jobs: api_ingestion, data_transform
 - Datasets: /tmp/api_data.json, postgres://events
 - Runs: últimas ejecuciones
 ```
 ---
 ## 🚨 Troubleshooting
 ### Dagu no responde
 ```bash
 # Ver logs
 journalctl --user -u dagu.service -f
 # Reiniciar
 systemctl --user restart dagu.service
 ```
 ### NATS no conecta
 ```bash
 # Ver estado
 docker logs nats
 # Verificar puerto
 nats server ping nats://localhost:4222
 ```
 ### Base de datos no accesible
 ```bash
 # PostgreSQL
 docker logs postgres-main
 # ClickHouse
 docker logs clickhouse
 ```
 ### Marquez no registra eventos
 ```bash
 # Ver logs
 docker logs marquez
 # Probar API manualmente
 curl http://localhost:5000/api/v1/namespaces
 ```
 ---
 ## 📚 Documentación Adicional
 - **CLAUDE.md**: Guía técnica de manipulación de servicios
 - **TRANSFORMATIONS.md**: Guía completa de transformaciones con Dagu
 - **~/dagu/README.md**: Documentación específica de Dagu
 - **Dagu Docs**: https://dagu.sh/
 - **OpenLineage Spec**: https://openlineage.io/
 - **NATS Docs**: https://docs.nats.io/
 ---
 ## 🤝 Contribuir
 ### Añadir un nuevo pipeline
 1. Crear script en `~/dagu/scripts/`
 2. Crear DAG en `~/dagu/dags/`
 3. Añadir lineage tracking
 4. Crear dashboard en Grafana
 5. Documentar en este README
 ### Añadir un nuevo servicio
 1. Crear `docker-compose-<servicio>.yml`
 2. Añadir a Homer en `homer/assets/config.yml`
 3. Documentar puertos y configuración
 4. Actualizar CLAUDE.md si necesita manipulación especial
 ---
 ## 📊 Arquitectura Detallada
 ```
                    ┌─────────────┐
                    │   DAGU      │
                    │  Scheduler  │
                    └──────┬──────┘
                           │
            ┌──────────────┼──────────────┐
            │              │              │
            ▼              ▼              ▼
    ┌───────────┐  ┌───────────┐  ┌───────────┐
    │  Script   │  │  Script   │  │  Script   │
    │  Fetch    │  │ Transform │  │  Export   │
    └─────┬─────┘  └─────┬─────┘  └─────┬─────┘
          │              │              │
          └──────────────┼──────────────┘
                         │
                    ┌────▼────┐
                    │  NATS   │
                    │JetStream│
                    └────┬────┘
                         │
          ┌──────────────┼──────────────┐
          │              │              │
          ▼              ▼              ▼
    ┌──────────┐  ┌──────────┐  ┌──────────┐
    │PostgreSQL│  │ClickHouse│  │   Dagu   │
    │          │  │          │  │Transform │
    └────┬─────┘  └────┬─────┘  └────┬─────┘
         │             │             │
         └─────────────┼─────────────┘
                       │
          ┌────────────┼────────────┐
          │            │            │
          ▼            ▼            ▼
    ┌─────────┐  ┌─────────┐  ┌─────────┐
    │ Grafana │  │Metabase │  │  Rill   │
    └─────────┘  └─────────┘  └─────────┘
         Lineage Tracking (Marquez)
    ┌────────────────────────────────┐
    │ API → NATS → DB → Visualization│
    └────────────────────────────────┘
 ```
 ---
 ## 🔐 Credenciales
 | Servicio | Usuario | Password | Puerto |
 |----------|---------|----------|--------|
 | PostgreSQL | postgres | postgres | 5434 |
 | ClickHouse | default | clickhouse | 8123 |
 | Marquez DB | marquez | marquez | 5433 |
 | Metabase DB | metabase | metabase | (interno) |
 | NATS | nats | nats123 | 4222 |
 **⚠️ IMPORTANTE**: Cambiar passwords en producción
 ---
 ## 📈 Roadmap
 - [ ] Añadir dbt para transformaciones SQL
 - [ ] Integrar Airflow como alternativa a Dagu
 - [ ] Añadir Kafka como alternativa a NATS
 - [ ] Implementar data quality con Great Expectations
 - [ ] Dashboard unificado de lineage + monitoring
 - [ ] CI/CD para pipelines de datos
 - [ ] Disaster recovery y backups automáticos
 ---
 **Última actualización**: 2026-03-23
 **Versión**: 1.0.0
 **Mantenedor**: Lucas (@egutierrez)
 ---
 ## 📞 Soporte
 Para issues y preguntas:
 - Gitea: https://gitea-dgg044oo04woo4ggcsws4gk0.organic-machine.com/dataforge/automatic-process
 - Claude Assistant: Disponible 24/7 para gestión de pipelines
@@ -0,0 +1,582 @@
 # Transformaciones con Dagu
 Guía completa de cómo hacer transformaciones de datos con Dagu.
 ---
 ## ✅ Dagu PUEDE hacer transformaciones
 Dagu ejecuta **cualquier script o comando**, por lo que puede hacer:
 - ✅ Transformaciones SQL
 - ✅ Transformaciones Python/Pandas
 - ✅ Agregaciones y cálculos
 - ✅ Limpieza de datos
 - ✅ Enriquecimiento de datos
 - ✅ Joins complejos
 - ✅ Transformaciones en streaming
 ---
 ## 🎯 Patrón 1: Transformaciones Python/Pandas
 ### Ejemplo: Limpieza y agregación
 ```yaml
 # ~/dagu/dags/transform_sales.yaml
 name: transform_sales_data
 schedule: "0 2 * * *"  # Cada día a las 2 AM
 steps:
  # 1. Extract desde PostgreSQL
  - name: extract
    command: |
      python <<EOF
      import pandas as pd
      from sqlalchemy import create_engine
      engine = create_engine('postgresql://postgres:postgres@localhost:5434/postgres')
      df = pd.read_sql('SELECT * FROM raw_sales WHERE date = CURRENT_DATE', engine)
      df.to_parquet('/tmp/raw_sales.parquet')
      EOF
  # 2. Transform - Limpieza
  - name: clean
    command: |
      python <<EOF
      import pandas as pd
      df = pd.read_parquet('/tmp/raw_sales.parquet')
      # Limpiar datos
      df = df.dropna(subset=['customer_id', 'amount'])
      df['amount'] = df['amount'].astype(float)
      df['date'] = pd.to_datetime(df['date'])
      # Remover duplicados
      df = df.drop_duplicates(subset=['transaction_id'])
      df.to_parquet('/tmp/clean_sales.parquet')
      print(f"Cleaned {len(df)} records")
      EOF
    depends: [extract]
  # 3. Transform - Agregaciones
  - name: aggregate
    command: |
      python <<EOF
      import pandas as pd
      df = pd.read_parquet('/tmp/clean_sales.parquet')
      # Agregación por cliente
      customer_summary = df.groupby('customer_id').agg({
          'amount': ['sum', 'mean', 'count'],
          'date': 'max'
      }).reset_index()
      customer_summary.columns = ['customer_id', 'total_spent', 'avg_spent', 'num_purchases', 'last_purchase']
      customer_summary.to_parquet('/tmp/customer_summary.parquet')
      print(f"Aggregated {len(customer_summary)} customers")
      EOF
    depends: [clean]
  # 4. Load a PostgreSQL
  - name: load
    command: |
      python <<EOF
      import pandas as pd
      from sqlalchemy import create_engine
      df = pd.read_parquet('/tmp/customer_summary.parquet')
      engine = create_engine('postgresql://postgres:postgres@localhost:5434/postgres')
      df.to_sql('customer_summary', engine, if_exists='replace', index=False)
      print(f"Loaded {len(df)} records to customer_summary table")
      EOF
    depends: [aggregate]
  # 5. Log lineage
  - name: lineage
    command: |
      python ~/dagu/scripts/log_lineage.py \
        --event COMPLETE \
        --source postgres://raw_sales \
        --target postgres://customer_summary \
        --job transform_sales_data
    depends: [load]
 ```
 ---
 ## 🎯 Patrón 2: Transformaciones SQL (dbt-style)
 ### Ejemplo: Transformación incremental
 ```yaml
 # ~/dagu/dags/transform_orders.yaml
 name: transform_orders
 schedule: "*/15 * * * *"  # Cada 15 minutos
 env:
  - DB_URL: postgresql://postgres:postgres@localhost:5434/postgres
 steps:
  # 1. Staging - Raw to Clean
  - name: stage_orders
    command: |
      psql $DB_URL <<SQL
      -- Crear tabla staging si no existe
      CREATE TABLE IF NOT EXISTS stg_orders (
        order_id BIGINT PRIMARY KEY,
        customer_id BIGINT,
        amount DECIMAL(10,2),
        status VARCHAR(50),
        created_at TIMESTAMPTZ,
        processed_at TIMESTAMPTZ DEFAULT NOW()
      );
      -- Insert incremental
      INSERT INTO stg_orders (order_id, customer_id, amount, status, created_at)
      SELECT
        order_id,
        customer_id,
        amount::DECIMAL(10,2),
        LOWER(TRIM(status)) as status,
        created_at
      FROM raw_orders
      WHERE created_at > (SELECT COALESCE(MAX(created_at), '1970-01-01') FROM stg_orders)
      ON CONFLICT (order_id) DO UPDATE SET
        amount = EXCLUDED.amount,
        status = EXCLUDED.status,
        processed_at = NOW();
      SQL
  # 2. Transform - Calcular métricas
  - name: calc_metrics
    command: |
      psql $DB_URL <<SQL
      -- Tabla de métricas diarias
      CREATE TABLE IF NOT EXISTS daily_metrics (
        date DATE PRIMARY KEY,
        total_orders INT,
        total_revenue DECIMAL(12,2),
        avg_order_value DECIMAL(10,2),
        completed_orders INT,
        cancelled_orders INT,
        updated_at TIMESTAMPTZ DEFAULT NOW()
      );
      -- Upsert métricas
      INSERT INTO daily_metrics (date, total_orders, total_revenue, avg_order_value, completed_orders, cancelled_orders)
      SELECT
        DATE(created_at) as date,
        COUNT(*) as total_orders,
        SUM(amount) as total_revenue,
        AVG(amount) as avg_order_value,
        COUNT(*) FILTER (WHERE status = 'completed') as completed_orders,
        COUNT(*) FILTER (WHERE status = 'cancelled') as cancelled_orders
      FROM stg_orders
      WHERE created_at >= CURRENT_DATE - INTERVAL '7 days'
      GROUP BY DATE(created_at)
      ON CONFLICT (date) DO UPDATE SET
        total_orders = EXCLUDED.total_orders,
        total_revenue = EXCLUDED.total_revenue,
        avg_order_value = EXCLUDED.avg_order_value,
        completed_orders = EXCLUDED.completed_orders,
        cancelled_orders = EXCLUDED.cancelled_orders,
        updated_at = NOW();
      SQL
    depends: [stage_orders]
  # 3. Transform - Snapshot histórico
  - name: snapshot
    command: |
      psql $DB_URL <<SQL
      -- Tabla de snapshots
      CREATE TABLE IF NOT EXISTS order_snapshots (
        snapshot_id SERIAL PRIMARY KEY,
        order_id BIGINT,
        status VARCHAR(50),
        amount DECIMAL(10,2),
        snapshot_at TIMESTAMPTZ DEFAULT NOW()
      );
      -- Insertar snapshot de órdenes en progreso
      INSERT INTO order_snapshots (order_id, status, amount)
      SELECT order_id, status, amount
      FROM stg_orders
      WHERE status IN ('pending', 'processing');
      SQL
    depends: [calc_metrics]
 ```
 ---
 ## 🎯 Patrón 3: Transformación Multi-Tabla con Joins
 ### Ejemplo: Enriquecer datos con múltiples fuentes
 ```yaml
 # ~/dagu/dags/enrich_customer_data.yaml
 name: enrich_customer_data
 schedule: "0 3 * * *"
 steps:
  # 1. Extract y combinar múltiples fuentes
  - name: merge_sources
    command: |
      python <<EOF
      import pandas as pd
      from sqlalchemy import create_engine
      engine = create_engine('postgresql://postgres:postgres@localhost:5434/postgres')
      # Cargar múltiples tablas
      customers = pd.read_sql('SELECT * FROM customers', engine)
      orders = pd.read_sql('SELECT * FROM orders WHERE created_at >= CURRENT_DATE - 30', engine)
      reviews = pd.read_sql('SELECT * FROM reviews', engine)
      # Agregaciones de órdenes
      order_stats = orders.groupby('customer_id').agg({
          'order_id': 'count',
          'amount': ['sum', 'mean'],
          'created_at': 'max'
      }).reset_index()
      order_stats.columns = ['customer_id', 'total_orders', 'total_spent', 'avg_order', 'last_order']
      # Agregaciones de reviews
      review_stats = reviews.groupby('customer_id').agg({
          'rating': 'mean',
          'review_id': 'count'
      }).reset_index()
      review_stats.columns = ['customer_id', 'avg_rating', 'total_reviews']
      # Merge todo
      enriched = customers.merge(order_stats, on='customer_id', how='left')
      enriched = enriched.merge(review_stats, on='customer_id', how='left')
      # Calcular segmento
      enriched['segment'] = enriched.apply(lambda x:
          'VIP' if x['total_spent'] > 1000 else
          'Regular' if x['total_spent'] > 100 else
          'New', axis=1
      )
      enriched.to_parquet('/tmp/enriched_customers.parquet')
      EOF
  # 2. Load enriquecido
  - name: load_enriched
    command: |
      python <<EOF
      import pandas as pd
      from sqlalchemy import create_engine
      df = pd.read_parquet('/tmp/enriched_customers.parquet')
      engine = create_engine('postgresql://postgres:postgres@localhost:5434/postgres')
      df.to_sql('enriched_customers', engine, if_exists='replace', index=False)
      EOF
    depends: [merge_sources]
 ```
 ---
 ## 🎯 Patrón 4: Transformación Incremental (Solo cambios)
 ### Ejemplo: CDC (Change Data Capture) simplificado
 ```yaml
 # ~/dagu/dags/incremental_transform.yaml
 name: incremental_transform
 schedule: "*/5 * * * *"  # Cada 5 minutos
 steps:
  # 1. Identificar cambios
  - name: detect_changes
    command: |
      python <<EOF
      import pandas as pd
      from sqlalchemy import create_engine
      engine = create_engine('postgresql://postgres:postgres@localhost:5434/postgres')
      # Última marca de agua
      last_sync = pd.read_sql(
          "SELECT MAX(updated_at) as last_sync FROM transformed_data",
          engine
      ).iloc[0]['last_sync']
      # Solo registros nuevos/modificados
      new_data = pd.read_sql(f"""
          SELECT * FROM raw_data
          WHERE updated_at > '{last_sync}'
      """, engine)
      if len(new_data) > 0:
          new_data.to_parquet('/tmp/new_data.parquet')
          print(f"Found {len(new_data)} new/changed records")
      else:
          print("No changes detected")
          exit(0)
      EOF
  # 2. Transformar solo cambios
  - name: transform_changes
    command: |
      python <<EOF
      import pandas as pd
      if not os.path.exists('/tmp/new_data.parquet'):
          exit(0)
      df = pd.read_parquet('/tmp/new_data.parquet')
      # Aplicar transformaciones
      df['normalized_value'] = df['value'] / df['value'].max()
      df['category'] = df['type'].map({
          'A': 'Category 1',
          'B': 'Category 2',
          'C': 'Category 3'
      })
      df.to_parquet('/tmp/transformed_changes.parquet')
      EOF
    depends: [detect_changes]
  # 3. Upsert cambios
  - name: upsert_changes
    command: |
      python <<EOF
      import pandas as pd
      from sqlalchemy import create_engine
      if not os.path.exists('/tmp/transformed_changes.parquet'):
          exit(0)
      df = pd.read_parquet('/tmp/transformed_changes.parquet')
      engine = create_engine('postgresql://postgres:postgres@localhost:5434/postgres')
      # Usar ON CONFLICT para upsert
      for _, row in df.iterrows():
          engine.execute(f"""
              INSERT INTO transformed_data (id, value, category, updated_at)
              VALUES ({row['id']}, {row['normalized_value']}, '{row['category']}', NOW())
              ON CONFLICT (id) DO UPDATE SET
                value = EXCLUDED.value,
                category = EXCLUDED.category,
                updated_at = NOW()
          """)
      print(f"Upserted {len(df)} records")
      EOF
    depends: [transform_changes]
 ```
 ---
 ## 🎯 Patrón 5: Transformación con ClickHouse (Analítica)
 ### Ejemplo: Agregaciones pesadas
 ```yaml
 # ~/dagu/dags/analytics_clickhouse.yaml
 name: analytics_transform
 schedule: "0 4 * * *"
 steps:
  # 1. Transformar y cargar a ClickHouse
  - name: load_to_clickhouse
    command: |
      python <<EOF
      import pandas as pd
      from sqlalchemy import create_engine
      from clickhouse_driver import Client
      # Extract de PostgreSQL
      pg = create_engine('postgresql://postgres:postgres@localhost:5434/postgres')
      df = pd.read_sql('SELECT * FROM events WHERE date = CURRENT_DATE', pg)
      # Transform
      df['hour'] = pd.to_datetime(df['timestamp']).dt.hour
      df['day_of_week'] = pd.to_datetime(df['timestamp']).dt.dayofweek
      # Load a ClickHouse
      ch = Client('localhost', port=9000)
      ch.execute('''
          CREATE TABLE IF NOT EXISTS events_analytics (
              event_id UInt64,
              user_id UInt64,
              event_type String,
              timestamp DateTime,
              hour UInt8,
              day_of_week UInt8,
              value Float64
          ) ENGINE = MergeTree()
          ORDER BY (event_type, timestamp)
      ''')
      # Insert
      ch.execute(
          'INSERT INTO events_analytics VALUES',
          df.to_dict('records')
      )
      EOF
  # 2. Agregación en ClickHouse (super rápido)
  - name: aggregate
    command: |
      clickhouse-client --query "
      CREATE TABLE IF NOT EXISTS hourly_stats
      ENGINE = MergeTree()
      ORDER BY (event_type, hour)
      AS SELECT
          event_type,
          hour,
          day_of_week,
          COUNT(*) as event_count,
          AVG(value) as avg_value,
          SUM(value) as total_value
      FROM events_analytics
      WHERE timestamp >= today()
      GROUP BY event_type, hour, day_of_week
      "
    depends: [load_to_clickhouse]
 ```
 ---
 ## 🎯 Patrón 6: Transformación con Dependencias Complejas
 ### Ejemplo: DAG con múltiples transformaciones en paralelo
 ```yaml
 # ~/dagu/dags/complex_transform.yaml
 name: complex_multi_transform
 schedule: "0 1 * * *"
 steps:
  # Paso inicial - Extracción
  - name: extract
    command: python ~/dagu/scripts/extract_data.py
  # Transformaciones en paralelo
  - name: transform_customers
    command: python ~/dagu/scripts/transform_customers.py
    depends: [extract]
  - name: transform_products
    command: python ~/dagu/scripts/transform_products.py
    depends: [extract]
  - name: transform_orders
    command: python ~/dagu/scripts/transform_orders.py
    depends: [extract]
  # Join todo
  - name: join_all
    command: python ~/dagu/scripts/join_datasets.py
    depends: [transform_customers, transform_products, transform_orders]
  # Calcular métricas finales
  - name: calc_metrics
    command: python ~/dagu/scripts/calculate_metrics.py
    depends: [join_all]
  # Cargar a destinos
  - name: load_postgres
    command: python ~/dagu/scripts/load_postgres.py
    depends: [calc_metrics]
  - name: load_clickhouse
    command: python ~/dagu/scripts/load_clickhouse.py
    depends: [calc_metrics]
 ```
 ---
 ## 💡 Buenas Prácticas
 ### 1. Usa archivos intermedios
 ```bash
 /tmp/raw_data.parquet
 /tmp/clean_data.parquet
 /tmp/transformed_data.parquet
 ```
 ### 2. Validaciones entre pasos
 ```python
 # Validar antes de continuar
 assert len(df) > 0, "No data to process"
 assert df['amount'].sum() > 0, "Invalid amounts"
 ```
 ### 3. Logs estructurados
 ```python
 import logging
 logging.info(f"Processed {len(df)} records in {elapsed:.2f}s")
 ```
 ### 4. Idempotencia
 ```sql
 -- Usar UPSERT en lugar de INSERT
 INSERT ... ON CONFLICT DO UPDATE
 ```
 ### 5. Cleanup
 ```yaml
 steps:
  # ... tus pasos
  - name: cleanup
    command: rm -f /tmp/*.parquet
    continueOn:
      failure: true
 ```
 ---
 ## 🆚 Dagu vs dbt
 | Feature | Dagu | dbt |
 |---------|------|-----|
 | SQL transforms | ✅ Sí | ✅ Sí (mejor) |
 | Python transforms | ✅ Sí (mejor) | ⚠️ Limitado |
 | Scheduling | ✅ Built-in | ❌ Externo |
 | Lineage | ⚠️ Manual | ✅ Automático |
 | Testing | ⚠️ Manual | ✅ Built-in |
 | Docs | ⚠️ Manual | ✅ Automático |
 **Recomendación**:
 - Usa **Dagu** para pipelines end-to-end
 - Considera **dbt** si haces mucho SQL y quieres lineage automático
 ---
 ## 🎯 Resumen
 **Dagu PUEDE hacer transformaciones:**
 - ✅ Python/Pandas (limpieza, agregaciones)
 - ✅ SQL (staging, métricas, joins)
 - ✅ Transformaciones incrementales
 - ✅ Multi-tabla con joins complejos
 - ✅ Paralelo (múltiples transforms a la vez)
 - ✅ ClickHouse (analítica pesada)
 **NO necesitas Temporal para:**
 - ❌ Transformaciones simples/medias
 - ❌ ETL típico (Extract → Transform → Load)
 - ❌ Pipelines que terminan en < 1 hora
 - ❌ Agregaciones SQL o Pandas
 **SÍ necesitas Temporal solo si:**
 - ✅ Transformación tarda > 1 hora
 - ✅ Necesitas pausar/reanudar
 - ✅ State machine muy complejo
 - ✅ Compensaciones distribuidas
 ---
 **Última actualización**: 2026-03-23
@@ -1,64 +0,0 @@
 services:
  temporal-postgresql:
    image: postgres:15
    container_name: temporal-db
    environment:
      POSTGRES_USER: temporal
      POSTGRES_PASSWORD: temporal
      POSTGRES_DB: temporal
    ports:
      - "5435:5432"
    volumes:
      - temporal-postgres-data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD", "pg_isready", "-U", "temporal"]
      interval: 5s
      timeout: 5s
      retries: 5
    restart: unless-stopped
  temporal:
    image: temporalio/auto-setup:latest
    container_name: temporal
    depends_on:
      temporal-postgresql:
        condition: service_healthy
    environment:
      - DB=postgres12
      - DB_PORT=5432
      - POSTGRES_USER=temporal
      - POSTGRES_PWD=temporal
      - POSTGRES_SEEDS=temporal-postgresql
      - DYNAMIC_CONFIG_FILE_PATH=config/dynamicconfig/development-sql.yaml
    ports:
      - "7233:7233"
    volumes:
      - ./temporal-dynamicconfig:/etc/temporal/config/dynamicconfig
    restart: unless-stopped
  temporal-ui:
    image: temporalio/ui:latest
    container_name: temporal-ui
    depends_on:
      - temporal
    environment:
      - TEMPORAL_ADDRESS=temporal:7233
      - TEMPORAL_CORS_ORIGINS=http://localhost:3400
    ports:
      - "3400:8080"
    restart: unless-stopped
  temporal-admin-tools:
    image: temporalio/admin-tools:latest
    container_name: temporal-admin-tools
    depends_on:
      - temporal
    environment:
      - TEMPORAL_ADDRESS=temporal:7233
      - TEMPORAL_CLI_ADDRESS=temporal:7233
    stdin_open: true
    tty: true
    restart: unless-stopped
 volumes:
  temporal-postgres-data:
@@ -95,16 +95,9 @@ services:
  - name: "Orchestration"
    icon: "fas fa-code-branch"
    items:
      - name: "Temporal UI"
        logo: "http://localhost:3400/favicon.ico"
        subtitle: "Workflow Orchestration"
        tag: "orchestration"
        url: "http://localhost:3400"
        target: "_blank"
      - name: "Dagu"
        logo: "http://localhost:8090/assets/favicon.ico"
-        subtitle: "DAG Scheduler - Local Scripts"
+        subtitle: "DAG Scheduler & Transformations"
        tag: "orchestration"
        url: "http://localhost:8090"
        target: "_blank"
@@ -1,20 +0,0 @@
 # Temporal dynamic configuration for development
 system.forceSearchAttributesCacheRefreshOnRead:
  - value: true
    constraints: {}
 frontend.enableUpdateWorkflowExecution:
  - value: true
    constraints: {}
 history.enableParentClosePolicyWorker:
  - value: true
    constraints: {}
 system.enableActivityEagerExecution:
  - value: true
    constraints: {}
 frontend.enableExecuteMultiOperation:
  - value: true
    constraints: {}