# marquez-cli **OpenLineage/Marquez CLI tool** para gestionar datasets, jobs y runs con lineage tracking completo. Binario escrito en Go sin dependencias externas, listo para usar en pipelines de Dagu y scripts bash. --- ## 🎯 Características - ✅ **Registrar datasets** en Marquez - ✅ **Registrar jobs** y sus runs - ✅ **Enviar eventos OpenLineage** (START, RUNNING, COMPLETE, FAIL) - ✅ **Consultar lineage** de datasets - ✅ **Listar** namespaces, jobs, datasets y runs - ✅ **Sin dependencias** externas (solo Go stdlib) - ✅ **Binario estático** compilado (~5MB) --- ## 📦 Instalación ### Desde el Código Fuente ```bash cd ~/AutomaticProyects/automatic_process/tools/marquez-cli # Compilar make build # Instalar en ~/.local/bin make install # Verificar instalación marquez-cli version ``` ### Variables de Entorno (Opcional) ```bash export MARQUEZ_URL="http://localhost:5000" export MARQUEZ_NAMESPACE="automatic-process" ``` --- ## 🚀 Uso ### 1. Gestión de Runs #### Iniciar un Run ```bash # Sintaxis básica marquez-cli run start -job my_pipeline # Con inputs y outputs marquez-cli run start \ -job fetch_api_data \ -inputs "api://jsonplaceholder.typicode.com/users" \ -outputs "file:///tmp/users.json" # Con run-id específico (para continuar después) marquez-cli run start \ -job my_pipeline \ -run-id "abc123-uuid-here" \ -inputs "file:///tmp/raw.json" ``` **Salida:** ``` ✓ Run event sent successfully Event Type: START Job: automatic-process/fetch_api_data Run ID: 3c7a4f21-1234-5678-90ab-cdef12345678 Inputs: 1 dataset(s) Outputs: 1 dataset(s) ``` #### Marcar Run como RUNNING (Progreso) ```bash marquez-cli run running \ -job my_pipeline \ -run-id "abc123-uuid-here" \ -inputs "file:///tmp/raw.json" \ -outputs "file:///tmp/processed.json" ``` #### Completar un Run Exitosamente ```bash marquez-cli run complete \ -job my_pipeline \ -run-id "abc123-uuid-here" \ -inputs "file:///tmp/raw.json" \ -outputs "postgres://localhost:5434/postgres/public/events" ``` #### Marcar Run como Fallido ```bash marquez-cli run fail \ -job my_pipeline \ -run-id "abc123-uuid-here" ``` --- ### 2. Gestión de Datasets #### Registrar un Dataset ```bash # PostgreSQL table marquez-cli dataset register \ -name "postgres://localhost:5434/postgres/public/events" # ClickHouse table marquez-cli dataset register \ -name "clickhouse://localhost:8123/default/analytics" # NATS stream marquez-cli dataset register \ -name "nats://localhost:4222/data.raw" # Archivo marquez-cli dataset register \ -name "file:///tmp/data.json" # API endpoint marquez-cli dataset register \ -name "api://example.com/users" ``` #### Listar Datasets ```bash # En el namespace por defecto (automatic-process) marquez-cli dataset get # En un namespace específico marquez-cli dataset get -namespace my-namespace ``` **Salida:** ``` Datasets in namespace 'automatic-process': • postgres://localhost:5434/postgres/public/events [DB_TABLE] • file:///tmp/users.json [FILE] • api://example.com/users [API] ``` --- ### 3. Gestión de Jobs #### Registrar un Job ```bash marquez-cli job register -name my_pipeline ``` #### Listar Jobs ```bash # En el namespace por defecto marquez-cli job get # En un namespace específico marquez-cli job get -namespace my-namespace ``` #### Ver Runs de un Job ```bash marquez-cli job runs -name my_pipeline ``` **Salida:** ``` Runs for job 'automatic-process/my_pipeline': • 3c7a4f21-1234-5678-90ab-cdef12345678 [COMPLETED] - 2026-03-23T10:30:00.000Z • 7b8c9d0e-5678-1234-90ab-cdef12345678 [FAILED] - 2026-03-23T09:15:00.000Z • 1a2b3c4d-9012-3456-78ab-cdef12345678 [RUNNING] - 2026-03-23T11:00:00.000Z ``` --- ### 4. Consultar Lineage #### Obtener Lineage de un Dataset ```bash # Formato texto (legible) marquez-cli lineage \ -name "postgres://localhost:5434/postgres/public/events" # Formato JSON (para scripts) marquez-cli lineage \ -name "postgres://localhost:5434/postgres/public/events" \ -format json # Con profundidad personalizada marquez-cli lineage \ -name "postgres://localhost:5434/postgres/public/events" \ -depth 20 ``` **Salida (formato texto):** ``` Lineage for dataset 'automatic-process/postgres://localhost:5434/postgres/public/events': 📦 Datasets (4): • api://jsonplaceholder.typicode.com/users • file:///tmp/users.json • file:///tmp/emails.json • postgres://localhost:5434/postgres/public/user_emails ⚙️ Jobs (3): • fetch_api ← Inputs: - api://jsonplaceholder.typicode.com/users → Outputs: - file:///tmp/users.json • transform ← Inputs: - file:///tmp/users.json → Outputs: - file:///tmp/emails.json • ingest_postgres ← Inputs: - file:///tmp/emails.json → Outputs: - postgres://localhost:5434/postgres/public/user_emails ``` --- ### 5. Listar Recursos #### Listar Namespaces ```bash marquez-cli list namespaces ``` #### Listar Jobs ```bash marquez-cli list jobs marquez-cli list jobs -namespace my-namespace ``` #### Listar Datasets ```bash marquez-cli list datasets marquez-cli list datasets -namespace my-namespace ``` --- ## 🔧 Integración con Dagu ### Ejemplo: DAG con Lineage Tracking ```yaml # ~/dagu/dags/example_with_lineage.yaml name: example_with_lineage description: Pipeline con lineage tracking usando marquez-cli schedule: - "0 */6 * * *" env: - RUN_ID: "" # Se genera dinámicamente steps: # PASO 1: Generar Run ID único - name: generate_run_id command: echo "RUN_ID=$(uuidgen)" >> $DAGU_ENV output: RUN_ID # PASO 2: START event - name: start_run command: | marquez-cli run start \ -job example_with_lineage \ -run-id $RUN_ID \ -inputs "api://jsonplaceholder.typicode.com/users" depends: [generate_run_id] # PASO 3: Fetch data - name: fetch_data command: | curl -s https://jsonplaceholder.typicode.com/users > /tmp/users.json marquez-cli run running \ -job example_with_lineage \ -run-id $RUN_ID \ -inputs "api://jsonplaceholder.typicode.com/users" \ -outputs "file:///tmp/users.json" depends: [start_run] # PASO 4: Transform data - name: transform_data command: | jq '[.[] | {email: .email, name: .name}]' /tmp/users.json > /tmp/emails.json marquez-cli run running \ -job example_with_lineage \ -run-id $RUN_ID \ -inputs "file:///tmp/users.json" \ -outputs "file:///tmp/emails.json" depends: [fetch_data] # PASO 5: Load to PostgreSQL - name: load_postgres command: | psql -h localhost -p 5434 -U postgres -d postgres -c \ "CREATE TABLE IF NOT EXISTS user_emails (email TEXT, name TEXT);" cat /tmp/emails.json | jq -r '.[] | [.email, .name] | @csv' | \ psql -h localhost -p 5434 -U postgres -d postgres -c \ "COPY user_emails FROM STDIN WITH CSV;" marquez-cli run running \ -job example_with_lineage \ -run-id $RUN_ID \ -inputs "file:///tmp/emails.json" \ -outputs "postgres://localhost:5434/postgres/public/user_emails" depends: [transform_data] # PASO 6: COMPLETE event - name: complete_run command: | marquez-cli run complete \ -job example_with_lineage \ -run-id $RUN_ID \ -inputs "api://jsonplaceholder.typicode.com/users" \ -outputs "postgres://localhost:5434/postgres/public/user_emails" depends: [load_postgres] handlers: failure: - name: fail_run command: | marquez-cli run fail \ -job example_with_lineage \ -run-id $RUN_ID ``` --- ## 🔁 Workflow Típico ### 1. Pipeline Simple (START → COMPLETE) ```bash #!/bin/bash # Script: ~/dagu/scripts/simple_pipeline.sh NAMESPACE="automatic-process" JOB_NAME="simple_pipeline" RUN_ID=$(marquez-cli run start -job $JOB_NAME -inputs "api://source" | grep "Run ID" | awk '{print $NF}') echo "Started run: $RUN_ID" # Hacer el trabajo curl -s https://api.example.com/data > /tmp/data.json # Completar marquez-cli run complete \ -job $JOB_NAME \ -run-id $RUN_ID \ -inputs "api://source" \ -outputs "file:///tmp/data.json" echo "Run completed: $RUN_ID" ``` ### 2. Pipeline con Manejo de Errores ```bash #!/bin/bash # Script: ~/dagu/scripts/pipeline_with_error_handling.sh set -euo pipefail JOB_NAME="pipeline_with_errors" RUN_ID=$(uuidgen) # Función de cleanup en caso de error cleanup() { marquez-cli run fail -job $JOB_NAME -run-id $RUN_ID echo "Pipeline failed, run marked as FAILED" } trap cleanup ERR # START marquez-cli run start -job $JOB_NAME -run-id $RUN_ID # Trabajo echo "Processing..." # ... tu lógica aquí ... # COMPLETE (solo si todo fue exitoso) marquez-cli run complete \ -job $JOB_NAME \ -run-id $RUN_ID \ -outputs "postgres://table" echo "Pipeline completed successfully" ``` ### 3. Pipeline Multi-Paso ```bash #!/bin/bash JOB_NAME="multi_step_pipeline" RUN_ID=$(uuidgen) # START marquez-cli run start -job $JOB_NAME -run-id $RUN_ID # PASO 1: Extract echo "Extracting..." curl -s https://api.example.com/data > /tmp/raw.json marquez-cli run running \ -job $JOB_NAME \ -run-id $RUN_ID \ -inputs "api://example.com/data" \ -outputs "file:///tmp/raw.json" # PASO 2: Transform echo "Transforming..." jq '.data' /tmp/raw.json > /tmp/clean.json marquez-cli run running \ -job $JOB_NAME \ -run-id $RUN_ID \ -inputs "file:///tmp/raw.json" \ -outputs "file:///tmp/clean.json" # PASO 3: Load echo "Loading..." psql -h localhost -p 5434 -U postgres -d postgres \ -c "COPY events FROM '/tmp/clean.json';" marquez-cli run complete \ -job $JOB_NAME \ -run-id $RUN_ID \ -inputs "file:///tmp/clean.json" \ -outputs "postgres://localhost:5434/postgres/public/events" echo "Pipeline completed" ``` --- ## 📋 Convenciones de Naming ### Dataset URIs Usa siempre URIs descriptivos: | Tipo | Formato | Ejemplo | |------|---------|---------| | PostgreSQL | `postgres://host:port/db/schema/table` | `postgres://localhost:5434/postgres/public/events` | | ClickHouse | `clickhouse://host:port/database/table` | `clickhouse://localhost:8123/default/analytics` | | NATS | `nats://host:port/subject` | `nats://localhost:4222/data.raw` | | Archivo | `file:///absolute/path` | `file:///tmp/data.json` | | API | `api://domain/endpoint` | `api://example.com/users` | | S3 | `s3://bucket/key` | `s3://my-bucket/data/file.parquet` | ### Namespaces - Usa el namespace `automatic-process` para todos tus pipelines - Puedes crear namespaces adicionales para proyectos específicos ### Job Names - Usa nombres descriptivos: `fetch_api_data`, `transform_sales`, `load_warehouse` - Evita guiones (`-`), usa guiones bajos (`_`) - Mantén consistencia con los nombres de DAGs en Dagu --- ## 🔍 Verificar Lineage Después de ejecutar tu pipeline, verifica el lineage: ```bash # 1. Ver jobs ejecutados marquez-cli list jobs # 2. Ver runs de tu job marquez-cli job runs -name my_pipeline # 3. Ver datasets creados marquez-cli list datasets # 4. Ver lineage completo de un dataset marquez-cli lineage -name "postgres://localhost:5434/postgres/public/events" ``` También puedes usar la **Web UI de Marquez**: http://localhost:3001 --- ## 🛠️ Desarrollo ### Compilar desde el Código Fuente ```bash # Clonar el proyecto cd ~/AutomaticProyects/automatic_process/tools/marquez-cli # Compilar make build # Ejecutar sin instalar ./marquez-cli help # Instalar make install # Limpiar binarios make clean # Desinstalar make uninstall ``` ### Estructura del Proyecto ``` marquez-cli/ ├── main.go # CLI principal con comandos ├── openlineage.go # Cliente HTTP y estructuras OpenLineage ├── go.mod # Módulo de Go ├── Makefile # Build automation └── README.md # Documentación ``` --- ## 📊 API de Marquez Utilizada El CLI interactúa con estos endpoints de Marquez: | Endpoint | Método | Uso | |----------|--------|-----| | `/api/v1/lineage` | POST | Enviar eventos OpenLineage | | `/api/v1/lineage` | GET | Obtener lineage de dataset | | `/api/v1/namespaces` | GET | Listar namespaces | | `/api/v1/namespaces/{ns}/jobs` | GET | Listar jobs | | `/api/v1/namespaces/{ns}/datasets` | GET | Listar datasets | | `/api/v1/namespaces/{ns}/jobs/{job}/runs` | GET | Listar runs de job | Documentación completa: https://marquezproject.github.io/marquez/openapi.html --- ## 🎯 Checklist de Lineage Usa esta checklist en cada pipeline: - [ ] **START event** al inicio del pipeline - [ ] **RUNNING events** en cada transformación intermedia - [ ] **COMPLETE event** al finalizar exitosamente - [ ] **FAIL event** si hay errores (handler) - [ ] Declarar **TODOS** los inputs (APIs, archivos, tablas) - [ ] Declarar **TODOS** los outputs (archivos, streams, tablas) - [ ] Usar **URIs bien formados** para datasets - [ ] Usar **mismo namespace** (`automatic-process`) - [ ] Usar **mismo run-id** en todos los eventos del mismo run - [ ] Verificar lineage en Marquez Web UI --- ## 🐛 Troubleshooting ### Error: "connection refused" ```bash # Verificar que Marquez esté corriendo docker ps | grep marquez # Iniciar Marquez si no está corriendo docker-compose -f docker-compose-marquez.yml up -d # Probar conexión curl http://localhost:5000/api/v1/namespaces ``` ### Error: "API error (status 400)" Revisa que: - El namespace existe - Los URIs de datasets estén bien formados - El run-id sea un UUID válido ### Run-id no es consistente entre pasos Usa una variable de entorno o archivo temporal: ```bash # Opción 1: Variable de entorno export RUN_ID=$(uuidgen) marquez-cli run start -job my_job -run-id $RUN_ID marquez-cli run complete -job my_job -run-id $RUN_ID # Opción 2: Archivo temporal uuidgen > /tmp/run_id.txt RUN_ID=$(cat /tmp/run_id.txt) marquez-cli run start -job my_job -run-id $RUN_ID ``` --- ## 📚 Recursos - **OpenLineage Spec**: https://openlineage.io/ - **Marquez Docs**: https://marquezproject.ai/ - **Marquez Web UI**: http://localhost:3001 - **Marquez API**: http://localhost:5000/api/v1 - **Dagu Docs**: https://dagu.sh/ --- ## 📝 Licencia MIT License - Proyecto Automatic Process --- **Última actualización**: 2026-03-23 **Versión**: 1.0.0 **Autor**: Lucas (@egutierrez)