fad4006f60
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6.0 KiB
6.0 KiB
id, title, status, type, domain, scope, priority, depends, blocks, related, created, updated, tags
| id | title | status | type | domain | scope | priority | depends | blocks | related | created | updated | tags | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0097 | data_factory app (Factorio-style data pipeline) | pendiente | feature |
|
app-scoped | alta | 2026-05-17 | 2026-05-17 |
0097 — data_factory app (Factorio-style data pipeline)
Status: pendiente Created: 2026-05-15 Type: app Priority: alta Depends: 0096 (apps/ standard) — DONE Blocks: 0098 (function dep tree) — diferido tras 0097
Problema
No hay vista unificada del flujo de datos. Hoy:
dag_engine_uimuestra DAGs corriendo (orquestacion).registry_dashboardmuestra codigo (funciones, dependencias).- Pero NO hay: que datos extraigo, donde acaban, que volumen tienen, lineage end-to-end.
User analogia: factoria Factorio. Extractors (drills), Transformers (assemblers), Databases (chests), Sinks (rocket silo), Belts (lineage).
Objetivo v1
App C++ ImGui standalone apps/data_factory/ que muestra:
- Extractors (funciones tag
extractor) — tabla: name, source, last_run, rows/min, kb/min, success_rate, schedule. Click -> detalle + sample rows. - Transformers (tag
transformer) — tabla con inputs/outputs, drop_ratio, last_run. - Databases — tabla: kind (sqlite/postgres/bq/parquet/mongo), uri, table_count, size, freshness.
- Sinks (tag
sink) — Metabase cards, FastAPI endpoints, alerts. - Health — KPIs derivados de runs: rows_24h, success_rate, freshness_alerts.
- Map (placeholder v1) — lista plana de nodes + connections. Grafo visual sale en v2 (no en 0097).
Fuera de v1
- Map grafico con
imgui_node_editor— v2. - Models tab (trainers ML) — v2.
- Integracion automatica con dag_engine — manual en v1 (usuario decide schedules en DAGs).
- Multi-tenant — un PC, un
data_factory.db.
Arquitectura
data_factory.db (per-PC, en apps/data_factory/)
├── nodes (kind=extractor|transformer|database|sink, function_id FK registry, config_json)
├── connections (src_node, dst_node, payload_schema)
├── runs (node_id, started_at, finished_at, status, rows_in, rows_out, kb_in, kb_out, error)
└── databases (kind, uri, label)
sqlite_api (anade DataFactoryHub WS)
└── /api/ws/datafactory (replica patron CallMonitorHub + DagRunHub)
C++ ImGui (apps/data_factory/)
├── data_http.{cpp,h} (REST a sqlite_api)
├── ws_client.{cpp,h} (reusado de dag_engine_ui)
├── tabs.{cpp,h} (Extractors, Transformers, Databases, Sinks, Health, Map)
└── main.cpp (fn::run_app + cfg.panels)
Fases
| Fase | Que entra |
|---|---|
| A. Schema + migrations | apps/data_factory/migrations/001_init.sql con 4 tablas. Idempotente (embed.FS). |
| B. Tags registry | Audit + asignar tags extractor/transformer/sink a funciones existentes. Target: >=3 funciones por tag (capability group threshold). |
| C. Scaffold app C++ | fn run init_cpp_app data_factory --desc "Factorio-style data pipeline factory". apps/data_factory/. |
| D. DataFactoryHub | projects/fn_monitoring/apps/sqlite_api/datafactory_events.go — clon de CallMonitorHub. Endpoint /api/ws/datafactory. |
E. Wrappers data_factory_record_run |
Pipeline Python: envuelve fn run <id> capturando rows/kb/duration -> INSERT en runs table. Reusa subprocess + psutil. |
| F. Tabs Extractors+Transformers+Databases+Sinks | Reusa data_table_cpp_viz. Lista filtrable por tag. Click row -> panel lateral con detalle (FnInfo pattern de dag_engine_ui). |
| G. Health tab | KPIs derivados de runs (rows_24h, success_rate, freshness). Reusa pattern dag_engine_ui health. |
| H. e2e_checks + deploy | build_cmake, binary_exists, self-test (HTTP probe a sqlite_api), cpp_apps_conformance. redeploy_cpp_app_windows. |
Funciones nuevas a delegar (estimacion)
| ID | Lang | Que hace |
|---|---|---|
data_factory_record_run_py_pipelines |
py pipeline | Wrappea fn run <id> [args]. Captura rows (parse stdout) + kb (size of output) + duration. INSERT en data_factory.db.runs. |
data_factory_list_nodes_go_infra |
go infra | Lista nodes desde data_factory.db filtrado por kind. |
data_factory_open_db_go_infra |
go infra | Abre data_factory.db con embed.FS migrations. |
Tags nuevos: extractor, transformer, sink, database, validator (no v1).
Aceptacion v1
apps/data_factory/existe con scaffolding canonico (cumple cpp/PATTERNS.md).data_factory.dbse crea al primer run con tablas init.sqlite_apiexpone/api/ws/datafactoryyGET /api/datafactory/nodes.- Al menos 3 funciones tageadas
extractory 3transformer(mover de existentes). - App C++ arranca, conecta sqlite_api, muestra tabs Extractors/Transformers/Databases/Sinks/Health.
fn doctor cpp-appsclean.fn doctor uses-functionsclean para data_factory.- Build Linux + Windows pass.
- e2e_check
self_testexit 0.
No-objetivos
- Visual lineage graph (deferido a 0098 o v2).
- Edicion de DAGs/extractors desde la UI (solo lectura + run-now).
- ML trainers (v2).
- Sync multi-PC del
data_factory.db(igual que dag_engine.db: per-PC).
Riesgos
| Riesgo | Mitigacion |
|---|---|
Tags extractor/etc se solapan con tags existentes (metabase, bigquery) |
Aceptado: una funcion puede tener varios tags. Tag flat, no jerarquia. |
| Asignar tags rompe FTS o consumers existentes | Tags son aditivos, no destructivos. fn index valida. |
Wrapper Python data_factory_record_run interfiere con telemetria call_monitor (issue 0085) |
Telemetria PostToolUse sigue activa. Wrapper escribe a tabla distinta (data_factory.runs vs call_monitor.calls). Cero conflicto. |
Schema data_factory.db cambia mucho en v1 |
Migrations aditivas estrictamente (regla db_migrations.md). Si cambio tipo: branch-by-abstraction. |
Telemetria objetivo
data_factoryaparece enpc_locationsactivo.- 3+ funciones nuevas creadas + usadas en mismo turno (capability-growth ratio bueno).
- runs persisten en data_factory.db; visibles desde tab Health en <500ms tras
fn run <extractor>.