Files
fn_registry/dev/flows/0001-hn-top-stories.md

4.2 KiB

name, id, status, created, updated, priority, risk, related_issues, apps, trigger, schedule, expected_runtime_s, tags
name id status created updated priority risk related_issues apps trigger schedule expected_runtime_s tags
hn-top-stories 0001 pending 2026-05-16 2026-05-16 high low
0097
0098
navegator_dashboard
dag_engine
data_factory
agents_and_robots
cron */30 * * * * 30
scraping
news
smoke-test
multi-app

Goal

Probar end-to-end el stack: navegator AutoExtract -> recipe -> dag_engine schedule -> data_factory.runs -> matrix bot. Pagina cero-auth + cero-coste. Si esto funciona, todo el plumbing es solido.

Pre-requisitos

  • Chrome lanzado con --remote-debugging-port=9222 (via navegator_dashboard "Open visible browser").
  • claude CLI en PATH (auto-extract requiere LLM).
  • sqlite_api activo en :8484.
  • dag_engine activo en :8090.
  • (opcional) Bot Matrix en sala #fn-registry-news para el sink final.

Flow

  1. Lanzar Chrome via navegator (puerto 9222).
  2. AutoExtract panel: URL https://news.ycombinator.com. Click "Open & Analyze".
  3. Esperar ~10-20s. Verificar schema propuesto: rank, title, url, points, comments, age.
  4. Refinar selectors si IA proponen rotos. Test extraction -> preview rows >= 20.
  5. Save as recipe hn_top.yaml (en projects/navegator/profiles/default/recipes/).
  6. Crear DAG ~/.dagu/dags/hn-top.yaml (manual o copy de apps/dag_engine/dags_migrated/):
    name: hn-top-stories
    description: Scrape HN top stories cada 30 min
    schedule: "*/30 * * * *"
    steps:
      - name: extract
        function: cdp_extract_recipe_py_pipelines
        args: ["projects/navegator/profiles/default/recipes/hn_top.yaml"]
    
  7. Reload dag_engine + activar scheduler. Trigger Run Now una vez para probar.
  8. dag_engine_ui: verificar run con status=success + function_id correcto en step.
  9. data_factory: tab Extractors muestra nodo hn_top_stories (creado por save recipe). Tab "All Runs" muestra runs nuevos.
  10. (opcional) Anadir step transformer filtra points > 100 -> sink matrix bot.

Acceptance

  • Recipe creada y validada (validate_recipe_yaml_py_core OK).
  • DAG corre OK 2 veces consecutivas via scheduler.
  • data_factory.runs tiene >=2 entries con node_id='hn_top_stories'.
  • cdp_extract_recipe_py_pipelines aparece en call_monitor.calls.
  • Schema extraido cubre 6/6 fields (rank, title, url, points, comments, age).
  • (opcional) Matrix bot recibe >=1 mensaje con top story filtrada.

Telemetria esperada

  • function_stats.cdp_extract_recipe_py_pipelines: calls_24h += 2.
  • data_factory.runs: 2 nuevas filas con trigger='cron'.
  • dag_engine.dag_step_results: step extract con function_id='cdp_extract_recipe_py_pipelines'.
  • call_monitor.calls: chain function call.

Definition of Done

Ver README.md seccion DoD + user-facing.

Generico

  • Repetibilidad: corre 3 veces consecutivas via cron sin intervencion.
  • Observabilidad: call_monitor.calls registra cdp_extract_recipe_py_pipelines + data_factory.runs muestra node_id=hn_top_stories.
  • Error-path: si Chrome :9222 cae, el step falla con mensaje claro (no crash silencioso del DAG).
  • Idempotencia: dedup dedup_duckdb_table_by_hash_py_pipelines corre tras extract; mismo HTML 2x = 0 filas nuevas.
  • Secrets: N/A (HN publico).
  • Docs: ## Notas con comandos para reproducir + onboarding.
  • Registry-first: extract sin codigo inline en el DAG.
  • INDEX + status: status: done + INDEX.md + movido a completed/.

User-facing

  • User-facing: usuario abre data_factory.exe → tab "All Runs" filtra node_id=hn_top_stories → ve >=30 filas con rank/title/url/points.
  • User-facing repeat: vuelve manana al mismo tab, ve runs frescos (cada 30 min) y tabla actualizada.
  • User-facing onboarding: parrafo en ## Notas: "Para ver HN top: lanzar data_factory.exe → tab Extractors → hn_top_stories. DuckDB en apps/data_factory/data/hn_top_stories.duckdb tabla hn_stories."
  • User-facing latencia: cron */30 * * * * → datos frescos en <31 min p95.

Custom

  • 7/7 campos cubiertos en TODOS los runs ultimas 24h (rank/title/url/points/author/age/comments).
  • Latencia extract <30s p95 (cdp_extract_recipe + render).

Notas

(rellenas tras correr)