fn_registry/dev/flows/0001-hn-top-stories.md at master

Files

T

egutierrez 6ad82167bb docs(flows): DoD obligatorio con user-facing surface + abrir issues 0100-0103 (taxonomia, frontmatter migration, dev_console, work dashboard)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-17 00:07:03 +02:00

4.2 KiB

Raw Permalink Blame History

name, id, status, created, updated, priority, risk, related_issues, apps, trigger, schedule, expected_runtime_s, tags

name

status

created

updated

priority

risk

related_issues

apps

trigger

schedule

expected_runtime_s

Goal

Probar end-to-end el stack: navegator AutoExtract -> recipe -> dag_engine schedule -> data_factory.runs -> matrix bot. Pagina cero-auth + cero-coste. Si esto funciona, todo el plumbing es solido.

Pre-requisitos

Chrome lanzado con --remote-debugging-port=9222 (via navegator_dashboard "Open visible browser").
claude CLI en PATH (auto-extract requiere LLM).
sqlite_api activo en :8484.
dag_engine activo en :8090.
(opcional) Bot Matrix en sala #fn-registry-news para el sink final.

Flow

Lanzar Chrome via navegator (puerto 9222).
AutoExtract panel: URL https://news.ycombinator.com. Click "Open & Analyze".
Esperar ~10-20s. Verificar schema propuesto: rank, title, url, points, comments, age.
Refinar selectors si IA proponen rotos. Test extraction -> preview rows >= 20.
Save as recipe hn_top.yaml (en projects/navegator/profiles/default/recipes/).

Crear DAG ~/.dagu/dags/hn-top.yaml (manual o copy de apps/dag_engine/dags_migrated/):

name: hn-top-stories
description: Scrape HN top stories cada 30 min
schedule: "*/30 * * * *"
steps:
  - name: extract
    function: cdp_extract_recipe_py_pipelines
    args: ["projects/navegator/profiles/default/recipes/hn_top.yaml"]

Reload dag_engine + activar scheduler. Trigger Run Now una vez para probar.
dag_engine_ui: verificar run con status=success + function_id correcto en step.
data_factory: tab Extractors muestra nodo hn_top_stories (creado por save recipe). Tab "All Runs" muestra runs nuevos.
(opcional) Anadir step transformer filtra points > 100 -> sink matrix bot.

Acceptance

Recipe creada y validada (validate_recipe_yaml_py_core OK).
DAG corre OK 2 veces consecutivas via scheduler.
data_factory.runs tiene >=2 entries con node_id='hn_top_stories'.
cdp_extract_recipe_py_pipelines aparece en call_monitor.calls.
Schema extraido cubre 6/6 fields (rank, title, url, points, comments, age).
(opcional) Matrix bot recibe >=1 mensaje con top story filtrada.

Telemetria esperada

function_stats.cdp_extract_recipe_py_pipelines: calls_24h += 2.
data_factory.runs: 2 nuevas filas con trigger='cron'.
dag_engine.dag_step_results: step extract con function_id='cdp_extract_recipe_py_pipelines'.
call_monitor.calls: chain function call.

Definition of Done

Ver README.md seccion DoD + user-facing.

Generico

Repetibilidad: corre 3 veces consecutivas via cron sin intervencion.
Observabilidad: call_monitor.calls registra cdp_extract_recipe_py_pipelines + data_factory.runs muestra node_id=hn_top_stories.
Error-path: si Chrome :9222 cae, el step falla con mensaje claro (no crash silencioso del DAG).
Idempotencia: dedup dedup_duckdb_table_by_hash_py_pipelines corre tras extract; mismo HTML 2x = 0 filas nuevas.
Secrets: N/A (HN publico).
Docs: ## Notas con comandos para reproducir + onboarding.
Registry-first: extract sin codigo inline en el DAG.
INDEX + status: status: done + INDEX.md + movido a completed/.

User-facing

User-facing: usuario abre data_factory.exe → tab "All Runs" filtra node_id=hn_top_stories → ve >=30 filas con rank/title/url/points.
User-facing repeat: vuelve manana al mismo tab, ve runs frescos (cada 30 min) y tabla actualizada.
User-facing onboarding: parrafo en ## Notas: "Para ver HN top: lanzar data_factory.exe → tab Extractors → hn_top_stories. DuckDB en apps/data_factory/data/hn_top_stories.duckdb tabla hn_stories."
User-facing latencia: cron */30 * * * * → datos frescos en <31 min p95.

Custom

7/7 campos cubiertos en TODOS los runs ultimas 24h (rank/title/url/points/author/age/comments).
Latencia extract <30s p95 (cdp_extract_recipe + render).

Notas

(rellenas tras correr)

4.2 KiB Raw Permalink Blame History