--- name: hn-top-stories id: 0001 status: pending created: 2026-05-16 updated: 2026-05-16 priority: high risk: low related_issues: [0097, 0098] apps: - navegator_dashboard - dag_engine - data_factory - agents_and_robots trigger: cron schedule: "*/30 * * * *" expected_runtime_s: 30 tags: [scraping, news, smoke-test, multi-app] --- ## Goal Probar end-to-end el stack: navegator AutoExtract -> recipe -> dag_engine schedule -> data_factory.runs -> matrix bot. Pagina cero-auth + cero-coste. Si esto funciona, todo el plumbing es solido. ## Pre-requisitos - Chrome lanzado con `--remote-debugging-port=9222` (via navegator_dashboard "Open visible browser"). - `claude` CLI en PATH (auto-extract requiere LLM). - sqlite_api activo en `:8484`. - dag_engine activo en `:8090`. - (opcional) Bot Matrix en sala `#fn-registry-news` para el sink final. ## Flow 1. Lanzar Chrome via navegator (puerto 9222). 2. AutoExtract panel: URL `https://news.ycombinator.com`. Click "Open & Analyze". 3. Esperar ~10-20s. Verificar schema propuesto: `rank`, `title`, `url`, `points`, `comments`, `age`. 4. Refinar selectors si IA proponen rotos. Test extraction -> preview rows >= 20. 5. Save as recipe `hn_top.yaml` (en `projects/navegator/profiles/default/recipes/`). 6. Crear DAG `~/.dagu/dags/hn-top.yaml` (manual o copy de `apps/dag_engine/dags_migrated/`): ```yaml name: hn-top-stories description: Scrape HN top stories cada 30 min schedule: "*/30 * * * *" steps: - name: extract function: cdp_extract_recipe_py_pipelines args: ["projects/navegator/profiles/default/recipes/hn_top.yaml"] ``` 7. Reload dag_engine + activar scheduler. Trigger Run Now una vez para probar. 8. dag_engine_ui: verificar run con status=success + function_id correcto en step. 9. data_factory: tab Extractors muestra nodo `hn_top_stories` (creado por save recipe). Tab "All Runs" muestra runs nuevos. 10. (opcional) Anadir step transformer filtra `points > 100` -> sink matrix bot. ## Acceptance - [ ] Recipe creada y validada (`validate_recipe_yaml_py_core` OK). - [ ] DAG corre OK 2 veces consecutivas via scheduler. - [ ] `data_factory.runs` tiene >=2 entries con `node_id='hn_top_stories'`. - [ ] `cdp_extract_recipe_py_pipelines` aparece en `call_monitor.calls`. - [ ] Schema extraido cubre 6/6 fields (rank, title, url, points, comments, age). - [ ] (opcional) Matrix bot recibe >=1 mensaje con top story filtrada. ## Telemetria esperada - `function_stats.cdp_extract_recipe_py_pipelines`: calls_24h += 2. - `data_factory.runs`: 2 nuevas filas con `trigger='cron'`. - `dag_engine.dag_step_results`: step `extract` con `function_id='cdp_extract_recipe_py_pipelines'`. - `call_monitor.calls`: chain function call. ## Notas (rellenas tras correr)