6ad82167bb
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4.2 KiB
4.2 KiB
name, id, status, created, updated, priority, risk, related_issues, apps, trigger, schedule, expected_runtime_s, tags
| name | id | status | created | updated | priority | risk | related_issues | apps | trigger | schedule | expected_runtime_s | tags | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| hn-top-stories | 0001 | pending | 2026-05-16 | 2026-05-16 | high | low |
|
|
cron | */30 * * * * | 30 |
|
Goal
Probar end-to-end el stack: navegator AutoExtract -> recipe -> dag_engine schedule -> data_factory.runs -> matrix bot. Pagina cero-auth + cero-coste. Si esto funciona, todo el plumbing es solido.
Pre-requisitos
- Chrome lanzado con
--remote-debugging-port=9222(via navegator_dashboard "Open visible browser"). claudeCLI en PATH (auto-extract requiere LLM).- sqlite_api activo en
:8484. - dag_engine activo en
:8090. - (opcional) Bot Matrix en sala
#fn-registry-newspara el sink final.
Flow
- Lanzar Chrome via navegator (puerto 9222).
- AutoExtract panel: URL
https://news.ycombinator.com. Click "Open & Analyze". - Esperar ~10-20s. Verificar schema propuesto:
rank,title,url,points,comments,age. - Refinar selectors si IA proponen rotos. Test extraction -> preview rows >= 20.
- Save as recipe
hn_top.yaml(enprojects/navegator/profiles/default/recipes/). - Crear DAG
~/.dagu/dags/hn-top.yaml(manual o copy deapps/dag_engine/dags_migrated/):name: hn-top-stories description: Scrape HN top stories cada 30 min schedule: "*/30 * * * *" steps: - name: extract function: cdp_extract_recipe_py_pipelines args: ["projects/navegator/profiles/default/recipes/hn_top.yaml"] - Reload dag_engine + activar scheduler. Trigger Run Now una vez para probar.
- dag_engine_ui: verificar run con status=success + function_id correcto en step.
- data_factory: tab Extractors muestra nodo
hn_top_stories(creado por save recipe). Tab "All Runs" muestra runs nuevos. - (opcional) Anadir step transformer filtra
points > 100-> sink matrix bot.
Acceptance
- Recipe creada y validada (
validate_recipe_yaml_py_coreOK). - DAG corre OK 2 veces consecutivas via scheduler.
data_factory.runstiene >=2 entries connode_id='hn_top_stories'.cdp_extract_recipe_py_pipelinesaparece encall_monitor.calls.- Schema extraido cubre 6/6 fields (rank, title, url, points, comments, age).
- (opcional) Matrix bot recibe >=1 mensaje con top story filtrada.
Telemetria esperada
function_stats.cdp_extract_recipe_py_pipelines: calls_24h += 2.data_factory.runs: 2 nuevas filas contrigger='cron'.dag_engine.dag_step_results: stepextractconfunction_id='cdp_extract_recipe_py_pipelines'.call_monitor.calls: chain function call.
Definition of Done
Ver README.md seccion DoD + user-facing.
Generico
- Repetibilidad: corre 3 veces consecutivas via cron sin intervencion.
- Observabilidad:
call_monitor.callsregistracdp_extract_recipe_py_pipelines+data_factory.runsmuestranode_id=hn_top_stories. - Error-path: si Chrome :9222 cae, el step falla con mensaje claro (no crash silencioso del DAG).
- Idempotencia: dedup
dedup_duckdb_table_by_hash_py_pipelinescorre tras extract; mismo HTML 2x = 0 filas nuevas. - Secrets: N/A (HN publico).
- Docs:
## Notascon comandos para reproducir + onboarding. - Registry-first: extract sin codigo inline en el DAG.
- INDEX + status:
status: done+INDEX.md+ movido acompleted/.
User-facing
- User-facing: usuario abre
data_factory.exe→ tab "All Runs" filtranode_id=hn_top_stories→ ve >=30 filas con rank/title/url/points. - User-facing repeat: vuelve manana al mismo tab, ve runs frescos (cada 30 min) y tabla actualizada.
- User-facing onboarding: parrafo en
## Notas: "Para ver HN top: lanzardata_factory.exe→ tab Extractors →hn_top_stories. DuckDB enapps/data_factory/data/hn_top_stories.duckdbtablahn_stories." - User-facing latencia: cron
*/30 * * * *→ datos frescos en <31 min p95.
Custom
- 7/7 campos cubiertos en TODOS los runs ultimas 24h (rank/title/url/points/author/age/comments).
- Latencia extract <30s p95 (cdp_extract_recipe + render).
Notas
(rellenas tras correr)