6ad82167bb
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
101 lines
4.2 KiB
Markdown
101 lines
4.2 KiB
Markdown
---
|
|
name: hn-top-stories
|
|
id: 0001
|
|
status: pending
|
|
created: 2026-05-16
|
|
updated: 2026-05-16
|
|
priority: high
|
|
risk: low
|
|
related_issues: [0097, 0098]
|
|
apps:
|
|
- navegator_dashboard
|
|
- dag_engine
|
|
- data_factory
|
|
- agents_and_robots
|
|
trigger: cron
|
|
schedule: "*/30 * * * *"
|
|
expected_runtime_s: 30
|
|
tags: [scraping, news, smoke-test, multi-app]
|
|
---
|
|
|
|
## Goal
|
|
|
|
Probar end-to-end el stack: navegator AutoExtract -> recipe -> dag_engine schedule -> data_factory.runs -> matrix bot. Pagina cero-auth + cero-coste. Si esto funciona, todo el plumbing es solido.
|
|
|
|
## Pre-requisitos
|
|
|
|
- Chrome lanzado con `--remote-debugging-port=9222` (via navegator_dashboard "Open visible browser").
|
|
- `claude` CLI en PATH (auto-extract requiere LLM).
|
|
- sqlite_api activo en `:8484`.
|
|
- dag_engine activo en `:8090`.
|
|
- (opcional) Bot Matrix en sala `#fn-registry-news` para el sink final.
|
|
|
|
## Flow
|
|
|
|
1. Lanzar Chrome via navegator (puerto 9222).
|
|
2. AutoExtract panel: URL `https://news.ycombinator.com`. Click "Open & Analyze".
|
|
3. Esperar ~10-20s. Verificar schema propuesto: `rank`, `title`, `url`, `points`, `comments`, `age`.
|
|
4. Refinar selectors si IA proponen rotos. Test extraction -> preview rows >= 20.
|
|
5. Save as recipe `hn_top.yaml` (en `projects/navegator/profiles/default/recipes/`).
|
|
6. Crear DAG `~/.dagu/dags/hn-top.yaml` (manual o copy de `apps/dag_engine/dags_migrated/`):
|
|
```yaml
|
|
name: hn-top-stories
|
|
description: Scrape HN top stories cada 30 min
|
|
schedule: "*/30 * * * *"
|
|
steps:
|
|
- name: extract
|
|
function: cdp_extract_recipe_py_pipelines
|
|
args: ["projects/navegator/profiles/default/recipes/hn_top.yaml"]
|
|
```
|
|
7. Reload dag_engine + activar scheduler. Trigger Run Now una vez para probar.
|
|
8. dag_engine_ui: verificar run con status=success + function_id correcto en step.
|
|
9. data_factory: tab Extractors muestra nodo `hn_top_stories` (creado por save recipe). Tab "All Runs" muestra runs nuevos.
|
|
10. (opcional) Anadir step transformer filtra `points > 100` -> sink matrix bot.
|
|
|
|
## Acceptance
|
|
|
|
- [ ] Recipe creada y validada (`validate_recipe_yaml_py_core` OK).
|
|
- [ ] DAG corre OK 2 veces consecutivas via scheduler.
|
|
- [ ] `data_factory.runs` tiene >=2 entries con `node_id='hn_top_stories'`.
|
|
- [ ] `cdp_extract_recipe_py_pipelines` aparece en `call_monitor.calls`.
|
|
- [ ] Schema extraido cubre 6/6 fields (rank, title, url, points, comments, age).
|
|
- [ ] (opcional) Matrix bot recibe >=1 mensaje con top story filtrada.
|
|
|
|
## Telemetria esperada
|
|
|
|
- `function_stats.cdp_extract_recipe_py_pipelines`: calls_24h += 2.
|
|
- `data_factory.runs`: 2 nuevas filas con `trigger='cron'`.
|
|
- `dag_engine.dag_step_results`: step `extract` con `function_id='cdp_extract_recipe_py_pipelines'`.
|
|
- `call_monitor.calls`: chain function call.
|
|
|
|
## Definition of Done
|
|
|
|
Ver `README.md` seccion DoD + user-facing.
|
|
|
|
### Generico
|
|
|
|
- [ ] **Repetibilidad**: corre 3 veces consecutivas via cron sin intervencion.
|
|
- [ ] **Observabilidad**: `call_monitor.calls` registra `cdp_extract_recipe_py_pipelines` + `data_factory.runs` muestra `node_id=hn_top_stories`.
|
|
- [ ] **Error-path**: si Chrome :9222 cae, el step falla con mensaje claro (no crash silencioso del DAG).
|
|
- [ ] **Idempotencia**: dedup `dedup_duckdb_table_by_hash_py_pipelines` corre tras extract; mismo HTML 2x = 0 filas nuevas.
|
|
- [ ] **Secrets**: N/A (HN publico).
|
|
- [ ] **Docs**: `## Notas` con comandos para reproducir + onboarding.
|
|
- [ ] **Registry-first**: extract sin codigo inline en el DAG.
|
|
- [ ] **INDEX + status**: `status: done` + `INDEX.md` + movido a `completed/`.
|
|
|
|
### User-facing
|
|
|
|
- [ ] **User-facing**: usuario abre `data_factory.exe` → tab "All Runs" filtra `node_id=hn_top_stories` → ve >=30 filas con rank/title/url/points.
|
|
- [ ] **User-facing repeat**: vuelve manana al mismo tab, ve runs frescos (cada 30 min) y tabla actualizada.
|
|
- [ ] **User-facing onboarding**: parrafo en `## Notas`: "Para ver HN top: lanzar `data_factory.exe` → tab Extractors → `hn_top_stories`. DuckDB en `apps/data_factory/data/hn_top_stories.duckdb` tabla `hn_stories`."
|
|
- [ ] **User-facing latencia**: cron `*/30 * * * *` → datos frescos en <31 min p95.
|
|
|
|
### Custom
|
|
|
|
- [ ] 7/7 campos cubiertos en TODOS los runs ultimas 24h (rank/title/url/points/author/age/comments).
|
|
- [ ] Latencia extract <30s p95 (cdp_extract_recipe + render).
|
|
|
|
## Notas
|
|
|
|
(rellenas tras correr)
|