fad4006f60
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
218 lines
9.6 KiB
Markdown
218 lines
9.6 KiB
Markdown
---
|
|
id: "0098"
|
|
title: "Navegator extractions enhancement (Pick + Network rows + AutoExtract + data_factory bridge)"
|
|
status: pendiente
|
|
type: feature
|
|
domain:
|
|
- data-ingest
|
|
scope: multi-app
|
|
priority: alta
|
|
depends: []
|
|
blocks: []
|
|
related: []
|
|
created: 2026-05-17
|
|
updated: 2026-05-17
|
|
tags: []
|
|
---
|
|
# 0098 — Navegator extractions enhancement (Pick + Network rows + AutoExtract + data_factory bridge)
|
|
|
|
**Status:** pendiente
|
|
**Created:** 2026-05-16
|
|
**Type:** feature
|
|
**Priority:** alta
|
|
**Depends:** 0097 (data_factory v1 — DONE)
|
|
|
|
## Problema
|
|
|
|
`navegator_dashboard` ya tiene paneles Browsers/Tabs/Tab Detail/Network/Agent. Pero extraccion de datos sigue siendo manual: abre DevTools, copia selectores, escribe JS, parsea respuestas. No hay flujo "URL -> esquema -> recipe -> run".
|
|
|
|
`data_factory` (issue 0097) tiene nodos vacios. Necesita una via para crearlos rapido desde una pagina web.
|
|
|
|
## Objetivo
|
|
|
|
Anadir 4 features convergentes:
|
|
|
|
1. **Element picker** (panel Pick / Tab Detail): click sobre elementos -> selector CSS robusto.
|
|
2. **Network -> rows**: parse JSON responses XHR/fetch -> tabla -> CSV/Save recipe.
|
|
3. **AutoExtract IA**: URL -> abrir tab -> capturar accessibility tree (no HTML completo) -> `claude -p` propone schema + selectors -> preview -> save recipe.
|
|
4. **data_factory bridge**: cada recipe ejecutada -> registra run en `data_factory.runs` + crea node con `kind=extractor`.
|
|
|
|
## Decisiones tecnicas
|
|
|
|
| Decision | Eleccion |
|
|
|---|---|
|
|
| LLM API | `claude -p "<prompt>"` CLI subprocess (NO API key) |
|
|
| Modelo | Sonnet (default de `claude -p`) |
|
|
| Page representation para LLM | **CDP Accessibility tree** (`Accessibility.getFullAXTree`) + paginacion. NO HTML completo |
|
|
| Recipe format | YAML en `~/.dagu/dags/recipes/<slug>.yaml` o `projects/navegator/profiles/<profile>/recipes/<slug>.yaml` |
|
|
| Persistencia | filesystem YAML + entry en `data_factory.nodes` cuando se guarda |
|
|
| Pagination | accessibility tree truncado por chunks (~25KB chars) con `nextPageToken` simulado |
|
|
|
|
## Por que accessibility tree
|
|
|
|
- Semantico: roles, labels, valores. Sin estilo, sin scripts, sin SVG.
|
|
- Tipico 5-20x mas pequeño que HTML para misma info.
|
|
- Chrome CDP: `Accessibility.getFullAXTree { depth: -1 }` devuelve array `AXNode`. Cada nodo: `role, name, value, role.value` + `childIds`.
|
|
- Permite identificar campos via `role` (button/textbox/link/heading/table/cell) + `name`.
|
|
- LLM razona mejor sobre AX tree porque encaja con su entrenamiento (apps accessibility).
|
|
|
|
## Componentes
|
|
|
|
### Fase A — Element picker
|
|
|
|
Panel "Pick" boton dentro Tab Detail.
|
|
|
|
1. Inyecta JS via `Runtime.evaluate` que:
|
|
- Hover -> highlight outline rojo via overlay.
|
|
- Click -> captura: CSS selector (algoritmo `nth-of-type` ascendente truncado), XPath, `textContent`, `tagName`, `attributes`.
|
|
- `console.log({ picked: {...} })`.
|
|
2. C++ escucha `Runtime.consoleAPICalled` via WS, parsea payload.
|
|
3. Render card en panel con los datos. Boton "Copy selector", "Save as data_factory node".
|
|
|
|
Funciones nuevas:
|
|
- `cdp_pick_element_js_browser` (string JS snippet, registrada como funcion del registry para reutilizar).
|
|
|
|
### Fase B — Network -> rows
|
|
|
|
En panel Network ya existente:
|
|
- Filtra responses con `content_type: application/json`.
|
|
- Click "Parse" en una row -> intenta `JSON.parse` -> si es array -> renderiza tabla.
|
|
- Si es objeto con array dentro -> autodetect path (busca primer array > 0 elementos).
|
|
- Boton "Save as recipe": genera YAML con `url_pattern`, `intercept_response`, schema inferido.
|
|
|
|
Funciones nuevas:
|
|
- `infer_json_rows_schema_py_core` (puro, py) — recibe JSON, devuelve `{root_path, fields: [{name, type, sample}]}`.
|
|
|
|
### Fase C — Recipe YAML + runner
|
|
|
|
Recipe format:
|
|
|
|
```yaml
|
|
name: bbva_balance
|
|
description: Saldo cuenta principal
|
|
url_pattern: "bbva.es/.*/movimientos"
|
|
trigger: manual
|
|
schedule: "" # opcional cron
|
|
steps:
|
|
- wait_selector: "table.movimientos tbody tr"
|
|
- js: |
|
|
return [...document.querySelectorAll('table.movimientos tbody tr')].map(r => ({
|
|
date: r.cells[0].innerText,
|
|
concept: r.cells[1].innerText,
|
|
amount: parseFloat(r.cells[2].innerText.replace(',', '.')),
|
|
}));
|
|
output:
|
|
schema:
|
|
- {field: date, type: string}
|
|
- {field: concept, type: string}
|
|
- {field: amount, type: float}
|
|
format: json
|
|
sink: data_factory.runs
|
|
```
|
|
|
|
Funciones nuevas:
|
|
- `cdp_extract_recipe_py_pipelines` (impuro pipeline). Args: `recipe_path, [tab_id]`. Output: dict rows + status.
|
|
- Si no hay tab abierto con `url_pattern` -> error claro "no tab matching, open URL manually".
|
|
- Ejecuta cada step. `wait_selector` -> CDP `Runtime.evaluate` polling. `js` -> `Runtime.evaluate` retorna value.
|
|
- Si output.sink=`data_factory.runs` -> llama `data_factory_record_run_py_pipelines`.
|
|
|
|
### Fase D — LLM proposer (`claude -p`)
|
|
|
|
Panel AutoExtract nuevo:
|
|
|
|
UI:
|
|
- Input URL.
|
|
- Boton "Open & Analyze" -> abre nueva tab Chrome (CDP `Target.createTarget`).
|
|
- Wait `Page.loadEventFired`.
|
|
- Captura accessibility tree via `Accessibility.getFullAXTree`.
|
|
- Trim tree: descarta nodos `role=generic` sin name/children utiles.
|
|
- Si tree > 25KB: pagina (split por subarboles top-level).
|
|
- Llama `claude -p` con prompt + chunk[i].
|
|
- Para cada chunk, recibe `{fields: [...], notes}`. Merge resultados.
|
|
- Render schema propuesto en tabla editable (puedes editar field name, selector, type).
|
|
- Boton "Test extraction" -> ejecuta JS construido a partir del schema + selectors -> preview filas.
|
|
- Boton "Save as recipe" -> escribe YAML + crea node en `data_factory`.
|
|
|
|
Funciones nuevas:
|
|
- `claude_cli_prompt_py_infra` (impura, py) — wrapper `subprocess.run(["claude", "-p", prompt])`. Captura stdout. Timeout configurable. Error si no encuentra `claude` en PATH.
|
|
- `cdp_get_ax_tree_py_pipelines` (impura) — connect to chrome debugging, call `Accessibility.getFullAXTree`, return trimmed JSON.
|
|
- `trim_ax_tree_py_core` (puro) — descarta nodos generic-sin-info, colapsa cadenas single-child, devuelve estructura compacta.
|
|
- `chunk_ax_tree_py_core` (puro) — splittea tree en chunks de N chars max preservando contexto root.
|
|
- `llm_propose_scraping_schema_py_infra` (impura) — orquesta: trim + chunk + N calls claude -p + merge. Output schema final.
|
|
- `cdp_open_url_and_wait_py_pipelines` (impura) — abre URL via CDP, waits load event, devuelve tab_id.
|
|
|
|
### Fase E — data_factory bridge
|
|
|
|
Cuando recipe corre OK:
|
|
1. `cdp_extract_recipe_py_pipelines` se asegura que node existe en `data_factory.nodes` (upsert por `name`).
|
|
2. Llama `data_factory_record_run_py_pipelines(node_id, "cdp_extract_recipe_py_pipelines", args=[recipe_path], trigger="manual")`.
|
|
3. UI navegator_dashboard muestra link "Open in data_factory" al lado del recipe (futuro tab nav cross-app).
|
|
|
|
### Fase F — Recipes panel
|
|
|
|
Nuevo panel "Recipes":
|
|
- Tabla: name | url_pattern | schedule | last_run_status | last_run_at | rows_last_run
|
|
- Acciones por row: Run, Edit (abre YAML en `selectable_text` editable), Delete, Open in data_factory.
|
|
- Filtro por tag/url_pattern.
|
|
|
|
### Fase G — e2e_checks + deploy
|
|
|
|
`app.md` e2e_checks:
|
|
- `build_windows` (ya existe).
|
|
- `exe_present` (ya existe).
|
|
- `api_health` (ya existe).
|
|
- `claude_cli_available` — `command -v claude` exit 0.
|
|
|
|
`redeploy_cpp_app_windows navegator_dashboard projects/navegator/apps/navegator_dashboard --build`.
|
|
|
|
## Riesgos
|
|
|
|
| Riesgo | Mitigacion |
|
|
|---|---|
|
|
| `claude -p` no instalado en PATH | check al arrancar app + tooltip "install claude code CLI". Boton AutoExtract deshabilitado. |
|
|
| AX tree gigante (paginas tipo dashboards admin) | trim + chunk + max 5 chunks por URL. Notas claras si truncamos. |
|
|
| LLM propone selectors fragiles | user edita antes de save. Recipe versionada YAML. |
|
|
| Recipe corre contra tab equivocada | url_pattern + match estricto. Confirma antes de run. |
|
|
| Cookies/sesion no persisten | v1 asume user mantiene chrome con sesion abierta. v2: cookies/auth manager. |
|
|
| `claude -p` lento (5-15s) | UI spinner + cancel button. No bloquea otros paneles. |
|
|
| Recipe YAML format inconsistente | validator schema-check antes de save. Funcion `validate_recipe_yaml_py_core`. |
|
|
|
|
## No-objetivos v1
|
|
|
|
- Cookies/auth manager (Fase F future).
|
|
- Headless scraping (asume chrome visible).
|
|
- Multi-page navigation dentro de recipe (1 url = 1 recipe).
|
|
- Scheduled recipes via cron (se delega a dag_engine: DAG step `function: cdp_extract_recipe_py_pipelines args: [recipe_path]`).
|
|
|
|
## Funciones nuevas (resumen, ~9)
|
|
|
|
| ID | Lang | Purity |
|
|
|---|---|---|
|
|
| `claude_cli_prompt_py_infra` | py | impure |
|
|
| `cdp_pick_element_js_browser` | js (str) | n/a |
|
|
| `cdp_get_ax_tree_py_pipelines` | py | impure |
|
|
| `trim_ax_tree_py_core` | py | pure |
|
|
| `chunk_ax_tree_py_core` | py | pure |
|
|
| `llm_propose_scraping_schema_py_infra` | py | impure |
|
|
| `cdp_open_url_and_wait_py_pipelines` | py | impure |
|
|
| `cdp_extract_recipe_py_pipelines` | py | impure |
|
|
| `infer_json_rows_schema_py_core` | py | pure |
|
|
| `validate_recipe_yaml_py_core` | py | pure |
|
|
|
|
Tag de capability group: `navegator` (nuevo, >=3 funciones encajan). Mother page en `docs/capabilities/navegator.md`.
|
|
|
|
## Aceptacion
|
|
|
|
- 4 paneles nuevos visibles: Pick (Tab Detail), AutoExtract, Recipes. Network panel extendido con boton "Parse JSON".
|
|
- `claude -p` invocable desde la app si CLI disponible.
|
|
- Test E2E: abro URL `https://news.ycombinator.com` -> autoextract -> obtengo schema `{title, url, points, comments}` -> save recipe -> run -> >=20 rows -> aparece run en data_factory.
|
|
- `redeploy_cpp_app_windows navegator_dashboard` pass + exe corriendo.
|
|
- `fn doctor cpp-apps` OK para navegator_dashboard.
|
|
- 1 recipe canonica salvada en repo como ejemplo.
|
|
|
|
## Telemetria objetivo
|
|
|
|
- Capability group `navegator` con >=8 funciones.
|
|
- `data_factory.nodes` con >=3 nodos kind=extractor creados via recipe.
|
|
- `data_factory.runs` con runs reales.
|