chore: auto-commit (286 archivos)

- .claude/agents/fn-orquestador/SKILL.md
- .claude/commands/fn_claude.md
- .claude/rules/INDEX.md
- .claude/rules/cpp_apps.md
- .claude/rules/ids_naming.md
- CHANGELOG.md
- apps/dag_engine/README.md
- apps/dag_engine/api.go
- apps/dag_engine/dags_migrated/example.yaml
- apps/dag_engine/dags_migrated/example_lineage_tracking.yaml
- ...

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-16 16:33:22 +02:00
parent d6175964e4
commit 212875ed0d
290 changed files with 12703 additions and 19778 deletions
@@ -0,0 +1,69 @@
---
name: cdp_extract_recipe
kind: pipeline
lang: py
domain: pipelines
version: "1.0.0"
purity: impure
signature: "def cdp_extract_recipe(recipe_path: str, debug_port: int = 9222, tab_id: str | None = None, record_run: bool = True) -> dict"
description: "Ejecuta una recipe YAML contra Chrome remoto via CDP. Valida recipe, busca tab por url_pattern, ejecuta steps (wait_selector/js) y envia resultado al sink declarado."
tags: [navegator, cdp, recipe, scraping, pipeline]
uses_functions: [validate_recipe_yaml_py_core, data_factory_record_run_py_pipelines]
uses_types: []
returns: []
returns_optional: false
error_type: "error_go_core"
imports: [json, re, sys, os, time, urllib.request, websocket]
params:
- name: recipe_path
desc: "Ruta al archivo .yaml de la recipe (absoluta o relativa al cwd)."
- name: debug_port
desc: "Puerto de depuracion remota de Chrome. Default 9222."
- name: tab_id
desc: "ID del tab a usar. Si None, busca tab cuyo URL matchee url_pattern de la recipe."
- name: record_run
desc: "Si True y output.sink=='data_factory.runs', registra la ejecucion en data_factory."
output: "dict {status: ok|error, rows_out: int, kb_out: float, duration_ms: int, error: str, sample_rows: list}"
tested: false
tests: []
test_file_path: ""
file_path: "python/functions/pipelines/cdp_extract_recipe.py"
---
## Ejemplo
```python
import sys
sys.path.insert(0, "python/functions")
from pipelines.cdp_extract_recipe import cdp_extract_recipe
result = cdp_extract_recipe(
recipe_path="recipes/product_list.yaml",
debug_port=9222,
)
print(result["status"], result["rows_out"], "rows")
# ok 42 rows
```
Recipe de ejemplo (`recipes/product_list.yaml`):
```yaml
name: product_list
url_pattern: "https://shop\\.example\\.com/products.*"
steps:
- wait_selector: ".product-card"
- js: "Array.from(document.querySelectorAll('.product-card')).map(e => ({name: e.querySelector('h2').innerText, price: e.querySelector('.price').innerText}))"
output:
sink: stdout
```
## Cuando usarla
Cuando tienes una recipe YAML validada y Chrome corriendo con remote debugging, y quieres extraer datos en un solo paso sin montar pipeline manualmente. Encadena con `cdp_open_url_and_wait` si necesitas abrir la URL primero.
## Gotchas
- Chrome debe estar corriendo con `--remote-debugging-port=<debug_port>`.
- `wait_selector` usa polling sync sobre el WebSocket (200ms interval, 10s timeout) — no apto para paginas con lazy load muy largo.
- El ultimo step `js` debe devolver el dato final (array o valor). Steps intermedios pueden preparar el DOM.
- `data_factory_record_run` falla silenciosamente si no hay DB configurada — el dato ya fue extraido y devuelto.
- `websocket-client` debe estar instalado en el venv.