chore: auto-commit (286 archivos)
- .claude/agents/fn-orquestador/SKILL.md - .claude/commands/fn_claude.md - .claude/rules/INDEX.md - .claude/rules/cpp_apps.md - .claude/rules/ids_naming.md - CHANGELOG.md - apps/dag_engine/README.md - apps/dag_engine/api.go - apps/dag_engine/dags_migrated/example.yaml - apps/dag_engine/dags_migrated/example_lineage_tracking.yaml - ... Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,69 @@
|
||||
---
|
||||
name: cdp_extract_recipe
|
||||
kind: pipeline
|
||||
lang: py
|
||||
domain: pipelines
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "def cdp_extract_recipe(recipe_path: str, debug_port: int = 9222, tab_id: str | None = None, record_run: bool = True) -> dict"
|
||||
description: "Ejecuta una recipe YAML contra Chrome remoto via CDP. Valida recipe, busca tab por url_pattern, ejecuta steps (wait_selector/js) y envia resultado al sink declarado."
|
||||
tags: [navegator, cdp, recipe, scraping, pipeline]
|
||||
uses_functions: [validate_recipe_yaml_py_core, data_factory_record_run_py_pipelines]
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: [json, re, sys, os, time, urllib.request, websocket]
|
||||
params:
|
||||
- name: recipe_path
|
||||
desc: "Ruta al archivo .yaml de la recipe (absoluta o relativa al cwd)."
|
||||
- name: debug_port
|
||||
desc: "Puerto de depuracion remota de Chrome. Default 9222."
|
||||
- name: tab_id
|
||||
desc: "ID del tab a usar. Si None, busca tab cuyo URL matchee url_pattern de la recipe."
|
||||
- name: record_run
|
||||
desc: "Si True y output.sink=='data_factory.runs', registra la ejecucion en data_factory."
|
||||
output: "dict {status: ok|error, rows_out: int, kb_out: float, duration_ms: int, error: str, sample_rows: list}"
|
||||
tested: false
|
||||
tests: []
|
||||
test_file_path: ""
|
||||
file_path: "python/functions/pipelines/cdp_extract_recipe.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
import sys
|
||||
sys.path.insert(0, "python/functions")
|
||||
from pipelines.cdp_extract_recipe import cdp_extract_recipe
|
||||
|
||||
result = cdp_extract_recipe(
|
||||
recipe_path="recipes/product_list.yaml",
|
||||
debug_port=9222,
|
||||
)
|
||||
print(result["status"], result["rows_out"], "rows")
|
||||
# ok 42 rows
|
||||
```
|
||||
|
||||
Recipe de ejemplo (`recipes/product_list.yaml`):
|
||||
```yaml
|
||||
name: product_list
|
||||
url_pattern: "https://shop\\.example\\.com/products.*"
|
||||
steps:
|
||||
- wait_selector: ".product-card"
|
||||
- js: "Array.from(document.querySelectorAll('.product-card')).map(e => ({name: e.querySelector('h2').innerText, price: e.querySelector('.price').innerText}))"
|
||||
output:
|
||||
sink: stdout
|
||||
```
|
||||
|
||||
## Cuando usarla
|
||||
|
||||
Cuando tienes una recipe YAML validada y Chrome corriendo con remote debugging, y quieres extraer datos en un solo paso sin montar pipeline manualmente. Encadena con `cdp_open_url_and_wait` si necesitas abrir la URL primero.
|
||||
|
||||
## Gotchas
|
||||
|
||||
- Chrome debe estar corriendo con `--remote-debugging-port=<debug_port>`.
|
||||
- `wait_selector` usa polling sync sobre el WebSocket (200ms interval, 10s timeout) — no apto para paginas con lazy load muy largo.
|
||||
- El ultimo step `js` debe devolver el dato final (array o valor). Steps intermedios pueden preparar el DOM.
|
||||
- `data_factory_record_run` falla silenciosamente si no hay DB configurada — el dato ya fue extraido y devuelto.
|
||||
- `websocket-client` debe estar instalado en el venv.
|
||||
Reference in New Issue
Block a user