6ad82167bb
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2.9 KiB
2.9 KiB
name, kind, lang, domain, version, purity, signature, description, tags, uses_functions, uses_types, returns, returns_optional, error_type, imports, params, output, tested, tests, test_file_path, file_path
| name | kind | lang | domain | version | purity | signature | description | tags | uses_functions | uses_types | returns | returns_optional | error_type | imports | params | output | tested | tests | test_file_path | file_path | ||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| cdp_extract_recipe | pipeline | py | pipelines | 1.2.0 | impure | def cdp_extract_recipe(recipe_path: str, debug_port: int = 9222, tab_id: str | None = None, record_run: bool = True) -> dict | Ejecuta una recipe YAML contra Chrome remoto via CDP. Valida recipe, busca tab por url_pattern, ejecuta steps (wait_selector/js) y envia resultado al sink declarado. |
|
|
false | error_go_core |
|
|
dict {status: ok|error, rows_out: int, kb_out: float, duration_ms: int, error: str, sample_rows: list} | false | python/functions/pipelines/cdp_extract_recipe.py |
Ejemplo
import sys
sys.path.insert(0, "python/functions")
from pipelines.cdp_extract_recipe import cdp_extract_recipe
result = cdp_extract_recipe(
recipe_path="recipes/product_list.yaml",
debug_port=9222,
)
print(result["status"], result["rows_out"], "rows")
# ok 42 rows
Recipe de ejemplo (recipes/product_list.yaml):
name: product_list
url_pattern: "https://shop\\.example\\.com/products.*"
steps:
- wait_selector: ".product-card"
- js: "Array.from(document.querySelectorAll('.product-card')).map(e => ({name: e.querySelector('h2').innerText, price: e.querySelector('.price').innerText}))"
output:
sink: stdout
Cuando usarla
Cuando tienes una recipe YAML validada y Chrome corriendo con remote debugging, y quieres extraer datos en un solo paso sin montar pipeline manualmente. Encadena con cdp_open_url_and_wait si necesitas abrir la URL primero.
Capability growth log
- v1.2.0 (2026-05-16) — sink
duckdbwrites rows to a DuckDB file + registers run in data_factory.runs with storage_db_id/storage_table for traceability.
Gotchas
- Chrome debe estar corriendo con
--remote-debugging-port=<debug_port>. wait_selectorusa polling sync sobre el WebSocket (200ms interval, 10s timeout) — no apto para paginas con lazy load muy largo.- El ultimo step
jsdebe devolver el dato final (array o valor). Steps intermedios pueden preparar el DOM. data_factory_record_runfalla silenciosamente si no hay DB configurada — el dato ya fue extraido y devuelto.websocket-clientdebe estar instalado en el venv.