chore: auto-commit (286 archivos)
- .claude/agents/fn-orquestador/SKILL.md - .claude/commands/fn_claude.md - .claude/rules/INDEX.md - .claude/rules/cpp_apps.md - .claude/rules/ids_naming.md - CHANGELOG.md - apps/dag_engine/README.md - apps/dag_engine/api.go - apps/dag_engine/dags_migrated/example.yaml - apps/dag_engine/dags_migrated/example_lineage_tracking.yaml - ... Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,61 @@
|
||||
---
|
||||
name: llm_propose_scraping_schema
|
||||
kind: function
|
||||
lang: py
|
||||
domain: infra
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "def llm_propose_scraping_schema(url: str, ax_tree: list, max_chunks: int = 5, max_chars_per_chunk: int = 25000) -> dict"
|
||||
description: "Orquesta trim_ax_tree -> chunk_ax_tree -> N llamadas a Claude CLI -> merge. Propone schema de scraping (fields, selectors, types) a partir del AX tree de una pagina."
|
||||
tags: [navegator, ai, llm, scraping, schema]
|
||||
uses_functions: [trim_ax_tree_py_core, chunk_ax_tree_py_core, claude_cli_prompt_py_infra]
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: [json, re, sys, os]
|
||||
params:
|
||||
- name: url
|
||||
desc: "URL de la pagina (se incluye en el prompt a Claude para contexto)."
|
||||
- name: ax_tree
|
||||
desc: "AX tree como lista de dicts obtenida via CDP (cdp_get_ax_tree)."
|
||||
- name: max_chunks
|
||||
desc: "Maximo de chunks a procesar. Default 5. Si hay mas, truncated=True."
|
||||
- name: max_chars_per_chunk
|
||||
desc: "Caracteres maximos por chunk de AX tree enviado a Claude. Default 25000."
|
||||
output: "dict {schema: [{field, selector, sample_value, type, source_role}], notes: str, chunks_processed: int, truncated: bool}"
|
||||
tested: false
|
||||
tests: []
|
||||
test_file_path: ""
|
||||
file_path: "python/functions/infra/llm_propose_scraping_schema.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
import sys
|
||||
sys.path.insert(0, "python/functions")
|
||||
from infra.llm_propose_scraping_schema import llm_propose_scraping_schema
|
||||
|
||||
# ax_tree obtenido previamente con cdp_get_ax_tree
|
||||
result = llm_propose_scraping_schema(
|
||||
url="https://shop.example.com/products",
|
||||
ax_tree=ax_tree,
|
||||
max_chunks=3,
|
||||
)
|
||||
# {"schema": [{"field": "price", "selector": ".product-price", ...}], "notes": "...", ...}
|
||||
for field in result["schema"]:
|
||||
print(field["field"], "->", field["selector"])
|
||||
```
|
||||
|
||||
## Cuando usarla
|
||||
|
||||
Cuando tienes el AX tree de una pagina y quieres que Claude proponga automaticamente que campos extraer y con que selectores CSS. Paso de discovery antes de escribir la recipe YAML a mano o de forma asistida.
|
||||
|
||||
## Gotchas
|
||||
|
||||
- Requiere `claude` CLI instalado y disponible en PATH (validado por `claude_cli_prompt`).
|
||||
- Cada chunk genera una llamada a Claude (coste de tokens). Usar `max_chunks` conservador en paginas muy grandes.
|
||||
- La respuesta de Claude se parsea tolerando fenced code blocks (```json ... ```). Si Claude devuelve prosa sin JSON, el chunk se omite con nota de error.
|
||||
- Dedup por `field`: primera ocurrencia gana si el mismo campo aparece en varios chunks.
|
||||
- No accede a red directamente — delega en `claude_cli_prompt`.
|
||||
Reference in New Issue
Block a user