feat: catch-up de decisiones previas (Webpage→Url, anti-bot, UI 2-col, tests cross-platform)

Bloque de cambios revisados y validados con el usuario en sesiones previas que no habian aterrizado en commits propios. Lista por tema: * enrichers: web_search ahora usa lite.duckduckgo.com como endpoint primario (mas tolerante con bot detection desde IP residencial), con fallback al endpoint html. Detecta pagina captcha y emite error claro si ambos fallan. Anyade _DDGLiteParser para el formato lite + auto-pick de parser por contenido. * enrichers: tipo Webpage unificado en Url (campos de cuerpo cacheado viven en metadata del Url). Manifests actualizados (applies_to: [Url]). fetch_webpage ya no convierte Url->Webpage. * enrichers/manifest: campo `params` parseado a EnricherSpec.params (name, type, default_value, description). UI puede renderizar dialog de configuracion. * jobs: fix de path conversion para Python embebido nativo Windows (no convertir a /mnt/c/... cuando el subproceso es Windows-native; solo cuando es bash o python via WSL). * main.cpp: ventana ImGui (no modal) "Run enricher" con layout 2-col (label izq, input der). Inserta job con JSON tipado. Layout clustering apretado: hijos del mismo anchor en un solo anillo alrededor del padre, sin desperdigar por anillos crecientes. * views: inspector con layout 2-col via BeginTable (Identity, Schema fields, Extras). Description full-width debajo de su label. * tests: portable conftest (auto-detecta REGISTRY_ROOT, PYTHON_BIN, ENRICHERS_DIR para WSL y Windows portable). _runner.py trampoline inyecta stub via sys.path porque embedded Python ignora PYTHONPATH. Tests bash-only (vendor_script, freeze, dispatcher bash, resolver Linux-binary) skipean en Windows. Tests existentes adaptados a Webpage->Url. Resultado actual: 32 passed WSL, 21 passed + 11 skipped Windows.
2026-05-03 14:41:28 +02:00
parent 4be5734ce5
commit 7a94160fd2
26 changed files with 973 additions and 241 deletions
@@ -1,7 +1,7 @@
 id: extract_domain
 name: "Extract domain"
 description: "Saca el dominio de la url/email del nodo y crea/conecta una entidad Domain con relacion BELONGS_TO. No descarga nada."
-applies_to: [Url, Webpage, Email]
+applies_to: [Url, Email]
 emits: [Domain]
 relations: [BELONGS_TO]
 params: []
@@ -1,7 +1,7 @@
 id: extract_links
 name: "Extract links"
-description: "Lee la markdown cacheada de un Webpage (metadata.markdown_path) y crea nodos Url para cada enlace encontrado, conectados con relacion LINKS_TO. Requiere haber ejecutado fetch_webpage antes."
-applies_to: [Webpage]
+description: "Lee la markdown cacheada del nodo Url (metadata.markdown_path) y crea nodos Url para cada enlace encontrado, conectados con relacion LINKS_TO. Requiere haber ejecutado fetch_webpage antes."
+applies_to: [Url]
 emits: [Url]
 relations: [LINKS_TO]
 uses_functions:
@@ -1,7 +1,7 @@
 id: extract_text_entities
 name: "Extract entities from text"
-description: "Lee la markdown cacheada de un Webpage y extrae IoCs (IPs, emails, dominios, hashes, crypto wallets, CVEs, MAC, telefonos) creando entidades + relacion EXTRACTED_FROM. Sin coste — solo regex. Modelos ML (GLiNER/GLiREL) en futura iteracion."
-applies_to: [Webpage]
+description: "Lee la markdown cacheada de un Url y extrae IoCs (IPs, emails, dominios, hashes, crypto wallets, CVEs, MAC, telefonos) creando entidades + relacion EXTRACTED_FROM. Sin coste — solo regex. Modelos ML (GLiNER/GLiREL) en futura iteracion."
+applies_to: [Url]
 emits: [Email, IPAddress, Domain, FileHash, CryptoWallet, CVE, MACAddress, Phone]
 relations: [EXTRACTED_FROM]
 uses_functions:
@@ -1,7 +1,7 @@
 id: fetch_webpage
 name: "Fetch web page"
-description: "Descarga HTML de una URL, extrae markdown limpio (readabilipy) y guarda los blobs en cache. Crea/actualiza el nodo Webpage con title/status_code/paths y crea el Domain con relacion BELONGS_TO."
-applies_to: [Url, Webpage]
+description: "Descarga HTML de una URL, extrae markdown limpio (readabilipy) y guarda los blobs en cache. Actualiza el nodo Url con title/status_code/paths/markdown en metadata y crea el Domain con relacion BELONGS_TO."
+applies_to: [Url]
 emits: [Domain]
 relations: [BELONGS_TO]
 uses_functions:
@@ -3,7 +3,12 @@

 Lee JSON de stdin, descarga la URL del nodo, convierte HTML a markdown,
 guarda blobs en `<cache_dir>/<sha256[0:2]>/<sha256>.{html,md}`, actualiza el
-nodo a tipo Webpage con metadata enriquecida y crea/conecta el Domain.
+nodo (deja type_ref=Url) con metadata enriquecida y crea/conecta el Domain.
+
+Nota: historicamente fetch_webpage convertia Url -> Webpage, pero esos
+dos tipos se han unificado en Url. Los campos de cuerpo cacheado
+(html_path, markdown_path, status_code, fetched_at, text_length, ...)
+viven en metadata.

 Wire protocol (issue 0026):
  - stdin:  JSON con node_id, metadata, ops_db_path, app_dir, cache_dir,
@@ -289,7 +294,14 @@ def main() -> int:
            log(f"node {node_id} disappeared")
            return 6
        cur_type, cur_meta = row[0], row[1] or "{}"
-        new_type = "Webpage" if cur_type.lower() == "url" else cur_type or "Webpage"
+        # Webpage fue un tipo separado historicamente. Hoy se unifica en
+        # Url (mismo tipo, los campos de cuerpo cacheado viven en
+        # metadata): si el nodo entrante es Url o el legacy Webpage, lo
+        # dejamos como Url; si el nodo no tiene tipo, default Url.
+        if not cur_type or cur_type.lower() in ("url", "webpage"):
+            new_type = "Url"
+        else:
+            new_type = cur_type

        patch = {
            "url":           url,
@@ -8,14 +8,20 @@ Wire protocol estandar (issue 0026):
  - stdout: una linea JSON al final con resumen.
  - exit code 0 = ok, !=0 = error.

-DDG endpoint usado: https://html.duckduckgo.com/html/?q=<query>
-Devuelve HTML estatico, sin JavaScript. Los enlaces vienen envueltos en
-redireccion `//duckduckgo.com/l/?uddg=<encoded>` que hay que decodificar.
+DDG endpoints usados:
+  1. https://lite.duckduckgo.com/lite/ (POST) — endpoint primario.
+     HTML minimo (ano 2009-style), tabla con `<a class='result-link'>` y
+     `<td class='result-snippet'>`. Es el menos agresivo con bot
+     detection; suele responder 200 cuando el endpoint `html.` ya
+     devuelve un challenge "anomaly" desde IPs residenciales/Windows.
+  2. https://html.duckduckgo.com/html/ (POST) — fallback. Su parser
+     usa `result__a` / `result__snippet`. DDG envuelve los enlaces en
+     `//duckduckgo.com/l/?uddg=<encoded>` que hay que decodificar.

-Para automatizar busquedas masivas en el futuro (sesion persistente,
-cookies, JS, captchas) la fase 2 introducira un enricher `web_search_cdp`
-que controle un Chromium remoto via DevTools Protocol. Este es el
-fallback simple zero-infra.
+Si ambos endpoints devuelven la pagina anti-bot ("anomaly", challenge
+captcha), el enricher emite un error claro indicando que se necesita
+`web_search_cdp` (issue 0029) — el fallback simple zero-infra no puede
+resolver el challenge.
 """
 from __future__ import annotations

@@ -49,13 +55,33 @@ def now_ms() -> int:
    return int(time.time() * 1000)


-def fetch_ddg(query: str, timeout: int, region: str, safe: str) -> str:
-    """Descarga la pagina HTML de resultados de DuckDuckGo.
+def _ddg_post(url: str, params: dict, headers: dict, timeout: int) -> str:
+    try:
+        import requests  # type: ignore
+        r = requests.post(url, data=params, headers=headers, timeout=timeout)
+        return r.text
+    except ImportError:
+        from urllib.parse import urlencode
+        from urllib.request import Request, urlopen
+        body = urlencode(params).encode()
+        req = Request(url, data=body, headers=headers)
+        with urlopen(req, timeout=timeout) as resp:  # type: ignore
+            return resp.read().decode("utf-8", errors="replace")

-    El endpoint `html.duckduckgo.com` no requiere JS y respeta los
-    parametros `kl` (region) y `kp` (safe search: 1 strict, -1 off,
-    -2 moderate). Inyecta cookie para que el "moderate" se aplique sin
-    pantalla intermedia.
+
+def is_anomaly_page(htmltxt: str) -> bool:
+    """Detecta la pagina anti-bot de DDG (challenge captcha)."""
+    s = htmltxt.lower()
+    return "anomaly" in s and "challenge" in s
+
+
+def fetch_ddg(query: str, timeout: int, region: str, safe: str) -> tuple[str, str]:
+    """Descarga la pagina de resultados de DuckDuckGo.
+
+    Intenta primero `lite.duckduckgo.com/lite/` (HTML minimo, ano-2009
+    style, mucho menos agresivo con bot detection que `html.`). Si
+    ese endpoint devuelve la pagina anti-bot, cae al endpoint `html.`.
+    Devuelve `(html, source)` donde source ∈ {"lite", "html"}.
    """
    params = {"q": query}
    if region:
@@ -66,29 +92,22 @@ def fetch_ddg(query: str, timeout: int, region: str, safe: str) -> str:

    headers = {
        "User-Agent": (
-            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
+            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
            "(KHTML, like Gecko) Chrome/120 Safari/537.36"
        ),
        "Accept": "text/html,application/xhtml+xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.7",
    }
-    try:
-        import requests  # type: ignore
-        r = requests.post(
-            "https://html.duckduckgo.com/html/",
-            data=params,
-            headers=headers,
-            timeout=timeout,
-        )
-        return r.text
-    except ImportError:
-        from urllib.parse import urlencode
-        from urllib.request import Request, urlopen
-        body = urlencode(params).encode()
-        req = Request("https://html.duckduckgo.com/html/", data=body,
-                      headers=headers)
-        with urlopen(req, timeout=timeout) as resp:  # type: ignore
-            return resp.read().decode("utf-8", errors="replace")
+
+    htmltxt = _ddg_post("https://lite.duckduckgo.com/lite/", params,
+                         headers, timeout)
+    if not is_anomaly_page(htmltxt):
+        return htmltxt, "lite"
+
+    log("lite endpoint devolvio challenge — fallback a html endpoint")
+    htmltxt = _ddg_post("https://html.duckduckgo.com/html/", params,
+                         headers, timeout)
+    return htmltxt, "html"


 def decode_ddg_href(href: str) -> str:
@@ -195,7 +214,7 @@ class _DDGParser(HTMLParser):


 def parse_ddg_html(htmltxt: str) -> list[dict]:
-    """Parsea el HTML de DDG y devuelve [{url, title, snippet, rank}]."""
+    """Parsea el HTML del endpoint `html.duckduckgo.com`."""
    p = _DDGParser()
    try:
        p.feed(htmltxt)
@@ -221,6 +240,100 @@ def parse_ddg_html(htmltxt: str) -> list[dict]:
    return out


+class _DDGLiteParser(HTMLParser):
+    """Parser para `lite.duckduckgo.com/lite/`.
+
+    Estructura tipica:
+      <a rel="nofollow" href="<URL>" class='result-link'>title</a>
+      ...
+      <td class='result-snippet'>snippet text</td>
+    Los snippets vienen DESPUES del enlace (no hijo del mismo elemento),
+    asi que parea por orden: cada `result-link` consume el siguiente
+    `result-snippet`.
+    """
+
+    def __init__(self) -> None:
+        super().__init__(convert_charrefs=True)
+        self.results: list[dict] = []
+        self._in_link = False
+        self._in_snippet = False
+        self._cur_href = ""
+        self._title_buf: list[str] = []
+        self._snippet_buf: list[str] = []
+        self._pending_snippet_for: int | None = None
+
+    def _attrs_dict(self, attrs):
+        return {k: (v or "") for k, v in attrs}
+
+    def handle_starttag(self, tag: str, attrs):
+        a = self._attrs_dict(attrs)
+        cls = a.get("class", "")
+        if tag == "a" and "result-link" in cls:
+            href = a.get("href", "")
+            self._in_link = True
+            self._cur_href = href
+            self._title_buf = []
+        elif tag == "td" and "result-snippet" in cls:
+            self._in_snippet = True
+            self._snippet_buf = []
+
+    def handle_endtag(self, tag: str):
+        if self._in_link and tag == "a":
+            title = " ".join("".join(self._title_buf).split())
+            self.results.append({
+                "href":    self._cur_href,
+                "title":   title,
+                "snippet": "",
+            })
+            self._pending_snippet_for = len(self.results) - 1
+            self._in_link = False
+        elif self._in_snippet and tag == "td":
+            snippet = " ".join("".join(self._snippet_buf).split())
+            if self._pending_snippet_for is not None:
+                self.results[self._pending_snippet_for]["snippet"] = snippet
+                self._pending_snippet_for = None
+            self._in_snippet = False
+
+    def handle_data(self, data: str):
+        if self._in_link:
+            self._title_buf.append(data)
+        elif self._in_snippet:
+            self._snippet_buf.append(data)
+
+
+def parse_ddg_lite(htmltxt: str) -> list[dict]:
+    """Parsea el HTML del endpoint `lite.duckduckgo.com/lite/`."""
+    p = _DDGLiteParser()
+    try:
+        p.feed(htmltxt)
+        p.close()
+    except Exception as e:
+        log(f"DDG lite parser failed: {e}")
+
+    out: list[dict] = []
+    seen: set[str] = set()
+    for r in p.results:
+        href = r.get("href") or ""
+        # lite envia URLs absolutas directas; aun asi pasamos por
+        # decode_ddg_href por si en algun caso DDG envuelve.
+        url = decode_ddg_href(href)
+        if not url or not url.startswith(("http://", "https://")):
+            continue
+        # Excluir auto-promociones de DDG (paginas de ayuda).
+        if "duckduckgo.com/duckduckgo-help-pages/" in url:
+            continue
+        if url in seen:
+            continue
+        seen.add(url)
+        out.append({
+            "url":     url,
+            "title":   r.get("title") or "",
+            "snippet": r.get("snippet") or "",
+            "rank":    len(out) + 1,
+        })
+    return out
+
+
 def find_url_entity(conn: sqlite3.Connection, url: str) -> str | None:
    """Busca un nodo Url existente con la misma url en metadata."""
    cur = conn.execute(
@@ -384,18 +497,40 @@ def main() -> int:

    progress(0.10, "fetching")
    try:
-        htmltxt = fetch_ddg(query, timeout=timeout_s, region=region, safe=safe)
+        htmltxt, source = fetch_ddg(query, timeout=timeout_s,
+                                     region=region, safe=safe)
    except Exception as e:
        log(f"DDG fetch failed: {e}")
        print(json.dumps({"error": str(e), "query": query,
                          "entities_added": 0, "relations_added": 0}))
        return 4

+    if is_anomaly_page(htmltxt):
+        log("DDG devolvio challenge captcha en ambos endpoints — "
+            "usar web_search_cdp (issue 0029) para resolver")
+        print(json.dumps({
+            "error":            "DDG bot challenge — captcha required",
+            "query":            query,
+            "engine":           "duckduckgo",
+            "source":           source,
+            "results":          0,
+            "entities_added":   0,
+            "relations_added":  0,
+        }, ensure_ascii=False))
+        return 4
+
    progress(0.55, "parsing")
-    results = parse_ddg_html(htmltxt)
+    # El parser se elige por contenido — si el endpoint y el markup no
+    # coinciden (tests con stub que sirve cualquier URL, o un cambio
+    # futuro de DDG), aun extraemos resultados. Probamos ambos y nos
+    # quedamos con el que devuelva mas.
+    results_lite = parse_ddg_lite(htmltxt) if "result-link" in htmltxt else []
+    results_html = parse_ddg_html(htmltxt) if "result__a"   in htmltxt else []
+    results = results_lite if len(results_lite) >= len(results_html) else results_html
    if limit > 0:
        results = results[:limit]
-    log(f"DDG returned {len(results)} results")
+    log(f"DDG ({source}) returned {len(results)} results "
+        f"(lite_parsed={len(results_lite)} html_parsed={len(results_html)})")

    progress(0.80, "applying")
    conn = sqlite3.connect(ops_db_path)