feat: catch-up de decisiones previas (Webpage→Url, anti-bot, UI 2-col, tests cross-platform)
Bloque de cambios revisados y validados con el usuario en sesiones previas que no habian aterrizado en commits propios. Lista por tema: * enrichers: web_search ahora usa lite.duckduckgo.com como endpoint primario (mas tolerante con bot detection desde IP residencial), con fallback al endpoint html. Detecta pagina captcha y emite error claro si ambos fallan. Anyade _DDGLiteParser para el formato lite + auto-pick de parser por contenido. * enrichers: tipo Webpage unificado en Url (campos de cuerpo cacheado viven en metadata del Url). Manifests actualizados (applies_to: [Url]). fetch_webpage ya no convierte Url->Webpage. * enrichers/manifest: campo `params` parseado a EnricherSpec.params (name, type, default_value, description). UI puede renderizar dialog de configuracion. * jobs: fix de path conversion para Python embebido nativo Windows (no convertir a /mnt/c/... cuando el subproceso es Windows-native; solo cuando es bash o python via WSL). * main.cpp: ventana ImGui (no modal) "Run enricher" con layout 2-col (label izq, input der). Inserta job con JSON tipado. Layout clustering apretado: hijos del mismo anchor en un solo anillo alrededor del padre, sin desperdigar por anillos crecientes. * views: inspector con layout 2-col via BeginTable (Identity, Schema fields, Extras). Description full-width debajo de su label. * tests: portable conftest (auto-detecta REGISTRY_ROOT, PYTHON_BIN, ENRICHERS_DIR para WSL y Windows portable). _runner.py trampoline inyecta stub via sys.path porque embedded Python ignora PYTHONPATH. Tests bash-only (vendor_script, freeze, dispatcher bash, resolver Linux-binary) skipean en Windows. Tests existentes adaptados a Webpage->Url. Resultado actual: 32 passed WSL, 21 passed + 11 skipped Windows.
This commit is contained in:
@@ -1,7 +1,7 @@
|
||||
id: extract_domain
|
||||
name: "Extract domain"
|
||||
description: "Saca el dominio de la url/email del nodo y crea/conecta una entidad Domain con relacion BELONGS_TO. No descarga nada."
|
||||
applies_to: [Url, Webpage, Email]
|
||||
applies_to: [Url, Email]
|
||||
emits: [Domain]
|
||||
relations: [BELONGS_TO]
|
||||
params: []
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
id: extract_links
|
||||
name: "Extract links"
|
||||
description: "Lee la markdown cacheada de un Webpage (metadata.markdown_path) y crea nodos Url para cada enlace encontrado, conectados con relacion LINKS_TO. Requiere haber ejecutado fetch_webpage antes."
|
||||
applies_to: [Webpage]
|
||||
description: "Lee la markdown cacheada del nodo Url (metadata.markdown_path) y crea nodos Url para cada enlace encontrado, conectados con relacion LINKS_TO. Requiere haber ejecutado fetch_webpage antes."
|
||||
applies_to: [Url]
|
||||
emits: [Url]
|
||||
relations: [LINKS_TO]
|
||||
uses_functions:
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
id: extract_text_entities
|
||||
name: "Extract entities from text"
|
||||
description: "Lee la markdown cacheada de un Webpage y extrae IoCs (IPs, emails, dominios, hashes, crypto wallets, CVEs, MAC, telefonos) creando entidades + relacion EXTRACTED_FROM. Sin coste — solo regex. Modelos ML (GLiNER/GLiREL) en futura iteracion."
|
||||
applies_to: [Webpage]
|
||||
description: "Lee la markdown cacheada de un Url y extrae IoCs (IPs, emails, dominios, hashes, crypto wallets, CVEs, MAC, telefonos) creando entidades + relacion EXTRACTED_FROM. Sin coste — solo regex. Modelos ML (GLiNER/GLiREL) en futura iteracion."
|
||||
applies_to: [Url]
|
||||
emits: [Email, IPAddress, Domain, FileHash, CryptoWallet, CVE, MACAddress, Phone]
|
||||
relations: [EXTRACTED_FROM]
|
||||
uses_functions:
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
id: fetch_webpage
|
||||
name: "Fetch web page"
|
||||
description: "Descarga HTML de una URL, extrae markdown limpio (readabilipy) y guarda los blobs en cache. Crea/actualiza el nodo Webpage con title/status_code/paths y crea el Domain con relacion BELONGS_TO."
|
||||
applies_to: [Url, Webpage]
|
||||
description: "Descarga HTML de una URL, extrae markdown limpio (readabilipy) y guarda los blobs en cache. Actualiza el nodo Url con title/status_code/paths/markdown en metadata y crea el Domain con relacion BELONGS_TO."
|
||||
applies_to: [Url]
|
||||
emits: [Domain]
|
||||
relations: [BELONGS_TO]
|
||||
uses_functions:
|
||||
|
||||
@@ -3,7 +3,12 @@
|
||||
|
||||
Lee JSON de stdin, descarga la URL del nodo, convierte HTML a markdown,
|
||||
guarda blobs en `<cache_dir>/<sha256[0:2]>/<sha256>.{html,md}`, actualiza el
|
||||
nodo a tipo Webpage con metadata enriquecida y crea/conecta el Domain.
|
||||
nodo (deja type_ref=Url) con metadata enriquecida y crea/conecta el Domain.
|
||||
|
||||
Nota: historicamente fetch_webpage convertia Url -> Webpage, pero esos
|
||||
dos tipos se han unificado en Url. Los campos de cuerpo cacheado
|
||||
(html_path, markdown_path, status_code, fetched_at, text_length, ...)
|
||||
viven en metadata.
|
||||
|
||||
Wire protocol (issue 0026):
|
||||
- stdin: JSON con node_id, metadata, ops_db_path, app_dir, cache_dir,
|
||||
@@ -289,7 +294,14 @@ def main() -> int:
|
||||
log(f"node {node_id} disappeared")
|
||||
return 6
|
||||
cur_type, cur_meta = row[0], row[1] or "{}"
|
||||
new_type = "Webpage" if cur_type.lower() == "url" else cur_type or "Webpage"
|
||||
# Webpage fue un tipo separado historicamente. Hoy se unifica en
|
||||
# Url (mismo tipo, los campos de cuerpo cacheado viven en
|
||||
# metadata): si el nodo entrante es Url o el legacy Webpage, lo
|
||||
# dejamos como Url; si el nodo no tiene tipo, default Url.
|
||||
if not cur_type or cur_type.lower() in ("url", "webpage"):
|
||||
new_type = "Url"
|
||||
else:
|
||||
new_type = cur_type
|
||||
|
||||
patch = {
|
||||
"url": url,
|
||||
|
||||
Binary file not shown.
+170
-35
@@ -8,14 +8,20 @@ Wire protocol estandar (issue 0026):
|
||||
- stdout: una linea JSON al final con resumen.
|
||||
- exit code 0 = ok, !=0 = error.
|
||||
|
||||
DDG endpoint usado: https://html.duckduckgo.com/html/?q=<query>
|
||||
Devuelve HTML estatico, sin JavaScript. Los enlaces vienen envueltos en
|
||||
redireccion `//duckduckgo.com/l/?uddg=<encoded>` que hay que decodificar.
|
||||
DDG endpoints usados:
|
||||
1. https://lite.duckduckgo.com/lite/ (POST) — endpoint primario.
|
||||
HTML minimo (ano 2009-style), tabla con `<a class='result-link'>` y
|
||||
`<td class='result-snippet'>`. Es el menos agresivo con bot
|
||||
detection; suele responder 200 cuando el endpoint `html.` ya
|
||||
devuelve un challenge "anomaly" desde IPs residenciales/Windows.
|
||||
2. https://html.duckduckgo.com/html/ (POST) — fallback. Su parser
|
||||
usa `result__a` / `result__snippet`. DDG envuelve los enlaces en
|
||||
`//duckduckgo.com/l/?uddg=<encoded>` que hay que decodificar.
|
||||
|
||||
Para automatizar busquedas masivas en el futuro (sesion persistente,
|
||||
cookies, JS, captchas) la fase 2 introducira un enricher `web_search_cdp`
|
||||
que controle un Chromium remoto via DevTools Protocol. Este es el
|
||||
fallback simple zero-infra.
|
||||
Si ambos endpoints devuelven la pagina anti-bot ("anomaly", challenge
|
||||
captcha), el enricher emite un error claro indicando que se necesita
|
||||
`web_search_cdp` (issue 0029) — el fallback simple zero-infra no puede
|
||||
resolver el challenge.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
@@ -49,13 +55,33 @@ def now_ms() -> int:
|
||||
return int(time.time() * 1000)
|
||||
|
||||
|
||||
def fetch_ddg(query: str, timeout: int, region: str, safe: str) -> str:
|
||||
"""Descarga la pagina HTML de resultados de DuckDuckGo.
|
||||
def _ddg_post(url: str, params: dict, headers: dict, timeout: int) -> str:
|
||||
try:
|
||||
import requests # type: ignore
|
||||
r = requests.post(url, data=params, headers=headers, timeout=timeout)
|
||||
return r.text
|
||||
except ImportError:
|
||||
from urllib.parse import urlencode
|
||||
from urllib.request import Request, urlopen
|
||||
body = urlencode(params).encode()
|
||||
req = Request(url, data=body, headers=headers)
|
||||
with urlopen(req, timeout=timeout) as resp: # type: ignore
|
||||
return resp.read().decode("utf-8", errors="replace")
|
||||
|
||||
El endpoint `html.duckduckgo.com` no requiere JS y respeta los
|
||||
parametros `kl` (region) y `kp` (safe search: 1 strict, -1 off,
|
||||
-2 moderate). Inyecta cookie para que el "moderate" se aplique sin
|
||||
pantalla intermedia.
|
||||
|
||||
def is_anomaly_page(htmltxt: str) -> bool:
|
||||
"""Detecta la pagina anti-bot de DDG (challenge captcha)."""
|
||||
s = htmltxt.lower()
|
||||
return "anomaly" in s and "challenge" in s
|
||||
|
||||
|
||||
def fetch_ddg(query: str, timeout: int, region: str, safe: str) -> tuple[str, str]:
|
||||
"""Descarga la pagina de resultados de DuckDuckGo.
|
||||
|
||||
Intenta primero `lite.duckduckgo.com/lite/` (HTML minimo, ano-2009
|
||||
style, mucho menos agresivo con bot detection que `html.`). Si
|
||||
ese endpoint devuelve la pagina anti-bot, cae al endpoint `html.`.
|
||||
Devuelve `(html, source)` donde source ∈ {"lite", "html"}.
|
||||
"""
|
||||
params = {"q": query}
|
||||
if region:
|
||||
@@ -66,29 +92,22 @@ def fetch_ddg(query: str, timeout: int, region: str, safe: str) -> str:
|
||||
|
||||
headers = {
|
||||
"User-Agent": (
|
||||
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
|
||||
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
|
||||
"(KHTML, like Gecko) Chrome/120 Safari/537.36"
|
||||
),
|
||||
"Accept": "text/html,application/xhtml+xml;q=0.9,*/*;q=0.8",
|
||||
"Accept-Language": "en-US,en;q=0.7",
|
||||
}
|
||||
try:
|
||||
import requests # type: ignore
|
||||
r = requests.post(
|
||||
"https://html.duckduckgo.com/html/",
|
||||
data=params,
|
||||
headers=headers,
|
||||
timeout=timeout,
|
||||
)
|
||||
return r.text
|
||||
except ImportError:
|
||||
from urllib.parse import urlencode
|
||||
from urllib.request import Request, urlopen
|
||||
body = urlencode(params).encode()
|
||||
req = Request("https://html.duckduckgo.com/html/", data=body,
|
||||
headers=headers)
|
||||
with urlopen(req, timeout=timeout) as resp: # type: ignore
|
||||
return resp.read().decode("utf-8", errors="replace")
|
||||
|
||||
htmltxt = _ddg_post("https://lite.duckduckgo.com/lite/", params,
|
||||
headers, timeout)
|
||||
if not is_anomaly_page(htmltxt):
|
||||
return htmltxt, "lite"
|
||||
|
||||
log("lite endpoint devolvio challenge — fallback a html endpoint")
|
||||
htmltxt = _ddg_post("https://html.duckduckgo.com/html/", params,
|
||||
headers, timeout)
|
||||
return htmltxt, "html"
|
||||
|
||||
|
||||
def decode_ddg_href(href: str) -> str:
|
||||
@@ -195,7 +214,7 @@ class _DDGParser(HTMLParser):
|
||||
|
||||
|
||||
def parse_ddg_html(htmltxt: str) -> list[dict]:
|
||||
"""Parsea el HTML de DDG y devuelve [{url, title, snippet, rank}]."""
|
||||
"""Parsea el HTML del endpoint `html.duckduckgo.com`."""
|
||||
p = _DDGParser()
|
||||
try:
|
||||
p.feed(htmltxt)
|
||||
@@ -221,6 +240,100 @@ def parse_ddg_html(htmltxt: str) -> list[dict]:
|
||||
return out
|
||||
|
||||
|
||||
class _DDGLiteParser(HTMLParser):
|
||||
"""Parser para `lite.duckduckgo.com/lite/`.
|
||||
|
||||
Estructura tipica:
|
||||
<a rel="nofollow" href="<URL>" class='result-link'>title</a>
|
||||
...
|
||||
<td class='result-snippet'>snippet text</td>
|
||||
Los snippets vienen DESPUES del enlace (no hijo del mismo elemento),
|
||||
asi que parea por orden: cada `result-link` consume el siguiente
|
||||
`result-snippet`.
|
||||
"""
|
||||
|
||||
def __init__(self) -> None:
|
||||
super().__init__(convert_charrefs=True)
|
||||
self.results: list[dict] = []
|
||||
self._in_link = False
|
||||
self._in_snippet = False
|
||||
self._cur_href = ""
|
||||
self._title_buf: list[str] = []
|
||||
self._snippet_buf: list[str] = []
|
||||
self._pending_snippet_for: int | None = None
|
||||
|
||||
def _attrs_dict(self, attrs):
|
||||
return {k: (v or "") for k, v in attrs}
|
||||
|
||||
def handle_starttag(self, tag: str, attrs):
|
||||
a = self._attrs_dict(attrs)
|
||||
cls = a.get("class", "")
|
||||
if tag == "a" and "result-link" in cls:
|
||||
href = a.get("href", "")
|
||||
self._in_link = True
|
||||
self._cur_href = href
|
||||
self._title_buf = []
|
||||
elif tag == "td" and "result-snippet" in cls:
|
||||
self._in_snippet = True
|
||||
self._snippet_buf = []
|
||||
|
||||
def handle_endtag(self, tag: str):
|
||||
if self._in_link and tag == "a":
|
||||
title = " ".join("".join(self._title_buf).split())
|
||||
self.results.append({
|
||||
"href": self._cur_href,
|
||||
"title": title,
|
||||
"snippet": "",
|
||||
})
|
||||
self._pending_snippet_for = len(self.results) - 1
|
||||
self._in_link = False
|
||||
elif self._in_snippet and tag == "td":
|
||||
snippet = " ".join("".join(self._snippet_buf).split())
|
||||
if self._pending_snippet_for is not None:
|
||||
self.results[self._pending_snippet_for]["snippet"] = snippet
|
||||
self._pending_snippet_for = None
|
||||
self._in_snippet = False
|
||||
|
||||
def handle_data(self, data: str):
|
||||
if self._in_link:
|
||||
self._title_buf.append(data)
|
||||
elif self._in_snippet:
|
||||
self._snippet_buf.append(data)
|
||||
|
||||
|
||||
def parse_ddg_lite(htmltxt: str) -> list[dict]:
|
||||
"""Parsea el HTML del endpoint `lite.duckduckgo.com/lite/`."""
|
||||
p = _DDGLiteParser()
|
||||
try:
|
||||
p.feed(htmltxt)
|
||||
p.close()
|
||||
except Exception as e:
|
||||
log(f"DDG lite parser failed: {e}")
|
||||
|
||||
out: list[dict] = []
|
||||
seen: set[str] = set()
|
||||
for r in p.results:
|
||||
href = r.get("href") or ""
|
||||
# lite envia URLs absolutas directas; aun asi pasamos por
|
||||
# decode_ddg_href por si en algun caso DDG envuelve.
|
||||
url = decode_ddg_href(href)
|
||||
if not url or not url.startswith(("http://", "https://")):
|
||||
continue
|
||||
# Excluir auto-promociones de DDG (paginas de ayuda).
|
||||
if "duckduckgo.com/duckduckgo-help-pages/" in url:
|
||||
continue
|
||||
if url in seen:
|
||||
continue
|
||||
seen.add(url)
|
||||
out.append({
|
||||
"url": url,
|
||||
"title": r.get("title") or "",
|
||||
"snippet": r.get("snippet") or "",
|
||||
"rank": len(out) + 1,
|
||||
})
|
||||
return out
|
||||
|
||||
|
||||
def find_url_entity(conn: sqlite3.Connection, url: str) -> str | None:
|
||||
"""Busca un nodo Url existente con la misma url en metadata."""
|
||||
cur = conn.execute(
|
||||
@@ -384,18 +497,40 @@ def main() -> int:
|
||||
|
||||
progress(0.10, "fetching")
|
||||
try:
|
||||
htmltxt = fetch_ddg(query, timeout=timeout_s, region=region, safe=safe)
|
||||
htmltxt, source = fetch_ddg(query, timeout=timeout_s,
|
||||
region=region, safe=safe)
|
||||
except Exception as e:
|
||||
log(f"DDG fetch failed: {e}")
|
||||
print(json.dumps({"error": str(e), "query": query,
|
||||
"entities_added": 0, "relations_added": 0}))
|
||||
return 4
|
||||
|
||||
if is_anomaly_page(htmltxt):
|
||||
log("DDG devolvio challenge captcha en ambos endpoints — "
|
||||
"usar web_search_cdp (issue 0029) para resolver")
|
||||
print(json.dumps({
|
||||
"error": "DDG bot challenge — captcha required",
|
||||
"query": query,
|
||||
"engine": "duckduckgo",
|
||||
"source": source,
|
||||
"results": 0,
|
||||
"entities_added": 0,
|
||||
"relations_added": 0,
|
||||
}, ensure_ascii=False))
|
||||
return 4
|
||||
|
||||
progress(0.55, "parsing")
|
||||
results = parse_ddg_html(htmltxt)
|
||||
# El parser se elige por contenido — si el endpoint y el markup no
|
||||
# coinciden (tests con stub que sirve cualquier URL, o un cambio
|
||||
# futuro de DDG), aun extraemos resultados. Probamos ambos y nos
|
||||
# quedamos con el que devuelva mas.
|
||||
results_lite = parse_ddg_lite(htmltxt) if "result-link" in htmltxt else []
|
||||
results_html = parse_ddg_html(htmltxt) if "result__a" in htmltxt else []
|
||||
results = results_lite if len(results_lite) >= len(results_html) else results_html
|
||||
if limit > 0:
|
||||
results = results[:limit]
|
||||
log(f"DDG returned {len(results)} results")
|
||||
log(f"DDG ({source}) returned {len(results)} results "
|
||||
f"(lite_parsed={len(results_lite)} html_parsed={len(results_html)})")
|
||||
|
||||
progress(0.80, "applying")
|
||||
conn = sqlite3.connect(ops_db_path)
|
||||
|
||||
Reference in New Issue
Block a user