graph_explorer

dataforge/graph_explorer

Fork 0

Commit Graph

Author	SHA1	Message	Date
egutierrez	6919ebfe9c	feat(enrichers): web_search DuckDuckGo + tests pytest de los 5 enrichers Anade enricher web_search aplicable a nodos text/Concept/Topic. Hace POST a html.duckduckgo.com con la query del nodo, parsea resultados con HTMLParser stdlib, decodifica el redirect uddg= y crea N nodos Url con relacion SEARCH_RESULT_OF apuntando al nodo origen. Encadenable: tras web_search, fetch_webpage sobre cada Url completa el pipeline search -> fetch -> extract. Defensa contra ops_db_path mal resuelto: normaliza backslashes, resuelve relativo contra app_dir, valida que la tabla entities exista antes de tocar nada (exit codes 7/8/9 con JSON resumen). Tests pytest (16/16 verde): conftest con operations.db temp + schema minimo, stub de requests via PYTHONPATH para mockear red. Cubre los 5 enrichers (extract_domain, fetch_webpage, extract_links, extract_text_entities, web_search) + sanity check de manifests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 16:10:13 +02:00
egutierrez	7ec6c4e09f	feat(enrichers): cuatro enrichers web — fetch + extract trio (issues 0028, 0028b) Cada enricher es un par manifest.yaml + run.py en enrichers/<id>/. 1. fetch_webpage (Url, Webpage): HTTP GET (requests, fallback urllib) -> html_to_markdown_py_core -> sha256(url) -> guarda HTML+MD en cache/<aa>/<sha>.{html,md}. Convierte Url -> Webpage con metadata enriquecida (title/status_code/content_type/ paths/text_length). Crea Domain con relacion BELONGS_TO. 2. extract_domain (Url, Webpage, Email): Saca dominio de metadata.url o metadata.address (sin I/O). Crea/conecta Domain con BELONGS_TO. Util cuando el usuario quiere ver el dominio antes de fetch. 3. extract_links (Webpage): Lee metadata.markdown_path -> extract_urls_py_cybersecurity -> dedup -> crea nodo Url por enlace + relacion LINKS_TO. Param max_links (50). 4. extract_text_entities (Webpage): Lee metadata.markdown_path -> extract_iocs_py_cybersecurity (regex puro, sin coste) -> crea entidades por (type, value) tipadas en el registro: Email, IPAddress, Domain, FileHash, CryptoWallet, CVE, MACAddress, Phone. Cada una con relacion EXTRACTED_FROM al Webpage origen. v1 sin GLiNER/ GLiREL — esos requieren modelos pre-cargados (futura iteracion). Probado end-to-end: fetch_webpage https://httpbin.org/html -> 1 Webpage + 1 Domain extract_links -> 2 Url + 2 LINKS_TO extract_text_entities -> 8 IoCs (Email, IP2, CVE, Domain2, Wallet, Phone) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 18:24:52 +02:00

Author

SHA1

Message

Date

egutierrez

6919ebfe9c

feat(enrichers): web_search DuckDuckGo + tests pytest de los 5 enrichers

Anade enricher web_search aplicable a nodos text/Concept/Topic. Hace
POST a html.duckduckgo.com con la query del nodo, parsea resultados
con HTMLParser stdlib, decodifica el redirect uddg= y crea N nodos
Url con relacion SEARCH_RESULT_OF apuntando al nodo origen.

Encadenable: tras web_search, fetch_webpage sobre cada Url completa
el pipeline search -> fetch -> extract.

Defensa contra ops_db_path mal resuelto: normaliza backslashes,
resuelve relativo contra app_dir, valida que la tabla entities
exista antes de tocar nada (exit codes 7/8/9 con JSON resumen).

Tests pytest (16/16 verde): conftest con operations.db temp +
schema minimo, stub de requests via PYTHONPATH para mockear red.
Cubre los 5 enrichers (extract_domain, fetch_webpage, extract_links,
extract_text_entities, web_search) + sanity check de manifests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-02 16:10:13 +02:00

egutierrez

7ec6c4e09f

feat(enrichers): cuatro enrichers web — fetch + extract trio (issues 0028, 0028b)

Cada enricher es un par manifest.yaml + run.py en enrichers/<id>/.

1. fetch_webpage (Url, Webpage):
   HTTP GET (requests, fallback urllib) -> html_to_markdown_py_core ->
   sha256(url) -> guarda HTML+MD en cache/<aa>/<sha>.{html,md}. Convierte
   Url -> Webpage con metadata enriquecida (title/status_code/content_type/
   paths/text_length). Crea Domain con relacion BELONGS_TO.

2. extract_domain (Url, Webpage, Email):
   Saca dominio de metadata.url o metadata.address (sin I/O). Crea/conecta
   Domain con BELONGS_TO. Util cuando el usuario quiere ver el dominio
   antes de fetch.

3. extract_links (Webpage):
   Lee metadata.markdown_path -> extract_urls_py_cybersecurity -> dedup ->
   crea nodo Url por enlace + relacion LINKS_TO. Param max_links (50).

4. extract_text_entities (Webpage):
   Lee metadata.markdown_path -> extract_iocs_py_cybersecurity (regex puro,
   sin coste) -> crea entidades por (type, value) tipadas en el registro:
   Email, IPAddress, Domain, FileHash, CryptoWallet, CVE, MACAddress, Phone.
   Cada una con relacion EXTRACTED_FROM al Webpage origen. v1 sin GLiNER/
   GLiREL — esos requieren modelos pre-cargados (futura iteracion).

Probado end-to-end:
  fetch_webpage  https://httpbin.org/html -> 1 Webpage + 1 Domain
  extract_links  -> 2 Url + 2 LINKS_TO
  extract_text_entities -> 8 IoCs (Email, IP*2, CVE, Domain*2, Wallet, Phone)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-01 18:24:52 +02:00

2 Commits