7ec6c4e09f
Cada enricher es un par manifest.yaml + run.py en enrichers/<id>/.
1. fetch_webpage (Url, Webpage):
HTTP GET (requests, fallback urllib) -> html_to_markdown_py_core ->
sha256(url) -> guarda HTML+MD en cache/<aa>/<sha>.{html,md}. Convierte
Url -> Webpage con metadata enriquecida (title/status_code/content_type/
paths/text_length). Crea Domain con relacion BELONGS_TO.
2. extract_domain (Url, Webpage, Email):
Saca dominio de metadata.url o metadata.address (sin I/O). Crea/conecta
Domain con BELONGS_TO. Util cuando el usuario quiere ver el dominio
antes de fetch.
3. extract_links (Webpage):
Lee metadata.markdown_path -> extract_urls_py_cybersecurity -> dedup ->
crea nodo Url por enlace + relacion LINKS_TO. Param max_links (50).
4. extract_text_entities (Webpage):
Lee metadata.markdown_path -> extract_iocs_py_cybersecurity (regex puro,
sin coste) -> crea entidades por (type, value) tipadas en el registro:
Email, IPAddress, Domain, FileHash, CryptoWallet, CVE, MACAddress, Phone.
Cada una con relacion EXTRACTED_FROM al Webpage origen. v1 sin GLiNER/
GLiREL — esos requieren modelos pre-cargados (futura iteracion).
Probado end-to-end:
fetch_webpage https://httpbin.org/html -> 1 Webpage + 1 Domain
extract_links -> 2 Url + 2 LINKS_TO
extract_text_entities -> 8 IoCs (Email, IP*2, CVE, Domain*2, Wallet, Phone)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8 lines
260 B
YAML
8 lines
260 B
YAML
id: extract_domain
|
|
name: "Extract domain"
|
|
description: "Saca el dominio de la url/email del nodo y crea/conecta una entidad Domain con relacion BELONGS_TO. No descarga nada."
|
|
applies_to: [Url, Webpage, Email]
|
|
emits: [Domain]
|
|
relations: [BELONGS_TO]
|
|
params: []
|