Files
egutierrez dff0c0d2b7 feat(cybersecurity): 8 IoC regex extractors + extract_iocs pipeline puro
Extractores nuevos en python/functions/cybersecurity/:
- extract_ip_addresses (IPv4 + IPv6 con validacion ipaddress)
- extract_emails (RFC 5322 simplificado)
- extract_domains (FQDNs con TLD valido, lista estatica)
- extract_file_hashes (MD5/SHA1/SHA256/SHA512, algoritmo por longitud)
- extract_crypto_wallets (BTC legacy + bech32, ETH 0x+40hex)
- extract_cve_ids (CVE-YYYY-NNNN+)
- extract_mac_addresses (xx:xx:xx + xx-xx-xx, separador uniforme)
- extract_phone_numbers (E.164 + ES local 9 digitos)

Pipeline:
- extract_iocs corre todos, deduplica spans contenidos. Mantiene
  purity:pure (kind:function con uses_functions no vacio) porque la
  regla del registry exige que los pipelines sean impuros.

Todas devuelven list[dict] con value/start/end/type para que el
caller (issues 0038-0040) pueda reconciliar offsets con spans NER
sin reparsing.

Refs #0037

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:41:30 +02:00

59 lines
2.0 KiB
Python

"""Extrae FQDNs validos de un texto, con offsets."""
import re
# Lista estatica de TLDs comunes (no exhaustiva — IANA tiene >1500).
# Incluye los gTLD originales, los nuevos mas usados, y ccTLD frecuentes.
_VALID_TLDS = frozenset({
# gTLD originales
"com", "org", "net", "edu", "gov", "mil", "int",
# gTLD comunes
"info", "biz", "name", "pro", "mobi", "asia", "jobs", "tel", "travel",
"xxx", "post",
# nuevos gTLD populares
"app", "dev", "io", "ai", "tech", "cloud", "online", "site", "store",
"xyz", "top", "shop", "club", "fun", "live", "blog", "page", "news",
"media", "design", "studio", "agency", "io", "co", "me", "tv",
# ccTLD frecuentes
"us", "uk", "de", "fr", "es", "it", "nl", "be", "se", "no", "fi", "dk",
"ru", "ua", "pl", "cz", "ch", "at", "pt", "gr", "ie", "tr",
"ca", "mx", "br", "ar", "cl", "co", "pe", "ve", "uy",
"cn", "jp", "kr", "in", "id", "th", "vn", "my", "sg", "ph", "tw", "hk",
"au", "nz",
"za", "eg", "ma", "ng", "ke",
"il", "ae", "sa", "qa",
"eu",
})
# Componentes: letras/digitos con guiones internos, sin empezar/terminar en guion.
_LABEL = r"[A-Za-z0-9](?:[A-Za-z0-9-]{0,61}[A-Za-z0-9])?"
_DOMAIN_REGEX = re.compile(
rf"(?<![A-Za-z0-9.-])"
rf"(?:{_LABEL}\.)+"
rf"[A-Za-z]{{2,63}}"
rf"(?![A-Za-z0-9.-])"
)
def extract_domains(text: str) -> list[dict]:
"""Extrae FQDNs cuyo TLD esta en la lista estatica.
Solo captura nombres con al menos un punto y un TLD reconocido. No
incluye URLs completas (ver `extract_urls`). Si el dominio aparece
dentro de un email, igual se extrae — el caller puede deduplicar
por offsets si lo necesita.
"""
results = []
for m in _DOMAIN_REGEX.finditer(text):
candidate = m.group(0)
tld = candidate.rsplit(".", 1)[-1].lower()
if tld not in _VALID_TLDS:
continue
results.append({
"value": candidate,
"start": m.start(),
"end": m.end(),
"type": "domain",
})
return results