Files
fn_registry/python/functions/cybersecurity/extract_file_hashes.md
T
egutierrez 6526da32dc feat(cybersecurity): 8 IoC regex extractors + extract_iocs pipeline puro
Extractores nuevos en python/functions/cybersecurity/:
- extract_ip_addresses (IPv4 + IPv6 con validacion ipaddress)
- extract_emails (RFC 5322 simplificado)
- extract_domains (FQDNs con TLD valido, lista estatica)
- extract_file_hashes (MD5/SHA1/SHA256/SHA512, algoritmo por longitud)
- extract_crypto_wallets (BTC legacy + bech32, ETH 0x+40hex)
- extract_cve_ids (CVE-YYYY-NNNN+)
- extract_mac_addresses (xx:xx:xx + xx-xx-xx, separador uniforme)
- extract_phone_numbers (E.164 + ES local 9 digitos)

Pipeline:
- extract_iocs corre todos, deduplica spans contenidos. Mantiene
  purity:pure (kind:function con uses_functions no vacio) porque la
  regla del registry exige que los pipelines sean impuros.

Todas devuelven list[dict] con value/start/end/type para que el
caller (issues 0038-0040) pueda reconciliar offsets con spans NER
sin reparsing.

Refs #0037

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:24:11 +02:00

1.7 KiB

name, kind, lang, domain, version, purity, signature, description, tags, uses_functions, uses_types, returns, returns_optional, error_type, imports, params, output, tested, tests, test_file_path, file_path
name kind lang domain version purity signature description tags uses_functions uses_types returns returns_optional error_type imports params output tested tests test_file_path file_path
extract_file_hashes function py cybersecurity 1.0.0 pure def extract_file_hashes(text: str) -> list[dict] Extrae hashes MD5/SHA1/SHA256/SHA512 de un texto, con offsets y algoritmo deducido por longitud (32, 40, 64 o 128 hex). Util para extraer IoCs de reportes de threat intelligence.
ioc
hash
md5
sha1
sha256
sha512
regex
extract
cybersecurity
python
false
re
name desc
text string de texto del que extraer hashes hex
lista de dicts con {value, start, end, type='file_hash', algorithm} por cada hash encontrado true
MD5 (32 hex), SHA1 (40), SHA256 (64), SHA512 (128)
Longitudes intermedias se ignoran
Insensible a mayusculas en hex
python/functions/cybersecurity/tests/test_extract_iocs.py python/functions/cybersecurity/extract_file_hashes.py

Ejemplo

extract_file_hashes("MD5: 5d41402abc4b2a76b9719d911017c592 SHA1: aaf4c61ddcc5e8a2dabede0f3b482cd9aea9434d")
# [{"value": "5d41402abc4b2a76b9719d911017c592", "start": 5, "end": 37,
#   "type": "file_hash", "algorithm": "md5"},
#  {"value": "aaf4c61ddcc5e8a2dabede0f3b482cd9aea9434d", "start": 44, "end": 84,
#   "type": "file_hash", "algorithm": "sha1"}]

Notas

Detecta solo longitudes canonicas (32/40/64/128 hex). Una secuencia hex de 50 caracteres se ignora. Word-boundary \b evita matchear sub-strings de hex mas largo. ETH wallets (0x + 40 hex = 42 chars totales) NO matchean este extractor por el \b y la ausencia del prefijo 0x en este patron — el pipeline extract_iocs deduplica overlaps si los hubiera.