47fac22230
- .claude/CLAUDE.md - .claude/commands/subagentes.md - .claude/rules/INDEX.md - .mcp.json - bash/functions/cybersecurity/analyze_dns.md - bash/functions/cybersecurity/audit_http_headers.md - bash/functions/cybersecurity/audit_ssh_config.md - bash/functions/cybersecurity/check_firewall.md - bash/functions/cybersecurity/detect_suspicious_users.md - bash/functions/cybersecurity/encrypt_file.md - ... Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6.0 KiB
6.0 KiB
name, kind, lang, domain, version, purity, signature, description, tags, uses_functions, uses_types, returns, returns_optional, error_type, imports, params, output, tested, tests, test_file_path, file_path
| name | kind | lang | domain | version | purity | signature | description | tags | uses_functions | uses_types | returns | returns_optional | error_type | imports | params | output | tested | tests | test_file_path | file_path | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| extraction_pipeline | pipeline | py | pipelines | 1.0.0 | impure | def extraction_pipeline(file_path: str, entity_presets: list[dict], relation_types: list[str], llm_chat_json: Callablelist[dict, dict], chunk_size: int = 500, chunk_overlap: int = 50, confidence_threshold: float = 0.5, dedup_threshold: float = 0.85, on_progress: Callable[[str, float], None] | None = None) -> ExtractionResult | Pipeline completa de extraccion de entidades y relaciones desde un documento. Orquesta extract_text_from_file -> preprocess_text -> split_text_into_chunks -> extract_entities_llm por chunk -> deduplicate_entities -> extract_relations_llm por chunk -> deduplicate_relations. |
|
|
|
|
false | error_go_core |
|
|
ExtractionResult con entidades, relaciones y estadísticas del proceso de extracción | true |
|
python/functions/pipelines/extraction_pipeline_test.py | python/functions/pipelines/extraction_pipeline.py |
Ejemplo
from python.functions.pipelines.extraction_pipeline import extraction_pipeline
entity_presets = [
{
"type_ref": "osint_person_go_cybersecurity",
"label": "Person",
"metadata_fields": ["full_name", "alias", "nationality"],
},
{
"type_ref": "osint_domain_go_cybersecurity",
"label": "Domain",
"metadata_fields": ["fqdn", "registrar"],
},
]
relation_types = ["operates", "owns", "funds", "communicates_with", "related_to"]
# Inyectar un cliente LLM real
def llm_chat_json(messages):
# llamada al proveedor LLM elegido
...
result = extraction_pipeline(
file_path="report.pdf",
entity_presets=entity_presets,
relation_types=relation_types,
llm_chat_json=llm_chat_json,
chunk_size=500,
chunk_overlap=50,
confidence_threshold=0.5,
dedup_threshold=0.85,
on_progress=lambda msg, pct: print(f"[{pct:.0%}] {msg}"),
)
print(f"Entities: {len(result.entities)}, Relations: {len(result.relations)}")
print(f"Stats: {result.stats}")
# Integrar con fuzzygraph / operations.db
for entity in result.entities:
db.add_entity(
name=entity.name,
type_ref=entity.type_ref,
metadata=entity.attributes,
)
for relation in result.relations:
db.add_relation(
name=relation.relation_type,
from_entity=relation.from_id,
to_entity=relation.to_id,
)
Algoritmo
- Extract:
extract_text_from_file(file_path)— texto crudo desde PDF, TXT, Markdown - Preprocess:
preprocess_text(text)— normaliza espacios, caracteres especiales - Split:
split_text_into_chunks(text, chunk_size, chunk_overlap)— divide en ventanas solapadas - Extract entities per chunk (0-40%): Para cada chunk llama
extract_entities_llmcon el schema de presets. Anotasource_chunk_indexen cada candidato - Filter: filtra por
confidence >= confidence_threshold - Deduplicate entities (40%):
deduplicate_entitiescon fuzzy matching, produceentity_id_map - Extract relations per chunk (40-80%): Para cada chunk obtiene las entidades de ese chunk y llama
extract_relations_llm - Deduplicate relations (80-100%):
deduplicate_relationsresuelve nombres a IDs y colapsa duplicados - Return:
ExtractionResultcon entidades, relaciones y stats del proceso
Notas
- El parametro
llm_chat_jsoninyecta el cliente LLM, sin acoplamiento a ningun proveedor (OpenAI, Anthropic, Ollama, etc.) - El progress callback cubre: 0-40% extraccion de entidades, 40-80% extraccion de relaciones, 80-100% deduplicacion
- Si el archivo no existe lanza
FileNotFoundErrorantes de cualquier llamada al LLM - Si
entity_presetsesta vacio lanzaValueError - Errores en chunks individuales se capturan con warnings y continuan (robustez)
- Los
entity_id_mapdededuplicate_entitiesconectan nombres originales del texto con IDs UUID finales paradeduplicate_relations - La retorna
ExtractionResultesta lista para insertar enoperations.dbviafn ops entity add/fn ops relation add