Files
egutierrez 47fac22230 chore: auto-commit (799 archivos)
- .claude/CLAUDE.md
- .claude/commands/subagentes.md
- .claude/rules/INDEX.md
- .mcp.json
- bash/functions/cybersecurity/analyze_dns.md
- bash/functions/cybersecurity/audit_http_headers.md
- bash/functions/cybersecurity/audit_ssh_config.md
- bash/functions/cybersecurity/check_firewall.md
- bash/functions/cybersecurity/detect_suspicious_users.md
- bash/functions/cybersecurity/encrypt_file.md
- ...

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 00:28:20 +02:00

4.3 KiB

name, kind, lang, domain, version, purity, signature, description, tags, uses_functions, uses_types, returns, returns_optional, error_type, imports, params, output, tested, tests, test_file_path, file_path, notes
name kind lang domain version purity signature description tags uses_functions uses_types returns returns_optional error_type imports params output tested tests test_file_path file_path notes
extract_graph_from_text pipeline py pipelines 1.0.0 impure def extract_graph_from_text(text: str, entity_labels: list[str], relation_labels: list | dict, allowed: dict, model: Any, threshold: float = 0.3, max_chars_per_chunk: int = 1500, overlap_sentences: int = 2) -> dict Pipeline E2E: texto -> grafo de entidades y relaciones. Orquesta chunking, extraccion con GLiNER2 por chunk, agregacion, filtrado tipado y resolucion de alias. Refactorizacion del playground del analisis gliner_glirel_tuning.
pipeline
graph
ner
relation-extraction
gliner2
nlp
e2e
knowledge-graph
datascience
python
pendiente-usar
chunk_with_overlap_py_core
extract_graph_gliner2_py_datascience
aggregate_extraction_results_py_core
filter_relations_by_entity_types_py_core
merge_entity_aliases_py_core
false error_go_core
time
typing.Any
name desc
text Texto de entrada de cualquier longitud. Se auto-chunkea si supera max_chars_per_chunk. Recomendado: pre-limpiar con clean_pdf_text si viene de un PDF.
name desc
entity_labels Tipos de entidad para GLiNER2. E.g. ['person', 'organization', 'location']. Usar snake_case (mejor recall segun notebook 08).
name desc
relation_labels Tipos de relacion. Lista de strings o dict {label: description}. E.g. ['works_at', 'ceo_of'] o {'ceo_of': 'person is CEO of organization'}.
name desc
allowed Reglas de filtrado tipado {rel_type: (head_types, tail_types)}. Pasar {} para desactivar el filtrado. E.g. {'ceo_of': (['person'], ['organization'])}.
name desc
model Instancia GLiNER2 cargada con gliner2_load_model. Inyectada por el caller para permitir cache entre llamadas.
name desc
threshold Umbral de confianza GLiNER2 (0-1). 0.3 validado empiricamente. Menor = mas recall, mas ruido.
name desc
max_chars_per_chunk Maximo de caracteres por chunk antes de dividir. 1500 es el valor optimo para GLiNER2-large.
name desc
overlap_sentences Frases de overlap entre chunks consecutivos. 2 evita perder entidades en los bordes de chunk.
Dict con 'nodes' (lista de {id, type, count}), 'edges' (lista de {from, to, kind}) y 'stats' ({n_chunks, n_nodes, n_edges, n_dropped_typed, elapsed_s}). Listo para serializar a JSON y visualizar con Sigma/D3. true
texto corto produce nodos y aristas esperados con stub model
stats tiene todos los campos requeridos
python/functions/pipelines/tests/test_extract_graph_from_text.py python/functions/pipelines/extract_graph_from_text.py Refactorizacion directa del playground/server.py del analisis projects/osint_graph/analysis/gliner_glirel_tuning. Todas las recetas validadas empiricamente en los notebooks 04, 06 y 08: - threshold=0.3 (notebook 04) - overlap_sentences=2 (notebook 06) - filtrado tipado (notebook 08) - coreference normalize+substring (playground/server.py) Para PDFs: usar extract_pdf_text + clean_pdf_text antes de llamar a este pipeline. Para OpenIE sin vocabulario fijo: usar extract_triples_spacy_es como alternativa.

Ejemplo

from datascience.gliner2_load_model import gliner2_load_model
from pipelines.extract_graph_from_text import extract_graph_from_text

model = gliner2_load_model(device="auto")

ENTITY_LABELS = ["person", "organization", "location"]
RELATION_LABELS = ["works_at", "ceo_of", "headquartered_in", "president_of"]
ALLOWED = {
    "ceo_of":       (["person"], ["organization"]),
    "president_of": (["person"], ["organization"]),
    "works_at":     (["person"], ["organization"]),
    "headquartered_in": (["organization"], ["location"]),
}

text = """Carlos Torres Blanco es el presidente de BBVA.
BBVA tiene su sede corporativa en Bilbao, aunque opera en mas de 30 paises."""

graph = extract_graph_from_text(
    text=text,
    entity_labels=ENTITY_LABELS,
    relation_labels=RELATION_LABELS,
    allowed=ALLOWED,
    model=model,
)
# graph["nodes"] -> [{"id": "Carlos Torres Blanco", "type": "person", "count": 1}, ...]
# graph["edges"] -> [{"from": "Carlos Torres Blanco", "to": "BBVA", "kind": "president_of"}]
# graph["stats"] -> {"n_chunks": 1, "n_nodes": 3, "n_edges": 2, ...}