--- name: extract_graph_from_text kind: pipeline lang: py domain: pipelines version: "1.0.0" purity: impure signature: "def extract_graph_from_text(text: str, entity_labels: list[str], relation_labels: list | dict, allowed: dict, model: Any, threshold: float = 0.3, max_chars_per_chunk: int = 1500, overlap_sentences: int = 2) -> dict" description: "Pipeline E2E: texto -> grafo de entidades y relaciones. Orquesta chunking, extraccion con GLiNER2 por chunk, agregacion, filtrado tipado y resolucion de alias. Refactorizacion del playground del analisis gliner_glirel_tuning." tags: [pipeline, graph, ner, relation-extraction, gliner2, nlp, e2e, knowledge-graph, datascience, python] uses_functions: - chunk_with_overlap_py_core - extract_graph_gliner2_py_datascience - aggregate_extraction_results_py_core - filter_relations_by_entity_types_py_core - merge_entity_aliases_py_core uses_types: [] returns: [] returns_optional: false error_type: "error_go_core" imports: [time, typing.Any] params: - name: text desc: "Texto de entrada de cualquier longitud. Se auto-chunkea si supera max_chars_per_chunk. Recomendado: pre-limpiar con clean_pdf_text si viene de un PDF." - name: entity_labels desc: "Tipos de entidad para GLiNER2. E.g. ['person', 'organization', 'location']. Usar snake_case (mejor recall segun notebook 08)." - name: relation_labels desc: "Tipos de relacion. Lista de strings o dict {label: description}. E.g. ['works_at', 'ceo_of'] o {'ceo_of': 'person is CEO of organization'}." - name: allowed desc: "Reglas de filtrado tipado {rel_type: (head_types, tail_types)}. Pasar {} para desactivar el filtrado. E.g. {'ceo_of': (['person'], ['organization'])}." - name: model desc: "Instancia GLiNER2 cargada con gliner2_load_model. Inyectada por el caller para permitir cache entre llamadas." - name: threshold desc: "Umbral de confianza GLiNER2 (0-1). 0.3 validado empiricamente. Menor = mas recall, mas ruido." - name: max_chars_per_chunk desc: "Maximo de caracteres por chunk antes de dividir. 1500 es el valor optimo para GLiNER2-large." - name: overlap_sentences desc: "Frases de overlap entre chunks consecutivos. 2 evita perder entidades en los bordes de chunk." output: "Dict con 'nodes' (lista de {id, type, count}), 'edges' (lista de {from, to, kind}) y 'stats' ({n_chunks, n_nodes, n_edges, n_dropped_typed, elapsed_s}). Listo para serializar a JSON y visualizar con Sigma/D3." tested: true tests: - "texto corto produce nodos y aristas esperados con stub model" - "stats tiene todos los campos requeridos" test_file_path: "python/functions/pipelines/tests/test_extract_graph_from_text.py" file_path: "python/functions/pipelines/extract_graph_from_text.py" notes: | Refactorizacion directa del playground/server.py del analisis projects/osint_graph/analysis/gliner_glirel_tuning. Todas las recetas validadas empiricamente en los notebooks 04, 06 y 08: - threshold=0.3 (notebook 04) - overlap_sentences=2 (notebook 06) - filtrado tipado (notebook 08) - coreference normalize+substring (playground/server.py) Para PDFs: usar extract_pdf_text + clean_pdf_text antes de llamar a este pipeline. Para OpenIE sin vocabulario fijo: usar extract_triples_spacy_es como alternativa. --- ## Ejemplo ```python from datascience.gliner2_load_model import gliner2_load_model from pipelines.extract_graph_from_text import extract_graph_from_text model = gliner2_load_model(device="auto") ENTITY_LABELS = ["person", "organization", "location"] RELATION_LABELS = ["works_at", "ceo_of", "headquartered_in", "president_of"] ALLOWED = { "ceo_of": (["person"], ["organization"]), "president_of": (["person"], ["organization"]), "works_at": (["person"], ["organization"]), "headquartered_in": (["organization"], ["location"]), } text = """Carlos Torres Blanco es el presidente de BBVA. BBVA tiene su sede corporativa en Bilbao, aunque opera en mas de 30 paises.""" graph = extract_graph_from_text( text=text, entity_labels=ENTITY_LABELS, relation_labels=RELATION_LABELS, allowed=ALLOWED, model=model, ) # graph["nodes"] -> [{"id": "Carlos Torres Blanco", "type": "person", "count": 1}, ...] # graph["edges"] -> [{"from": "Carlos Torres Blanco", "to": "BBVA", "kind": "president_of"}] # graph["stats"] -> {"n_chunks": 1, "n_nodes": 3, "n_edges": 2, ...} ```