| extract_graph_from_text |
pipeline |
py |
pipelines |
1.0.0 |
impure |
def extract_graph_from_text(text: str, entity_labels: list[str], relation_labels: list | dict, allowed: dict, model: Any, threshold: float = 0.3, max_chars_per_chunk: int = 1500, overlap_sentences: int = 2) -> dict |
Pipeline E2E: texto -> grafo de entidades y relaciones. Orquesta chunking, extraccion con GLiNER2 por chunk, agregacion, filtrado tipado y resolucion de alias. Refactorizacion del playground del analisis gliner_glirel_tuning. |
| pipeline |
| graph |
| ner |
| relation-extraction |
| gliner2 |
| nlp |
| e2e |
| knowledge-graph |
| datascience |
| python |
| pendiente-usar |
|
| chunk_with_overlap_py_core |
| extract_graph_gliner2_py_datascience |
| aggregate_extraction_results_py_core |
| filter_relations_by_entity_types_py_core |
| merge_entity_aliases_py_core |
|
|
|
false |
error_go_core |
|
| name |
desc |
| text |
Texto de entrada de cualquier longitud. Se auto-chunkea si supera max_chars_per_chunk. Recomendado: pre-limpiar con clean_pdf_text si viene de un PDF. |
|
| name |
desc |
| entity_labels |
Tipos de entidad para GLiNER2. E.g. ['person', 'organization', 'location']. Usar snake_case (mejor recall segun notebook 08). |
|
| name |
desc |
| relation_labels |
Tipos de relacion. Lista de strings o dict {label: description}. E.g. ['works_at', 'ceo_of'] o {'ceo_of': 'person is CEO of organization'}. |
|
| name |
desc |
| allowed |
Reglas de filtrado tipado {rel_type: (head_types, tail_types)}. Pasar {} para desactivar el filtrado. E.g. {'ceo_of': (['person'], ['organization'])}. |
|
| name |
desc |
| model |
Instancia GLiNER2 cargada con gliner2_load_model. Inyectada por el caller para permitir cache entre llamadas. |
|
| name |
desc |
| threshold |
Umbral de confianza GLiNER2 (0-1). 0.3 validado empiricamente. Menor = mas recall, mas ruido. |
|
| name |
desc |
| max_chars_per_chunk |
Maximo de caracteres por chunk antes de dividir. 1500 es el valor optimo para GLiNER2-large. |
|
| name |
desc |
| overlap_sentences |
Frases de overlap entre chunks consecutivos. 2 evita perder entidades en los bordes de chunk. |
|
|
Dict con 'nodes' (lista de {id, type, count}), 'edges' (lista de {from, to, kind}) y 'stats' ({n_chunks, n_nodes, n_edges, n_dropped_typed, elapsed_s}). Listo para serializar a JSON y visualizar con Sigma/D3. |
true |
| texto corto produce nodos y aristas esperados con stub model |
| stats tiene todos los campos requeridos |
|
python/functions/pipelines/tests/test_extract_graph_from_text.py |
python/functions/pipelines/extract_graph_from_text.py |
Refactorizacion directa del playground/server.py del analisis
projects/osint_graph/analysis/gliner_glirel_tuning.
Todas las recetas validadas empiricamente en los notebooks 04, 06 y 08:
- threshold=0.3 (notebook 04)
- overlap_sentences=2 (notebook 06)
- filtrado tipado (notebook 08)
- coreference normalize+substring (playground/server.py)
Para PDFs: usar extract_pdf_text + clean_pdf_text antes de llamar a este pipeline.
Para OpenIE sin vocabulario fijo: usar extract_triples_spacy_es como alternativa.
|