| extraction_pipeline |
pipeline |
py |
pipelines |
1.0.0 |
impure |
def extraction_pipeline(file_path: str, entity_presets: list[dict], relation_types: list[str], llm_chat_json: Callablelist[dict, dict], chunk_size: int = 500, chunk_overlap: int = 50, confidence_threshold: float = 0.5, dedup_threshold: float = 0.85, on_progress: Callable[[str, float], None] | None = None) -> ExtractionResult |
Pipeline completa de extraccion de entidades y relaciones desde un documento. Orquesta extract_text_from_file -> preprocess_text -> split_text_into_chunks -> extract_entities_llm por chunk -> deduplicate_entities -> extract_relations_llm por chunk -> deduplicate_relations. |
| pipeline |
| extraction |
| entities |
| relations |
| llm |
| nlp |
| fuzzygraph |
| datascience |
|
| extract_text_from_file_py_core |
| preprocess_text_py_core |
| split_text_into_chunks_py_core |
| build_entity_schema_prompt_py_datascience |
| build_relation_schema_prompt_py_datascience |
| extract_entities_llm_py_datascience |
| extract_relations_llm_py_datascience |
| deduplicate_entities_py_datascience |
| deduplicate_relations_py_datascience |
|
| entity_candidate_py_datascience |
| extraction_result_py_datascience |
| extraction_stats_py_datascience |
| relation_candidate_py_datascience |
|
| extraction_result_py_datascience |
|
false |
error_go_core |
| time |
| warnings |
| typing.Callable |
|
| name |
desc |
| file_path |
Ruta del documento (PDF, TXT, Markdown) a procesar |
|
| name |
desc |
| entity_presets |
Configuración de tipos de entidades a extraer con sus metadatos |
|
| name |
desc |
| relation_types |
Tipos de relaciones a extraer (ej: 'owns', 'operates', 'communicates_with') |
|
| name |
desc |
| llm_chat_json |
Función inyectada para llamadas al LLM (sin acoplamiento a proveedor) |
|
| name |
desc |
| chunk_size |
Tamaño de chunks para procesamiento (default 500) |
|
| name |
desc |
| chunk_overlap |
Solapamiento entre chunks (default 50) |
|
| name |
desc |
| confidence_threshold |
Confianza mínima para incluir entidades (default 0.5) |
|
| name |
desc |
| dedup_threshold |
Umbral fuzzy para deduplicación (default 0.85) |
|
| name |
desc |
| on_progress |
Callback opcional para progreso (msg, percentage) |
|
|
ExtractionResult con entidades, relaciones y estadísticas del proceso de extracción |
true |
| documento con entidades y relaciones retorna ExtractionResult completo |
| documento vacio retorna ExtractionResult con listas vacias |
| documento sin entidades detectables retorna listas vacias |
| archivo no encontrado lanza FileNotFoundError |
| entity presets vacio lanza ValueError |
| progress callback se invoca durante la ejecucion |
| stats se rellenan correctamente con conteos y tiempo |
|
python/functions/pipelines/extraction_pipeline_test.py |
python/functions/pipelines/extraction_pipeline.py |