63a9cb5273
Datascience: aggregate_by_group, deduplicate_entities/relations, detect_drift, diff_entities/relations, extract_entities/relations_llm, hotness_score, melt, merge_graphs, pivot, build_entity/relation_schema_prompt. Finance: avellaneda_stoikov_quotes, generate_gbm_prices, generate_taker_order, hawkes_intensity + módulo finance.py. Cybersecurity: envelope_encrypt/decrypt + módulo cybersecurity.py. Pipelines: extraction_pipeline, monte_carlo_market, run_market_sim. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3.3 KiB
3.3 KiB
name, kind, lang, domain, version, purity, signature, description, tags, uses_functions, uses_types, returns, returns_optional, error_type, imports, tested, tests, test_file_path, file_path
| name | kind | lang | domain | version | purity | signature | description | tags | uses_functions | uses_types | returns | returns_optional | error_type | imports | tested | tests | test_file_path | file_path | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| deduplicate_relations | function | py | datascience | 1.0.0 | pure | def deduplicate_relations(relations: list[RelationCandidate], entity_id_map: dict[str, str]) -> list[RelationCandidate] | Deduplica relaciones candidatas resolviendo from_name/to_name a entity IDs finales via entity_id_map. Descarta self-loops y relaciones sin match. Mergea duplicados (mismo from_id, to_id, relation_type) concatenando descripciones unicas y tomando max confidence. |
|
|
|
false | true |
|
python/functions/datascience/deduplicate_relations_test.py | python/functions/datascience/deduplicate_relations.py |
Ejemplo
from python.types.datascience.relation_candidate import RelationCandidate
from python.functions.datascience.deduplicate_relations import deduplicate_relations
# entity_id_map producido por deduplicate_entities
entity_id_map = {
"john smith": "entity_001",
"smith, john": "entity_001", # alias mergeado
"acme corp": "entity_002",
}
relations = [
RelationCandidate(from_name="John Smith", to_name="Acme Corp",
relation_type="works_at", description="John es CEO",
confidence=0.9, source_chunk_index=0),
RelationCandidate(from_name="Smith, John", to_name="Acme Corp",
relation_type="works_at", description="CEO de Acme",
confidence=0.7, source_chunk_index=2),
]
result = deduplicate_relations(relations, entity_id_map)
# → 1 RelationCandidate con from_id="entity_001", to_id="entity_002",
# confidence=0.9, description="John es CEO; CEO de Acme"
Notas
La funcion es pura: no hace I/O, no tiene efectos secundarios. El logging es de nivel DEBUG/WARNING — en produccion configurar el logger de la aplicacion.
Resolucion de nombres:
- Lookup exacto primero (lowercase strip del nombre contra las claves del mapa).
- Si no hay match exacto, fuzzy match con Levenshtein (threshold=3 ediciones).
- Si sigue sin match, la relacion se descarta con
logger.warning.
Self-loops: relaciones donde from_id == to_id siempre se descartan.
Merge: cuando varias relaciones comparten (from_id, to_id, relation_type):
confidence: max del grupo.description: union de descripciones unicas (no duplicadas), separadas por'; '.from_name/to_name/source_chunk_index: del primer candidato del grupo.
Integracion con fuzzygraph:
Esta funcion es el paso 4 del pipeline de extraccion. Recibe el output de
extract_relations_llm (relaciones crudas con nombres de texto) y el
entity_id_map producido por deduplicate_entities. Produce la lista final
de relaciones para ExtractionResult.