Files
fn_registry/python/functions/datascience/deduplicate_entities.md
T
egutierrez 837563c3ba feat: funciones Python datascience, finance, cybersecurity y pipelines
Datascience: aggregate_by_group, deduplicate_entities/relations, detect_drift,
diff_entities/relations, extract_entities/relations_llm, hotness_score, melt,
merge_graphs, pivot, build_entity/relation_schema_prompt.
Finance: avellaneda_stoikov_quotes, generate_gbm_prices, generate_taker_order,
hawkes_intensity + módulo finance.py.
Cybersecurity: envelope_encrypt/decrypt + módulo cybersecurity.py.
Pipelines: extraction_pipeline, monte_carlo_market, run_market_sim.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 17:11:32 +02:00

95 lines
4.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
name: deduplicate_entities
kind: function
lang: py
domain: datascience
version: "1.0.0"
purity: pure
signature: "def deduplicate_entities(candidates: list[EntityCandidate], name_threshold: float = 0.85, same_type_only: bool = True) -> DeduplicationResult"
description: "Agrupa entidades candidatas que refieren a la misma entidad real usando fuzzy matching de nombres (Levenshtein + Jaccard) y Union-Find para clusters transitivos. Retorna entidades mergeadas con mapas de resolucion de IDs y log de merges."
tags: [deduplication, entity, fuzzy, levenshtein, jaccard, union-find, knowledge-graph, nlp, fuzzygraph, datascience]
uses_functions:
- normalize_entity_name_py_core
- merge_entity_attributes_py_core
uses_types:
- entity_candidate_py_datascience
- deduplication_result_py_datascience
returns: [deduplication_result_py_datascience]
returns_optional: false
error_type: ""
imports:
- uuid
tested: true
tests:
- "John Smith y Smith, John se mergean"
- "Google y Google LLC se mergean"
- "192.168.1.1 y 192.168.1.1 se mergean por matching exacto"
- "John Smith (person) y John Smith (organization) NO se mergean"
- "Clusters transitivos: A~B, B~C -> {A, B, C} en un solo cluster"
- "Entidades sin duplicados pasan sin modificacion"
- "Confidence toma el max del cluster; atributos se fusionan"
- "Lista vacia retorna resultado vacio"
- "name_to_id contiene todos los nombres originales del cluster"
test_file_path: "python/functions/datascience/deduplicate_entities_test.py"
file_path: "python/functions/datascience/deduplicate_entities.py"
---
## Ejemplo
```python
from python.types.datascience.entity_candidate import EntityCandidate
from python.functions.datascience.deduplicate_entities import deduplicate_entities
candidates = [
EntityCandidate(name="John Smith", type_ref="person", confidence=0.9),
EntityCandidate(name="Smith, John", type_ref="person", confidence=0.85),
EntityCandidate(name="Google", type_ref="organization", confidence=0.95),
EntityCandidate(name="Google LLC", type_ref="organization", confidence=0.88),
]
result = deduplicate_entities(candidates, name_threshold=0.85, same_type_only=True)
# result.total_before = 4
# result.total_after = 2
# result.merge_log = [
# {"canonical": "John Smith", "merged": ["Smith, John"], "score": 0.91, "reason": "fuzzy_name"},
# {"canonical": "Google", "merged": ["Google LLC"], "score": 0.89, "reason": "fuzzy_name"},
# ]
```
## Algoritmo
1. **Normalizar nombres** usando `normalize_entity_name()` sobre cada candidato segun su `type_ref`
2. **Comparacion pairwise** dentro del mismo tipo (si `same_type_only=True`):
- Para tipos tecnicos (ip, email, domain, crypto_wallet, phone): matching exacto normalizado
- Para el resto: `score = max(levenshtein_sim, jaccard_sim)` + bonus por contencion (+0.3) y acronimos (+0.3)
3. **Union-Find** para clusters transitivos: si A~B y B~C, entonces {A, B, C} forman un cluster
4. **Merge por cluster:**
- Nombre canonico: candidato con mayor `confidence`
- Atributos: `merge_entity_attributes()` sobre todos los candidatos del cluster
- Confidence: `max` del cluster
- Source chunks: union de todos los candidatos
- `merged_from`: union de todos los nombres originales
## Heuristicas de similitud de nombres
| Heuristica | Efecto |
|---|---|
| Levenshtein | `1 - (edit_distance / max_len)` |
| Jaccard sobre tokens | `\|A ∩ B\| / \|A B\|` |
| Score base | `max(lev_sim, jaccard_sim)` |
| Contencion (a in b o b in a) | `+0.3` hasta max 1.0 |
| Acronimo ("FBI" ~ "Federal Bureau of Investigation") | `+0.3` hasta max 1.0 |
| Tipos exactos (ip/email/domain) | solo matching exacto, ignora umbral |
## Complejidad
- Pairwise: O(N^2) — aceptable para <1000 entidades (tipico por documento)
- Union-Find con path compression: O(α(N)) amortizado por operacion
- Para escalar a >1000: pre-filtrar por primera letra o n-gram index antes de comparar
## Notas
Funcion pura. Implementa Levenshtein y Jaccard internamente para evitar dependencias externas a este modulo. Las funciones del registry `levenshtein_distance_py_cybersecurity` y `jaccard_similarity_py_cybersecurity` son equivalentes pero requieren imports adicionales — la implementacion inline mantiene la funcion sin dependencias de stdlib.
El `name_to_id` del resultado es el mapa de resolucion principal para la fase de deduplicacion de relaciones: permite resolver cualquier variante de nombre de una entidad a su ID canonico.