feat: extraccion masiva footprint_aurgi (41 funcs + 4 types + stack Docker geo)
Extrae al registry funciones del proyecto interno footprint_aurgi: - core (6): slugify_ascii, normalize_for_join, cp_provincia_es, infer_provincia_from_cp, safe_read_csv_fallback, csv_to_parquet_duckdb - geo puras (7): haversine_km, point_in_ring, point_in_polygon, point_in_polygons_bbox, polygon_bbox, extent_with_padding, distance_bucket - geo I/O (4): load_geojson_polygons, load_boundary_gdf, add_basemap_osm, add_basemap_with_timeout - valhalla client (4): valhalla_route, valhalla_isochrone, valhalla_isochrones_async, valhalla_matrix_1_to_n - datascience stats (7): trimmed_mean, geometric_mean, detect_distribution_type, best_central_tendency, summary_stats, kde_density_levels, alpha_shape_concave_hull - datascience fuzzy (3): fuzzy_merge_adaptive (rapidfuzz), words_to_dataset, remove_words_from_column - datascience viz (2): plot_kde_2d, plot_heatmap_log - infra (4): compress_pdf_ghostscript, render_table_page_pdfpages, add_header_logo, osm2pgsql_ingest - pipelines (4): setup_geo_stack_docker, compute_centers_reachability, generate_isochrones_by_zone, count_points_per_zone - types geo (4): LonLat, BBox, IsochroneRequest, Centro Incluye: - apps/footprint_geo_stack/ (PostGIS + Martin + Valhalla via docker-compose) - 131/132 tests pasan (1 skip esperado: osm2pgsql en PATH) - Issue tracker dev/issues/0052-footprint-aurgi-extraction.md - Atribucion uniforme: source_repo internal:footprint_aurgi, source_license internal-aurgi - Build con 9 agentes en paralelo (8 wave 1 + 1 wave 2 pipelines) Tambien commitea trabajo previo no commiteado: aggregate_extraction_results, chunk_with_overlap, clean_pdf_text, merge_entity_aliases, extract_graph_gliner2, extract_relations_mrebel, extract_triples_spacy_es, gliner2/mrebel/marianmt/rebel/spacy_es load_model, parse_rebel_output, translate_es_to_en, issue 0050/0051. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,51 @@
|
||||
---
|
||||
name: aggregate_extraction_results
|
||||
kind: function
|
||||
lang: py
|
||||
domain: core
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def aggregate_extraction_results(extract_results: list[dict]) -> dict"
|
||||
description: "Agrega entidades y relaciones de N resultados de extraccion por chunk. Deduplica entidades por (type, name_lowercased) acumulando counts. Deduplica relaciones por (head, rel_type, tail) con Counter."
|
||||
tags: [nlp, aggregation, entities, relations, deduplication, chunking, ner, re, graph]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: [collections.Counter]
|
||||
params:
|
||||
- name: extract_results
|
||||
desc: "Lista de resultados por chunk. Cada elemento tiene shape {'entities': {type: [name, ...]}, 'relation_extraction': {rel_type: [(head, tail), ...]}}. Es el output de extract_graph_gliner2. Claves ausentes se toleran."
|
||||
output: "Dict con dos campos: 'entities' -> dict keyed por (type, name_lower) con {type, name, count}; 'relations' -> Counter (head, rel_type, tail) -> count. Listo para pasar a filter_relations_by_entity_types y merge_entity_aliases."
|
||||
tested: true
|
||||
tests:
|
||||
- "lista vacia retorna entities vacio y relations vacio"
|
||||
- "resultado unico se agrega correctamente"
|
||||
- "dos resultados con solapamiento acumulan counts"
|
||||
- "entidades se deduplicen case-insensitive"
|
||||
test_file_path: "python/functions/core/tests/test_aggregate_extraction_results.py"
|
||||
file_path: "python/functions/core/aggregate_extraction_results.py"
|
||||
notes: |
|
||||
Output shape deliberado para composicion con el pipeline:
|
||||
- entities keyed por (type, name_lower) permite lookup O(1) por tipo+nombre
|
||||
- relations como Counter permite filtrar por frecuencia (count >= 2)
|
||||
No aplica coreference — eso lo hace merge_entity_aliases sobre los nombres
|
||||
canonicos despues de agregar.
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from core.aggregate_extraction_results import aggregate_extraction_results
|
||||
|
||||
results = [
|
||||
{"entities": {"person": ["Pablo Isla"], "organization": ["Inditex"]},
|
||||
"relation_extraction": {"ceo_of": [("Pablo Isla", "Inditex")]}},
|
||||
{"entities": {"person": ["pablo isla"], "organization": ["Inditex"]},
|
||||
"relation_extraction": {"ceo_of": [("Pablo Isla", "Inditex")]}},
|
||||
]
|
||||
agg = aggregate_extraction_results(results)
|
||||
# agg["entities"][("person", "pablo isla")]["count"] == 2
|
||||
# agg["relations"][("Pablo Isla", "ceo_of", "Inditex")] == 2
|
||||
```
|
||||
@@ -0,0 +1,45 @@
|
||||
"""Agrega y deduplica entidades + relaciones de N resultados de extraccion por chunk."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from collections import Counter
|
||||
|
||||
|
||||
def aggregate_extraction_results(extract_results: list[dict]) -> dict:
|
||||
"""Aggregate entities + relations from multiple chunk-level extraction results.
|
||||
|
||||
Deduplicates entities by (type, name_lowercased) and counts occurrences.
|
||||
Deduplicates relations by (head, rel_type, tail) and counts occurrences.
|
||||
|
||||
Each input result is expected to have shape:
|
||||
{"entities": {type: [name, ...]}, "relation_extraction": {rel_type: [(head, tail), ...]}}
|
||||
This is the output format of extract_graph_gliner2.
|
||||
|
||||
Args:
|
||||
extract_results: List of per-chunk extraction dicts. May be empty.
|
||||
Missing keys ("entities", "relation_extraction") are tolerated.
|
||||
|
||||
Returns:
|
||||
{
|
||||
"entities": dict[(type, name_lower)] -> {"type": str, "name": str, "count": int},
|
||||
"relations": Counter mapping (head, rel_type, tail) -> count
|
||||
}
|
||||
"""
|
||||
all_ents: dict[tuple[str, str], dict] = {}
|
||||
all_rels: Counter = Counter()
|
||||
|
||||
for r in extract_results:
|
||||
for typ, names in (r.get("entities") or {}).items():
|
||||
for n in names:
|
||||
key = (typ, (n or "").strip().lower())
|
||||
if not key[1]:
|
||||
continue
|
||||
if key not in all_ents:
|
||||
all_ents[key] = {"type": typ, "name": n.strip(), "count": 0}
|
||||
all_ents[key]["count"] += 1
|
||||
|
||||
for rt, pairs in (r.get("relation_extraction") or {}).items():
|
||||
for h, t in pairs:
|
||||
all_rels[(h.strip(), rt, t.strip())] += 1
|
||||
|
||||
return {"entities": all_ents, "relations": all_rels}
|
||||
@@ -0,0 +1,64 @@
|
||||
---
|
||||
name: chunk_with_overlap
|
||||
kind: function
|
||||
lang: py
|
||||
domain: core
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def chunk_with_overlap(text: str, max_chars: int = 1500, overlap_sentences: int = 2) -> list[dict]"
|
||||
description: "Divide texto en chunks por sentence boundaries con sliding window overlap. Garantiza avance forzado si una frase supera max_chars (evita bucle infinito). Cada chunk retorna dict con 'text' y 'sentences'."
|
||||
tags: [text, chunking, nlp, split, overlap, sentence, ner, gliner, sliding-window]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: [re]
|
||||
params:
|
||||
- name: text
|
||||
desc: "Texto a dividir. Frases se detectan por [.!?] seguido de espacio. Admite saltos de linea si el texto ya fue limpiado con clean_pdf_text."
|
||||
- name: max_chars
|
||||
desc: "Limite maximo de caracteres por chunk (soft limit). Si una sola frase supera max_chars se incluye igualmente para evitar bucle infinito."
|
||||
- name: overlap_sentences
|
||||
desc: "Numero de frases finales del chunk previo a prepender al chunk actual. 0 desactiva el overlap."
|
||||
output: "Lista de dicts [{'text': str, 'sentences': list[str]}, ...]. 'text' es el texto listo para pasar a GLiNER2. Lista vacia si el input es vacio."
|
||||
tested: true
|
||||
tests:
|
||||
- "texto vacio retorna lista vacia"
|
||||
- "una frase menor que max_chars produce 1 chunk"
|
||||
- "multiples frases producen N chunks con overlap"
|
||||
- "frase mas larga que max_chars se incluye sin bucle infinito"
|
||||
- "overlap=0 no duplica frases entre chunks"
|
||||
- "overlap=2 el chunk N+1 empieza con las 2 ultimas frases del chunk N"
|
||||
test_file_path: "python/functions/core/tests/test_chunk_with_overlap.py"
|
||||
file_path: "python/functions/core/chunk_with_overlap.py"
|
||||
notes: |
|
||||
Algoritmo validado empiricamente en notebook 06 del analisis
|
||||
gliner_glirel_tuning. El overlap sentence-level (vs overlap en caracteres)
|
||||
asegura que las entidades que aparecen al final de un chunk tambien
|
||||
aparecen al principio del siguiente, mejorando el recall de GLiNER2.
|
||||
|
||||
split_text_into_chunks_py_core hace overlap en caracteres (modo RAG).
|
||||
chunk_with_overlap hace overlap en frases completas (modo NER/RE) — son
|
||||
complementarias, no competidoras.
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from core.chunk_with_overlap import chunk_with_overlap
|
||||
|
||||
text = "Pablo Isla preside Inditex. La empresa opera en 93 paises. Zara es su marca principal."
|
||||
chunks = chunk_with_overlap(text, max_chars=80, overlap_sentences=1)
|
||||
# chunk 0: text="Pablo Isla preside Inditex. La empresa opera en 93 paises."
|
||||
# chunk 1: text="La empresa opera en 93 paises. Zara es su marca principal."
|
||||
# ^--- overlap de 1 frase
|
||||
|
||||
for c in chunks:
|
||||
print(c["text"])
|
||||
```
|
||||
|
||||
## Diferencia con split_text_into_chunks
|
||||
|
||||
- `split_text_into_chunks`: overlap en caracteres, orientado a RAG
|
||||
- `chunk_with_overlap`: overlap en frases completas, orientado a NER/RE (GLiNER2)
|
||||
@@ -0,0 +1,73 @@
|
||||
"""Chunking por sentence boundaries con sliding window overlap.
|
||||
|
||||
Validado empiricamente en notebook 06 (gliner_glirel_tuning) para pipelines
|
||||
NER+RE con GLiNER2. Corrige el bug de bucle infinito de la version naive
|
||||
cuando una frase supera max_chars.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import re
|
||||
|
||||
|
||||
def chunk_with_overlap(
|
||||
text: str,
|
||||
max_chars: int = 1500,
|
||||
overlap_sentences: int = 2,
|
||||
) -> list[dict]:
|
||||
"""Split text into chunks with sentence-level sliding window overlap.
|
||||
|
||||
Each chunk has up to `max_chars` characters. Each chunk after the first
|
||||
starts with the last `overlap_sentences` sentences of the previous chunk
|
||||
if they fit. If a single sentence exceeds max_chars, it is force-included
|
||||
(chunk size will exceed max_chars rather than infinite-loop).
|
||||
|
||||
Args:
|
||||
text: Input text to split. Sentences are detected by [.!?] followed by whitespace.
|
||||
max_chars: Maximum characters per chunk (soft limit; exceeded if a single
|
||||
sentence is longer than max_chars to avoid infinite loop).
|
||||
overlap_sentences: Number of trailing sentences of the previous chunk to
|
||||
prepend to the next chunk. 0 disables overlap.
|
||||
|
||||
Returns:
|
||||
list of dicts: [{"text": str, "sentences": list[str]}, ...]
|
||||
Empty list if text is empty or contains only whitespace.
|
||||
"""
|
||||
if not text or not text.strip():
|
||||
return []
|
||||
|
||||
sentences = re.split(r"(?<=[\.!?])\s+", text)
|
||||
sentences = [s.strip() for s in sentences if s.strip()]
|
||||
if not sentences:
|
||||
return []
|
||||
|
||||
chunks: list[dict] = []
|
||||
i = 0
|
||||
while i < len(sentences):
|
||||
current_sents: list[str] = []
|
||||
current_len = 0
|
||||
|
||||
# Overlap desde el chunk anterior
|
||||
if chunks and overlap_sentences > 0:
|
||||
prev_sents = chunks[-1]["sentences"][-overlap_sentences:]
|
||||
overlap_len = sum(len(s) + 1 for s in prev_sents)
|
||||
next_len = len(sentences[i]) + 1
|
||||
if overlap_len + next_len <= max_chars:
|
||||
current_sents = list(prev_sents)
|
||||
current_len = overlap_len
|
||||
|
||||
# AVANCE FORZADO: meter al menos UNA frase aunque exceda max_chars
|
||||
# (evita bucle infinito con frases muy largas)
|
||||
current_sents.append(sentences[i])
|
||||
current_len += len(sentences[i]) + 1
|
||||
i += 1
|
||||
|
||||
# Seguir agregando frases mientras quepan
|
||||
while i < len(sentences) and current_len + len(sentences[i]) + 1 <= max_chars:
|
||||
current_sents.append(sentences[i])
|
||||
current_len += len(sentences[i]) + 1
|
||||
i += 1
|
||||
|
||||
chunks.append({"text": " ".join(current_sents), "sentences": current_sents})
|
||||
|
||||
return chunks
|
||||
@@ -0,0 +1,53 @@
|
||||
---
|
||||
name: clean_pdf_text
|
||||
kind: function
|
||||
lang: py
|
||||
domain: core
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def clean_pdf_text(text: str) -> str"
|
||||
description: "Limpieza de artefactos PyPDF2/pdfplumber: elimina marcas de pagina (1/20), tabs, guiones de dehyphenation, saltos de linea en medio de oraciones y espacios duplicados."
|
||||
tags: [pdf, text, cleaning, nlp, preprocessing, pypdf2]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: [re]
|
||||
params:
|
||||
- name: text
|
||||
desc: "Texto plano extraido de un PDF (ej. via PyPDF2.PdfReader o pdfplumber). Puede contener artefactos de paginacion, guiones de dehyphenation y saltos de linea espurios."
|
||||
output: "Texto limpiado con artefactos eliminados y espacios normalizados. Listo para chunking o extraccion NER."
|
||||
tested: true
|
||||
tests:
|
||||
- "string vacio retorna vacio"
|
||||
- "marca de pagina 1/20 se elimina"
|
||||
- "dehyphenation exa-newline-mple -> example"
|
||||
- "espacios duplicados se colapsan"
|
||||
- "salto de linea en mitad de oracion se une con espacio"
|
||||
- "salto de linea tras punto se preserva"
|
||||
test_file_path: "python/functions/core/tests/test_clean_pdf_text.py"
|
||||
file_path: "python/functions/core/clean_pdf_text.py"
|
||||
notes: |
|
||||
Funcion pura sin dependencias externas (solo re de stdlib).
|
||||
Orden de operaciones es significativo: dehyphenation antes que colapso
|
||||
de saltos de linea para evitar falsos positivos.
|
||||
No elimina saltos de linea tras punto/exclamacion/interrogacion —
|
||||
esos marcan fin de oracion y deben preservarse para el chunker.
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from core.clean_pdf_text import clean_pdf_text
|
||||
|
||||
raw = "Banco Bilbao Vizcaya Argen-\ntaria, S.A. operó en 2023.\n1/20\n\nFoo Bar"
|
||||
clean = clean_pdf_text(raw)
|
||||
# "Banco Bilbao Vizcaya Argentaria, S.A. operó en 2023.\nFoo Bar"
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Disenada para preprocesar texto antes de pasarlo a `chunk_with_overlap` +
|
||||
`extract_graph_gliner2`. El pipeline completo es:
|
||||
`extract_pdf_text` -> `clean_pdf_text` -> `chunk_with_overlap` -> `extract_graph_gliner2`.
|
||||
@@ -0,0 +1,32 @@
|
||||
"""Limpieza de artefactos tipicos de extraccion PyPDF2 en texto plano."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import re
|
||||
|
||||
|
||||
def clean_pdf_text(text: str) -> str:
|
||||
"""Clean PDF text extraction artifacts.
|
||||
|
||||
Removes: page-number markers like '1/20', tabs, hyphenated line breaks
|
||||
in mid-word, duplicated spaces, line breaks not at sentence end.
|
||||
|
||||
Args:
|
||||
text: Raw text extracted from a PDF (e.g. via PyPDF2 or pdfplumber).
|
||||
|
||||
Returns:
|
||||
Cleaned text with artifacts removed and whitespace normalized.
|
||||
"""
|
||||
# Eliminar marcas de pagina tipo "1/20" o "3/128"
|
||||
text = re.sub(r"\b\d{1,2}/\d{1,3}\b", " ", text)
|
||||
# Tabs a espacio
|
||||
text = text.replace("\t", " ")
|
||||
# Dehyphenation: "exa-\nmple" -> "example"
|
||||
text = re.sub(r"-\s*\n\s*", "", text)
|
||||
# Saltos de linea que NO son fin de oracion -> espacio
|
||||
text = re.sub(r"(?<![\.!?])\n+", " ", text)
|
||||
# Colapsar espacios multiples
|
||||
text = re.sub(r" {2,}", " ", text)
|
||||
# Limpiar lineas vacias y trim por linea
|
||||
text = "\n".join(line.strip() for line in text.split("\n") if line.strip())
|
||||
return text.strip()
|
||||
@@ -0,0 +1,58 @@
|
||||
---
|
||||
id: cp_provincia_es_py_core
|
||||
name: cp_provincia_es
|
||||
kind: function
|
||||
lang: py
|
||||
domain: core
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def cp_provincia_es(codigo_postal: str | int) -> str | None"
|
||||
description: "Lookup de provincia espanola por codigo postal. Acepta CP completo (5 digitos) o prefijo de 2 digitos. Retorna None si el prefijo no existe en el diccionario de las 52 provincias/ciudades autonomas espanolas."
|
||||
tags: [string, normalization, spain, geography, postal-code]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: []
|
||||
example: |
|
||||
from cp_provincia_es import cp_provincia_es
|
||||
cp_provincia_es("28001") # "Madrid"
|
||||
cp_provincia_es("28") # "Madrid"
|
||||
cp_provincia_es(1) # "Álava"
|
||||
cp_provincia_es("99") # None
|
||||
tested: true
|
||||
tests: ["cp completo retorna provincia", "prefijo 2 digitos retorna provincia", "primer prefijo 01 retorna Alava", "cp desconocido retorna None"]
|
||||
test_file_path: "python/functions/core/tests/test_cp_provincia_es.py"
|
||||
file_path: "python/functions/core/cp_provincia_es.py"
|
||||
params:
|
||||
- name: codigo_postal
|
||||
desc: "Codigo postal espanol como string o entero. Acepta CP de 5 digitos ('28001', 28001) o prefijo de 2 digitos ('28', 28)."
|
||||
output: "Nombre de la provincia en espanol (con diacriticos), o None si el prefijo del CP no corresponde a ninguna provincia conocida."
|
||||
source_repo: "internal:footprint_aurgi"
|
||||
source_license: "internal-aurgi"
|
||||
source_file: "aurgi_mapas/generar_pdf_reporte.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from cp_provincia_es import cp_provincia_es
|
||||
|
||||
cp_provincia_es("28001") # "Madrid"
|
||||
cp_provincia_es("28") # "Madrid"
|
||||
cp_provincia_es(28) # "Madrid"
|
||||
cp_provincia_es("01") # "Álava"
|
||||
cp_provincia_es(1) # "Álava" (zfill(5) -> "00001", prefix "00" -> None... ojo: int 1 -> "1" -> zfill(5) = "00001" -> "00" no existe)
|
||||
cp_provincia_es("99") # None
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Funcion pura sin dependencias. El diccionario embebido cubre las 50 provincias
|
||||
espanolas mas Ceuta ("51") y Melilla ("52"). Copiado tal cual de
|
||||
`aurgi_mapas/generar_pdf_reporte.py:CP_TO_PROVINCIA`.
|
||||
|
||||
Nota sobre enteros: `cp_provincia_es(1)` -> `str(1)` = "1" -> zfill(5) = "00001" -> prefix "00" -> None.
|
||||
Para prefijo numerico usar string: `cp_provincia_es("01")` -> "Álava".
|
||||
Para CP numerico completo funciona: `cp_provincia_es(28001)` -> "Madrid".
|
||||
@@ -0,0 +1,44 @@
|
||||
"""Lookup de provincia espanola por codigo postal."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
_CP_TO_PROVINCIA = {
|
||||
"01": "Álava", "02": "Albacete", "03": "Alicante", "04": "Almería",
|
||||
"05": "Ávila", "06": "Badajoz", "07": "Illes Balears", "08": "Barcelona",
|
||||
"09": "Burgos", "10": "Cáceres", "11": "Cádiz", "12": "Castellón",
|
||||
"13": "Ciudad Real", "14": "Córdoba", "15": "A Coruña", "16": "Cuenca",
|
||||
"17": "Girona", "18": "Granada", "19": "Guadalajara", "20": "Gipuzkoa",
|
||||
"21": "Huelva", "22": "Huesca", "23": "Jaén", "24": "León",
|
||||
"25": "Lleida", "26": "La Rioja", "27": "Lugo", "28": "Madrid",
|
||||
"29": "Málaga", "30": "Murcia", "31": "Navarra", "32": "Ourense",
|
||||
"33": "Asturias", "34": "Palencia", "35": "Las Palmas",
|
||||
"36": "Pontevedra", "37": "Salamanca", "38": "Santa Cruz de Tenerife",
|
||||
"39": "Cantabria", "40": "Segovia", "41": "Sevilla",
|
||||
"42": "Soria", "43": "Tarragona", "44": "Teruel",
|
||||
"45": "Toledo", "46": "Valencia", "47": "Valladolid",
|
||||
"48": "Bizkaia", "49": "Zamora", "50": "Zaragoza",
|
||||
"51": "Ceuta", "52": "Melilla",
|
||||
}
|
||||
|
||||
|
||||
def cp_provincia_es(codigo_postal: "str | int") -> "str | None":
|
||||
"""Retorna la provincia espanola correspondiente a un codigo postal.
|
||||
|
||||
Acepta CP completo (5 digitos) o prefijo de 2 digitos. Normaliza con
|
||||
zfill(5)[:2] antes de hacer el lookup. Retorna None si el prefijo
|
||||
no esta en el diccionario.
|
||||
|
||||
Args:
|
||||
codigo_postal: Codigo postal espanol como string o entero.
|
||||
Puede ser CP completo ("28001", 28001) o prefijo ("28", 28).
|
||||
|
||||
Returns:
|
||||
Nombre de la provincia en español, o None si el CP es desconocido.
|
||||
"""
|
||||
cp = str(codigo_postal).strip()
|
||||
# Si ya es prefijo de 2 digitos (o menos), usar directamente con zfill(2)
|
||||
if len(cp) <= 2:
|
||||
prefix = cp.zfill(2)
|
||||
else:
|
||||
prefix = cp.zfill(5)[:2]
|
||||
return _CP_TO_PROVINCIA.get(prefix)
|
||||
@@ -0,0 +1,54 @@
|
||||
---
|
||||
name: csv_to_parquet_duckdb
|
||||
kind: function
|
||||
lang: py
|
||||
domain: core
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "csv_to_parquet_duckdb(csv_path: str | Path, parquet_path: str | Path, column_casts: dict[str, str] | None = None, overwrite: bool = False) -> bool"
|
||||
description: "Convierte un CSV a Parquet usando DuckDB read_csv_auto. Si overwrite=False y el parquet ya existe no hace nada. column_casts permite sobreescribir tipos inferidos por columna. Retorna True si escribió."
|
||||
tags: [csv, parquet, duckdb, etl, core]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: [duckdb, pathlib]
|
||||
params:
|
||||
- name: csv_path
|
||||
desc: "Ruta al archivo CSV fuente."
|
||||
- name: parquet_path
|
||||
desc: "Ruta de destino del archivo Parquet. Se crean los directorios intermedios si no existen."
|
||||
- name: column_casts
|
||||
desc: "Dict opcional col→tipo DuckDB para sobreescribir tipos inferidos (e.g. {\"cp\": \"VARCHAR\"})."
|
||||
- name: overwrite
|
||||
desc: "Si False (default), no sobreescribe un parquet existente y retorna False."
|
||||
output: "True si el archivo Parquet fue escrito, False si fue omitido por ya existir."
|
||||
tested: true
|
||||
tests:
|
||||
- "convierte csv a parquet y duckdb puede leerlo"
|
||||
- "overwrite=False no sobreescribe parquet existente"
|
||||
test_file_path: "python/functions/core/tests/test_csv_to_parquet_duckdb.py"
|
||||
file_path: "python/functions/core/csv_to_parquet_duckdb.py"
|
||||
source_repo: "internal:footprint_aurgi"
|
||||
source_license: "internal-aurgi"
|
||||
source_file: "zonas_mapas_aurgi/scripts/prepare_parquet.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
written = csv_to_parquet_duckdb(
|
||||
"data/centros.csv",
|
||||
"data/centros.parquet",
|
||||
column_casts={"cp": "VARCHAR"},
|
||||
)
|
||||
if written:
|
||||
print("Parquet generado")
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Usa DuckDB read_csv_auto que infiere tipos automáticamente. Para columnas con
|
||||
códigos postales u otros campos numéricos que deben ser strings, usar column_casts.
|
||||
Lanza FileNotFoundError si csv_path no existe. Otros errores de DuckDB se propagan.
|
||||
@@ -0,0 +1,79 @@
|
||||
"""Convert a CSV file to Parquet format using DuckDB."""
|
||||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
def csv_to_parquet_duckdb(
|
||||
csv_path: "str | Path",
|
||||
parquet_path: "str | Path",
|
||||
column_casts: "dict[str, str] | None" = None,
|
||||
overwrite: bool = False,
|
||||
) -> bool:
|
||||
"""Convert a CSV file to Parquet using DuckDB's read_csv_auto.
|
||||
|
||||
If overwrite is False and the parquet file already exists, the function
|
||||
does nothing and returns False. Otherwise uses DuckDB to read the CSV
|
||||
(with automatic type inference) and writes it as Parquet.
|
||||
|
||||
Optional column_casts allow overriding inferred types for specific columns
|
||||
(e.g. {"codigo_postal": "VARCHAR"} to prevent numeric coercion).
|
||||
|
||||
Args:
|
||||
csv_path: Path to the source CSV file.
|
||||
parquet_path: Path for the output Parquet file.
|
||||
column_casts: Optional dict mapping column names to DuckDB SQL types.
|
||||
overwrite: If False (default), skip conversion when parquet exists.
|
||||
|
||||
Returns:
|
||||
True if the Parquet file was written, False if skipped.
|
||||
|
||||
Raises:
|
||||
FileNotFoundError: If csv_path does not exist.
|
||||
Exception: Any DuckDB error (malformed CSV, type cast failure, etc.).
|
||||
"""
|
||||
import duckdb
|
||||
|
||||
csv_p = Path(csv_path)
|
||||
parquet_p = Path(parquet_path)
|
||||
|
||||
if not csv_p.exists():
|
||||
raise FileNotFoundError(f"CSV not found: {csv_p}")
|
||||
|
||||
if not overwrite and parquet_p.exists():
|
||||
return False
|
||||
|
||||
parquet_p.parent.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
con = duckdb.connect()
|
||||
try:
|
||||
if column_casts:
|
||||
cast_exprs = ", ".join(
|
||||
f"CAST({col} AS {dtype}) AS {col}"
|
||||
for col, dtype in column_casts.items()
|
||||
)
|
||||
# Build SELECT: cast specified columns, pass rest through
|
||||
# We do this via a subquery to get all columns first
|
||||
all_cols_query = f"DESCRIBE SELECT * FROM read_csv_auto('{csv_p}', header=true)"
|
||||
all_cols = [row[0] for row in con.execute(all_cols_query).fetchall()]
|
||||
select_parts = []
|
||||
for col in all_cols:
|
||||
if col in column_casts:
|
||||
select_parts.append(f"CAST({col} AS {column_casts[col]}) AS {col}")
|
||||
else:
|
||||
select_parts.append(col)
|
||||
select_expr = ", ".join(select_parts)
|
||||
sql = (
|
||||
f"COPY (SELECT {select_expr} FROM read_csv_auto('{csv_p}', header=true)) "
|
||||
f"TO '{parquet_p}' (FORMAT PARQUET)"
|
||||
)
|
||||
else:
|
||||
sql = (
|
||||
f"COPY (SELECT * FROM read_csv_auto('{csv_p}', header=true)) "
|
||||
f"TO '{parquet_p}' (FORMAT PARQUET)"
|
||||
)
|
||||
con.execute(sql)
|
||||
finally:
|
||||
con.close()
|
||||
|
||||
return True
|
||||
@@ -0,0 +1,67 @@
|
||||
---
|
||||
name: filter_relations_by_entity_types
|
||||
kind: function
|
||||
lang: py
|
||||
domain: core
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def filter_relations_by_entity_types(relations: dict, name_to_type: dict, allowed: dict) -> tuple[list, list]"
|
||||
description: "Post-filtrado tipado de relaciones NER+RE: descarta pares donde los tipos de entidad (head_type, tail_type) no coinciden con los permitidos por relation kind. Ej: descarta 'Madrid president_of Persona' porque Madrid es location no person."
|
||||
tags: [nlp, relations, filter, entity-types, graph, ner, re, post-process, gliner2]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: []
|
||||
params:
|
||||
- name: relations
|
||||
desc: "Dict {rel_type: [(head_name, tail_name), ...]}. Los nombres deben ser strings no vacios. Ej: {'president_of': [('Carlos Torres', 'BBVA')]}"
|
||||
- name: name_to_type
|
||||
desc: "Dict {nombre_lowercased: entity_type}. Se construye del resultado de extract_graph_gliner2 o aggregate_extraction_results. Ej: {'carlos torres': 'person', 'bbva': 'organization'}"
|
||||
- name: allowed
|
||||
desc: "Dict {rel_type: (allowed_head_types, allowed_tail_types)}. Cada valor es una tupla de dos listas de strings. Si un rel_type no esta en allowed, todos sus pares se aceptan. Ej: {'president_of': (['person'], ['organization'])}"
|
||||
output: "Tupla (kept, dropped). Cada elemento es lista de dicts {from, kind, to, head_type, tail_type}. kept tiene los validos, dropped los rechazados (util para debugging)."
|
||||
tested: true
|
||||
tests:
|
||||
- "pares validos se incluyen en kept"
|
||||
- "pares con tipos incompatibles van a dropped"
|
||||
- "rel_type no en allowed se acepta siempre"
|
||||
- "entidad no encontrada en name_to_type va a dropped"
|
||||
test_file_path: "python/functions/core/tests/test_filter_relations_by_entity_types.py"
|
||||
file_path: "python/functions/core/filter_relations_by_entity_types.py"
|
||||
notes: |
|
||||
Validado en playground/server.py del analisis gliner_glirel_tuning.
|
||||
La regla (head_type, tail_type) evita falsos positivos comunes en grafos
|
||||
de conocimiento como "Madrid preside Santander" (Location -> Organization).
|
||||
El parametro dropped permite inspeccionar facilmente que relaciones se
|
||||
eliminaron y por que (head_type/tail_type None indica entidad desconocida).
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from core.filter_relations_by_entity_types import filter_relations_by_entity_types
|
||||
|
||||
relations = {
|
||||
"president_of": [
|
||||
("Carlos Torres", "BBVA"), # person -> organization: OK
|
||||
("Madrid", "Santander"), # location -> organization: INVALIDO
|
||||
],
|
||||
"unknown_rel": [("A", "B")], # no en allowed: se acepta
|
||||
}
|
||||
name_to_type = {
|
||||
"carlos torres": "person",
|
||||
"bbva": "organization",
|
||||
"madrid": "location",
|
||||
"santander": "organization",
|
||||
"a": "person", "b": "person",
|
||||
}
|
||||
allowed = {
|
||||
"president_of": (["person"], ["organization"]),
|
||||
}
|
||||
kept, dropped = filter_relations_by_entity_types(relations, name_to_type, allowed)
|
||||
# kept: [{"from": "Carlos Torres", "kind": "president_of", "to": "BBVA", ...},
|
||||
# {"from": "A", "kind": "unknown_rel", "to": "B", ...}]
|
||||
# dropped: [{"from": "Madrid", "kind": "president_of", "to": "Santander", ...}]
|
||||
```
|
||||
@@ -0,0 +1,49 @@
|
||||
"""Post-filtrado tipado de relaciones: descarta pares con tipos incompatibles."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
|
||||
def filter_relations_by_entity_types(
|
||||
relations: dict,
|
||||
name_to_type: dict,
|
||||
allowed: dict,
|
||||
) -> tuple[list, list]:
|
||||
"""Filter relations by allowed (head_type, tail_type) per relation kind.
|
||||
|
||||
Validates that each (head, tail) pair in a relation has the expected entity
|
||||
types. Relations with unknown types (not in name_to_type) are dropped when
|
||||
the relation_type appears in allowed.
|
||||
|
||||
Args:
|
||||
relations: Dict mapping rel_type -> list of (head_name, tail_name) tuples.
|
||||
E.g. {"president_of": [("Carlos Torres", "BBVA")], ...}
|
||||
name_to_type: Dict mapping lowercased entity name -> entity type.
|
||||
E.g. {"carlos torres": "person", "bbva": "organization"}
|
||||
allowed: Dict mapping rel_type -> (allowed_head_types, allowed_tail_types).
|
||||
Each value is a tuple/list of two lists of strings.
|
||||
If a rel_type is NOT in allowed, all its pairs are kept.
|
||||
E.g. {"president_of": (["person"], ["organization"])}
|
||||
|
||||
Returns:
|
||||
Tuple (kept, dropped) where each is a list of dicts:
|
||||
{"from": str, "kind": str, "to": str, "head_type": str|None, "tail_type": str|None}
|
||||
"""
|
||||
kept: list[dict] = []
|
||||
dropped: list[dict] = []
|
||||
|
||||
for rt, pairs in relations.items():
|
||||
rule = allowed.get(rt)
|
||||
for h, t in pairs:
|
||||
ht = name_to_type.get(h.lower().strip())
|
||||
tt = name_to_type.get(t.lower().strip())
|
||||
row = {"from": h, "kind": rt, "to": t, "head_type": ht, "tail_type": tt}
|
||||
if rule is None:
|
||||
kept.append(row)
|
||||
else:
|
||||
head_ok, tail_ok = rule
|
||||
if ht in head_ok and tt in tail_ok:
|
||||
kept.append(row)
|
||||
else:
|
||||
dropped.append(row)
|
||||
|
||||
return kept, dropped
|
||||
@@ -0,0 +1,65 @@
|
||||
---
|
||||
id: infer_provincia_from_cp_py_core
|
||||
name: infer_provincia_from_cp
|
||||
kind: function
|
||||
lang: py
|
||||
domain: core
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def infer_provincia_from_cp(rows: list[dict], cp_col: str = \"codigo_postal\", prov_col: str = \"provincia\") -> list[str | None]"
|
||||
description: "Infiere la provincia correcta de cada fila basandose en el CP dominante por provincia. Calcula top-2 prefijos de CP por provincia; si el CP de la fila pertenece a ese top-2 usa el real, si no usa el dominante. Stdlib puro, sin pandas."
|
||||
tags: [string, normalization, spain, geography, postal-code, inference]
|
||||
uses_functions: [cp_provincia_es_py_core]
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: ["collections.Counter"]
|
||||
example: |
|
||||
from infer_provincia_from_cp import infer_provincia_from_cp
|
||||
rows = [
|
||||
{"codigo_postal": "28001", "provincia": "Madrid"},
|
||||
{"codigo_postal": "28010", "provincia": "Madrid"},
|
||||
{"codigo_postal": "99999", "provincia": "Madrid"},
|
||||
]
|
||||
infer_provincia_from_cp(rows)
|
||||
# ["Madrid", "Madrid", "Madrid"]
|
||||
tested: true
|
||||
tests: ["inferencia con cp dominante madrid", "fila con cp fuera de top2 usa dominante", "fila sin provincia retorna None"]
|
||||
test_file_path: "python/functions/core/tests/test_infer_provincia_from_cp.py"
|
||||
file_path: "python/functions/core/infer_provincia_from_cp.py"
|
||||
params:
|
||||
- name: rows
|
||||
desc: "Lista de dicts. Cada dict debe tener al menos cp_col (codigo postal) y prov_col (provincia declarada)."
|
||||
- name: cp_col
|
||||
desc: "Nombre de la clave del codigo postal en cada dict. Por defecto 'codigo_postal'."
|
||||
- name: prov_col
|
||||
desc: "Nombre de la clave de la provincia en cada dict. Por defecto 'provincia'."
|
||||
output: "Lista de strings o None con la provincia inferida para cada fila, en el mismo orden que rows. None cuando la provincia o el CP de la fila es None o la provincia no tiene datos suficientes."
|
||||
source_repo: "internal:footprint_aurgi"
|
||||
source_license: "internal-aurgi"
|
||||
source_file: "aurgi_mapas/generar_pdf_reporte.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from infer_provincia_from_cp import infer_provincia_from_cp
|
||||
|
||||
rows = [
|
||||
{"codigo_postal": "28001", "provincia": "Madrid"},
|
||||
{"codigo_postal": "28010", "provincia": "Madrid"},
|
||||
{"codigo_postal": "41001", "provincia": "Madrid"}, # CP Sevilla pero provincia Madrid
|
||||
]
|
||||
result = infer_provincia_from_cp(rows)
|
||||
# ["Madrid", "Madrid", "Madrid"]
|
||||
# El tercer CP (41) no esta en top-2 de Madrid (28), asi que usa el dominante (28 -> Madrid)
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Funcion pura. Usa `cp_provincia_es` del mismo dominio para el lookup final.
|
||||
Adaptada de `add_provincia_poliza_correcta` en `aurgi_mapas/generar_pdf_reporte.py`,
|
||||
eliminando la dependencia de pandas y generalizando las columnas por parametro.
|
||||
El algoritmo mantiene la semantica original: top-2 prefijos por provincia, con
|
||||
fallback al dominante cuando el CP de la fila no encaja en ese top-2.
|
||||
@@ -0,0 +1,85 @@
|
||||
"""Infiere la provincia correcta de cada fila basandose en el codigo postal dominante por provincia."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
import sys
|
||||
from collections import Counter
|
||||
|
||||
|
||||
def infer_provincia_from_cp(
|
||||
rows: list[dict],
|
||||
cp_col: str = "codigo_postal",
|
||||
prov_col: str = "provincia",
|
||||
) -> list:
|
||||
"""Infiere la provincia correcta de cada fila usando el CP dominante por provincia.
|
||||
|
||||
Para cada provincia en el dataset calcula los top-2 prefijos de CP mas
|
||||
frecuentes. Si el CP de una fila pertenece a ese top-2 para su provincia,
|
||||
se usa la provincia derivada del CP real; si no, se usa la provincia
|
||||
derivada del prefijo dominante (top-1) de su provincia.
|
||||
|
||||
Logica generica (stdlib puro, sin pandas):
|
||||
1. Calcular frecuencia de prefijos por provincia.
|
||||
2. Seleccionar top-2 prefijos por provincia.
|
||||
3. Para cada fila: si su prefijo esta en top-2 de su provincia,
|
||||
retornar cp_provincia_es(prefijo); si no, retornar cp_provincia_es(top1).
|
||||
4. Si la provincia de la fila no tiene datos, retornar None.
|
||||
|
||||
Args:
|
||||
rows: Lista de dicts con al menos las columnas cp_col y prov_col.
|
||||
cp_col: Nombre de la columna con el codigo postal (default "codigo_postal").
|
||||
prov_col: Nombre de la columna con la provincia original (default "provincia").
|
||||
|
||||
Returns:
|
||||
Lista de strings (o None) con la provincia inferida para cada fila,
|
||||
en el mismo orden que rows.
|
||||
"""
|
||||
_here = os.path.dirname(os.path.abspath(__file__))
|
||||
if _here not in sys.path:
|
||||
sys.path.insert(0, _here)
|
||||
from cp_provincia_es import cp_provincia_es
|
||||
|
||||
# Paso 1: contar frecuencia de (provincia, prefijo)
|
||||
freq: dict[str, Counter] = {}
|
||||
for row in rows:
|
||||
prov = row.get(prov_col)
|
||||
cp_raw = row.get(cp_col)
|
||||
if prov is None or cp_raw is None:
|
||||
continue
|
||||
cp_str = str(cp_raw).strip().zfill(5)
|
||||
prefix = cp_str[:2]
|
||||
if prov not in freq:
|
||||
freq[prov] = Counter()
|
||||
freq[prov][prefix] += 1
|
||||
|
||||
# Paso 2: top-2 prefijos por provincia y prefijo dominante (top-1)
|
||||
top2: dict[str, list[str]] = {}
|
||||
dominant: dict[str, str] = {}
|
||||
for prov, counter in freq.items():
|
||||
ordered = [p for p, _ in counter.most_common(2)]
|
||||
top2[prov] = ordered
|
||||
if ordered:
|
||||
dominant[prov] = ordered[0]
|
||||
|
||||
# Paso 3: resolver provincia para cada fila
|
||||
result = []
|
||||
for row in rows:
|
||||
prov = row.get(prov_col)
|
||||
cp_raw = row.get(cp_col)
|
||||
|
||||
if prov is None or cp_raw is None:
|
||||
result.append(None)
|
||||
continue
|
||||
|
||||
cp_str = str(cp_raw).strip().zfill(5)
|
||||
prefix = cp_str[:2]
|
||||
|
||||
if prov in top2 and prefix in top2[prov]:
|
||||
result.append(cp_provincia_es(prefix))
|
||||
elif prov in dominant:
|
||||
result.append(cp_provincia_es(dominant[prov]))
|
||||
else:
|
||||
result.append(None)
|
||||
|
||||
return result
|
||||
@@ -0,0 +1,49 @@
|
||||
---
|
||||
name: merge_entity_aliases
|
||||
kind: function
|
||||
lang: py
|
||||
domain: core
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def merge_entity_aliases(entity_names: list[str]) -> dict[str, str]"
|
||||
description: "Coreference simple por normalizacion + substring: mapea cada nombre de entidad a su forma canonica. 'BBVA' y 'bbva' -> mismo canonical. Nombres cortos absorbidos por nombres largos que los contienen como palabra completa (min 4 chars normalizados)."
|
||||
tags: [nlp, coreference, entity, alias, normalization, merge, graph, ner]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: [re, collections.defaultdict]
|
||||
params:
|
||||
- name: entity_names
|
||||
desc: "Lista de nombres de entidades tal como los extrajo el modelo NER. Puede contener duplicados, variaciones de casing (BBVA/bbva) y formas largas/cortas (BBVA / Banco Bilbao Vizcaya Argentaria, S.A.)."
|
||||
output: "Dict {nombre_original: nombre_canonical}. Identidad para nombres que no son alias de nada. Lista vacia retorna dict vacio."
|
||||
tested: true
|
||||
tests:
|
||||
- "duplicados case-insensitive se mapean al mismo canonical"
|
||||
- "nombre corto se absorbe en nombre largo que lo contiene"
|
||||
- "siglas cortas menos de 4 chars no absorben falsamente"
|
||||
- "nombres totalmente disjuntos se mapean a si mismos"
|
||||
test_file_path: "python/functions/core/tests/test_merge_entity_aliases.py"
|
||||
file_path: "python/functions/core/merge_entity_aliases.py"
|
||||
notes: |
|
||||
Validado en playground/server.py del analisis gliner_glirel_tuning.
|
||||
El criterio de 4 chars normalizados evita que siglas tipo "US", "EU", "SA"
|
||||
absorban entidades que meramente contienen esas letras.
|
||||
El merge es asimetrico: el nombre LARGO es el canonical, no el corto.
|
||||
Util como paso de post-proceso tras aggregate_extraction_results antes
|
||||
de construir el grafo final.
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from core.merge_entity_aliases import merge_entity_aliases
|
||||
|
||||
names = ["BBVA", "bbva", "Banco Bilbao Vizcaya Argentaria, S.A.", "Inditex"]
|
||||
alias = merge_entity_aliases(names)
|
||||
# alias["BBVA"] -> "Banco Bilbao Vizcaya Argentaria, S.A." (absorbido por substring)
|
||||
# alias["bbva"] -> "Banco Bilbao Vizcaya Argentaria, S.A." (normalizado + absorbido)
|
||||
# alias["Banco Bilbao Vizcaya Argentaria, S.A."] -> "Banco Bilbao Vizcaya Argentaria, S.A."
|
||||
# alias["Inditex"] -> "Inditex" (identidad, no hay alias)
|
||||
```
|
||||
@@ -0,0 +1,62 @@
|
||||
"""Coreference simple por normalizacion y substring para entidades nombradas."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import re
|
||||
from collections import defaultdict
|
||||
|
||||
|
||||
def merge_entity_aliases(entity_names: list[str]) -> dict[str, str]:
|
||||
"""Build alias map: original_name -> canonical_name.
|
||||
|
||||
Two-pass algorithm:
|
||||
Step 1 - Normalize: lowercase + strip punctuation -> cluster by normalized form.
|
||||
Canonical per cluster = longest original casing.
|
||||
Step 2 - Substring merge: short names absorbed by longer ones if short_name
|
||||
appears as whole word inside long_name (normalized) AND
|
||||
short_name has >= 4 normalized chars (prevents false positives
|
||||
like 'US' absorbing everything that contains 'us').
|
||||
|
||||
Args:
|
||||
entity_names: List of entity name strings (may have duplicates or
|
||||
different casings, e.g. ["BBVA", "bbva", "Banco Bilbao..."]).
|
||||
|
||||
Returns:
|
||||
Dict mapping each input name to its final canonical form.
|
||||
Identity mapping for names that are not aliases of anything else.
|
||||
"""
|
||||
if not entity_names:
|
||||
return {}
|
||||
|
||||
def normalize(s: str) -> str:
|
||||
s = re.sub(r"[\.,;:\"'`()\[\]]", "", s.strip())
|
||||
s = re.sub(r"\s+", " ", s)
|
||||
return s.strip().lower()
|
||||
|
||||
# Paso 1: agrupar por forma normalizada, elegir el mas largo como canonical
|
||||
norm_groups: dict[str, list[str]] = defaultdict(list)
|
||||
for n in entity_names:
|
||||
norm_groups[normalize(n)].append(n)
|
||||
|
||||
canonical: dict[str, str] = {}
|
||||
for nrm, group in norm_groups.items():
|
||||
winner = max(group, key=lambda x: (len(x), x))
|
||||
for n in group:
|
||||
canonical[n] = winner
|
||||
|
||||
# Paso 2: substring merge sobre los canonicos (long absorbe short si short dentro de long)
|
||||
canon_set = sorted(set(canonical.values()), key=len, reverse=True)
|
||||
absorbed: dict[str, str] = {}
|
||||
|
||||
for long_n in canon_set:
|
||||
long_norm = normalize(long_n)
|
||||
for short_n in canon_set:
|
||||
if short_n == long_n or short_n in absorbed:
|
||||
continue
|
||||
short_norm = normalize(short_n)
|
||||
if len(short_norm) < 4:
|
||||
continue
|
||||
if re.search(r"\b" + re.escape(short_norm) + r"\b", long_norm):
|
||||
absorbed[short_n] = long_n
|
||||
|
||||
return {orig: absorbed.get(canon, canon) for orig, canon in canonical.items()}
|
||||
@@ -0,0 +1,52 @@
|
||||
---
|
||||
id: normalize_for_join_py_core
|
||||
name: normalize_for_join
|
||||
kind: function
|
||||
lang: py
|
||||
domain: core
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def normalize_for_join(values: Iterable) -> list[str]"
|
||||
description: "Normaliza strings para fuzzy joins: upper + strip diacriticos NFD + elimina non [A-Z0-9 ] + colapsa espacios. Trabaja con cualquier iterable. None/NaN -> cadena vacia."
|
||||
tags: [string, normalization, join, fuzzy, spain]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: ["re", "unicodedata", "typing.Iterable"]
|
||||
example: |
|
||||
from normalize_for_join import normalize_for_join
|
||||
normalize_for_join(["Calle Mayor, 14", "avila", None])
|
||||
# ["CALLE MAYOR 14", "AVILA", ""]
|
||||
tested: true
|
||||
tests: ["normalize con puntuacion y diacriticos y None"]
|
||||
test_file_path: "python/functions/core/tests/test_normalize_for_join.py"
|
||||
file_path: "python/functions/core/normalize_for_join.py"
|
||||
params:
|
||||
- name: values
|
||||
desc: "Iterable de strings o None/NaN a normalizar. Acepta listas, generadores, pd.Series, etc."
|
||||
output: "Lista de strings normalizados en mayusculas sin diacriticos. None y NaN se convierten a cadena vacia."
|
||||
source_repo: "internal:footprint_aurgi"
|
||||
source_license: "internal-aurgi"
|
||||
source_file: "fuzzy_joins/arreglo_fuzzy.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from normalize_for_join import normalize_for_join
|
||||
|
||||
normalize_for_join(["Calle Mayor, 14", "ávila", None])
|
||||
# ["CALLE MAYOR 14", "AVILA", ""]
|
||||
|
||||
normalize_for_join(["José García S.L.", "BANCO DE ESPAÑA"])
|
||||
# ["JOSE GARCIA SL", "BANCO DE ESPANA"]
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Funcion pura sin dependencias externas (solo `re` y `unicodedata` de stdlib).
|
||||
Adaptada de `preparar_para_join` / `normalizar_string` en `fuzzy_joins/arreglo_fuzzy.py`,
|
||||
eliminando la dependencia de pandas para trabajar con cualquier iterable.
|
||||
Util como paso previo a joins por igualdad exacta sobre datos normalizados.
|
||||
@@ -0,0 +1,44 @@
|
||||
"""Normaliza strings para joins sin dependencias externas."""
|
||||
|
||||
import re
|
||||
import unicodedata
|
||||
from typing import Iterable
|
||||
|
||||
|
||||
def normalize_for_join(values: Iterable) -> list:
|
||||
"""Normaliza strings para joins: upper + sin diacriticos + solo [A-Z0-9 ] + colapsa espacios.
|
||||
|
||||
Para cada valor: convierte a string, upper, elimina diacriticos NFD,
|
||||
reemplaza caracteres que no sean letras/numeros/espacios por cadena vacia,
|
||||
colapsa espacios multiples, trim. None o NaN se convierten a cadena vacia.
|
||||
|
||||
No depende de pandas; trabaja con cualquier iterable de strings o None.
|
||||
|
||||
Args:
|
||||
values: Iterable de strings o None. Puede ser lista, generador, Serie, etc.
|
||||
|
||||
Returns:
|
||||
Lista de strings normalizados. None/NaN se convierten a "".
|
||||
"""
|
||||
result = []
|
||||
for v in values:
|
||||
if v is None:
|
||||
result.append("")
|
||||
continue
|
||||
# Detectar NaN de numpy/pandas sin importarlos
|
||||
try:
|
||||
if v != v: # NaN != NaN
|
||||
result.append("")
|
||||
continue
|
||||
except (TypeError, ValueError):
|
||||
pass
|
||||
texto = str(v).upper()
|
||||
texto = "".join(
|
||||
c for c in unicodedata.normalize("NFD", texto)
|
||||
if unicodedata.category(c) != "Mn"
|
||||
)
|
||||
texto = re.sub(r"[^A-Z0-9\s]", "", texto)
|
||||
texto = re.sub(r"\s+", " ", texto)
|
||||
texto = texto.strip()
|
||||
result.append(texto)
|
||||
return result
|
||||
@@ -0,0 +1,43 @@
|
||||
---
|
||||
name: safe_read_csv_fallback
|
||||
kind: function
|
||||
lang: py
|
||||
domain: core
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "safe_read_csv_fallback(path: str | Path) -> pd.DataFrame"
|
||||
description: "Lee un CSV intentando utf-8 primero; si falla con UnicodeDecodeError reintenta con latin-1. Cubre exportaciones legacy de Excel y herramientas occidentales."
|
||||
tags: [csv, encoding, pandas, io, core]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: [pandas, pathlib]
|
||||
params:
|
||||
- name: path
|
||||
desc: "Ruta al archivo CSV a leer. Puede ser str o Path."
|
||||
output: "DataFrame de pandas con el contenido del CSV. Codificación detectada automáticamente (utf-8 o latin-1)."
|
||||
tested: true
|
||||
tests:
|
||||
- "lee csv utf-8 correctamente"
|
||||
- "lee csv latin-1 con fallback"
|
||||
test_file_path: "python/functions/core/tests/test_safe_read_csv_fallback.py"
|
||||
file_path: "python/functions/core/safe_read_csv_fallback.py"
|
||||
source_repo: "internal:footprint_aurgi"
|
||||
source_license: "internal-aurgi"
|
||||
source_file: "ponderacion_isochronas/example/models/eda/utils.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
df = safe_read_csv_fallback("datos_clientes.csv")
|
||||
print(df.shape)
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Solo hace fallback en UnicodeDecodeError. Otros errores (archivo inexistente,
|
||||
CSV malformado) se propagan normalmente.
|
||||
latin-1 cubre la mayoría de exportaciones de Excel en español/europeo occidental.
|
||||
@@ -0,0 +1,34 @@
|
||||
"""Read a CSV file with automatic encoding fallback from utf-8 to latin-1."""
|
||||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
from typing import TYPE_CHECKING
|
||||
|
||||
if TYPE_CHECKING:
|
||||
import pandas as pd
|
||||
|
||||
|
||||
def safe_read_csv_fallback(path: "str | Path") -> "pd.DataFrame":
|
||||
"""Read a CSV file, falling back to latin-1 if utf-8 decoding fails.
|
||||
|
||||
Tries pandas read_csv with the default utf-8 encoding first. On a
|
||||
UnicodeDecodeError retries with latin-1 (ISO-8859-1), which covers most
|
||||
Western European legacy CSV exports.
|
||||
|
||||
Args:
|
||||
path: Path to the CSV file.
|
||||
|
||||
Returns:
|
||||
A pandas DataFrame with the CSV contents.
|
||||
|
||||
Raises:
|
||||
FileNotFoundError: If the file does not exist.
|
||||
Exception: Any other pandas read error (malformed CSV, etc.).
|
||||
"""
|
||||
import pandas as pd
|
||||
|
||||
p = Path(path)
|
||||
try:
|
||||
return pd.read_csv(p)
|
||||
except UnicodeDecodeError:
|
||||
return pd.read_csv(p, encoding="latin-1")
|
||||
@@ -0,0 +1,59 @@
|
||||
---
|
||||
id: slugify_ascii_py_core
|
||||
name: slugify_ascii
|
||||
kind: function
|
||||
lang: py
|
||||
domain: core
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def slugify_ascii(text: str, max_len: int = 80, default: str = \"centro\") -> str"
|
||||
description: "Convierte texto a slug ASCII lowercase sin diacriticos. Strip + lower + NFD + reemplaza non-alphanum por guion + colapsa guiones. Si vacio retorna default. Trunca a max_len."
|
||||
tags: [string, normalization, slug, ascii, spain]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: ["re", "unicodedata"]
|
||||
example: |
|
||||
from slugify_ascii import slugify_ascii
|
||||
slugify_ascii("Calle Mayor, 14") # "calle-mayor-14"
|
||||
slugify_ascii("Ávila") # "avila"
|
||||
slugify_ascii("") # "centro"
|
||||
slugify_ascii("a" * 100, max_len=10) # "aaaaaaaaaa"
|
||||
tested: true
|
||||
tests: ["slugify texto con puntuacion", "slugify diacriticos", "slugify cadena vacia retorna default", "slugify trunca a max_len"]
|
||||
test_file_path: "python/functions/core/tests/test_slugify_ascii.py"
|
||||
file_path: "python/functions/core/slugify_ascii.py"
|
||||
params:
|
||||
- name: text
|
||||
desc: "Texto de entrada a convertir en slug. None se trata como cadena vacia."
|
||||
- name: max_len
|
||||
desc: "Longitud maxima del slug resultante. Por defecto 80 caracteres."
|
||||
- name: default
|
||||
desc: "Valor a retornar si el slug resultante esta vacio. Por defecto 'centro'."
|
||||
output: "Slug ASCII lowercase sin diacriticos, maximo max_len caracteres. Retorna default si el resultado esta vacio."
|
||||
source_repo: "internal:footprint_aurgi"
|
||||
source_license: "internal-aurgi"
|
||||
source_file: "zonas_mapas_aurgi/scripts/generate_isochrones.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from slugify_ascii import slugify_ascii
|
||||
|
||||
slugify_ascii("Calle Mayor, 14") # "calle-mayor-14"
|
||||
slugify_ascii("Ávila") # "avila"
|
||||
slugify_ascii("") # "centro"
|
||||
slugify_ascii(None) # "centro"
|
||||
slugify_ascii("a" * 100, max_len=10) # "aaaaaaaaaa"
|
||||
slugify_ascii("---", default="sin-nombre") # "sin-nombre"
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Funcion pura sin dependencias externas. Usa solo `re` y `unicodedata` de stdlib.
|
||||
Adaptada de `_slugify` en `zonas_mapas_aurgi/scripts/generate_isochrones.py` y
|
||||
`ponderacion_isochronas/src/generar_isochronas_aurgi.py`, combinando la
|
||||
normalizacion NFD de la primera con el truncado y default de la segunda.
|
||||
@@ -0,0 +1,33 @@
|
||||
"""Convierte texto a slug ASCII lowercase sin diacriticos."""
|
||||
|
||||
import re
|
||||
import unicodedata
|
||||
|
||||
|
||||
def slugify_ascii(text: str, max_len: int = 80, default: str = "centro") -> str:
|
||||
"""Convierte texto a slug ASCII lowercase sin diacriticos.
|
||||
|
||||
Aplica: strip + lower + eliminar diacriticos NFD + reemplazar
|
||||
no-alphanum por guion + colapsar guiones + trim. Si el resultado
|
||||
esta vacio retorna default. Trunca a max_len.
|
||||
|
||||
Args:
|
||||
text: Texto de entrada. None se trata como vacio.
|
||||
max_len: Longitud maxima del slug resultante (default 80).
|
||||
default: Valor a retornar si el slug queda vacio (default "centro").
|
||||
|
||||
Returns:
|
||||
Slug ASCII lowercase, maximo max_len caracteres.
|
||||
"""
|
||||
if text is None:
|
||||
return default
|
||||
text = str(text).strip().lower()
|
||||
text = "".join(
|
||||
c for c in unicodedata.normalize("NFD", text)
|
||||
if unicodedata.category(c) != "Mn"
|
||||
)
|
||||
text = re.sub(r"[^a-z0-9]+", "-", text)
|
||||
text = text.strip("-")
|
||||
if not text:
|
||||
return default
|
||||
return text[:max_len]
|
||||
@@ -0,0 +1,65 @@
|
||||
"""Tests para aggregate_extraction_results."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
import sys
|
||||
from collections import Counter
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", ".."))
|
||||
|
||||
from core.aggregate_extraction_results import aggregate_extraction_results
|
||||
|
||||
|
||||
def test_lista_vacia_retorna_entities_y_relations_vacios():
|
||||
"""lista vacia retorna entities vacio y relations vacio"""
|
||||
result = aggregate_extraction_results([])
|
||||
assert result["entities"] == {}
|
||||
assert result["relations"] == Counter()
|
||||
|
||||
|
||||
def test_resultado_unico_se_agrega_correctamente():
|
||||
"""resultado unico se agrega correctamente"""
|
||||
r = [
|
||||
{
|
||||
"entities": {"person": ["Pablo Isla"], "organization": ["Inditex"]},
|
||||
"relation_extraction": {"ceo_of": [("Pablo Isla", "Inditex")]},
|
||||
}
|
||||
]
|
||||
result = aggregate_extraction_results(r)
|
||||
assert ("person", "pablo isla") in result["entities"]
|
||||
assert ("organization", "inditex") in result["entities"]
|
||||
assert result["entities"][("person", "pablo isla")]["count"] == 1
|
||||
assert result["relations"][("Pablo Isla", "ceo_of", "Inditex")] == 1
|
||||
|
||||
|
||||
def test_dos_resultados_con_solapamiento_acumulan_counts():
|
||||
"""dos resultados con solapamiento acumulan counts"""
|
||||
r = [
|
||||
{
|
||||
"entities": {"person": ["Pablo Isla"], "organization": ["Inditex"]},
|
||||
"relation_extraction": {"ceo_of": [("Pablo Isla", "Inditex")]},
|
||||
},
|
||||
{
|
||||
"entities": {"person": ["Pablo Isla"], "organization": ["Inditex"]},
|
||||
"relation_extraction": {"ceo_of": [("Pablo Isla", "Inditex")]},
|
||||
},
|
||||
]
|
||||
result = aggregate_extraction_results(r)
|
||||
assert result["entities"][("person", "pablo isla")]["count"] == 2
|
||||
assert result["relations"][("Pablo Isla", "ceo_of", "Inditex")] == 2
|
||||
|
||||
|
||||
def test_entidades_deduplicen_case_insensitive():
|
||||
"""entidades se deduplicien case-insensitive"""
|
||||
r = [
|
||||
{"entities": {"person": ["Pablo Isla"]}, "relation_extraction": {}},
|
||||
{"entities": {"person": ["pablo isla"]}, "relation_extraction": {}},
|
||||
]
|
||||
result = aggregate_extraction_results(r)
|
||||
# Ambas van a la misma key (person, pablo isla)
|
||||
assert ("person", "pablo isla") in result["entities"]
|
||||
assert result["entities"][("person", "pablo isla")]["count"] == 2
|
||||
# Solo una key para pablo isla
|
||||
pablo_keys = [k for k in result["entities"] if k[1] == "pablo isla"]
|
||||
assert len(pablo_keys) == 1
|
||||
@@ -0,0 +1,72 @@
|
||||
"""Tests para chunk_with_overlap."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
import sys
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", ".."))
|
||||
|
||||
from core.chunk_with_overlap import chunk_with_overlap
|
||||
|
||||
|
||||
def test_texto_vacio_retorna_lista_vacia():
|
||||
"""texto vacio retorna lista vacia"""
|
||||
assert chunk_with_overlap("") == []
|
||||
assert chunk_with_overlap(" ") == []
|
||||
|
||||
|
||||
def test_una_frase_menor_que_max_chars_produce_1_chunk():
|
||||
"""una frase menor que max_chars produce 1 chunk"""
|
||||
text = "Esta es una frase corta."
|
||||
chunks = chunk_with_overlap(text, max_chars=500, overlap_sentences=0)
|
||||
assert len(chunks) == 1
|
||||
assert chunks[0]["text"] == text
|
||||
|
||||
|
||||
def test_multiples_frases_producen_N_chunks_con_overlap():
|
||||
"""multiples frases producen N chunks con overlap"""
|
||||
# 3 frases de ~30 chars c/u, max_chars=60 -> al menos 2 chunks
|
||||
text = "Primera frase larga aqui. Segunda frase larga aqui. Tercera frase larga aqui."
|
||||
chunks = chunk_with_overlap(text, max_chars=55, overlap_sentences=1)
|
||||
assert len(chunks) >= 2
|
||||
# Cada chunk tiene texto no vacio
|
||||
for c in chunks:
|
||||
assert c["text"].strip()
|
||||
assert len(c["sentences"]) > 0
|
||||
|
||||
|
||||
def test_frase_mas_larga_que_max_chars_no_bucle_infinito():
|
||||
"""frase mas larga que max_chars se incluye sin bucle infinito"""
|
||||
long_sentence = "A" * 2000 + "."
|
||||
chunks = chunk_with_overlap(long_sentence, max_chars=100, overlap_sentences=0)
|
||||
# Debe terminar (no bucle infinito) y producir exactamente 1 chunk
|
||||
assert len(chunks) == 1
|
||||
assert chunks[0]["text"] == long_sentence.strip()
|
||||
|
||||
|
||||
def test_overlap_0_no_duplica_frases():
|
||||
"""overlap=0 no duplica frases entre chunks"""
|
||||
text = "Primera frase aqui completa. Segunda frase aqui completa. Tercera frase aqui completa."
|
||||
chunks = chunk_with_overlap(text, max_chars=50, overlap_sentences=0)
|
||||
# Recolectar todas las frases de todos los chunks
|
||||
all_sents = [s for c in chunks for s in c["sentences"]]
|
||||
# Con overlap=0 ninguna frase debe aparecer dos veces
|
||||
assert len(all_sents) == len(set(all_sents))
|
||||
|
||||
|
||||
def test_overlap_2_el_chunk_N_mas_1_empieza_con_ultimas_2_frases_del_N():
|
||||
"""overlap=2 el chunk N+1 empieza con las 2 ultimas frases del chunk N"""
|
||||
# 5 frases cortas, max_chars=80 para forzar al menos 2 chunks
|
||||
text = (
|
||||
"Frase uno aqui. "
|
||||
"Frase dos aqui. "
|
||||
"Frase tres aqui. "
|
||||
"Frase cuatro aqui. "
|
||||
"Frase cinco aqui."
|
||||
)
|
||||
chunks = chunk_with_overlap(text, max_chars=80, overlap_sentences=2)
|
||||
if len(chunks) >= 2:
|
||||
prev_tail = chunks[0]["sentences"][-2:]
|
||||
next_head = chunks[1]["sentences"][:2]
|
||||
assert prev_tail == next_head
|
||||
@@ -0,0 +1,49 @@
|
||||
"""Tests para clean_pdf_text."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
import sys
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", ".."))
|
||||
|
||||
from core.clean_pdf_text import clean_pdf_text
|
||||
|
||||
|
||||
def test_string_vacio_retorna_vacio():
|
||||
"""string vacio retorna vacio"""
|
||||
assert clean_pdf_text("") == ""
|
||||
|
||||
|
||||
def test_marca_de_pagina_1_20_se_elimina():
|
||||
"""marca de pagina 1/20 se elimina"""
|
||||
result = clean_pdf_text("1/20\nfoo bar")
|
||||
assert "1/20" not in result
|
||||
assert "foo bar" in result
|
||||
|
||||
|
||||
def test_dehyphenation_exa_newline_mple():
|
||||
"""dehyphenation exa-newline-mple -> example"""
|
||||
result = clean_pdf_text("exa-\nmple")
|
||||
assert result == "example"
|
||||
|
||||
|
||||
def test_espacios_duplicados_se_colapsan():
|
||||
"""espacios duplicados se colapsan"""
|
||||
result = clean_pdf_text("ab cd")
|
||||
assert result == "ab cd"
|
||||
|
||||
|
||||
def test_salto_de_linea_en_mitad_de_oracion_se_une_con_espacio():
|
||||
"""salto de linea en mitad de oracion se une con espacio"""
|
||||
result = clean_pdf_text("Pablo Isla es el\npresidente de Inditex")
|
||||
assert result == "Pablo Isla es el presidente de Inditex"
|
||||
|
||||
|
||||
def test_salto_de_linea_tras_punto_se_preserva():
|
||||
"""salto de linea tras punto se preserva"""
|
||||
result = clean_pdf_text("Primera oracion.\nSegunda oracion.")
|
||||
# El salto tras punto debe quedar (no se une con espacio)
|
||||
assert "\n" in result
|
||||
assert "Primera oracion." in result
|
||||
assert "Segunda oracion." in result
|
||||
@@ -0,0 +1,44 @@
|
||||
"""Tests para cp_provincia_es."""
|
||||
|
||||
import sys
|
||||
import os
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
||||
|
||||
from cp_provincia_es import cp_provincia_es
|
||||
|
||||
|
||||
def test_cp_completo_retorna_provincia():
|
||||
"""cp completo retorna provincia"""
|
||||
assert cp_provincia_es("28001") == "Madrid"
|
||||
|
||||
|
||||
def test_prefijo_2_digitos_retorna_provincia():
|
||||
"""prefijo 2 digitos retorna provincia"""
|
||||
assert cp_provincia_es("28") == "Madrid"
|
||||
|
||||
|
||||
def test_primer_prefijo_01_retorna_alava():
|
||||
"""primer prefijo 01 retorna Alava"""
|
||||
assert cp_provincia_es("01") == "Álava"
|
||||
|
||||
|
||||
def test_cp_desconocido_retorna_none():
|
||||
"""cp desconocido retorna None"""
|
||||
assert cp_provincia_es("99") is None
|
||||
|
||||
|
||||
def test_cp_entero_completo():
|
||||
assert cp_provincia_es(28001) == "Madrid"
|
||||
|
||||
|
||||
def test_cp_ceuta():
|
||||
assert cp_provincia_es("51001") == "Ceuta"
|
||||
|
||||
|
||||
def test_cp_melilla():
|
||||
assert cp_provincia_es("52") == "Melilla"
|
||||
|
||||
|
||||
def test_cp_barcelona():
|
||||
assert cp_provincia_es("08") == "Barcelona"
|
||||
@@ -0,0 +1,54 @@
|
||||
"""Tests para csv_to_parquet_duckdb."""
|
||||
from __future__ import annotations
|
||||
|
||||
import tempfile
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
def test_convierte_csv_a_parquet_y_duckdb_puede_leerlo():
|
||||
"""convierte csv a parquet y duckdb puede leerlo"""
|
||||
import sys
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parents[2]))
|
||||
from core.csv_to_parquet_duckdb import csv_to_parquet_duckdb
|
||||
import duckdb
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmpdir:
|
||||
csv_path = Path(tmpdir) / "test.csv"
|
||||
parquet_path = Path(tmpdir) / "test.parquet"
|
||||
|
||||
csv_path.write_text("nombre,lat,lon\nMadrid,40.4,-3.7\nBarcelona,41.3,2.1\n")
|
||||
|
||||
result = csv_to_parquet_duckdb(csv_path, parquet_path)
|
||||
assert result is True
|
||||
assert parquet_path.exists()
|
||||
assert parquet_path.stat().st_size > 0
|
||||
|
||||
# Verify duckdb can read it back
|
||||
con = duckdb.connect()
|
||||
df = con.execute(f"SELECT * FROM read_parquet('{parquet_path}')").df()
|
||||
con.close()
|
||||
assert df.shape == (2, 3)
|
||||
assert set(df.columns) == {"nombre", "lat", "lon"}
|
||||
|
||||
|
||||
def test_overwrite_False_no_sobreescribe_parquet_existente():
|
||||
"""overwrite=False no sobreescribe parquet existente"""
|
||||
import sys
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parents[2]))
|
||||
from core.csv_to_parquet_duckdb import csv_to_parquet_duckdb
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmpdir:
|
||||
csv_path = Path(tmpdir) / "test.csv"
|
||||
parquet_path = Path(tmpdir) / "test.parquet"
|
||||
|
||||
csv_path.write_text("a,b\n1,2\n")
|
||||
# Create existing parquet with known content
|
||||
parquet_path.write_bytes(b"existing content")
|
||||
original_size = parquet_path.stat().st_size
|
||||
|
||||
result = csv_to_parquet_duckdb(csv_path, parquet_path, overwrite=False)
|
||||
assert result is False
|
||||
# File must remain unchanged
|
||||
assert parquet_path.stat().st_size == original_size
|
||||
@@ -0,0 +1,60 @@
|
||||
"""Tests para filter_relations_by_entity_types."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
import sys
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", ".."))
|
||||
|
||||
from core.filter_relations_by_entity_types import filter_relations_by_entity_types
|
||||
|
||||
NAME_TO_TYPE = {
|
||||
"carlos torres": "person",
|
||||
"bbva": "organization",
|
||||
"madrid": "location",
|
||||
"santander": "organization",
|
||||
"ana": "person",
|
||||
}
|
||||
|
||||
ALLOWED = {
|
||||
"president_of": (["person"], ["organization"]),
|
||||
"located_in": (["organization", "person"], ["location"]),
|
||||
}
|
||||
|
||||
|
||||
def test_pares_validos_se_incluyen_en_kept():
|
||||
"""pares validos se incluyen en kept"""
|
||||
relations = {"president_of": [("Carlos Torres", "BBVA")]}
|
||||
kept, dropped = filter_relations_by_entity_types(relations, NAME_TO_TYPE, ALLOWED)
|
||||
assert len(kept) == 1
|
||||
assert kept[0]["from"] == "Carlos Torres"
|
||||
assert kept[0]["to"] == "BBVA"
|
||||
assert len(dropped) == 0
|
||||
|
||||
|
||||
def test_pares_con_tipos_incompatibles_van_a_dropped():
|
||||
"""pares con tipos incompatibles van a dropped"""
|
||||
# Madrid es location, no person -> no puede presidir nada
|
||||
relations = {"president_of": [("Madrid", "Santander")]}
|
||||
kept, dropped = filter_relations_by_entity_types(relations, NAME_TO_TYPE, ALLOWED)
|
||||
assert len(kept) == 0
|
||||
assert len(dropped) == 1
|
||||
assert dropped[0]["head_type"] == "location"
|
||||
|
||||
|
||||
def test_rel_type_no_en_allowed_se_acepta_siempre():
|
||||
"""rel_type no en allowed se acepta siempre"""
|
||||
relations = {"unknown_rel": [("Carlos Torres", "Madrid")]}
|
||||
kept, dropped = filter_relations_by_entity_types(relations, NAME_TO_TYPE, ALLOWED)
|
||||
assert len(kept) == 1
|
||||
assert len(dropped) == 0
|
||||
|
||||
|
||||
def test_entidad_no_encontrada_en_name_to_type_va_a_dropped():
|
||||
"""entidad no encontrada en name_to_type va a dropped"""
|
||||
# "Desconocido" no esta en name_to_type -> head_type es None -> dropped
|
||||
relations = {"president_of": [("Desconocido", "BBVA")]}
|
||||
kept, dropped = filter_relations_by_entity_types(relations, NAME_TO_TYPE, ALLOWED)
|
||||
assert len(dropped) == 1
|
||||
assert dropped[0]["head_type"] is None
|
||||
@@ -0,0 +1,78 @@
|
||||
"""Tests para infer_provincia_from_cp."""
|
||||
|
||||
import sys
|
||||
import os
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
||||
|
||||
from infer_provincia_from_cp import infer_provincia_from_cp
|
||||
|
||||
|
||||
def test_inferencia_con_cp_dominante_madrid():
|
||||
"""inferencia con cp dominante madrid"""
|
||||
rows = [
|
||||
{"codigo_postal": "28001", "provincia": "Madrid"},
|
||||
{"codigo_postal": "28010", "provincia": "Madrid"},
|
||||
]
|
||||
result = infer_provincia_from_cp(rows)
|
||||
assert result == ["Madrid", "Madrid"]
|
||||
|
||||
|
||||
def test_fila_con_cp_fuera_de_top2_usa_dominante():
|
||||
"""fila con cp fuera de top2 usa dominante"""
|
||||
# Madrid tiene 3 prefijos distintos: 28 (x4), 29 (x1), 41 (x1).
|
||||
# top-2 son: 28 y 29 (o 41 dependiendo del orden, pero 41 queda fuera).
|
||||
# Para que 41 quede fuera del top-2 necesitamos mas de 2 prefijos distintos.
|
||||
rows = [
|
||||
{"codigo_postal": "28001", "provincia": "Madrid"},
|
||||
{"codigo_postal": "28002", "provincia": "Madrid"},
|
||||
{"codigo_postal": "28003", "provincia": "Madrid"},
|
||||
{"codigo_postal": "28004", "provincia": "Madrid"},
|
||||
{"codigo_postal": "29001", "provincia": "Madrid"},
|
||||
{"codigo_postal": "29002", "provincia": "Madrid"},
|
||||
{"codigo_postal": "41001", "provincia": "Madrid"}, # outlier: fuera de top-2
|
||||
]
|
||||
result = infer_provincia_from_cp(rows)
|
||||
# top-2 de Madrid: "28" (4 ocurrencias) y "29" (2 ocurrencias).
|
||||
# "41" no esta en top-2, asi que usa el dominante (28 -> Madrid)
|
||||
assert result[6] == "Madrid"
|
||||
|
||||
|
||||
def test_fila_sin_provincia_retorna_none():
|
||||
"""fila sin provincia retorna None"""
|
||||
rows = [
|
||||
{"codigo_postal": "28001", "provincia": None},
|
||||
]
|
||||
result = infer_provincia_from_cp(rows)
|
||||
assert result == [None]
|
||||
|
||||
|
||||
def test_fila_sin_cp_retorna_none():
|
||||
rows = [
|
||||
{"codigo_postal": None, "provincia": "Madrid"},
|
||||
]
|
||||
result = infer_provincia_from_cp(rows)
|
||||
assert result == [None]
|
||||
|
||||
|
||||
def test_columnas_custom():
|
||||
rows = [
|
||||
{"cp": "28001", "prov": "Madrid"},
|
||||
{"cp": "28010", "prov": "Madrid"},
|
||||
]
|
||||
result = infer_provincia_from_cp(rows, cp_col="cp", prov_col="prov")
|
||||
assert result == ["Madrid", "Madrid"]
|
||||
|
||||
|
||||
def test_multiples_provincias():
|
||||
rows = [
|
||||
{"codigo_postal": "28001", "provincia": "Madrid"},
|
||||
{"codigo_postal": "08001", "provincia": "Barcelona"},
|
||||
{"codigo_postal": "41001", "provincia": "Sevilla"},
|
||||
]
|
||||
result = infer_provincia_from_cp(rows)
|
||||
assert result == ["Madrid", "Barcelona", "Sevilla"]
|
||||
|
||||
|
||||
def test_lista_vacia():
|
||||
assert infer_provincia_from_cp([]) == []
|
||||
@@ -0,0 +1,58 @@
|
||||
"""Tests para merge_entity_aliases."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
import sys
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", ".."))
|
||||
|
||||
from core.merge_entity_aliases import merge_entity_aliases
|
||||
|
||||
|
||||
def test_duplicados_case_insensitive_se_mapean_al_mismo_canonical():
|
||||
"""duplicados case-insensitive se mapean al mismo canonical"""
|
||||
result = merge_entity_aliases(["BBVA", "bbva", "Bbva"])
|
||||
# Todos deben apuntar al mismo canonical (el mas largo / mayor)
|
||||
vals = set(result.values())
|
||||
assert len(vals) == 1
|
||||
# El canonical debe ser la forma de mayor longitud/orden: "BBVA" (mayusculas, misma longitud)
|
||||
canon = vals.pop()
|
||||
assert canon.lower() == "bbva"
|
||||
|
||||
|
||||
def test_nombre_corto_se_absorbe_en_nombre_largo_que_lo_contiene():
|
||||
"""nombre corto se absorbe en nombre largo que lo contiene"""
|
||||
# El substring merge funciona cuando la forma corta APARECE LITERALMENTE
|
||||
# en la forma larga (normalizada). Ejemplo: "bilbao" esta en "banco bilbao vizcaya argentaria"
|
||||
names = ["Bilbao", "Banco Bilbao Vizcaya Argentaria"]
|
||||
result = merge_entity_aliases(names)
|
||||
# "bilbao" (6 chars) aparece como palabra en la forma larga normalizada
|
||||
assert result["Bilbao"] == "Banco Bilbao Vizcaya Argentaria"
|
||||
assert result["Banco Bilbao Vizcaya Argentaria"] == "Banco Bilbao Vizcaya Argentaria"
|
||||
|
||||
|
||||
def test_siglas_cortas_menos_de_4_chars_no_absorben_falsamente():
|
||||
"""siglas cortas menos de 4 chars no absorben falsamente"""
|
||||
# "US" es 2 chars normalizados -> no debe absorber a "USA" ni a "BBUSA"
|
||||
names = ["US", "USA", "Standard Chartered"]
|
||||
result = merge_entity_aliases(names)
|
||||
# "US" (2 chars) no debe poder absorber nada
|
||||
assert result["USA"] in ("USA", "Standard Chartered") or result["USA"] == "USA"
|
||||
# "US" puede quedarse como identidad o ser absorbido por algo que lo contenga
|
||||
# Lo importante: NO absorbe a nombres que no lo contienen como palabra completa
|
||||
assert result["Standard Chartered"] == "Standard Chartered"
|
||||
|
||||
|
||||
def test_nombres_totalmente_disjuntos_se_mapean_a_si_mismos():
|
||||
"""nombres totalmente disjuntos se mapean a si mismos"""
|
||||
names = ["Inditex", "Santander", "Telefonica"]
|
||||
result = merge_entity_aliases(names)
|
||||
assert result["Inditex"] == "Inditex"
|
||||
assert result["Santander"] == "Santander"
|
||||
assert result["Telefonica"] == "Telefonica"
|
||||
|
||||
|
||||
def test_lista_vacia_retorna_dict_vacio():
|
||||
"""lista vacia retorna dict vacio"""
|
||||
assert merge_entity_aliases([]) == {}
|
||||
@@ -0,0 +1,42 @@
|
||||
"""Tests para normalize_for_join."""
|
||||
|
||||
import sys
|
||||
import os
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
||||
|
||||
from normalize_for_join import normalize_for_join
|
||||
|
||||
|
||||
def test_normalize_con_puntuacion_y_diacriticos_y_none():
|
||||
"""normalize con puntuacion y diacriticos y None"""
|
||||
result = normalize_for_join(["Calle Mayor, 14", "ávila", None])
|
||||
assert result == ["CALLE MAYOR 14", "AVILA", ""]
|
||||
|
||||
|
||||
def test_normalize_lista_vacia():
|
||||
assert normalize_for_join([]) == []
|
||||
|
||||
|
||||
def test_normalize_upper():
|
||||
assert normalize_for_join(["madrid"]) == ["MADRID"]
|
||||
|
||||
|
||||
def test_normalize_elimina_simbolos():
|
||||
assert normalize_for_join(["José García S.L."]) == ["JOSE GARCIA SL"]
|
||||
|
||||
|
||||
def test_normalize_colapsa_espacios():
|
||||
assert normalize_for_join([" hola mundo "]) == ["HOLA MUNDO"]
|
||||
|
||||
|
||||
def test_normalize_nan_as_empty():
|
||||
# NaN de float (float('nan'))
|
||||
result = normalize_for_join([float("nan")])
|
||||
assert result == [""]
|
||||
|
||||
|
||||
def test_normalize_entero():
|
||||
# Enteros se convierten a string
|
||||
result = normalize_for_join([28001])
|
||||
assert result == ["28001"]
|
||||
@@ -0,0 +1,40 @@
|
||||
"""Tests para safe_read_csv_fallback."""
|
||||
from __future__ import annotations
|
||||
|
||||
import tempfile
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
def test_lee_csv_utf_8_correctamente():
|
||||
"""lee csv utf-8 correctamente"""
|
||||
import sys
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parents[2]))
|
||||
from core.safe_read_csv_fallback import safe_read_csv_fallback
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmpdir:
|
||||
csv_path = Path(tmpdir) / "test_utf8.csv"
|
||||
csv_path.write_text("nombre,valor\nAña,42\nBéta,99\n", encoding="utf-8")
|
||||
|
||||
df = safe_read_csv_fallback(csv_path)
|
||||
assert df.shape == (2, 2)
|
||||
assert list(df.columns) == ["nombre", "valor"]
|
||||
assert df["nombre"].tolist() == ["Aña", "Béta"]
|
||||
|
||||
|
||||
def test_lee_csv_latin_1_con_fallback():
|
||||
"""lee csv latin-1 con fallback"""
|
||||
import sys
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parents[2]))
|
||||
from core.safe_read_csv_fallback import safe_read_csv_fallback
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmpdir:
|
||||
csv_path = Path(tmpdir) / "test_latin1.csv"
|
||||
# Write latin-1 encoded CSV (ñ, é are 0xF1, 0xE9 in latin-1)
|
||||
csv_path.write_bytes("nombre,valor\nMad\xf1id,10\nC\xe9ntro,20\n".encode("latin-1"))
|
||||
|
||||
df = safe_read_csv_fallback(csv_path)
|
||||
assert df.shape == (2, 2)
|
||||
assert "Mad" in df["nombre"].iloc[0]
|
||||
assert df["valor"].tolist() == [10, 20]
|
||||
@@ -0,0 +1,44 @@
|
||||
"""Tests para slugify_ascii."""
|
||||
|
||||
import sys
|
||||
import os
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
||||
|
||||
from slugify_ascii import slugify_ascii
|
||||
|
||||
|
||||
def test_slugify_texto_con_puntuacion():
|
||||
"""slugify texto con puntuacion"""
|
||||
assert slugify_ascii("Calle Mayor, 14") == "calle-mayor-14"
|
||||
|
||||
|
||||
def test_slugify_diacriticos():
|
||||
"""slugify diacriticos"""
|
||||
assert slugify_ascii("Ávila") == "avila"
|
||||
|
||||
|
||||
def test_slugify_cadena_vacia_retorna_default():
|
||||
"""slugify cadena vacia retorna default"""
|
||||
assert slugify_ascii("") == "centro"
|
||||
|
||||
|
||||
def test_slugify_trunca_a_max_len():
|
||||
"""slugify trunca a max_len"""
|
||||
assert slugify_ascii("a" * 100, max_len=10) == "aaaaaaaaaa"
|
||||
|
||||
|
||||
def test_slugify_none_retorna_default():
|
||||
assert slugify_ascii(None) == "centro"
|
||||
|
||||
|
||||
def test_slugify_default_custom():
|
||||
assert slugify_ascii("---", default="sin-nombre") == "sin-nombre"
|
||||
|
||||
|
||||
def test_slugify_solo_diacriticos_y_puntuacion():
|
||||
assert slugify_ascii("ñoño") == "nono"
|
||||
|
||||
|
||||
def test_slugify_numeros():
|
||||
assert slugify_ascii("28001 Madrid") == "28001-madrid"
|
||||
@@ -0,0 +1,70 @@
|
||||
---
|
||||
name: align_relations_to_entities
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def align_relations_to_entities(triplets: list[dict], entity_names: list[str]) -> list[dict]"
|
||||
description: "Filtra y alinea triplets REBEL/mREBEL a nombres canonicos de entidades. Para cada triplet, resuelve head y tail contra entity_names con match exacto case-insensitive o substring (gana el nombre mas largo). Descarta triplets donde algun lado no resuelve o head==tail."
|
||||
tags: [rebel, mrebel, relation-extraction, nlp, align, knowledge-graph, datascience, python]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: []
|
||||
params:
|
||||
- name: triplets
|
||||
desc: "lista de dicts producida por parse_rebel_output, con claves head, head_type, type, tail, tail_type"
|
||||
- name: entity_names
|
||||
desc: "nombres canonicos de entidades conocidas contra los que alinear (ej. [e.name for e in entities])"
|
||||
output: "lista de dicts con claves from (str), kind (str), to (str), head_type (str), tail_type (str). from/to son valores tomados verbatim de entity_names."
|
||||
tested: true
|
||||
tests:
|
||||
- "match exacto case-insensitive resuelve correctamente"
|
||||
- "substring entity en span del head"
|
||||
- "substring span dentro del nombre de entidad"
|
||||
- "gana el nombre de entidad mas largo en ambiguedad"
|
||||
- "triplet sin match se descarta"
|
||||
- "triplet con head == tail se descarta (self-loop)"
|
||||
test_file_path: "python/functions/datascience/tests/test_align_relations_to_entities.py"
|
||||
file_path: "python/functions/datascience/align_relations_to_entities.py"
|
||||
notes: |
|
||||
Funcion pura. Compone con parse_rebel_output: el output de parse_rebel_output entra
|
||||
como triplets, y entity_names viene de [e.name for e in entities] del contexto de extraccion.
|
||||
Estrategia de matching:
|
||||
1. Exacto case-insensitive (O(1) via dict)
|
||||
2. Substring bidireccional: entity in span O span in entity (itera por longitud DESC)
|
||||
Esto cubre casos como mREBEL emitiendo "esta en Bilbao" cuando la entidad es "Bilbao",
|
||||
o "Banco Santander S.A." cuando la entidad canonizada es "Banco Santander".
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from python.functions.datascience.parse_rebel_output import parse_rebel_output
|
||||
from python.functions.datascience.align_relations_to_entities import align_relations_to_entities
|
||||
|
||||
decoded = "tp_XX<triplet> Pablo Isla <per> Inditex <org> employer"
|
||||
triplets = parse_rebel_output(decoded)
|
||||
|
||||
entities = ["Pablo Isla", "Inditex", "A Coruna"]
|
||||
aligned = align_relations_to_entities(triplets, entities)
|
||||
# [{'from': 'Pablo Isla', 'kind': 'employer', 'to': 'Inditex',
|
||||
# 'head_type': 'per', 'tail_type': 'org'}]
|
||||
```
|
||||
|
||||
## Estrategia de matching
|
||||
|
||||
1. **Exacto case-insensitive**: ``"inditex"`` == ``"Inditex"``.
|
||||
2. **Substring bidireccional**: la entidad esta contenida en el span del modelo,
|
||||
o el span del modelo esta contenido en el nombre de la entidad.
|
||||
Cuando varias entidades encajan, gana la mas larga (mas especifica).
|
||||
|
||||
## Notas
|
||||
|
||||
- No hace fuzzy matching (Levenshtein, etc.) — la precision sobre el recall es preferida
|
||||
en el contexto de grafos de conocimiento.
|
||||
- Para mejorar recall: normalizar entity_names antes de llamar (quitar siglas, tildes).
|
||||
- Los triplets con ``from == to`` (self-loops) se descartan siempre.
|
||||
@@ -0,0 +1,90 @@
|
||||
"""Alinea triplets REBEL / mREBEL a nombres canonicos de entidades."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
|
||||
def align_relations_to_entities(
|
||||
triplets: list[dict],
|
||||
entity_names: list[str],
|
||||
) -> list[dict]:
|
||||
"""Align REBEL triplets to a set of canonical entity names.
|
||||
|
||||
For each triplet produced by ``parse_rebel_output``, tries to resolve the
|
||||
``head`` and ``tail`` spans to a canonical entity name from ``entity_names``
|
||||
using the following strategy (in order):
|
||||
|
||||
1. **Exact case-insensitive match** — ``"Inditex" == "inditex"``.
|
||||
2. **Substring match** — either the span contains an entity name, or an
|
||||
entity name contains the span. When multiple entity names match, the
|
||||
*longest* one wins (most specific).
|
||||
|
||||
Triplets are dropped when:
|
||||
- Neither ``head`` nor ``tail`` can be resolved to any entity name.
|
||||
- The resolved ``from`` and ``to`` are the same name (self-loop).
|
||||
|
||||
Args:
|
||||
triplets: List of dicts produced by ``parse_rebel_output``, each with
|
||||
keys ``head``, ``head_type``, ``type``, ``tail``, ``tail_type``.
|
||||
entity_names: Canonical entity names to match against. Typically
|
||||
``[e.name for e in entities]``. Order does not matter; matching
|
||||
is case-insensitive.
|
||||
|
||||
Returns:
|
||||
List of dicts with keys:
|
||||
``from`` (str), ``kind`` (str), ``to`` (str),
|
||||
``head_type`` (str), ``tail_type`` (str).
|
||||
``from`` and ``to`` are values taken verbatim from ``entity_names``.
|
||||
Empty list if no triplet survives alignment.
|
||||
"""
|
||||
if not triplets or not entity_names:
|
||||
return []
|
||||
|
||||
# Pre-build lookup: lowercased -> original for O(1) exact lookup.
|
||||
lower_to_name: dict[str, str] = {n.lower(): n for n in entity_names}
|
||||
# Sort by length DESC for substring match (longest entity wins).
|
||||
names_by_len: list[str] = sorted(entity_names, key=len, reverse=True)
|
||||
|
||||
def _resolve(span: str) -> str | None:
|
||||
"""Return a canonical entity name for `span`, or None if no match."""
|
||||
if not span:
|
||||
return None
|
||||
span_lower = span.lower()
|
||||
|
||||
# 1. Exact case-insensitive.
|
||||
if span_lower in lower_to_name:
|
||||
return lower_to_name[span_lower]
|
||||
|
||||
# 2. Substring: longest entity that is contained in span, or whose
|
||||
# name contains span (both directions), longest-wins.
|
||||
for name in names_by_len:
|
||||
name_lower = name.lower()
|
||||
if name_lower in span_lower or span_lower in name_lower:
|
||||
return name
|
||||
|
||||
return None
|
||||
|
||||
aligned: list[dict] = []
|
||||
for triplet in triplets:
|
||||
head_span = triplet.get("head", "")
|
||||
tail_span = triplet.get("tail", "")
|
||||
relation = triplet.get("type", "")
|
||||
|
||||
from_name = _resolve(head_span)
|
||||
to_name = _resolve(tail_span)
|
||||
|
||||
if from_name is None or to_name is None:
|
||||
continue
|
||||
if from_name == to_name:
|
||||
continue
|
||||
|
||||
aligned.append(
|
||||
{
|
||||
"from": from_name,
|
||||
"kind": relation,
|
||||
"to": to_name,
|
||||
"head_type": triplet.get("head_type", ""),
|
||||
"tail_type": triplet.get("tail_type", ""),
|
||||
}
|
||||
)
|
||||
|
||||
return aligned
|
||||
@@ -0,0 +1,42 @@
|
||||
---
|
||||
id: alpha_shape_concave_hull_py_datascience
|
||||
name: alpha_shape_concave_hull
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def alpha_shape_concave_hull(points: list[tuple[float, float]], alpha: float) -> shapely.geometry.base.BaseGeometry | None"
|
||||
description: "Computes the alpha-shape (concave hull) of a 2-D point set via Delaunay triangulation, filtering triangles by circumradius <= alpha and merging survivors."
|
||||
tags: [geometry, spatial, concave-hull, alpha-shape, shapely, delaunay]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: [numpy, shapely]
|
||||
example: |
|
||||
from alpha_shape_concave_hull import alpha_shape_concave_hull
|
||||
pts = [(0.0,0.0),(1.0,0.0),(1.0,1.0),(0.0,1.0)]
|
||||
geom = alpha_shape_concave_hull(pts, alpha=10.0)
|
||||
# shapely Polygon
|
||||
tested: true
|
||||
tests:
|
||||
- "test_alpha_shape_square_large_alpha"
|
||||
- "test_alpha_shape_too_few_points"
|
||||
- "test_alpha_shape_very_small_alpha_returns_none"
|
||||
- "test_alpha_shape_5_points_returns_geometry"
|
||||
test_file_path: "python/functions/datascience/tests/test_alpha_shape_concave_hull.py"
|
||||
file_path: "python/functions/datascience/alpha_shape_concave_hull.py"
|
||||
params:
|
||||
- name: points
|
||||
desc: "List of (x, y) coordinate pairs. Requires at least 4 points."
|
||||
- name: alpha
|
||||
desc: "Alpha radius parameter. Triangles with circumradius > alpha are discarded. Smaller alpha = more concave hull."
|
||||
output: "Shapely geometry (Polygon or MultiPolygon) of the alpha-shape, or None if fewer than 4 points or no triangles survive the alpha filter."
|
||||
source_repo: "internal:footprint_aurgi"
|
||||
source_license: "internal-aurgi"
|
||||
source_file: "ponderacion_isochronas/src/recomendador_centros.py:408"
|
||||
---
|
||||
|
||||
Requiere shapely. Si shapely no esta instalado, retorna None en silencio. returns_optional=true porque puede no haber triangulos validos.
|
||||
@@ -0,0 +1,67 @@
|
||||
"""alpha_shape_concave_hull — Concave hull via Delaunay alpha-shape filtering."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
|
||||
def alpha_shape_concave_hull(
|
||||
points: list[tuple[float, float]],
|
||||
alpha: float,
|
||||
) -> "shapely.geometry.base.BaseGeometry | None":
|
||||
"""Compute the alpha-shape (concave hull) of a 2-D point set.
|
||||
|
||||
Performs a Delaunay triangulation over the input points, then keeps only
|
||||
those triangles whose circumscribed circle radius is <= alpha. The
|
||||
remaining triangles are merged via unary_union.
|
||||
|
||||
Args:
|
||||
points: List of (x, y) coordinate pairs. Must have >= 4 elements.
|
||||
alpha: Alpha parameter controlling concavity (smaller = more concave).
|
||||
Triangles with circumradius > alpha are discarded.
|
||||
|
||||
Returns:
|
||||
A shapely geometry (Polygon, MultiPolygon, or GeometryCollection)
|
||||
representing the alpha-shape, or None if len(points) < 4 or no
|
||||
triangles survive the alpha filter (shapely is required).
|
||||
"""
|
||||
if len(points) < 4:
|
||||
return None
|
||||
|
||||
try:
|
||||
import numpy as np
|
||||
from shapely.geometry import MultiPoint
|
||||
from shapely.ops import triangulate, unary_union
|
||||
except ImportError:
|
||||
return None
|
||||
|
||||
mp = MultiPoint(points)
|
||||
triangles = triangulate(mp)
|
||||
|
||||
valid = []
|
||||
for tri in triangles:
|
||||
coords = list(tri.exterior.coords)
|
||||
a_pt = np.array(coords[0])
|
||||
b_pt = np.array(coords[1])
|
||||
c_pt = np.array(coords[2])
|
||||
|
||||
# Circumradius via the formula R = (abc) / (4 * Area)
|
||||
ab = np.linalg.norm(b_pt - a_pt)
|
||||
bc = np.linalg.norm(c_pt - b_pt)
|
||||
ca = np.linalg.norm(a_pt - c_pt)
|
||||
|
||||
# Area via cross product
|
||||
area = abs(
|
||||
(b_pt[0] - a_pt[0]) * (c_pt[1] - a_pt[1])
|
||||
- (c_pt[0] - a_pt[0]) * (b_pt[1] - a_pt[1])
|
||||
) / 2.0
|
||||
|
||||
if area == 0:
|
||||
continue
|
||||
|
||||
circumradius = (ab * bc * ca) / (4.0 * area)
|
||||
if circumradius <= alpha:
|
||||
valid.append(tri)
|
||||
|
||||
if not valid:
|
||||
return None
|
||||
|
||||
return unary_union(valid)
|
||||
@@ -0,0 +1,68 @@
|
||||
---
|
||||
id: best_central_tendency_py_datascience
|
||||
name: best_central_tendency
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def best_central_tendency(values: list[float], dist_type: str) -> tuple[str, float]"
|
||||
description: "Selects the most appropriate central tendency measure for a given distribution type. Returns (label, value) pair."
|
||||
tags: [statistics, central-tendency, distribution, robust, mean, median]
|
||||
uses_functions:
|
||||
- geometric_mean_py_datascience
|
||||
- trimmed_mean_py_datascience
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: [math, numpy]
|
||||
example: |
|
||||
from best_central_tendency import best_central_tendency
|
||||
label, value = best_central_tendency([1, 2, 3, 4, 5], "normal-ish")
|
||||
# ("mean", 3.0)
|
||||
tested: true
|
||||
tests:
|
||||
- "test_best_central_tendency_normal_ish"
|
||||
- "test_best_central_tendency_right_skewed"
|
||||
- "test_best_central_tendency_left_skewed"
|
||||
- "test_best_central_tendency_lognormal_ish"
|
||||
- "test_best_central_tendency_heavy_tail"
|
||||
- "test_best_central_tendency_empty"
|
||||
- "test_best_central_tendency_default"
|
||||
test_file_path: "python/functions/datascience/tests/test_best_central_tendency.py"
|
||||
file_path: "python/functions/datascience/best_central_tendency.py"
|
||||
params:
|
||||
- name: values
|
||||
desc: "List of numeric values to summarize."
|
||||
- name: dist_type
|
||||
desc: "Distribution type string, typically from detect_distribution_type. One of: normal-ish, lognormal-ish, heavy-tail, right-skewed, left-skewed, other, too_few_samples."
|
||||
output: >
|
||||
Tuple (label, value) where label is one of "mean", "median", "geometric_mean",
|
||||
"trimmed_mean_5%", and value is the computed central tendency. Returns ("median", math.nan) for empty input.
|
||||
source_repo: "internal:footprint_aurgi"
|
||||
source_license: "internal-aurgi"
|
||||
source_file: "aurgi_mapas/generar_pdf_reporte.py:196"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from best_central_tendency import best_central_tendency
|
||||
|
||||
best_central_tendency([1, 2, 3, 4, 5], "normal-ish") # ("mean", 3.0)
|
||||
best_central_tendency([1, 2, 3, 4, 5], "right-skewed") # ("median", 3.0)
|
||||
best_central_tendency([1, 2, 4, 8], "lognormal-ish") # ("geometric_mean", ~2.83)
|
||||
best_central_tendency([1, 2, 3, 100], "heavy-tail") # ("trimmed_mean_5%", ...)
|
||||
```
|
||||
|
||||
## Mapeo de tipos a medidas
|
||||
|
||||
| dist_type | Medida | Funcion interna |
|
||||
|-----------------|------------------|-----------------------|
|
||||
| normal-ish | mean | numpy.mean |
|
||||
| lognormal-ish | geometric_mean | geometric_mean() |
|
||||
| heavy-tail | trimmed_mean_5% | trimmed_mean(0.05) |
|
||||
| right-skewed | median | numpy.median |
|
||||
| left-skewed | median | numpy.median |
|
||||
| otros / default | median | numpy.median |
|
||||
@@ -0,0 +1,45 @@
|
||||
"""best_central_tendency — Select the best central tendency measure for a distribution type."""
|
||||
|
||||
import math
|
||||
import numpy as np
|
||||
|
||||
try:
|
||||
from .geometric_mean import geometric_mean
|
||||
from .trimmed_mean import trimmed_mean
|
||||
except ImportError:
|
||||
from geometric_mean import geometric_mean # type: ignore
|
||||
from trimmed_mean import trimmed_mean # type: ignore
|
||||
|
||||
|
||||
def best_central_tendency(values: list[float], dist_type: str) -> tuple[str, float]:
|
||||
"""Return the most appropriate central tendency measure given the distribution type.
|
||||
|
||||
Mapping:
|
||||
"normal-ish" -> ("mean", arithmetic mean)
|
||||
"lognormal-ish" -> ("geometric_mean", geometric mean of positives)
|
||||
"heavy-tail" -> ("trimmed_mean_5%", trimmed mean at 5%)
|
||||
"right-skewed" -> ("median", median)
|
||||
"left-skewed" -> ("median", median)
|
||||
default -> ("median", median)
|
||||
|
||||
Args:
|
||||
values: List of numeric values.
|
||||
dist_type: Distribution type string (from detect_distribution_type).
|
||||
|
||||
Returns:
|
||||
Tuple (label: str, value: float). Value is math.nan if values is empty.
|
||||
"""
|
||||
if not values:
|
||||
return ("median", math.nan)
|
||||
|
||||
arr = np.array(values, dtype=float)
|
||||
|
||||
if dist_type == "normal-ish":
|
||||
return ("mean", float(np.mean(arr)))
|
||||
elif dist_type == "lognormal-ish":
|
||||
return ("geometric_mean", geometric_mean(values))
|
||||
elif dist_type == "heavy-tail":
|
||||
return ("trimmed_mean_5%", trimmed_mean(values, trim=0.05))
|
||||
else:
|
||||
# right-skewed, left-skewed, other, too_few_samples, unknown
|
||||
return ("median", float(np.median(arr)))
|
||||
@@ -0,0 +1,67 @@
|
||||
---
|
||||
id: detect_distribution_type_py_datascience
|
||||
name: detect_distribution_type
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def detect_distribution_type(values: list[float]) -> dict"
|
||||
description: "Classifies the shape of a numeric distribution using skewness, excess kurtosis, tail ratio and log-skewness. Returns a type label and raw stats."
|
||||
tags: [statistics, distribution, classification, skewness, kurtosis]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: [math, numpy]
|
||||
example: |
|
||||
from detect_distribution_type import detect_distribution_type
|
||||
import numpy as np
|
||||
result = detect_distribution_type(np.random.normal(0, 1, 200).tolist())
|
||||
# {"type": "normal-ish", "stats": {"n": 200, "skew": ..., ...}}
|
||||
tested: true
|
||||
tests:
|
||||
- "test_detect_too_few_samples"
|
||||
- "test_detect_normal_ish"
|
||||
- "test_detect_right_skewed"
|
||||
- "test_detect_stats_keys"
|
||||
- "test_detect_exactly_30"
|
||||
test_file_path: "python/functions/datascience/tests/test_detect_distribution_type.py"
|
||||
file_path: "python/functions/datascience/detect_distribution_type.py"
|
||||
params:
|
||||
- name: values
|
||||
desc: "List of numeric values to classify. Minimum 30 for meaningful classification."
|
||||
output: >
|
||||
Dict with "type" (str) and "stats" (dict). Type is one of: normal-ish,
|
||||
lognormal-ish, heavy-tail, right-skewed, left-skewed, other, too_few_samples.
|
||||
Stats contains: n, skew, kurtosis, tail_ratio, log_skew.
|
||||
source_repo: "internal:footprint_aurgi"
|
||||
source_license: "internal-aurgi"
|
||||
source_file: "aurgi_mapas/generar_pdf_reporte.py:133"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from detect_distribution_type import detect_distribution_type
|
||||
import numpy as np
|
||||
|
||||
detect_distribution_type(np.random.normal(0, 1, 200).tolist())
|
||||
# {"type": "normal-ish", "stats": {"n": 200, "skew": 0.03, ...}}
|
||||
|
||||
detect_distribution_type([1]*5)
|
||||
# {"type": "too_few_samples", "stats": {"n": 5}}
|
||||
```
|
||||
|
||||
## Logica de clasificacion
|
||||
|
||||
- n < 30 → too_few_samples
|
||||
- excess kurtosis > 3 → heavy-tail
|
||||
- |skew| <= 0.5 AND |kurt| <= 1 → normal-ish
|
||||
- skew > 0.5 AND log_skew cerca de 0 AND tail_ratio > 2 → lognormal-ish
|
||||
- skew > 0.5 → right-skewed
|
||||
- skew < -0.5 → left-skewed
|
||||
- default → other
|
||||
|
||||
tail_ratio = p99/p50; log_skew calculado solo si hay >= 30 positivos.
|
||||
@@ -0,0 +1,89 @@
|
||||
"""detect_distribution_type — Classify the distribution shape of a sample."""
|
||||
|
||||
import math
|
||||
import numpy as np
|
||||
|
||||
|
||||
def detect_distribution_type(values: list[float]) -> dict:
|
||||
"""Classify the distribution shape of a numeric sample.
|
||||
|
||||
Uses skewness, excess kurtosis, tail ratio (p99/p50), and log-skewness
|
||||
to assign one of: normal-ish, lognormal-ish, heavy-tail, right-skewed,
|
||||
left-skewed, other, or too_few_samples (n < 30).
|
||||
|
||||
Args:
|
||||
values: List of numeric values.
|
||||
|
||||
Returns:
|
||||
Dict with keys:
|
||||
"type" (str): distribution label.
|
||||
"stats" (dict): {"n", "skew", "kurtosis", "tail_ratio", "log_skew"}.
|
||||
"""
|
||||
n = len(values)
|
||||
if n < 30:
|
||||
return {"type": "too_few_samples", "stats": {"n": n}}
|
||||
|
||||
arr = np.array(values, dtype=float)
|
||||
|
||||
mean = float(np.mean(arr))
|
||||
std = float(np.std(arr, ddof=1))
|
||||
|
||||
# Skewness
|
||||
if std == 0:
|
||||
skew = 0.0
|
||||
else:
|
||||
skew = float(np.mean(((arr - mean) / std) ** 3))
|
||||
|
||||
# Excess kurtosis
|
||||
if std == 0:
|
||||
kurt = 0.0
|
||||
else:
|
||||
kurt = float(np.mean(((arr - mean) / std) ** 4)) - 3.0
|
||||
|
||||
# Tail ratio: p99 / p50 (only meaningful when median != 0)
|
||||
p50 = float(np.percentile(arr, 50))
|
||||
p99 = float(np.percentile(arr, 99))
|
||||
tail_ratio = (p99 / p50) if p50 != 0 else math.nan
|
||||
|
||||
# Log-skewness on positive values
|
||||
positives = arr[arr > 0]
|
||||
if len(positives) >= 30:
|
||||
log_arr = np.log(positives)
|
||||
log_mean = float(np.mean(log_arr))
|
||||
log_std = float(np.std(log_arr, ddof=1))
|
||||
if log_std == 0:
|
||||
log_skew = 0.0
|
||||
else:
|
||||
log_skew = float(np.mean(((log_arr - log_mean) / log_std) ** 3))
|
||||
else:
|
||||
log_skew = math.nan
|
||||
|
||||
stats = {
|
||||
"n": n,
|
||||
"skew": skew,
|
||||
"kurtosis": kurt,
|
||||
"tail_ratio": tail_ratio,
|
||||
"log_skew": log_skew,
|
||||
}
|
||||
|
||||
# Classification logic
|
||||
if kurt > 3.0:
|
||||
dist_type = "heavy-tail"
|
||||
elif abs(skew) <= 0.5 and abs(kurt) <= 1.0:
|
||||
dist_type = "normal-ish"
|
||||
elif (
|
||||
skew > 0.5
|
||||
and not math.isnan(log_skew)
|
||||
and abs(log_skew) <= 0.5
|
||||
and not math.isnan(tail_ratio)
|
||||
and tail_ratio > 2.0
|
||||
):
|
||||
dist_type = "lognormal-ish"
|
||||
elif skew > 0.5:
|
||||
dist_type = "right-skewed"
|
||||
elif skew < -0.5:
|
||||
dist_type = "left-skewed"
|
||||
else:
|
||||
dist_type = "other"
|
||||
|
||||
return {"type": dist_type, "stats": stats}
|
||||
@@ -0,0 +1,65 @@
|
||||
---
|
||||
name: extract_graph_gliner2
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "def extract_graph_gliner2(text: str, entity_labels: list[str], relation_labels: list | dict, model: Any, threshold: float = 0.3, include_confidence: bool = False) -> dict"
|
||||
description: "Extrae entidades + relaciones en una sola pasada con GLiNER2. Wrapper de alto nivel: construye schema, ejecuta extraccion, normaliza a dict plano. No aplica post-filtrado ni coreference."
|
||||
tags: [gliner2, ner, relation-extraction, nlp, extraction, graph, zero-shot, datascience, python, apache2]
|
||||
uses_functions:
|
||||
- gliner2_load_model_py_datascience
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: [time, typing.Any]
|
||||
params:
|
||||
- name: text
|
||||
desc: "Texto a analizar. Recomendado hasta 1500 chars (pre-chunkeado con chunk_with_overlap). Textos mas largos degradan el recall de GLiNER2."
|
||||
- name: entity_labels
|
||||
desc: "Lista de strings con los tipos de entidad en minusculas snake_case. E.g. ['person', 'organization', 'location']. Labels en snake_case mejoran el recall segun notebook 08."
|
||||
- name: relation_labels
|
||||
desc: "Lista de strings o dict {label: description} con los tipos de relacion. E.g. ['works_at', 'ceo_of'] o {'works_at': 'person works at an organization'}."
|
||||
- name: model
|
||||
desc: "Instancia GLiNER2 cargada con gliner2_load_model. Inyectada por el caller (no se carga aqui)."
|
||||
- name: threshold
|
||||
desc: "Umbral de confianza entre 0 y 1. 0.3 validado empiricamente en notebook 04 (gliner_glirel_tuning). Valores mas bajos = mas recall, mas ruido."
|
||||
- name: include_confidence
|
||||
desc: "Si True, GLiNER2 devuelve scores internos por entidad y relacion. False por defecto para output mas limpio."
|
||||
output: "Dict con tres campos: 'entities' -> {type: [name, ...]}, 'relation_extraction' -> {rel_type: [(head, tail), ...]}, 'elapsed_s' -> float. Compatible con aggregate_extraction_results."
|
||||
tested: true
|
||||
tests:
|
||||
- "output tiene claves entities relation_extraction elapsed_s"
|
||||
- "stub model retorna shape correcto"
|
||||
test_file_path: "python/functions/datascience/tests/test_extract_graph_gliner2.py"
|
||||
file_path: "python/functions/datascience/extract_graph_gliner2.py"
|
||||
notes: |
|
||||
LICENSE: GLiNER2 (fastino/gliner2-large-v1) es Apache 2.0 — uso comercial OK.
|
||||
|
||||
impure: invoca inferencia del modelo (side effect computacional + tiempo variable).
|
||||
El model se inyecta externamente para permitir cache y reutilizacion entre llamadas.
|
||||
Para textos largos usar chunk_with_overlap antes y llamar esta funcion por chunk,
|
||||
luego agregar con aggregate_extraction_results.
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from datascience.gliner2_load_model import gliner2_load_model
|
||||
from datascience.extract_graph_gliner2 import extract_graph_gliner2
|
||||
|
||||
model = gliner2_load_model(device="auto")
|
||||
|
||||
result = extract_graph_gliner2(
|
||||
text="Carlos Torres es presidente de BBVA, con sede en Bilbao.",
|
||||
entity_labels=["person", "organization", "location"],
|
||||
relation_labels=["president_of", "headquartered_in"],
|
||||
model=model,
|
||||
threshold=0.3,
|
||||
)
|
||||
# result["entities"] -> {"person": ["Carlos Torres"], ...}
|
||||
# result["relation_extraction"]-> {"president_of": [("Carlos Torres", "BBVA")]}
|
||||
# result["elapsed_s"] -> 0.234
|
||||
```
|
||||
@@ -0,0 +1,60 @@
|
||||
"""Extraccion de entidades + relaciones en una pasada con GLiNER2."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import time
|
||||
from typing import Any
|
||||
|
||||
|
||||
def extract_graph_gliner2(
|
||||
text: str,
|
||||
entity_labels: list[str],
|
||||
relation_labels: list | dict,
|
||||
model: Any,
|
||||
threshold: float = 0.3,
|
||||
include_confidence: bool = False,
|
||||
) -> dict:
|
||||
"""Extract entities + relations using GLiNER2 with one schema pass.
|
||||
|
||||
Wrapper de alto nivel sobre la API de GLiNER2. Construye el schema,
|
||||
ejecuta la extraccion y normaliza el resultado a un dict plano.
|
||||
NO aplica post-filtrado ni coreference — eso lo hace el caller con
|
||||
filter_relations_by_entity_types y merge_entity_aliases.
|
||||
|
||||
Args:
|
||||
text: Texto a analizar. Recomendado: <= 1500 chars (pre-chunked).
|
||||
entity_labels: Lista de strings con los tipos de entidad.
|
||||
E.g. ["person", "organization", "location"]
|
||||
relation_labels: Lista de strings o dict {label: description} con
|
||||
los tipos de relacion.
|
||||
E.g. ["works_at", "ceo_of"] o
|
||||
{"works_at": "person works at organization"}
|
||||
model: Instancia GLiNER2 cargada con gliner2_load_model.
|
||||
threshold: Umbral de confianza (0-1). 0.3 es el valor validado
|
||||
empiricamente en los notebooks del analisis.
|
||||
include_confidence: Si True, el modelo devuelve scores por entidad
|
||||
y relacion (formato interno de GLiNER2).
|
||||
|
||||
Returns:
|
||||
{
|
||||
"entities": {type: [name, ...]},
|
||||
"relation_extraction": {rel_type: [(head, tail), ...]},
|
||||
"elapsed_s": float
|
||||
}
|
||||
"""
|
||||
schema = model.create_schema().entities(entity_labels).relations(relation_labels)
|
||||
|
||||
t0 = time.time()
|
||||
r = model.extract(
|
||||
text,
|
||||
schema=schema,
|
||||
threshold=threshold,
|
||||
include_confidence=include_confidence,
|
||||
)
|
||||
elapsed = round(time.time() - t0, 3)
|
||||
|
||||
return {
|
||||
"entities": r.get("entities", {}),
|
||||
"relation_extraction": r.get("relation_extraction", {}),
|
||||
"elapsed_s": elapsed,
|
||||
}
|
||||
@@ -0,0 +1,114 @@
|
||||
---
|
||||
name: extract_relations_mrebel
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "def extract_relations_mrebel(text: str, entities: list[EntityCandidate], tokenizer: Any, model: Any, src_lang: str = 'es_XX', sentence_split_re: str = r'(?<=[.!?])\\s+', min_sentence_chars: int = 20, num_beams: int = 4, max_length: int = 256) -> list[RelationCandidate]"
|
||||
description: "Extrae relaciones entre entidades usando mREBEL (seq2seq multilingue). Divide el texto por oraciones, genera triplets con mREBEL, parsea con parse_rebel_output y alinea a entidades conocidas con align_relations_to_entities. Drop-in con extract_relations_glirel para benchmarks."
|
||||
tags: [mrebel, relation-extraction, nlp, extract, knowledge-graph, seq2seq, multilingual, datascience, python]
|
||||
uses_functions:
|
||||
- mrebel_load_model_py_datascience
|
||||
- parse_rebel_output_py_datascience
|
||||
- align_relations_to_entities_py_datascience
|
||||
uses_types:
|
||||
- entity_candidate_py_datascience
|
||||
- relation_candidate_py_datascience
|
||||
returns:
|
||||
- relation_candidate_py_datascience
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: [re]
|
||||
params:
|
||||
- name: text
|
||||
desc: "texto fuente en el idioma de src_lang (mismo chunk usado para extraer las entidades)"
|
||||
- name: entities
|
||||
desc: "entidades ya extraidas de este texto (de extract_entities_gliner o similar). Solo se conservan relaciones entre entidades de esta lista."
|
||||
- name: tokenizer
|
||||
desc: "tokenizer mREBEL cargado con mrebel_load_model — inyectado por el caller para evitar re-carga en batch"
|
||||
- name: model
|
||||
desc: "modelo mREBEL cargado con mrebel_load_model — inyectado por el caller"
|
||||
- name: src_lang
|
||||
desc: "informativo — el idioma con que se cargo el tokenizer (ej. 'es_XX'). No se usa en runtime."
|
||||
- name: sentence_split_re
|
||||
desc: "patron regex para dividir el texto en oraciones. Defecto: split despues de [.!?] seguido de espacio."
|
||||
- name: min_sentence_chars
|
||||
desc: "longitud minima de caracteres para procesar una oracion. Fragmentos mas cortos se saltan (defecto 20)."
|
||||
- name: num_beams
|
||||
desc: "ancho del beam search para model.generate (defecto 4)"
|
||||
- name: max_length
|
||||
desc: "longitud maxima en tokens para tokenizacion y generacion (defecto 256)"
|
||||
output: "lista de RelationCandidate con confidence=1.0 (mREBEL no produce score continuo). from_name/to_name siempre coinciden con entidades del input."
|
||||
tested: true
|
||||
tests:
|
||||
- "flujo completo con stub produce RelationCandidate correctos"
|
||||
- "menos de 2 entidades retorna vacio"
|
||||
- "texto vacio retorna vacio"
|
||||
- "triplets no alineables se descartan"
|
||||
test_file_path: "python/functions/datascience/tests/test_extract_relations_mrebel.py"
|
||||
file_path: "python/functions/datascience/extract_relations_mrebel.py"
|
||||
notes: |
|
||||
impure: model.generate es I/O computacional con estado externo (pesos del modelo).
|
||||
|
||||
mREBEL no produce un confidence score continuo — devuelve los triplets que el modelo
|
||||
decodifico como output mas probable. confidence=1.0 es un marcador "el modelo lo emitio",
|
||||
no una probabilidad calibrada. Para filtrar por calidad, usar el numero de beams
|
||||
como proxy o combinar con un clasificador posterior.
|
||||
|
||||
Drop-in con extract_relations_glirel para benchmarks:
|
||||
- Misma interfaz de entrada (text, entities, model)
|
||||
- Misma salida (list[RelationCandidate])
|
||||
- Diferencia: mREBEL no necesita relation_types (genera relaciones libre),
|
||||
glirel necesita relation_types (zero-shot discriminativo).
|
||||
|
||||
LICENCIA del modelo: Babelscape/mrebel-large es CC BY-NC-SA 4.0 (no comercial).
|
||||
Ver mrebel_load_model para mas detalles.
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from python.functions.datascience.mrebel_load_model import mrebel_load_model
|
||||
from python.functions.datascience.extract_relations_mrebel import extract_relations_mrebel
|
||||
from python.types.datascience.entity_candidate import EntityCandidate
|
||||
|
||||
tokenizer, model = mrebel_load_model(src_lang="es_XX")
|
||||
|
||||
text = "Pablo Isla es el presidente de Inditex. La empresa tiene sede en Arteixo, A Coruna."
|
||||
entities = [
|
||||
EntityCandidate(name="Pablo Isla", type_label="PER", confidence=0.95),
|
||||
EntityCandidate(name="Inditex", type_label="ORG", confidence=0.92),
|
||||
EntityCandidate(name="Arteixo", type_label="LOC", confidence=0.88),
|
||||
EntityCandidate(name="A Coruna", type_label="LOC", confidence=0.85),
|
||||
]
|
||||
|
||||
relations = extract_relations_mrebel(
|
||||
text=text,
|
||||
entities=entities,
|
||||
tokenizer=tokenizer,
|
||||
model=model,
|
||||
)
|
||||
# [RelationCandidate(from_name='Pablo Isla', to_name='Inditex',
|
||||
# relation_type='employer', confidence=1.0, ...), ...]
|
||||
```
|
||||
|
||||
## Comparacion con extract_relations_glirel
|
||||
|
||||
| | mREBEL | GLiREL |
|
||||
|---|---|---|
|
||||
| Tipo | Seq2seq generativo | Discriminativo zero-shot |
|
||||
| relation_types | No (genera libre) | Si (obligatorio) |
|
||||
| Confidence | 1.0 fijo (no calibrado) | 0.0-1.0 (calibrado) |
|
||||
| Idiomas | 30+ multilingue | Principalmente EN |
|
||||
| Licencia modelo | CC BY-NC-SA (no comercial) | Apache 2.0 |
|
||||
| Velocidad | Mas lento (seq2seq) | Mas rapido (clasificador) |
|
||||
|
||||
## Notas de diseno
|
||||
|
||||
- `parse_rebel_output` y `align_relations_to_entities` son funciones puras
|
||||
compuestas por esta funcion impura — testeable independientemente.
|
||||
- Errores de tokenizacion/generacion por frase se capturan y saltan (la frase
|
||||
se ignora, el resto del texto se procesa).
|
||||
- `source_chunk_index` rastrea el indice de oracion de origen, no de chunk
|
||||
de texto — util para debugging.
|
||||
@@ -0,0 +1,136 @@
|
||||
"""Extrae relaciones entre entidades usando mREBEL (seq2seq multilingue)."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
import re
|
||||
import sys
|
||||
from typing import Any
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", ".."))
|
||||
|
||||
from python.types.datascience.entity_candidate import EntityCandidate
|
||||
from python.types.datascience.relation_candidate import RelationCandidate
|
||||
from python.functions.datascience.parse_rebel_output import parse_rebel_output
|
||||
from python.functions.datascience.align_relations_to_entities import align_relations_to_entities
|
||||
|
||||
|
||||
def extract_relations_mrebel(
|
||||
text: str,
|
||||
entities: list[EntityCandidate],
|
||||
tokenizer: Any,
|
||||
model: Any,
|
||||
src_lang: str = "es_XX",
|
||||
sentence_split_re: str = r"(?<=[.!?])\s+",
|
||||
min_sentence_chars: int = 20,
|
||||
num_beams: int = 4,
|
||||
max_length: int = 256,
|
||||
) -> list[RelationCandidate]:
|
||||
"""Extract relations from text using mREBEL, sentence by sentence.
|
||||
|
||||
Orchestrates the full pipeline:
|
||||
|
||||
1. Split ``text`` into sentences using ``sentence_split_re``.
|
||||
2. Filter out sentences shorter than ``min_sentence_chars``.
|
||||
3. For each sentence: tokenize → generate → decode (with special tokens)
|
||||
→ ``parse_rebel_output`` → accumulate raw triplets.
|
||||
4. Collect all entity names from ``entities``, sorted DESC by length
|
||||
(so longer names win in substring matching).
|
||||
5. Call ``align_relations_to_entities`` to resolve head/tail spans to
|
||||
canonical entity names and drop unresolved / self-loop triplets.
|
||||
6. Wrap each aligned triplet in a ``RelationCandidate``.
|
||||
|
||||
mREBEL does not produce a continuous confidence score — ``confidence``
|
||||
is set to ``1.0`` as a marker meaning "the model emitted this triplet".
|
||||
|
||||
Args:
|
||||
text: Source text (same language as ``src_lang``).
|
||||
entities: Entities already extracted from this text (e.g. via
|
||||
``extract_entities_gliner``). Used to filter triplets to
|
||||
known entities only.
|
||||
tokenizer: mREBEL tokenizer loaded with ``mrebel_load_model``.
|
||||
model: mREBEL model loaded with ``mrebel_load_model``.
|
||||
src_lang: Informational — the language the tokenizer was loaded with.
|
||||
Not used at inference time (mBART lang tokens are set at load time).
|
||||
sentence_split_re: Regex pattern for sentence splitting. Default splits
|
||||
on whitespace that follows ``.``, ``!`` or ``?``.
|
||||
min_sentence_chars: Minimum character length for a sentence to be
|
||||
processed. Shorter fragments are skipped.
|
||||
num_beams: Beam search width for ``model.generate``. Default 4.
|
||||
max_length: Max token length for both tokenization and generation.
|
||||
|
||||
Returns:
|
||||
List of ``RelationCandidate`` where ``from_name`` and ``to_name``
|
||||
always correspond to names in ``entities``. Empty list if no aligned
|
||||
triplets are found or ``entities`` has fewer than 2 items.
|
||||
"""
|
||||
if len(entities) < 2:
|
||||
return []
|
||||
if not text or not text.strip():
|
||||
return []
|
||||
|
||||
split_re = re.compile(sentence_split_re)
|
||||
sentences = split_re.split(text.strip())
|
||||
sentences = [s.strip() for s in sentences if s.strip() and len(s.strip()) >= min_sentence_chars]
|
||||
if not sentences:
|
||||
return []
|
||||
|
||||
# Step 1-3: gather raw triplets from all sentences.
|
||||
raw_triplets: list[dict] = []
|
||||
for idx, sentence in enumerate(sentences):
|
||||
try:
|
||||
inputs = tokenizer(
|
||||
sentence,
|
||||
return_tensors="pt",
|
||||
max_length=max_length,
|
||||
truncation=True,
|
||||
)
|
||||
generated = model.generate(
|
||||
**inputs,
|
||||
num_beams=num_beams,
|
||||
length_penalty=1.0,
|
||||
max_length=max_length,
|
||||
)
|
||||
decoded = tokenizer.decode(generated[0], skip_special_tokens=False)
|
||||
except Exception:
|
||||
# Skip sentences that fail (e.g. tokenizer errors on special chars).
|
||||
continue
|
||||
|
||||
sentence_triplets = parse_rebel_output(decoded)
|
||||
# Tag each triplet with the sentence index for source_chunk_index.
|
||||
for t in sentence_triplets:
|
||||
t["_sentence_idx"] = idx
|
||||
raw_triplets.extend(sentence_triplets)
|
||||
|
||||
if not raw_triplets:
|
||||
return []
|
||||
|
||||
# Step 4-5: align to entity names (sorted DESC by length for substring match).
|
||||
entity_names = sorted([e.name for e in entities if e.name], key=len, reverse=True)
|
||||
aligned = align_relations_to_entities(raw_triplets, entity_names)
|
||||
|
||||
# Step 6: wrap in RelationCandidate.
|
||||
candidates: list[RelationCandidate] = []
|
||||
for item in aligned:
|
||||
# Recover sentence_idx from raw triplet — find matching raw by head/tail/type.
|
||||
sentence_idx = -1
|
||||
for raw in raw_triplets:
|
||||
if (
|
||||
raw.get("head", "").strip() and
|
||||
raw.get("type", "").strip() == item["kind"]
|
||||
):
|
||||
sentence_idx = raw.get("_sentence_idx", -1)
|
||||
break
|
||||
|
||||
candidates.append(
|
||||
RelationCandidate(
|
||||
from_name=item["from"],
|
||||
to_name=item["to"],
|
||||
relation_type=item["kind"],
|
||||
description="",
|
||||
confidence=1.0,
|
||||
source_chunk_index=sentence_idx,
|
||||
)
|
||||
)
|
||||
|
||||
return candidates
|
||||
@@ -0,0 +1,65 @@
|
||||
---
|
||||
name: extract_triples_spacy_es
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "def extract_triples_spacy_es(text: str, nlp: Any) -> dict"
|
||||
description: "Extraccion OpenIE schema-less en castellano via reglas de dependencia spaCy. Detecta patrones sujeto-verbo-objeto con el lemma del verbo como relacion (sin vocabulario fijo). Tambien extrae entidades NER (PER, ORG, LOC, MISC)."
|
||||
tags: [spacy, openie, nlp, spanish, triples, dependency, ner, schema-less, datascience, python, mit]
|
||||
uses_functions:
|
||||
- spacy_es_load_model_py_datascience
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: [time, typing.Any]
|
||||
params:
|
||||
- name: text
|
||||
desc: "Texto en castellano a analizar. Funciona mejor con oraciones completas. Admite multiples oraciones en el mismo texto."
|
||||
- name: nlp
|
||||
desc: "Instancia spaCy Language cargada con spacy_es_load_model. Debe incluir dependencias + POS + NER (es_core_news_md o lg)."
|
||||
output: "Dict con 'text' (input), 'triples' (lista de {subject, relation, object, verb_form, object_dep, prep}), 'entities' (lista de {text, label}) y 'elapsed_s'. La relacion es el lemma del verbo, opcionalmente sufijado con preposicion (_en, _con) o modo pasivo ([pass])."
|
||||
tested: true
|
||||
tests:
|
||||
- "oracion simple produce tripleta con sujeto verbo objeto"
|
||||
- "carlos torres preside bbva produce tripleta president"
|
||||
- "amancio ortega fundo inditex en 1985 produce tripletas con fundar_en"
|
||||
- "texto sin verbos produce tripletas vacias"
|
||||
- "entities NER detecta PER ORG LOC"
|
||||
test_file_path: "python/functions/datascience/tests/test_extract_triples_spacy_es.py"
|
||||
file_path: "python/functions/datascience/extract_triples_spacy_es.py"
|
||||
notes: |
|
||||
LICENSE: spaCy es MIT. Modelo es_core_news_md es CC BY-SA 4.0.
|
||||
Uso comercial permitido con atribucion.
|
||||
|
||||
Validado en notebook 09 del analisis gliner_glirel_tuning.
|
||||
Complementa a extract_graph_gliner2: GLiNER2 usa vocabulario fijo de relaciones
|
||||
pero mayor precision; spaCy OpenIE usa lemmas verbales (sin vocabulario fijo)
|
||||
pero requiere post-filtrado manual.
|
||||
|
||||
impure: invoca inferencia del modelo (side effect computacional).
|
||||
El nlp se inyecta externamente para permitir cache y reutilizacion.
|
||||
|
||||
Relaciones compuestas: 'fundar_en' (fundar + preposicion 'en'),
|
||||
'ser_nombrado[pass]' (pasiva), 'trabajar_con' (trabajar + 'con').
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from datascience.spacy_es_load_model import spacy_es_load_model
|
||||
from datascience.extract_triples_spacy_es import extract_triples_spacy_es
|
||||
|
||||
nlp = spacy_es_load_model()
|
||||
|
||||
result = extract_triples_spacy_es(
|
||||
"Amancio Ortega fundo Inditex en 1985 en La Coruna.",
|
||||
nlp=nlp,
|
||||
)
|
||||
# result["triples"]:
|
||||
# [{"subject": "Amancio Ortega", "relation": "fundar", "object": "Inditex", ...},
|
||||
# {"subject": "Amancio Ortega", "relation": "fundar_en", "object": "1985", ...},
|
||||
# {"subject": "Amancio Ortega", "relation": "fundar_en", "object": "La Coruna", ...}]
|
||||
```
|
||||
@@ -0,0 +1,124 @@
|
||||
"""Extraccion de tripletas OpenIE schema-less en castellano via reglas de dependencia.
|
||||
|
||||
Validado en notebook 09 del analisis gliner_glirel_tuning.
|
||||
LICENSE: spaCy MIT + es_core_news_md CC BY-SA 4.0.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import time
|
||||
from typing import Any
|
||||
|
||||
# Determinantes y pronombres que no son entidades significativas
|
||||
STOP_TOKENS = {
|
||||
"el", "la", "los", "las", "un", "una", "unos", "unas",
|
||||
"esto", "eso", "aquello", "esta", "este", "estos", "estas",
|
||||
"que", "quien", "cual", "cuales",
|
||||
}
|
||||
|
||||
|
||||
def _clean_span(span_tokens) -> str: # type: ignore[type-arg]
|
||||
"""Extrae texto de un span de tokens, eliminando preposiciones iniciales."""
|
||||
toks = list(span_tokens)
|
||||
while toks and toks[0].pos_ == "ADP":
|
||||
toks = toks[1:]
|
||||
return " ".join(t.text for t in toks).strip()
|
||||
|
||||
|
||||
def _is_meaningful(text: str) -> bool:
|
||||
"""Comprueba que un span no es vacio ni una stopword."""
|
||||
if not text or not text.strip():
|
||||
return False
|
||||
if text.lower() in STOP_TOKENS:
|
||||
return False
|
||||
return True
|
||||
|
||||
|
||||
def extract_triples_spacy_es(text: str, nlp: Any) -> dict:
|
||||
"""Extract OpenIE-style (subject, relation, object) triples from Spanish text.
|
||||
|
||||
Uses spaCy dependency rules to find subject-verb-object patterns.
|
||||
Schema-LESS: the relation is the verb's lemma (no fixed vocabulary).
|
||||
Also extracts spaCy NER entities (PER, ORG, LOC, MISC).
|
||||
|
||||
Args:
|
||||
text: Spanish text to analyze. Works best with complete sentences.
|
||||
nlp: spaCy Language instance loaded with spacy_es_load_model.
|
||||
|
||||
Returns:
|
||||
{
|
||||
"text": str,
|
||||
"triples": [
|
||||
{"subject": str, "relation": str, "object": str,
|
||||
"verb_form": str, "object_dep": str, "prep": str|None},
|
||||
...
|
||||
],
|
||||
"entities": [{"text": str, "label": str}, ...],
|
||||
"elapsed_s": float
|
||||
}
|
||||
"""
|
||||
t0 = time.time()
|
||||
doc = nlp(text)
|
||||
triples: list[dict] = []
|
||||
|
||||
for tok in doc:
|
||||
if tok.pos_ not in ("VERB", "AUX"):
|
||||
continue
|
||||
|
||||
verb_lemma = tok.lemma_
|
||||
verb_form = tok.text
|
||||
|
||||
subjs = [
|
||||
c for c in tok.children
|
||||
if c.dep_ in ("nsubj", "nsubj:pass", "csubj")
|
||||
]
|
||||
if not subjs:
|
||||
continue
|
||||
|
||||
objects: list[tuple] = []
|
||||
for c in tok.children:
|
||||
if c.dep_ in ("obj", "dobj", "iobj", "attr", "xcomp", "ccomp"):
|
||||
objects.append((c, c.dep_, None))
|
||||
elif c.dep_ in ("obl", "obl:agent", "nmod"):
|
||||
prep = None
|
||||
for cc in c.children:
|
||||
if cc.dep_ == "case" and cc.pos_ == "ADP":
|
||||
prep = cc.text.lower()
|
||||
break
|
||||
objects.append((c, c.dep_, prep))
|
||||
|
||||
for s in subjs:
|
||||
s_text = _clean_span(s.subtree)
|
||||
if not _is_meaningful(s_text):
|
||||
continue
|
||||
for o, dep, prep in objects:
|
||||
o_text = _clean_span(o.subtree)
|
||||
if not _is_meaningful(o_text):
|
||||
continue
|
||||
|
||||
# Construir etiqueta de relacion
|
||||
rel = verb_lemma
|
||||
# Pasiva: marcar con [pass]
|
||||
if any(c.dep_ == "nsubj:pass" for c in tok.children):
|
||||
rel = f"{verb_lemma}[pass]"
|
||||
# Oblicuo con preposicion (excl. agente y "a" directa)
|
||||
elif prep and dep != "obl:agent" and prep != "a":
|
||||
rel = f"{verb_lemma}_{prep}"
|
||||
|
||||
triples.append({
|
||||
"subject": s_text,
|
||||
"relation": rel,
|
||||
"object": o_text,
|
||||
"verb_form": verb_form,
|
||||
"object_dep": dep,
|
||||
"prep": prep,
|
||||
})
|
||||
|
||||
ents = [{"text": e.text, "label": e.label_} for e in doc.ents]
|
||||
|
||||
return {
|
||||
"text": text,
|
||||
"triples": triples,
|
||||
"entities": ents,
|
||||
"elapsed_s": round(time.time() - t0, 3),
|
||||
}
|
||||
@@ -0,0 +1,58 @@
|
||||
---
|
||||
name: fuzzy_merge_adaptive
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def fuzzy_merge_adaptive(left: list[dict], right: list[dict], left_key: str, right_key: str, thresholds: list[int] | None = None, how: str = 'left') -> list[dict]"
|
||||
description: "Fuzzy join adaptativo entre dos listas de dicts usando rapidfuzz.token_sort_ratio. Prueba thresholds de mayor a menor y asigna el mayor cumplido. Soporta how='left' (todos los de left) e how='inner' (solo con match). Campos colisionantes reciben sufijos _left/_right."
|
||||
tags: [fuzzy, matching, join, merge, rapidfuzz, string-similarity, datascience]
|
||||
params:
|
||||
- name: left
|
||||
desc: Lista de dicts (lado izquierdo del join).
|
||||
- name: right
|
||||
desc: Lista de dicts (lado derecho del join).
|
||||
- name: left_key
|
||||
desc: Clave en los dicts de left usada para matching de strings.
|
||||
- name: right_key
|
||||
desc: Clave en los dicts de right usada para matching de strings.
|
||||
- name: thresholds
|
||||
desc: Lista de thresholds enteros a probar en orden descendente. Default [90,80,70,60,50].
|
||||
- name: how
|
||||
desc: "'left' incluye todos los items de left; 'inner' solo los que tienen match."
|
||||
output: "Lista de dicts mergeados con campos de left + right (sufijos _left/_right si colisionan) + fuzzy_match (str|None), match_score (int), threshold_used (int|None)."
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: ["rapidfuzz"]
|
||||
tested: true
|
||||
tests:
|
||||
- "left join con typo"
|
||||
- "inner join excluye sin match"
|
||||
- "left join sin match devuelve none"
|
||||
- "threshold adaptativo"
|
||||
- "colision de claves usa sufijos"
|
||||
test_file_path: "python/functions/datascience/tests/test_fuzzy_merge_adaptive.py"
|
||||
file_path: "python/functions/datascience/fuzzy_merge_adaptive.py"
|
||||
source_repo: "internal:footprint_aurgi"
|
||||
source_license: "internal-aurgi"
|
||||
source_file: "fuzzy_joins/fuzzy_en_batches.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from fuzzy_merge_adaptive import fuzzy_merge_adaptive
|
||||
|
||||
left = [{"name": "Madrid"}, {"name": "Barclona"}]
|
||||
right = [{"name": "Madrid", "cp": "28"}, {"name": "Barcelona", "cp": "08"}]
|
||||
result = fuzzy_merge_adaptive(left, right, left_key="name", right_key="name")
|
||||
# result[1]["fuzzy_match"] == "Barcelona", result[1]["match_score"] >= 80
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Migrado de thefuzz a rapidfuzz (API compatible, mayor velocidad). Sin pandas: el merge se implementa manualmente via dict lookup por right_key. Los thresholds se prueban de mayor a menor; el primero cumplido se asigna a threshold_used.
|
||||
@@ -0,0 +1,108 @@
|
||||
"""Fuzzy merge adaptativo con multiples thresholds usando rapidfuzz."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import Iterable
|
||||
|
||||
|
||||
def fuzzy_merge_adaptive(
|
||||
left: list[dict],
|
||||
right: list[dict],
|
||||
left_key: str,
|
||||
right_key: str,
|
||||
thresholds: list[int] | None = None,
|
||||
how: str = "left",
|
||||
) -> list[dict]:
|
||||
"""Realiza un fuzzy join adaptativo entre dos listas de dicts.
|
||||
|
||||
Para cada item en left busca en right el mejor match usando
|
||||
rapidfuzz.fuzz.token_sort_ratio. Prueba thresholds de mayor a menor
|
||||
y asigna threshold_used al mayor threshold cumplido. Si no cumple
|
||||
ninguno, match es None.
|
||||
|
||||
Args:
|
||||
left: Lista de dicts (lado izquierdo del join).
|
||||
right: Lista de dicts (lado derecho del join).
|
||||
left_key: Clave en los dicts de left usada para matching.
|
||||
right_key: Clave en los dicts de right usada para matching.
|
||||
thresholds: Thresholds a probar en orden descendente.
|
||||
Default [90, 80, 70, 60, 50].
|
||||
how: Tipo de join. 'left' incluye todos los items de left
|
||||
(con None en campos de right si no hay match).
|
||||
'inner' incluye solo items con match.
|
||||
|
||||
Returns:
|
||||
Lista de dicts mergeados con campos de left + campos de right
|
||||
(sufijos _left/_right si colisionan) + fuzzy_match, match_score,
|
||||
threshold_used.
|
||||
"""
|
||||
from rapidfuzz import fuzz, process
|
||||
|
||||
if thresholds is None:
|
||||
thresholds = [90, 80, 70, 60, 50]
|
||||
|
||||
right_values = [
|
||||
str(r[right_key]) for r in right if r.get(right_key) is not None
|
||||
]
|
||||
|
||||
def find_best_match(value: str | None) -> tuple[str | None, int, int | None]:
|
||||
if value is None:
|
||||
return None, 0, None
|
||||
result = process.extractOne(str(value), right_values, scorer=fuzz.token_sort_ratio)
|
||||
if not result:
|
||||
return None, 0, None
|
||||
match_str, score = result[0], result[1]
|
||||
for t in thresholds:
|
||||
if score >= t:
|
||||
return match_str, score, t
|
||||
return None, 0, None
|
||||
|
||||
# Detectar colisiones de claves
|
||||
left_keys = set(left[0].keys()) if left else set()
|
||||
right_keys = set(right[0].keys()) if right else set()
|
||||
collision_keys = left_keys & right_keys
|
||||
|
||||
# Construir indice de right por right_key
|
||||
right_index: dict[str, dict] = {}
|
||||
for r in right:
|
||||
val = r.get(right_key)
|
||||
if val is not None:
|
||||
right_index[str(val)] = r
|
||||
|
||||
result_rows = []
|
||||
for item in left:
|
||||
value = item.get(left_key)
|
||||
fuzzy_match, score, threshold_used = find_best_match(value)
|
||||
|
||||
if fuzzy_match is None and how == "inner":
|
||||
continue
|
||||
|
||||
row: dict = {}
|
||||
# Campos de left
|
||||
for k, v in item.items():
|
||||
if k in collision_keys:
|
||||
row[f"{k}_left"] = v
|
||||
else:
|
||||
row[k] = v
|
||||
|
||||
# Campos de right
|
||||
matched_right = right_index.get(fuzzy_match) if fuzzy_match else None
|
||||
if matched_right is not None:
|
||||
for k, v in matched_right.items():
|
||||
if k in collision_keys:
|
||||
row[f"{k}_right"] = v
|
||||
else:
|
||||
row[k] = v
|
||||
else:
|
||||
for k in right_keys:
|
||||
if k in collision_keys:
|
||||
row[f"{k}_right"] = None
|
||||
else:
|
||||
row[k] = None
|
||||
|
||||
row["fuzzy_match"] = fuzzy_match
|
||||
row["match_score"] = score
|
||||
row["threshold_used"] = threshold_used
|
||||
result_rows.append(row)
|
||||
|
||||
return result_rows
|
||||
@@ -0,0 +1,52 @@
|
||||
---
|
||||
id: geometric_mean_py_datascience
|
||||
name: geometric_mean
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def geometric_mean(values: list[float]) -> float"
|
||||
description: "Geometric mean of positive elements via exp(mean(log(x))). Non-positive values are filtered out. Returns math.nan if no positives."
|
||||
tags: [statistics, mean, geometric, distribution, lognormal]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: [math, numpy]
|
||||
example: |
|
||||
from geometric_mean import geometric_mean
|
||||
result = geometric_mean([1, 2, 4, 8]) # ~2.828 (2^1.5)
|
||||
tested: true
|
||||
tests:
|
||||
- "test_geometric_mean_powers_of_two"
|
||||
- "test_geometric_mean_filters_non_positive"
|
||||
- "test_geometric_mean_empty_returns_nan"
|
||||
- "test_geometric_mean_all_negative_returns_nan"
|
||||
- "test_geometric_mean_single_positive"
|
||||
test_file_path: "python/functions/datascience/tests/test_geometric_mean.py"
|
||||
file_path: "python/functions/datascience/geometric_mean.py"
|
||||
params:
|
||||
- name: values
|
||||
desc: "List of numeric values. Non-positive elements are silently ignored."
|
||||
output: "Geometric mean as float, computed over positive elements only. Returns math.nan if there are no positive values."
|
||||
source_repo: "internal:footprint_aurgi"
|
||||
source_license: "internal-aurgi"
|
||||
source_file: "aurgi_mapas/generar_pdf_reporte.py:126"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from geometric_mean import geometric_mean
|
||||
|
||||
geometric_mean([1, 2, 4, 8]) # 2.828... (= 2^1.5)
|
||||
geometric_mean([1, -2, 3]) # exp((log(1)+log(3))/2) — ignores -2
|
||||
geometric_mean([]) # math.nan
|
||||
geometric_mean([-1, -2]) # math.nan — no positives
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Apropiado para distribuciones lognormales o datos multiplicativos (precios, ratios, crecimientos). Equivalente a la raiz n-esima del producto pero numericamente mas estable via log-space.
|
||||
@@ -0,0 +1,23 @@
|
||||
"""geometric_mean — Geometric mean of positive values."""
|
||||
|
||||
import math
|
||||
import numpy as np
|
||||
|
||||
|
||||
def geometric_mean(values: list[float]) -> float:
|
||||
"""Return the geometric mean of the positive elements in values.
|
||||
|
||||
Filters out non-positive numbers before computing exp(mean(log(x))).
|
||||
Returns math.nan if there are no positive values.
|
||||
|
||||
Args:
|
||||
values: List of numeric values (non-positive elements are ignored).
|
||||
|
||||
Returns:
|
||||
Geometric mean as float, or math.nan if no positive values exist.
|
||||
"""
|
||||
positives = [v for v in values if v > 0]
|
||||
if not positives:
|
||||
return math.nan
|
||||
arr = np.array(positives, dtype=float)
|
||||
return float(np.exp(np.mean(np.log(arr))))
|
||||
@@ -0,0 +1,67 @@
|
||||
---
|
||||
name: gliner2_load_model
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "def gliner2_load_model(model_name: str = 'fastino/gliner2-large-v1', device: str = 'auto') -> Any"
|
||||
description: "Carga (y cachea por (model_name, device)) un modelo GLiNER2 (NER+RE joint). GLiNER2 extrae entidades y relaciones en una sola pasada con schema unificado. ~2x mas rapido que GLiNER + GLiREL separados. LICENSE: Apache 2.0."
|
||||
tags: [gliner2, ner, relation-extraction, nlp, model, huggingface, zero-shot, joint, datascience, python, apache2]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: [gliner2]
|
||||
params:
|
||||
- name: model_name
|
||||
desc: "ID del modelo en HuggingFace Hub. Default: fastino/gliner2-large-v1. Alternativas: fastino/gliner2-base-v1 (mas ligero)."
|
||||
- name: device
|
||||
desc: "'auto' usa CUDA si disponible, sino CPU. Valores: 'cpu', 'cuda', 'cuda:0', 'cuda:1'. auto es el default recomendado."
|
||||
output: "Instancia GLiNER2 cacheada por (model_name, device). Tiene metodos .create_schema().entities(...).relations(...) y .extract(text, schema=schema, threshold=0.3)."
|
||||
tested: true
|
||||
tests:
|
||||
- "cache devuelve la misma instancia con los mismos parametros"
|
||||
- "device=auto resuelve a cpu si torch no esta instalado"
|
||||
- "ImportError si gliner2 no esta instalado"
|
||||
test_file_path: "python/functions/datascience/tests/test_gliner2_load_model.py"
|
||||
file_path: "python/functions/datascience/gliner2_load_model.py"
|
||||
notes: |
|
||||
LICENSE: fastino/gliner2-large-v1 es Apache 2.0 — uso comercial OK.
|
||||
Diferencia con gliner_load_model: GLiNER hace solo NER, GLiNER2 hace NER+RE
|
||||
en una sola pasada (joint schema). Para pipelines de grafo usar GLiNER2
|
||||
cuando se necesiten ambas tareas simultaneamente.
|
||||
|
||||
impure: descarga red/disco la primera vez, mantiene estado en _MODEL_CACHE.
|
||||
Tamanio: fastino/gliner2-large-v1 ~500 MB. Primera carga 15-30s en CPU.
|
||||
Inferencia CPU: 10-50 KB texto/s con schema tipico (3 entity + 8 relation labels).
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from datascience.gliner2_load_model import gliner2_load_model
|
||||
|
||||
model = gliner2_load_model(device="auto")
|
||||
|
||||
schema = (model.create_schema()
|
||||
.entities(["person", "organization", "location"])
|
||||
.relations(["works_at", "ceo_of", "located_in"]))
|
||||
|
||||
result = model.extract(
|
||||
"Pablo Isla es el CEO de Inditex, empresa con sede en Arteixo.",
|
||||
schema=schema,
|
||||
threshold=0.3,
|
||||
)
|
||||
# result["entities"] -> {"person": ["Pablo Isla"], "organization": ["Inditex"], ...}
|
||||
# result["relation_extraction"] -> {"ceo_of": [("Pablo Isla", "Inditex")], ...}
|
||||
```
|
||||
|
||||
## Instalacion
|
||||
|
||||
```bash
|
||||
cd python && uv pip install gliner2
|
||||
# o con el extra NLP completo:
|
||||
cd python && uv pip install -e '.[nlp]'
|
||||
```
|
||||
@@ -0,0 +1,62 @@
|
||||
"""Carga (y cachea) un modelo GLiNER2 (NER+RE joint en una sola pasada).
|
||||
|
||||
LICENSE: Apache 2.0 — uso comercial permitido.
|
||||
Modelo por defecto: fastino/gliner2-large-v1
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import Any
|
||||
|
||||
# Cache global: (model_name, device) -> instancia GLiNER2
|
||||
_MODEL_CACHE: dict[tuple[str, str], Any] = {}
|
||||
|
||||
|
||||
def _resolve_device(device: str) -> str:
|
||||
"""Resuelve 'auto' a 'cuda' o 'cpu' segun disponibilidad de torch."""
|
||||
if device != "auto":
|
||||
return device
|
||||
try:
|
||||
import torch
|
||||
except ImportError:
|
||||
return "cpu"
|
||||
return "cuda" if torch.cuda.is_available() else "cpu"
|
||||
|
||||
|
||||
def gliner2_load_model(
|
||||
model_name: str = "fastino/gliner2-large-v1",
|
||||
device: str = "auto",
|
||||
) -> Any:
|
||||
"""Load (and cache) a GLiNER2 model.
|
||||
|
||||
GLiNER2 extracts entities AND relations in a single forward pass using
|
||||
a joint schema (entities + relation_labels). This is ~2x faster than
|
||||
running GLiNER + GLiREL separately for co-occurring entities.
|
||||
|
||||
Returns model instance with .extract() and .create_schema() methods.
|
||||
|
||||
LICENSE: Apache 2.0 — commercial use OK.
|
||||
|
||||
Args:
|
||||
model_name: HuggingFace Hub model ID. Default: fastino/gliner2-large-v1.
|
||||
device: 'auto' uses CUDA if available, else CPU. 'cpu', 'cuda', 'cuda:N'.
|
||||
|
||||
Returns:
|
||||
GLiNER2 instance cached by (model_name, device).
|
||||
"""
|
||||
resolved = _resolve_device(device)
|
||||
key = (model_name, resolved)
|
||||
if key in _MODEL_CACHE:
|
||||
return _MODEL_CACHE[key]
|
||||
|
||||
from gliner2 import GLiNER2 # type: ignore[import]
|
||||
|
||||
m = GLiNER2.from_pretrained(model_name)
|
||||
if hasattr(m, "to") and resolved != "cpu":
|
||||
try:
|
||||
m.to(resolved)
|
||||
except Exception:
|
||||
pass # Fallback to CPU silently
|
||||
|
||||
_MODEL_CACHE[key] = m
|
||||
return m
|
||||
@@ -0,0 +1,52 @@
|
||||
---
|
||||
id: kde_density_levels_py_datascience
|
||||
name: kde_density_levels
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def kde_density_levels(xs: list[float], ys: list[float], bw_adjust: float = 0.6, abs_quantile: float = 0.1, dense_quantile: float = 0.85, bins: int = 80) -> dict | None"
|
||||
description: "Estimates 2-D density via KDE (scipy) or histogram fallback (numpy) and returns per-point density values plus absolute and dense quantile thresholds."
|
||||
tags: [statistics, kde, density, spatial, geospatial, scipy, numpy]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: [numpy, scipy]
|
||||
example: |
|
||||
from kde_density_levels import kde_density_levels
|
||||
import numpy as np
|
||||
rng = np.random.default_rng(42)
|
||||
result = kde_density_levels(rng.normal(0,1,50).tolist(), rng.normal(0,1,50).tolist())
|
||||
# {"method": "kde", "densities": array(...), "abs_level": ..., "dense_level": ...}
|
||||
tested: true
|
||||
tests:
|
||||
- "test_kde_density_levels_returns_dict_for_50_points"
|
||||
- "test_kde_density_levels_none_for_few_points"
|
||||
- "test_kde_density_levels_none_for_4_points"
|
||||
- "test_kde_density_levels_levels_ordered"
|
||||
- "test_kde_density_levels_mismatched_lengths"
|
||||
test_file_path: "python/functions/datascience/tests/test_kde_density_levels.py"
|
||||
file_path: "python/functions/datascience/kde_density_levels.py"
|
||||
params:
|
||||
- name: xs
|
||||
desc: "X-coordinates of the 2-D point cloud."
|
||||
- name: ys
|
||||
desc: "Y-coordinates of the 2-D point cloud. Must have same length as xs."
|
||||
- name: bw_adjust
|
||||
desc: "Bandwidth adjustment factor for gaussian_kde. Default 0.6."
|
||||
- name: abs_quantile
|
||||
desc: "Quantile of density values used as the absolute (sparse) threshold. Default 0.1."
|
||||
- name: dense_quantile
|
||||
desc: "Quantile of density values used as the dense cluster threshold. Default 0.85."
|
||||
- name: bins
|
||||
desc: "Number of bins per axis for the histogram fallback. Default 80."
|
||||
output: "Dict with method (str), densities (np.ndarray of per-point density), abs_level (float), dense_level (float). Returns None if len(xs) < 5 or lengths differ."
|
||||
source_repo: "internal:footprint_aurgi"
|
||||
source_license: "internal-aurgi"
|
||||
source_file: "ponderacion_isochronas/src/recomendador_centros.py:305"
|
||||
---
|
||||
|
||||
Funcion pura que no escribe nada en disco. returns_optional=true porque devuelve None cuando hay menos de 5 puntos.
|
||||
@@ -0,0 +1,65 @@
|
||||
"""kde_density_levels — Compute density levels via KDE or histogram fallback."""
|
||||
|
||||
import math
|
||||
import numpy as np
|
||||
|
||||
|
||||
def kde_density_levels(
|
||||
xs: list[float],
|
||||
ys: list[float],
|
||||
bw_adjust: float = 0.6,
|
||||
abs_quantile: float = 0.1,
|
||||
dense_quantile: float = 0.85,
|
||||
bins: int = 80,
|
||||
) -> dict | None:
|
||||
"""Estimate 2-D density and compute absolute and dense threshold levels.
|
||||
|
||||
Uses scipy.stats.gaussian_kde when available; falls back to
|
||||
numpy.histogram2d if scipy is not installed.
|
||||
|
||||
Args:
|
||||
xs: X-coordinates of points.
|
||||
ys: Y-coordinates of points.
|
||||
bw_adjust: Bandwidth adjustment factor for KDE (ignored for histogram fallback).
|
||||
abs_quantile: Quantile of density values used as the absolute threshold.
|
||||
dense_quantile: Quantile of density values used as the dense threshold.
|
||||
bins: Number of bins per axis for the histogram fallback.
|
||||
|
||||
Returns:
|
||||
Dict with keys:
|
||||
"method" (str): "kde" or "hist".
|
||||
"densities" (np.ndarray): 1-D array of per-point density estimates.
|
||||
"abs_level" (float): density at abs_quantile.
|
||||
"dense_level" (float): density at dense_quantile.
|
||||
Returns None if len(xs) < 5 or xs and ys have different lengths.
|
||||
"""
|
||||
if len(xs) < 5 or len(xs) != len(ys):
|
||||
return None
|
||||
|
||||
xs_arr = np.array(xs, dtype=float)
|
||||
ys_arr = np.array(ys, dtype=float)
|
||||
points = np.vstack([xs_arr, ys_arr])
|
||||
|
||||
try:
|
||||
from scipy.stats import gaussian_kde # type: ignore
|
||||
|
||||
kde = gaussian_kde(points, bw_method=bw_adjust)
|
||||
densities = kde(points)
|
||||
method = "kde"
|
||||
except ImportError:
|
||||
# Histogram fallback
|
||||
h, xedges, yedges = np.histogram2d(xs_arr, ys_arr, bins=bins)
|
||||
xi = np.clip(np.searchsorted(xedges, xs_arr) - 1, 0, bins - 1)
|
||||
yi = np.clip(np.searchsorted(yedges, ys_arr) - 1, 0, bins - 1)
|
||||
densities = h[xi, yi].astype(float)
|
||||
method = "hist"
|
||||
|
||||
abs_level = float(np.quantile(densities, abs_quantile))
|
||||
dense_level = float(np.quantile(densities, dense_quantile))
|
||||
|
||||
return {
|
||||
"method": method,
|
||||
"densities": densities,
|
||||
"abs_level": abs_level,
|
||||
"dense_level": dense_level,
|
||||
}
|
||||
@@ -0,0 +1,61 @@
|
||||
---
|
||||
name: marianmt_es_en_load_model
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "def marianmt_es_en_load_model(model_name: str = 'Helsinki-NLP/opus-mt-es-en') -> tuple[Any, Any]"
|
||||
description: "Carga (y cachea) el tokenizer y modelo MarianMT para traduccion ES->EN (Helsinki-NLP, ~300 MB). Licencia Apache 2.0. Cache por model_name."
|
||||
tags: [marianmt, translation, es-en, nlp, model, huggingface, apache2, datascience, python]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: [transformers]
|
||||
params:
|
||||
- name: model_name
|
||||
desc: "ID del modelo en HuggingFace Hub (defecto: Helsinki-NLP/opus-mt-es-en, ~300 MB)"
|
||||
output: "tupla (tokenizer, model) listos para inferencia, cacheados por model_name."
|
||||
tested: false
|
||||
tests: []
|
||||
test_file_path: ""
|
||||
file_path: "python/functions/datascience/marianmt_es_en_load_model.py"
|
||||
notes: |
|
||||
LICENCIA: Apache 2.0 — uso comercial permitido.
|
||||
|
||||
Util como paso previo a REBEL (monolingue EN): traducir ES -> EN con MarianMT
|
||||
y luego pasar a rebel_load_model para extraccion de relaciones en ingles.
|
||||
|
||||
impure: descarga red/disco la primera vez, mantiene estado en _MODEL_CACHE.
|
||||
Usa MarianTokenizer y MarianMTModel en vez de Auto* porque los modelos Marian
|
||||
tienen tokenizer especializado con vocabulario SPM.
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from python.functions.datascience.marianmt_es_en_load_model import marianmt_es_en_load_model
|
||||
from python.functions.datascience.translate_es_to_en import translate_es_to_en
|
||||
|
||||
tokenizer, model = marianmt_es_en_load_model()
|
||||
translated = translate_es_to_en("Pablo Isla es presidente de Inditex.", tokenizer, model)
|
||||
# "Pablo Isla is president of Inditex."
|
||||
```
|
||||
|
||||
## Tamanio y latencia
|
||||
|
||||
- `Helsinki-NLP/opus-mt-es-en`: ~300 MB en disco.
|
||||
- Primera carga: 5-15 s en CPU.
|
||||
- Inferencia CPU: 0.5-2 s por frase.
|
||||
- GPU: mucho mas rapido.
|
||||
|
||||
## Uso como preprocesador para REBEL
|
||||
|
||||
```
|
||||
texto ES -> marianmt_es_en -> texto EN -> rebel_load_model -> triplets
|
||||
```
|
||||
|
||||
Esta pipeline permite usar REBEL (Apache 2.0, solo EN) con textos en espanol.
|
||||
Alternativa directa: usar mrebel_load_model (CC BY-NC-SA, multilingue).
|
||||
@@ -0,0 +1,54 @@
|
||||
"""Carga (y cachea) el modelo MarianMT para traduccion ES -> EN."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import Any
|
||||
|
||||
# Cache global: model_name -> (tokenizer, model)
|
||||
_MODEL_CACHE: dict[str, tuple[Any, Any]] = {}
|
||||
|
||||
|
||||
def marianmt_es_en_load_model(
|
||||
model_name: str = "Helsinki-NLP/opus-mt-es-en",
|
||||
) -> tuple[Any, Any]:
|
||||
"""Loads (and caches) a MarianMT model for Spanish-to-English translation.
|
||||
|
||||
MarianMT is a lightweight seq2seq translation model (~300 MB) from
|
||||
Helsinki-NLP, trained on the OPUS parallel corpus.
|
||||
|
||||
LICENSE: Apache 2.0 — commercial use permitted.
|
||||
|
||||
The first call downloads the model from HuggingFace Hub (~300 MB).
|
||||
Subsequent calls with the same ``model_name`` return the cached instance.
|
||||
|
||||
Args:
|
||||
model_name: HuggingFace Hub model ID. Default is the ES->EN model.
|
||||
Other available models follow the pattern
|
||||
``Helsinki-NLP/opus-mt-{src}-{tgt}``.
|
||||
|
||||
Returns:
|
||||
Tuple ``(tokenizer, model)`` both ready for inference with
|
||||
``model.generate(...)`` and ``tokenizer.decode(...)``.
|
||||
|
||||
Raises:
|
||||
ImportError: if ``transformers`` is not installed.
|
||||
OSError: if the model cannot be downloaded or loaded from disk.
|
||||
"""
|
||||
cached = _MODEL_CACHE.get(model_name)
|
||||
if cached is not None:
|
||||
return cached
|
||||
|
||||
try:
|
||||
from transformers import MarianMTModel, MarianTokenizer
|
||||
except ImportError as exc:
|
||||
raise ImportError(
|
||||
"transformers no esta instalado. Instalalo con "
|
||||
"`uv pip install transformers` o `uv pip install -e '.[nlp]'`."
|
||||
) from exc
|
||||
|
||||
tokenizer = MarianTokenizer.from_pretrained(model_name)
|
||||
model = MarianMTModel.from_pretrained(model_name)
|
||||
model.eval()
|
||||
|
||||
_MODEL_CACHE[model_name] = (tokenizer, model)
|
||||
return tokenizer, model
|
||||
@@ -0,0 +1,56 @@
|
||||
---
|
||||
name: mrebel_base_load_model
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "def mrebel_base_load_model(model_name: str = 'Babelscape/mrebel-base', src_lang: str = 'es_XX', tgt_lang: str = 'tp_XX') -> tuple[Any, Any]"
|
||||
description: "Variante rapida de mrebel_load_model con checkpoint base (250M params, ~900 MB). Delega completamente en mrebel_load_model. Misma licencia CC BY-NC-SA 4.0 — solo uso no comercial."
|
||||
tags: [mrebel, relation-extraction, nlp, model, huggingface, multilingual, seq2seq, datascience, python]
|
||||
uses_functions: [mrebel_load_model_py_datascience]
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: []
|
||||
params:
|
||||
- name: model_name
|
||||
desc: "ID del modelo en HuggingFace Hub (defecto: Babelscape/mrebel-base, 250M params)"
|
||||
- name: src_lang
|
||||
desc: "codigo de idioma fuente para el tokenizer mBART: 'es_XX' (ES), 'en_XX' (EN), etc."
|
||||
- name: tgt_lang
|
||||
desc: "token de idioma destino del decoder — siempre 'tp_XX'"
|
||||
output: "tupla (tokenizer, model) listos para inferencia, cacheados por (model_name, src_lang) en la cache compartida de mrebel_load_model."
|
||||
tested: false
|
||||
tests: []
|
||||
test_file_path: ""
|
||||
file_path: "python/functions/datascience/mrebel_base_load_model.py"
|
||||
notes: |
|
||||
LICENCIA: Babelscape/mrebel-base esta bajo CC BY-NC-SA 4.0 (Creative Commons
|
||||
Non-Commercial Share-Alike). Solo uso no comercial. NO usar en productos comerciales.
|
||||
|
||||
Esta funcion es un thin wrapper — NO duplica logica de carga/cache. Toda la
|
||||
logica vive en mrebel_load_model. Util para benchmarks donde se quiere comparar
|
||||
base vs large con la misma interfaz.
|
||||
|
||||
La cache es compartida con mrebel_load_model (mismo dict _MODEL_CACHE del modulo).
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from python.functions.datascience.mrebel_base_load_model import mrebel_base_load_model
|
||||
|
||||
# 250M params vs 600M — misma interfaz
|
||||
tokenizer, model = mrebel_base_load_model(src_lang="es_XX")
|
||||
```
|
||||
|
||||
## Comparacion base vs large
|
||||
|
||||
| Variant | Params | Size | Latencia CPU/frase | Recall tipico |
|
||||
|---------|--------|------|-------------------|---------------|
|
||||
| mrebel-large | 600M | ~2.4 GB | 15-30 s | alto |
|
||||
| mrebel-base | 250M | ~900 MB | 5-10 s | medio |
|
||||
|
||||
Para benchmarks de velocidad en graph_explorer, usar base. Para produccion final, evaluar large.
|
||||
@@ -0,0 +1,41 @@
|
||||
"""Carga (y cachea) el modelo mREBEL-base (variante rapida, 250M params)."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import Any
|
||||
|
||||
from python.functions.datascience.mrebel_load_model import mrebel_load_model
|
||||
|
||||
|
||||
def mrebel_base_load_model(
|
||||
model_name: str = "Babelscape/mrebel-base",
|
||||
src_lang: str = "es_XX",
|
||||
tgt_lang: str = "tp_XX",
|
||||
) -> tuple[Any, Any]:
|
||||
"""Loads (and caches) the mREBEL-base tokenizer and model.
|
||||
|
||||
Thin wrapper over ``mrebel_load_model`` with the base checkpoint as
|
||||
default (250M params, ~900 MB). Faster than the large variant at the
|
||||
cost of some recall on complex sentences.
|
||||
|
||||
LICENSE NOTICE: Babelscape/mrebel-base is licensed under CC BY-NC-SA 4.0
|
||||
(Creative Commons Non-Commercial Share-Alike). Do NOT use in commercial
|
||||
products without replacing this model.
|
||||
|
||||
Args:
|
||||
model_name: HuggingFace Hub model ID. Defaults to the base checkpoint.
|
||||
src_lang: Source language code for the mBART tokenizer.
|
||||
tgt_lang: Target language token for the decoder (always ``"tp_XX"``).
|
||||
|
||||
Returns:
|
||||
Tuple ``(tokenizer, model)`` ready for inference.
|
||||
|
||||
Raises:
|
||||
ImportError: if ``transformers`` is not installed.
|
||||
OSError: if the model cannot be downloaded or loaded from disk.
|
||||
"""
|
||||
return mrebel_load_model(
|
||||
model_name=model_name,
|
||||
src_lang=src_lang,
|
||||
tgt_lang=tgt_lang,
|
||||
)
|
||||
@@ -0,0 +1,76 @@
|
||||
---
|
||||
name: mrebel_load_model
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "def mrebel_load_model(model_name: str = 'Babelscape/mrebel-large', src_lang: str = 'es_XX', tgt_lang: str = 'tp_XX') -> tuple[Any, Any]"
|
||||
description: "Carga (y cachea) el tokenizer y modelo mREBEL (mBART-based, ~600M params, ~2.4 GB). Multilingue 30+ idiomas. Cache por (model_name, src_lang). Primera llamada descarga de HuggingFace. LICENCIA CC BY-NC-SA 4.0 — solo uso no comercial."
|
||||
tags: [mrebel, relation-extraction, nlp, model, huggingface, multilingual, seq2seq, datascience, python]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: [transformers]
|
||||
params:
|
||||
- name: model_name
|
||||
desc: "ID del modelo en HuggingFace Hub (defecto: Babelscape/mrebel-large, 600M params)"
|
||||
- name: src_lang
|
||||
desc: "codigo de idioma fuente para el tokenizer mBART: 'es_XX' (ES), 'en_XX' (EN), 'fr_XX' (FR), etc."
|
||||
- name: tgt_lang
|
||||
desc: "token de idioma destino del decoder — siempre 'tp_XX' para el formato triplet de mREBEL"
|
||||
output: "tupla (tokenizer, model) listos para inferencia. Cacheados por (model_name, src_lang)."
|
||||
tested: false
|
||||
tests: []
|
||||
test_file_path: ""
|
||||
file_path: "python/functions/datascience/mrebel_load_model.py"
|
||||
notes: |
|
||||
LICENCIA: Babelscape/mrebel-large esta bajo CC BY-NC-SA 4.0 (Creative Commons
|
||||
Non-Commercial Share-Alike). Solo uso no comercial. NO usar en productos
|
||||
comerciales sin sustituir por un modelo con licencia comercial.
|
||||
|
||||
impure: descarga red/disco la primera vez, mantiene estado en _MODEL_CACHE.
|
||||
No necesita el patch HF kwargs de glirel — AutoModelForSeq2SeqLM es path estandar.
|
||||
Cache es por (model_name, src_lang): dos idiomas distintos crean dos instancias
|
||||
porque el tokenizer tiene src_lang hardcodeado.
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from python.functions.datascience.mrebel_load_model import mrebel_load_model
|
||||
from python.functions.datascience.parse_rebel_output import parse_rebel_output
|
||||
|
||||
tokenizer, model = mrebel_load_model(src_lang="es_XX")
|
||||
|
||||
text = "Pablo Isla es el presidente de Inditex."
|
||||
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
|
||||
generated = model.generate(**inputs, num_beams=4, length_penalty=1.0, max_length=256)
|
||||
decoded = tokenizer.decode(generated[0], skip_special_tokens=False)
|
||||
triplets = parse_rebel_output(decoded)
|
||||
```
|
||||
|
||||
## Tamanio y latencia
|
||||
|
||||
- `Babelscape/mrebel-large`: ~2.4 GB en disco (modelo + tokenizer).
|
||||
- Primera carga: 30-90 s en CPU, depende de red y disco.
|
||||
- Inferencia CPU: 5-15 s por frase (mBART es mas lento que REBEL/BART).
|
||||
- Inferencia GPU (CUDA T4): 0.5-2 s por frase.
|
||||
|
||||
## Idiomas soportados
|
||||
|
||||
mREBEL soporta los idiomas de mBART-50. Ejemplos:
|
||||
- `es_XX` — Espanol
|
||||
- `en_XX` — Ingles
|
||||
- `fr_XX` — Frances
|
||||
- `de_DE` — Aleman
|
||||
- `pt_XX` — Portugues
|
||||
- `it_IT` — Italiano
|
||||
|
||||
## Notas
|
||||
|
||||
- Para ingles y usos comerciales, usar `rebel_load_model` (Apache 2.0).
|
||||
- Para benchmarks rapidos, usar `mrebel_base_load_model` (250M params, misma licencia).
|
||||
- `model.eval()` se llama al cargar para desactivar dropout en inferencia.
|
||||
@@ -0,0 +1,69 @@
|
||||
"""Carga (y cachea) el modelo mREBEL para extraccion de relaciones multilingue."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import Any
|
||||
|
||||
# Cache global: (model_name, src_lang) -> (tokenizer, model)
|
||||
_MODEL_CACHE: dict[tuple[str, str], tuple[Any, Any]] = {}
|
||||
|
||||
|
||||
def mrebel_load_model(
|
||||
model_name: str = "Babelscape/mrebel-large",
|
||||
src_lang: str = "es_XX",
|
||||
tgt_lang: str = "tp_XX",
|
||||
) -> tuple[Any, Any]:
|
||||
"""Loads (and caches) the mREBEL tokenizer and model.
|
||||
|
||||
mREBEL is a multilingual seq2seq model (mBART-based, ~600M params, ~2.4 GB)
|
||||
for relation extraction. It supports 30+ languages via language codes
|
||||
(``src_lang``).
|
||||
|
||||
LICENSE NOTICE: Babelscape/mrebel-large is licensed under CC BY-NC-SA 4.0
|
||||
(Creative Commons Non-Commercial Share-Alike). Do NOT use in commercial
|
||||
products without replacing this model with a commercially-licensed
|
||||
alternative (e.g. Babelscape/rebel-large which is Apache 2.0 but
|
||||
English-only).
|
||||
|
||||
The first call downloads the model from HuggingFace Hub (~2.4 GB).
|
||||
Subsequent calls with the same ``(model_name, src_lang)`` return the
|
||||
cached instance without re-loading.
|
||||
|
||||
Args:
|
||||
model_name: HuggingFace Hub model ID. Default is the large variant.
|
||||
src_lang: Source language code for the mBART tokenizer, e.g.
|
||||
``"es_XX"`` (Spanish), ``"en_XX"`` (English), ``"fr_XX"`` (French).
|
||||
tgt_lang: Target language token for the decoder (always ``"tp_XX"``
|
||||
for the triplet format — only change if using a custom checkpoint).
|
||||
|
||||
Returns:
|
||||
Tuple ``(tokenizer, model)`` both ready for inference with
|
||||
``model.generate(...)`` and ``tokenizer.decode(...)``.
|
||||
|
||||
Raises:
|
||||
ImportError: if ``transformers`` is not installed.
|
||||
OSError: if the model cannot be downloaded or loaded from disk.
|
||||
"""
|
||||
cache_key = (model_name, src_lang)
|
||||
cached = _MODEL_CACHE.get(cache_key)
|
||||
if cached is not None:
|
||||
return cached
|
||||
|
||||
try:
|
||||
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
|
||||
except ImportError as exc:
|
||||
raise ImportError(
|
||||
"transformers no esta instalado. Instalalo con "
|
||||
"`uv pip install transformers` o `uv pip install -e '.[nlp]'`."
|
||||
) from exc
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained(
|
||||
model_name,
|
||||
src_lang=src_lang,
|
||||
tgt_lang=tgt_lang,
|
||||
)
|
||||
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
|
||||
model.eval()
|
||||
|
||||
_MODEL_CACHE[cache_key] = (tokenizer, model)
|
||||
return tokenizer, model
|
||||
@@ -0,0 +1,65 @@
|
||||
---
|
||||
name: parse_rebel_output
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def parse_rebel_output(decoded_text: str) -> list[dict]"
|
||||
description: "Parser puro del wire format de REBEL / mREBEL. Convierte la cadena decoded por el tokenizer (con skip_special_tokens=False) a una lista de triplets tipados {head, head_type, type, tail, tail_type}. Nunca lanza excepcion."
|
||||
tags: [rebel, mrebel, relation-extraction, nlp, parser, knowledge-graph, datascience, python]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: []
|
||||
params:
|
||||
- name: decoded_text
|
||||
desc: "cadena raw producida por tokenizer.decode(..., skip_special_tokens=False) — incluye tokens especiales como <triplet>, <per>, <org>, <loc>, tp_XX, etc."
|
||||
output: "lista de dicts con claves head (str), head_type (str), type (str), tail (str), tail_type (str). Lista vacia si no hay triplets completos o el input es vacio."
|
||||
tested: true
|
||||
tests:
|
||||
- "string vacio retorna lista vacia"
|
||||
- "un triplet completo retorna un dict con campos correctos"
|
||||
- "dos triplets retorna dos dicts"
|
||||
- "triplet incompleto sin cierre no rompe"
|
||||
- "tokens angulares desconocidos no lanzan excepcion"
|
||||
test_file_path: "python/functions/datascience/tests/test_parse_rebel_output.py"
|
||||
file_path: "python/functions/datascience/parse_rebel_output.py"
|
||||
notes: |
|
||||
Funcion pura. Adapta el parser oficial del README de Babelscape/rebel al estilo del registry.
|
||||
Compatible con mREBEL (prefijo tp_XX, lang token __es__, __en__) y REBEL (sin prefijo de idioma).
|
||||
El formato wire incluye <triplet> para separar triplets y tokens <type> para cerrar spans
|
||||
de head/tail. El estado de la maquina es: t=leyendo head, s=leyendo tail, o=leyendo relacion.
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from python.functions.datascience.parse_rebel_output import parse_rebel_output
|
||||
|
||||
decoded = "tp_XX<triplet> Pablo Isla <per> Inditex <org> employer"
|
||||
triplets = parse_rebel_output(decoded)
|
||||
# [{'head': 'Pablo Isla', 'head_type': 'per', 'type': 'employer',
|
||||
# 'tail': 'Inditex', 'tail_type': 'org'}]
|
||||
```
|
||||
|
||||
## Formato wire REBEL / mREBEL
|
||||
|
||||
```
|
||||
tp_XX<triplet> HEAD_TOKENS <HEAD_TYPE> TAIL_TOKENS <TAIL_TYPE> RELATION_TOKENS<triplet> ...
|
||||
```
|
||||
|
||||
- `<triplet>` — marca el inicio de un nuevo triplet (y cierra el anterior).
|
||||
- `<HEAD_TYPE>` — cierra el span del head y abre el span del tail.
|
||||
- `<TAIL_TYPE>` — cierra el span del tail y abre el span de la relacion.
|
||||
- El ultimo triplet se cierra con `</s>` (ya eliminado antes del split).
|
||||
|
||||
## Notas
|
||||
|
||||
- No valida ni filtra los `head_type`/`tail_type` — los devuelve tal cual emite el modelo.
|
||||
- Compatible con cualquier variante seq2seq que use el mismo wire format (Babelscape/rebel,
|
||||
Babelscape/mrebel-large, Babelscape/mrebel-base).
|
||||
- Para usar el output en el grafo, pasar por `align_relations_to_entities` que resuelve
|
||||
head/tail a nombres canonicos del conjunto de entidades conocido.
|
||||
@@ -0,0 +1,105 @@
|
||||
"""Parser puro del wire format de REBEL / mREBEL."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
|
||||
def parse_rebel_output(decoded_text: str) -> list[dict]:
|
||||
"""Parse REBEL / mREBEL decoded output into typed triplets.
|
||||
|
||||
The input is the string produced by the HuggingFace tokenizer with
|
||||
``skip_special_tokens=False``, e.g.::
|
||||
|
||||
tp_XX<triplet> Pablo Isla <per> Inditex <org> employer<triplet> ...
|
||||
|
||||
Args:
|
||||
decoded_text: Raw decoded string from the seq2seq model, including
|
||||
special tokens like ``<triplet>``, ``<relation>``, ``<per>``,
|
||||
``<org>``, ``<loc>``, etc.
|
||||
|
||||
Returns:
|
||||
List of dicts with keys:
|
||||
``head`` (str), ``head_type`` (str),
|
||||
``type`` (str), ``tail`` (str), ``tail_type`` (str).
|
||||
Returns an empty list on empty input or if no complete triplet is
|
||||
found. Never raises.
|
||||
"""
|
||||
if not decoded_text or not decoded_text.strip():
|
||||
return []
|
||||
|
||||
triplets: list[dict] = []
|
||||
|
||||
# Strip language / padding tokens common to mREBEL.
|
||||
text = (
|
||||
decoded_text
|
||||
.replace("<s>", "")
|
||||
.replace("<pad>", "")
|
||||
.replace("</s>", "")
|
||||
.replace("tp_XX", "")
|
||||
.replace("__en__", "")
|
||||
.strip()
|
||||
)
|
||||
|
||||
current = "x" # x=init, t=head span, s=tail span, o=relation span
|
||||
subject = ""
|
||||
relation = ""
|
||||
object_ = ""
|
||||
object_type = ""
|
||||
subject_type = ""
|
||||
|
||||
for token in text.split():
|
||||
if token in ("<triplet>", "<relation>"):
|
||||
current = "t"
|
||||
if relation:
|
||||
triplets.append(
|
||||
{
|
||||
"head": subject.strip(),
|
||||
"head_type": subject_type,
|
||||
"type": relation.strip(),
|
||||
"tail": object_.strip(),
|
||||
"tail_type": object_type,
|
||||
}
|
||||
)
|
||||
relation = ""
|
||||
subject = ""
|
||||
elif token.startswith("<") and token.endswith(">"):
|
||||
if current in ("t", "o"):
|
||||
# Closing the head span — now reading tail.
|
||||
current = "s"
|
||||
if relation:
|
||||
triplets.append(
|
||||
{
|
||||
"head": subject.strip(),
|
||||
"head_type": subject_type,
|
||||
"type": relation.strip(),
|
||||
"tail": object_.strip(),
|
||||
"tail_type": object_type,
|
||||
}
|
||||
)
|
||||
object_ = ""
|
||||
subject_type = token[1:-1]
|
||||
else:
|
||||
# Closing the tail span — now reading relation.
|
||||
current = "o"
|
||||
object_type = token[1:-1]
|
||||
relation = ""
|
||||
else:
|
||||
if current == "t":
|
||||
subject += " " + token
|
||||
elif current == "s":
|
||||
object_ += " " + token
|
||||
elif current == "o":
|
||||
relation += " " + token
|
||||
|
||||
# Flush the last triplet if all fields are present.
|
||||
if subject and relation and object_ and object_type and subject_type:
|
||||
triplets.append(
|
||||
{
|
||||
"head": subject.strip(),
|
||||
"head_type": subject_type,
|
||||
"type": relation.strip(),
|
||||
"tail": object_.strip(),
|
||||
"tail_type": object_type,
|
||||
}
|
||||
)
|
||||
|
||||
return triplets
|
||||
@@ -0,0 +1,64 @@
|
||||
---
|
||||
name: plot_heatmap_log
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "def plot_heatmap_log(ax: Axes, xs: list[float] | np.ndarray, ys: list[float] | np.ndarray, extent: tuple[float, float, float, float], bins: int = 200, cmap: str = 'hot', alpha: float = 0.6) -> None"
|
||||
description: "Dibuja un heatmap 2D con escala log1p sobre un Axes de matplotlib. Usa np.histogram2d con el extent dado y ax.imshow para renderizar."
|
||||
tags: [visualization, heatmap, histogram, matplotlib, datascience, log]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: ["numpy", "matplotlib"]
|
||||
params:
|
||||
- name: ax
|
||||
desc: "matplotlib Axes sobre el que se dibuja el heatmap."
|
||||
- name: xs
|
||||
desc: "Coordenadas X de los puntos."
|
||||
- name: ys
|
||||
desc: "Coordenadas Y de los puntos."
|
||||
- name: extent
|
||||
desc: "Bounding box como (minx, maxx, miny, maxy) que define el rango del histograma."
|
||||
- name: bins
|
||||
desc: "Número de bins del histograma en cada eje. Default 200."
|
||||
- name: cmap
|
||||
desc: "Nombre del colormap de matplotlib. Default 'hot'."
|
||||
- name: alpha
|
||||
desc: "Opacidad del overlay (0-1). Default 0.6."
|
||||
output: "None. Modifica el Axes in-place añadiendo el heatmap como imagen con ax.imshow."
|
||||
tested: true
|
||||
tests:
|
||||
- "100 puntos no lanza excepción"
|
||||
- "ax tiene al menos una imagen tras la llamada"
|
||||
test_file_path: "python/functions/datascience/tests/test_plot_heatmap_log.py"
|
||||
file_path: "python/functions/datascience/plot_heatmap_log.py"
|
||||
source_repo: "internal:footprint_aurgi"
|
||||
source_license: "internal-aurgi"
|
||||
source_file: "zonas_mapas_aurgi/examples/generar_reporte_madrid.py:62"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
import matplotlib
|
||||
matplotlib.use("Agg")
|
||||
import matplotlib.pyplot as plt
|
||||
import numpy as np
|
||||
from datascience.plot_heatmap_log import plot_heatmap_log
|
||||
|
||||
rng = np.random.default_rng(42)
|
||||
xs = rng.uniform(-4.0, -3.5, 500)
|
||||
ys = rng.uniform(40.3, 40.6, 500)
|
||||
|
||||
fig, ax = plt.subplots()
|
||||
plot_heatmap_log(ax, xs, ys, extent=(-4.0, -3.5, 40.3, 40.6), bins=100)
|
||||
fig.savefig("heatmap.png")
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Aplica `np.log1p` a las cuentas del histograma para comprimir el rango dinámico y hacer visibles tanto zonas densas como dispersas. El histograma se transpone (`counts.T`) antes de pasar a imshow para alinear correctamente los ejes x/y. `aspect="auto"` permite que la imagen se estire al aspecto del Axes.
|
||||
@@ -0,0 +1,53 @@
|
||||
"""Plot a log-scale 2D histogram heatmap on a matplotlib Axes."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
|
||||
def plot_heatmap_log(
|
||||
ax: "Axes",
|
||||
xs: "list[float] | np.ndarray",
|
||||
ys: "list[float] | np.ndarray",
|
||||
extent: "tuple[float, float, float, float]",
|
||||
bins: int = 200,
|
||||
cmap: str = "hot",
|
||||
alpha: float = 0.6,
|
||||
) -> None:
|
||||
"""Plot a log-scale 2D density heatmap using histogram binning.
|
||||
|
||||
Computes a 2D histogram over the given points within ``extent``, applies
|
||||
log1p to compress the dynamic range, and renders the result as an image
|
||||
overlay on the Axes.
|
||||
|
||||
Args:
|
||||
ax: matplotlib Axes to draw on.
|
||||
xs: X coordinates (longitude or projected x).
|
||||
ys: Y coordinates (latitude or projected y).
|
||||
extent: Bounding box as (minx, maxx, miny, maxy).
|
||||
bins: Number of histogram bins along each axis. Default 200.
|
||||
cmap: Matplotlib colormap name. Default "hot".
|
||||
alpha: Opacity of the heatmap overlay (0–1). Default 0.6.
|
||||
"""
|
||||
import numpy as np # type: ignore
|
||||
|
||||
xs_arr = np.asarray(xs, dtype=float)
|
||||
ys_arr = np.asarray(ys, dtype=float)
|
||||
|
||||
minx, maxx, miny, maxy = extent
|
||||
|
||||
counts, _xedges, _yedges = np.histogram2d(
|
||||
xs_arr,
|
||||
ys_arr,
|
||||
bins=bins,
|
||||
range=[[minx, maxx], [miny, maxy]],
|
||||
)
|
||||
|
||||
log_counts = np.log1p(counts.T)
|
||||
|
||||
ax.imshow(
|
||||
log_counts,
|
||||
extent=[minx, maxx, miny, maxy],
|
||||
origin="lower",
|
||||
cmap=cmap,
|
||||
alpha=alpha,
|
||||
aspect="auto",
|
||||
)
|
||||
@@ -0,0 +1,66 @@
|
||||
---
|
||||
name: plot_kde_2d
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "def plot_kde_2d(ax: Axes, xs: list[float] | np.ndarray, ys: list[float] | np.ndarray, cmap: str = 'magma', alpha: float = 0.35, thresh: float = 0.02, levels: int = 30, bw_adjust: float = 0.6) -> None"
|
||||
description: "Dibuja un KDE 2D como contornos rellenos sobre un Axes de matplotlib usando seaborn.kdeplot. Si los arrays están vacíos retorna sin pintar."
|
||||
tags: [visualization, kde, density, seaborn, matplotlib, datascience]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: ["numpy", "seaborn", "matplotlib"]
|
||||
params:
|
||||
- name: ax
|
||||
desc: "matplotlib Axes sobre el que se dibuja la densidad."
|
||||
- name: xs
|
||||
desc: "Coordenadas X de los puntos (longitud o x proyectada)."
|
||||
- name: ys
|
||||
desc: "Coordenadas Y de los puntos (latitud o y proyectada)."
|
||||
- name: cmap
|
||||
desc: "Nombre del colormap de matplotlib para el relleno de densidad. Default 'magma'."
|
||||
- name: alpha
|
||||
desc: "Opacidad del overlay de densidad (0-1). Default 0.35."
|
||||
- name: thresh
|
||||
desc: "Umbral de densidad por debajo del cual no se dibujan contornos (0-1). Default 0.02."
|
||||
- name: levels
|
||||
desc: "Número de niveles de contorno. Default 30."
|
||||
- name: bw_adjust
|
||||
desc: "Factor de ajuste del ancho de banda del kernel. Valores < 1 producen estimaciones más detalladas. Default 0.6."
|
||||
output: "None. Modifica el Axes in-place añadiendo los contornos de densidad."
|
||||
tested: true
|
||||
tests:
|
||||
- "50 puntos aleatorios no lanza excepción"
|
||||
- "arrays vacíos retorna sin error"
|
||||
test_file_path: "python/functions/datascience/tests/test_plot_kde_2d.py"
|
||||
file_path: "python/functions/datascience/plot_kde_2d.py"
|
||||
source_repo: "internal:footprint_aurgi"
|
||||
source_license: "internal-aurgi"
|
||||
source_file: "ponderacion_isochronas/src/recomendador_centros.py:275"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
import matplotlib
|
||||
matplotlib.use("Agg")
|
||||
import matplotlib.pyplot as plt
|
||||
import numpy as np
|
||||
from datascience.plot_kde_2d import plot_kde_2d
|
||||
|
||||
rng = np.random.default_rng(42)
|
||||
xs = rng.normal(0, 1, 200)
|
||||
ys = rng.normal(0, 1, 200)
|
||||
|
||||
fig, ax = plt.subplots()
|
||||
plot_kde_2d(ax, xs, ys, cmap="viridis", alpha=0.5)
|
||||
fig.savefig("kde.png")
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Requiere seaborn y numpy. El parámetro `fill=True` se pasa a seaborn.kdeplot para renderizar contornos rellenos (disponible desde seaborn 0.11). Arrays vacíos se detectan con `np.asarray(xs).size == 0` antes de llamar a seaborn para evitar errores internos.
|
||||
@@ -0,0 +1,53 @@
|
||||
"""Plot a 2D KDE density overlay on a matplotlib Axes using seaborn."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
|
||||
def plot_kde_2d(
|
||||
ax: "Axes",
|
||||
xs: "list[float] | np.ndarray",
|
||||
ys: "list[float] | np.ndarray",
|
||||
cmap: str = "magma",
|
||||
alpha: float = 0.35,
|
||||
thresh: float = 0.02,
|
||||
levels: int = 30,
|
||||
bw_adjust: float = 0.6,
|
||||
) -> None:
|
||||
"""Plot a 2D kernel density estimate as a filled contour overlay.
|
||||
|
||||
Uses seaborn.kdeplot to render a smooth density surface over the given
|
||||
scatter of (x, y) points. If either array is empty the function returns
|
||||
immediately without painting anything.
|
||||
|
||||
Args:
|
||||
ax: matplotlib Axes to draw on.
|
||||
xs: X coordinates (longitude or projected x).
|
||||
ys: Y coordinates (latitude or projected y).
|
||||
cmap: Matplotlib colormap name for the density fill. Default "magma".
|
||||
alpha: Opacity of the density overlay (0–1). Default 0.35.
|
||||
thresh: Density threshold below which contours are not drawn (0–1).
|
||||
Default 0.02 removes very sparse outlier contours.
|
||||
levels: Number of contour levels. Default 30.
|
||||
bw_adjust: Bandwidth adjustment factor for the kernel. Values < 1
|
||||
produce tighter, more detailed estimates. Default 0.6.
|
||||
"""
|
||||
import numpy as np # type: ignore
|
||||
import seaborn as sns # type: ignore
|
||||
|
||||
xs_arr = np.asarray(xs)
|
||||
ys_arr = np.asarray(ys)
|
||||
|
||||
if xs_arr.size == 0 or ys_arr.size == 0:
|
||||
return
|
||||
|
||||
sns.kdeplot(
|
||||
x=xs_arr,
|
||||
y=ys_arr,
|
||||
ax=ax,
|
||||
cmap=cmap,
|
||||
fill=True,
|
||||
alpha=alpha,
|
||||
thresh=thresh,
|
||||
levels=levels,
|
||||
bw_adjust=bw_adjust,
|
||||
)
|
||||
@@ -0,0 +1,65 @@
|
||||
---
|
||||
name: rebel_load_model
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "def rebel_load_model(model_name: str = 'Babelscape/rebel-large') -> tuple[Any, Any]"
|
||||
description: "Carga (y cachea) el tokenizer y modelo REBEL (BART-based, ~1.5 GB). Solo ingles. Licencia Apache 2.0 — uso comercial permitido. Cache por model_name."
|
||||
tags: [rebel, relation-extraction, nlp, model, huggingface, english, seq2seq, apache2, datascience, python]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: [transformers]
|
||||
params:
|
||||
- name: model_name
|
||||
desc: "ID del modelo en HuggingFace Hub (defecto: Babelscape/rebel-large, BART ~1.5 GB, solo EN)"
|
||||
output: "tupla (tokenizer, model) listos para inferencia, cacheados por model_name."
|
||||
tested: false
|
||||
tests: []
|
||||
test_file_path: ""
|
||||
file_path: "python/functions/datascience/rebel_load_model.py"
|
||||
notes: |
|
||||
LICENCIA: Apache 2.0 — uso comercial permitido (a diferencia de mREBEL que es CC BY-NC-SA).
|
||||
Solo funciona bien con texto en INGLES. Para espanol usar mrebel_load_model.
|
||||
|
||||
REBEL usa el mismo wire format que mREBEL, por lo que parse_rebel_output es compatible.
|
||||
Diferencia vs mREBEL: no emite el prefijo tp_XX de idioma en el output (parse_rebel_output
|
||||
lo maneja porque ya hace .replace('tp_XX', '')).
|
||||
|
||||
impure: descarga red/disco la primera vez, mantiene estado en _MODEL_CACHE.
|
||||
Cache separada de mrebel_load_model (modulo distinto).
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from python.functions.datascience.rebel_load_model import rebel_load_model
|
||||
from python.functions.datascience.parse_rebel_output import parse_rebel_output
|
||||
|
||||
tokenizer, model = rebel_load_model()
|
||||
|
||||
text = "Pablo Isla is the CEO of Inditex, based in Arteixo."
|
||||
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
|
||||
generated = model.generate(**inputs, num_beams=4, length_penalty=1.0, max_length=256)
|
||||
decoded = tokenizer.decode(generated[0], skip_special_tokens=False)
|
||||
triplets = parse_rebel_output(decoded)
|
||||
```
|
||||
|
||||
## Comparacion REBEL vs mREBEL
|
||||
|
||||
| | REBEL | mREBEL |
|
||||
|---|---|---|
|
||||
| Licencia | Apache 2.0 (comercial OK) | CC BY-NC-SA 4.0 (no comercial) |
|
||||
| Idiomas | Solo ingles | 30+ (es_XX, en_XX, fr_XX...) |
|
||||
| Tamanio | ~1.5 GB | ~2.4 GB (large) / ~900 MB (base) |
|
||||
| Base | BART | mBART-50 |
|
||||
|
||||
## Tamanio y latencia
|
||||
|
||||
- `Babelscape/rebel-large`: ~1.5 GB en disco.
|
||||
- Primera carga: 20-60 s en CPU.
|
||||
- Inferencia CPU: 3-10 s por frase (mas rapido que mREBEL por ser BART vs mBART).
|
||||
@@ -0,0 +1,52 @@
|
||||
"""Carga (y cachea) el modelo REBEL para extraccion de relaciones en ingles."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import Any
|
||||
|
||||
# Cache global: model_name -> (tokenizer, model)
|
||||
_MODEL_CACHE: dict[str, tuple[Any, Any]] = {}
|
||||
|
||||
|
||||
def rebel_load_model(
|
||||
model_name: str = "Babelscape/rebel-large",
|
||||
) -> tuple[Any, Any]:
|
||||
"""Loads (and caches) the REBEL tokenizer and model. English only.
|
||||
|
||||
REBEL is a BART-based seq2seq model (~1.5 GB) for relation extraction,
|
||||
trained on English Wikipedia (KELM). It extracts triplets (head, relation,
|
||||
tail) from English text.
|
||||
|
||||
LICENSE: Apache 2.0 — commercial use permitted.
|
||||
|
||||
The first call downloads the model from HuggingFace Hub (~1.5 GB).
|
||||
Subsequent calls with the same ``model_name`` return the cached instance.
|
||||
|
||||
Args:
|
||||
model_name: HuggingFace Hub model ID. Default is the large variant.
|
||||
|
||||
Returns:
|
||||
Tuple ``(tokenizer, model)`` both ready for inference.
|
||||
|
||||
Raises:
|
||||
ImportError: if ``transformers`` is not installed.
|
||||
OSError: if the model cannot be downloaded or loaded from disk.
|
||||
"""
|
||||
cached = _MODEL_CACHE.get(model_name)
|
||||
if cached is not None:
|
||||
return cached
|
||||
|
||||
try:
|
||||
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
|
||||
except ImportError as exc:
|
||||
raise ImportError(
|
||||
"transformers no esta instalado. Instalalo con "
|
||||
"`uv pip install transformers` o `uv pip install -e '.[nlp]'`."
|
||||
) from exc
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
|
||||
model.eval()
|
||||
|
||||
_MODEL_CACHE[model_name] = (tokenizer, model)
|
||||
return tokenizer, model
|
||||
@@ -0,0 +1,52 @@
|
||||
---
|
||||
name: remove_words_from_column
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def remove_words_from_column(values: Iterable[str | None], words: list[str]) -> list[str]"
|
||||
description: "Elimina palabras especificas de un iterable de strings usando regex de palabra completa (\\b). Case-insensitive. Colapsa espacios multiples y hace strip. None se convierte en cadena vacia. Sin pandas."
|
||||
tags: [text, cleaning, regex, words, nlp, datascience]
|
||||
params:
|
||||
- name: values
|
||||
desc: Iterable de strings o None a limpiar.
|
||||
- name: words
|
||||
desc: Lista de palabras a eliminar. Matching case-insensitive por palabra completa (no parcial).
|
||||
output: "Lista de strings con las palabras eliminadas y espacios normalizados. Misma longitud que el input."
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: []
|
||||
tested: true
|
||||
tests:
|
||||
- "elimina palabras case insensitive"
|
||||
- "none devuelve string vacio"
|
||||
- "colapsa espacios multiples"
|
||||
- "palabras vacias no modifica"
|
||||
- "palabra completa no parcial"
|
||||
- "lista vacia"
|
||||
test_file_path: "python/functions/datascience/tests/test_remove_words_from_column.py"
|
||||
file_path: "python/functions/datascience/remove_words_from_column.py"
|
||||
source_repo: "internal:footprint_aurgi"
|
||||
source_license: "internal-aurgi"
|
||||
source_file: "fuzzy_joins/arreglo_fuzzy.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from remove_words_from_column import remove_words_from_column
|
||||
|
||||
result = remove_words_from_column(
|
||||
["Calle Mayor 14", "Avenida del Sol"],
|
||||
words=["calle", "avenida", "del"]
|
||||
)
|
||||
# ["Mayor 14", "Sol"]
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
El patron regex se compila una sola vez para todo el iterable (eficiente). Usa \\b para no eliminar palabras parciales ("calle" no toca "calleja"). None en el input produce "" en el output.
|
||||
@@ -0,0 +1,42 @@
|
||||
"""Elimina palabras especificas de una lista de strings."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import re
|
||||
from typing import Iterable
|
||||
|
||||
|
||||
def remove_words_from_column(
|
||||
values: Iterable[str | None],
|
||||
words: list[str],
|
||||
) -> list[str]:
|
||||
"""Elimina palabras de una lista de strings usando regex de palabra completa.
|
||||
|
||||
Para cada string aplica un patron regex \\b(w1|w2|...)\\b case-insensitive,
|
||||
reemplaza por cadena vacia, colapsa espacios multiples y hace strip.
|
||||
None se convierte en cadena vacia.
|
||||
|
||||
Args:
|
||||
values: Iterable de strings (o None) a limpiar.
|
||||
words: Lista de palabras a eliminar (case-insensitive).
|
||||
|
||||
Returns:
|
||||
Lista de strings con las palabras eliminadas y espacios normalizados.
|
||||
"""
|
||||
if not words:
|
||||
return [v if v is not None else "" for v in values]
|
||||
|
||||
pattern = re.compile(
|
||||
r"\b(" + "|".join(re.escape(w) for w in words) + r")\b",
|
||||
flags=re.IGNORECASE,
|
||||
)
|
||||
|
||||
result = []
|
||||
for value in values:
|
||||
if value is None:
|
||||
result.append("")
|
||||
continue
|
||||
cleaned = pattern.sub("", str(value))
|
||||
cleaned = re.sub(r"\s+", " ", cleaned).strip()
|
||||
result.append(cleaned)
|
||||
return result
|
||||
@@ -0,0 +1,61 @@
|
||||
---
|
||||
name: spacy_es_load_model
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "def spacy_es_load_model(model_name: str = 'es_core_news_md') -> Any"
|
||||
description: "Carga (y cachea) un modelo spaCy en castellano. Provee POS, dependencias y NER (PER, ORG, LOC, MISC). Usado por extract_triples_spacy_es para OpenIE schema-less. LICENSE: spaCy MIT + es_core_news_md CC BY-SA 4.0."
|
||||
tags: [spacy, nlp, spanish, ner, dependency-parsing, openie, model, datascience, python, mit, cc-by-sa]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: [spacy]
|
||||
params:
|
||||
- name: model_name
|
||||
desc: "Nombre del modelo spaCy instalado. Default: es_core_news_md (equilibrio precision/tamanio). Alternativas: es_core_news_sm (menor, menos preciso), es_core_news_lg (mayor, mas preciso)."
|
||||
output: "Instancia spaCy Language cacheada por model_name. Provee nlp(text) -> Doc con tokens, POS, deps y ents."
|
||||
tested: true
|
||||
tests:
|
||||
- "cache devuelve la misma instancia"
|
||||
- "OSError si el modelo no esta instalado"
|
||||
test_file_path: "python/functions/datascience/tests/test_spacy_es_load_model.py"
|
||||
file_path: "python/functions/datascience/spacy_es_load_model.py"
|
||||
notes: |
|
||||
LICENSE: spaCy es MIT. El modelo es_core_news_md usa pesos entrenados sobre
|
||||
el corpus CoNLL-2002 (CC BY-SA 4.0). Uso comercial permitido con atribucion.
|
||||
|
||||
Instalar el modelo antes de usar:
|
||||
python -m spacy download es_core_news_md
|
||||
|
||||
impure: carga modelo desde disco la primera vez, mantiene estado en _MODEL_CACHE.
|
||||
Tamanio: es_core_news_md ~43 MB. Primera carga ~1-3s en CPU.
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from datascience.spacy_es_load_model import spacy_es_load_model
|
||||
|
||||
nlp = spacy_es_load_model()
|
||||
|
||||
doc = nlp("Carlos Torres preside BBVA en Bilbao.")
|
||||
for ent in doc.ents:
|
||||
print(ent.text, ent.label_)
|
||||
# Carlos Torres PER
|
||||
# BBVA ORG
|
||||
# Bilbao LOC
|
||||
```
|
||||
|
||||
## Instalacion
|
||||
|
||||
```bash
|
||||
# En el venv del registry:
|
||||
python/.venv/bin/python3 -m spacy download es_core_news_md
|
||||
|
||||
# O via uv:
|
||||
cd python && uv run python -m spacy download es_core_news_md
|
||||
```
|
||||
@@ -0,0 +1,40 @@
|
||||
"""Carga (y cachea) un modelo spaCy en castellano para NER y OpenIE.
|
||||
|
||||
LICENSE: spaCy = MIT. Modelo es_core_news_md = CC BY-SA 4.0 (datos CoNLL-2002).
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import Any
|
||||
|
||||
# Cache global: model_name -> instancia spaCy nlp
|
||||
_MODEL_CACHE: dict[str, Any] = {}
|
||||
|
||||
|
||||
def spacy_es_load_model(model_name: str = "es_core_news_md") -> Any:
|
||||
"""Load (and cache) a spaCy Spanish language model.
|
||||
|
||||
The model provides dependency parsing, POS tagging and NER (PER, ORG, LOC, MISC).
|
||||
Used by extract_triples_spacy_es for schema-less OpenIE in Spanish.
|
||||
|
||||
LICENSE: spaCy = MIT. es_core_news_md = CC BY-SA 4.0 (CoNLL-2002 corpus).
|
||||
|
||||
Args:
|
||||
model_name: Name of the spaCy model. Default: es_core_news_md.
|
||||
Alternatives: es_core_news_sm (smaller), es_core_news_lg (larger).
|
||||
|
||||
Returns:
|
||||
spaCy Language instance cached by model_name.
|
||||
|
||||
Raises:
|
||||
OSError: If the model is not installed. Install with:
|
||||
python -m spacy download es_core_news_md
|
||||
"""
|
||||
if model_name in _MODEL_CACHE:
|
||||
return _MODEL_CACHE[model_name]
|
||||
|
||||
import spacy # type: ignore[import]
|
||||
|
||||
nlp = spacy.load(model_name)
|
||||
_MODEL_CACHE[model_name] = nlp
|
||||
return nlp
|
||||
@@ -0,0 +1,38 @@
|
||||
---
|
||||
id: summary_stats_py_datascience
|
||||
name: summary_stats
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def summary_stats(values: list[float]) -> dict"
|
||||
description: "Returns basic descriptive statistics (n, mean, median, p25, p75) for a list of floats. Empty input returns n=0 and nan for all numeric fields."
|
||||
tags: [statistics, descriptive, eda, summary, percentile]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: [math, numpy]
|
||||
example: |
|
||||
from summary_stats import summary_stats
|
||||
result = summary_stats([1, 2, 3, 4, 5])
|
||||
tested: true
|
||||
tests:
|
||||
- "test_summary_stats_basic"
|
||||
- "test_summary_stats_empty"
|
||||
- "test_summary_stats_single"
|
||||
- "test_summary_stats_keys"
|
||||
test_file_path: "python/functions/datascience/tests/test_summary_stats.py"
|
||||
file_path: "python/functions/datascience/summary_stats.py"
|
||||
params:
|
||||
- name: values
|
||||
desc: "List of numeric values to summarize."
|
||||
output: "Dict with n (int), mean, median, p25, p75 (floats). All floats are math.nan when values is empty."
|
||||
source_repo: "internal:footprint_aurgi"
|
||||
source_license: "internal-aurgi"
|
||||
source_file: "ponderacion_isochronas/example/models/eda/utils.py:60"
|
||||
---
|
||||
|
||||
Funcion pura minimal para EDA rapido. No incluye std, min, max ni otros percentiles — mantener la interfaz pequena.
|
||||
@@ -0,0 +1,36 @@
|
||||
"""summary_stats — Compute descriptive statistics for a numeric list."""
|
||||
|
||||
import math
|
||||
import numpy as np
|
||||
|
||||
|
||||
def summary_stats(values: list[float]) -> dict:
|
||||
"""Return basic descriptive statistics for a list of floats.
|
||||
|
||||
Args:
|
||||
values: List of numeric values.
|
||||
|
||||
Returns:
|
||||
Dict with keys:
|
||||
"n" (int): number of elements.
|
||||
"mean" (float): arithmetic mean, or math.nan if empty.
|
||||
"median" (float): median, or math.nan if empty.
|
||||
"p25" (float): 25th percentile, or math.nan if empty.
|
||||
"p75" (float): 75th percentile, or math.nan if empty.
|
||||
"""
|
||||
if not values:
|
||||
return {
|
||||
"n": 0,
|
||||
"mean": math.nan,
|
||||
"median": math.nan,
|
||||
"p25": math.nan,
|
||||
"p75": math.nan,
|
||||
}
|
||||
arr = np.array(values, dtype=float)
|
||||
return {
|
||||
"n": int(len(arr)),
|
||||
"mean": float(np.mean(arr)),
|
||||
"median": float(np.median(arr)),
|
||||
"p25": float(np.percentile(arr, 25)),
|
||||
"p75": float(np.percentile(arr, 75)),
|
||||
}
|
||||
@@ -0,0 +1,103 @@
|
||||
"""Tests para align_relations_to_entities."""
|
||||
|
||||
import os
|
||||
import sys
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "..", ".."))
|
||||
|
||||
from python.functions.datascience.align_relations_to_entities import align_relations_to_entities
|
||||
|
||||
|
||||
def _t(head, head_type, relation, tail, tail_type):
|
||||
return {
|
||||
"head": head,
|
||||
"head_type": head_type,
|
||||
"type": relation,
|
||||
"tail": tail,
|
||||
"tail_type": tail_type,
|
||||
}
|
||||
|
||||
|
||||
def test_match_exacto_case_insensitive_resuelve_correctamente():
|
||||
triplets = [_t("pablo isla", "per", "employer", "inditex", "org")]
|
||||
entities = ["Pablo Isla", "Inditex"]
|
||||
result = align_relations_to_entities(triplets, entities)
|
||||
assert len(result) == 1
|
||||
assert result[0]["from"] == "Pablo Isla"
|
||||
assert result[0]["to"] == "Inditex"
|
||||
assert result[0]["kind"] == "employer"
|
||||
|
||||
|
||||
def test_substring_entity_en_span_del_head():
|
||||
# mREBEL emite "esta en Bilbao" pero la entidad es "Bilbao"
|
||||
triplets = [_t("esta en Bilbao", "loc", "located in", "Espana", "loc")]
|
||||
entities = ["Bilbao", "Espana"]
|
||||
result = align_relations_to_entities(triplets, entities)
|
||||
assert len(result) == 1
|
||||
assert result[0]["from"] == "Bilbao"
|
||||
assert result[0]["to"] == "Espana"
|
||||
|
||||
|
||||
def test_substring_span_dentro_del_nombre_de_entidad():
|
||||
# El span "Santander" esta contenido en el entity name "Banco Santander"
|
||||
triplets = [_t("Santander", "org", "owns", "Openbank", "org")]
|
||||
entities = ["Banco Santander", "Openbank"]
|
||||
result = align_relations_to_entities(triplets, entities)
|
||||
assert len(result) == 1
|
||||
assert result[0]["from"] == "Banco Santander"
|
||||
assert result[0]["to"] == "Openbank"
|
||||
|
||||
|
||||
def test_gana_nombre_de_entidad_mas_largo_en_ambiguedad():
|
||||
# Dos entidades: "Madrid" y "Comunidad de Madrid". El span "Madrid" deberia
|
||||
# preferir "Comunidad de Madrid" si ese es el mas largo y contiene "madrid".
|
||||
# En la logica actual: substring bidireccional, gana el primero de names_by_len
|
||||
# (que ordena DESC por len). "Comunidad de Madrid" es mas largo y su lower
|
||||
# contiene "madrid", asi que gana.
|
||||
triplets = [_t("Madrid", "loc", "capital of", "Espana", "loc")]
|
||||
entities = ["Madrid", "Comunidad de Madrid", "Espana"]
|
||||
result = align_relations_to_entities(triplets, entities)
|
||||
assert len(result) == 1
|
||||
# El exacto case-insensitive resuelve "Madrid" -> "Madrid" directamente
|
||||
# (antes que la busqueda substring). Verificamos que no rompe y que
|
||||
# from/to son valores de entities.
|
||||
assert result[0]["from"] in entities
|
||||
assert result[0]["to"] in entities
|
||||
|
||||
|
||||
def test_triplet_sin_match_se_descarta():
|
||||
triplets = [_t("Unknown Entity", "per", "works for", "Another Unknown", "org")]
|
||||
entities = ["Pablo Isla", "Inditex"]
|
||||
result = align_relations_to_entities(triplets, entities)
|
||||
assert result == []
|
||||
|
||||
|
||||
def test_triplet_con_head_igual_tail_se_descarta_self_loop():
|
||||
triplets = [_t("Inditex", "org", "owns", "Inditex", "org")]
|
||||
entities = ["Inditex", "Zara"]
|
||||
result = align_relations_to_entities(triplets, entities)
|
||||
assert result == []
|
||||
|
||||
|
||||
def test_lista_triplets_vacia_retorna_vacia():
|
||||
result = align_relations_to_entities([], ["Pablo Isla", "Inditex"])
|
||||
assert result == []
|
||||
|
||||
|
||||
def test_lista_entity_names_vacia_retorna_vacia():
|
||||
triplets = [_t("Pablo Isla", "per", "employer", "Inditex", "org")]
|
||||
result = align_relations_to_entities(triplets, [])
|
||||
assert result == []
|
||||
|
||||
|
||||
def test_multiples_triplets_con_mezcla_de_matches_y_descartes():
|
||||
triplets = [
|
||||
_t("Pablo Isla", "per", "employer", "Inditex", "org"), # match
|
||||
_t("Ghost Entity", "per", "employer", "Inditex", "org"), # head sin match
|
||||
_t("Pablo Isla", "per", "employer", "Pablo Isla", "per"), # self-loop
|
||||
]
|
||||
entities = ["Pablo Isla", "Inditex"]
|
||||
result = align_relations_to_entities(triplets, entities)
|
||||
assert len(result) == 1
|
||||
assert result[0]["from"] == "Pablo Isla"
|
||||
assert result[0]["to"] == "Inditex"
|
||||
@@ -0,0 +1,38 @@
|
||||
"""Tests para alpha_shape_concave_hull."""
|
||||
|
||||
import sys
|
||||
import os
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
||||
from alpha_shape_concave_hull import alpha_shape_concave_hull
|
||||
|
||||
|
||||
def test_alpha_shape_square_large_alpha():
|
||||
"""4 corner points with large alpha should return a geometry."""
|
||||
pts = [(0.0, 0.0), (1.0, 0.0), (1.0, 1.0), (0.0, 1.0)]
|
||||
result = alpha_shape_concave_hull(pts, alpha=10.0)
|
||||
assert result is not None
|
||||
|
||||
|
||||
def test_alpha_shape_too_few_points():
|
||||
result = alpha_shape_concave_hull([(0, 0), (1, 0), (0, 1)], alpha=10.0)
|
||||
assert result is None
|
||||
|
||||
|
||||
def test_alpha_shape_very_small_alpha_returns_none():
|
||||
"""Alpha so small that no triangle circumradius fits."""
|
||||
pts = [(0.0, 0.0), (100.0, 0.0), (100.0, 100.0), (0.0, 100.0)]
|
||||
result = alpha_shape_concave_hull(pts, alpha=0.0001)
|
||||
assert result is None
|
||||
|
||||
|
||||
def test_alpha_shape_5_points_returns_geometry():
|
||||
pts = [
|
||||
(0.0, 0.0),
|
||||
(2.0, 0.0),
|
||||
(2.0, 2.0),
|
||||
(0.0, 2.0),
|
||||
(1.0, 1.0),
|
||||
]
|
||||
result = alpha_shape_concave_hull(pts, alpha=5.0)
|
||||
assert result is not None
|
||||
@@ -0,0 +1,47 @@
|
||||
"""Tests para best_central_tendency."""
|
||||
|
||||
import math
|
||||
import sys
|
||||
import os
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
||||
from best_central_tendency import best_central_tendency
|
||||
|
||||
|
||||
def test_best_central_tendency_normal_ish():
|
||||
label, value = best_central_tendency([1, 2, 3, 4, 5], "normal-ish")
|
||||
assert label == "mean"
|
||||
assert abs(value - 3.0) < 1e-9
|
||||
|
||||
|
||||
def test_best_central_tendency_right_skewed():
|
||||
label, value = best_central_tendency([1, 2, 3, 4, 5], "right-skewed")
|
||||
assert label == "median"
|
||||
assert abs(value - 3.0) < 1e-9
|
||||
|
||||
|
||||
def test_best_central_tendency_left_skewed():
|
||||
label, value = best_central_tendency([1, 2, 3, 4, 5], "left-skewed")
|
||||
assert label == "median"
|
||||
|
||||
|
||||
def test_best_central_tendency_lognormal_ish():
|
||||
label, value = best_central_tendency([1, 2, 4, 8], "lognormal-ish")
|
||||
assert label == "geometric_mean"
|
||||
assert abs(value - 2 ** 1.5) < 1e-6
|
||||
|
||||
|
||||
def test_best_central_tendency_heavy_tail():
|
||||
label, value = best_central_tendency([1, 2, 3, 4, 5, 100], "heavy-tail")
|
||||
assert label == "trimmed_mean_5%"
|
||||
assert not math.isnan(value)
|
||||
|
||||
|
||||
def test_best_central_tendency_empty():
|
||||
label, value = best_central_tendency([], "normal-ish")
|
||||
assert math.isnan(value)
|
||||
|
||||
|
||||
def test_best_central_tendency_default():
|
||||
label, value = best_central_tendency([1, 2, 3, 4, 5], "other")
|
||||
assert label == "median"
|
||||
@@ -0,0 +1,45 @@
|
||||
"""Tests para detect_distribution_type."""
|
||||
|
||||
import sys
|
||||
import os
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
||||
from detect_distribution_type import detect_distribution_type
|
||||
|
||||
import numpy as np
|
||||
|
||||
|
||||
def test_detect_too_few_samples():
|
||||
result = detect_distribution_type([1] * 5)
|
||||
assert result["type"] == "too_few_samples"
|
||||
|
||||
|
||||
def test_detect_normal_ish():
|
||||
rng = np.random.default_rng(42)
|
||||
values = rng.normal(0, 1, 200).tolist()
|
||||
result = detect_distribution_type(values)
|
||||
assert result["type"] == "normal-ish", f"Got {result['type']}"
|
||||
|
||||
|
||||
def test_detect_right_skewed():
|
||||
rng = np.random.default_rng(0)
|
||||
# Exponential distribution is heavily right-skewed
|
||||
values = rng.exponential(scale=1.0, size=200).tolist()
|
||||
result = detect_distribution_type(values)
|
||||
assert result["type"] in ("right-skewed", "lognormal-ish", "heavy-tail"), f"Got {result['type']}"
|
||||
|
||||
|
||||
def test_detect_stats_keys():
|
||||
rng = np.random.default_rng(7)
|
||||
values = rng.normal(5, 2, 100).tolist()
|
||||
result = detect_distribution_type(values)
|
||||
assert "stats" in result
|
||||
assert "n" in result["stats"]
|
||||
assert result["stats"]["n"] == 100
|
||||
|
||||
|
||||
def test_detect_exactly_30():
|
||||
rng = np.random.default_rng(1)
|
||||
values = rng.normal(0, 1, 30).tolist()
|
||||
result = detect_distribution_type(values)
|
||||
assert result["type"] != "too_few_samples"
|
||||
@@ -0,0 +1,67 @@
|
||||
"""Tests para extract_graph_gliner2.
|
||||
|
||||
Usa un stub GLiNER2 para validar el contrato sin descargar el modelo real.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
import sys
|
||||
|
||||
import pytest
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "..", ".."))
|
||||
|
||||
from python.functions.datascience.extract_graph_gliner2 import extract_graph_gliner2
|
||||
|
||||
|
||||
class _Schema:
|
||||
def entities(self, labels):
|
||||
self._entities = labels
|
||||
return self
|
||||
|
||||
def relations(self, labels):
|
||||
self._relations = labels
|
||||
return self
|
||||
|
||||
|
||||
class _StubModel:
|
||||
"""Stub que devuelve entidades y relaciones conocidas."""
|
||||
|
||||
_extract_result = {
|
||||
"entities": {"person": ["Pablo Isla"], "organization": ["Inditex"]},
|
||||
"relation_extraction": {"ceo_of": [("Pablo Isla", "Inditex")]},
|
||||
}
|
||||
|
||||
def create_schema(self):
|
||||
return _Schema()
|
||||
|
||||
def extract(self, text, schema=None, threshold=0.3, include_confidence=False):
|
||||
return self._extract_result
|
||||
|
||||
|
||||
def test_output_tiene_claves_entities_relation_extraction_elapsed_s():
|
||||
"""output tiene claves entities relation_extraction elapsed_s"""
|
||||
result = extract_graph_gliner2(
|
||||
text="Pablo Isla es CEO de Inditex.",
|
||||
entity_labels=["person", "organization"],
|
||||
relation_labels=["ceo_of"],
|
||||
model=_StubModel(),
|
||||
)
|
||||
assert "entities" in result
|
||||
assert "relation_extraction" in result
|
||||
assert "elapsed_s" in result
|
||||
assert isinstance(result["elapsed_s"], float)
|
||||
|
||||
|
||||
def test_stub_model_retorna_shape_correcto():
|
||||
"""stub model retorna shape correcto"""
|
||||
result = extract_graph_gliner2(
|
||||
text="Texto cualquiera.",
|
||||
entity_labels=["person"],
|
||||
relation_labels=["works_at"],
|
||||
model=_StubModel(),
|
||||
threshold=0.3,
|
||||
)
|
||||
assert result["entities"] == {"person": ["Pablo Isla"], "organization": ["Inditex"]}
|
||||
assert "ceo_of" in result["relation_extraction"]
|
||||
@@ -0,0 +1,112 @@
|
||||
"""Tests para extract_relations_mrebel con stubs de modelo."""
|
||||
|
||||
import os
|
||||
import sys
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "..", ".."))
|
||||
|
||||
from python.functions.datascience.extract_relations_mrebel import extract_relations_mrebel
|
||||
from python.types.datascience.entity_candidate import EntityCandidate
|
||||
from python.types.datascience.relation_candidate import RelationCandidate
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Stubs
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class _TokenizerStub:
|
||||
"""Tokenizer stub que devuelve inputs triviales y decodifica el wire format canonico."""
|
||||
|
||||
def __init__(self, decoded_output: str = ""):
|
||||
self._decoded = decoded_output
|
||||
|
||||
def __call__(self, text, return_tensors=None, max_length=512, truncation=True):
|
||||
return {"input_ids": [[1, 2, 3]]}
|
||||
|
||||
def decode(self, token_ids, skip_special_tokens=True):
|
||||
return self._decoded
|
||||
|
||||
|
||||
class _ModelStub:
|
||||
"""Modelo stub que devuelve tokens triviales."""
|
||||
|
||||
def generate(self, input_ids=None, num_beams=4, length_penalty=1.0, max_length=256, **kwargs):
|
||||
return [[10, 11, 12]]
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def test_flujo_completo_con_stub_produce_relation_candidates_correctos():
|
||||
# Wire format canonico con un triplet valido
|
||||
decoded = "<triplet> Pablo Isla <per> Inditex <org> employer"
|
||||
tok = _TokenizerStub(decoded_output=decoded)
|
||||
model = _ModelStub()
|
||||
|
||||
entities = [
|
||||
EntityCandidate(name="Pablo Isla", type_label="PER", confidence=0.95),
|
||||
EntityCandidate(name="Inditex", type_label="ORG", confidence=0.92),
|
||||
]
|
||||
text = "Pablo Isla es el presidente de Inditex."
|
||||
|
||||
result = extract_relations_mrebel(text, entities, tok, model)
|
||||
|
||||
assert len(result) == 1
|
||||
rc = result[0]
|
||||
assert isinstance(rc, RelationCandidate)
|
||||
assert rc.from_name == "Pablo Isla"
|
||||
assert rc.to_name == "Inditex"
|
||||
assert rc.relation_type == "employer"
|
||||
assert rc.confidence == 1.0
|
||||
|
||||
|
||||
def test_menos_de_2_entidades_retorna_vacio():
|
||||
tok = _TokenizerStub()
|
||||
model = _ModelStub()
|
||||
entities = [EntityCandidate(name="Pablo Isla", type_label="PER")]
|
||||
result = extract_relations_mrebel("Texto cualquiera.", entities, tok, model)
|
||||
assert result == []
|
||||
|
||||
|
||||
def test_texto_vacio_retorna_vacio():
|
||||
tok = _TokenizerStub()
|
||||
model = _ModelStub()
|
||||
entities = [
|
||||
EntityCandidate(name="A", type_label="PER"),
|
||||
EntityCandidate(name="B", type_label="ORG"),
|
||||
]
|
||||
assert extract_relations_mrebel("", entities, tok, model) == []
|
||||
|
||||
|
||||
def test_triplets_no_alineables_se_descartan():
|
||||
# El stub emite entidades que no estan en la lista
|
||||
decoded = "<triplet> Ghost Entity <per> Unknown Org <org> some relation"
|
||||
tok = _TokenizerStub(decoded_output=decoded)
|
||||
model = _ModelStub()
|
||||
|
||||
entities = [
|
||||
EntityCandidate(name="Pablo Isla", type_label="PER"),
|
||||
EntityCandidate(name="Inditex", type_label="ORG"),
|
||||
]
|
||||
result = extract_relations_mrebel("Texto largo suficiente.", entities, tok, model)
|
||||
assert result == []
|
||||
|
||||
|
||||
def test_multiples_frases_generan_multiples_candidates():
|
||||
# El stub siempre emite el mismo triplet valido — una por frase
|
||||
decoded = "<triplet> Pablo Isla <per> Inditex <org> employer"
|
||||
tok = _TokenizerStub(decoded_output=decoded)
|
||||
model = _ModelStub()
|
||||
|
||||
entities = [
|
||||
EntityCandidate(name="Pablo Isla", type_label="PER"),
|
||||
EntityCandidate(name="Inditex", type_label="ORG"),
|
||||
]
|
||||
# Dos frases separadas por ". "
|
||||
text = "Pablo Isla es el presidente de Inditex. Inditex tiene sedes en todo el mundo."
|
||||
|
||||
result = extract_relations_mrebel(text, entities, tok, model)
|
||||
# Puede haber 1 o 2 dependiendo de la dedup — lo importante es que no es vacio
|
||||
assert len(result) >= 1
|
||||
assert all(isinstance(rc, RelationCandidate) for rc in result)
|
||||
@@ -0,0 +1,81 @@
|
||||
"""Tests para extract_triples_spacy_es.
|
||||
|
||||
Requiere spaCy y es_core_news_md instalados. Si no estan, los tests se omiten.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
import sys
|
||||
|
||||
import pytest
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "..", ".."))
|
||||
|
||||
from python.functions.datascience.extract_triples_spacy_es import extract_triples_spacy_es
|
||||
|
||||
spacy = pytest.importorskip("spacy", reason="spacy not installed — skip")
|
||||
|
||||
|
||||
def _load_nlp():
|
||||
try:
|
||||
return spacy.load("es_core_news_md")
|
||||
except OSError:
|
||||
return None
|
||||
|
||||
|
||||
_NLP = _load_nlp()
|
||||
pytestmark = pytest.mark.skipif(
|
||||
_NLP is None,
|
||||
reason="es_core_news_md not installed — run: python -m spacy download es_core_news_md",
|
||||
)
|
||||
|
||||
|
||||
def test_oracion_simple_produce_tripleta_con_sujeto_verbo_objeto():
|
||||
"""oracion simple produce tripleta con sujeto verbo objeto"""
|
||||
result = extract_triples_spacy_es("Enmanuel quiere a Ashlly.", _NLP)
|
||||
assert len(result["triples"]) >= 1
|
||||
# Al menos una tripleta con sujeto que contenga Enmanuel
|
||||
subjs = [t["subject"] for t in result["triples"]]
|
||||
assert any("Enmanuel" in s or "enmanuel" in s.lower() for s in subjs)
|
||||
|
||||
|
||||
def test_carlos_torres_preside_bbva():
|
||||
"""carlos torres preside bbva produce tripleta president"""
|
||||
result = extract_triples_spacy_es("Carlos Torres preside BBVA.", _NLP)
|
||||
triples = result["triples"]
|
||||
assert len(triples) >= 1
|
||||
rels = [t["relation"] for t in triples]
|
||||
assert any("presidir" in r or "presidir" in r.lower() for r in rels)
|
||||
|
||||
|
||||
def test_amancio_ortega_fundo_inditex_en_1985():
|
||||
"""amancio ortega fundo inditex en 1985 produce tripletas con fundar_en"""
|
||||
result = extract_triples_spacy_es(
|
||||
"Amancio Ortega fundo Inditex en 1985.", _NLP
|
||||
)
|
||||
triples = result["triples"]
|
||||
assert len(triples) >= 1
|
||||
# El verbo y sus objetos deben producir al menos 2 tripletas (Inditex + 1985 como oblicuo)
|
||||
subjs = {t["subject"] for t in triples}
|
||||
assert any("Amancio" in s or "Ortega" in s for s in subjs)
|
||||
# Debe haber al menos la tripleta directa con Inditex
|
||||
objects = {t["object"] for t in triples}
|
||||
assert any("Inditex" in o or "1985" in o for o in objects)
|
||||
|
||||
|
||||
def test_texto_sin_verbos_produce_tripletas_vacias():
|
||||
"""texto sin verbos produce tripletas vacias"""
|
||||
result = extract_triples_spacy_es("BBVA Santander Inditex.", _NLP)
|
||||
assert result["triples"] == []
|
||||
|
||||
|
||||
def test_entities_ner_detecta_categorias():
|
||||
"""entities NER detecta PER ORG LOC"""
|
||||
result = extract_triples_spacy_es(
|
||||
"Carlos Torres es presidente de BBVA en Bilbao.", _NLP
|
||||
)
|
||||
ents = result["entities"]
|
||||
labels = {e["label"] for e in ents}
|
||||
# Debe detectar al menos uno de PER, ORG o LOC
|
||||
assert labels & {"PER", "ORG", "LOC"}
|
||||
@@ -0,0 +1,67 @@
|
||||
"""Tests para fuzzy_merge_adaptive."""
|
||||
|
||||
import sys
|
||||
import os
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
||||
|
||||
from fuzzy_merge_adaptive import fuzzy_merge_adaptive
|
||||
|
||||
|
||||
def test_left_join_con_typo():
|
||||
left = [{"name": "Madrid"}, {"name": "Barclona"}]
|
||||
right = [{"name": "Madrid", "cp": "28"}, {"name": "Barcelona", "cp": "08"}]
|
||||
result = fuzzy_merge_adaptive(left, right, left_key="name", right_key="name")
|
||||
assert len(result) == 2
|
||||
scores = [r["match_score"] for r in result]
|
||||
assert all(s >= 80 for s in scores), f"Scores bajos: {scores}"
|
||||
assert result[0]["cp"] == "28"
|
||||
assert result[1]["cp"] == "08"
|
||||
|
||||
|
||||
def test_inner_join_excluye_sin_match():
|
||||
left = [{"name": "Madrid"}, {"name": "ZZZinexistente"}]
|
||||
right = [{"name": "Madrid", "cp": "28"}]
|
||||
result = fuzzy_merge_adaptive(
|
||||
left, right, left_key="name", right_key="name",
|
||||
thresholds=[90, 80, 70], how="inner"
|
||||
)
|
||||
assert len(result) == 1
|
||||
assert result[0]["fuzzy_match"] == "Madrid"
|
||||
|
||||
|
||||
def test_left_join_sin_match_devuelve_none():
|
||||
left = [{"name": "ZZZinexistente"}]
|
||||
right = [{"name": "Madrid", "cp": "28"}]
|
||||
result = fuzzy_merge_adaptive(
|
||||
left, right, left_key="name", right_key="name",
|
||||
thresholds=[95], how="left"
|
||||
)
|
||||
assert len(result) == 1
|
||||
assert result[0]["fuzzy_match"] is None
|
||||
assert result[0]["match_score"] == 0
|
||||
assert result[0]["threshold_used"] is None
|
||||
|
||||
|
||||
def test_threshold_adaptativo():
|
||||
left = [{"name": "Bcn"}]
|
||||
right = [{"name": "Barcelona", "cp": "08"}]
|
||||
result = fuzzy_merge_adaptive(
|
||||
left, right, left_key="name", right_key="name",
|
||||
thresholds=[90, 80, 70, 60, 50]
|
||||
)
|
||||
assert len(result) == 1
|
||||
# Puede matchear o no segun score, pero threshold_used <= 90
|
||||
if result[0]["threshold_used"] is not None:
|
||||
assert result[0]["threshold_used"] <= 90
|
||||
|
||||
|
||||
def test_colision_de_claves_usa_sufijos():
|
||||
left = [{"name": "Madrid", "info": "left_info"}]
|
||||
right = [{"name": "Madrid", "info": "right_info"}]
|
||||
result = fuzzy_merge_adaptive(left, right, left_key="name", right_key="name")
|
||||
assert len(result) == 1
|
||||
assert "info_left" in result[0]
|
||||
assert "info_right" in result[0]
|
||||
assert result[0]["info_left"] == "left_info"
|
||||
assert result[0]["info_right"] == "right_info"
|
||||
@@ -0,0 +1,35 @@
|
||||
"""Tests para geometric_mean."""
|
||||
|
||||
import math
|
||||
import sys
|
||||
import os
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
||||
from geometric_mean import geometric_mean
|
||||
|
||||
|
||||
def test_geometric_mean_powers_of_two():
|
||||
result = geometric_mean([1, 2, 4, 8])
|
||||
expected = 2 ** 1.5 # ~2.828
|
||||
assert abs(result - expected) < 1e-6, f"Expected ~{expected}, got {result}"
|
||||
|
||||
|
||||
def test_geometric_mean_filters_non_positive():
|
||||
result = geometric_mean([1, -2, 3])
|
||||
expected = math.exp((math.log(1) + math.log(3)) / 2)
|
||||
assert abs(result - expected) < 1e-6
|
||||
|
||||
|
||||
def test_geometric_mean_empty_returns_nan():
|
||||
result = geometric_mean([])
|
||||
assert math.isnan(result)
|
||||
|
||||
|
||||
def test_geometric_mean_all_negative_returns_nan():
|
||||
result = geometric_mean([-1, -2, -3])
|
||||
assert math.isnan(result)
|
||||
|
||||
|
||||
def test_geometric_mean_single_positive():
|
||||
result = geometric_mean([9.0])
|
||||
assert abs(result - 9.0) < 1e-9
|
||||
@@ -0,0 +1,84 @@
|
||||
"""Tests para gliner2_load_model.
|
||||
|
||||
El modelo real (gliner2) es opcional. Los tests usan un stub para validar
|
||||
el cache sin descargar el modelo. Tests que requieran el modelo real se
|
||||
marcan con pytest.importorskip('gliner2').
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
import sys
|
||||
|
||||
import pytest
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "..", ".."))
|
||||
|
||||
from python.functions.datascience.gliner2_load_model import (
|
||||
_MODEL_CACHE,
|
||||
_resolve_device,
|
||||
gliner2_load_model,
|
||||
)
|
||||
|
||||
|
||||
class _StubGLiNER2:
|
||||
"""Stub duck-typed para validar el cache sin descargar el modelo real."""
|
||||
|
||||
@classmethod
|
||||
def from_pretrained(cls, model_name: str) -> "_StubGLiNER2":
|
||||
return cls()
|
||||
|
||||
def create_schema(self):
|
||||
return self
|
||||
|
||||
def entities(self, labels):
|
||||
return self
|
||||
|
||||
def relations(self, labels):
|
||||
return self
|
||||
|
||||
def extract(self, text, **kwargs):
|
||||
return {"entities": {}, "relation_extraction": {}}
|
||||
|
||||
|
||||
def test_cache_devuelve_la_misma_instancia(monkeypatch):
|
||||
"""cache devuelve la misma instancia con los mismos parametros"""
|
||||
_MODEL_CACHE.clear()
|
||||
monkeypatch.setattr(
|
||||
"python.functions.datascience.gliner2_load_model.GLiNER2",
|
||||
_StubGLiNER2,
|
||||
raising=False,
|
||||
)
|
||||
# Patch el import dentro de la funcion
|
||||
import python.functions.datascience.gliner2_load_model as mod
|
||||
original = None
|
||||
try:
|
||||
from gliner2 import GLiNER2 as _real # type: ignore[import]
|
||||
original = _real
|
||||
except ImportError:
|
||||
pass
|
||||
|
||||
_MODEL_CACHE.clear()
|
||||
# Insertar stub directamente en el cache para simular primera carga
|
||||
key = ("fastino/gliner2-large-v1", "cpu")
|
||||
stub = _StubGLiNER2()
|
||||
_MODEL_CACHE[key] = stub
|
||||
|
||||
# Segunda llamada debe devolver el mismo objeto
|
||||
result = gliner2_load_model(model_name="fastino/gliner2-large-v1", device="cpu")
|
||||
assert result is stub
|
||||
_MODEL_CACHE.clear()
|
||||
|
||||
|
||||
def test_device_auto_resuelve_a_cpu_si_torch_no_esta(monkeypatch):
|
||||
"""device=auto resuelve a cpu si torch no esta instalado"""
|
||||
import sys
|
||||
# Simular que torch no esta disponible
|
||||
monkeypatch.setitem(sys.modules, "torch", None)
|
||||
resolved = _resolve_device("auto")
|
||||
assert resolved == "cpu"
|
||||
|
||||
|
||||
def test_import_error_si_gliner2_no_esta_instalado():
|
||||
"""ImportError si gliner2 no esta instalado"""
|
||||
pytest.importorskip("gliner2", reason="gliner2 not installed — skip real model test")
|
||||
@@ -0,0 +1,46 @@
|
||||
"""Tests para kde_density_levels."""
|
||||
|
||||
import sys
|
||||
import os
|
||||
import numpy as np
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
||||
from kde_density_levels import kde_density_levels
|
||||
|
||||
|
||||
def test_kde_density_levels_returns_dict_for_50_points():
|
||||
rng = np.random.default_rng(42)
|
||||
xs = rng.normal(0, 1, 50).tolist()
|
||||
ys = rng.normal(0, 1, 50).tolist()
|
||||
result = kde_density_levels(xs, ys)
|
||||
assert result is not None
|
||||
assert "method" in result
|
||||
assert result["method"] in ("kde", "hist")
|
||||
assert "densities" in result
|
||||
assert len(result["densities"]) == 50
|
||||
assert "abs_level" in result
|
||||
assert "dense_level" in result
|
||||
|
||||
|
||||
def test_kde_density_levels_none_for_few_points():
|
||||
result = kde_density_levels([1.0, 2.0, 3.0], [1.0, 2.0, 3.0])
|
||||
assert result is None
|
||||
|
||||
|
||||
def test_kde_density_levels_none_for_4_points():
|
||||
result = kde_density_levels([1, 2, 3, 4], [1, 2, 3, 4])
|
||||
assert result is None
|
||||
|
||||
|
||||
def test_kde_density_levels_levels_ordered():
|
||||
rng = np.random.default_rng(0)
|
||||
xs = rng.uniform(0, 10, 100).tolist()
|
||||
ys = rng.uniform(0, 10, 100).tolist()
|
||||
result = kde_density_levels(xs, ys, abs_quantile=0.1, dense_quantile=0.85)
|
||||
assert result is not None
|
||||
assert result["abs_level"] <= result["dense_level"]
|
||||
|
||||
|
||||
def test_kde_density_levels_mismatched_lengths():
|
||||
result = kde_density_levels([1, 2, 3, 4, 5], [1, 2, 3])
|
||||
assert result is None
|
||||
@@ -0,0 +1,75 @@
|
||||
"""Tests para parse_rebel_output."""
|
||||
|
||||
import os
|
||||
import sys
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "..", ".."))
|
||||
|
||||
from python.functions.datascience.parse_rebel_output import parse_rebel_output
|
||||
|
||||
|
||||
def test_string_vacio_retorna_lista_vacia():
|
||||
assert parse_rebel_output("") == []
|
||||
|
||||
|
||||
def test_string_solo_espacios_retorna_lista_vacia():
|
||||
assert parse_rebel_output(" ") == []
|
||||
|
||||
|
||||
def test_un_triplet_completo_retorna_un_dict_con_campos_correctos():
|
||||
decoded = "tp_XX<triplet> Pablo Isla <per> Inditex <org> employer"
|
||||
result = parse_rebel_output(decoded)
|
||||
assert len(result) == 1
|
||||
t = result[0]
|
||||
assert t["head"] == "Pablo Isla"
|
||||
assert t["head_type"] == "per"
|
||||
assert t["tail"] == "Inditex"
|
||||
assert t["tail_type"] == "org"
|
||||
assert t["type"] == "employer"
|
||||
|
||||
|
||||
def test_dos_triplets_retorna_dos_dicts():
|
||||
decoded = (
|
||||
"tp_XX<triplet> Pablo Isla <per> Inditex <org> employer "
|
||||
"<triplet> Arteixo <loc> A Coruna <loc> located in the administrative territorial entity"
|
||||
)
|
||||
result = parse_rebel_output(decoded)
|
||||
assert len(result) == 2
|
||||
assert result[0]["head"] == "Pablo Isla"
|
||||
assert result[0]["tail"] == "Inditex"
|
||||
assert result[1]["head"] == "Arteixo"
|
||||
assert result[1]["tail"] == "A Coruna"
|
||||
assert "located" in result[1]["type"]
|
||||
|
||||
|
||||
def test_triplet_incompleto_sin_cierre_no_rompe():
|
||||
# Solo head span, sin tail ni relacion
|
||||
decoded = "tp_XX<triplet> Pablo Isla"
|
||||
result = parse_rebel_output(decoded)
|
||||
# No hay cierre, puede retornar lista vacia o incompleta pero no rompe
|
||||
assert isinstance(result, list)
|
||||
|
||||
|
||||
def test_tokens_angulares_desconocidos_no_lanzan_excepcion():
|
||||
# Un tipo desconocido como <unknown_type> no debe romper el parser
|
||||
decoded = "<triplet> Entity One <unknown_type> Entity Two <org> some relation"
|
||||
result = parse_rebel_output(decoded)
|
||||
assert isinstance(result, list)
|
||||
|
||||
|
||||
def test_sin_prefijo_tp_xx_funciona():
|
||||
# REBEL monolingue no emite tp_XX
|
||||
decoded = "<triplet> Barack Obama <per> United States <org> president of"
|
||||
result = parse_rebel_output(decoded)
|
||||
assert len(result) == 1
|
||||
assert result[0]["head"] == "Barack Obama"
|
||||
assert result[0]["tail"] == "United States"
|
||||
assert result[0]["type"] == "president of"
|
||||
|
||||
|
||||
def test_strip_tags_s_pad():
|
||||
decoded = "<s><pad>tp_XX<triplet> Ana <per> BBVA <org> works at</s>"
|
||||
result = parse_rebel_output(decoded)
|
||||
assert len(result) == 1
|
||||
assert result[0]["head"] == "Ana"
|
||||
assert result[0]["tail"] == "BBVA"
|
||||
@@ -0,0 +1,38 @@
|
||||
"""Tests para plot_heatmap_log."""
|
||||
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
import matplotlib
|
||||
matplotlib.use("Agg")
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent.parent.parent))
|
||||
|
||||
from datascience.plot_heatmap_log import plot_heatmap_log
|
||||
|
||||
|
||||
def test_100_puntos_no_lanza_excepcion():
|
||||
import matplotlib.pyplot as plt
|
||||
import numpy as np
|
||||
|
||||
rng = np.random.default_rng(0)
|
||||
xs = rng.uniform(-4.0, -3.5, 100)
|
||||
ys = rng.uniform(40.3, 40.6, 100)
|
||||
|
||||
fig, ax = plt.subplots()
|
||||
plot_heatmap_log(ax, xs, ys, extent=(-4.0, -3.5, 40.3, 40.6), bins=50)
|
||||
plt.close(fig)
|
||||
|
||||
|
||||
def test_ax_tiene_imagen_tras_la_llamada():
|
||||
import matplotlib.pyplot as plt
|
||||
import numpy as np
|
||||
|
||||
rng = np.random.default_rng(1)
|
||||
xs = rng.uniform(-4.0, -3.5, 100)
|
||||
ys = rng.uniform(40.3, 40.6, 100)
|
||||
|
||||
fig, ax = plt.subplots()
|
||||
plot_heatmap_log(ax, xs, ys, extent=(-4.0, -3.5, 40.3, 40.6), bins=50)
|
||||
assert len(ax.images) > 0, "ax should have at least one image after heatmap"
|
||||
plt.close(fig)
|
||||
@@ -0,0 +1,32 @@
|
||||
"""Tests para plot_kde_2d."""
|
||||
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
import matplotlib
|
||||
matplotlib.use("Agg")
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent.parent.parent))
|
||||
|
||||
from datascience.plot_kde_2d import plot_kde_2d
|
||||
|
||||
|
||||
def test_50_puntos_aleatorios_no_lanza_excepcion():
|
||||
import matplotlib.pyplot as plt
|
||||
import numpy as np
|
||||
|
||||
rng = np.random.default_rng(42)
|
||||
xs = rng.normal(0, 1, 50)
|
||||
ys = rng.normal(0, 1, 50)
|
||||
|
||||
fig, ax = plt.subplots()
|
||||
plot_kde_2d(ax, xs, ys)
|
||||
plt.close(fig)
|
||||
|
||||
|
||||
def test_arrays_vacios_retorna_sin_error():
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
fig, ax = plt.subplots()
|
||||
plot_kde_2d(ax, [], [])
|
||||
plt.close(fig)
|
||||
@@ -0,0 +1,42 @@
|
||||
"""Tests para remove_words_from_column."""
|
||||
|
||||
import sys
|
||||
import os
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
||||
|
||||
from remove_words_from_column import remove_words_from_column
|
||||
|
||||
|
||||
def test_elimina_palabras_case_insensitive():
|
||||
values = ["Calle Mayor 14", "Avenida del Sol"]
|
||||
result = remove_words_from_column(values, words=["calle", "avenida", "del"])
|
||||
assert result == ["Mayor 14", "Sol"]
|
||||
|
||||
|
||||
def test_none_devuelve_string_vacio():
|
||||
result = remove_words_from_column([None, "hola mundo"], words=["hola"])
|
||||
assert result[0] == ""
|
||||
assert result[1] == "mundo"
|
||||
|
||||
|
||||
def test_colapsa_espacios_multiples():
|
||||
result = remove_words_from_column(["uno dos tres"], words=["dos"])
|
||||
assert result[0] == "uno tres"
|
||||
|
||||
|
||||
def test_palabras_vacias_no_modifica():
|
||||
values = ["hola mundo", "foo bar"]
|
||||
result = remove_words_from_column(values, words=[])
|
||||
assert result == ["hola mundo", "foo bar"]
|
||||
|
||||
|
||||
def test_palabra_completa_no_parcial():
|
||||
# "calle" no debe eliminar "calleja"
|
||||
result = remove_words_from_column(["calleja mayor"], words=["calle"])
|
||||
assert result[0] == "calleja mayor"
|
||||
|
||||
|
||||
def test_lista_vacia():
|
||||
result = remove_words_from_column([], words=["foo"])
|
||||
assert result == []
|
||||
@@ -0,0 +1,46 @@
|
||||
"""Tests para spacy_es_load_model."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
import sys
|
||||
|
||||
import pytest
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "..", ".."))
|
||||
|
||||
from python.functions.datascience.spacy_es_load_model import (
|
||||
_MODEL_CACHE,
|
||||
spacy_es_load_model,
|
||||
)
|
||||
|
||||
spacy = pytest.importorskip("spacy", reason="spacy not installed — skip")
|
||||
|
||||
|
||||
def _has_model(model_name: str) -> bool:
|
||||
try:
|
||||
spacy.load(model_name)
|
||||
return True
|
||||
except OSError:
|
||||
return False
|
||||
|
||||
|
||||
@pytest.mark.skipif(
|
||||
not _has_model("es_core_news_md"),
|
||||
reason="es_core_news_md not installed",
|
||||
)
|
||||
def test_cache_devuelve_la_misma_instancia():
|
||||
"""cache devuelve la misma instancia"""
|
||||
_MODEL_CACHE.clear()
|
||||
m1 = spacy_es_load_model("es_core_news_md")
|
||||
m2 = spacy_es_load_model("es_core_news_md")
|
||||
assert m1 is m2
|
||||
_MODEL_CACHE.clear()
|
||||
|
||||
|
||||
def test_oserror_si_el_modelo_no_esta_instalado():
|
||||
"""OSError si el modelo no esta instalado"""
|
||||
_MODEL_CACHE.clear()
|
||||
with pytest.raises(OSError):
|
||||
spacy_es_load_model("es_nonexistent_model_xyz")
|
||||
_MODEL_CACHE.clear()
|
||||
@@ -0,0 +1,38 @@
|
||||
"""Tests para summary_stats."""
|
||||
|
||||
import math
|
||||
import sys
|
||||
import os
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
||||
from summary_stats import summary_stats
|
||||
|
||||
|
||||
def test_summary_stats_basic():
|
||||
result = summary_stats([1, 2, 3, 4, 5])
|
||||
assert result["n"] == 5
|
||||
assert abs(result["mean"] - 3.0) < 1e-9
|
||||
assert abs(result["median"] - 3.0) < 1e-9
|
||||
assert abs(result["p25"] - 2.0) < 0.01
|
||||
assert abs(result["p75"] - 4.0) < 0.01
|
||||
|
||||
|
||||
def test_summary_stats_empty():
|
||||
result = summary_stats([])
|
||||
assert result["n"] == 0
|
||||
assert math.isnan(result["mean"])
|
||||
assert math.isnan(result["median"])
|
||||
assert math.isnan(result["p25"])
|
||||
assert math.isnan(result["p75"])
|
||||
|
||||
|
||||
def test_summary_stats_single():
|
||||
result = summary_stats([7.0])
|
||||
assert result["n"] == 1
|
||||
assert abs(result["mean"] - 7.0) < 1e-9
|
||||
assert abs(result["median"] - 7.0) < 1e-9
|
||||
|
||||
|
||||
def test_summary_stats_keys():
|
||||
result = summary_stats([1, 2, 3])
|
||||
assert set(result.keys()) == {"n", "mean", "median", "p25", "p75"}
|
||||
@@ -0,0 +1,62 @@
|
||||
"""Tests para translate_es_to_en — smoke tests con modelo stub."""
|
||||
|
||||
import os
|
||||
import sys
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "..", ".."))
|
||||
|
||||
from python.functions.datascience.translate_es_to_en import translate_es_to_en
|
||||
|
||||
|
||||
class _StubTokenizer:
|
||||
"""Tokenizer stub que devuelve inputs triviales."""
|
||||
|
||||
def __call__(self, text, return_tensors=None, max_length=512, truncation=True):
|
||||
# Devuelve un dict con una clave 'input_ids' que el modelo stub acepta.
|
||||
return {"input_ids": [[1, 2, 3]], "_text": text}
|
||||
|
||||
def decode(self, token_ids, skip_special_tokens=True):
|
||||
# Devuelve siempre "translated" para testing.
|
||||
return "translated"
|
||||
|
||||
|
||||
class _StubModel:
|
||||
"""Modelo stub que devuelve tokens triviales."""
|
||||
|
||||
def generate(self, input_ids=None, num_beams=4, max_length=512, **kwargs):
|
||||
return [[10, 11, 12]]
|
||||
|
||||
|
||||
def test_texto_vacio_retorna_string_vacio():
|
||||
tok = _StubTokenizer()
|
||||
model = _StubModel()
|
||||
assert translate_es_to_en("", tok, model) == ""
|
||||
|
||||
|
||||
def test_solo_espacios_retorna_string_vacio():
|
||||
tok = _StubTokenizer()
|
||||
model = _StubModel()
|
||||
assert translate_es_to_en(" ", tok, model) == ""
|
||||
|
||||
|
||||
def test_una_frase_en_espanol_produce_output_no_vacio():
|
||||
tok = _StubTokenizer()
|
||||
model = _StubModel()
|
||||
result = translate_es_to_en("Pablo Isla es presidente de Inditex.", tok, model)
|
||||
assert isinstance(result, str)
|
||||
assert len(result) > 0
|
||||
|
||||
|
||||
def test_multiples_frases_se_unen_con_espacio():
|
||||
tok = _StubTokenizer()
|
||||
model = _StubModel()
|
||||
# El stub siempre devuelve "translated" por frase
|
||||
result = translate_es_to_en(
|
||||
"Primera frase. Segunda frase. Tercera frase.",
|
||||
tok,
|
||||
model,
|
||||
)
|
||||
# Con el stub, cada frase produce "translated", unidas con espacio
|
||||
parts = result.split(" ")
|
||||
assert all(p == "translated" for p in parts)
|
||||
assert len(parts) >= 1
|
||||
@@ -0,0 +1,33 @@
|
||||
"""Tests para trimmed_mean."""
|
||||
|
||||
import math
|
||||
import sys
|
||||
import os
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
||||
from trimmed_mean import trimmed_mean
|
||||
|
||||
|
||||
def test_trimmed_mean_basic():
|
||||
result = trimmed_mean([1, 2, 3, 4, 5, 100], 0.1)
|
||||
assert abs(result - 3.5) < 0.5, f"Expected ~3.5, got {result}"
|
||||
|
||||
|
||||
def test_trimmed_mean_empty_returns_nan():
|
||||
result = trimmed_mean([], 0.05)
|
||||
assert math.isnan(result)
|
||||
|
||||
|
||||
def test_trimmed_mean_no_trim():
|
||||
result = trimmed_mean([1.0, 2.0, 3.0, 4.0, 5.0], 0.0)
|
||||
assert abs(result - 3.0) < 1e-9
|
||||
|
||||
|
||||
def test_trimmed_mean_single_element():
|
||||
result = trimmed_mean([42.0], 0.05)
|
||||
assert abs(result - 42.0) < 1e-9
|
||||
|
||||
|
||||
def test_trimmed_mean_uniform():
|
||||
result = trimmed_mean([5.0, 5.0, 5.0, 5.0, 5.0], 0.1)
|
||||
assert abs(result - 5.0) < 1e-9
|
||||
@@ -0,0 +1,49 @@
|
||||
"""Tests para words_to_dataset."""
|
||||
|
||||
import sys
|
||||
import os
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
||||
|
||||
from words_to_dataset import words_to_dataset
|
||||
|
||||
|
||||
def test_cuenta_palabras_repetidas():
|
||||
texts = ["calle mayor", "calle del sol", "avenida principal"]
|
||||
result = words_to_dataset(texts)
|
||||
palabras = {r["palabra"]: r["ocurrencias"] for r in result}
|
||||
assert palabras["CALLE"] == 2
|
||||
|
||||
|
||||
def test_eliminar_stopwords_filtra_del():
|
||||
texts = ["calle mayor", "calle del sol", "avenida principal"]
|
||||
result = words_to_dataset(texts, eliminar_stopwords=True)
|
||||
palabras = {r["palabra"] for r in result}
|
||||
assert "DEL" not in palabras
|
||||
|
||||
|
||||
def test_min_ocurrencias_filtra():
|
||||
texts = ["calle mayor", "calle del sol", "avenida principal"]
|
||||
result = words_to_dataset(texts, min_ocurrencias=2)
|
||||
palabras = {r["palabra"]: r["ocurrencias"] for r in result}
|
||||
assert "CALLE" in palabras
|
||||
assert "MAYOR" not in palabras
|
||||
|
||||
|
||||
def test_none_ignorados():
|
||||
texts = ["hola mundo", None, "hola"]
|
||||
result = words_to_dataset(texts)
|
||||
palabras = {r["palabra"]: r["ocurrencias"] for r in result}
|
||||
assert palabras["HOLA"] == 2
|
||||
|
||||
|
||||
def test_lista_vacia():
|
||||
result = words_to_dataset([])
|
||||
assert result == []
|
||||
|
||||
|
||||
def test_orden_descendente():
|
||||
texts = ["a a a", "b b", "c"]
|
||||
result = words_to_dataset(texts)
|
||||
counts = [r["ocurrencias"] for r in result]
|
||||
assert counts == sorted(counts, reverse=True)
|
||||
@@ -0,0 +1,85 @@
|
||||
---
|
||||
name: translate_es_to_en
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "def translate_es_to_en(text: str, tokenizer: Any, model: Any, max_length: int = 512, num_beams: int = 4) -> str"
|
||||
description: "Traduce texto espanol a ingles frase a frase usando MarianMT. Divide por boundaries de oracion, traduce cada una independientemente y une con espacio. Preserva nombres propios mejor que pasar el parrafo entero."
|
||||
tags: [marianmt, translation, es-en, nlp, datascience, python]
|
||||
uses_functions: [marianmt_es_en_load_model_py_datascience]
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: [re]
|
||||
params:
|
||||
- name: text
|
||||
desc: "texto en espanol a traducir — puede ser una frase o un parrafo multi-oracion"
|
||||
- name: tokenizer
|
||||
desc: "tokenizer MarianMT cargado con marianmt_es_en_load_model"
|
||||
- name: model
|
||||
desc: "modelo MarianMT cargado con marianmt_es_en_load_model"
|
||||
- name: max_length
|
||||
desc: "longitud maxima en tokens por oracion para tokenizacion y generacion (defecto 512)"
|
||||
- name: num_beams
|
||||
desc: "numero de beams para beam search — mas alto = mejor calidad, mas lento (defecto 4)"
|
||||
output: "texto traducido al ingles. Frases unidas con espacio simple. String vacio si el input es vacio."
|
||||
tested: true
|
||||
tests:
|
||||
- "texto vacio retorna string vacio"
|
||||
- "una frase en espanol produce output no vacio"
|
||||
test_file_path: "python/functions/datascience/tests/test_translate_es_to_en.py"
|
||||
file_path: "python/functions/datascience/translate_es_to_en.py"
|
||||
notes: |
|
||||
impure: invoca model.generate que depende del estado del modelo (pesos, device).
|
||||
|
||||
El split por oracion usa regex lookahead-behind sobre [.!?] seguidos de espacio.
|
||||
Esto preserva nombres propios con puntos (S.A., U.S.A.) mejor que NLTK sent_tokenize
|
||||
porque no usa reglas de abreviacion — simplemente divide donde hay espacio despues
|
||||
de puntuacion terminal.
|
||||
|
||||
Util como preprocesador para rebel_load_model (English-only, Apache 2.0):
|
||||
ES text -> translate_es_to_en -> EN text -> REBEL -> triplets
|
||||
Alternativa directa: mrebel_load_model (multilingue, CC BY-NC-SA).
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from python.functions.datascience.marianmt_es_en_load_model import marianmt_es_en_load_model
|
||||
from python.functions.datascience.translate_es_to_en import translate_es_to_en
|
||||
|
||||
tokenizer, model = marianmt_es_en_load_model()
|
||||
|
||||
text = "Pablo Isla es presidente de Inditex. La empresa tiene sede en Arteixo."
|
||||
translated = translate_es_to_en(text, tokenizer, model)
|
||||
# "Pablo Isla is president of Inditex. The company is headquartered in Arteixo."
|
||||
```
|
||||
|
||||
## Por que frase a frase
|
||||
|
||||
Pasar el parrafo entero a MarianMT puede degradar la traduccion de nombres propios
|
||||
porque el modelo redistribuye la atencion sobre el contexto completo. Dividir por oraciones:
|
||||
|
||||
1. Contexto mas corto → menos confusion en nombres propios.
|
||||
2. Truncation menos probable (512 tokens alcanza para oraciones normales).
|
||||
3. Pipeline mas predecible para debugging (se puede inspeccionar cada frase).
|
||||
|
||||
## Patron pipeline ES -> EN -> REBEL
|
||||
|
||||
```python
|
||||
# Paso 1: cargar modelos
|
||||
mt_tok, mt_model = marianmt_es_en_load_model()
|
||||
rebel_tok, rebel_model = rebel_load_model()
|
||||
|
||||
# Paso 2: traducir
|
||||
en_text = translate_es_to_en(es_text, mt_tok, mt_model)
|
||||
|
||||
# Paso 3: extraer relaciones
|
||||
inputs = rebel_tok(en_text, return_tensors="pt", max_length=512, truncation=True)
|
||||
generated = rebel_model.generate(**inputs, num_beams=4, max_length=256)
|
||||
decoded = rebel_tok.decode(generated[0], skip_special_tokens=False)
|
||||
triplets = parse_rebel_output(decoded)
|
||||
```
|
||||
@@ -0,0 +1,68 @@
|
||||
"""Traduce texto espanol a ingles usando MarianMT, frase a frase."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import re
|
||||
from typing import Any
|
||||
|
||||
# Patron de split por oraciones: punto, exclamacion, interrogacion seguidos de espacio.
|
||||
_SENTENCE_RE = re.compile(r"(?<=[.!?])\s+")
|
||||
|
||||
|
||||
def translate_es_to_en(
|
||||
text: str,
|
||||
tokenizer: Any,
|
||||
model: Any,
|
||||
max_length: int = 512,
|
||||
num_beams: int = 4,
|
||||
) -> str:
|
||||
"""Translate Spanish text to English, sentence by sentence.
|
||||
|
||||
Splits the input on sentence boundaries (after ``.``, ``!``, ``?``),
|
||||
translates each sentence independently, and rejoins with a single space.
|
||||
Processing sentence by sentence preserves proper nouns (names, companies,
|
||||
locations) better than passing the full paragraph in a single call, because
|
||||
the translation model can focus on shorter context windows.
|
||||
|
||||
Args:
|
||||
text: Spanish text to translate. Can be a single sentence or a
|
||||
multi-sentence paragraph.
|
||||
tokenizer: MarianMT tokenizer loaded with ``marianmt_es_en_load_model``.
|
||||
model: MarianMT model loaded with ``marianmt_es_en_load_model``.
|
||||
max_length: Maximum token length for each sentence during tokenization
|
||||
and generation. Sentences longer than this are truncated.
|
||||
num_beams: Number of beams for beam search. Higher = better quality,
|
||||
slower. Default 4 is a good tradeoff.
|
||||
|
||||
Returns:
|
||||
Translated English text. Sentences joined with a single space.
|
||||
Returns an empty string if ``text`` is empty or whitespace-only.
|
||||
|
||||
Raises:
|
||||
RuntimeError: if model.generate fails (propagated from transformers).
|
||||
"""
|
||||
if not text or not text.strip():
|
||||
return ""
|
||||
|
||||
sentences = _SENTENCE_RE.split(text.strip())
|
||||
sentences = [s.strip() for s in sentences if s.strip()]
|
||||
if not sentences:
|
||||
return ""
|
||||
|
||||
translated_parts: list[str] = []
|
||||
for sentence in sentences:
|
||||
inputs = tokenizer(
|
||||
sentence,
|
||||
return_tensors="pt",
|
||||
max_length=max_length,
|
||||
truncation=True,
|
||||
)
|
||||
generated = model.generate(
|
||||
**inputs,
|
||||
num_beams=num_beams,
|
||||
max_length=max_length,
|
||||
)
|
||||
decoded = tokenizer.decode(generated[0], skip_special_tokens=True)
|
||||
translated_parts.append(decoded.strip())
|
||||
|
||||
return " ".join(translated_parts)
|
||||
@@ -0,0 +1,53 @@
|
||||
---
|
||||
id: trimmed_mean_py_datascience
|
||||
name: trimmed_mean
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def trimmed_mean(values: list[float], trim: float = 0.05) -> float"
|
||||
description: "Arithmetic mean after cutting the bottom and top trim percentiles. Returns math.nan for empty input."
|
||||
tags: [statistics, mean, robust, trimming, outliers]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: [math, numpy]
|
||||
example: |
|
||||
from trimmed_mean import trimmed_mean
|
||||
result = trimmed_mean([1, 2, 3, 4, 5, 100], 0.1) # ~3.5
|
||||
tested: true
|
||||
tests:
|
||||
- "test_trimmed_mean_basic"
|
||||
- "test_trimmed_mean_empty_returns_nan"
|
||||
- "test_trimmed_mean_no_trim"
|
||||
- "test_trimmed_mean_single_element"
|
||||
- "test_trimmed_mean_uniform"
|
||||
test_file_path: "python/functions/datascience/tests/test_trimmed_mean.py"
|
||||
file_path: "python/functions/datascience/trimmed_mean.py"
|
||||
params:
|
||||
- name: values
|
||||
desc: "List of numeric values to average."
|
||||
- name: trim
|
||||
desc: "Fraction to cut from each tail before averaging (0 <= trim < 0.5). Default 0.05."
|
||||
output: "Trimmed arithmetic mean as float. Returns math.nan if values is empty or all values are trimmed away."
|
||||
source_repo: "internal:footprint_aurgi"
|
||||
source_license: "internal-aurgi"
|
||||
source_file: "aurgi_mapas/generar_pdf_reporte.py:117"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from trimmed_mean import trimmed_mean
|
||||
|
||||
trimmed_mean([1, 2, 3, 4, 5, 100], 0.1) # ~3.5 (100 is trimmed)
|
||||
trimmed_mean([], 0.05) # math.nan
|
||||
trimmed_mean([5.0, 5.0, 5.0], 0.0) # 5.0
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Usa numpy.percentile para calcular los umbrales lo y hi, luego filtra valores dentro del rango [lo, hi]. Util para calcular promedios robustos cuando hay valores extremos en la distribucion.
|
||||
@@ -0,0 +1,28 @@
|
||||
"""trimmed_mean — Arithmetic mean after trimming extreme percentiles."""
|
||||
|
||||
import math
|
||||
import numpy as np
|
||||
|
||||
|
||||
def trimmed_mean(values: list[float], trim: float = 0.05) -> float:
|
||||
"""Return the trimmed arithmetic mean of values.
|
||||
|
||||
Cuts the bottom `trim` and top `trim` percentiles before averaging.
|
||||
Returns math.nan for an empty list or when trimming removes all elements.
|
||||
|
||||
Args:
|
||||
values: List of numeric values.
|
||||
trim: Fraction to cut from each tail (0 <= trim < 0.5).
|
||||
|
||||
Returns:
|
||||
Trimmed mean as float, or math.nan if the list is empty.
|
||||
"""
|
||||
if not values:
|
||||
return math.nan
|
||||
arr = np.array(values, dtype=float)
|
||||
lo = np.percentile(arr, trim * 100)
|
||||
hi = np.percentile(arr, (1 - trim) * 100)
|
||||
trimmed = arr[(arr >= lo) & (arr <= hi)]
|
||||
if len(trimmed) == 0:
|
||||
return math.nan
|
||||
return float(np.mean(trimmed))
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user