feat: extraccion masiva footprint_aurgi (41 funcs + 4 types + stack Docker geo)

Extrae al registry funciones del proyecto interno footprint_aurgi:
- core (6): slugify_ascii, normalize_for_join, cp_provincia_es, infer_provincia_from_cp, safe_read_csv_fallback, csv_to_parquet_duckdb
- geo puras (7): haversine_km, point_in_ring, point_in_polygon, point_in_polygons_bbox, polygon_bbox, extent_with_padding, distance_bucket
- geo I/O (4): load_geojson_polygons, load_boundary_gdf, add_basemap_osm, add_basemap_with_timeout
- valhalla client (4): valhalla_route, valhalla_isochrone, valhalla_isochrones_async, valhalla_matrix_1_to_n
- datascience stats (7): trimmed_mean, geometric_mean, detect_distribution_type, best_central_tendency, summary_stats, kde_density_levels, alpha_shape_concave_hull
- datascience fuzzy (3): fuzzy_merge_adaptive (rapidfuzz), words_to_dataset, remove_words_from_column
- datascience viz (2): plot_kde_2d, plot_heatmap_log
- infra (4): compress_pdf_ghostscript, render_table_page_pdfpages, add_header_logo, osm2pgsql_ingest
- pipelines (4): setup_geo_stack_docker, compute_centers_reachability, generate_isochrones_by_zone, count_points_per_zone
- types geo (4): LonLat, BBox, IsochroneRequest, Centro

Incluye:
- apps/footprint_geo_stack/ (PostGIS + Martin + Valhalla via docker-compose)
- 131/132 tests pasan (1 skip esperado: osm2pgsql en PATH)
- Issue tracker dev/issues/0052-footprint-aurgi-extraction.md
- Atribucion uniforme: source_repo internal:footprint_aurgi, source_license internal-aurgi
- Build con 9 agentes en paralelo (8 wave 1 + 1 wave 2 pipelines)

Tambien commitea trabajo previo no commiteado: aggregate_extraction_results, chunk_with_overlap, clean_pdf_text, merge_entity_aliases, extract_graph_gliner2, extract_relations_mrebel, extract_triples_spacy_es, gliner2/mrebel/marianmt/rebel/spacy_es load_model, parse_rebel_output, translate_es_to_en, issue 0050/0051.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-04 23:35:22 +02:00
parent f73ea072bd
commit faac610745
193 changed files with 13146 additions and 3 deletions
@@ -0,0 +1,51 @@
---
name: aggregate_extraction_results
kind: function
lang: py
domain: core
version: "1.0.0"
purity: pure
signature: "def aggregate_extraction_results(extract_results: list[dict]) -> dict"
description: "Agrega entidades y relaciones de N resultados de extraccion por chunk. Deduplica entidades por (type, name_lowercased) acumulando counts. Deduplica relaciones por (head, rel_type, tail) con Counter."
tags: [nlp, aggregation, entities, relations, deduplication, chunking, ner, re, graph]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: [collections.Counter]
params:
- name: extract_results
desc: "Lista de resultados por chunk. Cada elemento tiene shape {'entities': {type: [name, ...]}, 'relation_extraction': {rel_type: [(head, tail), ...]}}. Es el output de extract_graph_gliner2. Claves ausentes se toleran."
output: "Dict con dos campos: 'entities' -> dict keyed por (type, name_lower) con {type, name, count}; 'relations' -> Counter (head, rel_type, tail) -> count. Listo para pasar a filter_relations_by_entity_types y merge_entity_aliases."
tested: true
tests:
- "lista vacia retorna entities vacio y relations vacio"
- "resultado unico se agrega correctamente"
- "dos resultados con solapamiento acumulan counts"
- "entidades se deduplicen case-insensitive"
test_file_path: "python/functions/core/tests/test_aggregate_extraction_results.py"
file_path: "python/functions/core/aggregate_extraction_results.py"
notes: |
Output shape deliberado para composicion con el pipeline:
- entities keyed por (type, name_lower) permite lookup O(1) por tipo+nombre
- relations como Counter permite filtrar por frecuencia (count >= 2)
No aplica coreference — eso lo hace merge_entity_aliases sobre los nombres
canonicos despues de agregar.
---
## Ejemplo
```python
from core.aggregate_extraction_results import aggregate_extraction_results
results = [
{"entities": {"person": ["Pablo Isla"], "organization": ["Inditex"]},
"relation_extraction": {"ceo_of": [("Pablo Isla", "Inditex")]}},
{"entities": {"person": ["pablo isla"], "organization": ["Inditex"]},
"relation_extraction": {"ceo_of": [("Pablo Isla", "Inditex")]}},
]
agg = aggregate_extraction_results(results)
# agg["entities"][("person", "pablo isla")]["count"] == 2
# agg["relations"][("Pablo Isla", "ceo_of", "Inditex")] == 2
```
@@ -0,0 +1,45 @@
"""Agrega y deduplica entidades + relaciones de N resultados de extraccion por chunk."""
from __future__ import annotations
from collections import Counter
def aggregate_extraction_results(extract_results: list[dict]) -> dict:
"""Aggregate entities + relations from multiple chunk-level extraction results.
Deduplicates entities by (type, name_lowercased) and counts occurrences.
Deduplicates relations by (head, rel_type, tail) and counts occurrences.
Each input result is expected to have shape:
{"entities": {type: [name, ...]}, "relation_extraction": {rel_type: [(head, tail), ...]}}
This is the output format of extract_graph_gliner2.
Args:
extract_results: List of per-chunk extraction dicts. May be empty.
Missing keys ("entities", "relation_extraction") are tolerated.
Returns:
{
"entities": dict[(type, name_lower)] -> {"type": str, "name": str, "count": int},
"relations": Counter mapping (head, rel_type, tail) -> count
}
"""
all_ents: dict[tuple[str, str], dict] = {}
all_rels: Counter = Counter()
for r in extract_results:
for typ, names in (r.get("entities") or {}).items():
for n in names:
key = (typ, (n or "").strip().lower())
if not key[1]:
continue
if key not in all_ents:
all_ents[key] = {"type": typ, "name": n.strip(), "count": 0}
all_ents[key]["count"] += 1
for rt, pairs in (r.get("relation_extraction") or {}).items():
for h, t in pairs:
all_rels[(h.strip(), rt, t.strip())] += 1
return {"entities": all_ents, "relations": all_rels}
@@ -0,0 +1,64 @@
---
name: chunk_with_overlap
kind: function
lang: py
domain: core
version: "1.0.0"
purity: pure
signature: "def chunk_with_overlap(text: str, max_chars: int = 1500, overlap_sentences: int = 2) -> list[dict]"
description: "Divide texto en chunks por sentence boundaries con sliding window overlap. Garantiza avance forzado si una frase supera max_chars (evita bucle infinito). Cada chunk retorna dict con 'text' y 'sentences'."
tags: [text, chunking, nlp, split, overlap, sentence, ner, gliner, sliding-window]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: [re]
params:
- name: text
desc: "Texto a dividir. Frases se detectan por [.!?] seguido de espacio. Admite saltos de linea si el texto ya fue limpiado con clean_pdf_text."
- name: max_chars
desc: "Limite maximo de caracteres por chunk (soft limit). Si una sola frase supera max_chars se incluye igualmente para evitar bucle infinito."
- name: overlap_sentences
desc: "Numero de frases finales del chunk previo a prepender al chunk actual. 0 desactiva el overlap."
output: "Lista de dicts [{'text': str, 'sentences': list[str]}, ...]. 'text' es el texto listo para pasar a GLiNER2. Lista vacia si el input es vacio."
tested: true
tests:
- "texto vacio retorna lista vacia"
- "una frase menor que max_chars produce 1 chunk"
- "multiples frases producen N chunks con overlap"
- "frase mas larga que max_chars se incluye sin bucle infinito"
- "overlap=0 no duplica frases entre chunks"
- "overlap=2 el chunk N+1 empieza con las 2 ultimas frases del chunk N"
test_file_path: "python/functions/core/tests/test_chunk_with_overlap.py"
file_path: "python/functions/core/chunk_with_overlap.py"
notes: |
Algoritmo validado empiricamente en notebook 06 del analisis
gliner_glirel_tuning. El overlap sentence-level (vs overlap en caracteres)
asegura que las entidades que aparecen al final de un chunk tambien
aparecen al principio del siguiente, mejorando el recall de GLiNER2.
split_text_into_chunks_py_core hace overlap en caracteres (modo RAG).
chunk_with_overlap hace overlap en frases completas (modo NER/RE) — son
complementarias, no competidoras.
---
## Ejemplo
```python
from core.chunk_with_overlap import chunk_with_overlap
text = "Pablo Isla preside Inditex. La empresa opera en 93 paises. Zara es su marca principal."
chunks = chunk_with_overlap(text, max_chars=80, overlap_sentences=1)
# chunk 0: text="Pablo Isla preside Inditex. La empresa opera en 93 paises."
# chunk 1: text="La empresa opera en 93 paises. Zara es su marca principal."
# ^--- overlap de 1 frase
for c in chunks:
print(c["text"])
```
## Diferencia con split_text_into_chunks
- `split_text_into_chunks`: overlap en caracteres, orientado a RAG
- `chunk_with_overlap`: overlap en frases completas, orientado a NER/RE (GLiNER2)
@@ -0,0 +1,73 @@
"""Chunking por sentence boundaries con sliding window overlap.
Validado empiricamente en notebook 06 (gliner_glirel_tuning) para pipelines
NER+RE con GLiNER2. Corrige el bug de bucle infinito de la version naive
cuando una frase supera max_chars.
"""
from __future__ import annotations
import re
def chunk_with_overlap(
text: str,
max_chars: int = 1500,
overlap_sentences: int = 2,
) -> list[dict]:
"""Split text into chunks with sentence-level sliding window overlap.
Each chunk has up to `max_chars` characters. Each chunk after the first
starts with the last `overlap_sentences` sentences of the previous chunk
if they fit. If a single sentence exceeds max_chars, it is force-included
(chunk size will exceed max_chars rather than infinite-loop).
Args:
text: Input text to split. Sentences are detected by [.!?] followed by whitespace.
max_chars: Maximum characters per chunk (soft limit; exceeded if a single
sentence is longer than max_chars to avoid infinite loop).
overlap_sentences: Number of trailing sentences of the previous chunk to
prepend to the next chunk. 0 disables overlap.
Returns:
list of dicts: [{"text": str, "sentences": list[str]}, ...]
Empty list if text is empty or contains only whitespace.
"""
if not text or not text.strip():
return []
sentences = re.split(r"(?<=[\.!?])\s+", text)
sentences = [s.strip() for s in sentences if s.strip()]
if not sentences:
return []
chunks: list[dict] = []
i = 0
while i < len(sentences):
current_sents: list[str] = []
current_len = 0
# Overlap desde el chunk anterior
if chunks and overlap_sentences > 0:
prev_sents = chunks[-1]["sentences"][-overlap_sentences:]
overlap_len = sum(len(s) + 1 for s in prev_sents)
next_len = len(sentences[i]) + 1
if overlap_len + next_len <= max_chars:
current_sents = list(prev_sents)
current_len = overlap_len
# AVANCE FORZADO: meter al menos UNA frase aunque exceda max_chars
# (evita bucle infinito con frases muy largas)
current_sents.append(sentences[i])
current_len += len(sentences[i]) + 1
i += 1
# Seguir agregando frases mientras quepan
while i < len(sentences) and current_len + len(sentences[i]) + 1 <= max_chars:
current_sents.append(sentences[i])
current_len += len(sentences[i]) + 1
i += 1
chunks.append({"text": " ".join(current_sents), "sentences": current_sents})
return chunks
+53
View File
@@ -0,0 +1,53 @@
---
name: clean_pdf_text
kind: function
lang: py
domain: core
version: "1.0.0"
purity: pure
signature: "def clean_pdf_text(text: str) -> str"
description: "Limpieza de artefactos PyPDF2/pdfplumber: elimina marcas de pagina (1/20), tabs, guiones de dehyphenation, saltos de linea en medio de oraciones y espacios duplicados."
tags: [pdf, text, cleaning, nlp, preprocessing, pypdf2]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: [re]
params:
- name: text
desc: "Texto plano extraido de un PDF (ej. via PyPDF2.PdfReader o pdfplumber). Puede contener artefactos de paginacion, guiones de dehyphenation y saltos de linea espurios."
output: "Texto limpiado con artefactos eliminados y espacios normalizados. Listo para chunking o extraccion NER."
tested: true
tests:
- "string vacio retorna vacio"
- "marca de pagina 1/20 se elimina"
- "dehyphenation exa-newline-mple -> example"
- "espacios duplicados se colapsan"
- "salto de linea en mitad de oracion se une con espacio"
- "salto de linea tras punto se preserva"
test_file_path: "python/functions/core/tests/test_clean_pdf_text.py"
file_path: "python/functions/core/clean_pdf_text.py"
notes: |
Funcion pura sin dependencias externas (solo re de stdlib).
Orden de operaciones es significativo: dehyphenation antes que colapso
de saltos de linea para evitar falsos positivos.
No elimina saltos de linea tras punto/exclamacion/interrogacion —
esos marcan fin de oracion y deben preservarse para el chunker.
---
## Ejemplo
```python
from core.clean_pdf_text import clean_pdf_text
raw = "Banco Bilbao Vizcaya Argen-\ntaria, S.A. operó en 2023.\n1/20\n\nFoo Bar"
clean = clean_pdf_text(raw)
# "Banco Bilbao Vizcaya Argentaria, S.A. operó en 2023.\nFoo Bar"
```
## Notas
Disenada para preprocesar texto antes de pasarlo a `chunk_with_overlap` +
`extract_graph_gliner2`. El pipeline completo es:
`extract_pdf_text` -> `clean_pdf_text` -> `chunk_with_overlap` -> `extract_graph_gliner2`.
+32
View File
@@ -0,0 +1,32 @@
"""Limpieza de artefactos tipicos de extraccion PyPDF2 en texto plano."""
from __future__ import annotations
import re
def clean_pdf_text(text: str) -> str:
"""Clean PDF text extraction artifacts.
Removes: page-number markers like '1/20', tabs, hyphenated line breaks
in mid-word, duplicated spaces, line breaks not at sentence end.
Args:
text: Raw text extracted from a PDF (e.g. via PyPDF2 or pdfplumber).
Returns:
Cleaned text with artifacts removed and whitespace normalized.
"""
# Eliminar marcas de pagina tipo "1/20" o "3/128"
text = re.sub(r"\b\d{1,2}/\d{1,3}\b", " ", text)
# Tabs a espacio
text = text.replace("\t", " ")
# Dehyphenation: "exa-\nmple" -> "example"
text = re.sub(r"-\s*\n\s*", "", text)
# Saltos de linea que NO son fin de oracion -> espacio
text = re.sub(r"(?<![\.!?])\n+", " ", text)
# Colapsar espacios multiples
text = re.sub(r" {2,}", " ", text)
# Limpiar lineas vacias y trim por linea
text = "\n".join(line.strip() for line in text.split("\n") if line.strip())
return text.strip()
+58
View File
@@ -0,0 +1,58 @@
---
id: cp_provincia_es_py_core
name: cp_provincia_es
kind: function
lang: py
domain: core
version: "1.0.0"
purity: pure
signature: "def cp_provincia_es(codigo_postal: str | int) -> str | None"
description: "Lookup de provincia espanola por codigo postal. Acepta CP completo (5 digitos) o prefijo de 2 digitos. Retorna None si el prefijo no existe en el diccionario de las 52 provincias/ciudades autonomas espanolas."
tags: [string, normalization, spain, geography, postal-code]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: []
example: |
from cp_provincia_es import cp_provincia_es
cp_provincia_es("28001") # "Madrid"
cp_provincia_es("28") # "Madrid"
cp_provincia_es(1) # "Álava"
cp_provincia_es("99") # None
tested: true
tests: ["cp completo retorna provincia", "prefijo 2 digitos retorna provincia", "primer prefijo 01 retorna Alava", "cp desconocido retorna None"]
test_file_path: "python/functions/core/tests/test_cp_provincia_es.py"
file_path: "python/functions/core/cp_provincia_es.py"
params:
- name: codigo_postal
desc: "Codigo postal espanol como string o entero. Acepta CP de 5 digitos ('28001', 28001) o prefijo de 2 digitos ('28', 28)."
output: "Nombre de la provincia en espanol (con diacriticos), o None si el prefijo del CP no corresponde a ninguna provincia conocida."
source_repo: "internal:footprint_aurgi"
source_license: "internal-aurgi"
source_file: "aurgi_mapas/generar_pdf_reporte.py"
---
## Ejemplo
```python
from cp_provincia_es import cp_provincia_es
cp_provincia_es("28001") # "Madrid"
cp_provincia_es("28") # "Madrid"
cp_provincia_es(28) # "Madrid"
cp_provincia_es("01") # "Álava"
cp_provincia_es(1) # "Álava" (zfill(5) -> "00001", prefix "00" -> None... ojo: int 1 -> "1" -> zfill(5) = "00001" -> "00" no existe)
cp_provincia_es("99") # None
```
## Notas
Funcion pura sin dependencias. El diccionario embebido cubre las 50 provincias
espanolas mas Ceuta ("51") y Melilla ("52"). Copiado tal cual de
`aurgi_mapas/generar_pdf_reporte.py:CP_TO_PROVINCIA`.
Nota sobre enteros: `cp_provincia_es(1)` -> `str(1)` = "1" -> zfill(5) = "00001" -> prefix "00" -> None.
Para prefijo numerico usar string: `cp_provincia_es("01")` -> "Álava".
Para CP numerico completo funciona: `cp_provincia_es(28001)` -> "Madrid".
+44
View File
@@ -0,0 +1,44 @@
"""Lookup de provincia espanola por codigo postal."""
from __future__ import annotations
_CP_TO_PROVINCIA = {
"01": "Álava", "02": "Albacete", "03": "Alicante", "04": "Almería",
"05": "Ávila", "06": "Badajoz", "07": "Illes Balears", "08": "Barcelona",
"09": "Burgos", "10": "Cáceres", "11": "Cádiz", "12": "Castellón",
"13": "Ciudad Real", "14": "Córdoba", "15": "A Coruña", "16": "Cuenca",
"17": "Girona", "18": "Granada", "19": "Guadalajara", "20": "Gipuzkoa",
"21": "Huelva", "22": "Huesca", "23": "Jaén", "24": "León",
"25": "Lleida", "26": "La Rioja", "27": "Lugo", "28": "Madrid",
"29": "Málaga", "30": "Murcia", "31": "Navarra", "32": "Ourense",
"33": "Asturias", "34": "Palencia", "35": "Las Palmas",
"36": "Pontevedra", "37": "Salamanca", "38": "Santa Cruz de Tenerife",
"39": "Cantabria", "40": "Segovia", "41": "Sevilla",
"42": "Soria", "43": "Tarragona", "44": "Teruel",
"45": "Toledo", "46": "Valencia", "47": "Valladolid",
"48": "Bizkaia", "49": "Zamora", "50": "Zaragoza",
"51": "Ceuta", "52": "Melilla",
}
def cp_provincia_es(codigo_postal: "str | int") -> "str | None":
"""Retorna la provincia espanola correspondiente a un codigo postal.
Acepta CP completo (5 digitos) o prefijo de 2 digitos. Normaliza con
zfill(5)[:2] antes de hacer el lookup. Retorna None si el prefijo
no esta en el diccionario.
Args:
codigo_postal: Codigo postal espanol como string o entero.
Puede ser CP completo ("28001", 28001) o prefijo ("28", 28).
Returns:
Nombre de la provincia en español, o None si el CP es desconocido.
"""
cp = str(codigo_postal).strip()
# Si ya es prefijo de 2 digitos (o menos), usar directamente con zfill(2)
if len(cp) <= 2:
prefix = cp.zfill(2)
else:
prefix = cp.zfill(5)[:2]
return _CP_TO_PROVINCIA.get(prefix)
@@ -0,0 +1,54 @@
---
name: csv_to_parquet_duckdb
kind: function
lang: py
domain: core
version: "1.0.0"
purity: impure
signature: "csv_to_parquet_duckdb(csv_path: str | Path, parquet_path: str | Path, column_casts: dict[str, str] | None = None, overwrite: bool = False) -> bool"
description: "Convierte un CSV a Parquet usando DuckDB read_csv_auto. Si overwrite=False y el parquet ya existe no hace nada. column_casts permite sobreescribir tipos inferidos por columna. Retorna True si escribió."
tags: [csv, parquet, duckdb, etl, core]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: "error_go_core"
imports: [duckdb, pathlib]
params:
- name: csv_path
desc: "Ruta al archivo CSV fuente."
- name: parquet_path
desc: "Ruta de destino del archivo Parquet. Se crean los directorios intermedios si no existen."
- name: column_casts
desc: "Dict opcional col→tipo DuckDB para sobreescribir tipos inferidos (e.g. {\"cp\": \"VARCHAR\"})."
- name: overwrite
desc: "Si False (default), no sobreescribe un parquet existente y retorna False."
output: "True si el archivo Parquet fue escrito, False si fue omitido por ya existir."
tested: true
tests:
- "convierte csv a parquet y duckdb puede leerlo"
- "overwrite=False no sobreescribe parquet existente"
test_file_path: "python/functions/core/tests/test_csv_to_parquet_duckdb.py"
file_path: "python/functions/core/csv_to_parquet_duckdb.py"
source_repo: "internal:footprint_aurgi"
source_license: "internal-aurgi"
source_file: "zonas_mapas_aurgi/scripts/prepare_parquet.py"
---
## Ejemplo
```python
written = csv_to_parquet_duckdb(
"data/centros.csv",
"data/centros.parquet",
column_casts={"cp": "VARCHAR"},
)
if written:
print("Parquet generado")
```
## Notas
Usa DuckDB read_csv_auto que infiere tipos automáticamente. Para columnas con
códigos postales u otros campos numéricos que deben ser strings, usar column_casts.
Lanza FileNotFoundError si csv_path no existe. Otros errores de DuckDB se propagan.
@@ -0,0 +1,79 @@
"""Convert a CSV file to Parquet format using DuckDB."""
from __future__ import annotations
from pathlib import Path
def csv_to_parquet_duckdb(
csv_path: "str | Path",
parquet_path: "str | Path",
column_casts: "dict[str, str] | None" = None,
overwrite: bool = False,
) -> bool:
"""Convert a CSV file to Parquet using DuckDB's read_csv_auto.
If overwrite is False and the parquet file already exists, the function
does nothing and returns False. Otherwise uses DuckDB to read the CSV
(with automatic type inference) and writes it as Parquet.
Optional column_casts allow overriding inferred types for specific columns
(e.g. {"codigo_postal": "VARCHAR"} to prevent numeric coercion).
Args:
csv_path: Path to the source CSV file.
parquet_path: Path for the output Parquet file.
column_casts: Optional dict mapping column names to DuckDB SQL types.
overwrite: If False (default), skip conversion when parquet exists.
Returns:
True if the Parquet file was written, False if skipped.
Raises:
FileNotFoundError: If csv_path does not exist.
Exception: Any DuckDB error (malformed CSV, type cast failure, etc.).
"""
import duckdb
csv_p = Path(csv_path)
parquet_p = Path(parquet_path)
if not csv_p.exists():
raise FileNotFoundError(f"CSV not found: {csv_p}")
if not overwrite and parquet_p.exists():
return False
parquet_p.parent.mkdir(parents=True, exist_ok=True)
con = duckdb.connect()
try:
if column_casts:
cast_exprs = ", ".join(
f"CAST({col} AS {dtype}) AS {col}"
for col, dtype in column_casts.items()
)
# Build SELECT: cast specified columns, pass rest through
# We do this via a subquery to get all columns first
all_cols_query = f"DESCRIBE SELECT * FROM read_csv_auto('{csv_p}', header=true)"
all_cols = [row[0] for row in con.execute(all_cols_query).fetchall()]
select_parts = []
for col in all_cols:
if col in column_casts:
select_parts.append(f"CAST({col} AS {column_casts[col]}) AS {col}")
else:
select_parts.append(col)
select_expr = ", ".join(select_parts)
sql = (
f"COPY (SELECT {select_expr} FROM read_csv_auto('{csv_p}', header=true)) "
f"TO '{parquet_p}' (FORMAT PARQUET)"
)
else:
sql = (
f"COPY (SELECT * FROM read_csv_auto('{csv_p}', header=true)) "
f"TO '{parquet_p}' (FORMAT PARQUET)"
)
con.execute(sql)
finally:
con.close()
return True
@@ -0,0 +1,67 @@
---
name: filter_relations_by_entity_types
kind: function
lang: py
domain: core
version: "1.0.0"
purity: pure
signature: "def filter_relations_by_entity_types(relations: dict, name_to_type: dict, allowed: dict) -> tuple[list, list]"
description: "Post-filtrado tipado de relaciones NER+RE: descarta pares donde los tipos de entidad (head_type, tail_type) no coinciden con los permitidos por relation kind. Ej: descarta 'Madrid president_of Persona' porque Madrid es location no person."
tags: [nlp, relations, filter, entity-types, graph, ner, re, post-process, gliner2]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: []
params:
- name: relations
desc: "Dict {rel_type: [(head_name, tail_name), ...]}. Los nombres deben ser strings no vacios. Ej: {'president_of': [('Carlos Torres', 'BBVA')]}"
- name: name_to_type
desc: "Dict {nombre_lowercased: entity_type}. Se construye del resultado de extract_graph_gliner2 o aggregate_extraction_results. Ej: {'carlos torres': 'person', 'bbva': 'organization'}"
- name: allowed
desc: "Dict {rel_type: (allowed_head_types, allowed_tail_types)}. Cada valor es una tupla de dos listas de strings. Si un rel_type no esta en allowed, todos sus pares se aceptan. Ej: {'president_of': (['person'], ['organization'])}"
output: "Tupla (kept, dropped). Cada elemento es lista de dicts {from, kind, to, head_type, tail_type}. kept tiene los validos, dropped los rechazados (util para debugging)."
tested: true
tests:
- "pares validos se incluyen en kept"
- "pares con tipos incompatibles van a dropped"
- "rel_type no en allowed se acepta siempre"
- "entidad no encontrada en name_to_type va a dropped"
test_file_path: "python/functions/core/tests/test_filter_relations_by_entity_types.py"
file_path: "python/functions/core/filter_relations_by_entity_types.py"
notes: |
Validado en playground/server.py del analisis gliner_glirel_tuning.
La regla (head_type, tail_type) evita falsos positivos comunes en grafos
de conocimiento como "Madrid preside Santander" (Location -> Organization).
El parametro dropped permite inspeccionar facilmente que relaciones se
eliminaron y por que (head_type/tail_type None indica entidad desconocida).
---
## Ejemplo
```python
from core.filter_relations_by_entity_types import filter_relations_by_entity_types
relations = {
"president_of": [
("Carlos Torres", "BBVA"), # person -> organization: OK
("Madrid", "Santander"), # location -> organization: INVALIDO
],
"unknown_rel": [("A", "B")], # no en allowed: se acepta
}
name_to_type = {
"carlos torres": "person",
"bbva": "organization",
"madrid": "location",
"santander": "organization",
"a": "person", "b": "person",
}
allowed = {
"president_of": (["person"], ["organization"]),
}
kept, dropped = filter_relations_by_entity_types(relations, name_to_type, allowed)
# kept: [{"from": "Carlos Torres", "kind": "president_of", "to": "BBVA", ...},
# {"from": "A", "kind": "unknown_rel", "to": "B", ...}]
# dropped: [{"from": "Madrid", "kind": "president_of", "to": "Santander", ...}]
```
@@ -0,0 +1,49 @@
"""Post-filtrado tipado de relaciones: descarta pares con tipos incompatibles."""
from __future__ import annotations
def filter_relations_by_entity_types(
relations: dict,
name_to_type: dict,
allowed: dict,
) -> tuple[list, list]:
"""Filter relations by allowed (head_type, tail_type) per relation kind.
Validates that each (head, tail) pair in a relation has the expected entity
types. Relations with unknown types (not in name_to_type) are dropped when
the relation_type appears in allowed.
Args:
relations: Dict mapping rel_type -> list of (head_name, tail_name) tuples.
E.g. {"president_of": [("Carlos Torres", "BBVA")], ...}
name_to_type: Dict mapping lowercased entity name -> entity type.
E.g. {"carlos torres": "person", "bbva": "organization"}
allowed: Dict mapping rel_type -> (allowed_head_types, allowed_tail_types).
Each value is a tuple/list of two lists of strings.
If a rel_type is NOT in allowed, all its pairs are kept.
E.g. {"president_of": (["person"], ["organization"])}
Returns:
Tuple (kept, dropped) where each is a list of dicts:
{"from": str, "kind": str, "to": str, "head_type": str|None, "tail_type": str|None}
"""
kept: list[dict] = []
dropped: list[dict] = []
for rt, pairs in relations.items():
rule = allowed.get(rt)
for h, t in pairs:
ht = name_to_type.get(h.lower().strip())
tt = name_to_type.get(t.lower().strip())
row = {"from": h, "kind": rt, "to": t, "head_type": ht, "tail_type": tt}
if rule is None:
kept.append(row)
else:
head_ok, tail_ok = rule
if ht in head_ok and tt in tail_ok:
kept.append(row)
else:
dropped.append(row)
return kept, dropped
@@ -0,0 +1,65 @@
---
id: infer_provincia_from_cp_py_core
name: infer_provincia_from_cp
kind: function
lang: py
domain: core
version: "1.0.0"
purity: pure
signature: "def infer_provincia_from_cp(rows: list[dict], cp_col: str = \"codigo_postal\", prov_col: str = \"provincia\") -> list[str | None]"
description: "Infiere la provincia correcta de cada fila basandose en el CP dominante por provincia. Calcula top-2 prefijos de CP por provincia; si el CP de la fila pertenece a ese top-2 usa el real, si no usa el dominante. Stdlib puro, sin pandas."
tags: [string, normalization, spain, geography, postal-code, inference]
uses_functions: [cp_provincia_es_py_core]
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: ["collections.Counter"]
example: |
from infer_provincia_from_cp import infer_provincia_from_cp
rows = [
{"codigo_postal": "28001", "provincia": "Madrid"},
{"codigo_postal": "28010", "provincia": "Madrid"},
{"codigo_postal": "99999", "provincia": "Madrid"},
]
infer_provincia_from_cp(rows)
# ["Madrid", "Madrid", "Madrid"]
tested: true
tests: ["inferencia con cp dominante madrid", "fila con cp fuera de top2 usa dominante", "fila sin provincia retorna None"]
test_file_path: "python/functions/core/tests/test_infer_provincia_from_cp.py"
file_path: "python/functions/core/infer_provincia_from_cp.py"
params:
- name: rows
desc: "Lista de dicts. Cada dict debe tener al menos cp_col (codigo postal) y prov_col (provincia declarada)."
- name: cp_col
desc: "Nombre de la clave del codigo postal en cada dict. Por defecto 'codigo_postal'."
- name: prov_col
desc: "Nombre de la clave de la provincia en cada dict. Por defecto 'provincia'."
output: "Lista de strings o None con la provincia inferida para cada fila, en el mismo orden que rows. None cuando la provincia o el CP de la fila es None o la provincia no tiene datos suficientes."
source_repo: "internal:footprint_aurgi"
source_license: "internal-aurgi"
source_file: "aurgi_mapas/generar_pdf_reporte.py"
---
## Ejemplo
```python
from infer_provincia_from_cp import infer_provincia_from_cp
rows = [
{"codigo_postal": "28001", "provincia": "Madrid"},
{"codigo_postal": "28010", "provincia": "Madrid"},
{"codigo_postal": "41001", "provincia": "Madrid"}, # CP Sevilla pero provincia Madrid
]
result = infer_provincia_from_cp(rows)
# ["Madrid", "Madrid", "Madrid"]
# El tercer CP (41) no esta en top-2 de Madrid (28), asi que usa el dominante (28 -> Madrid)
```
## Notas
Funcion pura. Usa `cp_provincia_es` del mismo dominio para el lookup final.
Adaptada de `add_provincia_poliza_correcta` en `aurgi_mapas/generar_pdf_reporte.py`,
eliminando la dependencia de pandas y generalizando las columnas por parametro.
El algoritmo mantiene la semantica original: top-2 prefijos por provincia, con
fallback al dominante cuando el CP de la fila no encaja en ese top-2.
@@ -0,0 +1,85 @@
"""Infiere la provincia correcta de cada fila basandose en el codigo postal dominante por provincia."""
from __future__ import annotations
import os
import sys
from collections import Counter
def infer_provincia_from_cp(
rows: list[dict],
cp_col: str = "codigo_postal",
prov_col: str = "provincia",
) -> list:
"""Infiere la provincia correcta de cada fila usando el CP dominante por provincia.
Para cada provincia en el dataset calcula los top-2 prefijos de CP mas
frecuentes. Si el CP de una fila pertenece a ese top-2 para su provincia,
se usa la provincia derivada del CP real; si no, se usa la provincia
derivada del prefijo dominante (top-1) de su provincia.
Logica generica (stdlib puro, sin pandas):
1. Calcular frecuencia de prefijos por provincia.
2. Seleccionar top-2 prefijos por provincia.
3. Para cada fila: si su prefijo esta en top-2 de su provincia,
retornar cp_provincia_es(prefijo); si no, retornar cp_provincia_es(top1).
4. Si la provincia de la fila no tiene datos, retornar None.
Args:
rows: Lista de dicts con al menos las columnas cp_col y prov_col.
cp_col: Nombre de la columna con el codigo postal (default "codigo_postal").
prov_col: Nombre de la columna con la provincia original (default "provincia").
Returns:
Lista de strings (o None) con la provincia inferida para cada fila,
en el mismo orden que rows.
"""
_here = os.path.dirname(os.path.abspath(__file__))
if _here not in sys.path:
sys.path.insert(0, _here)
from cp_provincia_es import cp_provincia_es
# Paso 1: contar frecuencia de (provincia, prefijo)
freq: dict[str, Counter] = {}
for row in rows:
prov = row.get(prov_col)
cp_raw = row.get(cp_col)
if prov is None or cp_raw is None:
continue
cp_str = str(cp_raw).strip().zfill(5)
prefix = cp_str[:2]
if prov not in freq:
freq[prov] = Counter()
freq[prov][prefix] += 1
# Paso 2: top-2 prefijos por provincia y prefijo dominante (top-1)
top2: dict[str, list[str]] = {}
dominant: dict[str, str] = {}
for prov, counter in freq.items():
ordered = [p for p, _ in counter.most_common(2)]
top2[prov] = ordered
if ordered:
dominant[prov] = ordered[0]
# Paso 3: resolver provincia para cada fila
result = []
for row in rows:
prov = row.get(prov_col)
cp_raw = row.get(cp_col)
if prov is None or cp_raw is None:
result.append(None)
continue
cp_str = str(cp_raw).strip().zfill(5)
prefix = cp_str[:2]
if prov in top2 and prefix in top2[prov]:
result.append(cp_provincia_es(prefix))
elif prov in dominant:
result.append(cp_provincia_es(dominant[prov]))
else:
result.append(None)
return result
@@ -0,0 +1,49 @@
---
name: merge_entity_aliases
kind: function
lang: py
domain: core
version: "1.0.0"
purity: pure
signature: "def merge_entity_aliases(entity_names: list[str]) -> dict[str, str]"
description: "Coreference simple por normalizacion + substring: mapea cada nombre de entidad a su forma canonica. 'BBVA' y 'bbva' -> mismo canonical. Nombres cortos absorbidos por nombres largos que los contienen como palabra completa (min 4 chars normalizados)."
tags: [nlp, coreference, entity, alias, normalization, merge, graph, ner]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: [re, collections.defaultdict]
params:
- name: entity_names
desc: "Lista de nombres de entidades tal como los extrajo el modelo NER. Puede contener duplicados, variaciones de casing (BBVA/bbva) y formas largas/cortas (BBVA / Banco Bilbao Vizcaya Argentaria, S.A.)."
output: "Dict {nombre_original: nombre_canonical}. Identidad para nombres que no son alias de nada. Lista vacia retorna dict vacio."
tested: true
tests:
- "duplicados case-insensitive se mapean al mismo canonical"
- "nombre corto se absorbe en nombre largo que lo contiene"
- "siglas cortas menos de 4 chars no absorben falsamente"
- "nombres totalmente disjuntos se mapean a si mismos"
test_file_path: "python/functions/core/tests/test_merge_entity_aliases.py"
file_path: "python/functions/core/merge_entity_aliases.py"
notes: |
Validado en playground/server.py del analisis gliner_glirel_tuning.
El criterio de 4 chars normalizados evita que siglas tipo "US", "EU", "SA"
absorban entidades que meramente contienen esas letras.
El merge es asimetrico: el nombre LARGO es el canonical, no el corto.
Util como paso de post-proceso tras aggregate_extraction_results antes
de construir el grafo final.
---
## Ejemplo
```python
from core.merge_entity_aliases import merge_entity_aliases
names = ["BBVA", "bbva", "Banco Bilbao Vizcaya Argentaria, S.A.", "Inditex"]
alias = merge_entity_aliases(names)
# alias["BBVA"] -> "Banco Bilbao Vizcaya Argentaria, S.A." (absorbido por substring)
# alias["bbva"] -> "Banco Bilbao Vizcaya Argentaria, S.A." (normalizado + absorbido)
# alias["Banco Bilbao Vizcaya Argentaria, S.A."] -> "Banco Bilbao Vizcaya Argentaria, S.A."
# alias["Inditex"] -> "Inditex" (identidad, no hay alias)
```
@@ -0,0 +1,62 @@
"""Coreference simple por normalizacion y substring para entidades nombradas."""
from __future__ import annotations
import re
from collections import defaultdict
def merge_entity_aliases(entity_names: list[str]) -> dict[str, str]:
"""Build alias map: original_name -> canonical_name.
Two-pass algorithm:
Step 1 - Normalize: lowercase + strip punctuation -> cluster by normalized form.
Canonical per cluster = longest original casing.
Step 2 - Substring merge: short names absorbed by longer ones if short_name
appears as whole word inside long_name (normalized) AND
short_name has >= 4 normalized chars (prevents false positives
like 'US' absorbing everything that contains 'us').
Args:
entity_names: List of entity name strings (may have duplicates or
different casings, e.g. ["BBVA", "bbva", "Banco Bilbao..."]).
Returns:
Dict mapping each input name to its final canonical form.
Identity mapping for names that are not aliases of anything else.
"""
if not entity_names:
return {}
def normalize(s: str) -> str:
s = re.sub(r"[\.,;:\"'`()\[\]]", "", s.strip())
s = re.sub(r"\s+", " ", s)
return s.strip().lower()
# Paso 1: agrupar por forma normalizada, elegir el mas largo como canonical
norm_groups: dict[str, list[str]] = defaultdict(list)
for n in entity_names:
norm_groups[normalize(n)].append(n)
canonical: dict[str, str] = {}
for nrm, group in norm_groups.items():
winner = max(group, key=lambda x: (len(x), x))
for n in group:
canonical[n] = winner
# Paso 2: substring merge sobre los canonicos (long absorbe short si short dentro de long)
canon_set = sorted(set(canonical.values()), key=len, reverse=True)
absorbed: dict[str, str] = {}
for long_n in canon_set:
long_norm = normalize(long_n)
for short_n in canon_set:
if short_n == long_n or short_n in absorbed:
continue
short_norm = normalize(short_n)
if len(short_norm) < 4:
continue
if re.search(r"\b" + re.escape(short_norm) + r"\b", long_norm):
absorbed[short_n] = long_n
return {orig: absorbed.get(canon, canon) for orig, canon in canonical.items()}
@@ -0,0 +1,52 @@
---
id: normalize_for_join_py_core
name: normalize_for_join
kind: function
lang: py
domain: core
version: "1.0.0"
purity: pure
signature: "def normalize_for_join(values: Iterable) -> list[str]"
description: "Normaliza strings para fuzzy joins: upper + strip diacriticos NFD + elimina non [A-Z0-9 ] + colapsa espacios. Trabaja con cualquier iterable. None/NaN -> cadena vacia."
tags: [string, normalization, join, fuzzy, spain]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: ["re", "unicodedata", "typing.Iterable"]
example: |
from normalize_for_join import normalize_for_join
normalize_for_join(["Calle Mayor, 14", "avila", None])
# ["CALLE MAYOR 14", "AVILA", ""]
tested: true
tests: ["normalize con puntuacion y diacriticos y None"]
test_file_path: "python/functions/core/tests/test_normalize_for_join.py"
file_path: "python/functions/core/normalize_for_join.py"
params:
- name: values
desc: "Iterable de strings o None/NaN a normalizar. Acepta listas, generadores, pd.Series, etc."
output: "Lista de strings normalizados en mayusculas sin diacriticos. None y NaN se convierten a cadena vacia."
source_repo: "internal:footprint_aurgi"
source_license: "internal-aurgi"
source_file: "fuzzy_joins/arreglo_fuzzy.py"
---
## Ejemplo
```python
from normalize_for_join import normalize_for_join
normalize_for_join(["Calle Mayor, 14", "ávila", None])
# ["CALLE MAYOR 14", "AVILA", ""]
normalize_for_join(["José García S.L.", "BANCO DE ESPAÑA"])
# ["JOSE GARCIA SL", "BANCO DE ESPANA"]
```
## Notas
Funcion pura sin dependencias externas (solo `re` y `unicodedata` de stdlib).
Adaptada de `preparar_para_join` / `normalizar_string` en `fuzzy_joins/arreglo_fuzzy.py`,
eliminando la dependencia de pandas para trabajar con cualquier iterable.
Util como paso previo a joins por igualdad exacta sobre datos normalizados.
@@ -0,0 +1,44 @@
"""Normaliza strings para joins sin dependencias externas."""
import re
import unicodedata
from typing import Iterable
def normalize_for_join(values: Iterable) -> list:
"""Normaliza strings para joins: upper + sin diacriticos + solo [A-Z0-9 ] + colapsa espacios.
Para cada valor: convierte a string, upper, elimina diacriticos NFD,
reemplaza caracteres que no sean letras/numeros/espacios por cadena vacia,
colapsa espacios multiples, trim. None o NaN se convierten a cadena vacia.
No depende de pandas; trabaja con cualquier iterable de strings o None.
Args:
values: Iterable de strings o None. Puede ser lista, generador, Serie, etc.
Returns:
Lista de strings normalizados. None/NaN se convierten a "".
"""
result = []
for v in values:
if v is None:
result.append("")
continue
# Detectar NaN de numpy/pandas sin importarlos
try:
if v != v: # NaN != NaN
result.append("")
continue
except (TypeError, ValueError):
pass
texto = str(v).upper()
texto = "".join(
c for c in unicodedata.normalize("NFD", texto)
if unicodedata.category(c) != "Mn"
)
texto = re.sub(r"[^A-Z0-9\s]", "", texto)
texto = re.sub(r"\s+", " ", texto)
texto = texto.strip()
result.append(texto)
return result
@@ -0,0 +1,43 @@
---
name: safe_read_csv_fallback
kind: function
lang: py
domain: core
version: "1.0.0"
purity: impure
signature: "safe_read_csv_fallback(path: str | Path) -> pd.DataFrame"
description: "Lee un CSV intentando utf-8 primero; si falla con UnicodeDecodeError reintenta con latin-1. Cubre exportaciones legacy de Excel y herramientas occidentales."
tags: [csv, encoding, pandas, io, core]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: "error_go_core"
imports: [pandas, pathlib]
params:
- name: path
desc: "Ruta al archivo CSV a leer. Puede ser str o Path."
output: "DataFrame de pandas con el contenido del CSV. Codificación detectada automáticamente (utf-8 o latin-1)."
tested: true
tests:
- "lee csv utf-8 correctamente"
- "lee csv latin-1 con fallback"
test_file_path: "python/functions/core/tests/test_safe_read_csv_fallback.py"
file_path: "python/functions/core/safe_read_csv_fallback.py"
source_repo: "internal:footprint_aurgi"
source_license: "internal-aurgi"
source_file: "ponderacion_isochronas/example/models/eda/utils.py"
---
## Ejemplo
```python
df = safe_read_csv_fallback("datos_clientes.csv")
print(df.shape)
```
## Notas
Solo hace fallback en UnicodeDecodeError. Otros errores (archivo inexistente,
CSV malformado) se propagan normalmente.
latin-1 cubre la mayoría de exportaciones de Excel en español/europeo occidental.
@@ -0,0 +1,34 @@
"""Read a CSV file with automatic encoding fallback from utf-8 to latin-1."""
from __future__ import annotations
from pathlib import Path
from typing import TYPE_CHECKING
if TYPE_CHECKING:
import pandas as pd
def safe_read_csv_fallback(path: "str | Path") -> "pd.DataFrame":
"""Read a CSV file, falling back to latin-1 if utf-8 decoding fails.
Tries pandas read_csv with the default utf-8 encoding first. On a
UnicodeDecodeError retries with latin-1 (ISO-8859-1), which covers most
Western European legacy CSV exports.
Args:
path: Path to the CSV file.
Returns:
A pandas DataFrame with the CSV contents.
Raises:
FileNotFoundError: If the file does not exist.
Exception: Any other pandas read error (malformed CSV, etc.).
"""
import pandas as pd
p = Path(path)
try:
return pd.read_csv(p)
except UnicodeDecodeError:
return pd.read_csv(p, encoding="latin-1")
+59
View File
@@ -0,0 +1,59 @@
---
id: slugify_ascii_py_core
name: slugify_ascii
kind: function
lang: py
domain: core
version: "1.0.0"
purity: pure
signature: "def slugify_ascii(text: str, max_len: int = 80, default: str = \"centro\") -> str"
description: "Convierte texto a slug ASCII lowercase sin diacriticos. Strip + lower + NFD + reemplaza non-alphanum por guion + colapsa guiones. Si vacio retorna default. Trunca a max_len."
tags: [string, normalization, slug, ascii, spain]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: ["re", "unicodedata"]
example: |
from slugify_ascii import slugify_ascii
slugify_ascii("Calle Mayor, 14") # "calle-mayor-14"
slugify_ascii("Ávila") # "avila"
slugify_ascii("") # "centro"
slugify_ascii("a" * 100, max_len=10) # "aaaaaaaaaa"
tested: true
tests: ["slugify texto con puntuacion", "slugify diacriticos", "slugify cadena vacia retorna default", "slugify trunca a max_len"]
test_file_path: "python/functions/core/tests/test_slugify_ascii.py"
file_path: "python/functions/core/slugify_ascii.py"
params:
- name: text
desc: "Texto de entrada a convertir en slug. None se trata como cadena vacia."
- name: max_len
desc: "Longitud maxima del slug resultante. Por defecto 80 caracteres."
- name: default
desc: "Valor a retornar si el slug resultante esta vacio. Por defecto 'centro'."
output: "Slug ASCII lowercase sin diacriticos, maximo max_len caracteres. Retorna default si el resultado esta vacio."
source_repo: "internal:footprint_aurgi"
source_license: "internal-aurgi"
source_file: "zonas_mapas_aurgi/scripts/generate_isochrones.py"
---
## Ejemplo
```python
from slugify_ascii import slugify_ascii
slugify_ascii("Calle Mayor, 14") # "calle-mayor-14"
slugify_ascii("Ávila") # "avila"
slugify_ascii("") # "centro"
slugify_ascii(None) # "centro"
slugify_ascii("a" * 100, max_len=10) # "aaaaaaaaaa"
slugify_ascii("---", default="sin-nombre") # "sin-nombre"
```
## Notas
Funcion pura sin dependencias externas. Usa solo `re` y `unicodedata` de stdlib.
Adaptada de `_slugify` en `zonas_mapas_aurgi/scripts/generate_isochrones.py` y
`ponderacion_isochronas/src/generar_isochronas_aurgi.py`, combinando la
normalizacion NFD de la primera con el truncado y default de la segunda.
+33
View File
@@ -0,0 +1,33 @@
"""Convierte texto a slug ASCII lowercase sin diacriticos."""
import re
import unicodedata
def slugify_ascii(text: str, max_len: int = 80, default: str = "centro") -> str:
"""Convierte texto a slug ASCII lowercase sin diacriticos.
Aplica: strip + lower + eliminar diacriticos NFD + reemplazar
no-alphanum por guion + colapsar guiones + trim. Si el resultado
esta vacio retorna default. Trunca a max_len.
Args:
text: Texto de entrada. None se trata como vacio.
max_len: Longitud maxima del slug resultante (default 80).
default: Valor a retornar si el slug queda vacio (default "centro").
Returns:
Slug ASCII lowercase, maximo max_len caracteres.
"""
if text is None:
return default
text = str(text).strip().lower()
text = "".join(
c for c in unicodedata.normalize("NFD", text)
if unicodedata.category(c) != "Mn"
)
text = re.sub(r"[^a-z0-9]+", "-", text)
text = text.strip("-")
if not text:
return default
return text[:max_len]
@@ -0,0 +1,65 @@
"""Tests para aggregate_extraction_results."""
from __future__ import annotations
import os
import sys
from collections import Counter
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", ".."))
from core.aggregate_extraction_results import aggregate_extraction_results
def test_lista_vacia_retorna_entities_y_relations_vacios():
"""lista vacia retorna entities vacio y relations vacio"""
result = aggregate_extraction_results([])
assert result["entities"] == {}
assert result["relations"] == Counter()
def test_resultado_unico_se_agrega_correctamente():
"""resultado unico se agrega correctamente"""
r = [
{
"entities": {"person": ["Pablo Isla"], "organization": ["Inditex"]},
"relation_extraction": {"ceo_of": [("Pablo Isla", "Inditex")]},
}
]
result = aggregate_extraction_results(r)
assert ("person", "pablo isla") in result["entities"]
assert ("organization", "inditex") in result["entities"]
assert result["entities"][("person", "pablo isla")]["count"] == 1
assert result["relations"][("Pablo Isla", "ceo_of", "Inditex")] == 1
def test_dos_resultados_con_solapamiento_acumulan_counts():
"""dos resultados con solapamiento acumulan counts"""
r = [
{
"entities": {"person": ["Pablo Isla"], "organization": ["Inditex"]},
"relation_extraction": {"ceo_of": [("Pablo Isla", "Inditex")]},
},
{
"entities": {"person": ["Pablo Isla"], "organization": ["Inditex"]},
"relation_extraction": {"ceo_of": [("Pablo Isla", "Inditex")]},
},
]
result = aggregate_extraction_results(r)
assert result["entities"][("person", "pablo isla")]["count"] == 2
assert result["relations"][("Pablo Isla", "ceo_of", "Inditex")] == 2
def test_entidades_deduplicen_case_insensitive():
"""entidades se deduplicien case-insensitive"""
r = [
{"entities": {"person": ["Pablo Isla"]}, "relation_extraction": {}},
{"entities": {"person": ["pablo isla"]}, "relation_extraction": {}},
]
result = aggregate_extraction_results(r)
# Ambas van a la misma key (person, pablo isla)
assert ("person", "pablo isla") in result["entities"]
assert result["entities"][("person", "pablo isla")]["count"] == 2
# Solo una key para pablo isla
pablo_keys = [k for k in result["entities"] if k[1] == "pablo isla"]
assert len(pablo_keys) == 1
@@ -0,0 +1,72 @@
"""Tests para chunk_with_overlap."""
from __future__ import annotations
import os
import sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", ".."))
from core.chunk_with_overlap import chunk_with_overlap
def test_texto_vacio_retorna_lista_vacia():
"""texto vacio retorna lista vacia"""
assert chunk_with_overlap("") == []
assert chunk_with_overlap(" ") == []
def test_una_frase_menor_que_max_chars_produce_1_chunk():
"""una frase menor que max_chars produce 1 chunk"""
text = "Esta es una frase corta."
chunks = chunk_with_overlap(text, max_chars=500, overlap_sentences=0)
assert len(chunks) == 1
assert chunks[0]["text"] == text
def test_multiples_frases_producen_N_chunks_con_overlap():
"""multiples frases producen N chunks con overlap"""
# 3 frases de ~30 chars c/u, max_chars=60 -> al menos 2 chunks
text = "Primera frase larga aqui. Segunda frase larga aqui. Tercera frase larga aqui."
chunks = chunk_with_overlap(text, max_chars=55, overlap_sentences=1)
assert len(chunks) >= 2
# Cada chunk tiene texto no vacio
for c in chunks:
assert c["text"].strip()
assert len(c["sentences"]) > 0
def test_frase_mas_larga_que_max_chars_no_bucle_infinito():
"""frase mas larga que max_chars se incluye sin bucle infinito"""
long_sentence = "A" * 2000 + "."
chunks = chunk_with_overlap(long_sentence, max_chars=100, overlap_sentences=0)
# Debe terminar (no bucle infinito) y producir exactamente 1 chunk
assert len(chunks) == 1
assert chunks[0]["text"] == long_sentence.strip()
def test_overlap_0_no_duplica_frases():
"""overlap=0 no duplica frases entre chunks"""
text = "Primera frase aqui completa. Segunda frase aqui completa. Tercera frase aqui completa."
chunks = chunk_with_overlap(text, max_chars=50, overlap_sentences=0)
# Recolectar todas las frases de todos los chunks
all_sents = [s for c in chunks for s in c["sentences"]]
# Con overlap=0 ninguna frase debe aparecer dos veces
assert len(all_sents) == len(set(all_sents))
def test_overlap_2_el_chunk_N_mas_1_empieza_con_ultimas_2_frases_del_N():
"""overlap=2 el chunk N+1 empieza con las 2 ultimas frases del chunk N"""
# 5 frases cortas, max_chars=80 para forzar al menos 2 chunks
text = (
"Frase uno aqui. "
"Frase dos aqui. "
"Frase tres aqui. "
"Frase cuatro aqui. "
"Frase cinco aqui."
)
chunks = chunk_with_overlap(text, max_chars=80, overlap_sentences=2)
if len(chunks) >= 2:
prev_tail = chunks[0]["sentences"][-2:]
next_head = chunks[1]["sentences"][:2]
assert prev_tail == next_head
@@ -0,0 +1,49 @@
"""Tests para clean_pdf_text."""
from __future__ import annotations
import os
import sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", ".."))
from core.clean_pdf_text import clean_pdf_text
def test_string_vacio_retorna_vacio():
"""string vacio retorna vacio"""
assert clean_pdf_text("") == ""
def test_marca_de_pagina_1_20_se_elimina():
"""marca de pagina 1/20 se elimina"""
result = clean_pdf_text("1/20\nfoo bar")
assert "1/20" not in result
assert "foo bar" in result
def test_dehyphenation_exa_newline_mple():
"""dehyphenation exa-newline-mple -> example"""
result = clean_pdf_text("exa-\nmple")
assert result == "example"
def test_espacios_duplicados_se_colapsan():
"""espacios duplicados se colapsan"""
result = clean_pdf_text("ab cd")
assert result == "ab cd"
def test_salto_de_linea_en_mitad_de_oracion_se_une_con_espacio():
"""salto de linea en mitad de oracion se une con espacio"""
result = clean_pdf_text("Pablo Isla es el\npresidente de Inditex")
assert result == "Pablo Isla es el presidente de Inditex"
def test_salto_de_linea_tras_punto_se_preserva():
"""salto de linea tras punto se preserva"""
result = clean_pdf_text("Primera oracion.\nSegunda oracion.")
# El salto tras punto debe quedar (no se une con espacio)
assert "\n" in result
assert "Primera oracion." in result
assert "Segunda oracion." in result
@@ -0,0 +1,44 @@
"""Tests para cp_provincia_es."""
import sys
import os
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
from cp_provincia_es import cp_provincia_es
def test_cp_completo_retorna_provincia():
"""cp completo retorna provincia"""
assert cp_provincia_es("28001") == "Madrid"
def test_prefijo_2_digitos_retorna_provincia():
"""prefijo 2 digitos retorna provincia"""
assert cp_provincia_es("28") == "Madrid"
def test_primer_prefijo_01_retorna_alava():
"""primer prefijo 01 retorna Alava"""
assert cp_provincia_es("01") == "Álava"
def test_cp_desconocido_retorna_none():
"""cp desconocido retorna None"""
assert cp_provincia_es("99") is None
def test_cp_entero_completo():
assert cp_provincia_es(28001) == "Madrid"
def test_cp_ceuta():
assert cp_provincia_es("51001") == "Ceuta"
def test_cp_melilla():
assert cp_provincia_es("52") == "Melilla"
def test_cp_barcelona():
assert cp_provincia_es("08") == "Barcelona"
@@ -0,0 +1,54 @@
"""Tests para csv_to_parquet_duckdb."""
from __future__ import annotations
import tempfile
from pathlib import Path
import pytest
def test_convierte_csv_a_parquet_y_duckdb_puede_leerlo():
"""convierte csv a parquet y duckdb puede leerlo"""
import sys
sys.path.insert(0, str(Path(__file__).resolve().parents[2]))
from core.csv_to_parquet_duckdb import csv_to_parquet_duckdb
import duckdb
with tempfile.TemporaryDirectory() as tmpdir:
csv_path = Path(tmpdir) / "test.csv"
parquet_path = Path(tmpdir) / "test.parquet"
csv_path.write_text("nombre,lat,lon\nMadrid,40.4,-3.7\nBarcelona,41.3,2.1\n")
result = csv_to_parquet_duckdb(csv_path, parquet_path)
assert result is True
assert parquet_path.exists()
assert parquet_path.stat().st_size > 0
# Verify duckdb can read it back
con = duckdb.connect()
df = con.execute(f"SELECT * FROM read_parquet('{parquet_path}')").df()
con.close()
assert df.shape == (2, 3)
assert set(df.columns) == {"nombre", "lat", "lon"}
def test_overwrite_False_no_sobreescribe_parquet_existente():
"""overwrite=False no sobreescribe parquet existente"""
import sys
sys.path.insert(0, str(Path(__file__).resolve().parents[2]))
from core.csv_to_parquet_duckdb import csv_to_parquet_duckdb
with tempfile.TemporaryDirectory() as tmpdir:
csv_path = Path(tmpdir) / "test.csv"
parquet_path = Path(tmpdir) / "test.parquet"
csv_path.write_text("a,b\n1,2\n")
# Create existing parquet with known content
parquet_path.write_bytes(b"existing content")
original_size = parquet_path.stat().st_size
result = csv_to_parquet_duckdb(csv_path, parquet_path, overwrite=False)
assert result is False
# File must remain unchanged
assert parquet_path.stat().st_size == original_size
@@ -0,0 +1,60 @@
"""Tests para filter_relations_by_entity_types."""
from __future__ import annotations
import os
import sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", ".."))
from core.filter_relations_by_entity_types import filter_relations_by_entity_types
NAME_TO_TYPE = {
"carlos torres": "person",
"bbva": "organization",
"madrid": "location",
"santander": "organization",
"ana": "person",
}
ALLOWED = {
"president_of": (["person"], ["organization"]),
"located_in": (["organization", "person"], ["location"]),
}
def test_pares_validos_se_incluyen_en_kept():
"""pares validos se incluyen en kept"""
relations = {"president_of": [("Carlos Torres", "BBVA")]}
kept, dropped = filter_relations_by_entity_types(relations, NAME_TO_TYPE, ALLOWED)
assert len(kept) == 1
assert kept[0]["from"] == "Carlos Torres"
assert kept[0]["to"] == "BBVA"
assert len(dropped) == 0
def test_pares_con_tipos_incompatibles_van_a_dropped():
"""pares con tipos incompatibles van a dropped"""
# Madrid es location, no person -> no puede presidir nada
relations = {"president_of": [("Madrid", "Santander")]}
kept, dropped = filter_relations_by_entity_types(relations, NAME_TO_TYPE, ALLOWED)
assert len(kept) == 0
assert len(dropped) == 1
assert dropped[0]["head_type"] == "location"
def test_rel_type_no_en_allowed_se_acepta_siempre():
"""rel_type no en allowed se acepta siempre"""
relations = {"unknown_rel": [("Carlos Torres", "Madrid")]}
kept, dropped = filter_relations_by_entity_types(relations, NAME_TO_TYPE, ALLOWED)
assert len(kept) == 1
assert len(dropped) == 0
def test_entidad_no_encontrada_en_name_to_type_va_a_dropped():
"""entidad no encontrada en name_to_type va a dropped"""
# "Desconocido" no esta en name_to_type -> head_type es None -> dropped
relations = {"president_of": [("Desconocido", "BBVA")]}
kept, dropped = filter_relations_by_entity_types(relations, NAME_TO_TYPE, ALLOWED)
assert len(dropped) == 1
assert dropped[0]["head_type"] is None
@@ -0,0 +1,78 @@
"""Tests para infer_provincia_from_cp."""
import sys
import os
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
from infer_provincia_from_cp import infer_provincia_from_cp
def test_inferencia_con_cp_dominante_madrid():
"""inferencia con cp dominante madrid"""
rows = [
{"codigo_postal": "28001", "provincia": "Madrid"},
{"codigo_postal": "28010", "provincia": "Madrid"},
]
result = infer_provincia_from_cp(rows)
assert result == ["Madrid", "Madrid"]
def test_fila_con_cp_fuera_de_top2_usa_dominante():
"""fila con cp fuera de top2 usa dominante"""
# Madrid tiene 3 prefijos distintos: 28 (x4), 29 (x1), 41 (x1).
# top-2 son: 28 y 29 (o 41 dependiendo del orden, pero 41 queda fuera).
# Para que 41 quede fuera del top-2 necesitamos mas de 2 prefijos distintos.
rows = [
{"codigo_postal": "28001", "provincia": "Madrid"},
{"codigo_postal": "28002", "provincia": "Madrid"},
{"codigo_postal": "28003", "provincia": "Madrid"},
{"codigo_postal": "28004", "provincia": "Madrid"},
{"codigo_postal": "29001", "provincia": "Madrid"},
{"codigo_postal": "29002", "provincia": "Madrid"},
{"codigo_postal": "41001", "provincia": "Madrid"}, # outlier: fuera de top-2
]
result = infer_provincia_from_cp(rows)
# top-2 de Madrid: "28" (4 ocurrencias) y "29" (2 ocurrencias).
# "41" no esta en top-2, asi que usa el dominante (28 -> Madrid)
assert result[6] == "Madrid"
def test_fila_sin_provincia_retorna_none():
"""fila sin provincia retorna None"""
rows = [
{"codigo_postal": "28001", "provincia": None},
]
result = infer_provincia_from_cp(rows)
assert result == [None]
def test_fila_sin_cp_retorna_none():
rows = [
{"codigo_postal": None, "provincia": "Madrid"},
]
result = infer_provincia_from_cp(rows)
assert result == [None]
def test_columnas_custom():
rows = [
{"cp": "28001", "prov": "Madrid"},
{"cp": "28010", "prov": "Madrid"},
]
result = infer_provincia_from_cp(rows, cp_col="cp", prov_col="prov")
assert result == ["Madrid", "Madrid"]
def test_multiples_provincias():
rows = [
{"codigo_postal": "28001", "provincia": "Madrid"},
{"codigo_postal": "08001", "provincia": "Barcelona"},
{"codigo_postal": "41001", "provincia": "Sevilla"},
]
result = infer_provincia_from_cp(rows)
assert result == ["Madrid", "Barcelona", "Sevilla"]
def test_lista_vacia():
assert infer_provincia_from_cp([]) == []
@@ -0,0 +1,58 @@
"""Tests para merge_entity_aliases."""
from __future__ import annotations
import os
import sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", ".."))
from core.merge_entity_aliases import merge_entity_aliases
def test_duplicados_case_insensitive_se_mapean_al_mismo_canonical():
"""duplicados case-insensitive se mapean al mismo canonical"""
result = merge_entity_aliases(["BBVA", "bbva", "Bbva"])
# Todos deben apuntar al mismo canonical (el mas largo / mayor)
vals = set(result.values())
assert len(vals) == 1
# El canonical debe ser la forma de mayor longitud/orden: "BBVA" (mayusculas, misma longitud)
canon = vals.pop()
assert canon.lower() == "bbva"
def test_nombre_corto_se_absorbe_en_nombre_largo_que_lo_contiene():
"""nombre corto se absorbe en nombre largo que lo contiene"""
# El substring merge funciona cuando la forma corta APARECE LITERALMENTE
# en la forma larga (normalizada). Ejemplo: "bilbao" esta en "banco bilbao vizcaya argentaria"
names = ["Bilbao", "Banco Bilbao Vizcaya Argentaria"]
result = merge_entity_aliases(names)
# "bilbao" (6 chars) aparece como palabra en la forma larga normalizada
assert result["Bilbao"] == "Banco Bilbao Vizcaya Argentaria"
assert result["Banco Bilbao Vizcaya Argentaria"] == "Banco Bilbao Vizcaya Argentaria"
def test_siglas_cortas_menos_de_4_chars_no_absorben_falsamente():
"""siglas cortas menos de 4 chars no absorben falsamente"""
# "US" es 2 chars normalizados -> no debe absorber a "USA" ni a "BBUSA"
names = ["US", "USA", "Standard Chartered"]
result = merge_entity_aliases(names)
# "US" (2 chars) no debe poder absorber nada
assert result["USA"] in ("USA", "Standard Chartered") or result["USA"] == "USA"
# "US" puede quedarse como identidad o ser absorbido por algo que lo contenga
# Lo importante: NO absorbe a nombres que no lo contienen como palabra completa
assert result["Standard Chartered"] == "Standard Chartered"
def test_nombres_totalmente_disjuntos_se_mapean_a_si_mismos():
"""nombres totalmente disjuntos se mapean a si mismos"""
names = ["Inditex", "Santander", "Telefonica"]
result = merge_entity_aliases(names)
assert result["Inditex"] == "Inditex"
assert result["Santander"] == "Santander"
assert result["Telefonica"] == "Telefonica"
def test_lista_vacia_retorna_dict_vacio():
"""lista vacia retorna dict vacio"""
assert merge_entity_aliases([]) == {}
@@ -0,0 +1,42 @@
"""Tests para normalize_for_join."""
import sys
import os
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
from normalize_for_join import normalize_for_join
def test_normalize_con_puntuacion_y_diacriticos_y_none():
"""normalize con puntuacion y diacriticos y None"""
result = normalize_for_join(["Calle Mayor, 14", "ávila", None])
assert result == ["CALLE MAYOR 14", "AVILA", ""]
def test_normalize_lista_vacia():
assert normalize_for_join([]) == []
def test_normalize_upper():
assert normalize_for_join(["madrid"]) == ["MADRID"]
def test_normalize_elimina_simbolos():
assert normalize_for_join(["José García S.L."]) == ["JOSE GARCIA SL"]
def test_normalize_colapsa_espacios():
assert normalize_for_join([" hola mundo "]) == ["HOLA MUNDO"]
def test_normalize_nan_as_empty():
# NaN de float (float('nan'))
result = normalize_for_join([float("nan")])
assert result == [""]
def test_normalize_entero():
# Enteros se convierten a string
result = normalize_for_join([28001])
assert result == ["28001"]
@@ -0,0 +1,40 @@
"""Tests para safe_read_csv_fallback."""
from __future__ import annotations
import tempfile
from pathlib import Path
import pytest
def test_lee_csv_utf_8_correctamente():
"""lee csv utf-8 correctamente"""
import sys
sys.path.insert(0, str(Path(__file__).resolve().parents[2]))
from core.safe_read_csv_fallback import safe_read_csv_fallback
with tempfile.TemporaryDirectory() as tmpdir:
csv_path = Path(tmpdir) / "test_utf8.csv"
csv_path.write_text("nombre,valor\nAña,42\nBéta,99\n", encoding="utf-8")
df = safe_read_csv_fallback(csv_path)
assert df.shape == (2, 2)
assert list(df.columns) == ["nombre", "valor"]
assert df["nombre"].tolist() == ["Aña", "Béta"]
def test_lee_csv_latin_1_con_fallback():
"""lee csv latin-1 con fallback"""
import sys
sys.path.insert(0, str(Path(__file__).resolve().parents[2]))
from core.safe_read_csv_fallback import safe_read_csv_fallback
with tempfile.TemporaryDirectory() as tmpdir:
csv_path = Path(tmpdir) / "test_latin1.csv"
# Write latin-1 encoded CSV (ñ, é are 0xF1, 0xE9 in latin-1)
csv_path.write_bytes("nombre,valor\nMad\xf1id,10\nC\xe9ntro,20\n".encode("latin-1"))
df = safe_read_csv_fallback(csv_path)
assert df.shape == (2, 2)
assert "Mad" in df["nombre"].iloc[0]
assert df["valor"].tolist() == [10, 20]
@@ -0,0 +1,44 @@
"""Tests para slugify_ascii."""
import sys
import os
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
from slugify_ascii import slugify_ascii
def test_slugify_texto_con_puntuacion():
"""slugify texto con puntuacion"""
assert slugify_ascii("Calle Mayor, 14") == "calle-mayor-14"
def test_slugify_diacriticos():
"""slugify diacriticos"""
assert slugify_ascii("Ávila") == "avila"
def test_slugify_cadena_vacia_retorna_default():
"""slugify cadena vacia retorna default"""
assert slugify_ascii("") == "centro"
def test_slugify_trunca_a_max_len():
"""slugify trunca a max_len"""
assert slugify_ascii("a" * 100, max_len=10) == "aaaaaaaaaa"
def test_slugify_none_retorna_default():
assert slugify_ascii(None) == "centro"
def test_slugify_default_custom():
assert slugify_ascii("---", default="sin-nombre") == "sin-nombre"
def test_slugify_solo_diacriticos_y_puntuacion():
assert slugify_ascii("ñoño") == "nono"
def test_slugify_numeros():
assert slugify_ascii("28001 Madrid") == "28001-madrid"
@@ -0,0 +1,70 @@
---
name: align_relations_to_entities
kind: function
lang: py
domain: datascience
version: "1.0.0"
purity: pure
signature: "def align_relations_to_entities(triplets: list[dict], entity_names: list[str]) -> list[dict]"
description: "Filtra y alinea triplets REBEL/mREBEL a nombres canonicos de entidades. Para cada triplet, resuelve head y tail contra entity_names con match exacto case-insensitive o substring (gana el nombre mas largo). Descarta triplets donde algun lado no resuelve o head==tail."
tags: [rebel, mrebel, relation-extraction, nlp, align, knowledge-graph, datascience, python]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: []
params:
- name: triplets
desc: "lista de dicts producida por parse_rebel_output, con claves head, head_type, type, tail, tail_type"
- name: entity_names
desc: "nombres canonicos de entidades conocidas contra los que alinear (ej. [e.name for e in entities])"
output: "lista de dicts con claves from (str), kind (str), to (str), head_type (str), tail_type (str). from/to son valores tomados verbatim de entity_names."
tested: true
tests:
- "match exacto case-insensitive resuelve correctamente"
- "substring entity en span del head"
- "substring span dentro del nombre de entidad"
- "gana el nombre de entidad mas largo en ambiguedad"
- "triplet sin match se descarta"
- "triplet con head == tail se descarta (self-loop)"
test_file_path: "python/functions/datascience/tests/test_align_relations_to_entities.py"
file_path: "python/functions/datascience/align_relations_to_entities.py"
notes: |
Funcion pura. Compone con parse_rebel_output: el output de parse_rebel_output entra
como triplets, y entity_names viene de [e.name for e in entities] del contexto de extraccion.
Estrategia de matching:
1. Exacto case-insensitive (O(1) via dict)
2. Substring bidireccional: entity in span O span in entity (itera por longitud DESC)
Esto cubre casos como mREBEL emitiendo "esta en Bilbao" cuando la entidad es "Bilbao",
o "Banco Santander S.A." cuando la entidad canonizada es "Banco Santander".
---
## Ejemplo
```python
from python.functions.datascience.parse_rebel_output import parse_rebel_output
from python.functions.datascience.align_relations_to_entities import align_relations_to_entities
decoded = "tp_XX<triplet> Pablo Isla <per> Inditex <org> employer"
triplets = parse_rebel_output(decoded)
entities = ["Pablo Isla", "Inditex", "A Coruna"]
aligned = align_relations_to_entities(triplets, entities)
# [{'from': 'Pablo Isla', 'kind': 'employer', 'to': 'Inditex',
# 'head_type': 'per', 'tail_type': 'org'}]
```
## Estrategia de matching
1. **Exacto case-insensitive**: ``"inditex"`` == ``"Inditex"``.
2. **Substring bidireccional**: la entidad esta contenida en el span del modelo,
o el span del modelo esta contenido en el nombre de la entidad.
Cuando varias entidades encajan, gana la mas larga (mas especifica).
## Notas
- No hace fuzzy matching (Levenshtein, etc.) — la precision sobre el recall es preferida
en el contexto de grafos de conocimiento.
- Para mejorar recall: normalizar entity_names antes de llamar (quitar siglas, tildes).
- Los triplets con ``from == to`` (self-loops) se descartan siempre.
@@ -0,0 +1,90 @@
"""Alinea triplets REBEL / mREBEL a nombres canonicos de entidades."""
from __future__ import annotations
def align_relations_to_entities(
triplets: list[dict],
entity_names: list[str],
) -> list[dict]:
"""Align REBEL triplets to a set of canonical entity names.
For each triplet produced by ``parse_rebel_output``, tries to resolve the
``head`` and ``tail`` spans to a canonical entity name from ``entity_names``
using the following strategy (in order):
1. **Exact case-insensitive match** — ``"Inditex" == "inditex"``.
2. **Substring match** — either the span contains an entity name, or an
entity name contains the span. When multiple entity names match, the
*longest* one wins (most specific).
Triplets are dropped when:
- Neither ``head`` nor ``tail`` can be resolved to any entity name.
- The resolved ``from`` and ``to`` are the same name (self-loop).
Args:
triplets: List of dicts produced by ``parse_rebel_output``, each with
keys ``head``, ``head_type``, ``type``, ``tail``, ``tail_type``.
entity_names: Canonical entity names to match against. Typically
``[e.name for e in entities]``. Order does not matter; matching
is case-insensitive.
Returns:
List of dicts with keys:
``from`` (str), ``kind`` (str), ``to`` (str),
``head_type`` (str), ``tail_type`` (str).
``from`` and ``to`` are values taken verbatim from ``entity_names``.
Empty list if no triplet survives alignment.
"""
if not triplets or not entity_names:
return []
# Pre-build lookup: lowercased -> original for O(1) exact lookup.
lower_to_name: dict[str, str] = {n.lower(): n for n in entity_names}
# Sort by length DESC for substring match (longest entity wins).
names_by_len: list[str] = sorted(entity_names, key=len, reverse=True)
def _resolve(span: str) -> str | None:
"""Return a canonical entity name for `span`, or None if no match."""
if not span:
return None
span_lower = span.lower()
# 1. Exact case-insensitive.
if span_lower in lower_to_name:
return lower_to_name[span_lower]
# 2. Substring: longest entity that is contained in span, or whose
# name contains span (both directions), longest-wins.
for name in names_by_len:
name_lower = name.lower()
if name_lower in span_lower or span_lower in name_lower:
return name
return None
aligned: list[dict] = []
for triplet in triplets:
head_span = triplet.get("head", "")
tail_span = triplet.get("tail", "")
relation = triplet.get("type", "")
from_name = _resolve(head_span)
to_name = _resolve(tail_span)
if from_name is None or to_name is None:
continue
if from_name == to_name:
continue
aligned.append(
{
"from": from_name,
"kind": relation,
"to": to_name,
"head_type": triplet.get("head_type", ""),
"tail_type": triplet.get("tail_type", ""),
}
)
return aligned
@@ -0,0 +1,42 @@
---
id: alpha_shape_concave_hull_py_datascience
name: alpha_shape_concave_hull
kind: function
lang: py
domain: datascience
version: "1.0.0"
purity: pure
signature: "def alpha_shape_concave_hull(points: list[tuple[float, float]], alpha: float) -> shapely.geometry.base.BaseGeometry | None"
description: "Computes the alpha-shape (concave hull) of a 2-D point set via Delaunay triangulation, filtering triangles by circumradius <= alpha and merging survivors."
tags: [geometry, spatial, concave-hull, alpha-shape, shapely, delaunay]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: [numpy, shapely]
example: |
from alpha_shape_concave_hull import alpha_shape_concave_hull
pts = [(0.0,0.0),(1.0,0.0),(1.0,1.0),(0.0,1.0)]
geom = alpha_shape_concave_hull(pts, alpha=10.0)
# shapely Polygon
tested: true
tests:
- "test_alpha_shape_square_large_alpha"
- "test_alpha_shape_too_few_points"
- "test_alpha_shape_very_small_alpha_returns_none"
- "test_alpha_shape_5_points_returns_geometry"
test_file_path: "python/functions/datascience/tests/test_alpha_shape_concave_hull.py"
file_path: "python/functions/datascience/alpha_shape_concave_hull.py"
params:
- name: points
desc: "List of (x, y) coordinate pairs. Requires at least 4 points."
- name: alpha
desc: "Alpha radius parameter. Triangles with circumradius > alpha are discarded. Smaller alpha = more concave hull."
output: "Shapely geometry (Polygon or MultiPolygon) of the alpha-shape, or None if fewer than 4 points or no triangles survive the alpha filter."
source_repo: "internal:footprint_aurgi"
source_license: "internal-aurgi"
source_file: "ponderacion_isochronas/src/recomendador_centros.py:408"
---
Requiere shapely. Si shapely no esta instalado, retorna None en silencio. returns_optional=true porque puede no haber triangulos validos.
@@ -0,0 +1,67 @@
"""alpha_shape_concave_hull — Concave hull via Delaunay alpha-shape filtering."""
from __future__ import annotations
def alpha_shape_concave_hull(
points: list[tuple[float, float]],
alpha: float,
) -> "shapely.geometry.base.BaseGeometry | None":
"""Compute the alpha-shape (concave hull) of a 2-D point set.
Performs a Delaunay triangulation over the input points, then keeps only
those triangles whose circumscribed circle radius is <= alpha. The
remaining triangles are merged via unary_union.
Args:
points: List of (x, y) coordinate pairs. Must have >= 4 elements.
alpha: Alpha parameter controlling concavity (smaller = more concave).
Triangles with circumradius > alpha are discarded.
Returns:
A shapely geometry (Polygon, MultiPolygon, or GeometryCollection)
representing the alpha-shape, or None if len(points) < 4 or no
triangles survive the alpha filter (shapely is required).
"""
if len(points) < 4:
return None
try:
import numpy as np
from shapely.geometry import MultiPoint
from shapely.ops import triangulate, unary_union
except ImportError:
return None
mp = MultiPoint(points)
triangles = triangulate(mp)
valid = []
for tri in triangles:
coords = list(tri.exterior.coords)
a_pt = np.array(coords[0])
b_pt = np.array(coords[1])
c_pt = np.array(coords[2])
# Circumradius via the formula R = (abc) / (4 * Area)
ab = np.linalg.norm(b_pt - a_pt)
bc = np.linalg.norm(c_pt - b_pt)
ca = np.linalg.norm(a_pt - c_pt)
# Area via cross product
area = abs(
(b_pt[0] - a_pt[0]) * (c_pt[1] - a_pt[1])
- (c_pt[0] - a_pt[0]) * (b_pt[1] - a_pt[1])
) / 2.0
if area == 0:
continue
circumradius = (ab * bc * ca) / (4.0 * area)
if circumradius <= alpha:
valid.append(tri)
if not valid:
return None
return unary_union(valid)
@@ -0,0 +1,68 @@
---
id: best_central_tendency_py_datascience
name: best_central_tendency
kind: function
lang: py
domain: datascience
version: "1.0.0"
purity: pure
signature: "def best_central_tendency(values: list[float], dist_type: str) -> tuple[str, float]"
description: "Selects the most appropriate central tendency measure for a given distribution type. Returns (label, value) pair."
tags: [statistics, central-tendency, distribution, robust, mean, median]
uses_functions:
- geometric_mean_py_datascience
- trimmed_mean_py_datascience
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: [math, numpy]
example: |
from best_central_tendency import best_central_tendency
label, value = best_central_tendency([1, 2, 3, 4, 5], "normal-ish")
# ("mean", 3.0)
tested: true
tests:
- "test_best_central_tendency_normal_ish"
- "test_best_central_tendency_right_skewed"
- "test_best_central_tendency_left_skewed"
- "test_best_central_tendency_lognormal_ish"
- "test_best_central_tendency_heavy_tail"
- "test_best_central_tendency_empty"
- "test_best_central_tendency_default"
test_file_path: "python/functions/datascience/tests/test_best_central_tendency.py"
file_path: "python/functions/datascience/best_central_tendency.py"
params:
- name: values
desc: "List of numeric values to summarize."
- name: dist_type
desc: "Distribution type string, typically from detect_distribution_type. One of: normal-ish, lognormal-ish, heavy-tail, right-skewed, left-skewed, other, too_few_samples."
output: >
Tuple (label, value) where label is one of "mean", "median", "geometric_mean",
"trimmed_mean_5%", and value is the computed central tendency. Returns ("median", math.nan) for empty input.
source_repo: "internal:footprint_aurgi"
source_license: "internal-aurgi"
source_file: "aurgi_mapas/generar_pdf_reporte.py:196"
---
## Ejemplo
```python
from best_central_tendency import best_central_tendency
best_central_tendency([1, 2, 3, 4, 5], "normal-ish") # ("mean", 3.0)
best_central_tendency([1, 2, 3, 4, 5], "right-skewed") # ("median", 3.0)
best_central_tendency([1, 2, 4, 8], "lognormal-ish") # ("geometric_mean", ~2.83)
best_central_tendency([1, 2, 3, 100], "heavy-tail") # ("trimmed_mean_5%", ...)
```
## Mapeo de tipos a medidas
| dist_type | Medida | Funcion interna |
|-----------------|------------------|-----------------------|
| normal-ish | mean | numpy.mean |
| lognormal-ish | geometric_mean | geometric_mean() |
| heavy-tail | trimmed_mean_5% | trimmed_mean(0.05) |
| right-skewed | median | numpy.median |
| left-skewed | median | numpy.median |
| otros / default | median | numpy.median |
@@ -0,0 +1,45 @@
"""best_central_tendency — Select the best central tendency measure for a distribution type."""
import math
import numpy as np
try:
from .geometric_mean import geometric_mean
from .trimmed_mean import trimmed_mean
except ImportError:
from geometric_mean import geometric_mean # type: ignore
from trimmed_mean import trimmed_mean # type: ignore
def best_central_tendency(values: list[float], dist_type: str) -> tuple[str, float]:
"""Return the most appropriate central tendency measure given the distribution type.
Mapping:
"normal-ish" -> ("mean", arithmetic mean)
"lognormal-ish" -> ("geometric_mean", geometric mean of positives)
"heavy-tail" -> ("trimmed_mean_5%", trimmed mean at 5%)
"right-skewed" -> ("median", median)
"left-skewed" -> ("median", median)
default -> ("median", median)
Args:
values: List of numeric values.
dist_type: Distribution type string (from detect_distribution_type).
Returns:
Tuple (label: str, value: float). Value is math.nan if values is empty.
"""
if not values:
return ("median", math.nan)
arr = np.array(values, dtype=float)
if dist_type == "normal-ish":
return ("mean", float(np.mean(arr)))
elif dist_type == "lognormal-ish":
return ("geometric_mean", geometric_mean(values))
elif dist_type == "heavy-tail":
return ("trimmed_mean_5%", trimmed_mean(values, trim=0.05))
else:
# right-skewed, left-skewed, other, too_few_samples, unknown
return ("median", float(np.median(arr)))
@@ -0,0 +1,67 @@
---
id: detect_distribution_type_py_datascience
name: detect_distribution_type
kind: function
lang: py
domain: datascience
version: "1.0.0"
purity: pure
signature: "def detect_distribution_type(values: list[float]) -> dict"
description: "Classifies the shape of a numeric distribution using skewness, excess kurtosis, tail ratio and log-skewness. Returns a type label and raw stats."
tags: [statistics, distribution, classification, skewness, kurtosis]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: [math, numpy]
example: |
from detect_distribution_type import detect_distribution_type
import numpy as np
result = detect_distribution_type(np.random.normal(0, 1, 200).tolist())
# {"type": "normal-ish", "stats": {"n": 200, "skew": ..., ...}}
tested: true
tests:
- "test_detect_too_few_samples"
- "test_detect_normal_ish"
- "test_detect_right_skewed"
- "test_detect_stats_keys"
- "test_detect_exactly_30"
test_file_path: "python/functions/datascience/tests/test_detect_distribution_type.py"
file_path: "python/functions/datascience/detect_distribution_type.py"
params:
- name: values
desc: "List of numeric values to classify. Minimum 30 for meaningful classification."
output: >
Dict with "type" (str) and "stats" (dict). Type is one of: normal-ish,
lognormal-ish, heavy-tail, right-skewed, left-skewed, other, too_few_samples.
Stats contains: n, skew, kurtosis, tail_ratio, log_skew.
source_repo: "internal:footprint_aurgi"
source_license: "internal-aurgi"
source_file: "aurgi_mapas/generar_pdf_reporte.py:133"
---
## Ejemplo
```python
from detect_distribution_type import detect_distribution_type
import numpy as np
detect_distribution_type(np.random.normal(0, 1, 200).tolist())
# {"type": "normal-ish", "stats": {"n": 200, "skew": 0.03, ...}}
detect_distribution_type([1]*5)
# {"type": "too_few_samples", "stats": {"n": 5}}
```
## Logica de clasificacion
- n < 30 → too_few_samples
- excess kurtosis > 3 → heavy-tail
- |skew| <= 0.5 AND |kurt| <= 1 → normal-ish
- skew > 0.5 AND log_skew cerca de 0 AND tail_ratio > 2 → lognormal-ish
- skew > 0.5 → right-skewed
- skew < -0.5 → left-skewed
- default → other
tail_ratio = p99/p50; log_skew calculado solo si hay >= 30 positivos.
@@ -0,0 +1,89 @@
"""detect_distribution_type — Classify the distribution shape of a sample."""
import math
import numpy as np
def detect_distribution_type(values: list[float]) -> dict:
"""Classify the distribution shape of a numeric sample.
Uses skewness, excess kurtosis, tail ratio (p99/p50), and log-skewness
to assign one of: normal-ish, lognormal-ish, heavy-tail, right-skewed,
left-skewed, other, or too_few_samples (n < 30).
Args:
values: List of numeric values.
Returns:
Dict with keys:
"type" (str): distribution label.
"stats" (dict): {"n", "skew", "kurtosis", "tail_ratio", "log_skew"}.
"""
n = len(values)
if n < 30:
return {"type": "too_few_samples", "stats": {"n": n}}
arr = np.array(values, dtype=float)
mean = float(np.mean(arr))
std = float(np.std(arr, ddof=1))
# Skewness
if std == 0:
skew = 0.0
else:
skew = float(np.mean(((arr - mean) / std) ** 3))
# Excess kurtosis
if std == 0:
kurt = 0.0
else:
kurt = float(np.mean(((arr - mean) / std) ** 4)) - 3.0
# Tail ratio: p99 / p50 (only meaningful when median != 0)
p50 = float(np.percentile(arr, 50))
p99 = float(np.percentile(arr, 99))
tail_ratio = (p99 / p50) if p50 != 0 else math.nan
# Log-skewness on positive values
positives = arr[arr > 0]
if len(positives) >= 30:
log_arr = np.log(positives)
log_mean = float(np.mean(log_arr))
log_std = float(np.std(log_arr, ddof=1))
if log_std == 0:
log_skew = 0.0
else:
log_skew = float(np.mean(((log_arr - log_mean) / log_std) ** 3))
else:
log_skew = math.nan
stats = {
"n": n,
"skew": skew,
"kurtosis": kurt,
"tail_ratio": tail_ratio,
"log_skew": log_skew,
}
# Classification logic
if kurt > 3.0:
dist_type = "heavy-tail"
elif abs(skew) <= 0.5 and abs(kurt) <= 1.0:
dist_type = "normal-ish"
elif (
skew > 0.5
and not math.isnan(log_skew)
and abs(log_skew) <= 0.5
and not math.isnan(tail_ratio)
and tail_ratio > 2.0
):
dist_type = "lognormal-ish"
elif skew > 0.5:
dist_type = "right-skewed"
elif skew < -0.5:
dist_type = "left-skewed"
else:
dist_type = "other"
return {"type": dist_type, "stats": stats}
@@ -0,0 +1,65 @@
---
name: extract_graph_gliner2
kind: function
lang: py
domain: datascience
version: "1.0.0"
purity: impure
signature: "def extract_graph_gliner2(text: str, entity_labels: list[str], relation_labels: list | dict, model: Any, threshold: float = 0.3, include_confidence: bool = False) -> dict"
description: "Extrae entidades + relaciones en una sola pasada con GLiNER2. Wrapper de alto nivel: construye schema, ejecuta extraccion, normaliza a dict plano. No aplica post-filtrado ni coreference."
tags: [gliner2, ner, relation-extraction, nlp, extraction, graph, zero-shot, datascience, python, apache2]
uses_functions:
- gliner2_load_model_py_datascience
uses_types: []
returns: []
returns_optional: false
error_type: "error_go_core"
imports: [time, typing.Any]
params:
- name: text
desc: "Texto a analizar. Recomendado hasta 1500 chars (pre-chunkeado con chunk_with_overlap). Textos mas largos degradan el recall de GLiNER2."
- name: entity_labels
desc: "Lista de strings con los tipos de entidad en minusculas snake_case. E.g. ['person', 'organization', 'location']. Labels en snake_case mejoran el recall segun notebook 08."
- name: relation_labels
desc: "Lista de strings o dict {label: description} con los tipos de relacion. E.g. ['works_at', 'ceo_of'] o {'works_at': 'person works at an organization'}."
- name: model
desc: "Instancia GLiNER2 cargada con gliner2_load_model. Inyectada por el caller (no se carga aqui)."
- name: threshold
desc: "Umbral de confianza entre 0 y 1. 0.3 validado empiricamente en notebook 04 (gliner_glirel_tuning). Valores mas bajos = mas recall, mas ruido."
- name: include_confidence
desc: "Si True, GLiNER2 devuelve scores internos por entidad y relacion. False por defecto para output mas limpio."
output: "Dict con tres campos: 'entities' -> {type: [name, ...]}, 'relation_extraction' -> {rel_type: [(head, tail), ...]}, 'elapsed_s' -> float. Compatible con aggregate_extraction_results."
tested: true
tests:
- "output tiene claves entities relation_extraction elapsed_s"
- "stub model retorna shape correcto"
test_file_path: "python/functions/datascience/tests/test_extract_graph_gliner2.py"
file_path: "python/functions/datascience/extract_graph_gliner2.py"
notes: |
LICENSE: GLiNER2 (fastino/gliner2-large-v1) es Apache 2.0 — uso comercial OK.
impure: invoca inferencia del modelo (side effect computacional + tiempo variable).
El model se inyecta externamente para permitir cache y reutilizacion entre llamadas.
Para textos largos usar chunk_with_overlap antes y llamar esta funcion por chunk,
luego agregar con aggregate_extraction_results.
---
## Ejemplo
```python
from datascience.gliner2_load_model import gliner2_load_model
from datascience.extract_graph_gliner2 import extract_graph_gliner2
model = gliner2_load_model(device="auto")
result = extract_graph_gliner2(
text="Carlos Torres es presidente de BBVA, con sede en Bilbao.",
entity_labels=["person", "organization", "location"],
relation_labels=["president_of", "headquartered_in"],
model=model,
threshold=0.3,
)
# result["entities"] -> {"person": ["Carlos Torres"], ...}
# result["relation_extraction"]-> {"president_of": [("Carlos Torres", "BBVA")]}
# result["elapsed_s"] -> 0.234
```
@@ -0,0 +1,60 @@
"""Extraccion de entidades + relaciones en una pasada con GLiNER2."""
from __future__ import annotations
import time
from typing import Any
def extract_graph_gliner2(
text: str,
entity_labels: list[str],
relation_labels: list | dict,
model: Any,
threshold: float = 0.3,
include_confidence: bool = False,
) -> dict:
"""Extract entities + relations using GLiNER2 with one schema pass.
Wrapper de alto nivel sobre la API de GLiNER2. Construye el schema,
ejecuta la extraccion y normaliza el resultado a un dict plano.
NO aplica post-filtrado ni coreference — eso lo hace el caller con
filter_relations_by_entity_types y merge_entity_aliases.
Args:
text: Texto a analizar. Recomendado: <= 1500 chars (pre-chunked).
entity_labels: Lista de strings con los tipos de entidad.
E.g. ["person", "organization", "location"]
relation_labels: Lista de strings o dict {label: description} con
los tipos de relacion.
E.g. ["works_at", "ceo_of"] o
{"works_at": "person works at organization"}
model: Instancia GLiNER2 cargada con gliner2_load_model.
threshold: Umbral de confianza (0-1). 0.3 es el valor validado
empiricamente en los notebooks del analisis.
include_confidence: Si True, el modelo devuelve scores por entidad
y relacion (formato interno de GLiNER2).
Returns:
{
"entities": {type: [name, ...]},
"relation_extraction": {rel_type: [(head, tail), ...]},
"elapsed_s": float
}
"""
schema = model.create_schema().entities(entity_labels).relations(relation_labels)
t0 = time.time()
r = model.extract(
text,
schema=schema,
threshold=threshold,
include_confidence=include_confidence,
)
elapsed = round(time.time() - t0, 3)
return {
"entities": r.get("entities", {}),
"relation_extraction": r.get("relation_extraction", {}),
"elapsed_s": elapsed,
}
@@ -0,0 +1,114 @@
---
name: extract_relations_mrebel
kind: function
lang: py
domain: datascience
version: "1.0.0"
purity: impure
signature: "def extract_relations_mrebel(text: str, entities: list[EntityCandidate], tokenizer: Any, model: Any, src_lang: str = 'es_XX', sentence_split_re: str = r'(?<=[.!?])\\s+', min_sentence_chars: int = 20, num_beams: int = 4, max_length: int = 256) -> list[RelationCandidate]"
description: "Extrae relaciones entre entidades usando mREBEL (seq2seq multilingue). Divide el texto por oraciones, genera triplets con mREBEL, parsea con parse_rebel_output y alinea a entidades conocidas con align_relations_to_entities. Drop-in con extract_relations_glirel para benchmarks."
tags: [mrebel, relation-extraction, nlp, extract, knowledge-graph, seq2seq, multilingual, datascience, python]
uses_functions:
- mrebel_load_model_py_datascience
- parse_rebel_output_py_datascience
- align_relations_to_entities_py_datascience
uses_types:
- entity_candidate_py_datascience
- relation_candidate_py_datascience
returns:
- relation_candidate_py_datascience
returns_optional: false
error_type: "error_go_core"
imports: [re]
params:
- name: text
desc: "texto fuente en el idioma de src_lang (mismo chunk usado para extraer las entidades)"
- name: entities
desc: "entidades ya extraidas de este texto (de extract_entities_gliner o similar). Solo se conservan relaciones entre entidades de esta lista."
- name: tokenizer
desc: "tokenizer mREBEL cargado con mrebel_load_model — inyectado por el caller para evitar re-carga en batch"
- name: model
desc: "modelo mREBEL cargado con mrebel_load_model — inyectado por el caller"
- name: src_lang
desc: "informativo — el idioma con que se cargo el tokenizer (ej. 'es_XX'). No se usa en runtime."
- name: sentence_split_re
desc: "patron regex para dividir el texto en oraciones. Defecto: split despues de [.!?] seguido de espacio."
- name: min_sentence_chars
desc: "longitud minima de caracteres para procesar una oracion. Fragmentos mas cortos se saltan (defecto 20)."
- name: num_beams
desc: "ancho del beam search para model.generate (defecto 4)"
- name: max_length
desc: "longitud maxima en tokens para tokenizacion y generacion (defecto 256)"
output: "lista de RelationCandidate con confidence=1.0 (mREBEL no produce score continuo). from_name/to_name siempre coinciden con entidades del input."
tested: true
tests:
- "flujo completo con stub produce RelationCandidate correctos"
- "menos de 2 entidades retorna vacio"
- "texto vacio retorna vacio"
- "triplets no alineables se descartan"
test_file_path: "python/functions/datascience/tests/test_extract_relations_mrebel.py"
file_path: "python/functions/datascience/extract_relations_mrebel.py"
notes: |
impure: model.generate es I/O computacional con estado externo (pesos del modelo).
mREBEL no produce un confidence score continuo — devuelve los triplets que el modelo
decodifico como output mas probable. confidence=1.0 es un marcador "el modelo lo emitio",
no una probabilidad calibrada. Para filtrar por calidad, usar el numero de beams
como proxy o combinar con un clasificador posterior.
Drop-in con extract_relations_glirel para benchmarks:
- Misma interfaz de entrada (text, entities, model)
- Misma salida (list[RelationCandidate])
- Diferencia: mREBEL no necesita relation_types (genera relaciones libre),
glirel necesita relation_types (zero-shot discriminativo).
LICENCIA del modelo: Babelscape/mrebel-large es CC BY-NC-SA 4.0 (no comercial).
Ver mrebel_load_model para mas detalles.
---
## Ejemplo
```python
from python.functions.datascience.mrebel_load_model import mrebel_load_model
from python.functions.datascience.extract_relations_mrebel import extract_relations_mrebel
from python.types.datascience.entity_candidate import EntityCandidate
tokenizer, model = mrebel_load_model(src_lang="es_XX")
text = "Pablo Isla es el presidente de Inditex. La empresa tiene sede en Arteixo, A Coruna."
entities = [
EntityCandidate(name="Pablo Isla", type_label="PER", confidence=0.95),
EntityCandidate(name="Inditex", type_label="ORG", confidence=0.92),
EntityCandidate(name="Arteixo", type_label="LOC", confidence=0.88),
EntityCandidate(name="A Coruna", type_label="LOC", confidence=0.85),
]
relations = extract_relations_mrebel(
text=text,
entities=entities,
tokenizer=tokenizer,
model=model,
)
# [RelationCandidate(from_name='Pablo Isla', to_name='Inditex',
# relation_type='employer', confidence=1.0, ...), ...]
```
## Comparacion con extract_relations_glirel
| | mREBEL | GLiREL |
|---|---|---|
| Tipo | Seq2seq generativo | Discriminativo zero-shot |
| relation_types | No (genera libre) | Si (obligatorio) |
| Confidence | 1.0 fijo (no calibrado) | 0.0-1.0 (calibrado) |
| Idiomas | 30+ multilingue | Principalmente EN |
| Licencia modelo | CC BY-NC-SA (no comercial) | Apache 2.0 |
| Velocidad | Mas lento (seq2seq) | Mas rapido (clasificador) |
## Notas de diseno
- `parse_rebel_output` y `align_relations_to_entities` son funciones puras
compuestas por esta funcion impura — testeable independientemente.
- Errores de tokenizacion/generacion por frase se capturan y saltan (la frase
se ignora, el resto del texto se procesa).
- `source_chunk_index` rastrea el indice de oracion de origen, no de chunk
de texto — util para debugging.
@@ -0,0 +1,136 @@
"""Extrae relaciones entre entidades usando mREBEL (seq2seq multilingue)."""
from __future__ import annotations
import os
import re
import sys
from typing import Any
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", ".."))
from python.types.datascience.entity_candidate import EntityCandidate
from python.types.datascience.relation_candidate import RelationCandidate
from python.functions.datascience.parse_rebel_output import parse_rebel_output
from python.functions.datascience.align_relations_to_entities import align_relations_to_entities
def extract_relations_mrebel(
text: str,
entities: list[EntityCandidate],
tokenizer: Any,
model: Any,
src_lang: str = "es_XX",
sentence_split_re: str = r"(?<=[.!?])\s+",
min_sentence_chars: int = 20,
num_beams: int = 4,
max_length: int = 256,
) -> list[RelationCandidate]:
"""Extract relations from text using mREBEL, sentence by sentence.
Orchestrates the full pipeline:
1. Split ``text`` into sentences using ``sentence_split_re``.
2. Filter out sentences shorter than ``min_sentence_chars``.
3. For each sentence: tokenize → generate → decode (with special tokens)
→ ``parse_rebel_output`` → accumulate raw triplets.
4. Collect all entity names from ``entities``, sorted DESC by length
(so longer names win in substring matching).
5. Call ``align_relations_to_entities`` to resolve head/tail spans to
canonical entity names and drop unresolved / self-loop triplets.
6. Wrap each aligned triplet in a ``RelationCandidate``.
mREBEL does not produce a continuous confidence score — ``confidence``
is set to ``1.0`` as a marker meaning "the model emitted this triplet".
Args:
text: Source text (same language as ``src_lang``).
entities: Entities already extracted from this text (e.g. via
``extract_entities_gliner``). Used to filter triplets to
known entities only.
tokenizer: mREBEL tokenizer loaded with ``mrebel_load_model``.
model: mREBEL model loaded with ``mrebel_load_model``.
src_lang: Informational — the language the tokenizer was loaded with.
Not used at inference time (mBART lang tokens are set at load time).
sentence_split_re: Regex pattern for sentence splitting. Default splits
on whitespace that follows ``.``, ``!`` or ``?``.
min_sentence_chars: Minimum character length for a sentence to be
processed. Shorter fragments are skipped.
num_beams: Beam search width for ``model.generate``. Default 4.
max_length: Max token length for both tokenization and generation.
Returns:
List of ``RelationCandidate`` where ``from_name`` and ``to_name``
always correspond to names in ``entities``. Empty list if no aligned
triplets are found or ``entities`` has fewer than 2 items.
"""
if len(entities) < 2:
return []
if not text or not text.strip():
return []
split_re = re.compile(sentence_split_re)
sentences = split_re.split(text.strip())
sentences = [s.strip() for s in sentences if s.strip() and len(s.strip()) >= min_sentence_chars]
if not sentences:
return []
# Step 1-3: gather raw triplets from all sentences.
raw_triplets: list[dict] = []
for idx, sentence in enumerate(sentences):
try:
inputs = tokenizer(
sentence,
return_tensors="pt",
max_length=max_length,
truncation=True,
)
generated = model.generate(
**inputs,
num_beams=num_beams,
length_penalty=1.0,
max_length=max_length,
)
decoded = tokenizer.decode(generated[0], skip_special_tokens=False)
except Exception:
# Skip sentences that fail (e.g. tokenizer errors on special chars).
continue
sentence_triplets = parse_rebel_output(decoded)
# Tag each triplet with the sentence index for source_chunk_index.
for t in sentence_triplets:
t["_sentence_idx"] = idx
raw_triplets.extend(sentence_triplets)
if not raw_triplets:
return []
# Step 4-5: align to entity names (sorted DESC by length for substring match).
entity_names = sorted([e.name for e in entities if e.name], key=len, reverse=True)
aligned = align_relations_to_entities(raw_triplets, entity_names)
# Step 6: wrap in RelationCandidate.
candidates: list[RelationCandidate] = []
for item in aligned:
# Recover sentence_idx from raw triplet — find matching raw by head/tail/type.
sentence_idx = -1
for raw in raw_triplets:
if (
raw.get("head", "").strip() and
raw.get("type", "").strip() == item["kind"]
):
sentence_idx = raw.get("_sentence_idx", -1)
break
candidates.append(
RelationCandidate(
from_name=item["from"],
to_name=item["to"],
relation_type=item["kind"],
description="",
confidence=1.0,
source_chunk_index=sentence_idx,
)
)
return candidates
@@ -0,0 +1,65 @@
---
name: extract_triples_spacy_es
kind: function
lang: py
domain: datascience
version: "1.0.0"
purity: impure
signature: "def extract_triples_spacy_es(text: str, nlp: Any) -> dict"
description: "Extraccion OpenIE schema-less en castellano via reglas de dependencia spaCy. Detecta patrones sujeto-verbo-objeto con el lemma del verbo como relacion (sin vocabulario fijo). Tambien extrae entidades NER (PER, ORG, LOC, MISC)."
tags: [spacy, openie, nlp, spanish, triples, dependency, ner, schema-less, datascience, python, mit]
uses_functions:
- spacy_es_load_model_py_datascience
uses_types: []
returns: []
returns_optional: false
error_type: "error_go_core"
imports: [time, typing.Any]
params:
- name: text
desc: "Texto en castellano a analizar. Funciona mejor con oraciones completas. Admite multiples oraciones en el mismo texto."
- name: nlp
desc: "Instancia spaCy Language cargada con spacy_es_load_model. Debe incluir dependencias + POS + NER (es_core_news_md o lg)."
output: "Dict con 'text' (input), 'triples' (lista de {subject, relation, object, verb_form, object_dep, prep}), 'entities' (lista de {text, label}) y 'elapsed_s'. La relacion es el lemma del verbo, opcionalmente sufijado con preposicion (_en, _con) o modo pasivo ([pass])."
tested: true
tests:
- "oracion simple produce tripleta con sujeto verbo objeto"
- "carlos torres preside bbva produce tripleta president"
- "amancio ortega fundo inditex en 1985 produce tripletas con fundar_en"
- "texto sin verbos produce tripletas vacias"
- "entities NER detecta PER ORG LOC"
test_file_path: "python/functions/datascience/tests/test_extract_triples_spacy_es.py"
file_path: "python/functions/datascience/extract_triples_spacy_es.py"
notes: |
LICENSE: spaCy es MIT. Modelo es_core_news_md es CC BY-SA 4.0.
Uso comercial permitido con atribucion.
Validado en notebook 09 del analisis gliner_glirel_tuning.
Complementa a extract_graph_gliner2: GLiNER2 usa vocabulario fijo de relaciones
pero mayor precision; spaCy OpenIE usa lemmas verbales (sin vocabulario fijo)
pero requiere post-filtrado manual.
impure: invoca inferencia del modelo (side effect computacional).
El nlp se inyecta externamente para permitir cache y reutilizacion.
Relaciones compuestas: 'fundar_en' (fundar + preposicion 'en'),
'ser_nombrado[pass]' (pasiva), 'trabajar_con' (trabajar + 'con').
---
## Ejemplo
```python
from datascience.spacy_es_load_model import spacy_es_load_model
from datascience.extract_triples_spacy_es import extract_triples_spacy_es
nlp = spacy_es_load_model()
result = extract_triples_spacy_es(
"Amancio Ortega fundo Inditex en 1985 en La Coruna.",
nlp=nlp,
)
# result["triples"]:
# [{"subject": "Amancio Ortega", "relation": "fundar", "object": "Inditex", ...},
# {"subject": "Amancio Ortega", "relation": "fundar_en", "object": "1985", ...},
# {"subject": "Amancio Ortega", "relation": "fundar_en", "object": "La Coruna", ...}]
```
@@ -0,0 +1,124 @@
"""Extraccion de tripletas OpenIE schema-less en castellano via reglas de dependencia.
Validado en notebook 09 del analisis gliner_glirel_tuning.
LICENSE: spaCy MIT + es_core_news_md CC BY-SA 4.0.
"""
from __future__ import annotations
import time
from typing import Any
# Determinantes y pronombres que no son entidades significativas
STOP_TOKENS = {
"el", "la", "los", "las", "un", "una", "unos", "unas",
"esto", "eso", "aquello", "esta", "este", "estos", "estas",
"que", "quien", "cual", "cuales",
}
def _clean_span(span_tokens) -> str: # type: ignore[type-arg]
"""Extrae texto de un span de tokens, eliminando preposiciones iniciales."""
toks = list(span_tokens)
while toks and toks[0].pos_ == "ADP":
toks = toks[1:]
return " ".join(t.text for t in toks).strip()
def _is_meaningful(text: str) -> bool:
"""Comprueba que un span no es vacio ni una stopword."""
if not text or not text.strip():
return False
if text.lower() in STOP_TOKENS:
return False
return True
def extract_triples_spacy_es(text: str, nlp: Any) -> dict:
"""Extract OpenIE-style (subject, relation, object) triples from Spanish text.
Uses spaCy dependency rules to find subject-verb-object patterns.
Schema-LESS: the relation is the verb's lemma (no fixed vocabulary).
Also extracts spaCy NER entities (PER, ORG, LOC, MISC).
Args:
text: Spanish text to analyze. Works best with complete sentences.
nlp: spaCy Language instance loaded with spacy_es_load_model.
Returns:
{
"text": str,
"triples": [
{"subject": str, "relation": str, "object": str,
"verb_form": str, "object_dep": str, "prep": str|None},
...
],
"entities": [{"text": str, "label": str}, ...],
"elapsed_s": float
}
"""
t0 = time.time()
doc = nlp(text)
triples: list[dict] = []
for tok in doc:
if tok.pos_ not in ("VERB", "AUX"):
continue
verb_lemma = tok.lemma_
verb_form = tok.text
subjs = [
c for c in tok.children
if c.dep_ in ("nsubj", "nsubj:pass", "csubj")
]
if not subjs:
continue
objects: list[tuple] = []
for c in tok.children:
if c.dep_ in ("obj", "dobj", "iobj", "attr", "xcomp", "ccomp"):
objects.append((c, c.dep_, None))
elif c.dep_ in ("obl", "obl:agent", "nmod"):
prep = None
for cc in c.children:
if cc.dep_ == "case" and cc.pos_ == "ADP":
prep = cc.text.lower()
break
objects.append((c, c.dep_, prep))
for s in subjs:
s_text = _clean_span(s.subtree)
if not _is_meaningful(s_text):
continue
for o, dep, prep in objects:
o_text = _clean_span(o.subtree)
if not _is_meaningful(o_text):
continue
# Construir etiqueta de relacion
rel = verb_lemma
# Pasiva: marcar con [pass]
if any(c.dep_ == "nsubj:pass" for c in tok.children):
rel = f"{verb_lemma}[pass]"
# Oblicuo con preposicion (excl. agente y "a" directa)
elif prep and dep != "obl:agent" and prep != "a":
rel = f"{verb_lemma}_{prep}"
triples.append({
"subject": s_text,
"relation": rel,
"object": o_text,
"verb_form": verb_form,
"object_dep": dep,
"prep": prep,
})
ents = [{"text": e.text, "label": e.label_} for e in doc.ents]
return {
"text": text,
"triples": triples,
"entities": ents,
"elapsed_s": round(time.time() - t0, 3),
}
@@ -0,0 +1,58 @@
---
name: fuzzy_merge_adaptive
kind: function
lang: py
domain: datascience
version: "1.0.0"
purity: pure
signature: "def fuzzy_merge_adaptive(left: list[dict], right: list[dict], left_key: str, right_key: str, thresholds: list[int] | None = None, how: str = 'left') -> list[dict]"
description: "Fuzzy join adaptativo entre dos listas de dicts usando rapidfuzz.token_sort_ratio. Prueba thresholds de mayor a menor y asigna el mayor cumplido. Soporta how='left' (todos los de left) e how='inner' (solo con match). Campos colisionantes reciben sufijos _left/_right."
tags: [fuzzy, matching, join, merge, rapidfuzz, string-similarity, datascience]
params:
- name: left
desc: Lista de dicts (lado izquierdo del join).
- name: right
desc: Lista de dicts (lado derecho del join).
- name: left_key
desc: Clave en los dicts de left usada para matching de strings.
- name: right_key
desc: Clave en los dicts de right usada para matching de strings.
- name: thresholds
desc: Lista de thresholds enteros a probar en orden descendente. Default [90,80,70,60,50].
- name: how
desc: "'left' incluye todos los items de left; 'inner' solo los que tienen match."
output: "Lista de dicts mergeados con campos de left + right (sufijos _left/_right si colisionan) + fuzzy_match (str|None), match_score (int), threshold_used (int|None)."
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: ["rapidfuzz"]
tested: true
tests:
- "left join con typo"
- "inner join excluye sin match"
- "left join sin match devuelve none"
- "threshold adaptativo"
- "colision de claves usa sufijos"
test_file_path: "python/functions/datascience/tests/test_fuzzy_merge_adaptive.py"
file_path: "python/functions/datascience/fuzzy_merge_adaptive.py"
source_repo: "internal:footprint_aurgi"
source_license: "internal-aurgi"
source_file: "fuzzy_joins/fuzzy_en_batches.py"
---
## Ejemplo
```python
from fuzzy_merge_adaptive import fuzzy_merge_adaptive
left = [{"name": "Madrid"}, {"name": "Barclona"}]
right = [{"name": "Madrid", "cp": "28"}, {"name": "Barcelona", "cp": "08"}]
result = fuzzy_merge_adaptive(left, right, left_key="name", right_key="name")
# result[1]["fuzzy_match"] == "Barcelona", result[1]["match_score"] >= 80
```
## Notas
Migrado de thefuzz a rapidfuzz (API compatible, mayor velocidad). Sin pandas: el merge se implementa manualmente via dict lookup por right_key. Los thresholds se prueban de mayor a menor; el primero cumplido se asigna a threshold_used.
@@ -0,0 +1,108 @@
"""Fuzzy merge adaptativo con multiples thresholds usando rapidfuzz."""
from __future__ import annotations
from typing import Iterable
def fuzzy_merge_adaptive(
left: list[dict],
right: list[dict],
left_key: str,
right_key: str,
thresholds: list[int] | None = None,
how: str = "left",
) -> list[dict]:
"""Realiza un fuzzy join adaptativo entre dos listas de dicts.
Para cada item en left busca en right el mejor match usando
rapidfuzz.fuzz.token_sort_ratio. Prueba thresholds de mayor a menor
y asigna threshold_used al mayor threshold cumplido. Si no cumple
ninguno, match es None.
Args:
left: Lista de dicts (lado izquierdo del join).
right: Lista de dicts (lado derecho del join).
left_key: Clave en los dicts de left usada para matching.
right_key: Clave en los dicts de right usada para matching.
thresholds: Thresholds a probar en orden descendente.
Default [90, 80, 70, 60, 50].
how: Tipo de join. 'left' incluye todos los items de left
(con None en campos de right si no hay match).
'inner' incluye solo items con match.
Returns:
Lista de dicts mergeados con campos de left + campos de right
(sufijos _left/_right si colisionan) + fuzzy_match, match_score,
threshold_used.
"""
from rapidfuzz import fuzz, process
if thresholds is None:
thresholds = [90, 80, 70, 60, 50]
right_values = [
str(r[right_key]) for r in right if r.get(right_key) is not None
]
def find_best_match(value: str | None) -> tuple[str | None, int, int | None]:
if value is None:
return None, 0, None
result = process.extractOne(str(value), right_values, scorer=fuzz.token_sort_ratio)
if not result:
return None, 0, None
match_str, score = result[0], result[1]
for t in thresholds:
if score >= t:
return match_str, score, t
return None, 0, None
# Detectar colisiones de claves
left_keys = set(left[0].keys()) if left else set()
right_keys = set(right[0].keys()) if right else set()
collision_keys = left_keys & right_keys
# Construir indice de right por right_key
right_index: dict[str, dict] = {}
for r in right:
val = r.get(right_key)
if val is not None:
right_index[str(val)] = r
result_rows = []
for item in left:
value = item.get(left_key)
fuzzy_match, score, threshold_used = find_best_match(value)
if fuzzy_match is None and how == "inner":
continue
row: dict = {}
# Campos de left
for k, v in item.items():
if k in collision_keys:
row[f"{k}_left"] = v
else:
row[k] = v
# Campos de right
matched_right = right_index.get(fuzzy_match) if fuzzy_match else None
if matched_right is not None:
for k, v in matched_right.items():
if k in collision_keys:
row[f"{k}_right"] = v
else:
row[k] = v
else:
for k in right_keys:
if k in collision_keys:
row[f"{k}_right"] = None
else:
row[k] = None
row["fuzzy_match"] = fuzzy_match
row["match_score"] = score
row["threshold_used"] = threshold_used
result_rows.append(row)
return result_rows
@@ -0,0 +1,52 @@
---
id: geometric_mean_py_datascience
name: geometric_mean
kind: function
lang: py
domain: datascience
version: "1.0.0"
purity: pure
signature: "def geometric_mean(values: list[float]) -> float"
description: "Geometric mean of positive elements via exp(mean(log(x))). Non-positive values are filtered out. Returns math.nan if no positives."
tags: [statistics, mean, geometric, distribution, lognormal]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: [math, numpy]
example: |
from geometric_mean import geometric_mean
result = geometric_mean([1, 2, 4, 8]) # ~2.828 (2^1.5)
tested: true
tests:
- "test_geometric_mean_powers_of_two"
- "test_geometric_mean_filters_non_positive"
- "test_geometric_mean_empty_returns_nan"
- "test_geometric_mean_all_negative_returns_nan"
- "test_geometric_mean_single_positive"
test_file_path: "python/functions/datascience/tests/test_geometric_mean.py"
file_path: "python/functions/datascience/geometric_mean.py"
params:
- name: values
desc: "List of numeric values. Non-positive elements are silently ignored."
output: "Geometric mean as float, computed over positive elements only. Returns math.nan if there are no positive values."
source_repo: "internal:footprint_aurgi"
source_license: "internal-aurgi"
source_file: "aurgi_mapas/generar_pdf_reporte.py:126"
---
## Ejemplo
```python
from geometric_mean import geometric_mean
geometric_mean([1, 2, 4, 8]) # 2.828... (= 2^1.5)
geometric_mean([1, -2, 3]) # exp((log(1)+log(3))/2) — ignores -2
geometric_mean([]) # math.nan
geometric_mean([-1, -2]) # math.nan — no positives
```
## Notas
Apropiado para distribuciones lognormales o datos multiplicativos (precios, ratios, crecimientos). Equivalente a la raiz n-esima del producto pero numericamente mas estable via log-space.
@@ -0,0 +1,23 @@
"""geometric_mean — Geometric mean of positive values."""
import math
import numpy as np
def geometric_mean(values: list[float]) -> float:
"""Return the geometric mean of the positive elements in values.
Filters out non-positive numbers before computing exp(mean(log(x))).
Returns math.nan if there are no positive values.
Args:
values: List of numeric values (non-positive elements are ignored).
Returns:
Geometric mean as float, or math.nan if no positive values exist.
"""
positives = [v for v in values if v > 0]
if not positives:
return math.nan
arr = np.array(positives, dtype=float)
return float(np.exp(np.mean(np.log(arr))))
@@ -0,0 +1,67 @@
---
name: gliner2_load_model
kind: function
lang: py
domain: datascience
version: "1.0.0"
purity: impure
signature: "def gliner2_load_model(model_name: str = 'fastino/gliner2-large-v1', device: str = 'auto') -> Any"
description: "Carga (y cachea por (model_name, device)) un modelo GLiNER2 (NER+RE joint). GLiNER2 extrae entidades y relaciones en una sola pasada con schema unificado. ~2x mas rapido que GLiNER + GLiREL separados. LICENSE: Apache 2.0."
tags: [gliner2, ner, relation-extraction, nlp, model, huggingface, zero-shot, joint, datascience, python, apache2]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: "error_go_core"
imports: [gliner2]
params:
- name: model_name
desc: "ID del modelo en HuggingFace Hub. Default: fastino/gliner2-large-v1. Alternativas: fastino/gliner2-base-v1 (mas ligero)."
- name: device
desc: "'auto' usa CUDA si disponible, sino CPU. Valores: 'cpu', 'cuda', 'cuda:0', 'cuda:1'. auto es el default recomendado."
output: "Instancia GLiNER2 cacheada por (model_name, device). Tiene metodos .create_schema().entities(...).relations(...) y .extract(text, schema=schema, threshold=0.3)."
tested: true
tests:
- "cache devuelve la misma instancia con los mismos parametros"
- "device=auto resuelve a cpu si torch no esta instalado"
- "ImportError si gliner2 no esta instalado"
test_file_path: "python/functions/datascience/tests/test_gliner2_load_model.py"
file_path: "python/functions/datascience/gliner2_load_model.py"
notes: |
LICENSE: fastino/gliner2-large-v1 es Apache 2.0 — uso comercial OK.
Diferencia con gliner_load_model: GLiNER hace solo NER, GLiNER2 hace NER+RE
en una sola pasada (joint schema). Para pipelines de grafo usar GLiNER2
cuando se necesiten ambas tareas simultaneamente.
impure: descarga red/disco la primera vez, mantiene estado en _MODEL_CACHE.
Tamanio: fastino/gliner2-large-v1 ~500 MB. Primera carga 15-30s en CPU.
Inferencia CPU: 10-50 KB texto/s con schema tipico (3 entity + 8 relation labels).
---
## Ejemplo
```python
from datascience.gliner2_load_model import gliner2_load_model
model = gliner2_load_model(device="auto")
schema = (model.create_schema()
.entities(["person", "organization", "location"])
.relations(["works_at", "ceo_of", "located_in"]))
result = model.extract(
"Pablo Isla es el CEO de Inditex, empresa con sede en Arteixo.",
schema=schema,
threshold=0.3,
)
# result["entities"] -> {"person": ["Pablo Isla"], "organization": ["Inditex"], ...}
# result["relation_extraction"] -> {"ceo_of": [("Pablo Isla", "Inditex")], ...}
```
## Instalacion
```bash
cd python && uv pip install gliner2
# o con el extra NLP completo:
cd python && uv pip install -e '.[nlp]'
```
@@ -0,0 +1,62 @@
"""Carga (y cachea) un modelo GLiNER2 (NER+RE joint en una sola pasada).
LICENSE: Apache 2.0 — uso comercial permitido.
Modelo por defecto: fastino/gliner2-large-v1
"""
from __future__ import annotations
from typing import Any
# Cache global: (model_name, device) -> instancia GLiNER2
_MODEL_CACHE: dict[tuple[str, str], Any] = {}
def _resolve_device(device: str) -> str:
"""Resuelve 'auto' a 'cuda' o 'cpu' segun disponibilidad de torch."""
if device != "auto":
return device
try:
import torch
except ImportError:
return "cpu"
return "cuda" if torch.cuda.is_available() else "cpu"
def gliner2_load_model(
model_name: str = "fastino/gliner2-large-v1",
device: str = "auto",
) -> Any:
"""Load (and cache) a GLiNER2 model.
GLiNER2 extracts entities AND relations in a single forward pass using
a joint schema (entities + relation_labels). This is ~2x faster than
running GLiNER + GLiREL separately for co-occurring entities.
Returns model instance with .extract() and .create_schema() methods.
LICENSE: Apache 2.0 — commercial use OK.
Args:
model_name: HuggingFace Hub model ID. Default: fastino/gliner2-large-v1.
device: 'auto' uses CUDA if available, else CPU. 'cpu', 'cuda', 'cuda:N'.
Returns:
GLiNER2 instance cached by (model_name, device).
"""
resolved = _resolve_device(device)
key = (model_name, resolved)
if key in _MODEL_CACHE:
return _MODEL_CACHE[key]
from gliner2 import GLiNER2 # type: ignore[import]
m = GLiNER2.from_pretrained(model_name)
if hasattr(m, "to") and resolved != "cpu":
try:
m.to(resolved)
except Exception:
pass # Fallback to CPU silently
_MODEL_CACHE[key] = m
return m
@@ -0,0 +1,52 @@
---
id: kde_density_levels_py_datascience
name: kde_density_levels
kind: function
lang: py
domain: datascience
version: "1.0.0"
purity: pure
signature: "def kde_density_levels(xs: list[float], ys: list[float], bw_adjust: float = 0.6, abs_quantile: float = 0.1, dense_quantile: float = 0.85, bins: int = 80) -> dict | None"
description: "Estimates 2-D density via KDE (scipy) or histogram fallback (numpy) and returns per-point density values plus absolute and dense quantile thresholds."
tags: [statistics, kde, density, spatial, geospatial, scipy, numpy]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: [numpy, scipy]
example: |
from kde_density_levels import kde_density_levels
import numpy as np
rng = np.random.default_rng(42)
result = kde_density_levels(rng.normal(0,1,50).tolist(), rng.normal(0,1,50).tolist())
# {"method": "kde", "densities": array(...), "abs_level": ..., "dense_level": ...}
tested: true
tests:
- "test_kde_density_levels_returns_dict_for_50_points"
- "test_kde_density_levels_none_for_few_points"
- "test_kde_density_levels_none_for_4_points"
- "test_kde_density_levels_levels_ordered"
- "test_kde_density_levels_mismatched_lengths"
test_file_path: "python/functions/datascience/tests/test_kde_density_levels.py"
file_path: "python/functions/datascience/kde_density_levels.py"
params:
- name: xs
desc: "X-coordinates of the 2-D point cloud."
- name: ys
desc: "Y-coordinates of the 2-D point cloud. Must have same length as xs."
- name: bw_adjust
desc: "Bandwidth adjustment factor for gaussian_kde. Default 0.6."
- name: abs_quantile
desc: "Quantile of density values used as the absolute (sparse) threshold. Default 0.1."
- name: dense_quantile
desc: "Quantile of density values used as the dense cluster threshold. Default 0.85."
- name: bins
desc: "Number of bins per axis for the histogram fallback. Default 80."
output: "Dict with method (str), densities (np.ndarray of per-point density), abs_level (float), dense_level (float). Returns None if len(xs) < 5 or lengths differ."
source_repo: "internal:footprint_aurgi"
source_license: "internal-aurgi"
source_file: "ponderacion_isochronas/src/recomendador_centros.py:305"
---
Funcion pura que no escribe nada en disco. returns_optional=true porque devuelve None cuando hay menos de 5 puntos.
@@ -0,0 +1,65 @@
"""kde_density_levels — Compute density levels via KDE or histogram fallback."""
import math
import numpy as np
def kde_density_levels(
xs: list[float],
ys: list[float],
bw_adjust: float = 0.6,
abs_quantile: float = 0.1,
dense_quantile: float = 0.85,
bins: int = 80,
) -> dict | None:
"""Estimate 2-D density and compute absolute and dense threshold levels.
Uses scipy.stats.gaussian_kde when available; falls back to
numpy.histogram2d if scipy is not installed.
Args:
xs: X-coordinates of points.
ys: Y-coordinates of points.
bw_adjust: Bandwidth adjustment factor for KDE (ignored for histogram fallback).
abs_quantile: Quantile of density values used as the absolute threshold.
dense_quantile: Quantile of density values used as the dense threshold.
bins: Number of bins per axis for the histogram fallback.
Returns:
Dict with keys:
"method" (str): "kde" or "hist".
"densities" (np.ndarray): 1-D array of per-point density estimates.
"abs_level" (float): density at abs_quantile.
"dense_level" (float): density at dense_quantile.
Returns None if len(xs) < 5 or xs and ys have different lengths.
"""
if len(xs) < 5 or len(xs) != len(ys):
return None
xs_arr = np.array(xs, dtype=float)
ys_arr = np.array(ys, dtype=float)
points = np.vstack([xs_arr, ys_arr])
try:
from scipy.stats import gaussian_kde # type: ignore
kde = gaussian_kde(points, bw_method=bw_adjust)
densities = kde(points)
method = "kde"
except ImportError:
# Histogram fallback
h, xedges, yedges = np.histogram2d(xs_arr, ys_arr, bins=bins)
xi = np.clip(np.searchsorted(xedges, xs_arr) - 1, 0, bins - 1)
yi = np.clip(np.searchsorted(yedges, ys_arr) - 1, 0, bins - 1)
densities = h[xi, yi].astype(float)
method = "hist"
abs_level = float(np.quantile(densities, abs_quantile))
dense_level = float(np.quantile(densities, dense_quantile))
return {
"method": method,
"densities": densities,
"abs_level": abs_level,
"dense_level": dense_level,
}
@@ -0,0 +1,61 @@
---
name: marianmt_es_en_load_model
kind: function
lang: py
domain: datascience
version: "1.0.0"
purity: impure
signature: "def marianmt_es_en_load_model(model_name: str = 'Helsinki-NLP/opus-mt-es-en') -> tuple[Any, Any]"
description: "Carga (y cachea) el tokenizer y modelo MarianMT para traduccion ES->EN (Helsinki-NLP, ~300 MB). Licencia Apache 2.0. Cache por model_name."
tags: [marianmt, translation, es-en, nlp, model, huggingface, apache2, datascience, python]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: "error_go_core"
imports: [transformers]
params:
- name: model_name
desc: "ID del modelo en HuggingFace Hub (defecto: Helsinki-NLP/opus-mt-es-en, ~300 MB)"
output: "tupla (tokenizer, model) listos para inferencia, cacheados por model_name."
tested: false
tests: []
test_file_path: ""
file_path: "python/functions/datascience/marianmt_es_en_load_model.py"
notes: |
LICENCIA: Apache 2.0 — uso comercial permitido.
Util como paso previo a REBEL (monolingue EN): traducir ES -> EN con MarianMT
y luego pasar a rebel_load_model para extraccion de relaciones en ingles.
impure: descarga red/disco la primera vez, mantiene estado en _MODEL_CACHE.
Usa MarianTokenizer y MarianMTModel en vez de Auto* porque los modelos Marian
tienen tokenizer especializado con vocabulario SPM.
---
## Ejemplo
```python
from python.functions.datascience.marianmt_es_en_load_model import marianmt_es_en_load_model
from python.functions.datascience.translate_es_to_en import translate_es_to_en
tokenizer, model = marianmt_es_en_load_model()
translated = translate_es_to_en("Pablo Isla es presidente de Inditex.", tokenizer, model)
# "Pablo Isla is president of Inditex."
```
## Tamanio y latencia
- `Helsinki-NLP/opus-mt-es-en`: ~300 MB en disco.
- Primera carga: 5-15 s en CPU.
- Inferencia CPU: 0.5-2 s por frase.
- GPU: mucho mas rapido.
## Uso como preprocesador para REBEL
```
texto ES -> marianmt_es_en -> texto EN -> rebel_load_model -> triplets
```
Esta pipeline permite usar REBEL (Apache 2.0, solo EN) con textos en espanol.
Alternativa directa: usar mrebel_load_model (CC BY-NC-SA, multilingue).
@@ -0,0 +1,54 @@
"""Carga (y cachea) el modelo MarianMT para traduccion ES -> EN."""
from __future__ import annotations
from typing import Any
# Cache global: model_name -> (tokenizer, model)
_MODEL_CACHE: dict[str, tuple[Any, Any]] = {}
def marianmt_es_en_load_model(
model_name: str = "Helsinki-NLP/opus-mt-es-en",
) -> tuple[Any, Any]:
"""Loads (and caches) a MarianMT model for Spanish-to-English translation.
MarianMT is a lightweight seq2seq translation model (~300 MB) from
Helsinki-NLP, trained on the OPUS parallel corpus.
LICENSE: Apache 2.0 — commercial use permitted.
The first call downloads the model from HuggingFace Hub (~300 MB).
Subsequent calls with the same ``model_name`` return the cached instance.
Args:
model_name: HuggingFace Hub model ID. Default is the ES->EN model.
Other available models follow the pattern
``Helsinki-NLP/opus-mt-{src}-{tgt}``.
Returns:
Tuple ``(tokenizer, model)`` both ready for inference with
``model.generate(...)`` and ``tokenizer.decode(...)``.
Raises:
ImportError: if ``transformers`` is not installed.
OSError: if the model cannot be downloaded or loaded from disk.
"""
cached = _MODEL_CACHE.get(model_name)
if cached is not None:
return cached
try:
from transformers import MarianMTModel, MarianTokenizer
except ImportError as exc:
raise ImportError(
"transformers no esta instalado. Instalalo con "
"`uv pip install transformers` o `uv pip install -e '.[nlp]'`."
) from exc
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
model.eval()
_MODEL_CACHE[model_name] = (tokenizer, model)
return tokenizer, model
@@ -0,0 +1,56 @@
---
name: mrebel_base_load_model
kind: function
lang: py
domain: datascience
version: "1.0.0"
purity: impure
signature: "def mrebel_base_load_model(model_name: str = 'Babelscape/mrebel-base', src_lang: str = 'es_XX', tgt_lang: str = 'tp_XX') -> tuple[Any, Any]"
description: "Variante rapida de mrebel_load_model con checkpoint base (250M params, ~900 MB). Delega completamente en mrebel_load_model. Misma licencia CC BY-NC-SA 4.0 — solo uso no comercial."
tags: [mrebel, relation-extraction, nlp, model, huggingface, multilingual, seq2seq, datascience, python]
uses_functions: [mrebel_load_model_py_datascience]
uses_types: []
returns: []
returns_optional: false
error_type: "error_go_core"
imports: []
params:
- name: model_name
desc: "ID del modelo en HuggingFace Hub (defecto: Babelscape/mrebel-base, 250M params)"
- name: src_lang
desc: "codigo de idioma fuente para el tokenizer mBART: 'es_XX' (ES), 'en_XX' (EN), etc."
- name: tgt_lang
desc: "token de idioma destino del decoder — siempre 'tp_XX'"
output: "tupla (tokenizer, model) listos para inferencia, cacheados por (model_name, src_lang) en la cache compartida de mrebel_load_model."
tested: false
tests: []
test_file_path: ""
file_path: "python/functions/datascience/mrebel_base_load_model.py"
notes: |
LICENCIA: Babelscape/mrebel-base esta bajo CC BY-NC-SA 4.0 (Creative Commons
Non-Commercial Share-Alike). Solo uso no comercial. NO usar en productos comerciales.
Esta funcion es un thin wrapper — NO duplica logica de carga/cache. Toda la
logica vive en mrebel_load_model. Util para benchmarks donde se quiere comparar
base vs large con la misma interfaz.
La cache es compartida con mrebel_load_model (mismo dict _MODEL_CACHE del modulo).
---
## Ejemplo
```python
from python.functions.datascience.mrebel_base_load_model import mrebel_base_load_model
# 250M params vs 600M — misma interfaz
tokenizer, model = mrebel_base_load_model(src_lang="es_XX")
```
## Comparacion base vs large
| Variant | Params | Size | Latencia CPU/frase | Recall tipico |
|---------|--------|------|-------------------|---------------|
| mrebel-large | 600M | ~2.4 GB | 15-30 s | alto |
| mrebel-base | 250M | ~900 MB | 5-10 s | medio |
Para benchmarks de velocidad en graph_explorer, usar base. Para produccion final, evaluar large.
@@ -0,0 +1,41 @@
"""Carga (y cachea) el modelo mREBEL-base (variante rapida, 250M params)."""
from __future__ import annotations
from typing import Any
from python.functions.datascience.mrebel_load_model import mrebel_load_model
def mrebel_base_load_model(
model_name: str = "Babelscape/mrebel-base",
src_lang: str = "es_XX",
tgt_lang: str = "tp_XX",
) -> tuple[Any, Any]:
"""Loads (and caches) the mREBEL-base tokenizer and model.
Thin wrapper over ``mrebel_load_model`` with the base checkpoint as
default (250M params, ~900 MB). Faster than the large variant at the
cost of some recall on complex sentences.
LICENSE NOTICE: Babelscape/mrebel-base is licensed under CC BY-NC-SA 4.0
(Creative Commons Non-Commercial Share-Alike). Do NOT use in commercial
products without replacing this model.
Args:
model_name: HuggingFace Hub model ID. Defaults to the base checkpoint.
src_lang: Source language code for the mBART tokenizer.
tgt_lang: Target language token for the decoder (always ``"tp_XX"``).
Returns:
Tuple ``(tokenizer, model)`` ready for inference.
Raises:
ImportError: if ``transformers`` is not installed.
OSError: if the model cannot be downloaded or loaded from disk.
"""
return mrebel_load_model(
model_name=model_name,
src_lang=src_lang,
tgt_lang=tgt_lang,
)
@@ -0,0 +1,76 @@
---
name: mrebel_load_model
kind: function
lang: py
domain: datascience
version: "1.0.0"
purity: impure
signature: "def mrebel_load_model(model_name: str = 'Babelscape/mrebel-large', src_lang: str = 'es_XX', tgt_lang: str = 'tp_XX') -> tuple[Any, Any]"
description: "Carga (y cachea) el tokenizer y modelo mREBEL (mBART-based, ~600M params, ~2.4 GB). Multilingue 30+ idiomas. Cache por (model_name, src_lang). Primera llamada descarga de HuggingFace. LICENCIA CC BY-NC-SA 4.0 — solo uso no comercial."
tags: [mrebel, relation-extraction, nlp, model, huggingface, multilingual, seq2seq, datascience, python]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: "error_go_core"
imports: [transformers]
params:
- name: model_name
desc: "ID del modelo en HuggingFace Hub (defecto: Babelscape/mrebel-large, 600M params)"
- name: src_lang
desc: "codigo de idioma fuente para el tokenizer mBART: 'es_XX' (ES), 'en_XX' (EN), 'fr_XX' (FR), etc."
- name: tgt_lang
desc: "token de idioma destino del decoder — siempre 'tp_XX' para el formato triplet de mREBEL"
output: "tupla (tokenizer, model) listos para inferencia. Cacheados por (model_name, src_lang)."
tested: false
tests: []
test_file_path: ""
file_path: "python/functions/datascience/mrebel_load_model.py"
notes: |
LICENCIA: Babelscape/mrebel-large esta bajo CC BY-NC-SA 4.0 (Creative Commons
Non-Commercial Share-Alike). Solo uso no comercial. NO usar en productos
comerciales sin sustituir por un modelo con licencia comercial.
impure: descarga red/disco la primera vez, mantiene estado en _MODEL_CACHE.
No necesita el patch HF kwargs de glirel — AutoModelForSeq2SeqLM es path estandar.
Cache es por (model_name, src_lang): dos idiomas distintos crean dos instancias
porque el tokenizer tiene src_lang hardcodeado.
---
## Ejemplo
```python
from python.functions.datascience.mrebel_load_model import mrebel_load_model
from python.functions.datascience.parse_rebel_output import parse_rebel_output
tokenizer, model = mrebel_load_model(src_lang="es_XX")
text = "Pablo Isla es el presidente de Inditex."
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
generated = model.generate(**inputs, num_beams=4, length_penalty=1.0, max_length=256)
decoded = tokenizer.decode(generated[0], skip_special_tokens=False)
triplets = parse_rebel_output(decoded)
```
## Tamanio y latencia
- `Babelscape/mrebel-large`: ~2.4 GB en disco (modelo + tokenizer).
- Primera carga: 30-90 s en CPU, depende de red y disco.
- Inferencia CPU: 5-15 s por frase (mBART es mas lento que REBEL/BART).
- Inferencia GPU (CUDA T4): 0.5-2 s por frase.
## Idiomas soportados
mREBEL soporta los idiomas de mBART-50. Ejemplos:
- `es_XX` — Espanol
- `en_XX` — Ingles
- `fr_XX` — Frances
- `de_DE` — Aleman
- `pt_XX` — Portugues
- `it_IT` — Italiano
## Notas
- Para ingles y usos comerciales, usar `rebel_load_model` (Apache 2.0).
- Para benchmarks rapidos, usar `mrebel_base_load_model` (250M params, misma licencia).
- `model.eval()` se llama al cargar para desactivar dropout en inferencia.
@@ -0,0 +1,69 @@
"""Carga (y cachea) el modelo mREBEL para extraccion de relaciones multilingue."""
from __future__ import annotations
from typing import Any
# Cache global: (model_name, src_lang) -> (tokenizer, model)
_MODEL_CACHE: dict[tuple[str, str], tuple[Any, Any]] = {}
def mrebel_load_model(
model_name: str = "Babelscape/mrebel-large",
src_lang: str = "es_XX",
tgt_lang: str = "tp_XX",
) -> tuple[Any, Any]:
"""Loads (and caches) the mREBEL tokenizer and model.
mREBEL is a multilingual seq2seq model (mBART-based, ~600M params, ~2.4 GB)
for relation extraction. It supports 30+ languages via language codes
(``src_lang``).
LICENSE NOTICE: Babelscape/mrebel-large is licensed under CC BY-NC-SA 4.0
(Creative Commons Non-Commercial Share-Alike). Do NOT use in commercial
products without replacing this model with a commercially-licensed
alternative (e.g. Babelscape/rebel-large which is Apache 2.0 but
English-only).
The first call downloads the model from HuggingFace Hub (~2.4 GB).
Subsequent calls with the same ``(model_name, src_lang)`` return the
cached instance without re-loading.
Args:
model_name: HuggingFace Hub model ID. Default is the large variant.
src_lang: Source language code for the mBART tokenizer, e.g.
``"es_XX"`` (Spanish), ``"en_XX"`` (English), ``"fr_XX"`` (French).
tgt_lang: Target language token for the decoder (always ``"tp_XX"``
for the triplet format — only change if using a custom checkpoint).
Returns:
Tuple ``(tokenizer, model)`` both ready for inference with
``model.generate(...)`` and ``tokenizer.decode(...)``.
Raises:
ImportError: if ``transformers`` is not installed.
OSError: if the model cannot be downloaded or loaded from disk.
"""
cache_key = (model_name, src_lang)
cached = _MODEL_CACHE.get(cache_key)
if cached is not None:
return cached
try:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
except ImportError as exc:
raise ImportError(
"transformers no esta instalado. Instalalo con "
"`uv pip install transformers` o `uv pip install -e '.[nlp]'`."
) from exc
tokenizer = AutoTokenizer.from_pretrained(
model_name,
src_lang=src_lang,
tgt_lang=tgt_lang,
)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
model.eval()
_MODEL_CACHE[cache_key] = (tokenizer, model)
return tokenizer, model
@@ -0,0 +1,65 @@
---
name: parse_rebel_output
kind: function
lang: py
domain: datascience
version: "1.0.0"
purity: pure
signature: "def parse_rebel_output(decoded_text: str) -> list[dict]"
description: "Parser puro del wire format de REBEL / mREBEL. Convierte la cadena decoded por el tokenizer (con skip_special_tokens=False) a una lista de triplets tipados {head, head_type, type, tail, tail_type}. Nunca lanza excepcion."
tags: [rebel, mrebel, relation-extraction, nlp, parser, knowledge-graph, datascience, python]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: []
params:
- name: decoded_text
desc: "cadena raw producida por tokenizer.decode(..., skip_special_tokens=False) — incluye tokens especiales como <triplet>, <per>, <org>, <loc>, tp_XX, etc."
output: "lista de dicts con claves head (str), head_type (str), type (str), tail (str), tail_type (str). Lista vacia si no hay triplets completos o el input es vacio."
tested: true
tests:
- "string vacio retorna lista vacia"
- "un triplet completo retorna un dict con campos correctos"
- "dos triplets retorna dos dicts"
- "triplet incompleto sin cierre no rompe"
- "tokens angulares desconocidos no lanzan excepcion"
test_file_path: "python/functions/datascience/tests/test_parse_rebel_output.py"
file_path: "python/functions/datascience/parse_rebel_output.py"
notes: |
Funcion pura. Adapta el parser oficial del README de Babelscape/rebel al estilo del registry.
Compatible con mREBEL (prefijo tp_XX, lang token __es__, __en__) y REBEL (sin prefijo de idioma).
El formato wire incluye <triplet> para separar triplets y tokens <type> para cerrar spans
de head/tail. El estado de la maquina es: t=leyendo head, s=leyendo tail, o=leyendo relacion.
---
## Ejemplo
```python
from python.functions.datascience.parse_rebel_output import parse_rebel_output
decoded = "tp_XX<triplet> Pablo Isla <per> Inditex <org> employer"
triplets = parse_rebel_output(decoded)
# [{'head': 'Pablo Isla', 'head_type': 'per', 'type': 'employer',
# 'tail': 'Inditex', 'tail_type': 'org'}]
```
## Formato wire REBEL / mREBEL
```
tp_XX<triplet> HEAD_TOKENS <HEAD_TYPE> TAIL_TOKENS <TAIL_TYPE> RELATION_TOKENS<triplet> ...
```
- `<triplet>` — marca el inicio de un nuevo triplet (y cierra el anterior).
- `<HEAD_TYPE>` — cierra el span del head y abre el span del tail.
- `<TAIL_TYPE>` — cierra el span del tail y abre el span de la relacion.
- El ultimo triplet se cierra con `</s>` (ya eliminado antes del split).
## Notas
- No valida ni filtra los `head_type`/`tail_type` — los devuelve tal cual emite el modelo.
- Compatible con cualquier variante seq2seq que use el mismo wire format (Babelscape/rebel,
Babelscape/mrebel-large, Babelscape/mrebel-base).
- Para usar el output en el grafo, pasar por `align_relations_to_entities` que resuelve
head/tail a nombres canonicos del conjunto de entidades conocido.
@@ -0,0 +1,105 @@
"""Parser puro del wire format de REBEL / mREBEL."""
from __future__ import annotations
def parse_rebel_output(decoded_text: str) -> list[dict]:
"""Parse REBEL / mREBEL decoded output into typed triplets.
The input is the string produced by the HuggingFace tokenizer with
``skip_special_tokens=False``, e.g.::
tp_XX<triplet> Pablo Isla <per> Inditex <org> employer<triplet> ...
Args:
decoded_text: Raw decoded string from the seq2seq model, including
special tokens like ``<triplet>``, ``<relation>``, ``<per>``,
``<org>``, ``<loc>``, etc.
Returns:
List of dicts with keys:
``head`` (str), ``head_type`` (str),
``type`` (str), ``tail`` (str), ``tail_type`` (str).
Returns an empty list on empty input or if no complete triplet is
found. Never raises.
"""
if not decoded_text or not decoded_text.strip():
return []
triplets: list[dict] = []
# Strip language / padding tokens common to mREBEL.
text = (
decoded_text
.replace("<s>", "")
.replace("<pad>", "")
.replace("</s>", "")
.replace("tp_XX", "")
.replace("__en__", "")
.strip()
)
current = "x" # x=init, t=head span, s=tail span, o=relation span
subject = ""
relation = ""
object_ = ""
object_type = ""
subject_type = ""
for token in text.split():
if token in ("<triplet>", "<relation>"):
current = "t"
if relation:
triplets.append(
{
"head": subject.strip(),
"head_type": subject_type,
"type": relation.strip(),
"tail": object_.strip(),
"tail_type": object_type,
}
)
relation = ""
subject = ""
elif token.startswith("<") and token.endswith(">"):
if current in ("t", "o"):
# Closing the head span — now reading tail.
current = "s"
if relation:
triplets.append(
{
"head": subject.strip(),
"head_type": subject_type,
"type": relation.strip(),
"tail": object_.strip(),
"tail_type": object_type,
}
)
object_ = ""
subject_type = token[1:-1]
else:
# Closing the tail span — now reading relation.
current = "o"
object_type = token[1:-1]
relation = ""
else:
if current == "t":
subject += " " + token
elif current == "s":
object_ += " " + token
elif current == "o":
relation += " " + token
# Flush the last triplet if all fields are present.
if subject and relation and object_ and object_type and subject_type:
triplets.append(
{
"head": subject.strip(),
"head_type": subject_type,
"type": relation.strip(),
"tail": object_.strip(),
"tail_type": object_type,
}
)
return triplets
@@ -0,0 +1,64 @@
---
name: plot_heatmap_log
kind: function
lang: py
domain: datascience
version: "1.0.0"
purity: impure
signature: "def plot_heatmap_log(ax: Axes, xs: list[float] | np.ndarray, ys: list[float] | np.ndarray, extent: tuple[float, float, float, float], bins: int = 200, cmap: str = 'hot', alpha: float = 0.6) -> None"
description: "Dibuja un heatmap 2D con escala log1p sobre un Axes de matplotlib. Usa np.histogram2d con el extent dado y ax.imshow para renderizar."
tags: [visualization, heatmap, histogram, matplotlib, datascience, log]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: "error_go_core"
imports: ["numpy", "matplotlib"]
params:
- name: ax
desc: "matplotlib Axes sobre el que se dibuja el heatmap."
- name: xs
desc: "Coordenadas X de los puntos."
- name: ys
desc: "Coordenadas Y de los puntos."
- name: extent
desc: "Bounding box como (minx, maxx, miny, maxy) que define el rango del histograma."
- name: bins
desc: "Número de bins del histograma en cada eje. Default 200."
- name: cmap
desc: "Nombre del colormap de matplotlib. Default 'hot'."
- name: alpha
desc: "Opacidad del overlay (0-1). Default 0.6."
output: "None. Modifica el Axes in-place añadiendo el heatmap como imagen con ax.imshow."
tested: true
tests:
- "100 puntos no lanza excepción"
- "ax tiene al menos una imagen tras la llamada"
test_file_path: "python/functions/datascience/tests/test_plot_heatmap_log.py"
file_path: "python/functions/datascience/plot_heatmap_log.py"
source_repo: "internal:footprint_aurgi"
source_license: "internal-aurgi"
source_file: "zonas_mapas_aurgi/examples/generar_reporte_madrid.py:62"
---
## Ejemplo
```python
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
import numpy as np
from datascience.plot_heatmap_log import plot_heatmap_log
rng = np.random.default_rng(42)
xs = rng.uniform(-4.0, -3.5, 500)
ys = rng.uniform(40.3, 40.6, 500)
fig, ax = plt.subplots()
plot_heatmap_log(ax, xs, ys, extent=(-4.0, -3.5, 40.3, 40.6), bins=100)
fig.savefig("heatmap.png")
```
## Notas
Aplica `np.log1p` a las cuentas del histograma para comprimir el rango dinámico y hacer visibles tanto zonas densas como dispersas. El histograma se transpone (`counts.T`) antes de pasar a imshow para alinear correctamente los ejes x/y. `aspect="auto"` permite que la imagen se estire al aspecto del Axes.
@@ -0,0 +1,53 @@
"""Plot a log-scale 2D histogram heatmap on a matplotlib Axes."""
from __future__ import annotations
def plot_heatmap_log(
ax: "Axes",
xs: "list[float] | np.ndarray",
ys: "list[float] | np.ndarray",
extent: "tuple[float, float, float, float]",
bins: int = 200,
cmap: str = "hot",
alpha: float = 0.6,
) -> None:
"""Plot a log-scale 2D density heatmap using histogram binning.
Computes a 2D histogram over the given points within ``extent``, applies
log1p to compress the dynamic range, and renders the result as an image
overlay on the Axes.
Args:
ax: matplotlib Axes to draw on.
xs: X coordinates (longitude or projected x).
ys: Y coordinates (latitude or projected y).
extent: Bounding box as (minx, maxx, miny, maxy).
bins: Number of histogram bins along each axis. Default 200.
cmap: Matplotlib colormap name. Default "hot".
alpha: Opacity of the heatmap overlay (01). Default 0.6.
"""
import numpy as np # type: ignore
xs_arr = np.asarray(xs, dtype=float)
ys_arr = np.asarray(ys, dtype=float)
minx, maxx, miny, maxy = extent
counts, _xedges, _yedges = np.histogram2d(
xs_arr,
ys_arr,
bins=bins,
range=[[minx, maxx], [miny, maxy]],
)
log_counts = np.log1p(counts.T)
ax.imshow(
log_counts,
extent=[minx, maxx, miny, maxy],
origin="lower",
cmap=cmap,
alpha=alpha,
aspect="auto",
)
@@ -0,0 +1,66 @@
---
name: plot_kde_2d
kind: function
lang: py
domain: datascience
version: "1.0.0"
purity: impure
signature: "def plot_kde_2d(ax: Axes, xs: list[float] | np.ndarray, ys: list[float] | np.ndarray, cmap: str = 'magma', alpha: float = 0.35, thresh: float = 0.02, levels: int = 30, bw_adjust: float = 0.6) -> None"
description: "Dibuja un KDE 2D como contornos rellenos sobre un Axes de matplotlib usando seaborn.kdeplot. Si los arrays están vacíos retorna sin pintar."
tags: [visualization, kde, density, seaborn, matplotlib, datascience]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: "error_go_core"
imports: ["numpy", "seaborn", "matplotlib"]
params:
- name: ax
desc: "matplotlib Axes sobre el que se dibuja la densidad."
- name: xs
desc: "Coordenadas X de los puntos (longitud o x proyectada)."
- name: ys
desc: "Coordenadas Y de los puntos (latitud o y proyectada)."
- name: cmap
desc: "Nombre del colormap de matplotlib para el relleno de densidad. Default 'magma'."
- name: alpha
desc: "Opacidad del overlay de densidad (0-1). Default 0.35."
- name: thresh
desc: "Umbral de densidad por debajo del cual no se dibujan contornos (0-1). Default 0.02."
- name: levels
desc: "Número de niveles de contorno. Default 30."
- name: bw_adjust
desc: "Factor de ajuste del ancho de banda del kernel. Valores < 1 producen estimaciones más detalladas. Default 0.6."
output: "None. Modifica el Axes in-place añadiendo los contornos de densidad."
tested: true
tests:
- "50 puntos aleatorios no lanza excepción"
- "arrays vacíos retorna sin error"
test_file_path: "python/functions/datascience/tests/test_plot_kde_2d.py"
file_path: "python/functions/datascience/plot_kde_2d.py"
source_repo: "internal:footprint_aurgi"
source_license: "internal-aurgi"
source_file: "ponderacion_isochronas/src/recomendador_centros.py:275"
---
## Ejemplo
```python
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
import numpy as np
from datascience.plot_kde_2d import plot_kde_2d
rng = np.random.default_rng(42)
xs = rng.normal(0, 1, 200)
ys = rng.normal(0, 1, 200)
fig, ax = plt.subplots()
plot_kde_2d(ax, xs, ys, cmap="viridis", alpha=0.5)
fig.savefig("kde.png")
```
## Notas
Requiere seaborn y numpy. El parámetro `fill=True` se pasa a seaborn.kdeplot para renderizar contornos rellenos (disponible desde seaborn 0.11). Arrays vacíos se detectan con `np.asarray(xs).size == 0` antes de llamar a seaborn para evitar errores internos.
@@ -0,0 +1,53 @@
"""Plot a 2D KDE density overlay on a matplotlib Axes using seaborn."""
from __future__ import annotations
def plot_kde_2d(
ax: "Axes",
xs: "list[float] | np.ndarray",
ys: "list[float] | np.ndarray",
cmap: str = "magma",
alpha: float = 0.35,
thresh: float = 0.02,
levels: int = 30,
bw_adjust: float = 0.6,
) -> None:
"""Plot a 2D kernel density estimate as a filled contour overlay.
Uses seaborn.kdeplot to render a smooth density surface over the given
scatter of (x, y) points. If either array is empty the function returns
immediately without painting anything.
Args:
ax: matplotlib Axes to draw on.
xs: X coordinates (longitude or projected x).
ys: Y coordinates (latitude or projected y).
cmap: Matplotlib colormap name for the density fill. Default "magma".
alpha: Opacity of the density overlay (01). Default 0.35.
thresh: Density threshold below which contours are not drawn (01).
Default 0.02 removes very sparse outlier contours.
levels: Number of contour levels. Default 30.
bw_adjust: Bandwidth adjustment factor for the kernel. Values < 1
produce tighter, more detailed estimates. Default 0.6.
"""
import numpy as np # type: ignore
import seaborn as sns # type: ignore
xs_arr = np.asarray(xs)
ys_arr = np.asarray(ys)
if xs_arr.size == 0 or ys_arr.size == 0:
return
sns.kdeplot(
x=xs_arr,
y=ys_arr,
ax=ax,
cmap=cmap,
fill=True,
alpha=alpha,
thresh=thresh,
levels=levels,
bw_adjust=bw_adjust,
)
@@ -0,0 +1,65 @@
---
name: rebel_load_model
kind: function
lang: py
domain: datascience
version: "1.0.0"
purity: impure
signature: "def rebel_load_model(model_name: str = 'Babelscape/rebel-large') -> tuple[Any, Any]"
description: "Carga (y cachea) el tokenizer y modelo REBEL (BART-based, ~1.5 GB). Solo ingles. Licencia Apache 2.0 — uso comercial permitido. Cache por model_name."
tags: [rebel, relation-extraction, nlp, model, huggingface, english, seq2seq, apache2, datascience, python]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: "error_go_core"
imports: [transformers]
params:
- name: model_name
desc: "ID del modelo en HuggingFace Hub (defecto: Babelscape/rebel-large, BART ~1.5 GB, solo EN)"
output: "tupla (tokenizer, model) listos para inferencia, cacheados por model_name."
tested: false
tests: []
test_file_path: ""
file_path: "python/functions/datascience/rebel_load_model.py"
notes: |
LICENCIA: Apache 2.0 — uso comercial permitido (a diferencia de mREBEL que es CC BY-NC-SA).
Solo funciona bien con texto en INGLES. Para espanol usar mrebel_load_model.
REBEL usa el mismo wire format que mREBEL, por lo que parse_rebel_output es compatible.
Diferencia vs mREBEL: no emite el prefijo tp_XX de idioma en el output (parse_rebel_output
lo maneja porque ya hace .replace('tp_XX', '')).
impure: descarga red/disco la primera vez, mantiene estado en _MODEL_CACHE.
Cache separada de mrebel_load_model (modulo distinto).
---
## Ejemplo
```python
from python.functions.datascience.rebel_load_model import rebel_load_model
from python.functions.datascience.parse_rebel_output import parse_rebel_output
tokenizer, model = rebel_load_model()
text = "Pablo Isla is the CEO of Inditex, based in Arteixo."
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
generated = model.generate(**inputs, num_beams=4, length_penalty=1.0, max_length=256)
decoded = tokenizer.decode(generated[0], skip_special_tokens=False)
triplets = parse_rebel_output(decoded)
```
## Comparacion REBEL vs mREBEL
| | REBEL | mREBEL |
|---|---|---|
| Licencia | Apache 2.0 (comercial OK) | CC BY-NC-SA 4.0 (no comercial) |
| Idiomas | Solo ingles | 30+ (es_XX, en_XX, fr_XX...) |
| Tamanio | ~1.5 GB | ~2.4 GB (large) / ~900 MB (base) |
| Base | BART | mBART-50 |
## Tamanio y latencia
- `Babelscape/rebel-large`: ~1.5 GB en disco.
- Primera carga: 20-60 s en CPU.
- Inferencia CPU: 3-10 s por frase (mas rapido que mREBEL por ser BART vs mBART).
@@ -0,0 +1,52 @@
"""Carga (y cachea) el modelo REBEL para extraccion de relaciones en ingles."""
from __future__ import annotations
from typing import Any
# Cache global: model_name -> (tokenizer, model)
_MODEL_CACHE: dict[str, tuple[Any, Any]] = {}
def rebel_load_model(
model_name: str = "Babelscape/rebel-large",
) -> tuple[Any, Any]:
"""Loads (and caches) the REBEL tokenizer and model. English only.
REBEL is a BART-based seq2seq model (~1.5 GB) for relation extraction,
trained on English Wikipedia (KELM). It extracts triplets (head, relation,
tail) from English text.
LICENSE: Apache 2.0 — commercial use permitted.
The first call downloads the model from HuggingFace Hub (~1.5 GB).
Subsequent calls with the same ``model_name`` return the cached instance.
Args:
model_name: HuggingFace Hub model ID. Default is the large variant.
Returns:
Tuple ``(tokenizer, model)`` both ready for inference.
Raises:
ImportError: if ``transformers`` is not installed.
OSError: if the model cannot be downloaded or loaded from disk.
"""
cached = _MODEL_CACHE.get(model_name)
if cached is not None:
return cached
try:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
except ImportError as exc:
raise ImportError(
"transformers no esta instalado. Instalalo con "
"`uv pip install transformers` o `uv pip install -e '.[nlp]'`."
) from exc
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
model.eval()
_MODEL_CACHE[model_name] = (tokenizer, model)
return tokenizer, model
@@ -0,0 +1,52 @@
---
name: remove_words_from_column
kind: function
lang: py
domain: datascience
version: "1.0.0"
purity: pure
signature: "def remove_words_from_column(values: Iterable[str | None], words: list[str]) -> list[str]"
description: "Elimina palabras especificas de un iterable de strings usando regex de palabra completa (\\b). Case-insensitive. Colapsa espacios multiples y hace strip. None se convierte en cadena vacia. Sin pandas."
tags: [text, cleaning, regex, words, nlp, datascience]
params:
- name: values
desc: Iterable de strings o None a limpiar.
- name: words
desc: Lista de palabras a eliminar. Matching case-insensitive por palabra completa (no parcial).
output: "Lista de strings con las palabras eliminadas y espacios normalizados. Misma longitud que el input."
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: []
tested: true
tests:
- "elimina palabras case insensitive"
- "none devuelve string vacio"
- "colapsa espacios multiples"
- "palabras vacias no modifica"
- "palabra completa no parcial"
- "lista vacia"
test_file_path: "python/functions/datascience/tests/test_remove_words_from_column.py"
file_path: "python/functions/datascience/remove_words_from_column.py"
source_repo: "internal:footprint_aurgi"
source_license: "internal-aurgi"
source_file: "fuzzy_joins/arreglo_fuzzy.py"
---
## Ejemplo
```python
from remove_words_from_column import remove_words_from_column
result = remove_words_from_column(
["Calle Mayor 14", "Avenida del Sol"],
words=["calle", "avenida", "del"]
)
# ["Mayor 14", "Sol"]
```
## Notas
El patron regex se compila una sola vez para todo el iterable (eficiente). Usa \\b para no eliminar palabras parciales ("calle" no toca "calleja"). None en el input produce "" en el output.
@@ -0,0 +1,42 @@
"""Elimina palabras especificas de una lista de strings."""
from __future__ import annotations
import re
from typing import Iterable
def remove_words_from_column(
values: Iterable[str | None],
words: list[str],
) -> list[str]:
"""Elimina palabras de una lista de strings usando regex de palabra completa.
Para cada string aplica un patron regex \\b(w1|w2|...)\\b case-insensitive,
reemplaza por cadena vacia, colapsa espacios multiples y hace strip.
None se convierte en cadena vacia.
Args:
values: Iterable de strings (o None) a limpiar.
words: Lista de palabras a eliminar (case-insensitive).
Returns:
Lista de strings con las palabras eliminadas y espacios normalizados.
"""
if not words:
return [v if v is not None else "" for v in values]
pattern = re.compile(
r"\b(" + "|".join(re.escape(w) for w in words) + r")\b",
flags=re.IGNORECASE,
)
result = []
for value in values:
if value is None:
result.append("")
continue
cleaned = pattern.sub("", str(value))
cleaned = re.sub(r"\s+", " ", cleaned).strip()
result.append(cleaned)
return result
@@ -0,0 +1,61 @@
---
name: spacy_es_load_model
kind: function
lang: py
domain: datascience
version: "1.0.0"
purity: impure
signature: "def spacy_es_load_model(model_name: str = 'es_core_news_md') -> Any"
description: "Carga (y cachea) un modelo spaCy en castellano. Provee POS, dependencias y NER (PER, ORG, LOC, MISC). Usado por extract_triples_spacy_es para OpenIE schema-less. LICENSE: spaCy MIT + es_core_news_md CC BY-SA 4.0."
tags: [spacy, nlp, spanish, ner, dependency-parsing, openie, model, datascience, python, mit, cc-by-sa]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: "error_go_core"
imports: [spacy]
params:
- name: model_name
desc: "Nombre del modelo spaCy instalado. Default: es_core_news_md (equilibrio precision/tamanio). Alternativas: es_core_news_sm (menor, menos preciso), es_core_news_lg (mayor, mas preciso)."
output: "Instancia spaCy Language cacheada por model_name. Provee nlp(text) -> Doc con tokens, POS, deps y ents."
tested: true
tests:
- "cache devuelve la misma instancia"
- "OSError si el modelo no esta instalado"
test_file_path: "python/functions/datascience/tests/test_spacy_es_load_model.py"
file_path: "python/functions/datascience/spacy_es_load_model.py"
notes: |
LICENSE: spaCy es MIT. El modelo es_core_news_md usa pesos entrenados sobre
el corpus CoNLL-2002 (CC BY-SA 4.0). Uso comercial permitido con atribucion.
Instalar el modelo antes de usar:
python -m spacy download es_core_news_md
impure: carga modelo desde disco la primera vez, mantiene estado en _MODEL_CACHE.
Tamanio: es_core_news_md ~43 MB. Primera carga ~1-3s en CPU.
---
## Ejemplo
```python
from datascience.spacy_es_load_model import spacy_es_load_model
nlp = spacy_es_load_model()
doc = nlp("Carlos Torres preside BBVA en Bilbao.")
for ent in doc.ents:
print(ent.text, ent.label_)
# Carlos Torres PER
# BBVA ORG
# Bilbao LOC
```
## Instalacion
```bash
# En el venv del registry:
python/.venv/bin/python3 -m spacy download es_core_news_md
# O via uv:
cd python && uv run python -m spacy download es_core_news_md
```
@@ -0,0 +1,40 @@
"""Carga (y cachea) un modelo spaCy en castellano para NER y OpenIE.
LICENSE: spaCy = MIT. Modelo es_core_news_md = CC BY-SA 4.0 (datos CoNLL-2002).
"""
from __future__ import annotations
from typing import Any
# Cache global: model_name -> instancia spaCy nlp
_MODEL_CACHE: dict[str, Any] = {}
def spacy_es_load_model(model_name: str = "es_core_news_md") -> Any:
"""Load (and cache) a spaCy Spanish language model.
The model provides dependency parsing, POS tagging and NER (PER, ORG, LOC, MISC).
Used by extract_triples_spacy_es for schema-less OpenIE in Spanish.
LICENSE: spaCy = MIT. es_core_news_md = CC BY-SA 4.0 (CoNLL-2002 corpus).
Args:
model_name: Name of the spaCy model. Default: es_core_news_md.
Alternatives: es_core_news_sm (smaller), es_core_news_lg (larger).
Returns:
spaCy Language instance cached by model_name.
Raises:
OSError: If the model is not installed. Install with:
python -m spacy download es_core_news_md
"""
if model_name in _MODEL_CACHE:
return _MODEL_CACHE[model_name]
import spacy # type: ignore[import]
nlp = spacy.load(model_name)
_MODEL_CACHE[model_name] = nlp
return nlp
@@ -0,0 +1,38 @@
---
id: summary_stats_py_datascience
name: summary_stats
kind: function
lang: py
domain: datascience
version: "1.0.0"
purity: pure
signature: "def summary_stats(values: list[float]) -> dict"
description: "Returns basic descriptive statistics (n, mean, median, p25, p75) for a list of floats. Empty input returns n=0 and nan for all numeric fields."
tags: [statistics, descriptive, eda, summary, percentile]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: [math, numpy]
example: |
from summary_stats import summary_stats
result = summary_stats([1, 2, 3, 4, 5])
tested: true
tests:
- "test_summary_stats_basic"
- "test_summary_stats_empty"
- "test_summary_stats_single"
- "test_summary_stats_keys"
test_file_path: "python/functions/datascience/tests/test_summary_stats.py"
file_path: "python/functions/datascience/summary_stats.py"
params:
- name: values
desc: "List of numeric values to summarize."
output: "Dict with n (int), mean, median, p25, p75 (floats). All floats are math.nan when values is empty."
source_repo: "internal:footprint_aurgi"
source_license: "internal-aurgi"
source_file: "ponderacion_isochronas/example/models/eda/utils.py:60"
---
Funcion pura minimal para EDA rapido. No incluye std, min, max ni otros percentiles — mantener la interfaz pequena.
@@ -0,0 +1,36 @@
"""summary_stats — Compute descriptive statistics for a numeric list."""
import math
import numpy as np
def summary_stats(values: list[float]) -> dict:
"""Return basic descriptive statistics for a list of floats.
Args:
values: List of numeric values.
Returns:
Dict with keys:
"n" (int): number of elements.
"mean" (float): arithmetic mean, or math.nan if empty.
"median" (float): median, or math.nan if empty.
"p25" (float): 25th percentile, or math.nan if empty.
"p75" (float): 75th percentile, or math.nan if empty.
"""
if not values:
return {
"n": 0,
"mean": math.nan,
"median": math.nan,
"p25": math.nan,
"p75": math.nan,
}
arr = np.array(values, dtype=float)
return {
"n": int(len(arr)),
"mean": float(np.mean(arr)),
"median": float(np.median(arr)),
"p25": float(np.percentile(arr, 25)),
"p75": float(np.percentile(arr, 75)),
}
@@ -0,0 +1,103 @@
"""Tests para align_relations_to_entities."""
import os
import sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "..", ".."))
from python.functions.datascience.align_relations_to_entities import align_relations_to_entities
def _t(head, head_type, relation, tail, tail_type):
return {
"head": head,
"head_type": head_type,
"type": relation,
"tail": tail,
"tail_type": tail_type,
}
def test_match_exacto_case_insensitive_resuelve_correctamente():
triplets = [_t("pablo isla", "per", "employer", "inditex", "org")]
entities = ["Pablo Isla", "Inditex"]
result = align_relations_to_entities(triplets, entities)
assert len(result) == 1
assert result[0]["from"] == "Pablo Isla"
assert result[0]["to"] == "Inditex"
assert result[0]["kind"] == "employer"
def test_substring_entity_en_span_del_head():
# mREBEL emite "esta en Bilbao" pero la entidad es "Bilbao"
triplets = [_t("esta en Bilbao", "loc", "located in", "Espana", "loc")]
entities = ["Bilbao", "Espana"]
result = align_relations_to_entities(triplets, entities)
assert len(result) == 1
assert result[0]["from"] == "Bilbao"
assert result[0]["to"] == "Espana"
def test_substring_span_dentro_del_nombre_de_entidad():
# El span "Santander" esta contenido en el entity name "Banco Santander"
triplets = [_t("Santander", "org", "owns", "Openbank", "org")]
entities = ["Banco Santander", "Openbank"]
result = align_relations_to_entities(triplets, entities)
assert len(result) == 1
assert result[0]["from"] == "Banco Santander"
assert result[0]["to"] == "Openbank"
def test_gana_nombre_de_entidad_mas_largo_en_ambiguedad():
# Dos entidades: "Madrid" y "Comunidad de Madrid". El span "Madrid" deberia
# preferir "Comunidad de Madrid" si ese es el mas largo y contiene "madrid".
# En la logica actual: substring bidireccional, gana el primero de names_by_len
# (que ordena DESC por len). "Comunidad de Madrid" es mas largo y su lower
# contiene "madrid", asi que gana.
triplets = [_t("Madrid", "loc", "capital of", "Espana", "loc")]
entities = ["Madrid", "Comunidad de Madrid", "Espana"]
result = align_relations_to_entities(triplets, entities)
assert len(result) == 1
# El exacto case-insensitive resuelve "Madrid" -> "Madrid" directamente
# (antes que la busqueda substring). Verificamos que no rompe y que
# from/to son valores de entities.
assert result[0]["from"] in entities
assert result[0]["to"] in entities
def test_triplet_sin_match_se_descarta():
triplets = [_t("Unknown Entity", "per", "works for", "Another Unknown", "org")]
entities = ["Pablo Isla", "Inditex"]
result = align_relations_to_entities(triplets, entities)
assert result == []
def test_triplet_con_head_igual_tail_se_descarta_self_loop():
triplets = [_t("Inditex", "org", "owns", "Inditex", "org")]
entities = ["Inditex", "Zara"]
result = align_relations_to_entities(triplets, entities)
assert result == []
def test_lista_triplets_vacia_retorna_vacia():
result = align_relations_to_entities([], ["Pablo Isla", "Inditex"])
assert result == []
def test_lista_entity_names_vacia_retorna_vacia():
triplets = [_t("Pablo Isla", "per", "employer", "Inditex", "org")]
result = align_relations_to_entities(triplets, [])
assert result == []
def test_multiples_triplets_con_mezcla_de_matches_y_descartes():
triplets = [
_t("Pablo Isla", "per", "employer", "Inditex", "org"), # match
_t("Ghost Entity", "per", "employer", "Inditex", "org"), # head sin match
_t("Pablo Isla", "per", "employer", "Pablo Isla", "per"), # self-loop
]
entities = ["Pablo Isla", "Inditex"]
result = align_relations_to_entities(triplets, entities)
assert len(result) == 1
assert result[0]["from"] == "Pablo Isla"
assert result[0]["to"] == "Inditex"
@@ -0,0 +1,38 @@
"""Tests para alpha_shape_concave_hull."""
import sys
import os
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
from alpha_shape_concave_hull import alpha_shape_concave_hull
def test_alpha_shape_square_large_alpha():
"""4 corner points with large alpha should return a geometry."""
pts = [(0.0, 0.0), (1.0, 0.0), (1.0, 1.0), (0.0, 1.0)]
result = alpha_shape_concave_hull(pts, alpha=10.0)
assert result is not None
def test_alpha_shape_too_few_points():
result = alpha_shape_concave_hull([(0, 0), (1, 0), (0, 1)], alpha=10.0)
assert result is None
def test_alpha_shape_very_small_alpha_returns_none():
"""Alpha so small that no triangle circumradius fits."""
pts = [(0.0, 0.0), (100.0, 0.0), (100.0, 100.0), (0.0, 100.0)]
result = alpha_shape_concave_hull(pts, alpha=0.0001)
assert result is None
def test_alpha_shape_5_points_returns_geometry():
pts = [
(0.0, 0.0),
(2.0, 0.0),
(2.0, 2.0),
(0.0, 2.0),
(1.0, 1.0),
]
result = alpha_shape_concave_hull(pts, alpha=5.0)
assert result is not None
@@ -0,0 +1,47 @@
"""Tests para best_central_tendency."""
import math
import sys
import os
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
from best_central_tendency import best_central_tendency
def test_best_central_tendency_normal_ish():
label, value = best_central_tendency([1, 2, 3, 4, 5], "normal-ish")
assert label == "mean"
assert abs(value - 3.0) < 1e-9
def test_best_central_tendency_right_skewed():
label, value = best_central_tendency([1, 2, 3, 4, 5], "right-skewed")
assert label == "median"
assert abs(value - 3.0) < 1e-9
def test_best_central_tendency_left_skewed():
label, value = best_central_tendency([1, 2, 3, 4, 5], "left-skewed")
assert label == "median"
def test_best_central_tendency_lognormal_ish():
label, value = best_central_tendency([1, 2, 4, 8], "lognormal-ish")
assert label == "geometric_mean"
assert abs(value - 2 ** 1.5) < 1e-6
def test_best_central_tendency_heavy_tail():
label, value = best_central_tendency([1, 2, 3, 4, 5, 100], "heavy-tail")
assert label == "trimmed_mean_5%"
assert not math.isnan(value)
def test_best_central_tendency_empty():
label, value = best_central_tendency([], "normal-ish")
assert math.isnan(value)
def test_best_central_tendency_default():
label, value = best_central_tendency([1, 2, 3, 4, 5], "other")
assert label == "median"
@@ -0,0 +1,45 @@
"""Tests para detect_distribution_type."""
import sys
import os
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
from detect_distribution_type import detect_distribution_type
import numpy as np
def test_detect_too_few_samples():
result = detect_distribution_type([1] * 5)
assert result["type"] == "too_few_samples"
def test_detect_normal_ish():
rng = np.random.default_rng(42)
values = rng.normal(0, 1, 200).tolist()
result = detect_distribution_type(values)
assert result["type"] == "normal-ish", f"Got {result['type']}"
def test_detect_right_skewed():
rng = np.random.default_rng(0)
# Exponential distribution is heavily right-skewed
values = rng.exponential(scale=1.0, size=200).tolist()
result = detect_distribution_type(values)
assert result["type"] in ("right-skewed", "lognormal-ish", "heavy-tail"), f"Got {result['type']}"
def test_detect_stats_keys():
rng = np.random.default_rng(7)
values = rng.normal(5, 2, 100).tolist()
result = detect_distribution_type(values)
assert "stats" in result
assert "n" in result["stats"]
assert result["stats"]["n"] == 100
def test_detect_exactly_30():
rng = np.random.default_rng(1)
values = rng.normal(0, 1, 30).tolist()
result = detect_distribution_type(values)
assert result["type"] != "too_few_samples"
@@ -0,0 +1,67 @@
"""Tests para extract_graph_gliner2.
Usa un stub GLiNER2 para validar el contrato sin descargar el modelo real.
"""
from __future__ import annotations
import os
import sys
import pytest
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "..", ".."))
from python.functions.datascience.extract_graph_gliner2 import extract_graph_gliner2
class _Schema:
def entities(self, labels):
self._entities = labels
return self
def relations(self, labels):
self._relations = labels
return self
class _StubModel:
"""Stub que devuelve entidades y relaciones conocidas."""
_extract_result = {
"entities": {"person": ["Pablo Isla"], "organization": ["Inditex"]},
"relation_extraction": {"ceo_of": [("Pablo Isla", "Inditex")]},
}
def create_schema(self):
return _Schema()
def extract(self, text, schema=None, threshold=0.3, include_confidence=False):
return self._extract_result
def test_output_tiene_claves_entities_relation_extraction_elapsed_s():
"""output tiene claves entities relation_extraction elapsed_s"""
result = extract_graph_gliner2(
text="Pablo Isla es CEO de Inditex.",
entity_labels=["person", "organization"],
relation_labels=["ceo_of"],
model=_StubModel(),
)
assert "entities" in result
assert "relation_extraction" in result
assert "elapsed_s" in result
assert isinstance(result["elapsed_s"], float)
def test_stub_model_retorna_shape_correcto():
"""stub model retorna shape correcto"""
result = extract_graph_gliner2(
text="Texto cualquiera.",
entity_labels=["person"],
relation_labels=["works_at"],
model=_StubModel(),
threshold=0.3,
)
assert result["entities"] == {"person": ["Pablo Isla"], "organization": ["Inditex"]}
assert "ceo_of" in result["relation_extraction"]
@@ -0,0 +1,112 @@
"""Tests para extract_relations_mrebel con stubs de modelo."""
import os
import sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "..", ".."))
from python.functions.datascience.extract_relations_mrebel import extract_relations_mrebel
from python.types.datascience.entity_candidate import EntityCandidate
from python.types.datascience.relation_candidate import RelationCandidate
# ---------------------------------------------------------------------------
# Stubs
# ---------------------------------------------------------------------------
class _TokenizerStub:
"""Tokenizer stub que devuelve inputs triviales y decodifica el wire format canonico."""
def __init__(self, decoded_output: str = ""):
self._decoded = decoded_output
def __call__(self, text, return_tensors=None, max_length=512, truncation=True):
return {"input_ids": [[1, 2, 3]]}
def decode(self, token_ids, skip_special_tokens=True):
return self._decoded
class _ModelStub:
"""Modelo stub que devuelve tokens triviales."""
def generate(self, input_ids=None, num_beams=4, length_penalty=1.0, max_length=256, **kwargs):
return [[10, 11, 12]]
# ---------------------------------------------------------------------------
# Tests
# ---------------------------------------------------------------------------
def test_flujo_completo_con_stub_produce_relation_candidates_correctos():
# Wire format canonico con un triplet valido
decoded = "<triplet> Pablo Isla <per> Inditex <org> employer"
tok = _TokenizerStub(decoded_output=decoded)
model = _ModelStub()
entities = [
EntityCandidate(name="Pablo Isla", type_label="PER", confidence=0.95),
EntityCandidate(name="Inditex", type_label="ORG", confidence=0.92),
]
text = "Pablo Isla es el presidente de Inditex."
result = extract_relations_mrebel(text, entities, tok, model)
assert len(result) == 1
rc = result[0]
assert isinstance(rc, RelationCandidate)
assert rc.from_name == "Pablo Isla"
assert rc.to_name == "Inditex"
assert rc.relation_type == "employer"
assert rc.confidence == 1.0
def test_menos_de_2_entidades_retorna_vacio():
tok = _TokenizerStub()
model = _ModelStub()
entities = [EntityCandidate(name="Pablo Isla", type_label="PER")]
result = extract_relations_mrebel("Texto cualquiera.", entities, tok, model)
assert result == []
def test_texto_vacio_retorna_vacio():
tok = _TokenizerStub()
model = _ModelStub()
entities = [
EntityCandidate(name="A", type_label="PER"),
EntityCandidate(name="B", type_label="ORG"),
]
assert extract_relations_mrebel("", entities, tok, model) == []
def test_triplets_no_alineables_se_descartan():
# El stub emite entidades que no estan en la lista
decoded = "<triplet> Ghost Entity <per> Unknown Org <org> some relation"
tok = _TokenizerStub(decoded_output=decoded)
model = _ModelStub()
entities = [
EntityCandidate(name="Pablo Isla", type_label="PER"),
EntityCandidate(name="Inditex", type_label="ORG"),
]
result = extract_relations_mrebel("Texto largo suficiente.", entities, tok, model)
assert result == []
def test_multiples_frases_generan_multiples_candidates():
# El stub siempre emite el mismo triplet valido — una por frase
decoded = "<triplet> Pablo Isla <per> Inditex <org> employer"
tok = _TokenizerStub(decoded_output=decoded)
model = _ModelStub()
entities = [
EntityCandidate(name="Pablo Isla", type_label="PER"),
EntityCandidate(name="Inditex", type_label="ORG"),
]
# Dos frases separadas por ". "
text = "Pablo Isla es el presidente de Inditex. Inditex tiene sedes en todo el mundo."
result = extract_relations_mrebel(text, entities, tok, model)
# Puede haber 1 o 2 dependiendo de la dedup — lo importante es que no es vacio
assert len(result) >= 1
assert all(isinstance(rc, RelationCandidate) for rc in result)
@@ -0,0 +1,81 @@
"""Tests para extract_triples_spacy_es.
Requiere spaCy y es_core_news_md instalados. Si no estan, los tests se omiten.
"""
from __future__ import annotations
import os
import sys
import pytest
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "..", ".."))
from python.functions.datascience.extract_triples_spacy_es import extract_triples_spacy_es
spacy = pytest.importorskip("spacy", reason="spacy not installed — skip")
def _load_nlp():
try:
return spacy.load("es_core_news_md")
except OSError:
return None
_NLP = _load_nlp()
pytestmark = pytest.mark.skipif(
_NLP is None,
reason="es_core_news_md not installed — run: python -m spacy download es_core_news_md",
)
def test_oracion_simple_produce_tripleta_con_sujeto_verbo_objeto():
"""oracion simple produce tripleta con sujeto verbo objeto"""
result = extract_triples_spacy_es("Enmanuel quiere a Ashlly.", _NLP)
assert len(result["triples"]) >= 1
# Al menos una tripleta con sujeto que contenga Enmanuel
subjs = [t["subject"] for t in result["triples"]]
assert any("Enmanuel" in s or "enmanuel" in s.lower() for s in subjs)
def test_carlos_torres_preside_bbva():
"""carlos torres preside bbva produce tripleta president"""
result = extract_triples_spacy_es("Carlos Torres preside BBVA.", _NLP)
triples = result["triples"]
assert len(triples) >= 1
rels = [t["relation"] for t in triples]
assert any("presidir" in r or "presidir" in r.lower() for r in rels)
def test_amancio_ortega_fundo_inditex_en_1985():
"""amancio ortega fundo inditex en 1985 produce tripletas con fundar_en"""
result = extract_triples_spacy_es(
"Amancio Ortega fundo Inditex en 1985.", _NLP
)
triples = result["triples"]
assert len(triples) >= 1
# El verbo y sus objetos deben producir al menos 2 tripletas (Inditex + 1985 como oblicuo)
subjs = {t["subject"] for t in triples}
assert any("Amancio" in s or "Ortega" in s for s in subjs)
# Debe haber al menos la tripleta directa con Inditex
objects = {t["object"] for t in triples}
assert any("Inditex" in o or "1985" in o for o in objects)
def test_texto_sin_verbos_produce_tripletas_vacias():
"""texto sin verbos produce tripletas vacias"""
result = extract_triples_spacy_es("BBVA Santander Inditex.", _NLP)
assert result["triples"] == []
def test_entities_ner_detecta_categorias():
"""entities NER detecta PER ORG LOC"""
result = extract_triples_spacy_es(
"Carlos Torres es presidente de BBVA en Bilbao.", _NLP
)
ents = result["entities"]
labels = {e["label"] for e in ents}
# Debe detectar al menos uno de PER, ORG o LOC
assert labels & {"PER", "ORG", "LOC"}
@@ -0,0 +1,67 @@
"""Tests para fuzzy_merge_adaptive."""
import sys
import os
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
from fuzzy_merge_adaptive import fuzzy_merge_adaptive
def test_left_join_con_typo():
left = [{"name": "Madrid"}, {"name": "Barclona"}]
right = [{"name": "Madrid", "cp": "28"}, {"name": "Barcelona", "cp": "08"}]
result = fuzzy_merge_adaptive(left, right, left_key="name", right_key="name")
assert len(result) == 2
scores = [r["match_score"] for r in result]
assert all(s >= 80 for s in scores), f"Scores bajos: {scores}"
assert result[0]["cp"] == "28"
assert result[1]["cp"] == "08"
def test_inner_join_excluye_sin_match():
left = [{"name": "Madrid"}, {"name": "ZZZinexistente"}]
right = [{"name": "Madrid", "cp": "28"}]
result = fuzzy_merge_adaptive(
left, right, left_key="name", right_key="name",
thresholds=[90, 80, 70], how="inner"
)
assert len(result) == 1
assert result[0]["fuzzy_match"] == "Madrid"
def test_left_join_sin_match_devuelve_none():
left = [{"name": "ZZZinexistente"}]
right = [{"name": "Madrid", "cp": "28"}]
result = fuzzy_merge_adaptive(
left, right, left_key="name", right_key="name",
thresholds=[95], how="left"
)
assert len(result) == 1
assert result[0]["fuzzy_match"] is None
assert result[0]["match_score"] == 0
assert result[0]["threshold_used"] is None
def test_threshold_adaptativo():
left = [{"name": "Bcn"}]
right = [{"name": "Barcelona", "cp": "08"}]
result = fuzzy_merge_adaptive(
left, right, left_key="name", right_key="name",
thresholds=[90, 80, 70, 60, 50]
)
assert len(result) == 1
# Puede matchear o no segun score, pero threshold_used <= 90
if result[0]["threshold_used"] is not None:
assert result[0]["threshold_used"] <= 90
def test_colision_de_claves_usa_sufijos():
left = [{"name": "Madrid", "info": "left_info"}]
right = [{"name": "Madrid", "info": "right_info"}]
result = fuzzy_merge_adaptive(left, right, left_key="name", right_key="name")
assert len(result) == 1
assert "info_left" in result[0]
assert "info_right" in result[0]
assert result[0]["info_left"] == "left_info"
assert result[0]["info_right"] == "right_info"
@@ -0,0 +1,35 @@
"""Tests para geometric_mean."""
import math
import sys
import os
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
from geometric_mean import geometric_mean
def test_geometric_mean_powers_of_two():
result = geometric_mean([1, 2, 4, 8])
expected = 2 ** 1.5 # ~2.828
assert abs(result - expected) < 1e-6, f"Expected ~{expected}, got {result}"
def test_geometric_mean_filters_non_positive():
result = geometric_mean([1, -2, 3])
expected = math.exp((math.log(1) + math.log(3)) / 2)
assert abs(result - expected) < 1e-6
def test_geometric_mean_empty_returns_nan():
result = geometric_mean([])
assert math.isnan(result)
def test_geometric_mean_all_negative_returns_nan():
result = geometric_mean([-1, -2, -3])
assert math.isnan(result)
def test_geometric_mean_single_positive():
result = geometric_mean([9.0])
assert abs(result - 9.0) < 1e-9
@@ -0,0 +1,84 @@
"""Tests para gliner2_load_model.
El modelo real (gliner2) es opcional. Los tests usan un stub para validar
el cache sin descargar el modelo. Tests que requieran el modelo real se
marcan con pytest.importorskip('gliner2').
"""
from __future__ import annotations
import os
import sys
import pytest
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "..", ".."))
from python.functions.datascience.gliner2_load_model import (
_MODEL_CACHE,
_resolve_device,
gliner2_load_model,
)
class _StubGLiNER2:
"""Stub duck-typed para validar el cache sin descargar el modelo real."""
@classmethod
def from_pretrained(cls, model_name: str) -> "_StubGLiNER2":
return cls()
def create_schema(self):
return self
def entities(self, labels):
return self
def relations(self, labels):
return self
def extract(self, text, **kwargs):
return {"entities": {}, "relation_extraction": {}}
def test_cache_devuelve_la_misma_instancia(monkeypatch):
"""cache devuelve la misma instancia con los mismos parametros"""
_MODEL_CACHE.clear()
monkeypatch.setattr(
"python.functions.datascience.gliner2_load_model.GLiNER2",
_StubGLiNER2,
raising=False,
)
# Patch el import dentro de la funcion
import python.functions.datascience.gliner2_load_model as mod
original = None
try:
from gliner2 import GLiNER2 as _real # type: ignore[import]
original = _real
except ImportError:
pass
_MODEL_CACHE.clear()
# Insertar stub directamente en el cache para simular primera carga
key = ("fastino/gliner2-large-v1", "cpu")
stub = _StubGLiNER2()
_MODEL_CACHE[key] = stub
# Segunda llamada debe devolver el mismo objeto
result = gliner2_load_model(model_name="fastino/gliner2-large-v1", device="cpu")
assert result is stub
_MODEL_CACHE.clear()
def test_device_auto_resuelve_a_cpu_si_torch_no_esta(monkeypatch):
"""device=auto resuelve a cpu si torch no esta instalado"""
import sys
# Simular que torch no esta disponible
monkeypatch.setitem(sys.modules, "torch", None)
resolved = _resolve_device("auto")
assert resolved == "cpu"
def test_import_error_si_gliner2_no_esta_instalado():
"""ImportError si gliner2 no esta instalado"""
pytest.importorskip("gliner2", reason="gliner2 not installed — skip real model test")
@@ -0,0 +1,46 @@
"""Tests para kde_density_levels."""
import sys
import os
import numpy as np
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
from kde_density_levels import kde_density_levels
def test_kde_density_levels_returns_dict_for_50_points():
rng = np.random.default_rng(42)
xs = rng.normal(0, 1, 50).tolist()
ys = rng.normal(0, 1, 50).tolist()
result = kde_density_levels(xs, ys)
assert result is not None
assert "method" in result
assert result["method"] in ("kde", "hist")
assert "densities" in result
assert len(result["densities"]) == 50
assert "abs_level" in result
assert "dense_level" in result
def test_kde_density_levels_none_for_few_points():
result = kde_density_levels([1.0, 2.0, 3.0], [1.0, 2.0, 3.0])
assert result is None
def test_kde_density_levels_none_for_4_points():
result = kde_density_levels([1, 2, 3, 4], [1, 2, 3, 4])
assert result is None
def test_kde_density_levels_levels_ordered():
rng = np.random.default_rng(0)
xs = rng.uniform(0, 10, 100).tolist()
ys = rng.uniform(0, 10, 100).tolist()
result = kde_density_levels(xs, ys, abs_quantile=0.1, dense_quantile=0.85)
assert result is not None
assert result["abs_level"] <= result["dense_level"]
def test_kde_density_levels_mismatched_lengths():
result = kde_density_levels([1, 2, 3, 4, 5], [1, 2, 3])
assert result is None
@@ -0,0 +1,75 @@
"""Tests para parse_rebel_output."""
import os
import sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "..", ".."))
from python.functions.datascience.parse_rebel_output import parse_rebel_output
def test_string_vacio_retorna_lista_vacia():
assert parse_rebel_output("") == []
def test_string_solo_espacios_retorna_lista_vacia():
assert parse_rebel_output(" ") == []
def test_un_triplet_completo_retorna_un_dict_con_campos_correctos():
decoded = "tp_XX<triplet> Pablo Isla <per> Inditex <org> employer"
result = parse_rebel_output(decoded)
assert len(result) == 1
t = result[0]
assert t["head"] == "Pablo Isla"
assert t["head_type"] == "per"
assert t["tail"] == "Inditex"
assert t["tail_type"] == "org"
assert t["type"] == "employer"
def test_dos_triplets_retorna_dos_dicts():
decoded = (
"tp_XX<triplet> Pablo Isla <per> Inditex <org> employer "
"<triplet> Arteixo <loc> A Coruna <loc> located in the administrative territorial entity"
)
result = parse_rebel_output(decoded)
assert len(result) == 2
assert result[0]["head"] == "Pablo Isla"
assert result[0]["tail"] == "Inditex"
assert result[1]["head"] == "Arteixo"
assert result[1]["tail"] == "A Coruna"
assert "located" in result[1]["type"]
def test_triplet_incompleto_sin_cierre_no_rompe():
# Solo head span, sin tail ni relacion
decoded = "tp_XX<triplet> Pablo Isla"
result = parse_rebel_output(decoded)
# No hay cierre, puede retornar lista vacia o incompleta pero no rompe
assert isinstance(result, list)
def test_tokens_angulares_desconocidos_no_lanzan_excepcion():
# Un tipo desconocido como <unknown_type> no debe romper el parser
decoded = "<triplet> Entity One <unknown_type> Entity Two <org> some relation"
result = parse_rebel_output(decoded)
assert isinstance(result, list)
def test_sin_prefijo_tp_xx_funciona():
# REBEL monolingue no emite tp_XX
decoded = "<triplet> Barack Obama <per> United States <org> president of"
result = parse_rebel_output(decoded)
assert len(result) == 1
assert result[0]["head"] == "Barack Obama"
assert result[0]["tail"] == "United States"
assert result[0]["type"] == "president of"
def test_strip_tags_s_pad():
decoded = "<s><pad>tp_XX<triplet> Ana <per> BBVA <org> works at</s>"
result = parse_rebel_output(decoded)
assert len(result) == 1
assert result[0]["head"] == "Ana"
assert result[0]["tail"] == "BBVA"
@@ -0,0 +1,38 @@
"""Tests para plot_heatmap_log."""
import sys
from pathlib import Path
import matplotlib
matplotlib.use("Agg")
sys.path.insert(0, str(Path(__file__).parent.parent.parent))
from datascience.plot_heatmap_log import plot_heatmap_log
def test_100_puntos_no_lanza_excepcion():
import matplotlib.pyplot as plt
import numpy as np
rng = np.random.default_rng(0)
xs = rng.uniform(-4.0, -3.5, 100)
ys = rng.uniform(40.3, 40.6, 100)
fig, ax = plt.subplots()
plot_heatmap_log(ax, xs, ys, extent=(-4.0, -3.5, 40.3, 40.6), bins=50)
plt.close(fig)
def test_ax_tiene_imagen_tras_la_llamada():
import matplotlib.pyplot as plt
import numpy as np
rng = np.random.default_rng(1)
xs = rng.uniform(-4.0, -3.5, 100)
ys = rng.uniform(40.3, 40.6, 100)
fig, ax = plt.subplots()
plot_heatmap_log(ax, xs, ys, extent=(-4.0, -3.5, 40.3, 40.6), bins=50)
assert len(ax.images) > 0, "ax should have at least one image after heatmap"
plt.close(fig)
@@ -0,0 +1,32 @@
"""Tests para plot_kde_2d."""
import sys
from pathlib import Path
import matplotlib
matplotlib.use("Agg")
sys.path.insert(0, str(Path(__file__).parent.parent.parent))
from datascience.plot_kde_2d import plot_kde_2d
def test_50_puntos_aleatorios_no_lanza_excepcion():
import matplotlib.pyplot as plt
import numpy as np
rng = np.random.default_rng(42)
xs = rng.normal(0, 1, 50)
ys = rng.normal(0, 1, 50)
fig, ax = plt.subplots()
plot_kde_2d(ax, xs, ys)
plt.close(fig)
def test_arrays_vacios_retorna_sin_error():
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
plot_kde_2d(ax, [], [])
plt.close(fig)
@@ -0,0 +1,42 @@
"""Tests para remove_words_from_column."""
import sys
import os
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
from remove_words_from_column import remove_words_from_column
def test_elimina_palabras_case_insensitive():
values = ["Calle Mayor 14", "Avenida del Sol"]
result = remove_words_from_column(values, words=["calle", "avenida", "del"])
assert result == ["Mayor 14", "Sol"]
def test_none_devuelve_string_vacio():
result = remove_words_from_column([None, "hola mundo"], words=["hola"])
assert result[0] == ""
assert result[1] == "mundo"
def test_colapsa_espacios_multiples():
result = remove_words_from_column(["uno dos tres"], words=["dos"])
assert result[0] == "uno tres"
def test_palabras_vacias_no_modifica():
values = ["hola mundo", "foo bar"]
result = remove_words_from_column(values, words=[])
assert result == ["hola mundo", "foo bar"]
def test_palabra_completa_no_parcial():
# "calle" no debe eliminar "calleja"
result = remove_words_from_column(["calleja mayor"], words=["calle"])
assert result[0] == "calleja mayor"
def test_lista_vacia():
result = remove_words_from_column([], words=["foo"])
assert result == []
@@ -0,0 +1,46 @@
"""Tests para spacy_es_load_model."""
from __future__ import annotations
import os
import sys
import pytest
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "..", ".."))
from python.functions.datascience.spacy_es_load_model import (
_MODEL_CACHE,
spacy_es_load_model,
)
spacy = pytest.importorskip("spacy", reason="spacy not installed — skip")
def _has_model(model_name: str) -> bool:
try:
spacy.load(model_name)
return True
except OSError:
return False
@pytest.mark.skipif(
not _has_model("es_core_news_md"),
reason="es_core_news_md not installed",
)
def test_cache_devuelve_la_misma_instancia():
"""cache devuelve la misma instancia"""
_MODEL_CACHE.clear()
m1 = spacy_es_load_model("es_core_news_md")
m2 = spacy_es_load_model("es_core_news_md")
assert m1 is m2
_MODEL_CACHE.clear()
def test_oserror_si_el_modelo_no_esta_instalado():
"""OSError si el modelo no esta instalado"""
_MODEL_CACHE.clear()
with pytest.raises(OSError):
spacy_es_load_model("es_nonexistent_model_xyz")
_MODEL_CACHE.clear()
@@ -0,0 +1,38 @@
"""Tests para summary_stats."""
import math
import sys
import os
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
from summary_stats import summary_stats
def test_summary_stats_basic():
result = summary_stats([1, 2, 3, 4, 5])
assert result["n"] == 5
assert abs(result["mean"] - 3.0) < 1e-9
assert abs(result["median"] - 3.0) < 1e-9
assert abs(result["p25"] - 2.0) < 0.01
assert abs(result["p75"] - 4.0) < 0.01
def test_summary_stats_empty():
result = summary_stats([])
assert result["n"] == 0
assert math.isnan(result["mean"])
assert math.isnan(result["median"])
assert math.isnan(result["p25"])
assert math.isnan(result["p75"])
def test_summary_stats_single():
result = summary_stats([7.0])
assert result["n"] == 1
assert abs(result["mean"] - 7.0) < 1e-9
assert abs(result["median"] - 7.0) < 1e-9
def test_summary_stats_keys():
result = summary_stats([1, 2, 3])
assert set(result.keys()) == {"n", "mean", "median", "p25", "p75"}
@@ -0,0 +1,62 @@
"""Tests para translate_es_to_en — smoke tests con modelo stub."""
import os
import sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "..", ".."))
from python.functions.datascience.translate_es_to_en import translate_es_to_en
class _StubTokenizer:
"""Tokenizer stub que devuelve inputs triviales."""
def __call__(self, text, return_tensors=None, max_length=512, truncation=True):
# Devuelve un dict con una clave 'input_ids' que el modelo stub acepta.
return {"input_ids": [[1, 2, 3]], "_text": text}
def decode(self, token_ids, skip_special_tokens=True):
# Devuelve siempre "translated" para testing.
return "translated"
class _StubModel:
"""Modelo stub que devuelve tokens triviales."""
def generate(self, input_ids=None, num_beams=4, max_length=512, **kwargs):
return [[10, 11, 12]]
def test_texto_vacio_retorna_string_vacio():
tok = _StubTokenizer()
model = _StubModel()
assert translate_es_to_en("", tok, model) == ""
def test_solo_espacios_retorna_string_vacio():
tok = _StubTokenizer()
model = _StubModel()
assert translate_es_to_en(" ", tok, model) == ""
def test_una_frase_en_espanol_produce_output_no_vacio():
tok = _StubTokenizer()
model = _StubModel()
result = translate_es_to_en("Pablo Isla es presidente de Inditex.", tok, model)
assert isinstance(result, str)
assert len(result) > 0
def test_multiples_frases_se_unen_con_espacio():
tok = _StubTokenizer()
model = _StubModel()
# El stub siempre devuelve "translated" por frase
result = translate_es_to_en(
"Primera frase. Segunda frase. Tercera frase.",
tok,
model,
)
# Con el stub, cada frase produce "translated", unidas con espacio
parts = result.split(" ")
assert all(p == "translated" for p in parts)
assert len(parts) >= 1
@@ -0,0 +1,33 @@
"""Tests para trimmed_mean."""
import math
import sys
import os
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
from trimmed_mean import trimmed_mean
def test_trimmed_mean_basic():
result = trimmed_mean([1, 2, 3, 4, 5, 100], 0.1)
assert abs(result - 3.5) < 0.5, f"Expected ~3.5, got {result}"
def test_trimmed_mean_empty_returns_nan():
result = trimmed_mean([], 0.05)
assert math.isnan(result)
def test_trimmed_mean_no_trim():
result = trimmed_mean([1.0, 2.0, 3.0, 4.0, 5.0], 0.0)
assert abs(result - 3.0) < 1e-9
def test_trimmed_mean_single_element():
result = trimmed_mean([42.0], 0.05)
assert abs(result - 42.0) < 1e-9
def test_trimmed_mean_uniform():
result = trimmed_mean([5.0, 5.0, 5.0, 5.0, 5.0], 0.1)
assert abs(result - 5.0) < 1e-9
@@ -0,0 +1,49 @@
"""Tests para words_to_dataset."""
import sys
import os
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
from words_to_dataset import words_to_dataset
def test_cuenta_palabras_repetidas():
texts = ["calle mayor", "calle del sol", "avenida principal"]
result = words_to_dataset(texts)
palabras = {r["palabra"]: r["ocurrencias"] for r in result}
assert palabras["CALLE"] == 2
def test_eliminar_stopwords_filtra_del():
texts = ["calle mayor", "calle del sol", "avenida principal"]
result = words_to_dataset(texts, eliminar_stopwords=True)
palabras = {r["palabra"] for r in result}
assert "DEL" not in palabras
def test_min_ocurrencias_filtra():
texts = ["calle mayor", "calle del sol", "avenida principal"]
result = words_to_dataset(texts, min_ocurrencias=2)
palabras = {r["palabra"]: r["ocurrencias"] for r in result}
assert "CALLE" in palabras
assert "MAYOR" not in palabras
def test_none_ignorados():
texts = ["hola mundo", None, "hola"]
result = words_to_dataset(texts)
palabras = {r["palabra"]: r["ocurrencias"] for r in result}
assert palabras["HOLA"] == 2
def test_lista_vacia():
result = words_to_dataset([])
assert result == []
def test_orden_descendente():
texts = ["a a a", "b b", "c"]
result = words_to_dataset(texts)
counts = [r["ocurrencias"] for r in result]
assert counts == sorted(counts, reverse=True)
@@ -0,0 +1,85 @@
---
name: translate_es_to_en
kind: function
lang: py
domain: datascience
version: "1.0.0"
purity: impure
signature: "def translate_es_to_en(text: str, tokenizer: Any, model: Any, max_length: int = 512, num_beams: int = 4) -> str"
description: "Traduce texto espanol a ingles frase a frase usando MarianMT. Divide por boundaries de oracion, traduce cada una independientemente y une con espacio. Preserva nombres propios mejor que pasar el parrafo entero."
tags: [marianmt, translation, es-en, nlp, datascience, python]
uses_functions: [marianmt_es_en_load_model_py_datascience]
uses_types: []
returns: []
returns_optional: false
error_type: "error_go_core"
imports: [re]
params:
- name: text
desc: "texto en espanol a traducir — puede ser una frase o un parrafo multi-oracion"
- name: tokenizer
desc: "tokenizer MarianMT cargado con marianmt_es_en_load_model"
- name: model
desc: "modelo MarianMT cargado con marianmt_es_en_load_model"
- name: max_length
desc: "longitud maxima en tokens por oracion para tokenizacion y generacion (defecto 512)"
- name: num_beams
desc: "numero de beams para beam search — mas alto = mejor calidad, mas lento (defecto 4)"
output: "texto traducido al ingles. Frases unidas con espacio simple. String vacio si el input es vacio."
tested: true
tests:
- "texto vacio retorna string vacio"
- "una frase en espanol produce output no vacio"
test_file_path: "python/functions/datascience/tests/test_translate_es_to_en.py"
file_path: "python/functions/datascience/translate_es_to_en.py"
notes: |
impure: invoca model.generate que depende del estado del modelo (pesos, device).
El split por oracion usa regex lookahead-behind sobre [.!?] seguidos de espacio.
Esto preserva nombres propios con puntos (S.A., U.S.A.) mejor que NLTK sent_tokenize
porque no usa reglas de abreviacion — simplemente divide donde hay espacio despues
de puntuacion terminal.
Util como preprocesador para rebel_load_model (English-only, Apache 2.0):
ES text -> translate_es_to_en -> EN text -> REBEL -> triplets
Alternativa directa: mrebel_load_model (multilingue, CC BY-NC-SA).
---
## Ejemplo
```python
from python.functions.datascience.marianmt_es_en_load_model import marianmt_es_en_load_model
from python.functions.datascience.translate_es_to_en import translate_es_to_en
tokenizer, model = marianmt_es_en_load_model()
text = "Pablo Isla es presidente de Inditex. La empresa tiene sede en Arteixo."
translated = translate_es_to_en(text, tokenizer, model)
# "Pablo Isla is president of Inditex. The company is headquartered in Arteixo."
```
## Por que frase a frase
Pasar el parrafo entero a MarianMT puede degradar la traduccion de nombres propios
porque el modelo redistribuye la atencion sobre el contexto completo. Dividir por oraciones:
1. Contexto mas corto → menos confusion en nombres propios.
2. Truncation menos probable (512 tokens alcanza para oraciones normales).
3. Pipeline mas predecible para debugging (se puede inspeccionar cada frase).
## Patron pipeline ES -> EN -> REBEL
```python
# Paso 1: cargar modelos
mt_tok, mt_model = marianmt_es_en_load_model()
rebel_tok, rebel_model = rebel_load_model()
# Paso 2: traducir
en_text = translate_es_to_en(es_text, mt_tok, mt_model)
# Paso 3: extraer relaciones
inputs = rebel_tok(en_text, return_tensors="pt", max_length=512, truncation=True)
generated = rebel_model.generate(**inputs, num_beams=4, max_length=256)
decoded = rebel_tok.decode(generated[0], skip_special_tokens=False)
triplets = parse_rebel_output(decoded)
```
@@ -0,0 +1,68 @@
"""Traduce texto espanol a ingles usando MarianMT, frase a frase."""
from __future__ import annotations
import re
from typing import Any
# Patron de split por oraciones: punto, exclamacion, interrogacion seguidos de espacio.
_SENTENCE_RE = re.compile(r"(?<=[.!?])\s+")
def translate_es_to_en(
text: str,
tokenizer: Any,
model: Any,
max_length: int = 512,
num_beams: int = 4,
) -> str:
"""Translate Spanish text to English, sentence by sentence.
Splits the input on sentence boundaries (after ``.``, ``!``, ``?``),
translates each sentence independently, and rejoins with a single space.
Processing sentence by sentence preserves proper nouns (names, companies,
locations) better than passing the full paragraph in a single call, because
the translation model can focus on shorter context windows.
Args:
text: Spanish text to translate. Can be a single sentence or a
multi-sentence paragraph.
tokenizer: MarianMT tokenizer loaded with ``marianmt_es_en_load_model``.
model: MarianMT model loaded with ``marianmt_es_en_load_model``.
max_length: Maximum token length for each sentence during tokenization
and generation. Sentences longer than this are truncated.
num_beams: Number of beams for beam search. Higher = better quality,
slower. Default 4 is a good tradeoff.
Returns:
Translated English text. Sentences joined with a single space.
Returns an empty string if ``text`` is empty or whitespace-only.
Raises:
RuntimeError: if model.generate fails (propagated from transformers).
"""
if not text or not text.strip():
return ""
sentences = _SENTENCE_RE.split(text.strip())
sentences = [s.strip() for s in sentences if s.strip()]
if not sentences:
return ""
translated_parts: list[str] = []
for sentence in sentences:
inputs = tokenizer(
sentence,
return_tensors="pt",
max_length=max_length,
truncation=True,
)
generated = model.generate(
**inputs,
num_beams=num_beams,
max_length=max_length,
)
decoded = tokenizer.decode(generated[0], skip_special_tokens=True)
translated_parts.append(decoded.strip())
return " ".join(translated_parts)
@@ -0,0 +1,53 @@
---
id: trimmed_mean_py_datascience
name: trimmed_mean
kind: function
lang: py
domain: datascience
version: "1.0.0"
purity: pure
signature: "def trimmed_mean(values: list[float], trim: float = 0.05) -> float"
description: "Arithmetic mean after cutting the bottom and top trim percentiles. Returns math.nan for empty input."
tags: [statistics, mean, robust, trimming, outliers]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: [math, numpy]
example: |
from trimmed_mean import trimmed_mean
result = trimmed_mean([1, 2, 3, 4, 5, 100], 0.1) # ~3.5
tested: true
tests:
- "test_trimmed_mean_basic"
- "test_trimmed_mean_empty_returns_nan"
- "test_trimmed_mean_no_trim"
- "test_trimmed_mean_single_element"
- "test_trimmed_mean_uniform"
test_file_path: "python/functions/datascience/tests/test_trimmed_mean.py"
file_path: "python/functions/datascience/trimmed_mean.py"
params:
- name: values
desc: "List of numeric values to average."
- name: trim
desc: "Fraction to cut from each tail before averaging (0 <= trim < 0.5). Default 0.05."
output: "Trimmed arithmetic mean as float. Returns math.nan if values is empty or all values are trimmed away."
source_repo: "internal:footprint_aurgi"
source_license: "internal-aurgi"
source_file: "aurgi_mapas/generar_pdf_reporte.py:117"
---
## Ejemplo
```python
from trimmed_mean import trimmed_mean
trimmed_mean([1, 2, 3, 4, 5, 100], 0.1) # ~3.5 (100 is trimmed)
trimmed_mean([], 0.05) # math.nan
trimmed_mean([5.0, 5.0, 5.0], 0.0) # 5.0
```
## Notas
Usa numpy.percentile para calcular los umbrales lo y hi, luego filtra valores dentro del rango [lo, hi]. Util para calcular promedios robustos cuando hay valores extremos en la distribucion.
@@ -0,0 +1,28 @@
"""trimmed_mean — Arithmetic mean after trimming extreme percentiles."""
import math
import numpy as np
def trimmed_mean(values: list[float], trim: float = 0.05) -> float:
"""Return the trimmed arithmetic mean of values.
Cuts the bottom `trim` and top `trim` percentiles before averaging.
Returns math.nan for an empty list or when trimming removes all elements.
Args:
values: List of numeric values.
trim: Fraction to cut from each tail (0 <= trim < 0.5).
Returns:
Trimmed mean as float, or math.nan if the list is empty.
"""
if not values:
return math.nan
arr = np.array(values, dtype=float)
lo = np.percentile(arr, trim * 100)
hi = np.percentile(arr, (1 - trim) * 100)
trimmed = arr[(arr >= lo) & (arr <= hi)]
if len(trimmed) == 0:
return math.nan
return float(np.mean(trimmed))

Some files were not shown because too many files have changed in this diff Show More