feat: extraccion masiva footprint_aurgi (41 funcs + 4 types + stack Docker geo)

Extrae al registry funciones del proyecto interno footprint_aurgi:
- core (6): slugify_ascii, normalize_for_join, cp_provincia_es, infer_provincia_from_cp, safe_read_csv_fallback, csv_to_parquet_duckdb
- geo puras (7): haversine_km, point_in_ring, point_in_polygon, point_in_polygons_bbox, polygon_bbox, extent_with_padding, distance_bucket
- geo I/O (4): load_geojson_polygons, load_boundary_gdf, add_basemap_osm, add_basemap_with_timeout
- valhalla client (4): valhalla_route, valhalla_isochrone, valhalla_isochrones_async, valhalla_matrix_1_to_n
- datascience stats (7): trimmed_mean, geometric_mean, detect_distribution_type, best_central_tendency, summary_stats, kde_density_levels, alpha_shape_concave_hull
- datascience fuzzy (3): fuzzy_merge_adaptive (rapidfuzz), words_to_dataset, remove_words_from_column
- datascience viz (2): plot_kde_2d, plot_heatmap_log
- infra (4): compress_pdf_ghostscript, render_table_page_pdfpages, add_header_logo, osm2pgsql_ingest
- pipelines (4): setup_geo_stack_docker, compute_centers_reachability, generate_isochrones_by_zone, count_points_per_zone
- types geo (4): LonLat, BBox, IsochroneRequest, Centro

Incluye:
- apps/footprint_geo_stack/ (PostGIS + Martin + Valhalla via docker-compose)
- 131/132 tests pasan (1 skip esperado: osm2pgsql en PATH)
- Issue tracker dev/issues/0052-footprint-aurgi-extraction.md
- Atribucion uniforme: source_repo internal:footprint_aurgi, source_license internal-aurgi
- Build con 9 agentes en paralelo (8 wave 1 + 1 wave 2 pipelines)

Tambien commitea trabajo previo no commiteado: aggregate_extraction_results, chunk_with_overlap, clean_pdf_text, merge_entity_aliases, extract_graph_gliner2, extract_relations_mrebel, extract_triples_spacy_es, gliner2/mrebel/marianmt/rebel/spacy_es load_model, parse_rebel_output, translate_es_to_en, issue 0050/0051.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-04 23:35:22 +02:00
parent f73ea072bd
commit faac610745
193 changed files with 13146 additions and 3 deletions
@@ -0,0 +1,51 @@
---
name: aggregate_extraction_results
kind: function
lang: py
domain: core
version: "1.0.0"
purity: pure
signature: "def aggregate_extraction_results(extract_results: list[dict]) -> dict"
description: "Agrega entidades y relaciones de N resultados de extraccion por chunk. Deduplica entidades por (type, name_lowercased) acumulando counts. Deduplica relaciones por (head, rel_type, tail) con Counter."
tags: [nlp, aggregation, entities, relations, deduplication, chunking, ner, re, graph]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: [collections.Counter]
params:
- name: extract_results
desc: "Lista de resultados por chunk. Cada elemento tiene shape {'entities': {type: [name, ...]}, 'relation_extraction': {rel_type: [(head, tail), ...]}}. Es el output de extract_graph_gliner2. Claves ausentes se toleran."
output: "Dict con dos campos: 'entities' -> dict keyed por (type, name_lower) con {type, name, count}; 'relations' -> Counter (head, rel_type, tail) -> count. Listo para pasar a filter_relations_by_entity_types y merge_entity_aliases."
tested: true
tests:
- "lista vacia retorna entities vacio y relations vacio"
- "resultado unico se agrega correctamente"
- "dos resultados con solapamiento acumulan counts"
- "entidades se deduplicen case-insensitive"
test_file_path: "python/functions/core/tests/test_aggregate_extraction_results.py"
file_path: "python/functions/core/aggregate_extraction_results.py"
notes: |
Output shape deliberado para composicion con el pipeline:
- entities keyed por (type, name_lower) permite lookup O(1) por tipo+nombre
- relations como Counter permite filtrar por frecuencia (count >= 2)
No aplica coreference — eso lo hace merge_entity_aliases sobre los nombres
canonicos despues de agregar.
---
## Ejemplo
```python
from core.aggregate_extraction_results import aggregate_extraction_results
results = [
{"entities": {"person": ["Pablo Isla"], "organization": ["Inditex"]},
"relation_extraction": {"ceo_of": [("Pablo Isla", "Inditex")]}},
{"entities": {"person": ["pablo isla"], "organization": ["Inditex"]},
"relation_extraction": {"ceo_of": [("Pablo Isla", "Inditex")]}},
]
agg = aggregate_extraction_results(results)
# agg["entities"][("person", "pablo isla")]["count"] == 2
# agg["relations"][("Pablo Isla", "ceo_of", "Inditex")] == 2
```
@@ -0,0 +1,45 @@
"""Agrega y deduplica entidades + relaciones de N resultados de extraccion por chunk."""
from __future__ import annotations
from collections import Counter
def aggregate_extraction_results(extract_results: list[dict]) -> dict:
"""Aggregate entities + relations from multiple chunk-level extraction results.
Deduplicates entities by (type, name_lowercased) and counts occurrences.
Deduplicates relations by (head, rel_type, tail) and counts occurrences.
Each input result is expected to have shape:
{"entities": {type: [name, ...]}, "relation_extraction": {rel_type: [(head, tail), ...]}}
This is the output format of extract_graph_gliner2.
Args:
extract_results: List of per-chunk extraction dicts. May be empty.
Missing keys ("entities", "relation_extraction") are tolerated.
Returns:
{
"entities": dict[(type, name_lower)] -> {"type": str, "name": str, "count": int},
"relations": Counter mapping (head, rel_type, tail) -> count
}
"""
all_ents: dict[tuple[str, str], dict] = {}
all_rels: Counter = Counter()
for r in extract_results:
for typ, names in (r.get("entities") or {}).items():
for n in names:
key = (typ, (n or "").strip().lower())
if not key[1]:
continue
if key not in all_ents:
all_ents[key] = {"type": typ, "name": n.strip(), "count": 0}
all_ents[key]["count"] += 1
for rt, pairs in (r.get("relation_extraction") or {}).items():
for h, t in pairs:
all_rels[(h.strip(), rt, t.strip())] += 1
return {"entities": all_ents, "relations": all_rels}
@@ -0,0 +1,64 @@
---
name: chunk_with_overlap
kind: function
lang: py
domain: core
version: "1.0.0"
purity: pure
signature: "def chunk_with_overlap(text: str, max_chars: int = 1500, overlap_sentences: int = 2) -> list[dict]"
description: "Divide texto en chunks por sentence boundaries con sliding window overlap. Garantiza avance forzado si una frase supera max_chars (evita bucle infinito). Cada chunk retorna dict con 'text' y 'sentences'."
tags: [text, chunking, nlp, split, overlap, sentence, ner, gliner, sliding-window]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: [re]
params:
- name: text
desc: "Texto a dividir. Frases se detectan por [.!?] seguido de espacio. Admite saltos de linea si el texto ya fue limpiado con clean_pdf_text."
- name: max_chars
desc: "Limite maximo de caracteres por chunk (soft limit). Si una sola frase supera max_chars se incluye igualmente para evitar bucle infinito."
- name: overlap_sentences
desc: "Numero de frases finales del chunk previo a prepender al chunk actual. 0 desactiva el overlap."
output: "Lista de dicts [{'text': str, 'sentences': list[str]}, ...]. 'text' es el texto listo para pasar a GLiNER2. Lista vacia si el input es vacio."
tested: true
tests:
- "texto vacio retorna lista vacia"
- "una frase menor que max_chars produce 1 chunk"
- "multiples frases producen N chunks con overlap"
- "frase mas larga que max_chars se incluye sin bucle infinito"
- "overlap=0 no duplica frases entre chunks"
- "overlap=2 el chunk N+1 empieza con las 2 ultimas frases del chunk N"
test_file_path: "python/functions/core/tests/test_chunk_with_overlap.py"
file_path: "python/functions/core/chunk_with_overlap.py"
notes: |
Algoritmo validado empiricamente en notebook 06 del analisis
gliner_glirel_tuning. El overlap sentence-level (vs overlap en caracteres)
asegura que las entidades que aparecen al final de un chunk tambien
aparecen al principio del siguiente, mejorando el recall de GLiNER2.
split_text_into_chunks_py_core hace overlap en caracteres (modo RAG).
chunk_with_overlap hace overlap en frases completas (modo NER/RE) — son
complementarias, no competidoras.
---
## Ejemplo
```python
from core.chunk_with_overlap import chunk_with_overlap
text = "Pablo Isla preside Inditex. La empresa opera en 93 paises. Zara es su marca principal."
chunks = chunk_with_overlap(text, max_chars=80, overlap_sentences=1)
# chunk 0: text="Pablo Isla preside Inditex. La empresa opera en 93 paises."
# chunk 1: text="La empresa opera en 93 paises. Zara es su marca principal."
# ^--- overlap de 1 frase
for c in chunks:
print(c["text"])
```
## Diferencia con split_text_into_chunks
- `split_text_into_chunks`: overlap en caracteres, orientado a RAG
- `chunk_with_overlap`: overlap en frases completas, orientado a NER/RE (GLiNER2)
@@ -0,0 +1,73 @@
"""Chunking por sentence boundaries con sliding window overlap.
Validado empiricamente en notebook 06 (gliner_glirel_tuning) para pipelines
NER+RE con GLiNER2. Corrige el bug de bucle infinito de la version naive
cuando una frase supera max_chars.
"""
from __future__ import annotations
import re
def chunk_with_overlap(
text: str,
max_chars: int = 1500,
overlap_sentences: int = 2,
) -> list[dict]:
"""Split text into chunks with sentence-level sliding window overlap.
Each chunk has up to `max_chars` characters. Each chunk after the first
starts with the last `overlap_sentences` sentences of the previous chunk
if they fit. If a single sentence exceeds max_chars, it is force-included
(chunk size will exceed max_chars rather than infinite-loop).
Args:
text: Input text to split. Sentences are detected by [.!?] followed by whitespace.
max_chars: Maximum characters per chunk (soft limit; exceeded if a single
sentence is longer than max_chars to avoid infinite loop).
overlap_sentences: Number of trailing sentences of the previous chunk to
prepend to the next chunk. 0 disables overlap.
Returns:
list of dicts: [{"text": str, "sentences": list[str]}, ...]
Empty list if text is empty or contains only whitespace.
"""
if not text or not text.strip():
return []
sentences = re.split(r"(?<=[\.!?])\s+", text)
sentences = [s.strip() for s in sentences if s.strip()]
if not sentences:
return []
chunks: list[dict] = []
i = 0
while i < len(sentences):
current_sents: list[str] = []
current_len = 0
# Overlap desde el chunk anterior
if chunks and overlap_sentences > 0:
prev_sents = chunks[-1]["sentences"][-overlap_sentences:]
overlap_len = sum(len(s) + 1 for s in prev_sents)
next_len = len(sentences[i]) + 1
if overlap_len + next_len <= max_chars:
current_sents = list(prev_sents)
current_len = overlap_len
# AVANCE FORZADO: meter al menos UNA frase aunque exceda max_chars
# (evita bucle infinito con frases muy largas)
current_sents.append(sentences[i])
current_len += len(sentences[i]) + 1
i += 1
# Seguir agregando frases mientras quepan
while i < len(sentences) and current_len + len(sentences[i]) + 1 <= max_chars:
current_sents.append(sentences[i])
current_len += len(sentences[i]) + 1
i += 1
chunks.append({"text": " ".join(current_sents), "sentences": current_sents})
return chunks
+53
View File
@@ -0,0 +1,53 @@
---
name: clean_pdf_text
kind: function
lang: py
domain: core
version: "1.0.0"
purity: pure
signature: "def clean_pdf_text(text: str) -> str"
description: "Limpieza de artefactos PyPDF2/pdfplumber: elimina marcas de pagina (1/20), tabs, guiones de dehyphenation, saltos de linea en medio de oraciones y espacios duplicados."
tags: [pdf, text, cleaning, nlp, preprocessing, pypdf2]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: [re]
params:
- name: text
desc: "Texto plano extraido de un PDF (ej. via PyPDF2.PdfReader o pdfplumber). Puede contener artefactos de paginacion, guiones de dehyphenation y saltos de linea espurios."
output: "Texto limpiado con artefactos eliminados y espacios normalizados. Listo para chunking o extraccion NER."
tested: true
tests:
- "string vacio retorna vacio"
- "marca de pagina 1/20 se elimina"
- "dehyphenation exa-newline-mple -> example"
- "espacios duplicados se colapsan"
- "salto de linea en mitad de oracion se une con espacio"
- "salto de linea tras punto se preserva"
test_file_path: "python/functions/core/tests/test_clean_pdf_text.py"
file_path: "python/functions/core/clean_pdf_text.py"
notes: |
Funcion pura sin dependencias externas (solo re de stdlib).
Orden de operaciones es significativo: dehyphenation antes que colapso
de saltos de linea para evitar falsos positivos.
No elimina saltos de linea tras punto/exclamacion/interrogacion —
esos marcan fin de oracion y deben preservarse para el chunker.
---
## Ejemplo
```python
from core.clean_pdf_text import clean_pdf_text
raw = "Banco Bilbao Vizcaya Argen-\ntaria, S.A. operó en 2023.\n1/20\n\nFoo Bar"
clean = clean_pdf_text(raw)
# "Banco Bilbao Vizcaya Argentaria, S.A. operó en 2023.\nFoo Bar"
```
## Notas
Disenada para preprocesar texto antes de pasarlo a `chunk_with_overlap` +
`extract_graph_gliner2`. El pipeline completo es:
`extract_pdf_text` -> `clean_pdf_text` -> `chunk_with_overlap` -> `extract_graph_gliner2`.
+32
View File
@@ -0,0 +1,32 @@
"""Limpieza de artefactos tipicos de extraccion PyPDF2 en texto plano."""
from __future__ import annotations
import re
def clean_pdf_text(text: str) -> str:
"""Clean PDF text extraction artifacts.
Removes: page-number markers like '1/20', tabs, hyphenated line breaks
in mid-word, duplicated spaces, line breaks not at sentence end.
Args:
text: Raw text extracted from a PDF (e.g. via PyPDF2 or pdfplumber).
Returns:
Cleaned text with artifacts removed and whitespace normalized.
"""
# Eliminar marcas de pagina tipo "1/20" o "3/128"
text = re.sub(r"\b\d{1,2}/\d{1,3}\b", " ", text)
# Tabs a espacio
text = text.replace("\t", " ")
# Dehyphenation: "exa-\nmple" -> "example"
text = re.sub(r"-\s*\n\s*", "", text)
# Saltos de linea que NO son fin de oracion -> espacio
text = re.sub(r"(?<![\.!?])\n+", " ", text)
# Colapsar espacios multiples
text = re.sub(r" {2,}", " ", text)
# Limpiar lineas vacias y trim por linea
text = "\n".join(line.strip() for line in text.split("\n") if line.strip())
return text.strip()
+58
View File
@@ -0,0 +1,58 @@
---
id: cp_provincia_es_py_core
name: cp_provincia_es
kind: function
lang: py
domain: core
version: "1.0.0"
purity: pure
signature: "def cp_provincia_es(codigo_postal: str | int) -> str | None"
description: "Lookup de provincia espanola por codigo postal. Acepta CP completo (5 digitos) o prefijo de 2 digitos. Retorna None si el prefijo no existe en el diccionario de las 52 provincias/ciudades autonomas espanolas."
tags: [string, normalization, spain, geography, postal-code]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: []
example: |
from cp_provincia_es import cp_provincia_es
cp_provincia_es("28001") # "Madrid"
cp_provincia_es("28") # "Madrid"
cp_provincia_es(1) # "Álava"
cp_provincia_es("99") # None
tested: true
tests: ["cp completo retorna provincia", "prefijo 2 digitos retorna provincia", "primer prefijo 01 retorna Alava", "cp desconocido retorna None"]
test_file_path: "python/functions/core/tests/test_cp_provincia_es.py"
file_path: "python/functions/core/cp_provincia_es.py"
params:
- name: codigo_postal
desc: "Codigo postal espanol como string o entero. Acepta CP de 5 digitos ('28001', 28001) o prefijo de 2 digitos ('28', 28)."
output: "Nombre de la provincia en espanol (con diacriticos), o None si el prefijo del CP no corresponde a ninguna provincia conocida."
source_repo: "internal:footprint_aurgi"
source_license: "internal-aurgi"
source_file: "aurgi_mapas/generar_pdf_reporte.py"
---
## Ejemplo
```python
from cp_provincia_es import cp_provincia_es
cp_provincia_es("28001") # "Madrid"
cp_provincia_es("28") # "Madrid"
cp_provincia_es(28) # "Madrid"
cp_provincia_es("01") # "Álava"
cp_provincia_es(1) # "Álava" (zfill(5) -> "00001", prefix "00" -> None... ojo: int 1 -> "1" -> zfill(5) = "00001" -> "00" no existe)
cp_provincia_es("99") # None
```
## Notas
Funcion pura sin dependencias. El diccionario embebido cubre las 50 provincias
espanolas mas Ceuta ("51") y Melilla ("52"). Copiado tal cual de
`aurgi_mapas/generar_pdf_reporte.py:CP_TO_PROVINCIA`.
Nota sobre enteros: `cp_provincia_es(1)` -> `str(1)` = "1" -> zfill(5) = "00001" -> prefix "00" -> None.
Para prefijo numerico usar string: `cp_provincia_es("01")` -> "Álava".
Para CP numerico completo funciona: `cp_provincia_es(28001)` -> "Madrid".
+44
View File
@@ -0,0 +1,44 @@
"""Lookup de provincia espanola por codigo postal."""
from __future__ import annotations
_CP_TO_PROVINCIA = {
"01": "Álava", "02": "Albacete", "03": "Alicante", "04": "Almería",
"05": "Ávila", "06": "Badajoz", "07": "Illes Balears", "08": "Barcelona",
"09": "Burgos", "10": "Cáceres", "11": "Cádiz", "12": "Castellón",
"13": "Ciudad Real", "14": "Córdoba", "15": "A Coruña", "16": "Cuenca",
"17": "Girona", "18": "Granada", "19": "Guadalajara", "20": "Gipuzkoa",
"21": "Huelva", "22": "Huesca", "23": "Jaén", "24": "León",
"25": "Lleida", "26": "La Rioja", "27": "Lugo", "28": "Madrid",
"29": "Málaga", "30": "Murcia", "31": "Navarra", "32": "Ourense",
"33": "Asturias", "34": "Palencia", "35": "Las Palmas",
"36": "Pontevedra", "37": "Salamanca", "38": "Santa Cruz de Tenerife",
"39": "Cantabria", "40": "Segovia", "41": "Sevilla",
"42": "Soria", "43": "Tarragona", "44": "Teruel",
"45": "Toledo", "46": "Valencia", "47": "Valladolid",
"48": "Bizkaia", "49": "Zamora", "50": "Zaragoza",
"51": "Ceuta", "52": "Melilla",
}
def cp_provincia_es(codigo_postal: "str | int") -> "str | None":
"""Retorna la provincia espanola correspondiente a un codigo postal.
Acepta CP completo (5 digitos) o prefijo de 2 digitos. Normaliza con
zfill(5)[:2] antes de hacer el lookup. Retorna None si el prefijo
no esta en el diccionario.
Args:
codigo_postal: Codigo postal espanol como string o entero.
Puede ser CP completo ("28001", 28001) o prefijo ("28", 28).
Returns:
Nombre de la provincia en español, o None si el CP es desconocido.
"""
cp = str(codigo_postal).strip()
# Si ya es prefijo de 2 digitos (o menos), usar directamente con zfill(2)
if len(cp) <= 2:
prefix = cp.zfill(2)
else:
prefix = cp.zfill(5)[:2]
return _CP_TO_PROVINCIA.get(prefix)
@@ -0,0 +1,54 @@
---
name: csv_to_parquet_duckdb
kind: function
lang: py
domain: core
version: "1.0.0"
purity: impure
signature: "csv_to_parquet_duckdb(csv_path: str | Path, parquet_path: str | Path, column_casts: dict[str, str] | None = None, overwrite: bool = False) -> bool"
description: "Convierte un CSV a Parquet usando DuckDB read_csv_auto. Si overwrite=False y el parquet ya existe no hace nada. column_casts permite sobreescribir tipos inferidos por columna. Retorna True si escribió."
tags: [csv, parquet, duckdb, etl, core]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: "error_go_core"
imports: [duckdb, pathlib]
params:
- name: csv_path
desc: "Ruta al archivo CSV fuente."
- name: parquet_path
desc: "Ruta de destino del archivo Parquet. Se crean los directorios intermedios si no existen."
- name: column_casts
desc: "Dict opcional col→tipo DuckDB para sobreescribir tipos inferidos (e.g. {\"cp\": \"VARCHAR\"})."
- name: overwrite
desc: "Si False (default), no sobreescribe un parquet existente y retorna False."
output: "True si el archivo Parquet fue escrito, False si fue omitido por ya existir."
tested: true
tests:
- "convierte csv a parquet y duckdb puede leerlo"
- "overwrite=False no sobreescribe parquet existente"
test_file_path: "python/functions/core/tests/test_csv_to_parquet_duckdb.py"
file_path: "python/functions/core/csv_to_parquet_duckdb.py"
source_repo: "internal:footprint_aurgi"
source_license: "internal-aurgi"
source_file: "zonas_mapas_aurgi/scripts/prepare_parquet.py"
---
## Ejemplo
```python
written = csv_to_parquet_duckdb(
"data/centros.csv",
"data/centros.parquet",
column_casts={"cp": "VARCHAR"},
)
if written:
print("Parquet generado")
```
## Notas
Usa DuckDB read_csv_auto que infiere tipos automáticamente. Para columnas con
códigos postales u otros campos numéricos que deben ser strings, usar column_casts.
Lanza FileNotFoundError si csv_path no existe. Otros errores de DuckDB se propagan.
@@ -0,0 +1,79 @@
"""Convert a CSV file to Parquet format using DuckDB."""
from __future__ import annotations
from pathlib import Path
def csv_to_parquet_duckdb(
csv_path: "str | Path",
parquet_path: "str | Path",
column_casts: "dict[str, str] | None" = None,
overwrite: bool = False,
) -> bool:
"""Convert a CSV file to Parquet using DuckDB's read_csv_auto.
If overwrite is False and the parquet file already exists, the function
does nothing and returns False. Otherwise uses DuckDB to read the CSV
(with automatic type inference) and writes it as Parquet.
Optional column_casts allow overriding inferred types for specific columns
(e.g. {"codigo_postal": "VARCHAR"} to prevent numeric coercion).
Args:
csv_path: Path to the source CSV file.
parquet_path: Path for the output Parquet file.
column_casts: Optional dict mapping column names to DuckDB SQL types.
overwrite: If False (default), skip conversion when parquet exists.
Returns:
True if the Parquet file was written, False if skipped.
Raises:
FileNotFoundError: If csv_path does not exist.
Exception: Any DuckDB error (malformed CSV, type cast failure, etc.).
"""
import duckdb
csv_p = Path(csv_path)
parquet_p = Path(parquet_path)
if not csv_p.exists():
raise FileNotFoundError(f"CSV not found: {csv_p}")
if not overwrite and parquet_p.exists():
return False
parquet_p.parent.mkdir(parents=True, exist_ok=True)
con = duckdb.connect()
try:
if column_casts:
cast_exprs = ", ".join(
f"CAST({col} AS {dtype}) AS {col}"
for col, dtype in column_casts.items()
)
# Build SELECT: cast specified columns, pass rest through
# We do this via a subquery to get all columns first
all_cols_query = f"DESCRIBE SELECT * FROM read_csv_auto('{csv_p}', header=true)"
all_cols = [row[0] for row in con.execute(all_cols_query).fetchall()]
select_parts = []
for col in all_cols:
if col in column_casts:
select_parts.append(f"CAST({col} AS {column_casts[col]}) AS {col}")
else:
select_parts.append(col)
select_expr = ", ".join(select_parts)
sql = (
f"COPY (SELECT {select_expr} FROM read_csv_auto('{csv_p}', header=true)) "
f"TO '{parquet_p}' (FORMAT PARQUET)"
)
else:
sql = (
f"COPY (SELECT * FROM read_csv_auto('{csv_p}', header=true)) "
f"TO '{parquet_p}' (FORMAT PARQUET)"
)
con.execute(sql)
finally:
con.close()
return True
@@ -0,0 +1,67 @@
---
name: filter_relations_by_entity_types
kind: function
lang: py
domain: core
version: "1.0.0"
purity: pure
signature: "def filter_relations_by_entity_types(relations: dict, name_to_type: dict, allowed: dict) -> tuple[list, list]"
description: "Post-filtrado tipado de relaciones NER+RE: descarta pares donde los tipos de entidad (head_type, tail_type) no coinciden con los permitidos por relation kind. Ej: descarta 'Madrid president_of Persona' porque Madrid es location no person."
tags: [nlp, relations, filter, entity-types, graph, ner, re, post-process, gliner2]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: []
params:
- name: relations
desc: "Dict {rel_type: [(head_name, tail_name), ...]}. Los nombres deben ser strings no vacios. Ej: {'president_of': [('Carlos Torres', 'BBVA')]}"
- name: name_to_type
desc: "Dict {nombre_lowercased: entity_type}. Se construye del resultado de extract_graph_gliner2 o aggregate_extraction_results. Ej: {'carlos torres': 'person', 'bbva': 'organization'}"
- name: allowed
desc: "Dict {rel_type: (allowed_head_types, allowed_tail_types)}. Cada valor es una tupla de dos listas de strings. Si un rel_type no esta en allowed, todos sus pares se aceptan. Ej: {'president_of': (['person'], ['organization'])}"
output: "Tupla (kept, dropped). Cada elemento es lista de dicts {from, kind, to, head_type, tail_type}. kept tiene los validos, dropped los rechazados (util para debugging)."
tested: true
tests:
- "pares validos se incluyen en kept"
- "pares con tipos incompatibles van a dropped"
- "rel_type no en allowed se acepta siempre"
- "entidad no encontrada en name_to_type va a dropped"
test_file_path: "python/functions/core/tests/test_filter_relations_by_entity_types.py"
file_path: "python/functions/core/filter_relations_by_entity_types.py"
notes: |
Validado en playground/server.py del analisis gliner_glirel_tuning.
La regla (head_type, tail_type) evita falsos positivos comunes en grafos
de conocimiento como "Madrid preside Santander" (Location -> Organization).
El parametro dropped permite inspeccionar facilmente que relaciones se
eliminaron y por que (head_type/tail_type None indica entidad desconocida).
---
## Ejemplo
```python
from core.filter_relations_by_entity_types import filter_relations_by_entity_types
relations = {
"president_of": [
("Carlos Torres", "BBVA"), # person -> organization: OK
("Madrid", "Santander"), # location -> organization: INVALIDO
],
"unknown_rel": [("A", "B")], # no en allowed: se acepta
}
name_to_type = {
"carlos torres": "person",
"bbva": "organization",
"madrid": "location",
"santander": "organization",
"a": "person", "b": "person",
}
allowed = {
"president_of": (["person"], ["organization"]),
}
kept, dropped = filter_relations_by_entity_types(relations, name_to_type, allowed)
# kept: [{"from": "Carlos Torres", "kind": "president_of", "to": "BBVA", ...},
# {"from": "A", "kind": "unknown_rel", "to": "B", ...}]
# dropped: [{"from": "Madrid", "kind": "president_of", "to": "Santander", ...}]
```
@@ -0,0 +1,49 @@
"""Post-filtrado tipado de relaciones: descarta pares con tipos incompatibles."""
from __future__ import annotations
def filter_relations_by_entity_types(
relations: dict,
name_to_type: dict,
allowed: dict,
) -> tuple[list, list]:
"""Filter relations by allowed (head_type, tail_type) per relation kind.
Validates that each (head, tail) pair in a relation has the expected entity
types. Relations with unknown types (not in name_to_type) are dropped when
the relation_type appears in allowed.
Args:
relations: Dict mapping rel_type -> list of (head_name, tail_name) tuples.
E.g. {"president_of": [("Carlos Torres", "BBVA")], ...}
name_to_type: Dict mapping lowercased entity name -> entity type.
E.g. {"carlos torres": "person", "bbva": "organization"}
allowed: Dict mapping rel_type -> (allowed_head_types, allowed_tail_types).
Each value is a tuple/list of two lists of strings.
If a rel_type is NOT in allowed, all its pairs are kept.
E.g. {"president_of": (["person"], ["organization"])}
Returns:
Tuple (kept, dropped) where each is a list of dicts:
{"from": str, "kind": str, "to": str, "head_type": str|None, "tail_type": str|None}
"""
kept: list[dict] = []
dropped: list[dict] = []
for rt, pairs in relations.items():
rule = allowed.get(rt)
for h, t in pairs:
ht = name_to_type.get(h.lower().strip())
tt = name_to_type.get(t.lower().strip())
row = {"from": h, "kind": rt, "to": t, "head_type": ht, "tail_type": tt}
if rule is None:
kept.append(row)
else:
head_ok, tail_ok = rule
if ht in head_ok and tt in tail_ok:
kept.append(row)
else:
dropped.append(row)
return kept, dropped
@@ -0,0 +1,65 @@
---
id: infer_provincia_from_cp_py_core
name: infer_provincia_from_cp
kind: function
lang: py
domain: core
version: "1.0.0"
purity: pure
signature: "def infer_provincia_from_cp(rows: list[dict], cp_col: str = \"codigo_postal\", prov_col: str = \"provincia\") -> list[str | None]"
description: "Infiere la provincia correcta de cada fila basandose en el CP dominante por provincia. Calcula top-2 prefijos de CP por provincia; si el CP de la fila pertenece a ese top-2 usa el real, si no usa el dominante. Stdlib puro, sin pandas."
tags: [string, normalization, spain, geography, postal-code, inference]
uses_functions: [cp_provincia_es_py_core]
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: ["collections.Counter"]
example: |
from infer_provincia_from_cp import infer_provincia_from_cp
rows = [
{"codigo_postal": "28001", "provincia": "Madrid"},
{"codigo_postal": "28010", "provincia": "Madrid"},
{"codigo_postal": "99999", "provincia": "Madrid"},
]
infer_provincia_from_cp(rows)
# ["Madrid", "Madrid", "Madrid"]
tested: true
tests: ["inferencia con cp dominante madrid", "fila con cp fuera de top2 usa dominante", "fila sin provincia retorna None"]
test_file_path: "python/functions/core/tests/test_infer_provincia_from_cp.py"
file_path: "python/functions/core/infer_provincia_from_cp.py"
params:
- name: rows
desc: "Lista de dicts. Cada dict debe tener al menos cp_col (codigo postal) y prov_col (provincia declarada)."
- name: cp_col
desc: "Nombre de la clave del codigo postal en cada dict. Por defecto 'codigo_postal'."
- name: prov_col
desc: "Nombre de la clave de la provincia en cada dict. Por defecto 'provincia'."
output: "Lista de strings o None con la provincia inferida para cada fila, en el mismo orden que rows. None cuando la provincia o el CP de la fila es None o la provincia no tiene datos suficientes."
source_repo: "internal:footprint_aurgi"
source_license: "internal-aurgi"
source_file: "aurgi_mapas/generar_pdf_reporte.py"
---
## Ejemplo
```python
from infer_provincia_from_cp import infer_provincia_from_cp
rows = [
{"codigo_postal": "28001", "provincia": "Madrid"},
{"codigo_postal": "28010", "provincia": "Madrid"},
{"codigo_postal": "41001", "provincia": "Madrid"}, # CP Sevilla pero provincia Madrid
]
result = infer_provincia_from_cp(rows)
# ["Madrid", "Madrid", "Madrid"]
# El tercer CP (41) no esta en top-2 de Madrid (28), asi que usa el dominante (28 -> Madrid)
```
## Notas
Funcion pura. Usa `cp_provincia_es` del mismo dominio para el lookup final.
Adaptada de `add_provincia_poliza_correcta` en `aurgi_mapas/generar_pdf_reporte.py`,
eliminando la dependencia de pandas y generalizando las columnas por parametro.
El algoritmo mantiene la semantica original: top-2 prefijos por provincia, con
fallback al dominante cuando el CP de la fila no encaja en ese top-2.
@@ -0,0 +1,85 @@
"""Infiere la provincia correcta de cada fila basandose en el codigo postal dominante por provincia."""
from __future__ import annotations
import os
import sys
from collections import Counter
def infer_provincia_from_cp(
rows: list[dict],
cp_col: str = "codigo_postal",
prov_col: str = "provincia",
) -> list:
"""Infiere la provincia correcta de cada fila usando el CP dominante por provincia.
Para cada provincia en el dataset calcula los top-2 prefijos de CP mas
frecuentes. Si el CP de una fila pertenece a ese top-2 para su provincia,
se usa la provincia derivada del CP real; si no, se usa la provincia
derivada del prefijo dominante (top-1) de su provincia.
Logica generica (stdlib puro, sin pandas):
1. Calcular frecuencia de prefijos por provincia.
2. Seleccionar top-2 prefijos por provincia.
3. Para cada fila: si su prefijo esta en top-2 de su provincia,
retornar cp_provincia_es(prefijo); si no, retornar cp_provincia_es(top1).
4. Si la provincia de la fila no tiene datos, retornar None.
Args:
rows: Lista de dicts con al menos las columnas cp_col y prov_col.
cp_col: Nombre de la columna con el codigo postal (default "codigo_postal").
prov_col: Nombre de la columna con la provincia original (default "provincia").
Returns:
Lista de strings (o None) con la provincia inferida para cada fila,
en el mismo orden que rows.
"""
_here = os.path.dirname(os.path.abspath(__file__))
if _here not in sys.path:
sys.path.insert(0, _here)
from cp_provincia_es import cp_provincia_es
# Paso 1: contar frecuencia de (provincia, prefijo)
freq: dict[str, Counter] = {}
for row in rows:
prov = row.get(prov_col)
cp_raw = row.get(cp_col)
if prov is None or cp_raw is None:
continue
cp_str = str(cp_raw).strip().zfill(5)
prefix = cp_str[:2]
if prov not in freq:
freq[prov] = Counter()
freq[prov][prefix] += 1
# Paso 2: top-2 prefijos por provincia y prefijo dominante (top-1)
top2: dict[str, list[str]] = {}
dominant: dict[str, str] = {}
for prov, counter in freq.items():
ordered = [p for p, _ in counter.most_common(2)]
top2[prov] = ordered
if ordered:
dominant[prov] = ordered[0]
# Paso 3: resolver provincia para cada fila
result = []
for row in rows:
prov = row.get(prov_col)
cp_raw = row.get(cp_col)
if prov is None or cp_raw is None:
result.append(None)
continue
cp_str = str(cp_raw).strip().zfill(5)
prefix = cp_str[:2]
if prov in top2 and prefix in top2[prov]:
result.append(cp_provincia_es(prefix))
elif prov in dominant:
result.append(cp_provincia_es(dominant[prov]))
else:
result.append(None)
return result
@@ -0,0 +1,49 @@
---
name: merge_entity_aliases
kind: function
lang: py
domain: core
version: "1.0.0"
purity: pure
signature: "def merge_entity_aliases(entity_names: list[str]) -> dict[str, str]"
description: "Coreference simple por normalizacion + substring: mapea cada nombre de entidad a su forma canonica. 'BBVA' y 'bbva' -> mismo canonical. Nombres cortos absorbidos por nombres largos que los contienen como palabra completa (min 4 chars normalizados)."
tags: [nlp, coreference, entity, alias, normalization, merge, graph, ner]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: [re, collections.defaultdict]
params:
- name: entity_names
desc: "Lista de nombres de entidades tal como los extrajo el modelo NER. Puede contener duplicados, variaciones de casing (BBVA/bbva) y formas largas/cortas (BBVA / Banco Bilbao Vizcaya Argentaria, S.A.)."
output: "Dict {nombre_original: nombre_canonical}. Identidad para nombres que no son alias de nada. Lista vacia retorna dict vacio."
tested: true
tests:
- "duplicados case-insensitive se mapean al mismo canonical"
- "nombre corto se absorbe en nombre largo que lo contiene"
- "siglas cortas menos de 4 chars no absorben falsamente"
- "nombres totalmente disjuntos se mapean a si mismos"
test_file_path: "python/functions/core/tests/test_merge_entity_aliases.py"
file_path: "python/functions/core/merge_entity_aliases.py"
notes: |
Validado en playground/server.py del analisis gliner_glirel_tuning.
El criterio de 4 chars normalizados evita que siglas tipo "US", "EU", "SA"
absorban entidades que meramente contienen esas letras.
El merge es asimetrico: el nombre LARGO es el canonical, no el corto.
Util como paso de post-proceso tras aggregate_extraction_results antes
de construir el grafo final.
---
## Ejemplo
```python
from core.merge_entity_aliases import merge_entity_aliases
names = ["BBVA", "bbva", "Banco Bilbao Vizcaya Argentaria, S.A.", "Inditex"]
alias = merge_entity_aliases(names)
# alias["BBVA"] -> "Banco Bilbao Vizcaya Argentaria, S.A." (absorbido por substring)
# alias["bbva"] -> "Banco Bilbao Vizcaya Argentaria, S.A." (normalizado + absorbido)
# alias["Banco Bilbao Vizcaya Argentaria, S.A."] -> "Banco Bilbao Vizcaya Argentaria, S.A."
# alias["Inditex"] -> "Inditex" (identidad, no hay alias)
```
@@ -0,0 +1,62 @@
"""Coreference simple por normalizacion y substring para entidades nombradas."""
from __future__ import annotations
import re
from collections import defaultdict
def merge_entity_aliases(entity_names: list[str]) -> dict[str, str]:
"""Build alias map: original_name -> canonical_name.
Two-pass algorithm:
Step 1 - Normalize: lowercase + strip punctuation -> cluster by normalized form.
Canonical per cluster = longest original casing.
Step 2 - Substring merge: short names absorbed by longer ones if short_name
appears as whole word inside long_name (normalized) AND
short_name has >= 4 normalized chars (prevents false positives
like 'US' absorbing everything that contains 'us').
Args:
entity_names: List of entity name strings (may have duplicates or
different casings, e.g. ["BBVA", "bbva", "Banco Bilbao..."]).
Returns:
Dict mapping each input name to its final canonical form.
Identity mapping for names that are not aliases of anything else.
"""
if not entity_names:
return {}
def normalize(s: str) -> str:
s = re.sub(r"[\.,;:\"'`()\[\]]", "", s.strip())
s = re.sub(r"\s+", " ", s)
return s.strip().lower()
# Paso 1: agrupar por forma normalizada, elegir el mas largo como canonical
norm_groups: dict[str, list[str]] = defaultdict(list)
for n in entity_names:
norm_groups[normalize(n)].append(n)
canonical: dict[str, str] = {}
for nrm, group in norm_groups.items():
winner = max(group, key=lambda x: (len(x), x))
for n in group:
canonical[n] = winner
# Paso 2: substring merge sobre los canonicos (long absorbe short si short dentro de long)
canon_set = sorted(set(canonical.values()), key=len, reverse=True)
absorbed: dict[str, str] = {}
for long_n in canon_set:
long_norm = normalize(long_n)
for short_n in canon_set:
if short_n == long_n or short_n in absorbed:
continue
short_norm = normalize(short_n)
if len(short_norm) < 4:
continue
if re.search(r"\b" + re.escape(short_norm) + r"\b", long_norm):
absorbed[short_n] = long_n
return {orig: absorbed.get(canon, canon) for orig, canon in canonical.items()}
@@ -0,0 +1,52 @@
---
id: normalize_for_join_py_core
name: normalize_for_join
kind: function
lang: py
domain: core
version: "1.0.0"
purity: pure
signature: "def normalize_for_join(values: Iterable) -> list[str]"
description: "Normaliza strings para fuzzy joins: upper + strip diacriticos NFD + elimina non [A-Z0-9 ] + colapsa espacios. Trabaja con cualquier iterable. None/NaN -> cadena vacia."
tags: [string, normalization, join, fuzzy, spain]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: ["re", "unicodedata", "typing.Iterable"]
example: |
from normalize_for_join import normalize_for_join
normalize_for_join(["Calle Mayor, 14", "avila", None])
# ["CALLE MAYOR 14", "AVILA", ""]
tested: true
tests: ["normalize con puntuacion y diacriticos y None"]
test_file_path: "python/functions/core/tests/test_normalize_for_join.py"
file_path: "python/functions/core/normalize_for_join.py"
params:
- name: values
desc: "Iterable de strings o None/NaN a normalizar. Acepta listas, generadores, pd.Series, etc."
output: "Lista de strings normalizados en mayusculas sin diacriticos. None y NaN se convierten a cadena vacia."
source_repo: "internal:footprint_aurgi"
source_license: "internal-aurgi"
source_file: "fuzzy_joins/arreglo_fuzzy.py"
---
## Ejemplo
```python
from normalize_for_join import normalize_for_join
normalize_for_join(["Calle Mayor, 14", "ávila", None])
# ["CALLE MAYOR 14", "AVILA", ""]
normalize_for_join(["José García S.L.", "BANCO DE ESPAÑA"])
# ["JOSE GARCIA SL", "BANCO DE ESPANA"]
```
## Notas
Funcion pura sin dependencias externas (solo `re` y `unicodedata` de stdlib).
Adaptada de `preparar_para_join` / `normalizar_string` en `fuzzy_joins/arreglo_fuzzy.py`,
eliminando la dependencia de pandas para trabajar con cualquier iterable.
Util como paso previo a joins por igualdad exacta sobre datos normalizados.
@@ -0,0 +1,44 @@
"""Normaliza strings para joins sin dependencias externas."""
import re
import unicodedata
from typing import Iterable
def normalize_for_join(values: Iterable) -> list:
"""Normaliza strings para joins: upper + sin diacriticos + solo [A-Z0-9 ] + colapsa espacios.
Para cada valor: convierte a string, upper, elimina diacriticos NFD,
reemplaza caracteres que no sean letras/numeros/espacios por cadena vacia,
colapsa espacios multiples, trim. None o NaN se convierten a cadena vacia.
No depende de pandas; trabaja con cualquier iterable de strings o None.
Args:
values: Iterable de strings o None. Puede ser lista, generador, Serie, etc.
Returns:
Lista de strings normalizados. None/NaN se convierten a "".
"""
result = []
for v in values:
if v is None:
result.append("")
continue
# Detectar NaN de numpy/pandas sin importarlos
try:
if v != v: # NaN != NaN
result.append("")
continue
except (TypeError, ValueError):
pass
texto = str(v).upper()
texto = "".join(
c for c in unicodedata.normalize("NFD", texto)
if unicodedata.category(c) != "Mn"
)
texto = re.sub(r"[^A-Z0-9\s]", "", texto)
texto = re.sub(r"\s+", " ", texto)
texto = texto.strip()
result.append(texto)
return result
@@ -0,0 +1,43 @@
---
name: safe_read_csv_fallback
kind: function
lang: py
domain: core
version: "1.0.0"
purity: impure
signature: "safe_read_csv_fallback(path: str | Path) -> pd.DataFrame"
description: "Lee un CSV intentando utf-8 primero; si falla con UnicodeDecodeError reintenta con latin-1. Cubre exportaciones legacy de Excel y herramientas occidentales."
tags: [csv, encoding, pandas, io, core]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: "error_go_core"
imports: [pandas, pathlib]
params:
- name: path
desc: "Ruta al archivo CSV a leer. Puede ser str o Path."
output: "DataFrame de pandas con el contenido del CSV. Codificación detectada automáticamente (utf-8 o latin-1)."
tested: true
tests:
- "lee csv utf-8 correctamente"
- "lee csv latin-1 con fallback"
test_file_path: "python/functions/core/tests/test_safe_read_csv_fallback.py"
file_path: "python/functions/core/safe_read_csv_fallback.py"
source_repo: "internal:footprint_aurgi"
source_license: "internal-aurgi"
source_file: "ponderacion_isochronas/example/models/eda/utils.py"
---
## Ejemplo
```python
df = safe_read_csv_fallback("datos_clientes.csv")
print(df.shape)
```
## Notas
Solo hace fallback en UnicodeDecodeError. Otros errores (archivo inexistente,
CSV malformado) se propagan normalmente.
latin-1 cubre la mayoría de exportaciones de Excel en español/europeo occidental.
@@ -0,0 +1,34 @@
"""Read a CSV file with automatic encoding fallback from utf-8 to latin-1."""
from __future__ import annotations
from pathlib import Path
from typing import TYPE_CHECKING
if TYPE_CHECKING:
import pandas as pd
def safe_read_csv_fallback(path: "str | Path") -> "pd.DataFrame":
"""Read a CSV file, falling back to latin-1 if utf-8 decoding fails.
Tries pandas read_csv with the default utf-8 encoding first. On a
UnicodeDecodeError retries with latin-1 (ISO-8859-1), which covers most
Western European legacy CSV exports.
Args:
path: Path to the CSV file.
Returns:
A pandas DataFrame with the CSV contents.
Raises:
FileNotFoundError: If the file does not exist.
Exception: Any other pandas read error (malformed CSV, etc.).
"""
import pandas as pd
p = Path(path)
try:
return pd.read_csv(p)
except UnicodeDecodeError:
return pd.read_csv(p, encoding="latin-1")
+59
View File
@@ -0,0 +1,59 @@
---
id: slugify_ascii_py_core
name: slugify_ascii
kind: function
lang: py
domain: core
version: "1.0.0"
purity: pure
signature: "def slugify_ascii(text: str, max_len: int = 80, default: str = \"centro\") -> str"
description: "Convierte texto a slug ASCII lowercase sin diacriticos. Strip + lower + NFD + reemplaza non-alphanum por guion + colapsa guiones. Si vacio retorna default. Trunca a max_len."
tags: [string, normalization, slug, ascii, spain]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: ["re", "unicodedata"]
example: |
from slugify_ascii import slugify_ascii
slugify_ascii("Calle Mayor, 14") # "calle-mayor-14"
slugify_ascii("Ávila") # "avila"
slugify_ascii("") # "centro"
slugify_ascii("a" * 100, max_len=10) # "aaaaaaaaaa"
tested: true
tests: ["slugify texto con puntuacion", "slugify diacriticos", "slugify cadena vacia retorna default", "slugify trunca a max_len"]
test_file_path: "python/functions/core/tests/test_slugify_ascii.py"
file_path: "python/functions/core/slugify_ascii.py"
params:
- name: text
desc: "Texto de entrada a convertir en slug. None se trata como cadena vacia."
- name: max_len
desc: "Longitud maxima del slug resultante. Por defecto 80 caracteres."
- name: default
desc: "Valor a retornar si el slug resultante esta vacio. Por defecto 'centro'."
output: "Slug ASCII lowercase sin diacriticos, maximo max_len caracteres. Retorna default si el resultado esta vacio."
source_repo: "internal:footprint_aurgi"
source_license: "internal-aurgi"
source_file: "zonas_mapas_aurgi/scripts/generate_isochrones.py"
---
## Ejemplo
```python
from slugify_ascii import slugify_ascii
slugify_ascii("Calle Mayor, 14") # "calle-mayor-14"
slugify_ascii("Ávila") # "avila"
slugify_ascii("") # "centro"
slugify_ascii(None) # "centro"
slugify_ascii("a" * 100, max_len=10) # "aaaaaaaaaa"
slugify_ascii("---", default="sin-nombre") # "sin-nombre"
```
## Notas
Funcion pura sin dependencias externas. Usa solo `re` y `unicodedata` de stdlib.
Adaptada de `_slugify` en `zonas_mapas_aurgi/scripts/generate_isochrones.py` y
`ponderacion_isochronas/src/generar_isochronas_aurgi.py`, combinando la
normalizacion NFD de la primera con el truncado y default de la segunda.
+33
View File
@@ -0,0 +1,33 @@
"""Convierte texto a slug ASCII lowercase sin diacriticos."""
import re
import unicodedata
def slugify_ascii(text: str, max_len: int = 80, default: str = "centro") -> str:
"""Convierte texto a slug ASCII lowercase sin diacriticos.
Aplica: strip + lower + eliminar diacriticos NFD + reemplazar
no-alphanum por guion + colapsar guiones + trim. Si el resultado
esta vacio retorna default. Trunca a max_len.
Args:
text: Texto de entrada. None se trata como vacio.
max_len: Longitud maxima del slug resultante (default 80).
default: Valor a retornar si el slug queda vacio (default "centro").
Returns:
Slug ASCII lowercase, maximo max_len caracteres.
"""
if text is None:
return default
text = str(text).strip().lower()
text = "".join(
c for c in unicodedata.normalize("NFD", text)
if unicodedata.category(c) != "Mn"
)
text = re.sub(r"[^a-z0-9]+", "-", text)
text = text.strip("-")
if not text:
return default
return text[:max_len]
@@ -0,0 +1,65 @@
"""Tests para aggregate_extraction_results."""
from __future__ import annotations
import os
import sys
from collections import Counter
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", ".."))
from core.aggregate_extraction_results import aggregate_extraction_results
def test_lista_vacia_retorna_entities_y_relations_vacios():
"""lista vacia retorna entities vacio y relations vacio"""
result = aggregate_extraction_results([])
assert result["entities"] == {}
assert result["relations"] == Counter()
def test_resultado_unico_se_agrega_correctamente():
"""resultado unico se agrega correctamente"""
r = [
{
"entities": {"person": ["Pablo Isla"], "organization": ["Inditex"]},
"relation_extraction": {"ceo_of": [("Pablo Isla", "Inditex")]},
}
]
result = aggregate_extraction_results(r)
assert ("person", "pablo isla") in result["entities"]
assert ("organization", "inditex") in result["entities"]
assert result["entities"][("person", "pablo isla")]["count"] == 1
assert result["relations"][("Pablo Isla", "ceo_of", "Inditex")] == 1
def test_dos_resultados_con_solapamiento_acumulan_counts():
"""dos resultados con solapamiento acumulan counts"""
r = [
{
"entities": {"person": ["Pablo Isla"], "organization": ["Inditex"]},
"relation_extraction": {"ceo_of": [("Pablo Isla", "Inditex")]},
},
{
"entities": {"person": ["Pablo Isla"], "organization": ["Inditex"]},
"relation_extraction": {"ceo_of": [("Pablo Isla", "Inditex")]},
},
]
result = aggregate_extraction_results(r)
assert result["entities"][("person", "pablo isla")]["count"] == 2
assert result["relations"][("Pablo Isla", "ceo_of", "Inditex")] == 2
def test_entidades_deduplicen_case_insensitive():
"""entidades se deduplicien case-insensitive"""
r = [
{"entities": {"person": ["Pablo Isla"]}, "relation_extraction": {}},
{"entities": {"person": ["pablo isla"]}, "relation_extraction": {}},
]
result = aggregate_extraction_results(r)
# Ambas van a la misma key (person, pablo isla)
assert ("person", "pablo isla") in result["entities"]
assert result["entities"][("person", "pablo isla")]["count"] == 2
# Solo una key para pablo isla
pablo_keys = [k for k in result["entities"] if k[1] == "pablo isla"]
assert len(pablo_keys) == 1
@@ -0,0 +1,72 @@
"""Tests para chunk_with_overlap."""
from __future__ import annotations
import os
import sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", ".."))
from core.chunk_with_overlap import chunk_with_overlap
def test_texto_vacio_retorna_lista_vacia():
"""texto vacio retorna lista vacia"""
assert chunk_with_overlap("") == []
assert chunk_with_overlap(" ") == []
def test_una_frase_menor_que_max_chars_produce_1_chunk():
"""una frase menor que max_chars produce 1 chunk"""
text = "Esta es una frase corta."
chunks = chunk_with_overlap(text, max_chars=500, overlap_sentences=0)
assert len(chunks) == 1
assert chunks[0]["text"] == text
def test_multiples_frases_producen_N_chunks_con_overlap():
"""multiples frases producen N chunks con overlap"""
# 3 frases de ~30 chars c/u, max_chars=60 -> al menos 2 chunks
text = "Primera frase larga aqui. Segunda frase larga aqui. Tercera frase larga aqui."
chunks = chunk_with_overlap(text, max_chars=55, overlap_sentences=1)
assert len(chunks) >= 2
# Cada chunk tiene texto no vacio
for c in chunks:
assert c["text"].strip()
assert len(c["sentences"]) > 0
def test_frase_mas_larga_que_max_chars_no_bucle_infinito():
"""frase mas larga que max_chars se incluye sin bucle infinito"""
long_sentence = "A" * 2000 + "."
chunks = chunk_with_overlap(long_sentence, max_chars=100, overlap_sentences=0)
# Debe terminar (no bucle infinito) y producir exactamente 1 chunk
assert len(chunks) == 1
assert chunks[0]["text"] == long_sentence.strip()
def test_overlap_0_no_duplica_frases():
"""overlap=0 no duplica frases entre chunks"""
text = "Primera frase aqui completa. Segunda frase aqui completa. Tercera frase aqui completa."
chunks = chunk_with_overlap(text, max_chars=50, overlap_sentences=0)
# Recolectar todas las frases de todos los chunks
all_sents = [s for c in chunks for s in c["sentences"]]
# Con overlap=0 ninguna frase debe aparecer dos veces
assert len(all_sents) == len(set(all_sents))
def test_overlap_2_el_chunk_N_mas_1_empieza_con_ultimas_2_frases_del_N():
"""overlap=2 el chunk N+1 empieza con las 2 ultimas frases del chunk N"""
# 5 frases cortas, max_chars=80 para forzar al menos 2 chunks
text = (
"Frase uno aqui. "
"Frase dos aqui. "
"Frase tres aqui. "
"Frase cuatro aqui. "
"Frase cinco aqui."
)
chunks = chunk_with_overlap(text, max_chars=80, overlap_sentences=2)
if len(chunks) >= 2:
prev_tail = chunks[0]["sentences"][-2:]
next_head = chunks[1]["sentences"][:2]
assert prev_tail == next_head
@@ -0,0 +1,49 @@
"""Tests para clean_pdf_text."""
from __future__ import annotations
import os
import sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", ".."))
from core.clean_pdf_text import clean_pdf_text
def test_string_vacio_retorna_vacio():
"""string vacio retorna vacio"""
assert clean_pdf_text("") == ""
def test_marca_de_pagina_1_20_se_elimina():
"""marca de pagina 1/20 se elimina"""
result = clean_pdf_text("1/20\nfoo bar")
assert "1/20" not in result
assert "foo bar" in result
def test_dehyphenation_exa_newline_mple():
"""dehyphenation exa-newline-mple -> example"""
result = clean_pdf_text("exa-\nmple")
assert result == "example"
def test_espacios_duplicados_se_colapsan():
"""espacios duplicados se colapsan"""
result = clean_pdf_text("ab cd")
assert result == "ab cd"
def test_salto_de_linea_en_mitad_de_oracion_se_une_con_espacio():
"""salto de linea en mitad de oracion se une con espacio"""
result = clean_pdf_text("Pablo Isla es el\npresidente de Inditex")
assert result == "Pablo Isla es el presidente de Inditex"
def test_salto_de_linea_tras_punto_se_preserva():
"""salto de linea tras punto se preserva"""
result = clean_pdf_text("Primera oracion.\nSegunda oracion.")
# El salto tras punto debe quedar (no se une con espacio)
assert "\n" in result
assert "Primera oracion." in result
assert "Segunda oracion." in result
@@ -0,0 +1,44 @@
"""Tests para cp_provincia_es."""
import sys
import os
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
from cp_provincia_es import cp_provincia_es
def test_cp_completo_retorna_provincia():
"""cp completo retorna provincia"""
assert cp_provincia_es("28001") == "Madrid"
def test_prefijo_2_digitos_retorna_provincia():
"""prefijo 2 digitos retorna provincia"""
assert cp_provincia_es("28") == "Madrid"
def test_primer_prefijo_01_retorna_alava():
"""primer prefijo 01 retorna Alava"""
assert cp_provincia_es("01") == "Álava"
def test_cp_desconocido_retorna_none():
"""cp desconocido retorna None"""
assert cp_provincia_es("99") is None
def test_cp_entero_completo():
assert cp_provincia_es(28001) == "Madrid"
def test_cp_ceuta():
assert cp_provincia_es("51001") == "Ceuta"
def test_cp_melilla():
assert cp_provincia_es("52") == "Melilla"
def test_cp_barcelona():
assert cp_provincia_es("08") == "Barcelona"
@@ -0,0 +1,54 @@
"""Tests para csv_to_parquet_duckdb."""
from __future__ import annotations
import tempfile
from pathlib import Path
import pytest
def test_convierte_csv_a_parquet_y_duckdb_puede_leerlo():
"""convierte csv a parquet y duckdb puede leerlo"""
import sys
sys.path.insert(0, str(Path(__file__).resolve().parents[2]))
from core.csv_to_parquet_duckdb import csv_to_parquet_duckdb
import duckdb
with tempfile.TemporaryDirectory() as tmpdir:
csv_path = Path(tmpdir) / "test.csv"
parquet_path = Path(tmpdir) / "test.parquet"
csv_path.write_text("nombre,lat,lon\nMadrid,40.4,-3.7\nBarcelona,41.3,2.1\n")
result = csv_to_parquet_duckdb(csv_path, parquet_path)
assert result is True
assert parquet_path.exists()
assert parquet_path.stat().st_size > 0
# Verify duckdb can read it back
con = duckdb.connect()
df = con.execute(f"SELECT * FROM read_parquet('{parquet_path}')").df()
con.close()
assert df.shape == (2, 3)
assert set(df.columns) == {"nombre", "lat", "lon"}
def test_overwrite_False_no_sobreescribe_parquet_existente():
"""overwrite=False no sobreescribe parquet existente"""
import sys
sys.path.insert(0, str(Path(__file__).resolve().parents[2]))
from core.csv_to_parquet_duckdb import csv_to_parquet_duckdb
with tempfile.TemporaryDirectory() as tmpdir:
csv_path = Path(tmpdir) / "test.csv"
parquet_path = Path(tmpdir) / "test.parquet"
csv_path.write_text("a,b\n1,2\n")
# Create existing parquet with known content
parquet_path.write_bytes(b"existing content")
original_size = parquet_path.stat().st_size
result = csv_to_parquet_duckdb(csv_path, parquet_path, overwrite=False)
assert result is False
# File must remain unchanged
assert parquet_path.stat().st_size == original_size
@@ -0,0 +1,60 @@
"""Tests para filter_relations_by_entity_types."""
from __future__ import annotations
import os
import sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", ".."))
from core.filter_relations_by_entity_types import filter_relations_by_entity_types
NAME_TO_TYPE = {
"carlos torres": "person",
"bbva": "organization",
"madrid": "location",
"santander": "organization",
"ana": "person",
}
ALLOWED = {
"president_of": (["person"], ["organization"]),
"located_in": (["organization", "person"], ["location"]),
}
def test_pares_validos_se_incluyen_en_kept():
"""pares validos se incluyen en kept"""
relations = {"president_of": [("Carlos Torres", "BBVA")]}
kept, dropped = filter_relations_by_entity_types(relations, NAME_TO_TYPE, ALLOWED)
assert len(kept) == 1
assert kept[0]["from"] == "Carlos Torres"
assert kept[0]["to"] == "BBVA"
assert len(dropped) == 0
def test_pares_con_tipos_incompatibles_van_a_dropped():
"""pares con tipos incompatibles van a dropped"""
# Madrid es location, no person -> no puede presidir nada
relations = {"president_of": [("Madrid", "Santander")]}
kept, dropped = filter_relations_by_entity_types(relations, NAME_TO_TYPE, ALLOWED)
assert len(kept) == 0
assert len(dropped) == 1
assert dropped[0]["head_type"] == "location"
def test_rel_type_no_en_allowed_se_acepta_siempre():
"""rel_type no en allowed se acepta siempre"""
relations = {"unknown_rel": [("Carlos Torres", "Madrid")]}
kept, dropped = filter_relations_by_entity_types(relations, NAME_TO_TYPE, ALLOWED)
assert len(kept) == 1
assert len(dropped) == 0
def test_entidad_no_encontrada_en_name_to_type_va_a_dropped():
"""entidad no encontrada en name_to_type va a dropped"""
# "Desconocido" no esta en name_to_type -> head_type es None -> dropped
relations = {"president_of": [("Desconocido", "BBVA")]}
kept, dropped = filter_relations_by_entity_types(relations, NAME_TO_TYPE, ALLOWED)
assert len(dropped) == 1
assert dropped[0]["head_type"] is None
@@ -0,0 +1,78 @@
"""Tests para infer_provincia_from_cp."""
import sys
import os
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
from infer_provincia_from_cp import infer_provincia_from_cp
def test_inferencia_con_cp_dominante_madrid():
"""inferencia con cp dominante madrid"""
rows = [
{"codigo_postal": "28001", "provincia": "Madrid"},
{"codigo_postal": "28010", "provincia": "Madrid"},
]
result = infer_provincia_from_cp(rows)
assert result == ["Madrid", "Madrid"]
def test_fila_con_cp_fuera_de_top2_usa_dominante():
"""fila con cp fuera de top2 usa dominante"""
# Madrid tiene 3 prefijos distintos: 28 (x4), 29 (x1), 41 (x1).
# top-2 son: 28 y 29 (o 41 dependiendo del orden, pero 41 queda fuera).
# Para que 41 quede fuera del top-2 necesitamos mas de 2 prefijos distintos.
rows = [
{"codigo_postal": "28001", "provincia": "Madrid"},
{"codigo_postal": "28002", "provincia": "Madrid"},
{"codigo_postal": "28003", "provincia": "Madrid"},
{"codigo_postal": "28004", "provincia": "Madrid"},
{"codigo_postal": "29001", "provincia": "Madrid"},
{"codigo_postal": "29002", "provincia": "Madrid"},
{"codigo_postal": "41001", "provincia": "Madrid"}, # outlier: fuera de top-2
]
result = infer_provincia_from_cp(rows)
# top-2 de Madrid: "28" (4 ocurrencias) y "29" (2 ocurrencias).
# "41" no esta en top-2, asi que usa el dominante (28 -> Madrid)
assert result[6] == "Madrid"
def test_fila_sin_provincia_retorna_none():
"""fila sin provincia retorna None"""
rows = [
{"codigo_postal": "28001", "provincia": None},
]
result = infer_provincia_from_cp(rows)
assert result == [None]
def test_fila_sin_cp_retorna_none():
rows = [
{"codigo_postal": None, "provincia": "Madrid"},
]
result = infer_provincia_from_cp(rows)
assert result == [None]
def test_columnas_custom():
rows = [
{"cp": "28001", "prov": "Madrid"},
{"cp": "28010", "prov": "Madrid"},
]
result = infer_provincia_from_cp(rows, cp_col="cp", prov_col="prov")
assert result == ["Madrid", "Madrid"]
def test_multiples_provincias():
rows = [
{"codigo_postal": "28001", "provincia": "Madrid"},
{"codigo_postal": "08001", "provincia": "Barcelona"},
{"codigo_postal": "41001", "provincia": "Sevilla"},
]
result = infer_provincia_from_cp(rows)
assert result == ["Madrid", "Barcelona", "Sevilla"]
def test_lista_vacia():
assert infer_provincia_from_cp([]) == []
@@ -0,0 +1,58 @@
"""Tests para merge_entity_aliases."""
from __future__ import annotations
import os
import sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", ".."))
from core.merge_entity_aliases import merge_entity_aliases
def test_duplicados_case_insensitive_se_mapean_al_mismo_canonical():
"""duplicados case-insensitive se mapean al mismo canonical"""
result = merge_entity_aliases(["BBVA", "bbva", "Bbva"])
# Todos deben apuntar al mismo canonical (el mas largo / mayor)
vals = set(result.values())
assert len(vals) == 1
# El canonical debe ser la forma de mayor longitud/orden: "BBVA" (mayusculas, misma longitud)
canon = vals.pop()
assert canon.lower() == "bbva"
def test_nombre_corto_se_absorbe_en_nombre_largo_que_lo_contiene():
"""nombre corto se absorbe en nombre largo que lo contiene"""
# El substring merge funciona cuando la forma corta APARECE LITERALMENTE
# en la forma larga (normalizada). Ejemplo: "bilbao" esta en "banco bilbao vizcaya argentaria"
names = ["Bilbao", "Banco Bilbao Vizcaya Argentaria"]
result = merge_entity_aliases(names)
# "bilbao" (6 chars) aparece como palabra en la forma larga normalizada
assert result["Bilbao"] == "Banco Bilbao Vizcaya Argentaria"
assert result["Banco Bilbao Vizcaya Argentaria"] == "Banco Bilbao Vizcaya Argentaria"
def test_siglas_cortas_menos_de_4_chars_no_absorben_falsamente():
"""siglas cortas menos de 4 chars no absorben falsamente"""
# "US" es 2 chars normalizados -> no debe absorber a "USA" ni a "BBUSA"
names = ["US", "USA", "Standard Chartered"]
result = merge_entity_aliases(names)
# "US" (2 chars) no debe poder absorber nada
assert result["USA"] in ("USA", "Standard Chartered") or result["USA"] == "USA"
# "US" puede quedarse como identidad o ser absorbido por algo que lo contenga
# Lo importante: NO absorbe a nombres que no lo contienen como palabra completa
assert result["Standard Chartered"] == "Standard Chartered"
def test_nombres_totalmente_disjuntos_se_mapean_a_si_mismos():
"""nombres totalmente disjuntos se mapean a si mismos"""
names = ["Inditex", "Santander", "Telefonica"]
result = merge_entity_aliases(names)
assert result["Inditex"] == "Inditex"
assert result["Santander"] == "Santander"
assert result["Telefonica"] == "Telefonica"
def test_lista_vacia_retorna_dict_vacio():
"""lista vacia retorna dict vacio"""
assert merge_entity_aliases([]) == {}
@@ -0,0 +1,42 @@
"""Tests para normalize_for_join."""
import sys
import os
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
from normalize_for_join import normalize_for_join
def test_normalize_con_puntuacion_y_diacriticos_y_none():
"""normalize con puntuacion y diacriticos y None"""
result = normalize_for_join(["Calle Mayor, 14", "ávila", None])
assert result == ["CALLE MAYOR 14", "AVILA", ""]
def test_normalize_lista_vacia():
assert normalize_for_join([]) == []
def test_normalize_upper():
assert normalize_for_join(["madrid"]) == ["MADRID"]
def test_normalize_elimina_simbolos():
assert normalize_for_join(["José García S.L."]) == ["JOSE GARCIA SL"]
def test_normalize_colapsa_espacios():
assert normalize_for_join([" hola mundo "]) == ["HOLA MUNDO"]
def test_normalize_nan_as_empty():
# NaN de float (float('nan'))
result = normalize_for_join([float("nan")])
assert result == [""]
def test_normalize_entero():
# Enteros se convierten a string
result = normalize_for_join([28001])
assert result == ["28001"]
@@ -0,0 +1,40 @@
"""Tests para safe_read_csv_fallback."""
from __future__ import annotations
import tempfile
from pathlib import Path
import pytest
def test_lee_csv_utf_8_correctamente():
"""lee csv utf-8 correctamente"""
import sys
sys.path.insert(0, str(Path(__file__).resolve().parents[2]))
from core.safe_read_csv_fallback import safe_read_csv_fallback
with tempfile.TemporaryDirectory() as tmpdir:
csv_path = Path(tmpdir) / "test_utf8.csv"
csv_path.write_text("nombre,valor\nAña,42\nBéta,99\n", encoding="utf-8")
df = safe_read_csv_fallback(csv_path)
assert df.shape == (2, 2)
assert list(df.columns) == ["nombre", "valor"]
assert df["nombre"].tolist() == ["Aña", "Béta"]
def test_lee_csv_latin_1_con_fallback():
"""lee csv latin-1 con fallback"""
import sys
sys.path.insert(0, str(Path(__file__).resolve().parents[2]))
from core.safe_read_csv_fallback import safe_read_csv_fallback
with tempfile.TemporaryDirectory() as tmpdir:
csv_path = Path(tmpdir) / "test_latin1.csv"
# Write latin-1 encoded CSV (ñ, é are 0xF1, 0xE9 in latin-1)
csv_path.write_bytes("nombre,valor\nMad\xf1id,10\nC\xe9ntro,20\n".encode("latin-1"))
df = safe_read_csv_fallback(csv_path)
assert df.shape == (2, 2)
assert "Mad" in df["nombre"].iloc[0]
assert df["valor"].tolist() == [10, 20]
@@ -0,0 +1,44 @@
"""Tests para slugify_ascii."""
import sys
import os
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
from slugify_ascii import slugify_ascii
def test_slugify_texto_con_puntuacion():
"""slugify texto con puntuacion"""
assert slugify_ascii("Calle Mayor, 14") == "calle-mayor-14"
def test_slugify_diacriticos():
"""slugify diacriticos"""
assert slugify_ascii("Ávila") == "avila"
def test_slugify_cadena_vacia_retorna_default():
"""slugify cadena vacia retorna default"""
assert slugify_ascii("") == "centro"
def test_slugify_trunca_a_max_len():
"""slugify trunca a max_len"""
assert slugify_ascii("a" * 100, max_len=10) == "aaaaaaaaaa"
def test_slugify_none_retorna_default():
assert slugify_ascii(None) == "centro"
def test_slugify_default_custom():
assert slugify_ascii("---", default="sin-nombre") == "sin-nombre"
def test_slugify_solo_diacriticos_y_puntuacion():
assert slugify_ascii("ñoño") == "nono"
def test_slugify_numeros():
assert slugify_ascii("28001 Madrid") == "28001-madrid"