feat: extraccion masiva footprint_aurgi (41 funcs + 4 types + stack Docker geo)
Extrae al registry funciones del proyecto interno footprint_aurgi: - core (6): slugify_ascii, normalize_for_join, cp_provincia_es, infer_provincia_from_cp, safe_read_csv_fallback, csv_to_parquet_duckdb - geo puras (7): haversine_km, point_in_ring, point_in_polygon, point_in_polygons_bbox, polygon_bbox, extent_with_padding, distance_bucket - geo I/O (4): load_geojson_polygons, load_boundary_gdf, add_basemap_osm, add_basemap_with_timeout - valhalla client (4): valhalla_route, valhalla_isochrone, valhalla_isochrones_async, valhalla_matrix_1_to_n - datascience stats (7): trimmed_mean, geometric_mean, detect_distribution_type, best_central_tendency, summary_stats, kde_density_levels, alpha_shape_concave_hull - datascience fuzzy (3): fuzzy_merge_adaptive (rapidfuzz), words_to_dataset, remove_words_from_column - datascience viz (2): plot_kde_2d, plot_heatmap_log - infra (4): compress_pdf_ghostscript, render_table_page_pdfpages, add_header_logo, osm2pgsql_ingest - pipelines (4): setup_geo_stack_docker, compute_centers_reachability, generate_isochrones_by_zone, count_points_per_zone - types geo (4): LonLat, BBox, IsochroneRequest, Centro Incluye: - apps/footprint_geo_stack/ (PostGIS + Martin + Valhalla via docker-compose) - 131/132 tests pasan (1 skip esperado: osm2pgsql en PATH) - Issue tracker dev/issues/0052-footprint-aurgi-extraction.md - Atribucion uniforme: source_repo internal:footprint_aurgi, source_license internal-aurgi - Build con 9 agentes en paralelo (8 wave 1 + 1 wave 2 pipelines) Tambien commitea trabajo previo no commiteado: aggregate_extraction_results, chunk_with_overlap, clean_pdf_text, merge_entity_aliases, extract_graph_gliner2, extract_relations_mrebel, extract_triples_spacy_es, gliner2/mrebel/marianmt/rebel/spacy_es load_model, parse_rebel_output, translate_es_to_en, issue 0050/0051. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,70 @@
|
||||
---
|
||||
name: align_relations_to_entities
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def align_relations_to_entities(triplets: list[dict], entity_names: list[str]) -> list[dict]"
|
||||
description: "Filtra y alinea triplets REBEL/mREBEL a nombres canonicos de entidades. Para cada triplet, resuelve head y tail contra entity_names con match exacto case-insensitive o substring (gana el nombre mas largo). Descarta triplets donde algun lado no resuelve o head==tail."
|
||||
tags: [rebel, mrebel, relation-extraction, nlp, align, knowledge-graph, datascience, python]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: []
|
||||
params:
|
||||
- name: triplets
|
||||
desc: "lista de dicts producida por parse_rebel_output, con claves head, head_type, type, tail, tail_type"
|
||||
- name: entity_names
|
||||
desc: "nombres canonicos de entidades conocidas contra los que alinear (ej. [e.name for e in entities])"
|
||||
output: "lista de dicts con claves from (str), kind (str), to (str), head_type (str), tail_type (str). from/to son valores tomados verbatim de entity_names."
|
||||
tested: true
|
||||
tests:
|
||||
- "match exacto case-insensitive resuelve correctamente"
|
||||
- "substring entity en span del head"
|
||||
- "substring span dentro del nombre de entidad"
|
||||
- "gana el nombre de entidad mas largo en ambiguedad"
|
||||
- "triplet sin match se descarta"
|
||||
- "triplet con head == tail se descarta (self-loop)"
|
||||
test_file_path: "python/functions/datascience/tests/test_align_relations_to_entities.py"
|
||||
file_path: "python/functions/datascience/align_relations_to_entities.py"
|
||||
notes: |
|
||||
Funcion pura. Compone con parse_rebel_output: el output de parse_rebel_output entra
|
||||
como triplets, y entity_names viene de [e.name for e in entities] del contexto de extraccion.
|
||||
Estrategia de matching:
|
||||
1. Exacto case-insensitive (O(1) via dict)
|
||||
2. Substring bidireccional: entity in span O span in entity (itera por longitud DESC)
|
||||
Esto cubre casos como mREBEL emitiendo "esta en Bilbao" cuando la entidad es "Bilbao",
|
||||
o "Banco Santander S.A." cuando la entidad canonizada es "Banco Santander".
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from python.functions.datascience.parse_rebel_output import parse_rebel_output
|
||||
from python.functions.datascience.align_relations_to_entities import align_relations_to_entities
|
||||
|
||||
decoded = "tp_XX<triplet> Pablo Isla <per> Inditex <org> employer"
|
||||
triplets = parse_rebel_output(decoded)
|
||||
|
||||
entities = ["Pablo Isla", "Inditex", "A Coruna"]
|
||||
aligned = align_relations_to_entities(triplets, entities)
|
||||
# [{'from': 'Pablo Isla', 'kind': 'employer', 'to': 'Inditex',
|
||||
# 'head_type': 'per', 'tail_type': 'org'}]
|
||||
```
|
||||
|
||||
## Estrategia de matching
|
||||
|
||||
1. **Exacto case-insensitive**: ``"inditex"`` == ``"Inditex"``.
|
||||
2. **Substring bidireccional**: la entidad esta contenida en el span del modelo,
|
||||
o el span del modelo esta contenido en el nombre de la entidad.
|
||||
Cuando varias entidades encajan, gana la mas larga (mas especifica).
|
||||
|
||||
## Notas
|
||||
|
||||
- No hace fuzzy matching (Levenshtein, etc.) — la precision sobre el recall es preferida
|
||||
en el contexto de grafos de conocimiento.
|
||||
- Para mejorar recall: normalizar entity_names antes de llamar (quitar siglas, tildes).
|
||||
- Los triplets con ``from == to`` (self-loops) se descartan siempre.
|
||||
@@ -0,0 +1,90 @@
|
||||
"""Alinea triplets REBEL / mREBEL a nombres canonicos de entidades."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
|
||||
def align_relations_to_entities(
|
||||
triplets: list[dict],
|
||||
entity_names: list[str],
|
||||
) -> list[dict]:
|
||||
"""Align REBEL triplets to a set of canonical entity names.
|
||||
|
||||
For each triplet produced by ``parse_rebel_output``, tries to resolve the
|
||||
``head`` and ``tail`` spans to a canonical entity name from ``entity_names``
|
||||
using the following strategy (in order):
|
||||
|
||||
1. **Exact case-insensitive match** — ``"Inditex" == "inditex"``.
|
||||
2. **Substring match** — either the span contains an entity name, or an
|
||||
entity name contains the span. When multiple entity names match, the
|
||||
*longest* one wins (most specific).
|
||||
|
||||
Triplets are dropped when:
|
||||
- Neither ``head`` nor ``tail`` can be resolved to any entity name.
|
||||
- The resolved ``from`` and ``to`` are the same name (self-loop).
|
||||
|
||||
Args:
|
||||
triplets: List of dicts produced by ``parse_rebel_output``, each with
|
||||
keys ``head``, ``head_type``, ``type``, ``tail``, ``tail_type``.
|
||||
entity_names: Canonical entity names to match against. Typically
|
||||
``[e.name for e in entities]``. Order does not matter; matching
|
||||
is case-insensitive.
|
||||
|
||||
Returns:
|
||||
List of dicts with keys:
|
||||
``from`` (str), ``kind`` (str), ``to`` (str),
|
||||
``head_type`` (str), ``tail_type`` (str).
|
||||
``from`` and ``to`` are values taken verbatim from ``entity_names``.
|
||||
Empty list if no triplet survives alignment.
|
||||
"""
|
||||
if not triplets or not entity_names:
|
||||
return []
|
||||
|
||||
# Pre-build lookup: lowercased -> original for O(1) exact lookup.
|
||||
lower_to_name: dict[str, str] = {n.lower(): n for n in entity_names}
|
||||
# Sort by length DESC for substring match (longest entity wins).
|
||||
names_by_len: list[str] = sorted(entity_names, key=len, reverse=True)
|
||||
|
||||
def _resolve(span: str) -> str | None:
|
||||
"""Return a canonical entity name for `span`, or None if no match."""
|
||||
if not span:
|
||||
return None
|
||||
span_lower = span.lower()
|
||||
|
||||
# 1. Exact case-insensitive.
|
||||
if span_lower in lower_to_name:
|
||||
return lower_to_name[span_lower]
|
||||
|
||||
# 2. Substring: longest entity that is contained in span, or whose
|
||||
# name contains span (both directions), longest-wins.
|
||||
for name in names_by_len:
|
||||
name_lower = name.lower()
|
||||
if name_lower in span_lower or span_lower in name_lower:
|
||||
return name
|
||||
|
||||
return None
|
||||
|
||||
aligned: list[dict] = []
|
||||
for triplet in triplets:
|
||||
head_span = triplet.get("head", "")
|
||||
tail_span = triplet.get("tail", "")
|
||||
relation = triplet.get("type", "")
|
||||
|
||||
from_name = _resolve(head_span)
|
||||
to_name = _resolve(tail_span)
|
||||
|
||||
if from_name is None or to_name is None:
|
||||
continue
|
||||
if from_name == to_name:
|
||||
continue
|
||||
|
||||
aligned.append(
|
||||
{
|
||||
"from": from_name,
|
||||
"kind": relation,
|
||||
"to": to_name,
|
||||
"head_type": triplet.get("head_type", ""),
|
||||
"tail_type": triplet.get("tail_type", ""),
|
||||
}
|
||||
)
|
||||
|
||||
return aligned
|
||||
@@ -0,0 +1,42 @@
|
||||
---
|
||||
id: alpha_shape_concave_hull_py_datascience
|
||||
name: alpha_shape_concave_hull
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def alpha_shape_concave_hull(points: list[tuple[float, float]], alpha: float) -> shapely.geometry.base.BaseGeometry | None"
|
||||
description: "Computes the alpha-shape (concave hull) of a 2-D point set via Delaunay triangulation, filtering triangles by circumradius <= alpha and merging survivors."
|
||||
tags: [geometry, spatial, concave-hull, alpha-shape, shapely, delaunay]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: [numpy, shapely]
|
||||
example: |
|
||||
from alpha_shape_concave_hull import alpha_shape_concave_hull
|
||||
pts = [(0.0,0.0),(1.0,0.0),(1.0,1.0),(0.0,1.0)]
|
||||
geom = alpha_shape_concave_hull(pts, alpha=10.0)
|
||||
# shapely Polygon
|
||||
tested: true
|
||||
tests:
|
||||
- "test_alpha_shape_square_large_alpha"
|
||||
- "test_alpha_shape_too_few_points"
|
||||
- "test_alpha_shape_very_small_alpha_returns_none"
|
||||
- "test_alpha_shape_5_points_returns_geometry"
|
||||
test_file_path: "python/functions/datascience/tests/test_alpha_shape_concave_hull.py"
|
||||
file_path: "python/functions/datascience/alpha_shape_concave_hull.py"
|
||||
params:
|
||||
- name: points
|
||||
desc: "List of (x, y) coordinate pairs. Requires at least 4 points."
|
||||
- name: alpha
|
||||
desc: "Alpha radius parameter. Triangles with circumradius > alpha are discarded. Smaller alpha = more concave hull."
|
||||
output: "Shapely geometry (Polygon or MultiPolygon) of the alpha-shape, or None if fewer than 4 points or no triangles survive the alpha filter."
|
||||
source_repo: "internal:footprint_aurgi"
|
||||
source_license: "internal-aurgi"
|
||||
source_file: "ponderacion_isochronas/src/recomendador_centros.py:408"
|
||||
---
|
||||
|
||||
Requiere shapely. Si shapely no esta instalado, retorna None en silencio. returns_optional=true porque puede no haber triangulos validos.
|
||||
@@ -0,0 +1,67 @@
|
||||
"""alpha_shape_concave_hull — Concave hull via Delaunay alpha-shape filtering."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
|
||||
def alpha_shape_concave_hull(
|
||||
points: list[tuple[float, float]],
|
||||
alpha: float,
|
||||
) -> "shapely.geometry.base.BaseGeometry | None":
|
||||
"""Compute the alpha-shape (concave hull) of a 2-D point set.
|
||||
|
||||
Performs a Delaunay triangulation over the input points, then keeps only
|
||||
those triangles whose circumscribed circle radius is <= alpha. The
|
||||
remaining triangles are merged via unary_union.
|
||||
|
||||
Args:
|
||||
points: List of (x, y) coordinate pairs. Must have >= 4 elements.
|
||||
alpha: Alpha parameter controlling concavity (smaller = more concave).
|
||||
Triangles with circumradius > alpha are discarded.
|
||||
|
||||
Returns:
|
||||
A shapely geometry (Polygon, MultiPolygon, or GeometryCollection)
|
||||
representing the alpha-shape, or None if len(points) < 4 or no
|
||||
triangles survive the alpha filter (shapely is required).
|
||||
"""
|
||||
if len(points) < 4:
|
||||
return None
|
||||
|
||||
try:
|
||||
import numpy as np
|
||||
from shapely.geometry import MultiPoint
|
||||
from shapely.ops import triangulate, unary_union
|
||||
except ImportError:
|
||||
return None
|
||||
|
||||
mp = MultiPoint(points)
|
||||
triangles = triangulate(mp)
|
||||
|
||||
valid = []
|
||||
for tri in triangles:
|
||||
coords = list(tri.exterior.coords)
|
||||
a_pt = np.array(coords[0])
|
||||
b_pt = np.array(coords[1])
|
||||
c_pt = np.array(coords[2])
|
||||
|
||||
# Circumradius via the formula R = (abc) / (4 * Area)
|
||||
ab = np.linalg.norm(b_pt - a_pt)
|
||||
bc = np.linalg.norm(c_pt - b_pt)
|
||||
ca = np.linalg.norm(a_pt - c_pt)
|
||||
|
||||
# Area via cross product
|
||||
area = abs(
|
||||
(b_pt[0] - a_pt[0]) * (c_pt[1] - a_pt[1])
|
||||
- (c_pt[0] - a_pt[0]) * (b_pt[1] - a_pt[1])
|
||||
) / 2.0
|
||||
|
||||
if area == 0:
|
||||
continue
|
||||
|
||||
circumradius = (ab * bc * ca) / (4.0 * area)
|
||||
if circumradius <= alpha:
|
||||
valid.append(tri)
|
||||
|
||||
if not valid:
|
||||
return None
|
||||
|
||||
return unary_union(valid)
|
||||
@@ -0,0 +1,68 @@
|
||||
---
|
||||
id: best_central_tendency_py_datascience
|
||||
name: best_central_tendency
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def best_central_tendency(values: list[float], dist_type: str) -> tuple[str, float]"
|
||||
description: "Selects the most appropriate central tendency measure for a given distribution type. Returns (label, value) pair."
|
||||
tags: [statistics, central-tendency, distribution, robust, mean, median]
|
||||
uses_functions:
|
||||
- geometric_mean_py_datascience
|
||||
- trimmed_mean_py_datascience
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: [math, numpy]
|
||||
example: |
|
||||
from best_central_tendency import best_central_tendency
|
||||
label, value = best_central_tendency([1, 2, 3, 4, 5], "normal-ish")
|
||||
# ("mean", 3.0)
|
||||
tested: true
|
||||
tests:
|
||||
- "test_best_central_tendency_normal_ish"
|
||||
- "test_best_central_tendency_right_skewed"
|
||||
- "test_best_central_tendency_left_skewed"
|
||||
- "test_best_central_tendency_lognormal_ish"
|
||||
- "test_best_central_tendency_heavy_tail"
|
||||
- "test_best_central_tendency_empty"
|
||||
- "test_best_central_tendency_default"
|
||||
test_file_path: "python/functions/datascience/tests/test_best_central_tendency.py"
|
||||
file_path: "python/functions/datascience/best_central_tendency.py"
|
||||
params:
|
||||
- name: values
|
||||
desc: "List of numeric values to summarize."
|
||||
- name: dist_type
|
||||
desc: "Distribution type string, typically from detect_distribution_type. One of: normal-ish, lognormal-ish, heavy-tail, right-skewed, left-skewed, other, too_few_samples."
|
||||
output: >
|
||||
Tuple (label, value) where label is one of "mean", "median", "geometric_mean",
|
||||
"trimmed_mean_5%", and value is the computed central tendency. Returns ("median", math.nan) for empty input.
|
||||
source_repo: "internal:footprint_aurgi"
|
||||
source_license: "internal-aurgi"
|
||||
source_file: "aurgi_mapas/generar_pdf_reporte.py:196"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from best_central_tendency import best_central_tendency
|
||||
|
||||
best_central_tendency([1, 2, 3, 4, 5], "normal-ish") # ("mean", 3.0)
|
||||
best_central_tendency([1, 2, 3, 4, 5], "right-skewed") # ("median", 3.0)
|
||||
best_central_tendency([1, 2, 4, 8], "lognormal-ish") # ("geometric_mean", ~2.83)
|
||||
best_central_tendency([1, 2, 3, 100], "heavy-tail") # ("trimmed_mean_5%", ...)
|
||||
```
|
||||
|
||||
## Mapeo de tipos a medidas
|
||||
|
||||
| dist_type | Medida | Funcion interna |
|
||||
|-----------------|------------------|-----------------------|
|
||||
| normal-ish | mean | numpy.mean |
|
||||
| lognormal-ish | geometric_mean | geometric_mean() |
|
||||
| heavy-tail | trimmed_mean_5% | trimmed_mean(0.05) |
|
||||
| right-skewed | median | numpy.median |
|
||||
| left-skewed | median | numpy.median |
|
||||
| otros / default | median | numpy.median |
|
||||
@@ -0,0 +1,45 @@
|
||||
"""best_central_tendency — Select the best central tendency measure for a distribution type."""
|
||||
|
||||
import math
|
||||
import numpy as np
|
||||
|
||||
try:
|
||||
from .geometric_mean import geometric_mean
|
||||
from .trimmed_mean import trimmed_mean
|
||||
except ImportError:
|
||||
from geometric_mean import geometric_mean # type: ignore
|
||||
from trimmed_mean import trimmed_mean # type: ignore
|
||||
|
||||
|
||||
def best_central_tendency(values: list[float], dist_type: str) -> tuple[str, float]:
|
||||
"""Return the most appropriate central tendency measure given the distribution type.
|
||||
|
||||
Mapping:
|
||||
"normal-ish" -> ("mean", arithmetic mean)
|
||||
"lognormal-ish" -> ("geometric_mean", geometric mean of positives)
|
||||
"heavy-tail" -> ("trimmed_mean_5%", trimmed mean at 5%)
|
||||
"right-skewed" -> ("median", median)
|
||||
"left-skewed" -> ("median", median)
|
||||
default -> ("median", median)
|
||||
|
||||
Args:
|
||||
values: List of numeric values.
|
||||
dist_type: Distribution type string (from detect_distribution_type).
|
||||
|
||||
Returns:
|
||||
Tuple (label: str, value: float). Value is math.nan if values is empty.
|
||||
"""
|
||||
if not values:
|
||||
return ("median", math.nan)
|
||||
|
||||
arr = np.array(values, dtype=float)
|
||||
|
||||
if dist_type == "normal-ish":
|
||||
return ("mean", float(np.mean(arr)))
|
||||
elif dist_type == "lognormal-ish":
|
||||
return ("geometric_mean", geometric_mean(values))
|
||||
elif dist_type == "heavy-tail":
|
||||
return ("trimmed_mean_5%", trimmed_mean(values, trim=0.05))
|
||||
else:
|
||||
# right-skewed, left-skewed, other, too_few_samples, unknown
|
||||
return ("median", float(np.median(arr)))
|
||||
@@ -0,0 +1,67 @@
|
||||
---
|
||||
id: detect_distribution_type_py_datascience
|
||||
name: detect_distribution_type
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def detect_distribution_type(values: list[float]) -> dict"
|
||||
description: "Classifies the shape of a numeric distribution using skewness, excess kurtosis, tail ratio and log-skewness. Returns a type label and raw stats."
|
||||
tags: [statistics, distribution, classification, skewness, kurtosis]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: [math, numpy]
|
||||
example: |
|
||||
from detect_distribution_type import detect_distribution_type
|
||||
import numpy as np
|
||||
result = detect_distribution_type(np.random.normal(0, 1, 200).tolist())
|
||||
# {"type": "normal-ish", "stats": {"n": 200, "skew": ..., ...}}
|
||||
tested: true
|
||||
tests:
|
||||
- "test_detect_too_few_samples"
|
||||
- "test_detect_normal_ish"
|
||||
- "test_detect_right_skewed"
|
||||
- "test_detect_stats_keys"
|
||||
- "test_detect_exactly_30"
|
||||
test_file_path: "python/functions/datascience/tests/test_detect_distribution_type.py"
|
||||
file_path: "python/functions/datascience/detect_distribution_type.py"
|
||||
params:
|
||||
- name: values
|
||||
desc: "List of numeric values to classify. Minimum 30 for meaningful classification."
|
||||
output: >
|
||||
Dict with "type" (str) and "stats" (dict). Type is one of: normal-ish,
|
||||
lognormal-ish, heavy-tail, right-skewed, left-skewed, other, too_few_samples.
|
||||
Stats contains: n, skew, kurtosis, tail_ratio, log_skew.
|
||||
source_repo: "internal:footprint_aurgi"
|
||||
source_license: "internal-aurgi"
|
||||
source_file: "aurgi_mapas/generar_pdf_reporte.py:133"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from detect_distribution_type import detect_distribution_type
|
||||
import numpy as np
|
||||
|
||||
detect_distribution_type(np.random.normal(0, 1, 200).tolist())
|
||||
# {"type": "normal-ish", "stats": {"n": 200, "skew": 0.03, ...}}
|
||||
|
||||
detect_distribution_type([1]*5)
|
||||
# {"type": "too_few_samples", "stats": {"n": 5}}
|
||||
```
|
||||
|
||||
## Logica de clasificacion
|
||||
|
||||
- n < 30 → too_few_samples
|
||||
- excess kurtosis > 3 → heavy-tail
|
||||
- |skew| <= 0.5 AND |kurt| <= 1 → normal-ish
|
||||
- skew > 0.5 AND log_skew cerca de 0 AND tail_ratio > 2 → lognormal-ish
|
||||
- skew > 0.5 → right-skewed
|
||||
- skew < -0.5 → left-skewed
|
||||
- default → other
|
||||
|
||||
tail_ratio = p99/p50; log_skew calculado solo si hay >= 30 positivos.
|
||||
@@ -0,0 +1,89 @@
|
||||
"""detect_distribution_type — Classify the distribution shape of a sample."""
|
||||
|
||||
import math
|
||||
import numpy as np
|
||||
|
||||
|
||||
def detect_distribution_type(values: list[float]) -> dict:
|
||||
"""Classify the distribution shape of a numeric sample.
|
||||
|
||||
Uses skewness, excess kurtosis, tail ratio (p99/p50), and log-skewness
|
||||
to assign one of: normal-ish, lognormal-ish, heavy-tail, right-skewed,
|
||||
left-skewed, other, or too_few_samples (n < 30).
|
||||
|
||||
Args:
|
||||
values: List of numeric values.
|
||||
|
||||
Returns:
|
||||
Dict with keys:
|
||||
"type" (str): distribution label.
|
||||
"stats" (dict): {"n", "skew", "kurtosis", "tail_ratio", "log_skew"}.
|
||||
"""
|
||||
n = len(values)
|
||||
if n < 30:
|
||||
return {"type": "too_few_samples", "stats": {"n": n}}
|
||||
|
||||
arr = np.array(values, dtype=float)
|
||||
|
||||
mean = float(np.mean(arr))
|
||||
std = float(np.std(arr, ddof=1))
|
||||
|
||||
# Skewness
|
||||
if std == 0:
|
||||
skew = 0.0
|
||||
else:
|
||||
skew = float(np.mean(((arr - mean) / std) ** 3))
|
||||
|
||||
# Excess kurtosis
|
||||
if std == 0:
|
||||
kurt = 0.0
|
||||
else:
|
||||
kurt = float(np.mean(((arr - mean) / std) ** 4)) - 3.0
|
||||
|
||||
# Tail ratio: p99 / p50 (only meaningful when median != 0)
|
||||
p50 = float(np.percentile(arr, 50))
|
||||
p99 = float(np.percentile(arr, 99))
|
||||
tail_ratio = (p99 / p50) if p50 != 0 else math.nan
|
||||
|
||||
# Log-skewness on positive values
|
||||
positives = arr[arr > 0]
|
||||
if len(positives) >= 30:
|
||||
log_arr = np.log(positives)
|
||||
log_mean = float(np.mean(log_arr))
|
||||
log_std = float(np.std(log_arr, ddof=1))
|
||||
if log_std == 0:
|
||||
log_skew = 0.0
|
||||
else:
|
||||
log_skew = float(np.mean(((log_arr - log_mean) / log_std) ** 3))
|
||||
else:
|
||||
log_skew = math.nan
|
||||
|
||||
stats = {
|
||||
"n": n,
|
||||
"skew": skew,
|
||||
"kurtosis": kurt,
|
||||
"tail_ratio": tail_ratio,
|
||||
"log_skew": log_skew,
|
||||
}
|
||||
|
||||
# Classification logic
|
||||
if kurt > 3.0:
|
||||
dist_type = "heavy-tail"
|
||||
elif abs(skew) <= 0.5 and abs(kurt) <= 1.0:
|
||||
dist_type = "normal-ish"
|
||||
elif (
|
||||
skew > 0.5
|
||||
and not math.isnan(log_skew)
|
||||
and abs(log_skew) <= 0.5
|
||||
and not math.isnan(tail_ratio)
|
||||
and tail_ratio > 2.0
|
||||
):
|
||||
dist_type = "lognormal-ish"
|
||||
elif skew > 0.5:
|
||||
dist_type = "right-skewed"
|
||||
elif skew < -0.5:
|
||||
dist_type = "left-skewed"
|
||||
else:
|
||||
dist_type = "other"
|
||||
|
||||
return {"type": dist_type, "stats": stats}
|
||||
@@ -0,0 +1,65 @@
|
||||
---
|
||||
name: extract_graph_gliner2
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "def extract_graph_gliner2(text: str, entity_labels: list[str], relation_labels: list | dict, model: Any, threshold: float = 0.3, include_confidence: bool = False) -> dict"
|
||||
description: "Extrae entidades + relaciones en una sola pasada con GLiNER2. Wrapper de alto nivel: construye schema, ejecuta extraccion, normaliza a dict plano. No aplica post-filtrado ni coreference."
|
||||
tags: [gliner2, ner, relation-extraction, nlp, extraction, graph, zero-shot, datascience, python, apache2]
|
||||
uses_functions:
|
||||
- gliner2_load_model_py_datascience
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: [time, typing.Any]
|
||||
params:
|
||||
- name: text
|
||||
desc: "Texto a analizar. Recomendado hasta 1500 chars (pre-chunkeado con chunk_with_overlap). Textos mas largos degradan el recall de GLiNER2."
|
||||
- name: entity_labels
|
||||
desc: "Lista de strings con los tipos de entidad en minusculas snake_case. E.g. ['person', 'organization', 'location']. Labels en snake_case mejoran el recall segun notebook 08."
|
||||
- name: relation_labels
|
||||
desc: "Lista de strings o dict {label: description} con los tipos de relacion. E.g. ['works_at', 'ceo_of'] o {'works_at': 'person works at an organization'}."
|
||||
- name: model
|
||||
desc: "Instancia GLiNER2 cargada con gliner2_load_model. Inyectada por el caller (no se carga aqui)."
|
||||
- name: threshold
|
||||
desc: "Umbral de confianza entre 0 y 1. 0.3 validado empiricamente en notebook 04 (gliner_glirel_tuning). Valores mas bajos = mas recall, mas ruido."
|
||||
- name: include_confidence
|
||||
desc: "Si True, GLiNER2 devuelve scores internos por entidad y relacion. False por defecto para output mas limpio."
|
||||
output: "Dict con tres campos: 'entities' -> {type: [name, ...]}, 'relation_extraction' -> {rel_type: [(head, tail), ...]}, 'elapsed_s' -> float. Compatible con aggregate_extraction_results."
|
||||
tested: true
|
||||
tests:
|
||||
- "output tiene claves entities relation_extraction elapsed_s"
|
||||
- "stub model retorna shape correcto"
|
||||
test_file_path: "python/functions/datascience/tests/test_extract_graph_gliner2.py"
|
||||
file_path: "python/functions/datascience/extract_graph_gliner2.py"
|
||||
notes: |
|
||||
LICENSE: GLiNER2 (fastino/gliner2-large-v1) es Apache 2.0 — uso comercial OK.
|
||||
|
||||
impure: invoca inferencia del modelo (side effect computacional + tiempo variable).
|
||||
El model se inyecta externamente para permitir cache y reutilizacion entre llamadas.
|
||||
Para textos largos usar chunk_with_overlap antes y llamar esta funcion por chunk,
|
||||
luego agregar con aggregate_extraction_results.
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from datascience.gliner2_load_model import gliner2_load_model
|
||||
from datascience.extract_graph_gliner2 import extract_graph_gliner2
|
||||
|
||||
model = gliner2_load_model(device="auto")
|
||||
|
||||
result = extract_graph_gliner2(
|
||||
text="Carlos Torres es presidente de BBVA, con sede en Bilbao.",
|
||||
entity_labels=["person", "organization", "location"],
|
||||
relation_labels=["president_of", "headquartered_in"],
|
||||
model=model,
|
||||
threshold=0.3,
|
||||
)
|
||||
# result["entities"] -> {"person": ["Carlos Torres"], ...}
|
||||
# result["relation_extraction"]-> {"president_of": [("Carlos Torres", "BBVA")]}
|
||||
# result["elapsed_s"] -> 0.234
|
||||
```
|
||||
@@ -0,0 +1,60 @@
|
||||
"""Extraccion de entidades + relaciones en una pasada con GLiNER2."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import time
|
||||
from typing import Any
|
||||
|
||||
|
||||
def extract_graph_gliner2(
|
||||
text: str,
|
||||
entity_labels: list[str],
|
||||
relation_labels: list | dict,
|
||||
model: Any,
|
||||
threshold: float = 0.3,
|
||||
include_confidence: bool = False,
|
||||
) -> dict:
|
||||
"""Extract entities + relations using GLiNER2 with one schema pass.
|
||||
|
||||
Wrapper de alto nivel sobre la API de GLiNER2. Construye el schema,
|
||||
ejecuta la extraccion y normaliza el resultado a un dict plano.
|
||||
NO aplica post-filtrado ni coreference — eso lo hace el caller con
|
||||
filter_relations_by_entity_types y merge_entity_aliases.
|
||||
|
||||
Args:
|
||||
text: Texto a analizar. Recomendado: <= 1500 chars (pre-chunked).
|
||||
entity_labels: Lista de strings con los tipos de entidad.
|
||||
E.g. ["person", "organization", "location"]
|
||||
relation_labels: Lista de strings o dict {label: description} con
|
||||
los tipos de relacion.
|
||||
E.g. ["works_at", "ceo_of"] o
|
||||
{"works_at": "person works at organization"}
|
||||
model: Instancia GLiNER2 cargada con gliner2_load_model.
|
||||
threshold: Umbral de confianza (0-1). 0.3 es el valor validado
|
||||
empiricamente en los notebooks del analisis.
|
||||
include_confidence: Si True, el modelo devuelve scores por entidad
|
||||
y relacion (formato interno de GLiNER2).
|
||||
|
||||
Returns:
|
||||
{
|
||||
"entities": {type: [name, ...]},
|
||||
"relation_extraction": {rel_type: [(head, tail), ...]},
|
||||
"elapsed_s": float
|
||||
}
|
||||
"""
|
||||
schema = model.create_schema().entities(entity_labels).relations(relation_labels)
|
||||
|
||||
t0 = time.time()
|
||||
r = model.extract(
|
||||
text,
|
||||
schema=schema,
|
||||
threshold=threshold,
|
||||
include_confidence=include_confidence,
|
||||
)
|
||||
elapsed = round(time.time() - t0, 3)
|
||||
|
||||
return {
|
||||
"entities": r.get("entities", {}),
|
||||
"relation_extraction": r.get("relation_extraction", {}),
|
||||
"elapsed_s": elapsed,
|
||||
}
|
||||
@@ -0,0 +1,114 @@
|
||||
---
|
||||
name: extract_relations_mrebel
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "def extract_relations_mrebel(text: str, entities: list[EntityCandidate], tokenizer: Any, model: Any, src_lang: str = 'es_XX', sentence_split_re: str = r'(?<=[.!?])\\s+', min_sentence_chars: int = 20, num_beams: int = 4, max_length: int = 256) -> list[RelationCandidate]"
|
||||
description: "Extrae relaciones entre entidades usando mREBEL (seq2seq multilingue). Divide el texto por oraciones, genera triplets con mREBEL, parsea con parse_rebel_output y alinea a entidades conocidas con align_relations_to_entities. Drop-in con extract_relations_glirel para benchmarks."
|
||||
tags: [mrebel, relation-extraction, nlp, extract, knowledge-graph, seq2seq, multilingual, datascience, python]
|
||||
uses_functions:
|
||||
- mrebel_load_model_py_datascience
|
||||
- parse_rebel_output_py_datascience
|
||||
- align_relations_to_entities_py_datascience
|
||||
uses_types:
|
||||
- entity_candidate_py_datascience
|
||||
- relation_candidate_py_datascience
|
||||
returns:
|
||||
- relation_candidate_py_datascience
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: [re]
|
||||
params:
|
||||
- name: text
|
||||
desc: "texto fuente en el idioma de src_lang (mismo chunk usado para extraer las entidades)"
|
||||
- name: entities
|
||||
desc: "entidades ya extraidas de este texto (de extract_entities_gliner o similar). Solo se conservan relaciones entre entidades de esta lista."
|
||||
- name: tokenizer
|
||||
desc: "tokenizer mREBEL cargado con mrebel_load_model — inyectado por el caller para evitar re-carga en batch"
|
||||
- name: model
|
||||
desc: "modelo mREBEL cargado con mrebel_load_model — inyectado por el caller"
|
||||
- name: src_lang
|
||||
desc: "informativo — el idioma con que se cargo el tokenizer (ej. 'es_XX'). No se usa en runtime."
|
||||
- name: sentence_split_re
|
||||
desc: "patron regex para dividir el texto en oraciones. Defecto: split despues de [.!?] seguido de espacio."
|
||||
- name: min_sentence_chars
|
||||
desc: "longitud minima de caracteres para procesar una oracion. Fragmentos mas cortos se saltan (defecto 20)."
|
||||
- name: num_beams
|
||||
desc: "ancho del beam search para model.generate (defecto 4)"
|
||||
- name: max_length
|
||||
desc: "longitud maxima en tokens para tokenizacion y generacion (defecto 256)"
|
||||
output: "lista de RelationCandidate con confidence=1.0 (mREBEL no produce score continuo). from_name/to_name siempre coinciden con entidades del input."
|
||||
tested: true
|
||||
tests:
|
||||
- "flujo completo con stub produce RelationCandidate correctos"
|
||||
- "menos de 2 entidades retorna vacio"
|
||||
- "texto vacio retorna vacio"
|
||||
- "triplets no alineables se descartan"
|
||||
test_file_path: "python/functions/datascience/tests/test_extract_relations_mrebel.py"
|
||||
file_path: "python/functions/datascience/extract_relations_mrebel.py"
|
||||
notes: |
|
||||
impure: model.generate es I/O computacional con estado externo (pesos del modelo).
|
||||
|
||||
mREBEL no produce un confidence score continuo — devuelve los triplets que el modelo
|
||||
decodifico como output mas probable. confidence=1.0 es un marcador "el modelo lo emitio",
|
||||
no una probabilidad calibrada. Para filtrar por calidad, usar el numero de beams
|
||||
como proxy o combinar con un clasificador posterior.
|
||||
|
||||
Drop-in con extract_relations_glirel para benchmarks:
|
||||
- Misma interfaz de entrada (text, entities, model)
|
||||
- Misma salida (list[RelationCandidate])
|
||||
- Diferencia: mREBEL no necesita relation_types (genera relaciones libre),
|
||||
glirel necesita relation_types (zero-shot discriminativo).
|
||||
|
||||
LICENCIA del modelo: Babelscape/mrebel-large es CC BY-NC-SA 4.0 (no comercial).
|
||||
Ver mrebel_load_model para mas detalles.
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from python.functions.datascience.mrebel_load_model import mrebel_load_model
|
||||
from python.functions.datascience.extract_relations_mrebel import extract_relations_mrebel
|
||||
from python.types.datascience.entity_candidate import EntityCandidate
|
||||
|
||||
tokenizer, model = mrebel_load_model(src_lang="es_XX")
|
||||
|
||||
text = "Pablo Isla es el presidente de Inditex. La empresa tiene sede en Arteixo, A Coruna."
|
||||
entities = [
|
||||
EntityCandidate(name="Pablo Isla", type_label="PER", confidence=0.95),
|
||||
EntityCandidate(name="Inditex", type_label="ORG", confidence=0.92),
|
||||
EntityCandidate(name="Arteixo", type_label="LOC", confidence=0.88),
|
||||
EntityCandidate(name="A Coruna", type_label="LOC", confidence=0.85),
|
||||
]
|
||||
|
||||
relations = extract_relations_mrebel(
|
||||
text=text,
|
||||
entities=entities,
|
||||
tokenizer=tokenizer,
|
||||
model=model,
|
||||
)
|
||||
# [RelationCandidate(from_name='Pablo Isla', to_name='Inditex',
|
||||
# relation_type='employer', confidence=1.0, ...), ...]
|
||||
```
|
||||
|
||||
## Comparacion con extract_relations_glirel
|
||||
|
||||
| | mREBEL | GLiREL |
|
||||
|---|---|---|
|
||||
| Tipo | Seq2seq generativo | Discriminativo zero-shot |
|
||||
| relation_types | No (genera libre) | Si (obligatorio) |
|
||||
| Confidence | 1.0 fijo (no calibrado) | 0.0-1.0 (calibrado) |
|
||||
| Idiomas | 30+ multilingue | Principalmente EN |
|
||||
| Licencia modelo | CC BY-NC-SA (no comercial) | Apache 2.0 |
|
||||
| Velocidad | Mas lento (seq2seq) | Mas rapido (clasificador) |
|
||||
|
||||
## Notas de diseno
|
||||
|
||||
- `parse_rebel_output` y `align_relations_to_entities` son funciones puras
|
||||
compuestas por esta funcion impura — testeable independientemente.
|
||||
- Errores de tokenizacion/generacion por frase se capturan y saltan (la frase
|
||||
se ignora, el resto del texto se procesa).
|
||||
- `source_chunk_index` rastrea el indice de oracion de origen, no de chunk
|
||||
de texto — util para debugging.
|
||||
@@ -0,0 +1,136 @@
|
||||
"""Extrae relaciones entre entidades usando mREBEL (seq2seq multilingue)."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
import re
|
||||
import sys
|
||||
from typing import Any
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", ".."))
|
||||
|
||||
from python.types.datascience.entity_candidate import EntityCandidate
|
||||
from python.types.datascience.relation_candidate import RelationCandidate
|
||||
from python.functions.datascience.parse_rebel_output import parse_rebel_output
|
||||
from python.functions.datascience.align_relations_to_entities import align_relations_to_entities
|
||||
|
||||
|
||||
def extract_relations_mrebel(
|
||||
text: str,
|
||||
entities: list[EntityCandidate],
|
||||
tokenizer: Any,
|
||||
model: Any,
|
||||
src_lang: str = "es_XX",
|
||||
sentence_split_re: str = r"(?<=[.!?])\s+",
|
||||
min_sentence_chars: int = 20,
|
||||
num_beams: int = 4,
|
||||
max_length: int = 256,
|
||||
) -> list[RelationCandidate]:
|
||||
"""Extract relations from text using mREBEL, sentence by sentence.
|
||||
|
||||
Orchestrates the full pipeline:
|
||||
|
||||
1. Split ``text`` into sentences using ``sentence_split_re``.
|
||||
2. Filter out sentences shorter than ``min_sentence_chars``.
|
||||
3. For each sentence: tokenize → generate → decode (with special tokens)
|
||||
→ ``parse_rebel_output`` → accumulate raw triplets.
|
||||
4. Collect all entity names from ``entities``, sorted DESC by length
|
||||
(so longer names win in substring matching).
|
||||
5. Call ``align_relations_to_entities`` to resolve head/tail spans to
|
||||
canonical entity names and drop unresolved / self-loop triplets.
|
||||
6. Wrap each aligned triplet in a ``RelationCandidate``.
|
||||
|
||||
mREBEL does not produce a continuous confidence score — ``confidence``
|
||||
is set to ``1.0`` as a marker meaning "the model emitted this triplet".
|
||||
|
||||
Args:
|
||||
text: Source text (same language as ``src_lang``).
|
||||
entities: Entities already extracted from this text (e.g. via
|
||||
``extract_entities_gliner``). Used to filter triplets to
|
||||
known entities only.
|
||||
tokenizer: mREBEL tokenizer loaded with ``mrebel_load_model``.
|
||||
model: mREBEL model loaded with ``mrebel_load_model``.
|
||||
src_lang: Informational — the language the tokenizer was loaded with.
|
||||
Not used at inference time (mBART lang tokens are set at load time).
|
||||
sentence_split_re: Regex pattern for sentence splitting. Default splits
|
||||
on whitespace that follows ``.``, ``!`` or ``?``.
|
||||
min_sentence_chars: Minimum character length for a sentence to be
|
||||
processed. Shorter fragments are skipped.
|
||||
num_beams: Beam search width for ``model.generate``. Default 4.
|
||||
max_length: Max token length for both tokenization and generation.
|
||||
|
||||
Returns:
|
||||
List of ``RelationCandidate`` where ``from_name`` and ``to_name``
|
||||
always correspond to names in ``entities``. Empty list if no aligned
|
||||
triplets are found or ``entities`` has fewer than 2 items.
|
||||
"""
|
||||
if len(entities) < 2:
|
||||
return []
|
||||
if not text or not text.strip():
|
||||
return []
|
||||
|
||||
split_re = re.compile(sentence_split_re)
|
||||
sentences = split_re.split(text.strip())
|
||||
sentences = [s.strip() for s in sentences if s.strip() and len(s.strip()) >= min_sentence_chars]
|
||||
if not sentences:
|
||||
return []
|
||||
|
||||
# Step 1-3: gather raw triplets from all sentences.
|
||||
raw_triplets: list[dict] = []
|
||||
for idx, sentence in enumerate(sentences):
|
||||
try:
|
||||
inputs = tokenizer(
|
||||
sentence,
|
||||
return_tensors="pt",
|
||||
max_length=max_length,
|
||||
truncation=True,
|
||||
)
|
||||
generated = model.generate(
|
||||
**inputs,
|
||||
num_beams=num_beams,
|
||||
length_penalty=1.0,
|
||||
max_length=max_length,
|
||||
)
|
||||
decoded = tokenizer.decode(generated[0], skip_special_tokens=False)
|
||||
except Exception:
|
||||
# Skip sentences that fail (e.g. tokenizer errors on special chars).
|
||||
continue
|
||||
|
||||
sentence_triplets = parse_rebel_output(decoded)
|
||||
# Tag each triplet with the sentence index for source_chunk_index.
|
||||
for t in sentence_triplets:
|
||||
t["_sentence_idx"] = idx
|
||||
raw_triplets.extend(sentence_triplets)
|
||||
|
||||
if not raw_triplets:
|
||||
return []
|
||||
|
||||
# Step 4-5: align to entity names (sorted DESC by length for substring match).
|
||||
entity_names = sorted([e.name for e in entities if e.name], key=len, reverse=True)
|
||||
aligned = align_relations_to_entities(raw_triplets, entity_names)
|
||||
|
||||
# Step 6: wrap in RelationCandidate.
|
||||
candidates: list[RelationCandidate] = []
|
||||
for item in aligned:
|
||||
# Recover sentence_idx from raw triplet — find matching raw by head/tail/type.
|
||||
sentence_idx = -1
|
||||
for raw in raw_triplets:
|
||||
if (
|
||||
raw.get("head", "").strip() and
|
||||
raw.get("type", "").strip() == item["kind"]
|
||||
):
|
||||
sentence_idx = raw.get("_sentence_idx", -1)
|
||||
break
|
||||
|
||||
candidates.append(
|
||||
RelationCandidate(
|
||||
from_name=item["from"],
|
||||
to_name=item["to"],
|
||||
relation_type=item["kind"],
|
||||
description="",
|
||||
confidence=1.0,
|
||||
source_chunk_index=sentence_idx,
|
||||
)
|
||||
)
|
||||
|
||||
return candidates
|
||||
@@ -0,0 +1,65 @@
|
||||
---
|
||||
name: extract_triples_spacy_es
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "def extract_triples_spacy_es(text: str, nlp: Any) -> dict"
|
||||
description: "Extraccion OpenIE schema-less en castellano via reglas de dependencia spaCy. Detecta patrones sujeto-verbo-objeto con el lemma del verbo como relacion (sin vocabulario fijo). Tambien extrae entidades NER (PER, ORG, LOC, MISC)."
|
||||
tags: [spacy, openie, nlp, spanish, triples, dependency, ner, schema-less, datascience, python, mit]
|
||||
uses_functions:
|
||||
- spacy_es_load_model_py_datascience
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: [time, typing.Any]
|
||||
params:
|
||||
- name: text
|
||||
desc: "Texto en castellano a analizar. Funciona mejor con oraciones completas. Admite multiples oraciones en el mismo texto."
|
||||
- name: nlp
|
||||
desc: "Instancia spaCy Language cargada con spacy_es_load_model. Debe incluir dependencias + POS + NER (es_core_news_md o lg)."
|
||||
output: "Dict con 'text' (input), 'triples' (lista de {subject, relation, object, verb_form, object_dep, prep}), 'entities' (lista de {text, label}) y 'elapsed_s'. La relacion es el lemma del verbo, opcionalmente sufijado con preposicion (_en, _con) o modo pasivo ([pass])."
|
||||
tested: true
|
||||
tests:
|
||||
- "oracion simple produce tripleta con sujeto verbo objeto"
|
||||
- "carlos torres preside bbva produce tripleta president"
|
||||
- "amancio ortega fundo inditex en 1985 produce tripletas con fundar_en"
|
||||
- "texto sin verbos produce tripletas vacias"
|
||||
- "entities NER detecta PER ORG LOC"
|
||||
test_file_path: "python/functions/datascience/tests/test_extract_triples_spacy_es.py"
|
||||
file_path: "python/functions/datascience/extract_triples_spacy_es.py"
|
||||
notes: |
|
||||
LICENSE: spaCy es MIT. Modelo es_core_news_md es CC BY-SA 4.0.
|
||||
Uso comercial permitido con atribucion.
|
||||
|
||||
Validado en notebook 09 del analisis gliner_glirel_tuning.
|
||||
Complementa a extract_graph_gliner2: GLiNER2 usa vocabulario fijo de relaciones
|
||||
pero mayor precision; spaCy OpenIE usa lemmas verbales (sin vocabulario fijo)
|
||||
pero requiere post-filtrado manual.
|
||||
|
||||
impure: invoca inferencia del modelo (side effect computacional).
|
||||
El nlp se inyecta externamente para permitir cache y reutilizacion.
|
||||
|
||||
Relaciones compuestas: 'fundar_en' (fundar + preposicion 'en'),
|
||||
'ser_nombrado[pass]' (pasiva), 'trabajar_con' (trabajar + 'con').
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from datascience.spacy_es_load_model import spacy_es_load_model
|
||||
from datascience.extract_triples_spacy_es import extract_triples_spacy_es
|
||||
|
||||
nlp = spacy_es_load_model()
|
||||
|
||||
result = extract_triples_spacy_es(
|
||||
"Amancio Ortega fundo Inditex en 1985 en La Coruna.",
|
||||
nlp=nlp,
|
||||
)
|
||||
# result["triples"]:
|
||||
# [{"subject": "Amancio Ortega", "relation": "fundar", "object": "Inditex", ...},
|
||||
# {"subject": "Amancio Ortega", "relation": "fundar_en", "object": "1985", ...},
|
||||
# {"subject": "Amancio Ortega", "relation": "fundar_en", "object": "La Coruna", ...}]
|
||||
```
|
||||
@@ -0,0 +1,124 @@
|
||||
"""Extraccion de tripletas OpenIE schema-less en castellano via reglas de dependencia.
|
||||
|
||||
Validado en notebook 09 del analisis gliner_glirel_tuning.
|
||||
LICENSE: spaCy MIT + es_core_news_md CC BY-SA 4.0.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import time
|
||||
from typing import Any
|
||||
|
||||
# Determinantes y pronombres que no son entidades significativas
|
||||
STOP_TOKENS = {
|
||||
"el", "la", "los", "las", "un", "una", "unos", "unas",
|
||||
"esto", "eso", "aquello", "esta", "este", "estos", "estas",
|
||||
"que", "quien", "cual", "cuales",
|
||||
}
|
||||
|
||||
|
||||
def _clean_span(span_tokens) -> str: # type: ignore[type-arg]
|
||||
"""Extrae texto de un span de tokens, eliminando preposiciones iniciales."""
|
||||
toks = list(span_tokens)
|
||||
while toks and toks[0].pos_ == "ADP":
|
||||
toks = toks[1:]
|
||||
return " ".join(t.text for t in toks).strip()
|
||||
|
||||
|
||||
def _is_meaningful(text: str) -> bool:
|
||||
"""Comprueba que un span no es vacio ni una stopword."""
|
||||
if not text or not text.strip():
|
||||
return False
|
||||
if text.lower() in STOP_TOKENS:
|
||||
return False
|
||||
return True
|
||||
|
||||
|
||||
def extract_triples_spacy_es(text: str, nlp: Any) -> dict:
|
||||
"""Extract OpenIE-style (subject, relation, object) triples from Spanish text.
|
||||
|
||||
Uses spaCy dependency rules to find subject-verb-object patterns.
|
||||
Schema-LESS: the relation is the verb's lemma (no fixed vocabulary).
|
||||
Also extracts spaCy NER entities (PER, ORG, LOC, MISC).
|
||||
|
||||
Args:
|
||||
text: Spanish text to analyze. Works best with complete sentences.
|
||||
nlp: spaCy Language instance loaded with spacy_es_load_model.
|
||||
|
||||
Returns:
|
||||
{
|
||||
"text": str,
|
||||
"triples": [
|
||||
{"subject": str, "relation": str, "object": str,
|
||||
"verb_form": str, "object_dep": str, "prep": str|None},
|
||||
...
|
||||
],
|
||||
"entities": [{"text": str, "label": str}, ...],
|
||||
"elapsed_s": float
|
||||
}
|
||||
"""
|
||||
t0 = time.time()
|
||||
doc = nlp(text)
|
||||
triples: list[dict] = []
|
||||
|
||||
for tok in doc:
|
||||
if tok.pos_ not in ("VERB", "AUX"):
|
||||
continue
|
||||
|
||||
verb_lemma = tok.lemma_
|
||||
verb_form = tok.text
|
||||
|
||||
subjs = [
|
||||
c for c in tok.children
|
||||
if c.dep_ in ("nsubj", "nsubj:pass", "csubj")
|
||||
]
|
||||
if not subjs:
|
||||
continue
|
||||
|
||||
objects: list[tuple] = []
|
||||
for c in tok.children:
|
||||
if c.dep_ in ("obj", "dobj", "iobj", "attr", "xcomp", "ccomp"):
|
||||
objects.append((c, c.dep_, None))
|
||||
elif c.dep_ in ("obl", "obl:agent", "nmod"):
|
||||
prep = None
|
||||
for cc in c.children:
|
||||
if cc.dep_ == "case" and cc.pos_ == "ADP":
|
||||
prep = cc.text.lower()
|
||||
break
|
||||
objects.append((c, c.dep_, prep))
|
||||
|
||||
for s in subjs:
|
||||
s_text = _clean_span(s.subtree)
|
||||
if not _is_meaningful(s_text):
|
||||
continue
|
||||
for o, dep, prep in objects:
|
||||
o_text = _clean_span(o.subtree)
|
||||
if not _is_meaningful(o_text):
|
||||
continue
|
||||
|
||||
# Construir etiqueta de relacion
|
||||
rel = verb_lemma
|
||||
# Pasiva: marcar con [pass]
|
||||
if any(c.dep_ == "nsubj:pass" for c in tok.children):
|
||||
rel = f"{verb_lemma}[pass]"
|
||||
# Oblicuo con preposicion (excl. agente y "a" directa)
|
||||
elif prep and dep != "obl:agent" and prep != "a":
|
||||
rel = f"{verb_lemma}_{prep}"
|
||||
|
||||
triples.append({
|
||||
"subject": s_text,
|
||||
"relation": rel,
|
||||
"object": o_text,
|
||||
"verb_form": verb_form,
|
||||
"object_dep": dep,
|
||||
"prep": prep,
|
||||
})
|
||||
|
||||
ents = [{"text": e.text, "label": e.label_} for e in doc.ents]
|
||||
|
||||
return {
|
||||
"text": text,
|
||||
"triples": triples,
|
||||
"entities": ents,
|
||||
"elapsed_s": round(time.time() - t0, 3),
|
||||
}
|
||||
@@ -0,0 +1,58 @@
|
||||
---
|
||||
name: fuzzy_merge_adaptive
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def fuzzy_merge_adaptive(left: list[dict], right: list[dict], left_key: str, right_key: str, thresholds: list[int] | None = None, how: str = 'left') -> list[dict]"
|
||||
description: "Fuzzy join adaptativo entre dos listas de dicts usando rapidfuzz.token_sort_ratio. Prueba thresholds de mayor a menor y asigna el mayor cumplido. Soporta how='left' (todos los de left) e how='inner' (solo con match). Campos colisionantes reciben sufijos _left/_right."
|
||||
tags: [fuzzy, matching, join, merge, rapidfuzz, string-similarity, datascience]
|
||||
params:
|
||||
- name: left
|
||||
desc: Lista de dicts (lado izquierdo del join).
|
||||
- name: right
|
||||
desc: Lista de dicts (lado derecho del join).
|
||||
- name: left_key
|
||||
desc: Clave en los dicts de left usada para matching de strings.
|
||||
- name: right_key
|
||||
desc: Clave en los dicts de right usada para matching de strings.
|
||||
- name: thresholds
|
||||
desc: Lista de thresholds enteros a probar en orden descendente. Default [90,80,70,60,50].
|
||||
- name: how
|
||||
desc: "'left' incluye todos los items de left; 'inner' solo los que tienen match."
|
||||
output: "Lista de dicts mergeados con campos de left + right (sufijos _left/_right si colisionan) + fuzzy_match (str|None), match_score (int), threshold_used (int|None)."
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: ["rapidfuzz"]
|
||||
tested: true
|
||||
tests:
|
||||
- "left join con typo"
|
||||
- "inner join excluye sin match"
|
||||
- "left join sin match devuelve none"
|
||||
- "threshold adaptativo"
|
||||
- "colision de claves usa sufijos"
|
||||
test_file_path: "python/functions/datascience/tests/test_fuzzy_merge_adaptive.py"
|
||||
file_path: "python/functions/datascience/fuzzy_merge_adaptive.py"
|
||||
source_repo: "internal:footprint_aurgi"
|
||||
source_license: "internal-aurgi"
|
||||
source_file: "fuzzy_joins/fuzzy_en_batches.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from fuzzy_merge_adaptive import fuzzy_merge_adaptive
|
||||
|
||||
left = [{"name": "Madrid"}, {"name": "Barclona"}]
|
||||
right = [{"name": "Madrid", "cp": "28"}, {"name": "Barcelona", "cp": "08"}]
|
||||
result = fuzzy_merge_adaptive(left, right, left_key="name", right_key="name")
|
||||
# result[1]["fuzzy_match"] == "Barcelona", result[1]["match_score"] >= 80
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Migrado de thefuzz a rapidfuzz (API compatible, mayor velocidad). Sin pandas: el merge se implementa manualmente via dict lookup por right_key. Los thresholds se prueban de mayor a menor; el primero cumplido se asigna a threshold_used.
|
||||
@@ -0,0 +1,108 @@
|
||||
"""Fuzzy merge adaptativo con multiples thresholds usando rapidfuzz."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import Iterable
|
||||
|
||||
|
||||
def fuzzy_merge_adaptive(
|
||||
left: list[dict],
|
||||
right: list[dict],
|
||||
left_key: str,
|
||||
right_key: str,
|
||||
thresholds: list[int] | None = None,
|
||||
how: str = "left",
|
||||
) -> list[dict]:
|
||||
"""Realiza un fuzzy join adaptativo entre dos listas de dicts.
|
||||
|
||||
Para cada item en left busca en right el mejor match usando
|
||||
rapidfuzz.fuzz.token_sort_ratio. Prueba thresholds de mayor a menor
|
||||
y asigna threshold_used al mayor threshold cumplido. Si no cumple
|
||||
ninguno, match es None.
|
||||
|
||||
Args:
|
||||
left: Lista de dicts (lado izquierdo del join).
|
||||
right: Lista de dicts (lado derecho del join).
|
||||
left_key: Clave en los dicts de left usada para matching.
|
||||
right_key: Clave en los dicts de right usada para matching.
|
||||
thresholds: Thresholds a probar en orden descendente.
|
||||
Default [90, 80, 70, 60, 50].
|
||||
how: Tipo de join. 'left' incluye todos los items de left
|
||||
(con None en campos de right si no hay match).
|
||||
'inner' incluye solo items con match.
|
||||
|
||||
Returns:
|
||||
Lista de dicts mergeados con campos de left + campos de right
|
||||
(sufijos _left/_right si colisionan) + fuzzy_match, match_score,
|
||||
threshold_used.
|
||||
"""
|
||||
from rapidfuzz import fuzz, process
|
||||
|
||||
if thresholds is None:
|
||||
thresholds = [90, 80, 70, 60, 50]
|
||||
|
||||
right_values = [
|
||||
str(r[right_key]) for r in right if r.get(right_key) is not None
|
||||
]
|
||||
|
||||
def find_best_match(value: str | None) -> tuple[str | None, int, int | None]:
|
||||
if value is None:
|
||||
return None, 0, None
|
||||
result = process.extractOne(str(value), right_values, scorer=fuzz.token_sort_ratio)
|
||||
if not result:
|
||||
return None, 0, None
|
||||
match_str, score = result[0], result[1]
|
||||
for t in thresholds:
|
||||
if score >= t:
|
||||
return match_str, score, t
|
||||
return None, 0, None
|
||||
|
||||
# Detectar colisiones de claves
|
||||
left_keys = set(left[0].keys()) if left else set()
|
||||
right_keys = set(right[0].keys()) if right else set()
|
||||
collision_keys = left_keys & right_keys
|
||||
|
||||
# Construir indice de right por right_key
|
||||
right_index: dict[str, dict] = {}
|
||||
for r in right:
|
||||
val = r.get(right_key)
|
||||
if val is not None:
|
||||
right_index[str(val)] = r
|
||||
|
||||
result_rows = []
|
||||
for item in left:
|
||||
value = item.get(left_key)
|
||||
fuzzy_match, score, threshold_used = find_best_match(value)
|
||||
|
||||
if fuzzy_match is None and how == "inner":
|
||||
continue
|
||||
|
||||
row: dict = {}
|
||||
# Campos de left
|
||||
for k, v in item.items():
|
||||
if k in collision_keys:
|
||||
row[f"{k}_left"] = v
|
||||
else:
|
||||
row[k] = v
|
||||
|
||||
# Campos de right
|
||||
matched_right = right_index.get(fuzzy_match) if fuzzy_match else None
|
||||
if matched_right is not None:
|
||||
for k, v in matched_right.items():
|
||||
if k in collision_keys:
|
||||
row[f"{k}_right"] = v
|
||||
else:
|
||||
row[k] = v
|
||||
else:
|
||||
for k in right_keys:
|
||||
if k in collision_keys:
|
||||
row[f"{k}_right"] = None
|
||||
else:
|
||||
row[k] = None
|
||||
|
||||
row["fuzzy_match"] = fuzzy_match
|
||||
row["match_score"] = score
|
||||
row["threshold_used"] = threshold_used
|
||||
result_rows.append(row)
|
||||
|
||||
return result_rows
|
||||
@@ -0,0 +1,52 @@
|
||||
---
|
||||
id: geometric_mean_py_datascience
|
||||
name: geometric_mean
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def geometric_mean(values: list[float]) -> float"
|
||||
description: "Geometric mean of positive elements via exp(mean(log(x))). Non-positive values are filtered out. Returns math.nan if no positives."
|
||||
tags: [statistics, mean, geometric, distribution, lognormal]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: [math, numpy]
|
||||
example: |
|
||||
from geometric_mean import geometric_mean
|
||||
result = geometric_mean([1, 2, 4, 8]) # ~2.828 (2^1.5)
|
||||
tested: true
|
||||
tests:
|
||||
- "test_geometric_mean_powers_of_two"
|
||||
- "test_geometric_mean_filters_non_positive"
|
||||
- "test_geometric_mean_empty_returns_nan"
|
||||
- "test_geometric_mean_all_negative_returns_nan"
|
||||
- "test_geometric_mean_single_positive"
|
||||
test_file_path: "python/functions/datascience/tests/test_geometric_mean.py"
|
||||
file_path: "python/functions/datascience/geometric_mean.py"
|
||||
params:
|
||||
- name: values
|
||||
desc: "List of numeric values. Non-positive elements are silently ignored."
|
||||
output: "Geometric mean as float, computed over positive elements only. Returns math.nan if there are no positive values."
|
||||
source_repo: "internal:footprint_aurgi"
|
||||
source_license: "internal-aurgi"
|
||||
source_file: "aurgi_mapas/generar_pdf_reporte.py:126"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from geometric_mean import geometric_mean
|
||||
|
||||
geometric_mean([1, 2, 4, 8]) # 2.828... (= 2^1.5)
|
||||
geometric_mean([1, -2, 3]) # exp((log(1)+log(3))/2) — ignores -2
|
||||
geometric_mean([]) # math.nan
|
||||
geometric_mean([-1, -2]) # math.nan — no positives
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Apropiado para distribuciones lognormales o datos multiplicativos (precios, ratios, crecimientos). Equivalente a la raiz n-esima del producto pero numericamente mas estable via log-space.
|
||||
@@ -0,0 +1,23 @@
|
||||
"""geometric_mean — Geometric mean of positive values."""
|
||||
|
||||
import math
|
||||
import numpy as np
|
||||
|
||||
|
||||
def geometric_mean(values: list[float]) -> float:
|
||||
"""Return the geometric mean of the positive elements in values.
|
||||
|
||||
Filters out non-positive numbers before computing exp(mean(log(x))).
|
||||
Returns math.nan if there are no positive values.
|
||||
|
||||
Args:
|
||||
values: List of numeric values (non-positive elements are ignored).
|
||||
|
||||
Returns:
|
||||
Geometric mean as float, or math.nan if no positive values exist.
|
||||
"""
|
||||
positives = [v for v in values if v > 0]
|
||||
if not positives:
|
||||
return math.nan
|
||||
arr = np.array(positives, dtype=float)
|
||||
return float(np.exp(np.mean(np.log(arr))))
|
||||
@@ -0,0 +1,67 @@
|
||||
---
|
||||
name: gliner2_load_model
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "def gliner2_load_model(model_name: str = 'fastino/gliner2-large-v1', device: str = 'auto') -> Any"
|
||||
description: "Carga (y cachea por (model_name, device)) un modelo GLiNER2 (NER+RE joint). GLiNER2 extrae entidades y relaciones en una sola pasada con schema unificado. ~2x mas rapido que GLiNER + GLiREL separados. LICENSE: Apache 2.0."
|
||||
tags: [gliner2, ner, relation-extraction, nlp, model, huggingface, zero-shot, joint, datascience, python, apache2]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: [gliner2]
|
||||
params:
|
||||
- name: model_name
|
||||
desc: "ID del modelo en HuggingFace Hub. Default: fastino/gliner2-large-v1. Alternativas: fastino/gliner2-base-v1 (mas ligero)."
|
||||
- name: device
|
||||
desc: "'auto' usa CUDA si disponible, sino CPU. Valores: 'cpu', 'cuda', 'cuda:0', 'cuda:1'. auto es el default recomendado."
|
||||
output: "Instancia GLiNER2 cacheada por (model_name, device). Tiene metodos .create_schema().entities(...).relations(...) y .extract(text, schema=schema, threshold=0.3)."
|
||||
tested: true
|
||||
tests:
|
||||
- "cache devuelve la misma instancia con los mismos parametros"
|
||||
- "device=auto resuelve a cpu si torch no esta instalado"
|
||||
- "ImportError si gliner2 no esta instalado"
|
||||
test_file_path: "python/functions/datascience/tests/test_gliner2_load_model.py"
|
||||
file_path: "python/functions/datascience/gliner2_load_model.py"
|
||||
notes: |
|
||||
LICENSE: fastino/gliner2-large-v1 es Apache 2.0 — uso comercial OK.
|
||||
Diferencia con gliner_load_model: GLiNER hace solo NER, GLiNER2 hace NER+RE
|
||||
en una sola pasada (joint schema). Para pipelines de grafo usar GLiNER2
|
||||
cuando se necesiten ambas tareas simultaneamente.
|
||||
|
||||
impure: descarga red/disco la primera vez, mantiene estado en _MODEL_CACHE.
|
||||
Tamanio: fastino/gliner2-large-v1 ~500 MB. Primera carga 15-30s en CPU.
|
||||
Inferencia CPU: 10-50 KB texto/s con schema tipico (3 entity + 8 relation labels).
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from datascience.gliner2_load_model import gliner2_load_model
|
||||
|
||||
model = gliner2_load_model(device="auto")
|
||||
|
||||
schema = (model.create_schema()
|
||||
.entities(["person", "organization", "location"])
|
||||
.relations(["works_at", "ceo_of", "located_in"]))
|
||||
|
||||
result = model.extract(
|
||||
"Pablo Isla es el CEO de Inditex, empresa con sede en Arteixo.",
|
||||
schema=schema,
|
||||
threshold=0.3,
|
||||
)
|
||||
# result["entities"] -> {"person": ["Pablo Isla"], "organization": ["Inditex"], ...}
|
||||
# result["relation_extraction"] -> {"ceo_of": [("Pablo Isla", "Inditex")], ...}
|
||||
```
|
||||
|
||||
## Instalacion
|
||||
|
||||
```bash
|
||||
cd python && uv pip install gliner2
|
||||
# o con el extra NLP completo:
|
||||
cd python && uv pip install -e '.[nlp]'
|
||||
```
|
||||
@@ -0,0 +1,62 @@
|
||||
"""Carga (y cachea) un modelo GLiNER2 (NER+RE joint en una sola pasada).
|
||||
|
||||
LICENSE: Apache 2.0 — uso comercial permitido.
|
||||
Modelo por defecto: fastino/gliner2-large-v1
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import Any
|
||||
|
||||
# Cache global: (model_name, device) -> instancia GLiNER2
|
||||
_MODEL_CACHE: dict[tuple[str, str], Any] = {}
|
||||
|
||||
|
||||
def _resolve_device(device: str) -> str:
|
||||
"""Resuelve 'auto' a 'cuda' o 'cpu' segun disponibilidad de torch."""
|
||||
if device != "auto":
|
||||
return device
|
||||
try:
|
||||
import torch
|
||||
except ImportError:
|
||||
return "cpu"
|
||||
return "cuda" if torch.cuda.is_available() else "cpu"
|
||||
|
||||
|
||||
def gliner2_load_model(
|
||||
model_name: str = "fastino/gliner2-large-v1",
|
||||
device: str = "auto",
|
||||
) -> Any:
|
||||
"""Load (and cache) a GLiNER2 model.
|
||||
|
||||
GLiNER2 extracts entities AND relations in a single forward pass using
|
||||
a joint schema (entities + relation_labels). This is ~2x faster than
|
||||
running GLiNER + GLiREL separately for co-occurring entities.
|
||||
|
||||
Returns model instance with .extract() and .create_schema() methods.
|
||||
|
||||
LICENSE: Apache 2.0 — commercial use OK.
|
||||
|
||||
Args:
|
||||
model_name: HuggingFace Hub model ID. Default: fastino/gliner2-large-v1.
|
||||
device: 'auto' uses CUDA if available, else CPU. 'cpu', 'cuda', 'cuda:N'.
|
||||
|
||||
Returns:
|
||||
GLiNER2 instance cached by (model_name, device).
|
||||
"""
|
||||
resolved = _resolve_device(device)
|
||||
key = (model_name, resolved)
|
||||
if key in _MODEL_CACHE:
|
||||
return _MODEL_CACHE[key]
|
||||
|
||||
from gliner2 import GLiNER2 # type: ignore[import]
|
||||
|
||||
m = GLiNER2.from_pretrained(model_name)
|
||||
if hasattr(m, "to") and resolved != "cpu":
|
||||
try:
|
||||
m.to(resolved)
|
||||
except Exception:
|
||||
pass # Fallback to CPU silently
|
||||
|
||||
_MODEL_CACHE[key] = m
|
||||
return m
|
||||
@@ -0,0 +1,52 @@
|
||||
---
|
||||
id: kde_density_levels_py_datascience
|
||||
name: kde_density_levels
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def kde_density_levels(xs: list[float], ys: list[float], bw_adjust: float = 0.6, abs_quantile: float = 0.1, dense_quantile: float = 0.85, bins: int = 80) -> dict | None"
|
||||
description: "Estimates 2-D density via KDE (scipy) or histogram fallback (numpy) and returns per-point density values plus absolute and dense quantile thresholds."
|
||||
tags: [statistics, kde, density, spatial, geospatial, scipy, numpy]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: [numpy, scipy]
|
||||
example: |
|
||||
from kde_density_levels import kde_density_levels
|
||||
import numpy as np
|
||||
rng = np.random.default_rng(42)
|
||||
result = kde_density_levels(rng.normal(0,1,50).tolist(), rng.normal(0,1,50).tolist())
|
||||
# {"method": "kde", "densities": array(...), "abs_level": ..., "dense_level": ...}
|
||||
tested: true
|
||||
tests:
|
||||
- "test_kde_density_levels_returns_dict_for_50_points"
|
||||
- "test_kde_density_levels_none_for_few_points"
|
||||
- "test_kde_density_levels_none_for_4_points"
|
||||
- "test_kde_density_levels_levels_ordered"
|
||||
- "test_kde_density_levels_mismatched_lengths"
|
||||
test_file_path: "python/functions/datascience/tests/test_kde_density_levels.py"
|
||||
file_path: "python/functions/datascience/kde_density_levels.py"
|
||||
params:
|
||||
- name: xs
|
||||
desc: "X-coordinates of the 2-D point cloud."
|
||||
- name: ys
|
||||
desc: "Y-coordinates of the 2-D point cloud. Must have same length as xs."
|
||||
- name: bw_adjust
|
||||
desc: "Bandwidth adjustment factor for gaussian_kde. Default 0.6."
|
||||
- name: abs_quantile
|
||||
desc: "Quantile of density values used as the absolute (sparse) threshold. Default 0.1."
|
||||
- name: dense_quantile
|
||||
desc: "Quantile of density values used as the dense cluster threshold. Default 0.85."
|
||||
- name: bins
|
||||
desc: "Number of bins per axis for the histogram fallback. Default 80."
|
||||
output: "Dict with method (str), densities (np.ndarray of per-point density), abs_level (float), dense_level (float). Returns None if len(xs) < 5 or lengths differ."
|
||||
source_repo: "internal:footprint_aurgi"
|
||||
source_license: "internal-aurgi"
|
||||
source_file: "ponderacion_isochronas/src/recomendador_centros.py:305"
|
||||
---
|
||||
|
||||
Funcion pura que no escribe nada en disco. returns_optional=true porque devuelve None cuando hay menos de 5 puntos.
|
||||
@@ -0,0 +1,65 @@
|
||||
"""kde_density_levels — Compute density levels via KDE or histogram fallback."""
|
||||
|
||||
import math
|
||||
import numpy as np
|
||||
|
||||
|
||||
def kde_density_levels(
|
||||
xs: list[float],
|
||||
ys: list[float],
|
||||
bw_adjust: float = 0.6,
|
||||
abs_quantile: float = 0.1,
|
||||
dense_quantile: float = 0.85,
|
||||
bins: int = 80,
|
||||
) -> dict | None:
|
||||
"""Estimate 2-D density and compute absolute and dense threshold levels.
|
||||
|
||||
Uses scipy.stats.gaussian_kde when available; falls back to
|
||||
numpy.histogram2d if scipy is not installed.
|
||||
|
||||
Args:
|
||||
xs: X-coordinates of points.
|
||||
ys: Y-coordinates of points.
|
||||
bw_adjust: Bandwidth adjustment factor for KDE (ignored for histogram fallback).
|
||||
abs_quantile: Quantile of density values used as the absolute threshold.
|
||||
dense_quantile: Quantile of density values used as the dense threshold.
|
||||
bins: Number of bins per axis for the histogram fallback.
|
||||
|
||||
Returns:
|
||||
Dict with keys:
|
||||
"method" (str): "kde" or "hist".
|
||||
"densities" (np.ndarray): 1-D array of per-point density estimates.
|
||||
"abs_level" (float): density at abs_quantile.
|
||||
"dense_level" (float): density at dense_quantile.
|
||||
Returns None if len(xs) < 5 or xs and ys have different lengths.
|
||||
"""
|
||||
if len(xs) < 5 or len(xs) != len(ys):
|
||||
return None
|
||||
|
||||
xs_arr = np.array(xs, dtype=float)
|
||||
ys_arr = np.array(ys, dtype=float)
|
||||
points = np.vstack([xs_arr, ys_arr])
|
||||
|
||||
try:
|
||||
from scipy.stats import gaussian_kde # type: ignore
|
||||
|
||||
kde = gaussian_kde(points, bw_method=bw_adjust)
|
||||
densities = kde(points)
|
||||
method = "kde"
|
||||
except ImportError:
|
||||
# Histogram fallback
|
||||
h, xedges, yedges = np.histogram2d(xs_arr, ys_arr, bins=bins)
|
||||
xi = np.clip(np.searchsorted(xedges, xs_arr) - 1, 0, bins - 1)
|
||||
yi = np.clip(np.searchsorted(yedges, ys_arr) - 1, 0, bins - 1)
|
||||
densities = h[xi, yi].astype(float)
|
||||
method = "hist"
|
||||
|
||||
abs_level = float(np.quantile(densities, abs_quantile))
|
||||
dense_level = float(np.quantile(densities, dense_quantile))
|
||||
|
||||
return {
|
||||
"method": method,
|
||||
"densities": densities,
|
||||
"abs_level": abs_level,
|
||||
"dense_level": dense_level,
|
||||
}
|
||||
@@ -0,0 +1,61 @@
|
||||
---
|
||||
name: marianmt_es_en_load_model
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "def marianmt_es_en_load_model(model_name: str = 'Helsinki-NLP/opus-mt-es-en') -> tuple[Any, Any]"
|
||||
description: "Carga (y cachea) el tokenizer y modelo MarianMT para traduccion ES->EN (Helsinki-NLP, ~300 MB). Licencia Apache 2.0. Cache por model_name."
|
||||
tags: [marianmt, translation, es-en, nlp, model, huggingface, apache2, datascience, python]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: [transformers]
|
||||
params:
|
||||
- name: model_name
|
||||
desc: "ID del modelo en HuggingFace Hub (defecto: Helsinki-NLP/opus-mt-es-en, ~300 MB)"
|
||||
output: "tupla (tokenizer, model) listos para inferencia, cacheados por model_name."
|
||||
tested: false
|
||||
tests: []
|
||||
test_file_path: ""
|
||||
file_path: "python/functions/datascience/marianmt_es_en_load_model.py"
|
||||
notes: |
|
||||
LICENCIA: Apache 2.0 — uso comercial permitido.
|
||||
|
||||
Util como paso previo a REBEL (monolingue EN): traducir ES -> EN con MarianMT
|
||||
y luego pasar a rebel_load_model para extraccion de relaciones en ingles.
|
||||
|
||||
impure: descarga red/disco la primera vez, mantiene estado en _MODEL_CACHE.
|
||||
Usa MarianTokenizer y MarianMTModel en vez de Auto* porque los modelos Marian
|
||||
tienen tokenizer especializado con vocabulario SPM.
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from python.functions.datascience.marianmt_es_en_load_model import marianmt_es_en_load_model
|
||||
from python.functions.datascience.translate_es_to_en import translate_es_to_en
|
||||
|
||||
tokenizer, model = marianmt_es_en_load_model()
|
||||
translated = translate_es_to_en("Pablo Isla es presidente de Inditex.", tokenizer, model)
|
||||
# "Pablo Isla is president of Inditex."
|
||||
```
|
||||
|
||||
## Tamanio y latencia
|
||||
|
||||
- `Helsinki-NLP/opus-mt-es-en`: ~300 MB en disco.
|
||||
- Primera carga: 5-15 s en CPU.
|
||||
- Inferencia CPU: 0.5-2 s por frase.
|
||||
- GPU: mucho mas rapido.
|
||||
|
||||
## Uso como preprocesador para REBEL
|
||||
|
||||
```
|
||||
texto ES -> marianmt_es_en -> texto EN -> rebel_load_model -> triplets
|
||||
```
|
||||
|
||||
Esta pipeline permite usar REBEL (Apache 2.0, solo EN) con textos en espanol.
|
||||
Alternativa directa: usar mrebel_load_model (CC BY-NC-SA, multilingue).
|
||||
@@ -0,0 +1,54 @@
|
||||
"""Carga (y cachea) el modelo MarianMT para traduccion ES -> EN."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import Any
|
||||
|
||||
# Cache global: model_name -> (tokenizer, model)
|
||||
_MODEL_CACHE: dict[str, tuple[Any, Any]] = {}
|
||||
|
||||
|
||||
def marianmt_es_en_load_model(
|
||||
model_name: str = "Helsinki-NLP/opus-mt-es-en",
|
||||
) -> tuple[Any, Any]:
|
||||
"""Loads (and caches) a MarianMT model for Spanish-to-English translation.
|
||||
|
||||
MarianMT is a lightweight seq2seq translation model (~300 MB) from
|
||||
Helsinki-NLP, trained on the OPUS parallel corpus.
|
||||
|
||||
LICENSE: Apache 2.0 — commercial use permitted.
|
||||
|
||||
The first call downloads the model from HuggingFace Hub (~300 MB).
|
||||
Subsequent calls with the same ``model_name`` return the cached instance.
|
||||
|
||||
Args:
|
||||
model_name: HuggingFace Hub model ID. Default is the ES->EN model.
|
||||
Other available models follow the pattern
|
||||
``Helsinki-NLP/opus-mt-{src}-{tgt}``.
|
||||
|
||||
Returns:
|
||||
Tuple ``(tokenizer, model)`` both ready for inference with
|
||||
``model.generate(...)`` and ``tokenizer.decode(...)``.
|
||||
|
||||
Raises:
|
||||
ImportError: if ``transformers`` is not installed.
|
||||
OSError: if the model cannot be downloaded or loaded from disk.
|
||||
"""
|
||||
cached = _MODEL_CACHE.get(model_name)
|
||||
if cached is not None:
|
||||
return cached
|
||||
|
||||
try:
|
||||
from transformers import MarianMTModel, MarianTokenizer
|
||||
except ImportError as exc:
|
||||
raise ImportError(
|
||||
"transformers no esta instalado. Instalalo con "
|
||||
"`uv pip install transformers` o `uv pip install -e '.[nlp]'`."
|
||||
) from exc
|
||||
|
||||
tokenizer = MarianTokenizer.from_pretrained(model_name)
|
||||
model = MarianMTModel.from_pretrained(model_name)
|
||||
model.eval()
|
||||
|
||||
_MODEL_CACHE[model_name] = (tokenizer, model)
|
||||
return tokenizer, model
|
||||
@@ -0,0 +1,56 @@
|
||||
---
|
||||
name: mrebel_base_load_model
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "def mrebel_base_load_model(model_name: str = 'Babelscape/mrebel-base', src_lang: str = 'es_XX', tgt_lang: str = 'tp_XX') -> tuple[Any, Any]"
|
||||
description: "Variante rapida de mrebel_load_model con checkpoint base (250M params, ~900 MB). Delega completamente en mrebel_load_model. Misma licencia CC BY-NC-SA 4.0 — solo uso no comercial."
|
||||
tags: [mrebel, relation-extraction, nlp, model, huggingface, multilingual, seq2seq, datascience, python]
|
||||
uses_functions: [mrebel_load_model_py_datascience]
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: []
|
||||
params:
|
||||
- name: model_name
|
||||
desc: "ID del modelo en HuggingFace Hub (defecto: Babelscape/mrebel-base, 250M params)"
|
||||
- name: src_lang
|
||||
desc: "codigo de idioma fuente para el tokenizer mBART: 'es_XX' (ES), 'en_XX' (EN), etc."
|
||||
- name: tgt_lang
|
||||
desc: "token de idioma destino del decoder — siempre 'tp_XX'"
|
||||
output: "tupla (tokenizer, model) listos para inferencia, cacheados por (model_name, src_lang) en la cache compartida de mrebel_load_model."
|
||||
tested: false
|
||||
tests: []
|
||||
test_file_path: ""
|
||||
file_path: "python/functions/datascience/mrebel_base_load_model.py"
|
||||
notes: |
|
||||
LICENCIA: Babelscape/mrebel-base esta bajo CC BY-NC-SA 4.0 (Creative Commons
|
||||
Non-Commercial Share-Alike). Solo uso no comercial. NO usar en productos comerciales.
|
||||
|
||||
Esta funcion es un thin wrapper — NO duplica logica de carga/cache. Toda la
|
||||
logica vive en mrebel_load_model. Util para benchmarks donde se quiere comparar
|
||||
base vs large con la misma interfaz.
|
||||
|
||||
La cache es compartida con mrebel_load_model (mismo dict _MODEL_CACHE del modulo).
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from python.functions.datascience.mrebel_base_load_model import mrebel_base_load_model
|
||||
|
||||
# 250M params vs 600M — misma interfaz
|
||||
tokenizer, model = mrebel_base_load_model(src_lang="es_XX")
|
||||
```
|
||||
|
||||
## Comparacion base vs large
|
||||
|
||||
| Variant | Params | Size | Latencia CPU/frase | Recall tipico |
|
||||
|---------|--------|------|-------------------|---------------|
|
||||
| mrebel-large | 600M | ~2.4 GB | 15-30 s | alto |
|
||||
| mrebel-base | 250M | ~900 MB | 5-10 s | medio |
|
||||
|
||||
Para benchmarks de velocidad en graph_explorer, usar base. Para produccion final, evaluar large.
|
||||
@@ -0,0 +1,41 @@
|
||||
"""Carga (y cachea) el modelo mREBEL-base (variante rapida, 250M params)."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import Any
|
||||
|
||||
from python.functions.datascience.mrebel_load_model import mrebel_load_model
|
||||
|
||||
|
||||
def mrebel_base_load_model(
|
||||
model_name: str = "Babelscape/mrebel-base",
|
||||
src_lang: str = "es_XX",
|
||||
tgt_lang: str = "tp_XX",
|
||||
) -> tuple[Any, Any]:
|
||||
"""Loads (and caches) the mREBEL-base tokenizer and model.
|
||||
|
||||
Thin wrapper over ``mrebel_load_model`` with the base checkpoint as
|
||||
default (250M params, ~900 MB). Faster than the large variant at the
|
||||
cost of some recall on complex sentences.
|
||||
|
||||
LICENSE NOTICE: Babelscape/mrebel-base is licensed under CC BY-NC-SA 4.0
|
||||
(Creative Commons Non-Commercial Share-Alike). Do NOT use in commercial
|
||||
products without replacing this model.
|
||||
|
||||
Args:
|
||||
model_name: HuggingFace Hub model ID. Defaults to the base checkpoint.
|
||||
src_lang: Source language code for the mBART tokenizer.
|
||||
tgt_lang: Target language token for the decoder (always ``"tp_XX"``).
|
||||
|
||||
Returns:
|
||||
Tuple ``(tokenizer, model)`` ready for inference.
|
||||
|
||||
Raises:
|
||||
ImportError: if ``transformers`` is not installed.
|
||||
OSError: if the model cannot be downloaded or loaded from disk.
|
||||
"""
|
||||
return mrebel_load_model(
|
||||
model_name=model_name,
|
||||
src_lang=src_lang,
|
||||
tgt_lang=tgt_lang,
|
||||
)
|
||||
@@ -0,0 +1,76 @@
|
||||
---
|
||||
name: mrebel_load_model
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "def mrebel_load_model(model_name: str = 'Babelscape/mrebel-large', src_lang: str = 'es_XX', tgt_lang: str = 'tp_XX') -> tuple[Any, Any]"
|
||||
description: "Carga (y cachea) el tokenizer y modelo mREBEL (mBART-based, ~600M params, ~2.4 GB). Multilingue 30+ idiomas. Cache por (model_name, src_lang). Primera llamada descarga de HuggingFace. LICENCIA CC BY-NC-SA 4.0 — solo uso no comercial."
|
||||
tags: [mrebel, relation-extraction, nlp, model, huggingface, multilingual, seq2seq, datascience, python]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: [transformers]
|
||||
params:
|
||||
- name: model_name
|
||||
desc: "ID del modelo en HuggingFace Hub (defecto: Babelscape/mrebel-large, 600M params)"
|
||||
- name: src_lang
|
||||
desc: "codigo de idioma fuente para el tokenizer mBART: 'es_XX' (ES), 'en_XX' (EN), 'fr_XX' (FR), etc."
|
||||
- name: tgt_lang
|
||||
desc: "token de idioma destino del decoder — siempre 'tp_XX' para el formato triplet de mREBEL"
|
||||
output: "tupla (tokenizer, model) listos para inferencia. Cacheados por (model_name, src_lang)."
|
||||
tested: false
|
||||
tests: []
|
||||
test_file_path: ""
|
||||
file_path: "python/functions/datascience/mrebel_load_model.py"
|
||||
notes: |
|
||||
LICENCIA: Babelscape/mrebel-large esta bajo CC BY-NC-SA 4.0 (Creative Commons
|
||||
Non-Commercial Share-Alike). Solo uso no comercial. NO usar en productos
|
||||
comerciales sin sustituir por un modelo con licencia comercial.
|
||||
|
||||
impure: descarga red/disco la primera vez, mantiene estado en _MODEL_CACHE.
|
||||
No necesita el patch HF kwargs de glirel — AutoModelForSeq2SeqLM es path estandar.
|
||||
Cache es por (model_name, src_lang): dos idiomas distintos crean dos instancias
|
||||
porque el tokenizer tiene src_lang hardcodeado.
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from python.functions.datascience.mrebel_load_model import mrebel_load_model
|
||||
from python.functions.datascience.parse_rebel_output import parse_rebel_output
|
||||
|
||||
tokenizer, model = mrebel_load_model(src_lang="es_XX")
|
||||
|
||||
text = "Pablo Isla es el presidente de Inditex."
|
||||
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
|
||||
generated = model.generate(**inputs, num_beams=4, length_penalty=1.0, max_length=256)
|
||||
decoded = tokenizer.decode(generated[0], skip_special_tokens=False)
|
||||
triplets = parse_rebel_output(decoded)
|
||||
```
|
||||
|
||||
## Tamanio y latencia
|
||||
|
||||
- `Babelscape/mrebel-large`: ~2.4 GB en disco (modelo + tokenizer).
|
||||
- Primera carga: 30-90 s en CPU, depende de red y disco.
|
||||
- Inferencia CPU: 5-15 s por frase (mBART es mas lento que REBEL/BART).
|
||||
- Inferencia GPU (CUDA T4): 0.5-2 s por frase.
|
||||
|
||||
## Idiomas soportados
|
||||
|
||||
mREBEL soporta los idiomas de mBART-50. Ejemplos:
|
||||
- `es_XX` — Espanol
|
||||
- `en_XX` — Ingles
|
||||
- `fr_XX` — Frances
|
||||
- `de_DE` — Aleman
|
||||
- `pt_XX` — Portugues
|
||||
- `it_IT` — Italiano
|
||||
|
||||
## Notas
|
||||
|
||||
- Para ingles y usos comerciales, usar `rebel_load_model` (Apache 2.0).
|
||||
- Para benchmarks rapidos, usar `mrebel_base_load_model` (250M params, misma licencia).
|
||||
- `model.eval()` se llama al cargar para desactivar dropout en inferencia.
|
||||
@@ -0,0 +1,69 @@
|
||||
"""Carga (y cachea) el modelo mREBEL para extraccion de relaciones multilingue."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import Any
|
||||
|
||||
# Cache global: (model_name, src_lang) -> (tokenizer, model)
|
||||
_MODEL_CACHE: dict[tuple[str, str], tuple[Any, Any]] = {}
|
||||
|
||||
|
||||
def mrebel_load_model(
|
||||
model_name: str = "Babelscape/mrebel-large",
|
||||
src_lang: str = "es_XX",
|
||||
tgt_lang: str = "tp_XX",
|
||||
) -> tuple[Any, Any]:
|
||||
"""Loads (and caches) the mREBEL tokenizer and model.
|
||||
|
||||
mREBEL is a multilingual seq2seq model (mBART-based, ~600M params, ~2.4 GB)
|
||||
for relation extraction. It supports 30+ languages via language codes
|
||||
(``src_lang``).
|
||||
|
||||
LICENSE NOTICE: Babelscape/mrebel-large is licensed under CC BY-NC-SA 4.0
|
||||
(Creative Commons Non-Commercial Share-Alike). Do NOT use in commercial
|
||||
products without replacing this model with a commercially-licensed
|
||||
alternative (e.g. Babelscape/rebel-large which is Apache 2.0 but
|
||||
English-only).
|
||||
|
||||
The first call downloads the model from HuggingFace Hub (~2.4 GB).
|
||||
Subsequent calls with the same ``(model_name, src_lang)`` return the
|
||||
cached instance without re-loading.
|
||||
|
||||
Args:
|
||||
model_name: HuggingFace Hub model ID. Default is the large variant.
|
||||
src_lang: Source language code for the mBART tokenizer, e.g.
|
||||
``"es_XX"`` (Spanish), ``"en_XX"`` (English), ``"fr_XX"`` (French).
|
||||
tgt_lang: Target language token for the decoder (always ``"tp_XX"``
|
||||
for the triplet format — only change if using a custom checkpoint).
|
||||
|
||||
Returns:
|
||||
Tuple ``(tokenizer, model)`` both ready for inference with
|
||||
``model.generate(...)`` and ``tokenizer.decode(...)``.
|
||||
|
||||
Raises:
|
||||
ImportError: if ``transformers`` is not installed.
|
||||
OSError: if the model cannot be downloaded or loaded from disk.
|
||||
"""
|
||||
cache_key = (model_name, src_lang)
|
||||
cached = _MODEL_CACHE.get(cache_key)
|
||||
if cached is not None:
|
||||
return cached
|
||||
|
||||
try:
|
||||
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
|
||||
except ImportError as exc:
|
||||
raise ImportError(
|
||||
"transformers no esta instalado. Instalalo con "
|
||||
"`uv pip install transformers` o `uv pip install -e '.[nlp]'`."
|
||||
) from exc
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained(
|
||||
model_name,
|
||||
src_lang=src_lang,
|
||||
tgt_lang=tgt_lang,
|
||||
)
|
||||
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
|
||||
model.eval()
|
||||
|
||||
_MODEL_CACHE[cache_key] = (tokenizer, model)
|
||||
return tokenizer, model
|
||||
@@ -0,0 +1,65 @@
|
||||
---
|
||||
name: parse_rebel_output
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def parse_rebel_output(decoded_text: str) -> list[dict]"
|
||||
description: "Parser puro del wire format de REBEL / mREBEL. Convierte la cadena decoded por el tokenizer (con skip_special_tokens=False) a una lista de triplets tipados {head, head_type, type, tail, tail_type}. Nunca lanza excepcion."
|
||||
tags: [rebel, mrebel, relation-extraction, nlp, parser, knowledge-graph, datascience, python]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: []
|
||||
params:
|
||||
- name: decoded_text
|
||||
desc: "cadena raw producida por tokenizer.decode(..., skip_special_tokens=False) — incluye tokens especiales como <triplet>, <per>, <org>, <loc>, tp_XX, etc."
|
||||
output: "lista de dicts con claves head (str), head_type (str), type (str), tail (str), tail_type (str). Lista vacia si no hay triplets completos o el input es vacio."
|
||||
tested: true
|
||||
tests:
|
||||
- "string vacio retorna lista vacia"
|
||||
- "un triplet completo retorna un dict con campos correctos"
|
||||
- "dos triplets retorna dos dicts"
|
||||
- "triplet incompleto sin cierre no rompe"
|
||||
- "tokens angulares desconocidos no lanzan excepcion"
|
||||
test_file_path: "python/functions/datascience/tests/test_parse_rebel_output.py"
|
||||
file_path: "python/functions/datascience/parse_rebel_output.py"
|
||||
notes: |
|
||||
Funcion pura. Adapta el parser oficial del README de Babelscape/rebel al estilo del registry.
|
||||
Compatible con mREBEL (prefijo tp_XX, lang token __es__, __en__) y REBEL (sin prefijo de idioma).
|
||||
El formato wire incluye <triplet> para separar triplets y tokens <type> para cerrar spans
|
||||
de head/tail. El estado de la maquina es: t=leyendo head, s=leyendo tail, o=leyendo relacion.
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from python.functions.datascience.parse_rebel_output import parse_rebel_output
|
||||
|
||||
decoded = "tp_XX<triplet> Pablo Isla <per> Inditex <org> employer"
|
||||
triplets = parse_rebel_output(decoded)
|
||||
# [{'head': 'Pablo Isla', 'head_type': 'per', 'type': 'employer',
|
||||
# 'tail': 'Inditex', 'tail_type': 'org'}]
|
||||
```
|
||||
|
||||
## Formato wire REBEL / mREBEL
|
||||
|
||||
```
|
||||
tp_XX<triplet> HEAD_TOKENS <HEAD_TYPE> TAIL_TOKENS <TAIL_TYPE> RELATION_TOKENS<triplet> ...
|
||||
```
|
||||
|
||||
- `<triplet>` — marca el inicio de un nuevo triplet (y cierra el anterior).
|
||||
- `<HEAD_TYPE>` — cierra el span del head y abre el span del tail.
|
||||
- `<TAIL_TYPE>` — cierra el span del tail y abre el span de la relacion.
|
||||
- El ultimo triplet se cierra con `</s>` (ya eliminado antes del split).
|
||||
|
||||
## Notas
|
||||
|
||||
- No valida ni filtra los `head_type`/`tail_type` — los devuelve tal cual emite el modelo.
|
||||
- Compatible con cualquier variante seq2seq que use el mismo wire format (Babelscape/rebel,
|
||||
Babelscape/mrebel-large, Babelscape/mrebel-base).
|
||||
- Para usar el output en el grafo, pasar por `align_relations_to_entities` que resuelve
|
||||
head/tail a nombres canonicos del conjunto de entidades conocido.
|
||||
@@ -0,0 +1,105 @@
|
||||
"""Parser puro del wire format de REBEL / mREBEL."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
|
||||
def parse_rebel_output(decoded_text: str) -> list[dict]:
|
||||
"""Parse REBEL / mREBEL decoded output into typed triplets.
|
||||
|
||||
The input is the string produced by the HuggingFace tokenizer with
|
||||
``skip_special_tokens=False``, e.g.::
|
||||
|
||||
tp_XX<triplet> Pablo Isla <per> Inditex <org> employer<triplet> ...
|
||||
|
||||
Args:
|
||||
decoded_text: Raw decoded string from the seq2seq model, including
|
||||
special tokens like ``<triplet>``, ``<relation>``, ``<per>``,
|
||||
``<org>``, ``<loc>``, etc.
|
||||
|
||||
Returns:
|
||||
List of dicts with keys:
|
||||
``head`` (str), ``head_type`` (str),
|
||||
``type`` (str), ``tail`` (str), ``tail_type`` (str).
|
||||
Returns an empty list on empty input or if no complete triplet is
|
||||
found. Never raises.
|
||||
"""
|
||||
if not decoded_text or not decoded_text.strip():
|
||||
return []
|
||||
|
||||
triplets: list[dict] = []
|
||||
|
||||
# Strip language / padding tokens common to mREBEL.
|
||||
text = (
|
||||
decoded_text
|
||||
.replace("<s>", "")
|
||||
.replace("<pad>", "")
|
||||
.replace("</s>", "")
|
||||
.replace("tp_XX", "")
|
||||
.replace("__en__", "")
|
||||
.strip()
|
||||
)
|
||||
|
||||
current = "x" # x=init, t=head span, s=tail span, o=relation span
|
||||
subject = ""
|
||||
relation = ""
|
||||
object_ = ""
|
||||
object_type = ""
|
||||
subject_type = ""
|
||||
|
||||
for token in text.split():
|
||||
if token in ("<triplet>", "<relation>"):
|
||||
current = "t"
|
||||
if relation:
|
||||
triplets.append(
|
||||
{
|
||||
"head": subject.strip(),
|
||||
"head_type": subject_type,
|
||||
"type": relation.strip(),
|
||||
"tail": object_.strip(),
|
||||
"tail_type": object_type,
|
||||
}
|
||||
)
|
||||
relation = ""
|
||||
subject = ""
|
||||
elif token.startswith("<") and token.endswith(">"):
|
||||
if current in ("t", "o"):
|
||||
# Closing the head span — now reading tail.
|
||||
current = "s"
|
||||
if relation:
|
||||
triplets.append(
|
||||
{
|
||||
"head": subject.strip(),
|
||||
"head_type": subject_type,
|
||||
"type": relation.strip(),
|
||||
"tail": object_.strip(),
|
||||
"tail_type": object_type,
|
||||
}
|
||||
)
|
||||
object_ = ""
|
||||
subject_type = token[1:-1]
|
||||
else:
|
||||
# Closing the tail span — now reading relation.
|
||||
current = "o"
|
||||
object_type = token[1:-1]
|
||||
relation = ""
|
||||
else:
|
||||
if current == "t":
|
||||
subject += " " + token
|
||||
elif current == "s":
|
||||
object_ += " " + token
|
||||
elif current == "o":
|
||||
relation += " " + token
|
||||
|
||||
# Flush the last triplet if all fields are present.
|
||||
if subject and relation and object_ and object_type and subject_type:
|
||||
triplets.append(
|
||||
{
|
||||
"head": subject.strip(),
|
||||
"head_type": subject_type,
|
||||
"type": relation.strip(),
|
||||
"tail": object_.strip(),
|
||||
"tail_type": object_type,
|
||||
}
|
||||
)
|
||||
|
||||
return triplets
|
||||
@@ -0,0 +1,64 @@
|
||||
---
|
||||
name: plot_heatmap_log
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "def plot_heatmap_log(ax: Axes, xs: list[float] | np.ndarray, ys: list[float] | np.ndarray, extent: tuple[float, float, float, float], bins: int = 200, cmap: str = 'hot', alpha: float = 0.6) -> None"
|
||||
description: "Dibuja un heatmap 2D con escala log1p sobre un Axes de matplotlib. Usa np.histogram2d con el extent dado y ax.imshow para renderizar."
|
||||
tags: [visualization, heatmap, histogram, matplotlib, datascience, log]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: ["numpy", "matplotlib"]
|
||||
params:
|
||||
- name: ax
|
||||
desc: "matplotlib Axes sobre el que se dibuja el heatmap."
|
||||
- name: xs
|
||||
desc: "Coordenadas X de los puntos."
|
||||
- name: ys
|
||||
desc: "Coordenadas Y de los puntos."
|
||||
- name: extent
|
||||
desc: "Bounding box como (minx, maxx, miny, maxy) que define el rango del histograma."
|
||||
- name: bins
|
||||
desc: "Número de bins del histograma en cada eje. Default 200."
|
||||
- name: cmap
|
||||
desc: "Nombre del colormap de matplotlib. Default 'hot'."
|
||||
- name: alpha
|
||||
desc: "Opacidad del overlay (0-1). Default 0.6."
|
||||
output: "None. Modifica el Axes in-place añadiendo el heatmap como imagen con ax.imshow."
|
||||
tested: true
|
||||
tests:
|
||||
- "100 puntos no lanza excepción"
|
||||
- "ax tiene al menos una imagen tras la llamada"
|
||||
test_file_path: "python/functions/datascience/tests/test_plot_heatmap_log.py"
|
||||
file_path: "python/functions/datascience/plot_heatmap_log.py"
|
||||
source_repo: "internal:footprint_aurgi"
|
||||
source_license: "internal-aurgi"
|
||||
source_file: "zonas_mapas_aurgi/examples/generar_reporte_madrid.py:62"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
import matplotlib
|
||||
matplotlib.use("Agg")
|
||||
import matplotlib.pyplot as plt
|
||||
import numpy as np
|
||||
from datascience.plot_heatmap_log import plot_heatmap_log
|
||||
|
||||
rng = np.random.default_rng(42)
|
||||
xs = rng.uniform(-4.0, -3.5, 500)
|
||||
ys = rng.uniform(40.3, 40.6, 500)
|
||||
|
||||
fig, ax = plt.subplots()
|
||||
plot_heatmap_log(ax, xs, ys, extent=(-4.0, -3.5, 40.3, 40.6), bins=100)
|
||||
fig.savefig("heatmap.png")
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Aplica `np.log1p` a las cuentas del histograma para comprimir el rango dinámico y hacer visibles tanto zonas densas como dispersas. El histograma se transpone (`counts.T`) antes de pasar a imshow para alinear correctamente los ejes x/y. `aspect="auto"` permite que la imagen se estire al aspecto del Axes.
|
||||
@@ -0,0 +1,53 @@
|
||||
"""Plot a log-scale 2D histogram heatmap on a matplotlib Axes."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
|
||||
def plot_heatmap_log(
|
||||
ax: "Axes",
|
||||
xs: "list[float] | np.ndarray",
|
||||
ys: "list[float] | np.ndarray",
|
||||
extent: "tuple[float, float, float, float]",
|
||||
bins: int = 200,
|
||||
cmap: str = "hot",
|
||||
alpha: float = 0.6,
|
||||
) -> None:
|
||||
"""Plot a log-scale 2D density heatmap using histogram binning.
|
||||
|
||||
Computes a 2D histogram over the given points within ``extent``, applies
|
||||
log1p to compress the dynamic range, and renders the result as an image
|
||||
overlay on the Axes.
|
||||
|
||||
Args:
|
||||
ax: matplotlib Axes to draw on.
|
||||
xs: X coordinates (longitude or projected x).
|
||||
ys: Y coordinates (latitude or projected y).
|
||||
extent: Bounding box as (minx, maxx, miny, maxy).
|
||||
bins: Number of histogram bins along each axis. Default 200.
|
||||
cmap: Matplotlib colormap name. Default "hot".
|
||||
alpha: Opacity of the heatmap overlay (0–1). Default 0.6.
|
||||
"""
|
||||
import numpy as np # type: ignore
|
||||
|
||||
xs_arr = np.asarray(xs, dtype=float)
|
||||
ys_arr = np.asarray(ys, dtype=float)
|
||||
|
||||
minx, maxx, miny, maxy = extent
|
||||
|
||||
counts, _xedges, _yedges = np.histogram2d(
|
||||
xs_arr,
|
||||
ys_arr,
|
||||
bins=bins,
|
||||
range=[[minx, maxx], [miny, maxy]],
|
||||
)
|
||||
|
||||
log_counts = np.log1p(counts.T)
|
||||
|
||||
ax.imshow(
|
||||
log_counts,
|
||||
extent=[minx, maxx, miny, maxy],
|
||||
origin="lower",
|
||||
cmap=cmap,
|
||||
alpha=alpha,
|
||||
aspect="auto",
|
||||
)
|
||||
@@ -0,0 +1,66 @@
|
||||
---
|
||||
name: plot_kde_2d
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "def plot_kde_2d(ax: Axes, xs: list[float] | np.ndarray, ys: list[float] | np.ndarray, cmap: str = 'magma', alpha: float = 0.35, thresh: float = 0.02, levels: int = 30, bw_adjust: float = 0.6) -> None"
|
||||
description: "Dibuja un KDE 2D como contornos rellenos sobre un Axes de matplotlib usando seaborn.kdeplot. Si los arrays están vacíos retorna sin pintar."
|
||||
tags: [visualization, kde, density, seaborn, matplotlib, datascience]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: ["numpy", "seaborn", "matplotlib"]
|
||||
params:
|
||||
- name: ax
|
||||
desc: "matplotlib Axes sobre el que se dibuja la densidad."
|
||||
- name: xs
|
||||
desc: "Coordenadas X de los puntos (longitud o x proyectada)."
|
||||
- name: ys
|
||||
desc: "Coordenadas Y de los puntos (latitud o y proyectada)."
|
||||
- name: cmap
|
||||
desc: "Nombre del colormap de matplotlib para el relleno de densidad. Default 'magma'."
|
||||
- name: alpha
|
||||
desc: "Opacidad del overlay de densidad (0-1). Default 0.35."
|
||||
- name: thresh
|
||||
desc: "Umbral de densidad por debajo del cual no se dibujan contornos (0-1). Default 0.02."
|
||||
- name: levels
|
||||
desc: "Número de niveles de contorno. Default 30."
|
||||
- name: bw_adjust
|
||||
desc: "Factor de ajuste del ancho de banda del kernel. Valores < 1 producen estimaciones más detalladas. Default 0.6."
|
||||
output: "None. Modifica el Axes in-place añadiendo los contornos de densidad."
|
||||
tested: true
|
||||
tests:
|
||||
- "50 puntos aleatorios no lanza excepción"
|
||||
- "arrays vacíos retorna sin error"
|
||||
test_file_path: "python/functions/datascience/tests/test_plot_kde_2d.py"
|
||||
file_path: "python/functions/datascience/plot_kde_2d.py"
|
||||
source_repo: "internal:footprint_aurgi"
|
||||
source_license: "internal-aurgi"
|
||||
source_file: "ponderacion_isochronas/src/recomendador_centros.py:275"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
import matplotlib
|
||||
matplotlib.use("Agg")
|
||||
import matplotlib.pyplot as plt
|
||||
import numpy as np
|
||||
from datascience.plot_kde_2d import plot_kde_2d
|
||||
|
||||
rng = np.random.default_rng(42)
|
||||
xs = rng.normal(0, 1, 200)
|
||||
ys = rng.normal(0, 1, 200)
|
||||
|
||||
fig, ax = plt.subplots()
|
||||
plot_kde_2d(ax, xs, ys, cmap="viridis", alpha=0.5)
|
||||
fig.savefig("kde.png")
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Requiere seaborn y numpy. El parámetro `fill=True` se pasa a seaborn.kdeplot para renderizar contornos rellenos (disponible desde seaborn 0.11). Arrays vacíos se detectan con `np.asarray(xs).size == 0` antes de llamar a seaborn para evitar errores internos.
|
||||
@@ -0,0 +1,53 @@
|
||||
"""Plot a 2D KDE density overlay on a matplotlib Axes using seaborn."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
|
||||
def plot_kde_2d(
|
||||
ax: "Axes",
|
||||
xs: "list[float] | np.ndarray",
|
||||
ys: "list[float] | np.ndarray",
|
||||
cmap: str = "magma",
|
||||
alpha: float = 0.35,
|
||||
thresh: float = 0.02,
|
||||
levels: int = 30,
|
||||
bw_adjust: float = 0.6,
|
||||
) -> None:
|
||||
"""Plot a 2D kernel density estimate as a filled contour overlay.
|
||||
|
||||
Uses seaborn.kdeplot to render a smooth density surface over the given
|
||||
scatter of (x, y) points. If either array is empty the function returns
|
||||
immediately without painting anything.
|
||||
|
||||
Args:
|
||||
ax: matplotlib Axes to draw on.
|
||||
xs: X coordinates (longitude or projected x).
|
||||
ys: Y coordinates (latitude or projected y).
|
||||
cmap: Matplotlib colormap name for the density fill. Default "magma".
|
||||
alpha: Opacity of the density overlay (0–1). Default 0.35.
|
||||
thresh: Density threshold below which contours are not drawn (0–1).
|
||||
Default 0.02 removes very sparse outlier contours.
|
||||
levels: Number of contour levels. Default 30.
|
||||
bw_adjust: Bandwidth adjustment factor for the kernel. Values < 1
|
||||
produce tighter, more detailed estimates. Default 0.6.
|
||||
"""
|
||||
import numpy as np # type: ignore
|
||||
import seaborn as sns # type: ignore
|
||||
|
||||
xs_arr = np.asarray(xs)
|
||||
ys_arr = np.asarray(ys)
|
||||
|
||||
if xs_arr.size == 0 or ys_arr.size == 0:
|
||||
return
|
||||
|
||||
sns.kdeplot(
|
||||
x=xs_arr,
|
||||
y=ys_arr,
|
||||
ax=ax,
|
||||
cmap=cmap,
|
||||
fill=True,
|
||||
alpha=alpha,
|
||||
thresh=thresh,
|
||||
levels=levels,
|
||||
bw_adjust=bw_adjust,
|
||||
)
|
||||
@@ -0,0 +1,65 @@
|
||||
---
|
||||
name: rebel_load_model
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "def rebel_load_model(model_name: str = 'Babelscape/rebel-large') -> tuple[Any, Any]"
|
||||
description: "Carga (y cachea) el tokenizer y modelo REBEL (BART-based, ~1.5 GB). Solo ingles. Licencia Apache 2.0 — uso comercial permitido. Cache por model_name."
|
||||
tags: [rebel, relation-extraction, nlp, model, huggingface, english, seq2seq, apache2, datascience, python]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: [transformers]
|
||||
params:
|
||||
- name: model_name
|
||||
desc: "ID del modelo en HuggingFace Hub (defecto: Babelscape/rebel-large, BART ~1.5 GB, solo EN)"
|
||||
output: "tupla (tokenizer, model) listos para inferencia, cacheados por model_name."
|
||||
tested: false
|
||||
tests: []
|
||||
test_file_path: ""
|
||||
file_path: "python/functions/datascience/rebel_load_model.py"
|
||||
notes: |
|
||||
LICENCIA: Apache 2.0 — uso comercial permitido (a diferencia de mREBEL que es CC BY-NC-SA).
|
||||
Solo funciona bien con texto en INGLES. Para espanol usar mrebel_load_model.
|
||||
|
||||
REBEL usa el mismo wire format que mREBEL, por lo que parse_rebel_output es compatible.
|
||||
Diferencia vs mREBEL: no emite el prefijo tp_XX de idioma en el output (parse_rebel_output
|
||||
lo maneja porque ya hace .replace('tp_XX', '')).
|
||||
|
||||
impure: descarga red/disco la primera vez, mantiene estado en _MODEL_CACHE.
|
||||
Cache separada de mrebel_load_model (modulo distinto).
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from python.functions.datascience.rebel_load_model import rebel_load_model
|
||||
from python.functions.datascience.parse_rebel_output import parse_rebel_output
|
||||
|
||||
tokenizer, model = rebel_load_model()
|
||||
|
||||
text = "Pablo Isla is the CEO of Inditex, based in Arteixo."
|
||||
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
|
||||
generated = model.generate(**inputs, num_beams=4, length_penalty=1.0, max_length=256)
|
||||
decoded = tokenizer.decode(generated[0], skip_special_tokens=False)
|
||||
triplets = parse_rebel_output(decoded)
|
||||
```
|
||||
|
||||
## Comparacion REBEL vs mREBEL
|
||||
|
||||
| | REBEL | mREBEL |
|
||||
|---|---|---|
|
||||
| Licencia | Apache 2.0 (comercial OK) | CC BY-NC-SA 4.0 (no comercial) |
|
||||
| Idiomas | Solo ingles | 30+ (es_XX, en_XX, fr_XX...) |
|
||||
| Tamanio | ~1.5 GB | ~2.4 GB (large) / ~900 MB (base) |
|
||||
| Base | BART | mBART-50 |
|
||||
|
||||
## Tamanio y latencia
|
||||
|
||||
- `Babelscape/rebel-large`: ~1.5 GB en disco.
|
||||
- Primera carga: 20-60 s en CPU.
|
||||
- Inferencia CPU: 3-10 s por frase (mas rapido que mREBEL por ser BART vs mBART).
|
||||
@@ -0,0 +1,52 @@
|
||||
"""Carga (y cachea) el modelo REBEL para extraccion de relaciones en ingles."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import Any
|
||||
|
||||
# Cache global: model_name -> (tokenizer, model)
|
||||
_MODEL_CACHE: dict[str, tuple[Any, Any]] = {}
|
||||
|
||||
|
||||
def rebel_load_model(
|
||||
model_name: str = "Babelscape/rebel-large",
|
||||
) -> tuple[Any, Any]:
|
||||
"""Loads (and caches) the REBEL tokenizer and model. English only.
|
||||
|
||||
REBEL is a BART-based seq2seq model (~1.5 GB) for relation extraction,
|
||||
trained on English Wikipedia (KELM). It extracts triplets (head, relation,
|
||||
tail) from English text.
|
||||
|
||||
LICENSE: Apache 2.0 — commercial use permitted.
|
||||
|
||||
The first call downloads the model from HuggingFace Hub (~1.5 GB).
|
||||
Subsequent calls with the same ``model_name`` return the cached instance.
|
||||
|
||||
Args:
|
||||
model_name: HuggingFace Hub model ID. Default is the large variant.
|
||||
|
||||
Returns:
|
||||
Tuple ``(tokenizer, model)`` both ready for inference.
|
||||
|
||||
Raises:
|
||||
ImportError: if ``transformers`` is not installed.
|
||||
OSError: if the model cannot be downloaded or loaded from disk.
|
||||
"""
|
||||
cached = _MODEL_CACHE.get(model_name)
|
||||
if cached is not None:
|
||||
return cached
|
||||
|
||||
try:
|
||||
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
|
||||
except ImportError as exc:
|
||||
raise ImportError(
|
||||
"transformers no esta instalado. Instalalo con "
|
||||
"`uv pip install transformers` o `uv pip install -e '.[nlp]'`."
|
||||
) from exc
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
|
||||
model.eval()
|
||||
|
||||
_MODEL_CACHE[model_name] = (tokenizer, model)
|
||||
return tokenizer, model
|
||||
@@ -0,0 +1,52 @@
|
||||
---
|
||||
name: remove_words_from_column
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def remove_words_from_column(values: Iterable[str | None], words: list[str]) -> list[str]"
|
||||
description: "Elimina palabras especificas de un iterable de strings usando regex de palabra completa (\\b). Case-insensitive. Colapsa espacios multiples y hace strip. None se convierte en cadena vacia. Sin pandas."
|
||||
tags: [text, cleaning, regex, words, nlp, datascience]
|
||||
params:
|
||||
- name: values
|
||||
desc: Iterable de strings o None a limpiar.
|
||||
- name: words
|
||||
desc: Lista de palabras a eliminar. Matching case-insensitive por palabra completa (no parcial).
|
||||
output: "Lista de strings con las palabras eliminadas y espacios normalizados. Misma longitud que el input."
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: []
|
||||
tested: true
|
||||
tests:
|
||||
- "elimina palabras case insensitive"
|
||||
- "none devuelve string vacio"
|
||||
- "colapsa espacios multiples"
|
||||
- "palabras vacias no modifica"
|
||||
- "palabra completa no parcial"
|
||||
- "lista vacia"
|
||||
test_file_path: "python/functions/datascience/tests/test_remove_words_from_column.py"
|
||||
file_path: "python/functions/datascience/remove_words_from_column.py"
|
||||
source_repo: "internal:footprint_aurgi"
|
||||
source_license: "internal-aurgi"
|
||||
source_file: "fuzzy_joins/arreglo_fuzzy.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from remove_words_from_column import remove_words_from_column
|
||||
|
||||
result = remove_words_from_column(
|
||||
["Calle Mayor 14", "Avenida del Sol"],
|
||||
words=["calle", "avenida", "del"]
|
||||
)
|
||||
# ["Mayor 14", "Sol"]
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
El patron regex se compila una sola vez para todo el iterable (eficiente). Usa \\b para no eliminar palabras parciales ("calle" no toca "calleja"). None en el input produce "" en el output.
|
||||
@@ -0,0 +1,42 @@
|
||||
"""Elimina palabras especificas de una lista de strings."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import re
|
||||
from typing import Iterable
|
||||
|
||||
|
||||
def remove_words_from_column(
|
||||
values: Iterable[str | None],
|
||||
words: list[str],
|
||||
) -> list[str]:
|
||||
"""Elimina palabras de una lista de strings usando regex de palabra completa.
|
||||
|
||||
Para cada string aplica un patron regex \\b(w1|w2|...)\\b case-insensitive,
|
||||
reemplaza por cadena vacia, colapsa espacios multiples y hace strip.
|
||||
None se convierte en cadena vacia.
|
||||
|
||||
Args:
|
||||
values: Iterable de strings (o None) a limpiar.
|
||||
words: Lista de palabras a eliminar (case-insensitive).
|
||||
|
||||
Returns:
|
||||
Lista de strings con las palabras eliminadas y espacios normalizados.
|
||||
"""
|
||||
if not words:
|
||||
return [v if v is not None else "" for v in values]
|
||||
|
||||
pattern = re.compile(
|
||||
r"\b(" + "|".join(re.escape(w) for w in words) + r")\b",
|
||||
flags=re.IGNORECASE,
|
||||
)
|
||||
|
||||
result = []
|
||||
for value in values:
|
||||
if value is None:
|
||||
result.append("")
|
||||
continue
|
||||
cleaned = pattern.sub("", str(value))
|
||||
cleaned = re.sub(r"\s+", " ", cleaned).strip()
|
||||
result.append(cleaned)
|
||||
return result
|
||||
@@ -0,0 +1,61 @@
|
||||
---
|
||||
name: spacy_es_load_model
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "def spacy_es_load_model(model_name: str = 'es_core_news_md') -> Any"
|
||||
description: "Carga (y cachea) un modelo spaCy en castellano. Provee POS, dependencias y NER (PER, ORG, LOC, MISC). Usado por extract_triples_spacy_es para OpenIE schema-less. LICENSE: spaCy MIT + es_core_news_md CC BY-SA 4.0."
|
||||
tags: [spacy, nlp, spanish, ner, dependency-parsing, openie, model, datascience, python, mit, cc-by-sa]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: [spacy]
|
||||
params:
|
||||
- name: model_name
|
||||
desc: "Nombre del modelo spaCy instalado. Default: es_core_news_md (equilibrio precision/tamanio). Alternativas: es_core_news_sm (menor, menos preciso), es_core_news_lg (mayor, mas preciso)."
|
||||
output: "Instancia spaCy Language cacheada por model_name. Provee nlp(text) -> Doc con tokens, POS, deps y ents."
|
||||
tested: true
|
||||
tests:
|
||||
- "cache devuelve la misma instancia"
|
||||
- "OSError si el modelo no esta instalado"
|
||||
test_file_path: "python/functions/datascience/tests/test_spacy_es_load_model.py"
|
||||
file_path: "python/functions/datascience/spacy_es_load_model.py"
|
||||
notes: |
|
||||
LICENSE: spaCy es MIT. El modelo es_core_news_md usa pesos entrenados sobre
|
||||
el corpus CoNLL-2002 (CC BY-SA 4.0). Uso comercial permitido con atribucion.
|
||||
|
||||
Instalar el modelo antes de usar:
|
||||
python -m spacy download es_core_news_md
|
||||
|
||||
impure: carga modelo desde disco la primera vez, mantiene estado en _MODEL_CACHE.
|
||||
Tamanio: es_core_news_md ~43 MB. Primera carga ~1-3s en CPU.
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from datascience.spacy_es_load_model import spacy_es_load_model
|
||||
|
||||
nlp = spacy_es_load_model()
|
||||
|
||||
doc = nlp("Carlos Torres preside BBVA en Bilbao.")
|
||||
for ent in doc.ents:
|
||||
print(ent.text, ent.label_)
|
||||
# Carlos Torres PER
|
||||
# BBVA ORG
|
||||
# Bilbao LOC
|
||||
```
|
||||
|
||||
## Instalacion
|
||||
|
||||
```bash
|
||||
# En el venv del registry:
|
||||
python/.venv/bin/python3 -m spacy download es_core_news_md
|
||||
|
||||
# O via uv:
|
||||
cd python && uv run python -m spacy download es_core_news_md
|
||||
```
|
||||
@@ -0,0 +1,40 @@
|
||||
"""Carga (y cachea) un modelo spaCy en castellano para NER y OpenIE.
|
||||
|
||||
LICENSE: spaCy = MIT. Modelo es_core_news_md = CC BY-SA 4.0 (datos CoNLL-2002).
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import Any
|
||||
|
||||
# Cache global: model_name -> instancia spaCy nlp
|
||||
_MODEL_CACHE: dict[str, Any] = {}
|
||||
|
||||
|
||||
def spacy_es_load_model(model_name: str = "es_core_news_md") -> Any:
|
||||
"""Load (and cache) a spaCy Spanish language model.
|
||||
|
||||
The model provides dependency parsing, POS tagging and NER (PER, ORG, LOC, MISC).
|
||||
Used by extract_triples_spacy_es for schema-less OpenIE in Spanish.
|
||||
|
||||
LICENSE: spaCy = MIT. es_core_news_md = CC BY-SA 4.0 (CoNLL-2002 corpus).
|
||||
|
||||
Args:
|
||||
model_name: Name of the spaCy model. Default: es_core_news_md.
|
||||
Alternatives: es_core_news_sm (smaller), es_core_news_lg (larger).
|
||||
|
||||
Returns:
|
||||
spaCy Language instance cached by model_name.
|
||||
|
||||
Raises:
|
||||
OSError: If the model is not installed. Install with:
|
||||
python -m spacy download es_core_news_md
|
||||
"""
|
||||
if model_name in _MODEL_CACHE:
|
||||
return _MODEL_CACHE[model_name]
|
||||
|
||||
import spacy # type: ignore[import]
|
||||
|
||||
nlp = spacy.load(model_name)
|
||||
_MODEL_CACHE[model_name] = nlp
|
||||
return nlp
|
||||
@@ -0,0 +1,38 @@
|
||||
---
|
||||
id: summary_stats_py_datascience
|
||||
name: summary_stats
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def summary_stats(values: list[float]) -> dict"
|
||||
description: "Returns basic descriptive statistics (n, mean, median, p25, p75) for a list of floats. Empty input returns n=0 and nan for all numeric fields."
|
||||
tags: [statistics, descriptive, eda, summary, percentile]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: [math, numpy]
|
||||
example: |
|
||||
from summary_stats import summary_stats
|
||||
result = summary_stats([1, 2, 3, 4, 5])
|
||||
tested: true
|
||||
tests:
|
||||
- "test_summary_stats_basic"
|
||||
- "test_summary_stats_empty"
|
||||
- "test_summary_stats_single"
|
||||
- "test_summary_stats_keys"
|
||||
test_file_path: "python/functions/datascience/tests/test_summary_stats.py"
|
||||
file_path: "python/functions/datascience/summary_stats.py"
|
||||
params:
|
||||
- name: values
|
||||
desc: "List of numeric values to summarize."
|
||||
output: "Dict with n (int), mean, median, p25, p75 (floats). All floats are math.nan when values is empty."
|
||||
source_repo: "internal:footprint_aurgi"
|
||||
source_license: "internal-aurgi"
|
||||
source_file: "ponderacion_isochronas/example/models/eda/utils.py:60"
|
||||
---
|
||||
|
||||
Funcion pura minimal para EDA rapido. No incluye std, min, max ni otros percentiles — mantener la interfaz pequena.
|
||||
@@ -0,0 +1,36 @@
|
||||
"""summary_stats — Compute descriptive statistics for a numeric list."""
|
||||
|
||||
import math
|
||||
import numpy as np
|
||||
|
||||
|
||||
def summary_stats(values: list[float]) -> dict:
|
||||
"""Return basic descriptive statistics for a list of floats.
|
||||
|
||||
Args:
|
||||
values: List of numeric values.
|
||||
|
||||
Returns:
|
||||
Dict with keys:
|
||||
"n" (int): number of elements.
|
||||
"mean" (float): arithmetic mean, or math.nan if empty.
|
||||
"median" (float): median, or math.nan if empty.
|
||||
"p25" (float): 25th percentile, or math.nan if empty.
|
||||
"p75" (float): 75th percentile, or math.nan if empty.
|
||||
"""
|
||||
if not values:
|
||||
return {
|
||||
"n": 0,
|
||||
"mean": math.nan,
|
||||
"median": math.nan,
|
||||
"p25": math.nan,
|
||||
"p75": math.nan,
|
||||
}
|
||||
arr = np.array(values, dtype=float)
|
||||
return {
|
||||
"n": int(len(arr)),
|
||||
"mean": float(np.mean(arr)),
|
||||
"median": float(np.median(arr)),
|
||||
"p25": float(np.percentile(arr, 25)),
|
||||
"p75": float(np.percentile(arr, 75)),
|
||||
}
|
||||
@@ -0,0 +1,103 @@
|
||||
"""Tests para align_relations_to_entities."""
|
||||
|
||||
import os
|
||||
import sys
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "..", ".."))
|
||||
|
||||
from python.functions.datascience.align_relations_to_entities import align_relations_to_entities
|
||||
|
||||
|
||||
def _t(head, head_type, relation, tail, tail_type):
|
||||
return {
|
||||
"head": head,
|
||||
"head_type": head_type,
|
||||
"type": relation,
|
||||
"tail": tail,
|
||||
"tail_type": tail_type,
|
||||
}
|
||||
|
||||
|
||||
def test_match_exacto_case_insensitive_resuelve_correctamente():
|
||||
triplets = [_t("pablo isla", "per", "employer", "inditex", "org")]
|
||||
entities = ["Pablo Isla", "Inditex"]
|
||||
result = align_relations_to_entities(triplets, entities)
|
||||
assert len(result) == 1
|
||||
assert result[0]["from"] == "Pablo Isla"
|
||||
assert result[0]["to"] == "Inditex"
|
||||
assert result[0]["kind"] == "employer"
|
||||
|
||||
|
||||
def test_substring_entity_en_span_del_head():
|
||||
# mREBEL emite "esta en Bilbao" pero la entidad es "Bilbao"
|
||||
triplets = [_t("esta en Bilbao", "loc", "located in", "Espana", "loc")]
|
||||
entities = ["Bilbao", "Espana"]
|
||||
result = align_relations_to_entities(triplets, entities)
|
||||
assert len(result) == 1
|
||||
assert result[0]["from"] == "Bilbao"
|
||||
assert result[0]["to"] == "Espana"
|
||||
|
||||
|
||||
def test_substring_span_dentro_del_nombre_de_entidad():
|
||||
# El span "Santander" esta contenido en el entity name "Banco Santander"
|
||||
triplets = [_t("Santander", "org", "owns", "Openbank", "org")]
|
||||
entities = ["Banco Santander", "Openbank"]
|
||||
result = align_relations_to_entities(triplets, entities)
|
||||
assert len(result) == 1
|
||||
assert result[0]["from"] == "Banco Santander"
|
||||
assert result[0]["to"] == "Openbank"
|
||||
|
||||
|
||||
def test_gana_nombre_de_entidad_mas_largo_en_ambiguedad():
|
||||
# Dos entidades: "Madrid" y "Comunidad de Madrid". El span "Madrid" deberia
|
||||
# preferir "Comunidad de Madrid" si ese es el mas largo y contiene "madrid".
|
||||
# En la logica actual: substring bidireccional, gana el primero de names_by_len
|
||||
# (que ordena DESC por len). "Comunidad de Madrid" es mas largo y su lower
|
||||
# contiene "madrid", asi que gana.
|
||||
triplets = [_t("Madrid", "loc", "capital of", "Espana", "loc")]
|
||||
entities = ["Madrid", "Comunidad de Madrid", "Espana"]
|
||||
result = align_relations_to_entities(triplets, entities)
|
||||
assert len(result) == 1
|
||||
# El exacto case-insensitive resuelve "Madrid" -> "Madrid" directamente
|
||||
# (antes que la busqueda substring). Verificamos que no rompe y que
|
||||
# from/to son valores de entities.
|
||||
assert result[0]["from"] in entities
|
||||
assert result[0]["to"] in entities
|
||||
|
||||
|
||||
def test_triplet_sin_match_se_descarta():
|
||||
triplets = [_t("Unknown Entity", "per", "works for", "Another Unknown", "org")]
|
||||
entities = ["Pablo Isla", "Inditex"]
|
||||
result = align_relations_to_entities(triplets, entities)
|
||||
assert result == []
|
||||
|
||||
|
||||
def test_triplet_con_head_igual_tail_se_descarta_self_loop():
|
||||
triplets = [_t("Inditex", "org", "owns", "Inditex", "org")]
|
||||
entities = ["Inditex", "Zara"]
|
||||
result = align_relations_to_entities(triplets, entities)
|
||||
assert result == []
|
||||
|
||||
|
||||
def test_lista_triplets_vacia_retorna_vacia():
|
||||
result = align_relations_to_entities([], ["Pablo Isla", "Inditex"])
|
||||
assert result == []
|
||||
|
||||
|
||||
def test_lista_entity_names_vacia_retorna_vacia():
|
||||
triplets = [_t("Pablo Isla", "per", "employer", "Inditex", "org")]
|
||||
result = align_relations_to_entities(triplets, [])
|
||||
assert result == []
|
||||
|
||||
|
||||
def test_multiples_triplets_con_mezcla_de_matches_y_descartes():
|
||||
triplets = [
|
||||
_t("Pablo Isla", "per", "employer", "Inditex", "org"), # match
|
||||
_t("Ghost Entity", "per", "employer", "Inditex", "org"), # head sin match
|
||||
_t("Pablo Isla", "per", "employer", "Pablo Isla", "per"), # self-loop
|
||||
]
|
||||
entities = ["Pablo Isla", "Inditex"]
|
||||
result = align_relations_to_entities(triplets, entities)
|
||||
assert len(result) == 1
|
||||
assert result[0]["from"] == "Pablo Isla"
|
||||
assert result[0]["to"] == "Inditex"
|
||||
@@ -0,0 +1,38 @@
|
||||
"""Tests para alpha_shape_concave_hull."""
|
||||
|
||||
import sys
|
||||
import os
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
||||
from alpha_shape_concave_hull import alpha_shape_concave_hull
|
||||
|
||||
|
||||
def test_alpha_shape_square_large_alpha():
|
||||
"""4 corner points with large alpha should return a geometry."""
|
||||
pts = [(0.0, 0.0), (1.0, 0.0), (1.0, 1.0), (0.0, 1.0)]
|
||||
result = alpha_shape_concave_hull(pts, alpha=10.0)
|
||||
assert result is not None
|
||||
|
||||
|
||||
def test_alpha_shape_too_few_points():
|
||||
result = alpha_shape_concave_hull([(0, 0), (1, 0), (0, 1)], alpha=10.0)
|
||||
assert result is None
|
||||
|
||||
|
||||
def test_alpha_shape_very_small_alpha_returns_none():
|
||||
"""Alpha so small that no triangle circumradius fits."""
|
||||
pts = [(0.0, 0.0), (100.0, 0.0), (100.0, 100.0), (0.0, 100.0)]
|
||||
result = alpha_shape_concave_hull(pts, alpha=0.0001)
|
||||
assert result is None
|
||||
|
||||
|
||||
def test_alpha_shape_5_points_returns_geometry():
|
||||
pts = [
|
||||
(0.0, 0.0),
|
||||
(2.0, 0.0),
|
||||
(2.0, 2.0),
|
||||
(0.0, 2.0),
|
||||
(1.0, 1.0),
|
||||
]
|
||||
result = alpha_shape_concave_hull(pts, alpha=5.0)
|
||||
assert result is not None
|
||||
@@ -0,0 +1,47 @@
|
||||
"""Tests para best_central_tendency."""
|
||||
|
||||
import math
|
||||
import sys
|
||||
import os
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
||||
from best_central_tendency import best_central_tendency
|
||||
|
||||
|
||||
def test_best_central_tendency_normal_ish():
|
||||
label, value = best_central_tendency([1, 2, 3, 4, 5], "normal-ish")
|
||||
assert label == "mean"
|
||||
assert abs(value - 3.0) < 1e-9
|
||||
|
||||
|
||||
def test_best_central_tendency_right_skewed():
|
||||
label, value = best_central_tendency([1, 2, 3, 4, 5], "right-skewed")
|
||||
assert label == "median"
|
||||
assert abs(value - 3.0) < 1e-9
|
||||
|
||||
|
||||
def test_best_central_tendency_left_skewed():
|
||||
label, value = best_central_tendency([1, 2, 3, 4, 5], "left-skewed")
|
||||
assert label == "median"
|
||||
|
||||
|
||||
def test_best_central_tendency_lognormal_ish():
|
||||
label, value = best_central_tendency([1, 2, 4, 8], "lognormal-ish")
|
||||
assert label == "geometric_mean"
|
||||
assert abs(value - 2 ** 1.5) < 1e-6
|
||||
|
||||
|
||||
def test_best_central_tendency_heavy_tail():
|
||||
label, value = best_central_tendency([1, 2, 3, 4, 5, 100], "heavy-tail")
|
||||
assert label == "trimmed_mean_5%"
|
||||
assert not math.isnan(value)
|
||||
|
||||
|
||||
def test_best_central_tendency_empty():
|
||||
label, value = best_central_tendency([], "normal-ish")
|
||||
assert math.isnan(value)
|
||||
|
||||
|
||||
def test_best_central_tendency_default():
|
||||
label, value = best_central_tendency([1, 2, 3, 4, 5], "other")
|
||||
assert label == "median"
|
||||
@@ -0,0 +1,45 @@
|
||||
"""Tests para detect_distribution_type."""
|
||||
|
||||
import sys
|
||||
import os
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
||||
from detect_distribution_type import detect_distribution_type
|
||||
|
||||
import numpy as np
|
||||
|
||||
|
||||
def test_detect_too_few_samples():
|
||||
result = detect_distribution_type([1] * 5)
|
||||
assert result["type"] == "too_few_samples"
|
||||
|
||||
|
||||
def test_detect_normal_ish():
|
||||
rng = np.random.default_rng(42)
|
||||
values = rng.normal(0, 1, 200).tolist()
|
||||
result = detect_distribution_type(values)
|
||||
assert result["type"] == "normal-ish", f"Got {result['type']}"
|
||||
|
||||
|
||||
def test_detect_right_skewed():
|
||||
rng = np.random.default_rng(0)
|
||||
# Exponential distribution is heavily right-skewed
|
||||
values = rng.exponential(scale=1.0, size=200).tolist()
|
||||
result = detect_distribution_type(values)
|
||||
assert result["type"] in ("right-skewed", "lognormal-ish", "heavy-tail"), f"Got {result['type']}"
|
||||
|
||||
|
||||
def test_detect_stats_keys():
|
||||
rng = np.random.default_rng(7)
|
||||
values = rng.normal(5, 2, 100).tolist()
|
||||
result = detect_distribution_type(values)
|
||||
assert "stats" in result
|
||||
assert "n" in result["stats"]
|
||||
assert result["stats"]["n"] == 100
|
||||
|
||||
|
||||
def test_detect_exactly_30():
|
||||
rng = np.random.default_rng(1)
|
||||
values = rng.normal(0, 1, 30).tolist()
|
||||
result = detect_distribution_type(values)
|
||||
assert result["type"] != "too_few_samples"
|
||||
@@ -0,0 +1,67 @@
|
||||
"""Tests para extract_graph_gliner2.
|
||||
|
||||
Usa un stub GLiNER2 para validar el contrato sin descargar el modelo real.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
import sys
|
||||
|
||||
import pytest
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "..", ".."))
|
||||
|
||||
from python.functions.datascience.extract_graph_gliner2 import extract_graph_gliner2
|
||||
|
||||
|
||||
class _Schema:
|
||||
def entities(self, labels):
|
||||
self._entities = labels
|
||||
return self
|
||||
|
||||
def relations(self, labels):
|
||||
self._relations = labels
|
||||
return self
|
||||
|
||||
|
||||
class _StubModel:
|
||||
"""Stub que devuelve entidades y relaciones conocidas."""
|
||||
|
||||
_extract_result = {
|
||||
"entities": {"person": ["Pablo Isla"], "organization": ["Inditex"]},
|
||||
"relation_extraction": {"ceo_of": [("Pablo Isla", "Inditex")]},
|
||||
}
|
||||
|
||||
def create_schema(self):
|
||||
return _Schema()
|
||||
|
||||
def extract(self, text, schema=None, threshold=0.3, include_confidence=False):
|
||||
return self._extract_result
|
||||
|
||||
|
||||
def test_output_tiene_claves_entities_relation_extraction_elapsed_s():
|
||||
"""output tiene claves entities relation_extraction elapsed_s"""
|
||||
result = extract_graph_gliner2(
|
||||
text="Pablo Isla es CEO de Inditex.",
|
||||
entity_labels=["person", "organization"],
|
||||
relation_labels=["ceo_of"],
|
||||
model=_StubModel(),
|
||||
)
|
||||
assert "entities" in result
|
||||
assert "relation_extraction" in result
|
||||
assert "elapsed_s" in result
|
||||
assert isinstance(result["elapsed_s"], float)
|
||||
|
||||
|
||||
def test_stub_model_retorna_shape_correcto():
|
||||
"""stub model retorna shape correcto"""
|
||||
result = extract_graph_gliner2(
|
||||
text="Texto cualquiera.",
|
||||
entity_labels=["person"],
|
||||
relation_labels=["works_at"],
|
||||
model=_StubModel(),
|
||||
threshold=0.3,
|
||||
)
|
||||
assert result["entities"] == {"person": ["Pablo Isla"], "organization": ["Inditex"]}
|
||||
assert "ceo_of" in result["relation_extraction"]
|
||||
@@ -0,0 +1,112 @@
|
||||
"""Tests para extract_relations_mrebel con stubs de modelo."""
|
||||
|
||||
import os
|
||||
import sys
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "..", ".."))
|
||||
|
||||
from python.functions.datascience.extract_relations_mrebel import extract_relations_mrebel
|
||||
from python.types.datascience.entity_candidate import EntityCandidate
|
||||
from python.types.datascience.relation_candidate import RelationCandidate
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Stubs
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class _TokenizerStub:
|
||||
"""Tokenizer stub que devuelve inputs triviales y decodifica el wire format canonico."""
|
||||
|
||||
def __init__(self, decoded_output: str = ""):
|
||||
self._decoded = decoded_output
|
||||
|
||||
def __call__(self, text, return_tensors=None, max_length=512, truncation=True):
|
||||
return {"input_ids": [[1, 2, 3]]}
|
||||
|
||||
def decode(self, token_ids, skip_special_tokens=True):
|
||||
return self._decoded
|
||||
|
||||
|
||||
class _ModelStub:
|
||||
"""Modelo stub que devuelve tokens triviales."""
|
||||
|
||||
def generate(self, input_ids=None, num_beams=4, length_penalty=1.0, max_length=256, **kwargs):
|
||||
return [[10, 11, 12]]
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def test_flujo_completo_con_stub_produce_relation_candidates_correctos():
|
||||
# Wire format canonico con un triplet valido
|
||||
decoded = "<triplet> Pablo Isla <per> Inditex <org> employer"
|
||||
tok = _TokenizerStub(decoded_output=decoded)
|
||||
model = _ModelStub()
|
||||
|
||||
entities = [
|
||||
EntityCandidate(name="Pablo Isla", type_label="PER", confidence=0.95),
|
||||
EntityCandidate(name="Inditex", type_label="ORG", confidence=0.92),
|
||||
]
|
||||
text = "Pablo Isla es el presidente de Inditex."
|
||||
|
||||
result = extract_relations_mrebel(text, entities, tok, model)
|
||||
|
||||
assert len(result) == 1
|
||||
rc = result[0]
|
||||
assert isinstance(rc, RelationCandidate)
|
||||
assert rc.from_name == "Pablo Isla"
|
||||
assert rc.to_name == "Inditex"
|
||||
assert rc.relation_type == "employer"
|
||||
assert rc.confidence == 1.0
|
||||
|
||||
|
||||
def test_menos_de_2_entidades_retorna_vacio():
|
||||
tok = _TokenizerStub()
|
||||
model = _ModelStub()
|
||||
entities = [EntityCandidate(name="Pablo Isla", type_label="PER")]
|
||||
result = extract_relations_mrebel("Texto cualquiera.", entities, tok, model)
|
||||
assert result == []
|
||||
|
||||
|
||||
def test_texto_vacio_retorna_vacio():
|
||||
tok = _TokenizerStub()
|
||||
model = _ModelStub()
|
||||
entities = [
|
||||
EntityCandidate(name="A", type_label="PER"),
|
||||
EntityCandidate(name="B", type_label="ORG"),
|
||||
]
|
||||
assert extract_relations_mrebel("", entities, tok, model) == []
|
||||
|
||||
|
||||
def test_triplets_no_alineables_se_descartan():
|
||||
# El stub emite entidades que no estan en la lista
|
||||
decoded = "<triplet> Ghost Entity <per> Unknown Org <org> some relation"
|
||||
tok = _TokenizerStub(decoded_output=decoded)
|
||||
model = _ModelStub()
|
||||
|
||||
entities = [
|
||||
EntityCandidate(name="Pablo Isla", type_label="PER"),
|
||||
EntityCandidate(name="Inditex", type_label="ORG"),
|
||||
]
|
||||
result = extract_relations_mrebel("Texto largo suficiente.", entities, tok, model)
|
||||
assert result == []
|
||||
|
||||
|
||||
def test_multiples_frases_generan_multiples_candidates():
|
||||
# El stub siempre emite el mismo triplet valido — una por frase
|
||||
decoded = "<triplet> Pablo Isla <per> Inditex <org> employer"
|
||||
tok = _TokenizerStub(decoded_output=decoded)
|
||||
model = _ModelStub()
|
||||
|
||||
entities = [
|
||||
EntityCandidate(name="Pablo Isla", type_label="PER"),
|
||||
EntityCandidate(name="Inditex", type_label="ORG"),
|
||||
]
|
||||
# Dos frases separadas por ". "
|
||||
text = "Pablo Isla es el presidente de Inditex. Inditex tiene sedes en todo el mundo."
|
||||
|
||||
result = extract_relations_mrebel(text, entities, tok, model)
|
||||
# Puede haber 1 o 2 dependiendo de la dedup — lo importante es que no es vacio
|
||||
assert len(result) >= 1
|
||||
assert all(isinstance(rc, RelationCandidate) for rc in result)
|
||||
@@ -0,0 +1,81 @@
|
||||
"""Tests para extract_triples_spacy_es.
|
||||
|
||||
Requiere spaCy y es_core_news_md instalados. Si no estan, los tests se omiten.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
import sys
|
||||
|
||||
import pytest
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "..", ".."))
|
||||
|
||||
from python.functions.datascience.extract_triples_spacy_es import extract_triples_spacy_es
|
||||
|
||||
spacy = pytest.importorskip("spacy", reason="spacy not installed — skip")
|
||||
|
||||
|
||||
def _load_nlp():
|
||||
try:
|
||||
return spacy.load("es_core_news_md")
|
||||
except OSError:
|
||||
return None
|
||||
|
||||
|
||||
_NLP = _load_nlp()
|
||||
pytestmark = pytest.mark.skipif(
|
||||
_NLP is None,
|
||||
reason="es_core_news_md not installed — run: python -m spacy download es_core_news_md",
|
||||
)
|
||||
|
||||
|
||||
def test_oracion_simple_produce_tripleta_con_sujeto_verbo_objeto():
|
||||
"""oracion simple produce tripleta con sujeto verbo objeto"""
|
||||
result = extract_triples_spacy_es("Enmanuel quiere a Ashlly.", _NLP)
|
||||
assert len(result["triples"]) >= 1
|
||||
# Al menos una tripleta con sujeto que contenga Enmanuel
|
||||
subjs = [t["subject"] for t in result["triples"]]
|
||||
assert any("Enmanuel" in s or "enmanuel" in s.lower() for s in subjs)
|
||||
|
||||
|
||||
def test_carlos_torres_preside_bbva():
|
||||
"""carlos torres preside bbva produce tripleta president"""
|
||||
result = extract_triples_spacy_es("Carlos Torres preside BBVA.", _NLP)
|
||||
triples = result["triples"]
|
||||
assert len(triples) >= 1
|
||||
rels = [t["relation"] for t in triples]
|
||||
assert any("presidir" in r or "presidir" in r.lower() for r in rels)
|
||||
|
||||
|
||||
def test_amancio_ortega_fundo_inditex_en_1985():
|
||||
"""amancio ortega fundo inditex en 1985 produce tripletas con fundar_en"""
|
||||
result = extract_triples_spacy_es(
|
||||
"Amancio Ortega fundo Inditex en 1985.", _NLP
|
||||
)
|
||||
triples = result["triples"]
|
||||
assert len(triples) >= 1
|
||||
# El verbo y sus objetos deben producir al menos 2 tripletas (Inditex + 1985 como oblicuo)
|
||||
subjs = {t["subject"] for t in triples}
|
||||
assert any("Amancio" in s or "Ortega" in s for s in subjs)
|
||||
# Debe haber al menos la tripleta directa con Inditex
|
||||
objects = {t["object"] for t in triples}
|
||||
assert any("Inditex" in o or "1985" in o for o in objects)
|
||||
|
||||
|
||||
def test_texto_sin_verbos_produce_tripletas_vacias():
|
||||
"""texto sin verbos produce tripletas vacias"""
|
||||
result = extract_triples_spacy_es("BBVA Santander Inditex.", _NLP)
|
||||
assert result["triples"] == []
|
||||
|
||||
|
||||
def test_entities_ner_detecta_categorias():
|
||||
"""entities NER detecta PER ORG LOC"""
|
||||
result = extract_triples_spacy_es(
|
||||
"Carlos Torres es presidente de BBVA en Bilbao.", _NLP
|
||||
)
|
||||
ents = result["entities"]
|
||||
labels = {e["label"] for e in ents}
|
||||
# Debe detectar al menos uno de PER, ORG o LOC
|
||||
assert labels & {"PER", "ORG", "LOC"}
|
||||
@@ -0,0 +1,67 @@
|
||||
"""Tests para fuzzy_merge_adaptive."""
|
||||
|
||||
import sys
|
||||
import os
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
||||
|
||||
from fuzzy_merge_adaptive import fuzzy_merge_adaptive
|
||||
|
||||
|
||||
def test_left_join_con_typo():
|
||||
left = [{"name": "Madrid"}, {"name": "Barclona"}]
|
||||
right = [{"name": "Madrid", "cp": "28"}, {"name": "Barcelona", "cp": "08"}]
|
||||
result = fuzzy_merge_adaptive(left, right, left_key="name", right_key="name")
|
||||
assert len(result) == 2
|
||||
scores = [r["match_score"] for r in result]
|
||||
assert all(s >= 80 for s in scores), f"Scores bajos: {scores}"
|
||||
assert result[0]["cp"] == "28"
|
||||
assert result[1]["cp"] == "08"
|
||||
|
||||
|
||||
def test_inner_join_excluye_sin_match():
|
||||
left = [{"name": "Madrid"}, {"name": "ZZZinexistente"}]
|
||||
right = [{"name": "Madrid", "cp": "28"}]
|
||||
result = fuzzy_merge_adaptive(
|
||||
left, right, left_key="name", right_key="name",
|
||||
thresholds=[90, 80, 70], how="inner"
|
||||
)
|
||||
assert len(result) == 1
|
||||
assert result[0]["fuzzy_match"] == "Madrid"
|
||||
|
||||
|
||||
def test_left_join_sin_match_devuelve_none():
|
||||
left = [{"name": "ZZZinexistente"}]
|
||||
right = [{"name": "Madrid", "cp": "28"}]
|
||||
result = fuzzy_merge_adaptive(
|
||||
left, right, left_key="name", right_key="name",
|
||||
thresholds=[95], how="left"
|
||||
)
|
||||
assert len(result) == 1
|
||||
assert result[0]["fuzzy_match"] is None
|
||||
assert result[0]["match_score"] == 0
|
||||
assert result[0]["threshold_used"] is None
|
||||
|
||||
|
||||
def test_threshold_adaptativo():
|
||||
left = [{"name": "Bcn"}]
|
||||
right = [{"name": "Barcelona", "cp": "08"}]
|
||||
result = fuzzy_merge_adaptive(
|
||||
left, right, left_key="name", right_key="name",
|
||||
thresholds=[90, 80, 70, 60, 50]
|
||||
)
|
||||
assert len(result) == 1
|
||||
# Puede matchear o no segun score, pero threshold_used <= 90
|
||||
if result[0]["threshold_used"] is not None:
|
||||
assert result[0]["threshold_used"] <= 90
|
||||
|
||||
|
||||
def test_colision_de_claves_usa_sufijos():
|
||||
left = [{"name": "Madrid", "info": "left_info"}]
|
||||
right = [{"name": "Madrid", "info": "right_info"}]
|
||||
result = fuzzy_merge_adaptive(left, right, left_key="name", right_key="name")
|
||||
assert len(result) == 1
|
||||
assert "info_left" in result[0]
|
||||
assert "info_right" in result[0]
|
||||
assert result[0]["info_left"] == "left_info"
|
||||
assert result[0]["info_right"] == "right_info"
|
||||
@@ -0,0 +1,35 @@
|
||||
"""Tests para geometric_mean."""
|
||||
|
||||
import math
|
||||
import sys
|
||||
import os
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
||||
from geometric_mean import geometric_mean
|
||||
|
||||
|
||||
def test_geometric_mean_powers_of_two():
|
||||
result = geometric_mean([1, 2, 4, 8])
|
||||
expected = 2 ** 1.5 # ~2.828
|
||||
assert abs(result - expected) < 1e-6, f"Expected ~{expected}, got {result}"
|
||||
|
||||
|
||||
def test_geometric_mean_filters_non_positive():
|
||||
result = geometric_mean([1, -2, 3])
|
||||
expected = math.exp((math.log(1) + math.log(3)) / 2)
|
||||
assert abs(result - expected) < 1e-6
|
||||
|
||||
|
||||
def test_geometric_mean_empty_returns_nan():
|
||||
result = geometric_mean([])
|
||||
assert math.isnan(result)
|
||||
|
||||
|
||||
def test_geometric_mean_all_negative_returns_nan():
|
||||
result = geometric_mean([-1, -2, -3])
|
||||
assert math.isnan(result)
|
||||
|
||||
|
||||
def test_geometric_mean_single_positive():
|
||||
result = geometric_mean([9.0])
|
||||
assert abs(result - 9.0) < 1e-9
|
||||
@@ -0,0 +1,84 @@
|
||||
"""Tests para gliner2_load_model.
|
||||
|
||||
El modelo real (gliner2) es opcional. Los tests usan un stub para validar
|
||||
el cache sin descargar el modelo. Tests que requieran el modelo real se
|
||||
marcan con pytest.importorskip('gliner2').
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
import sys
|
||||
|
||||
import pytest
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "..", ".."))
|
||||
|
||||
from python.functions.datascience.gliner2_load_model import (
|
||||
_MODEL_CACHE,
|
||||
_resolve_device,
|
||||
gliner2_load_model,
|
||||
)
|
||||
|
||||
|
||||
class _StubGLiNER2:
|
||||
"""Stub duck-typed para validar el cache sin descargar el modelo real."""
|
||||
|
||||
@classmethod
|
||||
def from_pretrained(cls, model_name: str) -> "_StubGLiNER2":
|
||||
return cls()
|
||||
|
||||
def create_schema(self):
|
||||
return self
|
||||
|
||||
def entities(self, labels):
|
||||
return self
|
||||
|
||||
def relations(self, labels):
|
||||
return self
|
||||
|
||||
def extract(self, text, **kwargs):
|
||||
return {"entities": {}, "relation_extraction": {}}
|
||||
|
||||
|
||||
def test_cache_devuelve_la_misma_instancia(monkeypatch):
|
||||
"""cache devuelve la misma instancia con los mismos parametros"""
|
||||
_MODEL_CACHE.clear()
|
||||
monkeypatch.setattr(
|
||||
"python.functions.datascience.gliner2_load_model.GLiNER2",
|
||||
_StubGLiNER2,
|
||||
raising=False,
|
||||
)
|
||||
# Patch el import dentro de la funcion
|
||||
import python.functions.datascience.gliner2_load_model as mod
|
||||
original = None
|
||||
try:
|
||||
from gliner2 import GLiNER2 as _real # type: ignore[import]
|
||||
original = _real
|
||||
except ImportError:
|
||||
pass
|
||||
|
||||
_MODEL_CACHE.clear()
|
||||
# Insertar stub directamente en el cache para simular primera carga
|
||||
key = ("fastino/gliner2-large-v1", "cpu")
|
||||
stub = _StubGLiNER2()
|
||||
_MODEL_CACHE[key] = stub
|
||||
|
||||
# Segunda llamada debe devolver el mismo objeto
|
||||
result = gliner2_load_model(model_name="fastino/gliner2-large-v1", device="cpu")
|
||||
assert result is stub
|
||||
_MODEL_CACHE.clear()
|
||||
|
||||
|
||||
def test_device_auto_resuelve_a_cpu_si_torch_no_esta(monkeypatch):
|
||||
"""device=auto resuelve a cpu si torch no esta instalado"""
|
||||
import sys
|
||||
# Simular que torch no esta disponible
|
||||
monkeypatch.setitem(sys.modules, "torch", None)
|
||||
resolved = _resolve_device("auto")
|
||||
assert resolved == "cpu"
|
||||
|
||||
|
||||
def test_import_error_si_gliner2_no_esta_instalado():
|
||||
"""ImportError si gliner2 no esta instalado"""
|
||||
pytest.importorskip("gliner2", reason="gliner2 not installed — skip real model test")
|
||||
@@ -0,0 +1,46 @@
|
||||
"""Tests para kde_density_levels."""
|
||||
|
||||
import sys
|
||||
import os
|
||||
import numpy as np
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
||||
from kde_density_levels import kde_density_levels
|
||||
|
||||
|
||||
def test_kde_density_levels_returns_dict_for_50_points():
|
||||
rng = np.random.default_rng(42)
|
||||
xs = rng.normal(0, 1, 50).tolist()
|
||||
ys = rng.normal(0, 1, 50).tolist()
|
||||
result = kde_density_levels(xs, ys)
|
||||
assert result is not None
|
||||
assert "method" in result
|
||||
assert result["method"] in ("kde", "hist")
|
||||
assert "densities" in result
|
||||
assert len(result["densities"]) == 50
|
||||
assert "abs_level" in result
|
||||
assert "dense_level" in result
|
||||
|
||||
|
||||
def test_kde_density_levels_none_for_few_points():
|
||||
result = kde_density_levels([1.0, 2.0, 3.0], [1.0, 2.0, 3.0])
|
||||
assert result is None
|
||||
|
||||
|
||||
def test_kde_density_levels_none_for_4_points():
|
||||
result = kde_density_levels([1, 2, 3, 4], [1, 2, 3, 4])
|
||||
assert result is None
|
||||
|
||||
|
||||
def test_kde_density_levels_levels_ordered():
|
||||
rng = np.random.default_rng(0)
|
||||
xs = rng.uniform(0, 10, 100).tolist()
|
||||
ys = rng.uniform(0, 10, 100).tolist()
|
||||
result = kde_density_levels(xs, ys, abs_quantile=0.1, dense_quantile=0.85)
|
||||
assert result is not None
|
||||
assert result["abs_level"] <= result["dense_level"]
|
||||
|
||||
|
||||
def test_kde_density_levels_mismatched_lengths():
|
||||
result = kde_density_levels([1, 2, 3, 4, 5], [1, 2, 3])
|
||||
assert result is None
|
||||
@@ -0,0 +1,75 @@
|
||||
"""Tests para parse_rebel_output."""
|
||||
|
||||
import os
|
||||
import sys
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "..", ".."))
|
||||
|
||||
from python.functions.datascience.parse_rebel_output import parse_rebel_output
|
||||
|
||||
|
||||
def test_string_vacio_retorna_lista_vacia():
|
||||
assert parse_rebel_output("") == []
|
||||
|
||||
|
||||
def test_string_solo_espacios_retorna_lista_vacia():
|
||||
assert parse_rebel_output(" ") == []
|
||||
|
||||
|
||||
def test_un_triplet_completo_retorna_un_dict_con_campos_correctos():
|
||||
decoded = "tp_XX<triplet> Pablo Isla <per> Inditex <org> employer"
|
||||
result = parse_rebel_output(decoded)
|
||||
assert len(result) == 1
|
||||
t = result[0]
|
||||
assert t["head"] == "Pablo Isla"
|
||||
assert t["head_type"] == "per"
|
||||
assert t["tail"] == "Inditex"
|
||||
assert t["tail_type"] == "org"
|
||||
assert t["type"] == "employer"
|
||||
|
||||
|
||||
def test_dos_triplets_retorna_dos_dicts():
|
||||
decoded = (
|
||||
"tp_XX<triplet> Pablo Isla <per> Inditex <org> employer "
|
||||
"<triplet> Arteixo <loc> A Coruna <loc> located in the administrative territorial entity"
|
||||
)
|
||||
result = parse_rebel_output(decoded)
|
||||
assert len(result) == 2
|
||||
assert result[0]["head"] == "Pablo Isla"
|
||||
assert result[0]["tail"] == "Inditex"
|
||||
assert result[1]["head"] == "Arteixo"
|
||||
assert result[1]["tail"] == "A Coruna"
|
||||
assert "located" in result[1]["type"]
|
||||
|
||||
|
||||
def test_triplet_incompleto_sin_cierre_no_rompe():
|
||||
# Solo head span, sin tail ni relacion
|
||||
decoded = "tp_XX<triplet> Pablo Isla"
|
||||
result = parse_rebel_output(decoded)
|
||||
# No hay cierre, puede retornar lista vacia o incompleta pero no rompe
|
||||
assert isinstance(result, list)
|
||||
|
||||
|
||||
def test_tokens_angulares_desconocidos_no_lanzan_excepcion():
|
||||
# Un tipo desconocido como <unknown_type> no debe romper el parser
|
||||
decoded = "<triplet> Entity One <unknown_type> Entity Two <org> some relation"
|
||||
result = parse_rebel_output(decoded)
|
||||
assert isinstance(result, list)
|
||||
|
||||
|
||||
def test_sin_prefijo_tp_xx_funciona():
|
||||
# REBEL monolingue no emite tp_XX
|
||||
decoded = "<triplet> Barack Obama <per> United States <org> president of"
|
||||
result = parse_rebel_output(decoded)
|
||||
assert len(result) == 1
|
||||
assert result[0]["head"] == "Barack Obama"
|
||||
assert result[0]["tail"] == "United States"
|
||||
assert result[0]["type"] == "president of"
|
||||
|
||||
|
||||
def test_strip_tags_s_pad():
|
||||
decoded = "<s><pad>tp_XX<triplet> Ana <per> BBVA <org> works at</s>"
|
||||
result = parse_rebel_output(decoded)
|
||||
assert len(result) == 1
|
||||
assert result[0]["head"] == "Ana"
|
||||
assert result[0]["tail"] == "BBVA"
|
||||
@@ -0,0 +1,38 @@
|
||||
"""Tests para plot_heatmap_log."""
|
||||
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
import matplotlib
|
||||
matplotlib.use("Agg")
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent.parent.parent))
|
||||
|
||||
from datascience.plot_heatmap_log import plot_heatmap_log
|
||||
|
||||
|
||||
def test_100_puntos_no_lanza_excepcion():
|
||||
import matplotlib.pyplot as plt
|
||||
import numpy as np
|
||||
|
||||
rng = np.random.default_rng(0)
|
||||
xs = rng.uniform(-4.0, -3.5, 100)
|
||||
ys = rng.uniform(40.3, 40.6, 100)
|
||||
|
||||
fig, ax = plt.subplots()
|
||||
plot_heatmap_log(ax, xs, ys, extent=(-4.0, -3.5, 40.3, 40.6), bins=50)
|
||||
plt.close(fig)
|
||||
|
||||
|
||||
def test_ax_tiene_imagen_tras_la_llamada():
|
||||
import matplotlib.pyplot as plt
|
||||
import numpy as np
|
||||
|
||||
rng = np.random.default_rng(1)
|
||||
xs = rng.uniform(-4.0, -3.5, 100)
|
||||
ys = rng.uniform(40.3, 40.6, 100)
|
||||
|
||||
fig, ax = plt.subplots()
|
||||
plot_heatmap_log(ax, xs, ys, extent=(-4.0, -3.5, 40.3, 40.6), bins=50)
|
||||
assert len(ax.images) > 0, "ax should have at least one image after heatmap"
|
||||
plt.close(fig)
|
||||
@@ -0,0 +1,32 @@
|
||||
"""Tests para plot_kde_2d."""
|
||||
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
import matplotlib
|
||||
matplotlib.use("Agg")
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent.parent.parent))
|
||||
|
||||
from datascience.plot_kde_2d import plot_kde_2d
|
||||
|
||||
|
||||
def test_50_puntos_aleatorios_no_lanza_excepcion():
|
||||
import matplotlib.pyplot as plt
|
||||
import numpy as np
|
||||
|
||||
rng = np.random.default_rng(42)
|
||||
xs = rng.normal(0, 1, 50)
|
||||
ys = rng.normal(0, 1, 50)
|
||||
|
||||
fig, ax = plt.subplots()
|
||||
plot_kde_2d(ax, xs, ys)
|
||||
plt.close(fig)
|
||||
|
||||
|
||||
def test_arrays_vacios_retorna_sin_error():
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
fig, ax = plt.subplots()
|
||||
plot_kde_2d(ax, [], [])
|
||||
plt.close(fig)
|
||||
@@ -0,0 +1,42 @@
|
||||
"""Tests para remove_words_from_column."""
|
||||
|
||||
import sys
|
||||
import os
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
||||
|
||||
from remove_words_from_column import remove_words_from_column
|
||||
|
||||
|
||||
def test_elimina_palabras_case_insensitive():
|
||||
values = ["Calle Mayor 14", "Avenida del Sol"]
|
||||
result = remove_words_from_column(values, words=["calle", "avenida", "del"])
|
||||
assert result == ["Mayor 14", "Sol"]
|
||||
|
||||
|
||||
def test_none_devuelve_string_vacio():
|
||||
result = remove_words_from_column([None, "hola mundo"], words=["hola"])
|
||||
assert result[0] == ""
|
||||
assert result[1] == "mundo"
|
||||
|
||||
|
||||
def test_colapsa_espacios_multiples():
|
||||
result = remove_words_from_column(["uno dos tres"], words=["dos"])
|
||||
assert result[0] == "uno tres"
|
||||
|
||||
|
||||
def test_palabras_vacias_no_modifica():
|
||||
values = ["hola mundo", "foo bar"]
|
||||
result = remove_words_from_column(values, words=[])
|
||||
assert result == ["hola mundo", "foo bar"]
|
||||
|
||||
|
||||
def test_palabra_completa_no_parcial():
|
||||
# "calle" no debe eliminar "calleja"
|
||||
result = remove_words_from_column(["calleja mayor"], words=["calle"])
|
||||
assert result[0] == "calleja mayor"
|
||||
|
||||
|
||||
def test_lista_vacia():
|
||||
result = remove_words_from_column([], words=["foo"])
|
||||
assert result == []
|
||||
@@ -0,0 +1,46 @@
|
||||
"""Tests para spacy_es_load_model."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
import sys
|
||||
|
||||
import pytest
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "..", ".."))
|
||||
|
||||
from python.functions.datascience.spacy_es_load_model import (
|
||||
_MODEL_CACHE,
|
||||
spacy_es_load_model,
|
||||
)
|
||||
|
||||
spacy = pytest.importorskip("spacy", reason="spacy not installed — skip")
|
||||
|
||||
|
||||
def _has_model(model_name: str) -> bool:
|
||||
try:
|
||||
spacy.load(model_name)
|
||||
return True
|
||||
except OSError:
|
||||
return False
|
||||
|
||||
|
||||
@pytest.mark.skipif(
|
||||
not _has_model("es_core_news_md"),
|
||||
reason="es_core_news_md not installed",
|
||||
)
|
||||
def test_cache_devuelve_la_misma_instancia():
|
||||
"""cache devuelve la misma instancia"""
|
||||
_MODEL_CACHE.clear()
|
||||
m1 = spacy_es_load_model("es_core_news_md")
|
||||
m2 = spacy_es_load_model("es_core_news_md")
|
||||
assert m1 is m2
|
||||
_MODEL_CACHE.clear()
|
||||
|
||||
|
||||
def test_oserror_si_el_modelo_no_esta_instalado():
|
||||
"""OSError si el modelo no esta instalado"""
|
||||
_MODEL_CACHE.clear()
|
||||
with pytest.raises(OSError):
|
||||
spacy_es_load_model("es_nonexistent_model_xyz")
|
||||
_MODEL_CACHE.clear()
|
||||
@@ -0,0 +1,38 @@
|
||||
"""Tests para summary_stats."""
|
||||
|
||||
import math
|
||||
import sys
|
||||
import os
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
||||
from summary_stats import summary_stats
|
||||
|
||||
|
||||
def test_summary_stats_basic():
|
||||
result = summary_stats([1, 2, 3, 4, 5])
|
||||
assert result["n"] == 5
|
||||
assert abs(result["mean"] - 3.0) < 1e-9
|
||||
assert abs(result["median"] - 3.0) < 1e-9
|
||||
assert abs(result["p25"] - 2.0) < 0.01
|
||||
assert abs(result["p75"] - 4.0) < 0.01
|
||||
|
||||
|
||||
def test_summary_stats_empty():
|
||||
result = summary_stats([])
|
||||
assert result["n"] == 0
|
||||
assert math.isnan(result["mean"])
|
||||
assert math.isnan(result["median"])
|
||||
assert math.isnan(result["p25"])
|
||||
assert math.isnan(result["p75"])
|
||||
|
||||
|
||||
def test_summary_stats_single():
|
||||
result = summary_stats([7.0])
|
||||
assert result["n"] == 1
|
||||
assert abs(result["mean"] - 7.0) < 1e-9
|
||||
assert abs(result["median"] - 7.0) < 1e-9
|
||||
|
||||
|
||||
def test_summary_stats_keys():
|
||||
result = summary_stats([1, 2, 3])
|
||||
assert set(result.keys()) == {"n", "mean", "median", "p25", "p75"}
|
||||
@@ -0,0 +1,62 @@
|
||||
"""Tests para translate_es_to_en — smoke tests con modelo stub."""
|
||||
|
||||
import os
|
||||
import sys
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "..", ".."))
|
||||
|
||||
from python.functions.datascience.translate_es_to_en import translate_es_to_en
|
||||
|
||||
|
||||
class _StubTokenizer:
|
||||
"""Tokenizer stub que devuelve inputs triviales."""
|
||||
|
||||
def __call__(self, text, return_tensors=None, max_length=512, truncation=True):
|
||||
# Devuelve un dict con una clave 'input_ids' que el modelo stub acepta.
|
||||
return {"input_ids": [[1, 2, 3]], "_text": text}
|
||||
|
||||
def decode(self, token_ids, skip_special_tokens=True):
|
||||
# Devuelve siempre "translated" para testing.
|
||||
return "translated"
|
||||
|
||||
|
||||
class _StubModel:
|
||||
"""Modelo stub que devuelve tokens triviales."""
|
||||
|
||||
def generate(self, input_ids=None, num_beams=4, max_length=512, **kwargs):
|
||||
return [[10, 11, 12]]
|
||||
|
||||
|
||||
def test_texto_vacio_retorna_string_vacio():
|
||||
tok = _StubTokenizer()
|
||||
model = _StubModel()
|
||||
assert translate_es_to_en("", tok, model) == ""
|
||||
|
||||
|
||||
def test_solo_espacios_retorna_string_vacio():
|
||||
tok = _StubTokenizer()
|
||||
model = _StubModel()
|
||||
assert translate_es_to_en(" ", tok, model) == ""
|
||||
|
||||
|
||||
def test_una_frase_en_espanol_produce_output_no_vacio():
|
||||
tok = _StubTokenizer()
|
||||
model = _StubModel()
|
||||
result = translate_es_to_en("Pablo Isla es presidente de Inditex.", tok, model)
|
||||
assert isinstance(result, str)
|
||||
assert len(result) > 0
|
||||
|
||||
|
||||
def test_multiples_frases_se_unen_con_espacio():
|
||||
tok = _StubTokenizer()
|
||||
model = _StubModel()
|
||||
# El stub siempre devuelve "translated" por frase
|
||||
result = translate_es_to_en(
|
||||
"Primera frase. Segunda frase. Tercera frase.",
|
||||
tok,
|
||||
model,
|
||||
)
|
||||
# Con el stub, cada frase produce "translated", unidas con espacio
|
||||
parts = result.split(" ")
|
||||
assert all(p == "translated" for p in parts)
|
||||
assert len(parts) >= 1
|
||||
@@ -0,0 +1,33 @@
|
||||
"""Tests para trimmed_mean."""
|
||||
|
||||
import math
|
||||
import sys
|
||||
import os
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
||||
from trimmed_mean import trimmed_mean
|
||||
|
||||
|
||||
def test_trimmed_mean_basic():
|
||||
result = trimmed_mean([1, 2, 3, 4, 5, 100], 0.1)
|
||||
assert abs(result - 3.5) < 0.5, f"Expected ~3.5, got {result}"
|
||||
|
||||
|
||||
def test_trimmed_mean_empty_returns_nan():
|
||||
result = trimmed_mean([], 0.05)
|
||||
assert math.isnan(result)
|
||||
|
||||
|
||||
def test_trimmed_mean_no_trim():
|
||||
result = trimmed_mean([1.0, 2.0, 3.0, 4.0, 5.0], 0.0)
|
||||
assert abs(result - 3.0) < 1e-9
|
||||
|
||||
|
||||
def test_trimmed_mean_single_element():
|
||||
result = trimmed_mean([42.0], 0.05)
|
||||
assert abs(result - 42.0) < 1e-9
|
||||
|
||||
|
||||
def test_trimmed_mean_uniform():
|
||||
result = trimmed_mean([5.0, 5.0, 5.0, 5.0, 5.0], 0.1)
|
||||
assert abs(result - 5.0) < 1e-9
|
||||
@@ -0,0 +1,49 @@
|
||||
"""Tests para words_to_dataset."""
|
||||
|
||||
import sys
|
||||
import os
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
||||
|
||||
from words_to_dataset import words_to_dataset
|
||||
|
||||
|
||||
def test_cuenta_palabras_repetidas():
|
||||
texts = ["calle mayor", "calle del sol", "avenida principal"]
|
||||
result = words_to_dataset(texts)
|
||||
palabras = {r["palabra"]: r["ocurrencias"] for r in result}
|
||||
assert palabras["CALLE"] == 2
|
||||
|
||||
|
||||
def test_eliminar_stopwords_filtra_del():
|
||||
texts = ["calle mayor", "calle del sol", "avenida principal"]
|
||||
result = words_to_dataset(texts, eliminar_stopwords=True)
|
||||
palabras = {r["palabra"] for r in result}
|
||||
assert "DEL" not in palabras
|
||||
|
||||
|
||||
def test_min_ocurrencias_filtra():
|
||||
texts = ["calle mayor", "calle del sol", "avenida principal"]
|
||||
result = words_to_dataset(texts, min_ocurrencias=2)
|
||||
palabras = {r["palabra"]: r["ocurrencias"] for r in result}
|
||||
assert "CALLE" in palabras
|
||||
assert "MAYOR" not in palabras
|
||||
|
||||
|
||||
def test_none_ignorados():
|
||||
texts = ["hola mundo", None, "hola"]
|
||||
result = words_to_dataset(texts)
|
||||
palabras = {r["palabra"]: r["ocurrencias"] for r in result}
|
||||
assert palabras["HOLA"] == 2
|
||||
|
||||
|
||||
def test_lista_vacia():
|
||||
result = words_to_dataset([])
|
||||
assert result == []
|
||||
|
||||
|
||||
def test_orden_descendente():
|
||||
texts = ["a a a", "b b", "c"]
|
||||
result = words_to_dataset(texts)
|
||||
counts = [r["ocurrencias"] for r in result]
|
||||
assert counts == sorted(counts, reverse=True)
|
||||
@@ -0,0 +1,85 @@
|
||||
---
|
||||
name: translate_es_to_en
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "def translate_es_to_en(text: str, tokenizer: Any, model: Any, max_length: int = 512, num_beams: int = 4) -> str"
|
||||
description: "Traduce texto espanol a ingles frase a frase usando MarianMT. Divide por boundaries de oracion, traduce cada una independientemente y une con espacio. Preserva nombres propios mejor que pasar el parrafo entero."
|
||||
tags: [marianmt, translation, es-en, nlp, datascience, python]
|
||||
uses_functions: [marianmt_es_en_load_model_py_datascience]
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: [re]
|
||||
params:
|
||||
- name: text
|
||||
desc: "texto en espanol a traducir — puede ser una frase o un parrafo multi-oracion"
|
||||
- name: tokenizer
|
||||
desc: "tokenizer MarianMT cargado con marianmt_es_en_load_model"
|
||||
- name: model
|
||||
desc: "modelo MarianMT cargado con marianmt_es_en_load_model"
|
||||
- name: max_length
|
||||
desc: "longitud maxima en tokens por oracion para tokenizacion y generacion (defecto 512)"
|
||||
- name: num_beams
|
||||
desc: "numero de beams para beam search — mas alto = mejor calidad, mas lento (defecto 4)"
|
||||
output: "texto traducido al ingles. Frases unidas con espacio simple. String vacio si el input es vacio."
|
||||
tested: true
|
||||
tests:
|
||||
- "texto vacio retorna string vacio"
|
||||
- "una frase en espanol produce output no vacio"
|
||||
test_file_path: "python/functions/datascience/tests/test_translate_es_to_en.py"
|
||||
file_path: "python/functions/datascience/translate_es_to_en.py"
|
||||
notes: |
|
||||
impure: invoca model.generate que depende del estado del modelo (pesos, device).
|
||||
|
||||
El split por oracion usa regex lookahead-behind sobre [.!?] seguidos de espacio.
|
||||
Esto preserva nombres propios con puntos (S.A., U.S.A.) mejor que NLTK sent_tokenize
|
||||
porque no usa reglas de abreviacion — simplemente divide donde hay espacio despues
|
||||
de puntuacion terminal.
|
||||
|
||||
Util como preprocesador para rebel_load_model (English-only, Apache 2.0):
|
||||
ES text -> translate_es_to_en -> EN text -> REBEL -> triplets
|
||||
Alternativa directa: mrebel_load_model (multilingue, CC BY-NC-SA).
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from python.functions.datascience.marianmt_es_en_load_model import marianmt_es_en_load_model
|
||||
from python.functions.datascience.translate_es_to_en import translate_es_to_en
|
||||
|
||||
tokenizer, model = marianmt_es_en_load_model()
|
||||
|
||||
text = "Pablo Isla es presidente de Inditex. La empresa tiene sede en Arteixo."
|
||||
translated = translate_es_to_en(text, tokenizer, model)
|
||||
# "Pablo Isla is president of Inditex. The company is headquartered in Arteixo."
|
||||
```
|
||||
|
||||
## Por que frase a frase
|
||||
|
||||
Pasar el parrafo entero a MarianMT puede degradar la traduccion de nombres propios
|
||||
porque el modelo redistribuye la atencion sobre el contexto completo. Dividir por oraciones:
|
||||
|
||||
1. Contexto mas corto → menos confusion en nombres propios.
|
||||
2. Truncation menos probable (512 tokens alcanza para oraciones normales).
|
||||
3. Pipeline mas predecible para debugging (se puede inspeccionar cada frase).
|
||||
|
||||
## Patron pipeline ES -> EN -> REBEL
|
||||
|
||||
```python
|
||||
# Paso 1: cargar modelos
|
||||
mt_tok, mt_model = marianmt_es_en_load_model()
|
||||
rebel_tok, rebel_model = rebel_load_model()
|
||||
|
||||
# Paso 2: traducir
|
||||
en_text = translate_es_to_en(es_text, mt_tok, mt_model)
|
||||
|
||||
# Paso 3: extraer relaciones
|
||||
inputs = rebel_tok(en_text, return_tensors="pt", max_length=512, truncation=True)
|
||||
generated = rebel_model.generate(**inputs, num_beams=4, max_length=256)
|
||||
decoded = rebel_tok.decode(generated[0], skip_special_tokens=False)
|
||||
triplets = parse_rebel_output(decoded)
|
||||
```
|
||||
@@ -0,0 +1,68 @@
|
||||
"""Traduce texto espanol a ingles usando MarianMT, frase a frase."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import re
|
||||
from typing import Any
|
||||
|
||||
# Patron de split por oraciones: punto, exclamacion, interrogacion seguidos de espacio.
|
||||
_SENTENCE_RE = re.compile(r"(?<=[.!?])\s+")
|
||||
|
||||
|
||||
def translate_es_to_en(
|
||||
text: str,
|
||||
tokenizer: Any,
|
||||
model: Any,
|
||||
max_length: int = 512,
|
||||
num_beams: int = 4,
|
||||
) -> str:
|
||||
"""Translate Spanish text to English, sentence by sentence.
|
||||
|
||||
Splits the input on sentence boundaries (after ``.``, ``!``, ``?``),
|
||||
translates each sentence independently, and rejoins with a single space.
|
||||
Processing sentence by sentence preserves proper nouns (names, companies,
|
||||
locations) better than passing the full paragraph in a single call, because
|
||||
the translation model can focus on shorter context windows.
|
||||
|
||||
Args:
|
||||
text: Spanish text to translate. Can be a single sentence or a
|
||||
multi-sentence paragraph.
|
||||
tokenizer: MarianMT tokenizer loaded with ``marianmt_es_en_load_model``.
|
||||
model: MarianMT model loaded with ``marianmt_es_en_load_model``.
|
||||
max_length: Maximum token length for each sentence during tokenization
|
||||
and generation. Sentences longer than this are truncated.
|
||||
num_beams: Number of beams for beam search. Higher = better quality,
|
||||
slower. Default 4 is a good tradeoff.
|
||||
|
||||
Returns:
|
||||
Translated English text. Sentences joined with a single space.
|
||||
Returns an empty string if ``text`` is empty or whitespace-only.
|
||||
|
||||
Raises:
|
||||
RuntimeError: if model.generate fails (propagated from transformers).
|
||||
"""
|
||||
if not text or not text.strip():
|
||||
return ""
|
||||
|
||||
sentences = _SENTENCE_RE.split(text.strip())
|
||||
sentences = [s.strip() for s in sentences if s.strip()]
|
||||
if not sentences:
|
||||
return ""
|
||||
|
||||
translated_parts: list[str] = []
|
||||
for sentence in sentences:
|
||||
inputs = tokenizer(
|
||||
sentence,
|
||||
return_tensors="pt",
|
||||
max_length=max_length,
|
||||
truncation=True,
|
||||
)
|
||||
generated = model.generate(
|
||||
**inputs,
|
||||
num_beams=num_beams,
|
||||
max_length=max_length,
|
||||
)
|
||||
decoded = tokenizer.decode(generated[0], skip_special_tokens=True)
|
||||
translated_parts.append(decoded.strip())
|
||||
|
||||
return " ".join(translated_parts)
|
||||
@@ -0,0 +1,53 @@
|
||||
---
|
||||
id: trimmed_mean_py_datascience
|
||||
name: trimmed_mean
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def trimmed_mean(values: list[float], trim: float = 0.05) -> float"
|
||||
description: "Arithmetic mean after cutting the bottom and top trim percentiles. Returns math.nan for empty input."
|
||||
tags: [statistics, mean, robust, trimming, outliers]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: [math, numpy]
|
||||
example: |
|
||||
from trimmed_mean import trimmed_mean
|
||||
result = trimmed_mean([1, 2, 3, 4, 5, 100], 0.1) # ~3.5
|
||||
tested: true
|
||||
tests:
|
||||
- "test_trimmed_mean_basic"
|
||||
- "test_trimmed_mean_empty_returns_nan"
|
||||
- "test_trimmed_mean_no_trim"
|
||||
- "test_trimmed_mean_single_element"
|
||||
- "test_trimmed_mean_uniform"
|
||||
test_file_path: "python/functions/datascience/tests/test_trimmed_mean.py"
|
||||
file_path: "python/functions/datascience/trimmed_mean.py"
|
||||
params:
|
||||
- name: values
|
||||
desc: "List of numeric values to average."
|
||||
- name: trim
|
||||
desc: "Fraction to cut from each tail before averaging (0 <= trim < 0.5). Default 0.05."
|
||||
output: "Trimmed arithmetic mean as float. Returns math.nan if values is empty or all values are trimmed away."
|
||||
source_repo: "internal:footprint_aurgi"
|
||||
source_license: "internal-aurgi"
|
||||
source_file: "aurgi_mapas/generar_pdf_reporte.py:117"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from trimmed_mean import trimmed_mean
|
||||
|
||||
trimmed_mean([1, 2, 3, 4, 5, 100], 0.1) # ~3.5 (100 is trimmed)
|
||||
trimmed_mean([], 0.05) # math.nan
|
||||
trimmed_mean([5.0, 5.0, 5.0], 0.0) # 5.0
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Usa numpy.percentile para calcular los umbrales lo y hi, luego filtra valores dentro del rango [lo, hi]. Util para calcular promedios robustos cuando hay valores extremos en la distribucion.
|
||||
@@ -0,0 +1,28 @@
|
||||
"""trimmed_mean — Arithmetic mean after trimming extreme percentiles."""
|
||||
|
||||
import math
|
||||
import numpy as np
|
||||
|
||||
|
||||
def trimmed_mean(values: list[float], trim: float = 0.05) -> float:
|
||||
"""Return the trimmed arithmetic mean of values.
|
||||
|
||||
Cuts the bottom `trim` and top `trim` percentiles before averaging.
|
||||
Returns math.nan for an empty list or when trimming removes all elements.
|
||||
|
||||
Args:
|
||||
values: List of numeric values.
|
||||
trim: Fraction to cut from each tail (0 <= trim < 0.5).
|
||||
|
||||
Returns:
|
||||
Trimmed mean as float, or math.nan if the list is empty.
|
||||
"""
|
||||
if not values:
|
||||
return math.nan
|
||||
arr = np.array(values, dtype=float)
|
||||
lo = np.percentile(arr, trim * 100)
|
||||
hi = np.percentile(arr, (1 - trim) * 100)
|
||||
trimmed = arr[(arr >= lo) & (arr <= hi)]
|
||||
if len(trimmed) == 0:
|
||||
return math.nan
|
||||
return float(np.mean(trimmed))
|
||||
@@ -0,0 +1,55 @@
|
||||
---
|
||||
name: words_to_dataset
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def words_to_dataset(texts: Iterable[str | None], min_ocurrencias: int = 1, eliminar_stopwords: bool = False) -> list[dict]"
|
||||
description: "Extrae palabras y sus ocurrencias de un iterable de textos. Tokeniza con \\b\\w+\\b, convierte a mayusculas, cuenta con Counter, filtra por minimo de ocurrencias y opcionalmente elimina stopwords en espanol. Sin pandas."
|
||||
tags: [nlp, text, words, frequency, counter, stopwords, spanish, datascience]
|
||||
params:
|
||||
- name: texts
|
||||
desc: Iterable de strings o None. Los None se ignoran silenciosamente.
|
||||
- name: min_ocurrencias
|
||||
desc: Numero minimo de ocurrencias para incluir una palabra. Default 1.
|
||||
- name: eliminar_stopwords
|
||||
desc: Si True, filtra un conjunto embebido de stopwords comunes en espanol.
|
||||
output: "Lista de dicts {'palabra': str, 'ocurrencias': int} ordenada por ocurrencias descendente."
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: []
|
||||
tested: true
|
||||
tests:
|
||||
- "cuenta palabras repetidas"
|
||||
- "eliminar stopwords filtra del"
|
||||
- "min ocurrencias filtra"
|
||||
- "none ignorados"
|
||||
- "lista vacia"
|
||||
- "orden descendente"
|
||||
test_file_path: "python/functions/datascience/tests/test_words_to_dataset.py"
|
||||
file_path: "python/functions/datascience/words_to_dataset.py"
|
||||
source_repo: "internal:footprint_aurgi"
|
||||
source_license: "internal-aurgi"
|
||||
source_file: "fuzzy_joins/arreglo_fuzzy.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from words_to_dataset import words_to_dataset
|
||||
|
||||
texts = ["calle mayor", "calle del sol", "avenida principal"]
|
||||
result = words_to_dataset(texts)
|
||||
# [{"palabra": "CALLE", "ocurrencias": 2}, {"palabra": "MAYOR", "ocurrencias": 1}, ...]
|
||||
|
||||
result_clean = words_to_dataset(texts, eliminar_stopwords=True)
|
||||
# "DEL" no aparece
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Stopwords embebidas (frozenset de ~40 palabras ES). Funcion pura: solo stdlib (re, collections.Counter). Tokens en mayusculas para unificar "Calle" y "CALLE".
|
||||
@@ -0,0 +1,54 @@
|
||||
"""Extrae palabras y sus ocurrencias de textos en bruto."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import re
|
||||
from collections import Counter
|
||||
from typing import Iterable
|
||||
|
||||
|
||||
_STOPWORDS_ES: frozenset[str] = frozenset({
|
||||
"DE", "LA", "EL", "EN", "Y", "A", "LOS", "DEL", "SE", "LAS",
|
||||
"UN", "POR", "CON", "NO", "UNA", "SU", "PARA", "ES", "AL", "LO",
|
||||
"COMO", "MAS", "O", "PERO", "SUS", "LE", "YA", "ESTE",
|
||||
"SI", "PORQUE", "ESTA", "ENTRE", "CUANDO", "MUY", "SIN", "SOBRE",
|
||||
"TAMBIEN", "ME", "HASTA", "HAY", "DONDE", "QUIEN", "DESDE", "TODO",
|
||||
"NOS", "DURANTE", "TODOS", "UNO", "LES", "NI", "CONTRA", "OTROS",
|
||||
})
|
||||
|
||||
|
||||
def words_to_dataset(
|
||||
texts: Iterable[str | None],
|
||||
min_ocurrencias: int = 1,
|
||||
eliminar_stopwords: bool = False,
|
||||
) -> list[dict]:
|
||||
"""Extrae palabras y ocurrencias de una coleccion de textos.
|
||||
|
||||
Sin dependencias externas. Tokeniza cada texto con regex \\b\\w+\\b,
|
||||
convierte a mayusculas, cuenta ocurrencias y filtra por minimo.
|
||||
|
||||
Args:
|
||||
texts: Iterable de strings (o None). Los None se ignoran.
|
||||
min_ocurrencias: Numero minimo de ocurrencias para incluir una
|
||||
palabra. Default 1.
|
||||
eliminar_stopwords: Si True, filtra palabras comunes en espanol.
|
||||
|
||||
Returns:
|
||||
Lista de dicts {"palabra": str, "ocurrencias": int} ordenada
|
||||
por ocurrencias descendente.
|
||||
"""
|
||||
all_words: list[str] = []
|
||||
for text in texts:
|
||||
if text is None:
|
||||
continue
|
||||
words = re.findall(r"\b\w+\b", str(text).upper())
|
||||
if eliminar_stopwords:
|
||||
words = [w for w in words if w not in _STOPWORDS_ES]
|
||||
all_words.extend(words)
|
||||
|
||||
counter = Counter(all_words)
|
||||
return [
|
||||
{"palabra": word, "ocurrencias": count}
|
||||
for word, count in counter.most_common()
|
||||
if count >= min_ocurrencias
|
||||
]
|
||||
Reference in New Issue
Block a user