Compare commits
17 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 26569c7015 | |||
| 44622339fa | |||
| c0d44a6352 | |||
| cab0fbf0a3 | |||
| 7f304adc9c | |||
| a74a5a047f | |||
| 44be1d6b58 | |||
| 64306f3b1c | |||
| f2eb782a5f | |||
| 80d10010f5 | |||
| ecc22d6d57 | |||
| 7bdb8bffb5 | |||
| 4139394326 | |||
| 54a9ab70c7 | |||
| 4773781323 | |||
| 50c05d126c | |||
| 6f88f184f1 |
@@ -41,12 +41,13 @@ reconocido se degrada a `Note`, nunca lanza).
|
||||
| `Heading(text, level=1)` | título de sección, `level` 1 (grande) … 3 (chico) | una o varias líneas en negrita; nivel 1 lleva subrayado de acento |
|
||||
| `Markdown(text)` | texto markdown ligero | ver subset abajo; **nunca corta a media línea** |
|
||||
| `KVTable(rows, title=None)` | `rows = [(clave, valor), ...]` | tabla de 2 columnas etiqueta/valor; el valor se envuelve |
|
||||
| `DataTable(header, rows, title=None, note=None)` | `header=[...]`, `rows=[[...],...]` | tabla con cabecera; **se parte por filas repitiendo cabecera**; las celdas largas se envuelven dentro de su columna |
|
||||
| `DataTable(header, rows, title=None, note=None)` | `header=[...]`, `rows=[[...],...]` | tabla con cabecera; **si cabe** como texto se parte por filas repitiendo cabecera; **si NO cabe** (demasiadas columnas) se rasteriza entera como imagen de alta resolución para hacer zoom. Ver §11.4 |
|
||||
| `Figure(fig=None, make=None, caption=None, height_in=None)` | una `matplotlib.figure.Figure` ya construida (`fig`) o un callable `make()->Figure` (perezoso) | se rasteriza y escala para caber entera (nunca recortada) |
|
||||
| `Image(path, caption=None, height_in=None)` | ruta a PNG/JPG | se escala para caber entera |
|
||||
| `Caption(text)` / `Note(text)` | texto auxiliar pequeño | pie/nota en gris; `Note` es además el fallback de lo desconocido |
|
||||
| `Group(blocks, title=None)` | unidad **keep-together**: sus bloques se mantienen juntos | el renderer mide el grupo entero y lo mueve completo a la página/slide siguiente si no cabe; encoge la figura para dejar sitio al título+texto. Ver §11 |
|
||||
| `Group(blocks, title=None, page_break_before=False, layout="stack")` | unidad **keep-together**: sus bloques se mantienen juntos | el renderer mide el grupo entero y lo mueve completo a la página/slide siguiente si no cabe; encoge la figura para dejar sitio al título+texto. `layout="side_by_side"` coloca tabla+figura en dos columnas (solo PPTX). Ver §11 y §11.4 |
|
||||
| `GlossaryEntry(key, label, definition)` | una entrada del glosario (destino clicable) | la genera el capítulo `glosario`; registra su posición como destino de los términos marcados. Ver §11 |
|
||||
| `TocEntry(label, target_id)` | una entrada de **índice clicable** en la portada | la genera el capítulo `portada`; el renderer la cablea como salto al inicio del capítulo cuyo `id` o `title` coincide con `target_id`. Ver §11.4 |
|
||||
|
||||
`Figure`/`Image` aceptan `height_in` (hint): el renderer **clampa** la figura a esa altura máxima (lo usa `Group` para encoger la figura). Toda figura escala dejando sitio a su caption en la misma página/slide; en PPTX el caption es **siempre** visible (si no se da `caption`, cae al último heading o a "Figura").
|
||||
|
||||
@@ -397,6 +398,65 @@ cabecera con su fondo propio. Es automático en PDF y PPTX; el patrón se mantie
|
||||
cuando una tabla larga se parte y repite cabecera (el índice de fila es lógico, no por
|
||||
página). No hay nada que hacer en los capítulos.
|
||||
|
||||
### 11.4 Calidad de render global: DPI alto, tabla ancha → imagen, figura al lado, índice clicable
|
||||
|
||||
Cuatro capacidades transversales del motor, **todas automáticas salvo `layout`** (que un
|
||||
capítulo activa explícitamente). Aplican a PDF y PPTX salvo donde se indique.
|
||||
|
||||
**(a) DPI alto (automático).** Toda figura/imagen embebida se rasteriza a **220 dpi**
|
||||
(constante `_RASTER_DPI` en ambos renderers; en PDF se aplica también al `savefig` de la
|
||||
página, porque matplotlib re-rasteriza cada `imshow` al escribir la página). Objetivo:
|
||||
ampliar en el móvil y leer detalle (ejes, celdas) sin pixelar. El texto sigue siendo
|
||||
vectorial y seleccionable. No hay nada que hacer en los capítulos.
|
||||
|
||||
**(b) Tabla ancha → imagen de alta resolución (automático).** Cuando un `DataTable` tiene
|
||||
**demasiadas columnas para ser legible como texto** en el ancho útil (criterio
|
||||
`_table_fits_as_text`: ancho mínimo legible por columna × nº de columnas > ancho útil; en
|
||||
la práctica salta sobre tablas tipo `df.head` con muchas columnas), en vez de comprimir las
|
||||
columnas hasta hacerlas ilegibles, la tabla se dibuja **entera como una imagen de alta
|
||||
resolución** (función `render_table_as_figure_py_datascience`: cabecera sombreada + zebra)
|
||||
escalada para caber completa, de modo que el lector hace **zoom** y la lee sin perder datos.
|
||||
Si la tabla **sí cabe**, se mantiene como texto seleccionable (PDF) / tabla nativa (PPTX).
|
||||
Las `KVTable` (2 columnas) caben siempre y se quedan como texto. No hay nada que hacer en
|
||||
los capítulos.
|
||||
|
||||
**(c) Figura al lado de la tabla — `Group(layout="side_by_side")`.** Hint de layout que un
|
||||
capítulo activa para que su **tabla quede a la izquierda y su figura a la derecha** en la
|
||||
misma diapositiva, en lugar de apiladas:
|
||||
|
||||
```python
|
||||
model.Group(
|
||||
layout="side_by_side",
|
||||
blocks=[
|
||||
model.Heading(text=str(name), level=2), # va a ancho completo arriba
|
||||
model.DataTable(header=..., rows=...), # columna IZQUIERDA (~55%)
|
||||
model.Figure(make=_grafico_perezoso(...)), # columna DERECHA (~45%)
|
||||
model.Markdown(text="explicación…"), # va a ancho completo abajo
|
||||
])
|
||||
```
|
||||
|
||||
Contrato exacto del campo:
|
||||
|
||||
| Campo | Valor | Efecto |
|
||||
|---|---|---|
|
||||
| `layout` | `"stack"` (por defecto) | comportamiento histórico: apilado vertical (keep-together). |
|
||||
| `layout` | `"side_by_side"` | **PPTX**: la tabla (rasterizada a imagen) ocupa la columna izquierda (~55% del ancho útil) y la figura la derecha (~45%); cualquier otro bloque (heading, markdown) va a ancho completo arriba/abajo. Si no hay un par tabla+figura, o no caben lado a lado en una slide, **cae automáticamente a apilado**. **PDF**: se trata **igual que `stack`** (el ancho A5 móvil no admite dos columnas legibles). Valores desconocidos degradan a `"stack"`. |
|
||||
|
||||
Es **retrocompatible**: un `Group` sin `layout` (o `layout="stack"`) se comporta exactamente
|
||||
como antes. El capítulo `cat_distr` es el consumidor previsto (gráfico a la derecha de la
|
||||
tabla de categorías en PPT); este motor solo provee el soporte.
|
||||
|
||||
**(d) Índice clicable en la portada — `TocEntry`.** La portada emite un `Heading("Índice")`
|
||||
seguido de un `TocEntry(label, target_id)` por capítulo. El renderer registra la
|
||||
página/slide de inicio de **cada** capítulo (indexado por `id` **y** por `title`) y cablea
|
||||
cada `TocEntry` como un salto real a ese inicio: en **PDF** vía
|
||||
`add_pdf_internal_links_py_datascience` (link GOTO de PyMuPDF), en **PPTX** vía
|
||||
`pptx_link_run_to_slide_py_datascience` (salto a slide nativo). Como la portada solo conoce
|
||||
los **títulos** de los capítulos, el `target_id` se hace coincidir contra el `title` (o el
|
||||
`id`) de destino. Si un destino no resuelve, la entrada se muestra igualmente como texto
|
||||
(en color de enlace), nunca se corta. Es el mismo mecanismo que los términos clicables del
|
||||
glosario (§11.1), reutilizado en sentido portada → capítulo.
|
||||
|
||||
---
|
||||
|
||||
## 10. Integración futura con `profile_table` (siguiente fase)
|
||||
|
||||
File diff suppressed because one or more lines are too long
@@ -29,6 +29,7 @@ from .model import ( # noqa: F401
|
||||
KVTable,
|
||||
Markdown,
|
||||
Note,
|
||||
TocEntry,
|
||||
as_blocks,
|
||||
as_chapters,
|
||||
merge_manifest,
|
||||
@@ -52,6 +53,7 @@ __all__ = [
|
||||
"Group",
|
||||
"GlossaryEntry",
|
||||
"GlossaryCollector",
|
||||
"TocEntry",
|
||||
"Chapter",
|
||||
"as_blocks",
|
||||
"as_chapters",
|
||||
|
||||
@@ -0,0 +1,109 @@
|
||||
"""Tests del filtro `only` de build_document (selección de capítulos).
|
||||
|
||||
Verifican que:
|
||||
- only=None mantiene el comportamiento histórico (todos los capítulos).
|
||||
- only=[ids] restringe el CUERPO a esos ids, pero portada (primera) y glosario
|
||||
(última) están SIEMPRE presentes.
|
||||
- only=[] produce el documento mínimo (solo portada + glosario).
|
||||
- la selección también viaja por la clave reservada ctx['_only_chapters']
|
||||
(el canal que usan los renderers, que llaman build_document sin `only`), y
|
||||
esa clave nunca se filtra a los capítulos.
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
|
||||
_HERE = os.path.dirname(os.path.abspath(__file__))
|
||||
_FUNCTIONS = os.path.abspath(os.path.join(_HERE, "..", "..", "..")) # python/functions
|
||||
if _FUNCTIONS not in sys.path:
|
||||
sys.path.insert(0, _FUNCTIONS)
|
||||
|
||||
from datascience.automatic_eda import build_document # noqa: E402
|
||||
|
||||
|
||||
def _profile_with_cat_and_num():
|
||||
"""Perfil mínimo que hace construir cat_distr y num_distr (cuerpo no vacío)."""
|
||||
return {
|
||||
"table": "ventas", "n_rows": 120, "n_cols": 2, "quality_score": 91,
|
||||
"duplicate_pct": 1.5, "null_cell_pct": 0.8,
|
||||
"columns": [
|
||||
{"name": "region", "inferred_type": "categorical",
|
||||
"categorical": {
|
||||
"top": [{"value": "norte", "count": 50, "pct": 0.42},
|
||||
{"value": "sur", "count": 40, "pct": 0.33},
|
||||
{"value": "este", "count": 30, "pct": 0.25}],
|
||||
"mode": "norte", "n_distinct": 3, "entropy": 1.55,
|
||||
"imbalance": 0.1}},
|
||||
{"name": "importe", "inferred_type": "numeric",
|
||||
"numeric": {"mean": 50.0, "median": 48.0, "std": 10.0,
|
||||
"min": 10, "max": 99, "iqr": 15,
|
||||
"histogram": [{"lo": 0, "hi": 50, "count": 40},
|
||||
{"lo": 50, "hi": 100, "count": 80}]}},
|
||||
],
|
||||
}
|
||||
|
||||
|
||||
def test_only_none_is_full_document():
|
||||
"""Retro-compat: sin `only`, salen todos los capítulos aplicables."""
|
||||
chs = build_document(_profile_with_cat_and_num(), ctx={"dataset_name": "v"})
|
||||
ids = [c.id for c in chs]
|
||||
assert ids[0] == "portada"
|
||||
assert ids[-1] == "glosario"
|
||||
# El cuerpo trae las distribuciones (cat/num), no solo portada+glosario.
|
||||
assert "num_distr" in ids
|
||||
assert "cat_distr" in ids
|
||||
|
||||
|
||||
def test_only_restricts_body_but_keeps_cover_and_glossary():
|
||||
# cat_distr registra el término "entropía" en el glosario, así que el
|
||||
# glosario (destino del término clicable) aparece — demuestra el contrato
|
||||
# "portada primera + capítulo + glosario última".
|
||||
chs = build_document(_profile_with_cat_and_num(),
|
||||
ctx={"dataset_name": "v"}, only=["cat_distr"])
|
||||
ids = [c.id for c in chs]
|
||||
assert ids[0] == "portada", f"portada no es la primera: {ids}"
|
||||
assert ids[-1] == "glosario", f"glosario no es la última: {ids}"
|
||||
assert "cat_distr" in ids
|
||||
# num_distr quedó fuera de la selección.
|
||||
assert "num_distr" not in ids
|
||||
|
||||
|
||||
def test_only_empty_yields_minimal_document():
|
||||
# only=[] -> cuerpo vacío. La portada está siempre; el glosario solo aparece
|
||||
# si algún capítulo registró términos (patrón preexistente: glosario vacío se
|
||||
# omite). Sin cuerpo no hay términos → documento mínimo = solo portada.
|
||||
chs = build_document(_profile_with_cat_and_num(),
|
||||
ctx={"dataset_name": "v"}, only=[])
|
||||
ids = [c.id for c in chs]
|
||||
assert ids == ["portada"], \
|
||||
f"only=[] debe dar el documento mínimo (solo portada), no {ids}"
|
||||
|
||||
|
||||
def test_selection_via_reserved_ctx_key():
|
||||
"""La selección viaja por ctx['_only_chapters'] cuando no se pasa `only`."""
|
||||
chs = build_document(_profile_with_cat_and_num(),
|
||||
ctx={"dataset_name": "v",
|
||||
"_only_chapters": ["cat_distr"]})
|
||||
ids = [c.id for c in chs]
|
||||
assert "cat_distr" in ids
|
||||
assert "num_distr" not in ids
|
||||
assert ids[0] == "portada" and ids[-1] == "glosario"
|
||||
|
||||
|
||||
def test_explicit_only_arg_wins_over_ctx_key():
|
||||
"""Si se pasan ambos, el argumento `only` manda sobre la clave del ctx."""
|
||||
chs = build_document(_profile_with_cat_and_num(),
|
||||
ctx={"dataset_name": "v",
|
||||
"_only_chapters": ["cat_distr"]},
|
||||
only=["num_distr"])
|
||||
ids = [c.id for c in chs]
|
||||
assert "num_distr" in ids
|
||||
assert "cat_distr" not in ids
|
||||
|
||||
|
||||
def test_reserved_key_not_leaked_to_caller_ctx():
|
||||
"""build_document no muta el ctx del caller (copia interna)."""
|
||||
ctx = {"dataset_name": "v", "_only_chapters": ["num_distr"]}
|
||||
build_document(_profile_with_cat_and_num(), ctx=ctx)
|
||||
# La clave reservada sigue en el dict del caller (no se mutó su copia).
|
||||
assert ctx["_only_chapters"] == ["num_distr"]
|
||||
@@ -0,0 +1,205 @@
|
||||
"""chapter_deps — mapa central de dependencias de cómputo por capítulo del EDA.
|
||||
|
||||
Fuente de verdad ÚNICA de qué necesita cada capítulo de ``CHAPTER_ORDER`` para
|
||||
computarse COMPLETO (sin caer en su rama degradada "datos insuficientes"). Lo
|
||||
consume el pipeline ``render_automatic_eda`` cuando se le pide renderizar un
|
||||
SUBCONJUNTO de capítulos (kwarg ``only_chapters``): antes de perfilar, resuelve
|
||||
los requisitos de los capítulos pedidos y activa SOLO el cómputo que esos
|
||||
capítulos necesitan, de modo que un capítulo suelto siempre llegue poblado y a la
|
||||
vez no se malgaste CPU/LLM en piezas que ningún capítulo pedido usa.
|
||||
|
||||
Diseño: el mapa es CENTRAL (este módulo), NO una constante por capítulo. Así se
|
||||
evita tocar los ``chapters/<id>.py`` (cada agente es dueño de su capítulo) y se
|
||||
elimina el riesgo de colisión entre ramas. Si un capítulo cambia lo que lee del
|
||||
``profile``/``ctx``, se actualiza ESTE mapa — es donde el motor mira.
|
||||
|
||||
Dos clases de dependencia, derivadas inspeccionando qué lee cada capítulo:
|
||||
|
||||
- ``profile_flags``: flags de coste de ``profile_table`` que hay que ACTIVAR
|
||||
para que el ``profile`` traiga el bloque que el capítulo lee. Son los caros:
|
||||
* ``run_models`` -> ``profile['models']`` (KMeans/IsolationForest/PCA).
|
||||
Lo leen ``outliers`` (fallback del multivariante) y ``modelos``.
|
||||
* ``run_series`` -> ``profile['series']`` (análisis de serie temporal).
|
||||
Lo lee ``timeseries``.
|
||||
* ``run_llm`` -> ``profile['llm']`` (interpretación del modelo).
|
||||
Lo lee ``analisis_llm``.
|
||||
|
||||
- ``ctx``: etiquetas de las piezas de DATOS CRUDOS que construye
|
||||
``build_eda_render_ctx`` y que el capítulo lee del ``ctx``. Si la lista está
|
||||
vacía, el capítulo no necesita datos crudos y el pipeline puede saltarse
|
||||
``build_eda_render_ctx`` por completo cuando ningún capítulo pedido los pide.
|
||||
Etiquetas y claves reales que mapean (ver ``CTX_LABEL_TO_KEYS``):
|
||||
* ``head_rows`` -> ``ctx['head_rows']`` (overview: df.head real).
|
||||
* ``raw_numeric`` -> ``ctx['raw_numeric']`` (outliers/modelos/
|
||||
correlacion/missingness/geospatial: muestra numérica alineada por fila).
|
||||
* ``timeseries_raw`` -> ``ctx['timeseries_raw']`` (timeseries: serie cruda).
|
||||
* ``geo_points`` -> ``ctx['geo_points']`` (+ ``raw_numeric``)
|
||||
(geospatial: lat/lon).
|
||||
* ``db_path_table`` -> ``ctx['db_path']`` + ``ctx['table']`` (agregacion/
|
||||
text_distr/missingness/relaciones: push-down de queries propias).
|
||||
|
||||
``portada`` y ``glosario`` NO son opcionales: el pipeline los incluye SIEMPRE
|
||||
(la portada resume el documento y el glosario es el destino de los términos
|
||||
clicables), así que aquí se declaran sin requisitos de cómputo.
|
||||
|
||||
Todas las funciones de este módulo son PURAS (no I/O, deterministas): se prestan
|
||||
a test unitario directo.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
# Mapa central. Una entrada por id de CHAPTER_ORDER. ``profile_flags`` lista los
|
||||
# flags de coste a activar; ``ctx`` las etiquetas de datos crudos que lee. Las
|
||||
# claves vacías significan "no necesita ese tipo de dependencia".
|
||||
CHAPTER_DEPS = {
|
||||
# Portada y glosario: SIEMPRE presentes, sin cómputo propio (la portada lee
|
||||
# el document_summary que arma build_document; el glosario lee los términos
|
||||
# que el resto registró). Se declaran para que el mapa cubra CHAPTER_ORDER
|
||||
# entero y la validación los reconozca.
|
||||
"portada": {"profile_flags": [], "ctx": []},
|
||||
"overview": {"profile_flags": [], "ctx": ["head_rows"]},
|
||||
"analisis_llm": {"profile_flags": ["run_llm"], "ctx": []},
|
||||
"num_distr": {"profile_flags": [], "ctx": []},
|
||||
"cat_distr": {"profile_flags": [], "ctx": []},
|
||||
# text_distr empuja su propia query de texto (no usa raw_numeric); necesita
|
||||
# db_path/table en el ctx para hacerlo.
|
||||
"text_distr": {"profile_flags": [], "ctx": ["db_path_table"]},
|
||||
"calidad": {"profile_flags": [], "ctx": []},
|
||||
# missingness lee la muestra numérica cruda (co-ocurrencia de ausencias) y
|
||||
# puede empujar una query de patrón de nulos con db_path/table.
|
||||
"missingness": {"profile_flags": [], "ctx": ["raw_numeric", "db_path_table"]},
|
||||
# outliers corre IsolationForest EN VIVO sobre ctx['raw_numeric']; run_models
|
||||
# asegura además el fallback profile['models']['outliers'] si el ctx faltara.
|
||||
"outliers": {"profile_flags": ["run_models"], "ctx": ["raw_numeric"]},
|
||||
"correlacion": {"profile_flags": [], "ctx": ["raw_numeric"]},
|
||||
"relaciones": {"profile_flags": [], "ctx": ["db_path_table"]},
|
||||
"modelos": {"profile_flags": ["run_models"], "ctx": ["raw_numeric"]},
|
||||
"timeseries": {"profile_flags": ["run_series"], "ctx": ["timeseries_raw"]},
|
||||
"geospatial": {"profile_flags": [], "ctx": ["geo_points", "raw_numeric"]},
|
||||
"agregacion": {"profile_flags": [], "ctx": ["db_path_table"]},
|
||||
"glosario": {"profile_flags": [], "ctx": []},
|
||||
}
|
||||
|
||||
# Capítulos que el documento incluye SIEMPRE, independientemente de only_chapters.
|
||||
ALWAYS_PRESENT = ("portada", "glosario")
|
||||
|
||||
# Flags de coste reconocidos (el orden no importa; se devuelven como set).
|
||||
KNOWN_PROFILE_FLAGS = ("run_models", "run_series", "run_llm")
|
||||
|
||||
# Mapeo de cada etiqueta de ctx a las claves REALES que produce
|
||||
# build_eda_render_ctx. ``db_path_table`` es especial: db_path/table siempre se
|
||||
# ponen para un backend válido y son inofensivos, por eso no se podan nunca (no
|
||||
# aparecen en DATA_CTX_KEYS). El resto (head_rows/raw_numeric/timeseries_raw/
|
||||
# geo_points) son las piezas de datos podables.
|
||||
CTX_LABEL_TO_KEYS = {
|
||||
"head_rows": {"head_rows"},
|
||||
"raw_numeric": {"raw_numeric"},
|
||||
"timeseries_raw": {"timeseries_raw"},
|
||||
"geo_points": {"geo_points", "raw_numeric"},
|
||||
"db_path_table": set(), # db_path/table siempre presentes; nunca se podan.
|
||||
}
|
||||
|
||||
# Claves de datos crudos del ctx que se pueden podar cuando ningún capítulo
|
||||
# pedido las necesita (las que cuestan muestreo). db_path/table NO entran aquí.
|
||||
DATA_CTX_KEYS = ("head_rows", "raw_numeric", "timeseries_raw", "geo_points")
|
||||
|
||||
|
||||
def _as_id_list(chapter_ids):
|
||||
"""Normaliza la entrada a una lista de ids string, defensiva. None -> []."""
|
||||
if chapter_ids is None:
|
||||
return []
|
||||
if isinstance(chapter_ids, str):
|
||||
return [chapter_ids]
|
||||
return [c for c in chapter_ids if isinstance(c, str)]
|
||||
|
||||
|
||||
def validate_chapter_ids(chapter_ids, order):
|
||||
"""Separa los ids pedidos en válidos y desconocidos respecto a ``order``.
|
||||
|
||||
Args:
|
||||
chapter_ids: lista (o str) de ids de capítulo pedidos.
|
||||
order: lista canónica de ids válidos (CHAPTER_ORDER).
|
||||
|
||||
Returns:
|
||||
dict ``{"valid": [...], "unknown": [...]}`` preservando el orden de
|
||||
aparición de la entrada. Función pura.
|
||||
"""
|
||||
valid_set = set(order or [])
|
||||
valid, unknown = [], []
|
||||
for cid in _as_id_list(chapter_ids):
|
||||
(valid if cid in valid_set else unknown).append(cid)
|
||||
return {"valid": valid, "unknown": unknown}
|
||||
|
||||
|
||||
def resolve_requirements(chapter_ids):
|
||||
"""Une los requisitos de cómputo de los capítulos pedidos.
|
||||
|
||||
Es el corazón de la resolución de dependencias: dado el subconjunto de
|
||||
capítulos a renderizar, devuelve TODO lo que hay que activar/construir para
|
||||
que esos capítulos lleguen COMPLETOS, y solo eso.
|
||||
|
||||
Los capítulos ``ALWAYS_PRESENT`` (portada/glosario) se añaden implícitamente
|
||||
porque el pipeline siempre los incluye; como no tienen requisitos, no alteran
|
||||
el resultado, pero se contemplan para que el conjunto sea coherente.
|
||||
|
||||
Args:
|
||||
chapter_ids: lista (o str) de ids de capítulo. Ids desconocidos se
|
||||
ignoran silenciosamente (la validación estricta es de quien llama).
|
||||
None o lista vacía -> requisitos vacíos.
|
||||
|
||||
Returns:
|
||||
dict ``{"profile_flags": set[str], "ctx_keys": set[str]}`` donde
|
||||
``ctx_keys`` son las ETIQUETAS de ctx (no las claves reales). Función
|
||||
pura.
|
||||
"""
|
||||
ids = set(_as_id_list(chapter_ids)) | set(ALWAYS_PRESENT)
|
||||
profile_flags = set()
|
||||
ctx_keys = set()
|
||||
for cid in ids:
|
||||
dep = CHAPTER_DEPS.get(cid)
|
||||
if not isinstance(dep, dict):
|
||||
continue
|
||||
for f in dep.get("profile_flags", []) or []:
|
||||
if f in KNOWN_PROFILE_FLAGS:
|
||||
profile_flags.add(f)
|
||||
for k in dep.get("ctx", []) or []:
|
||||
ctx_keys.add(k)
|
||||
return {"profile_flags": profile_flags, "ctx_keys": ctx_keys}
|
||||
|
||||
|
||||
def resolve_profile_flags(chapter_ids):
|
||||
"""Atajo: solo el set de profile_flags a activar para los capítulos pedidos.
|
||||
|
||||
Función pura. Devuelve un set ⊆ KNOWN_PROFILE_FLAGS.
|
||||
"""
|
||||
return resolve_requirements(chapter_ids)["profile_flags"]
|
||||
|
||||
|
||||
def needs_render_ctx(chapter_ids):
|
||||
"""True si algún capítulo pedido necesita datos crudos del ctx.
|
||||
|
||||
Cuando es False, el pipeline puede saltarse ``build_eda_render_ctx`` entero
|
||||
(ahorro real de CPU/I/O): los capítulos pedidos no leen ninguna pieza de
|
||||
datos crudos. Función pura.
|
||||
"""
|
||||
return bool(resolve_requirements(chapter_ids)["ctx_keys"])
|
||||
|
||||
|
||||
def resolve_ctx_data_keys(chapter_ids):
|
||||
"""Claves REALES de datos del ctx a CONSERVAR para los capítulos pedidos.
|
||||
|
||||
Traduce las etiquetas de ctx a las claves concretas que produce
|
||||
``build_eda_render_ctx`` (head_rows/raw_numeric/timeseries_raw/geo_points).
|
||||
El pipeline poda del ctx las claves de datos que NO estén en este set, para
|
||||
que un capítulo suelto no arrastre piezas de datos que no usa. db_path/table
|
||||
nunca se podan (no aparecen aquí). Función pura.
|
||||
|
||||
Returns:
|
||||
set[str] subconjunto de DATA_CTX_KEYS.
|
||||
"""
|
||||
req = resolve_requirements(chapter_ids)
|
||||
keep = set()
|
||||
for label in req["ctx_keys"]:
|
||||
keep |= CTX_LABEL_TO_KEYS.get(label, set())
|
||||
# Solo claves de datos podables (db_path/table se gestionan aparte).
|
||||
return {k for k in keep if k in DATA_CTX_KEYS}
|
||||
@@ -0,0 +1,160 @@
|
||||
"""Tests del mapa central de dependencias por capítulo (chapter_deps).
|
||||
|
||||
Todas las funciones bajo prueba son PURAS (sin I/O): se ejercitan directamente
|
||||
sin DuckDB ni renderizado. Cubren la resolución de requisitos (golden + edges),
|
||||
la validación de ids y los helpers de eficiencia (qué cómputo se salta).
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
|
||||
_HERE = os.path.dirname(os.path.abspath(__file__))
|
||||
_FUNCTIONS = os.path.abspath(os.path.join(_HERE, "..", "..", "..")) # python/functions
|
||||
if _FUNCTIONS not in sys.path:
|
||||
sys.path.insert(0, _FUNCTIONS)
|
||||
|
||||
from datascience.automatic_eda.chapter_deps import ( # noqa: E402
|
||||
ALWAYS_PRESENT,
|
||||
CHAPTER_DEPS,
|
||||
DATA_CTX_KEYS,
|
||||
needs_render_ctx,
|
||||
resolve_ctx_data_keys,
|
||||
resolve_profile_flags,
|
||||
resolve_requirements,
|
||||
validate_chapter_ids,
|
||||
)
|
||||
from datascience.automatic_eda.chapters_registry import CHAPTER_ORDER # noqa: E402
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# El mapa cubre CHAPTER_ORDER entero (sin huecos ni claves de más).
|
||||
# --------------------------------------------------------------------------- #
|
||||
def test_chapter_deps_covers_every_chapter_in_order():
|
||||
assert set(CHAPTER_DEPS) == set(CHAPTER_ORDER), (
|
||||
"CHAPTER_DEPS debe declarar exactamente los ids de CHAPTER_ORDER")
|
||||
# Cada entrada tiene la forma esperada.
|
||||
for cid, dep in CHAPTER_DEPS.items():
|
||||
assert isinstance(dep.get("profile_flags"), list), cid
|
||||
assert isinstance(dep.get("ctx"), list), cid
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# resolve_requirements — golden: outliers exige run_models + raw_numeric.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def test_resolve_outliers_requires_run_models_and_raw_numeric():
|
||||
req = resolve_requirements(["outliers"])
|
||||
assert "run_models" in req["profile_flags"]
|
||||
assert "raw_numeric" in req["ctx_keys"]
|
||||
assert "run_series" not in req["profile_flags"]
|
||||
assert "run_llm" not in req["profile_flags"]
|
||||
|
||||
|
||||
def test_resolve_timeseries_requires_run_series():
|
||||
req = resolve_requirements(["timeseries"])
|
||||
assert req["profile_flags"] == {"run_series"}
|
||||
assert "timeseries_raw" in req["ctx_keys"]
|
||||
|
||||
|
||||
def test_resolve_analisis_llm_requires_run_llm():
|
||||
assert resolve_requirements(["analisis_llm"])["profile_flags"] == {"run_llm"}
|
||||
|
||||
|
||||
def test_resolve_union_of_several_chapters():
|
||||
req = resolve_requirements(["outliers", "timeseries", "analisis_llm"])
|
||||
assert req["profile_flags"] == {"run_models", "run_series", "run_llm"}
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Eficiencia: capítulos que NO necesitan flags caros no los activan.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def test_resolve_geospatial_needs_no_cost_flags():
|
||||
"""geospatial sale de geo_points/raw_numeric del ctx, NO de los modelos."""
|
||||
req = resolve_requirements(["geospatial"])
|
||||
assert req["profile_flags"] == set(), \
|
||||
"geospatial no debe activar run_models/run_series/run_llm"
|
||||
assert "geo_points" in req["ctx_keys"]
|
||||
|
||||
|
||||
def test_resolve_correlacion_needs_raw_numeric_but_no_models():
|
||||
req = resolve_requirements(["correlacion"])
|
||||
assert req["profile_flags"] == set()
|
||||
assert "raw_numeric" in req["ctx_keys"]
|
||||
|
||||
|
||||
def test_always_present_chapters_add_no_requirements():
|
||||
"""portada y glosario están siempre, pero no arrastran cómputo."""
|
||||
for cid in ALWAYS_PRESENT:
|
||||
req = resolve_requirements([cid])
|
||||
assert req["profile_flags"] == set()
|
||||
assert req["ctx_keys"] == set()
|
||||
|
||||
|
||||
def test_resolve_profile_flags_shortcut():
|
||||
assert resolve_profile_flags(["modelos"]) == {"run_models"}
|
||||
assert resolve_profile_flags(["num_distr"]) == set()
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# needs_render_ctx — cuándo se puede saltar build_eda_render_ctx por completo.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def test_needs_render_ctx_true_when_chapter_reads_raw_data():
|
||||
assert needs_render_ctx(["outliers"]) is True
|
||||
assert needs_render_ctx(["agregacion"]) is True # db_path/table push-down
|
||||
assert needs_render_ctx(["timeseries"]) is True
|
||||
|
||||
|
||||
def test_needs_render_ctx_false_for_purely_aggregated_chapters():
|
||||
"""num_distr / cat_distr / calidad solo leen el profile agregado."""
|
||||
assert needs_render_ctx(["num_distr"]) is False
|
||||
assert needs_render_ctx(["cat_distr", "calidad"]) is False
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# resolve_ctx_data_keys — poda: qué claves de DATOS conservar (db_path/table no).
|
||||
# --------------------------------------------------------------------------- #
|
||||
def test_resolve_ctx_data_keys_outliers_keeps_only_raw_numeric():
|
||||
assert resolve_ctx_data_keys(["outliers"]) == {"raw_numeric"}
|
||||
|
||||
|
||||
def test_resolve_ctx_data_keys_geospatial_keeps_geo_and_numeric():
|
||||
assert resolve_ctx_data_keys(["geospatial"]) == {"geo_points", "raw_numeric"}
|
||||
|
||||
|
||||
def test_resolve_ctx_data_keys_aggregation_keeps_nothing_prunable():
|
||||
"""agregacion usa db_path/table (siempre presentes), 0 claves podables."""
|
||||
assert resolve_ctx_data_keys(["agregacion"]) == set()
|
||||
|
||||
|
||||
def test_resolve_ctx_data_keys_subset_of_data_keys():
|
||||
keep = resolve_ctx_data_keys(["overview", "timeseries", "geospatial"])
|
||||
assert keep <= set(DATA_CTX_KEYS)
|
||||
assert {"head_rows", "timeseries_raw", "geo_points", "raw_numeric"} == keep
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# validate_chapter_ids — separa válidos de desconocidos preservando orden.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def test_validate_separates_known_and_unknown():
|
||||
out = validate_chapter_ids(["outliers", "nope", "timeseries", "ghost"],
|
||||
CHAPTER_ORDER)
|
||||
assert out["valid"] == ["outliers", "timeseries"]
|
||||
assert out["unknown"] == ["nope", "ghost"]
|
||||
|
||||
|
||||
def test_validate_all_known():
|
||||
out = validate_chapter_ids(["portada", "glosario"], CHAPTER_ORDER)
|
||||
assert out["unknown"] == []
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Robustez: entradas raras nunca lanzan.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def test_resolve_handles_none_and_empty():
|
||||
assert resolve_requirements(None)["profile_flags"] == set()
|
||||
assert resolve_requirements([])["profile_flags"] == set()
|
||||
# ids desconocidos se ignoran silenciosamente en la resolución.
|
||||
assert resolve_requirements(["no_existe"])["ctx_keys"] == set()
|
||||
|
||||
|
||||
def test_resolve_accepts_single_string():
|
||||
assert resolve_requirements("outliers")["profile_flags"] == {"run_models"}
|
||||
@@ -5,28 +5,32 @@ page (PDF) / slide (PPTX)**: every column is wrapped in a keep-together
|
||||
``model.Group`` with ``page_break_before=True`` (except the first, which may share
|
||||
the intro's page), so its chart sits next to its tables and no column is split.
|
||||
|
||||
A short intro names the clickable **[[term:entropia]]entropía[[/term]]** term —
|
||||
the full definition lives in the GLOSARIO chapter, so it is NOT repeated inline
|
||||
here (one click jumps to the glossary entry). The intro also carries the dataset
|
||||
row total used as a comparison baseline.
|
||||
Per column the Group is laid out ``side_by_side`` (PPTX: cardinality table LEFT,
|
||||
chart RIGHT; PDF: stacked) and contains, in order:
|
||||
|
||||
Per column the Group contains, in order:
|
||||
|
||||
1. A cardinality key/value table: distinct values, ``% distinct`` (distinct /
|
||||
1. The column name plus, when the LLM layer ran, its business **description** and
|
||||
**unit** (read from ``profile['llm']['dictionary']``, matched by column name).
|
||||
2. A cardinality key/value table: distinct values, ``% distinct`` (distinct /
|
||||
total rows), total dataset rows, singleton values (frequency 1), entropy with
|
||||
its theoretical maximum and the normalized ratio, mode, imbalance and
|
||||
string-length stats.
|
||||
2. A short note flagging problematic cardinality (id-like ≈100% distinct, or a
|
||||
3. A short note flagging problematic cardinality (id-like ≈100% distinct, or a
|
||||
single dominating category).
|
||||
3. A ``top-k`` table (value / count / %).
|
||||
4. A **donut pie chart** of the most common categories (top-k + an "Otros"
|
||||
4. A ``top-k`` table (value / count / %).
|
||||
5. A **horizontal bar chart** of the most common categories (top-k + an "Otros"
|
||||
bucket), drawn lazily so the renderers scale it to fit entirely.
|
||||
|
||||
A short intro names the clickable **[[term:entropia]]entropía[[/term]]** and
|
||||
**[[term:pagina_categorica]]page-layout[[/term]]** terms — their full
|
||||
definitions live in the GLOSARIO chapter, so they are NOT repeated inline here
|
||||
(one click jumps to the glossary entry). The intro also carries the dataset row
|
||||
total used as a comparison baseline.
|
||||
|
||||
Data comes from the ``eda`` group: each ``columns[i]['categorical']`` is the
|
||||
output of ``summarize_categorical`` (``top[{value,count,pct}]``, ``mode``,
|
||||
``n_distinct``, ``entropy``, ``imbalance``, ``len_min/mean/max``). The derived
|
||||
cardinality metrics and the pie figure are delegated to two registry functions
|
||||
(``categorical_cardinality_block`` and ``categorical_top_pie_figure``); both are
|
||||
cardinality metrics and the bar figure are delegated to two registry functions
|
||||
(``categorical_cardinality_block`` and ``categorical_top_bar_figure``); both are
|
||||
imported lazily and degrade to a minimal inline fallback so this chapter never
|
||||
raises even if they are unavailable.
|
||||
|
||||
@@ -39,10 +43,21 @@ import math
|
||||
|
||||
from .. import model
|
||||
|
||||
CHAPTER_VERSION = "1.2.0"
|
||||
CHAPTER_VERSION = "1.3.0"
|
||||
CHAPTER_ID = "cat_distr"
|
||||
CHAPTER_TITLE = "Distribuciones categóricas"
|
||||
|
||||
# Key under which eda_llm_insights stores its interpretive block in the profile.
|
||||
LLM_KEY = "llm"
|
||||
|
||||
# Second glossary term this chapter names: "how each categorical page is laid
|
||||
# out". The long paragraph that used to describe it inline in the intro now lives
|
||||
# in the GLOSARIO chapter (canonical definition in ``glosario._BASELINE_TERMS``);
|
||||
# the intro only names the clickable term, relocating the explanation, not losing
|
||||
# it. The chapter only needs to register key+label here.
|
||||
_TERM_PAGINA_KEY = "pagina_categorica"
|
||||
_TERM_PAGINA_LABEL = "Cómo se organiza cada página categórica"
|
||||
|
||||
# Glossary term this chapter explains. Registered in the shared collector and
|
||||
# marked clickable on its first appearance (end-to-end glossary example —
|
||||
# mejora 6). Other chapters hook their own terms the same way (see the contract).
|
||||
@@ -59,14 +74,14 @@ _TERM_ENTROPIA_DEF = (
|
||||
# Cap the number of categorical columns rendered to keep the document bounded;
|
||||
# the rest are summarized in a closing note (no silent truncation).
|
||||
MAX_COLS = 40
|
||||
# Rows shown in each top-k table and explicit slices in the pie. Kept moderate so
|
||||
# the whole column — cardinality table + top-k table + donut — fits on ONE
|
||||
# Rows shown in each top-k table and explicit bars in the chart. Kept moderate so
|
||||
# the whole column — cardinality table + top-k table + bar chart — fits on ONE
|
||||
# page/slide with the chart next to its tables; the table note still reports
|
||||
# "top N of M" so nothing is silently hidden. For id-like columns (≈100%
|
||||
# distinct) the top-k table is dropped entirely (it would be a list of unique
|
||||
# values — pure noise), which also frees the room the donut needs (see build).
|
||||
# values — pure noise), which also frees the room the chart needs (see build).
|
||||
TOP_TABLE_ROWS = 8
|
||||
PIE_TOP_K = 6
|
||||
CHART_TOP_K = 6
|
||||
# Truncate very long category labels in tables (the renderer also wraps). Kept
|
||||
# tight so a column with long id-like values (names, tickets) still fits its page.
|
||||
LABEL_MAX = 28
|
||||
@@ -208,26 +223,74 @@ def _fallback_cardinality(cat: dict, n_rows) -> dict:
|
||||
}
|
||||
|
||||
|
||||
def _pie_make(top, n_distinct, title, n_rows):
|
||||
"""Return a zero-arg callable that builds the donut figure lazily."""
|
||||
def _llm_index(profile: dict, ctx: dict) -> dict:
|
||||
"""Map column name -> its LLM dictionary entry (description/unit/...).
|
||||
|
||||
Reads the ``llm.dictionary`` list that ``eda_llm_insights`` stored in the
|
||||
profile (``profile['llm']``; falls back to ``ctx['llm']``). Returns an empty
|
||||
dict when ``run_llm`` did not run, so the caller degrades cleanly. Fully
|
||||
defensive: never raises on malformed input.
|
||||
"""
|
||||
llm = profile.get(LLM_KEY)
|
||||
if not isinstance(llm, dict):
|
||||
llm = ctx.get(LLM_KEY)
|
||||
if not isinstance(llm, dict):
|
||||
return {}
|
||||
entries = llm.get("dictionary")
|
||||
if not isinstance(entries, (list, tuple)):
|
||||
return {}
|
||||
index: dict = {}
|
||||
for e in entries:
|
||||
if not isinstance(e, dict):
|
||||
continue
|
||||
col = e.get("column")
|
||||
if col is None:
|
||||
continue
|
||||
index[model._safe_str(col)] = e
|
||||
return index
|
||||
|
||||
|
||||
def _llm_desc_unit_block(name: str, llm_index: dict):
|
||||
"""Markdown block with the LLM business description + unit of a column, or
|
||||
None when no LLM entry matches the column (clean fallback without LLM)."""
|
||||
entry = llm_index.get(model._safe_str(name))
|
||||
if not isinstance(entry, dict):
|
||||
return None
|
||||
raw_desc = entry.get("description") or entry.get("business_meaning")
|
||||
desc = " ".join(model._safe_str(raw_desc).split()) if raw_desc else ""
|
||||
raw_unit = entry.get("unit")
|
||||
unit = " ".join(model._safe_str(raw_unit).split()) if raw_unit else ""
|
||||
parts = []
|
||||
if desc:
|
||||
parts.append(f"**Descripción:** {desc}")
|
||||
if unit:
|
||||
parts.append(f"**Unidad:** {unit}")
|
||||
if not parts:
|
||||
return None
|
||||
return model.Markdown(text=" · ".join(parts))
|
||||
|
||||
|
||||
def _bar_make(top, n_distinct, title, n_rows):
|
||||
"""Return a zero-arg callable that builds the bar figure lazily."""
|
||||
|
||||
def make():
|
||||
try:
|
||||
from datascience.categorical_top_pie_figure import (
|
||||
categorical_top_pie_figure,
|
||||
from datascience.categorical_top_bar_figure import (
|
||||
categorical_top_bar_figure,
|
||||
)
|
||||
|
||||
return categorical_top_pie_figure(
|
||||
return categorical_top_bar_figure(
|
||||
top=top, n_distinct=n_distinct or 0, title=title,
|
||||
top_k=PIE_TOP_K, n_rows=n_rows)
|
||||
top_k=CHART_TOP_K, n_rows=n_rows)
|
||||
except Exception: # noqa: BLE001 — minimal local fallback figure.
|
||||
return _fallback_pie(top, title)
|
||||
return _fallback_bar(top, title)
|
||||
|
||||
return make
|
||||
|
||||
|
||||
def _fallback_pie(top, title):
|
||||
"""Minimal donut figure used only if the registry function is unavailable."""
|
||||
def _fallback_bar(top, title):
|
||||
"""Minimal horizontal-bar figure used only if the registry function is
|
||||
unavailable. Largest category on top, the rest folded into "Otros"."""
|
||||
import matplotlib
|
||||
|
||||
matplotlib.use("Agg")
|
||||
@@ -238,8 +301,8 @@ def _fallback_pie(top, title):
|
||||
items = [t for t in (top or [])
|
||||
if isinstance(t, dict) and isinstance(t.get("count"), (int, float))]
|
||||
items = sorted(items, key=lambda t: t.get("count") or 0, reverse=True)
|
||||
head = items[:PIE_TOP_K]
|
||||
rest = items[PIE_TOP_K:]
|
||||
head = items[:CHART_TOP_K]
|
||||
rest = items[CHART_TOP_K:]
|
||||
labels = [_truncate(t.get("value"), 20) for t in head]
|
||||
sizes = [float(t.get("count") or 0) for t in head]
|
||||
if rest:
|
||||
@@ -249,10 +312,13 @@ def _fallback_pie(top, title):
|
||||
ax.text(0.5, 0.5, "sin datos categóricos", ha="center", va="center")
|
||||
ax.axis("off")
|
||||
return fig
|
||||
ax.pie(sizes, labels=None, wedgeprops={"width": 0.42},
|
||||
autopct=lambda p: f"{p:.0f}%" if p >= 4 else "")
|
||||
ax.legend(labels, loc="center left", bbox_to_anchor=(1.0, 0.5),
|
||||
fontsize=7, frameon=False)
|
||||
# barh draws bottom-up, so reverse to put the largest category on top.
|
||||
y_pos = range(len(labels))
|
||||
ax.barh(list(y_pos), list(reversed(sizes)), color="#4C72B0",
|
||||
edgecolor="white")
|
||||
ax.set_yticks(list(y_pos))
|
||||
ax.set_yticklabels(list(reversed(labels)), fontsize=7)
|
||||
ax.set_xlabel("conteo", fontsize=8)
|
||||
ax.set_title(_truncate(title, 40))
|
||||
fig.tight_layout()
|
||||
return fig
|
||||
@@ -373,22 +439,17 @@ def _topk_table(cat: dict):
|
||||
note=note)
|
||||
|
||||
|
||||
def _intro_blocks(n_rows, mark_term: bool = False):
|
||||
total = _fmt_int(n_rows)
|
||||
# Mark the first appearance of the term as a clickable glossary jump when the
|
||||
# term was registered (mark_term). The full definition of entropy lives in the
|
||||
# GLOSARIO chapter, so the intro only names the clickable term here instead of
|
||||
# repeating the long explanation (avoids the redundancy with the glossary).
|
||||
def _intro_blocks(mark_term: bool = False):
|
||||
# The full explanation of entropy AND of how each categorical page is laid out
|
||||
# lives in the GLOSARIO chapter; the chapter body keeps only the minimal
|
||||
# clickable terms — no descriptive prose — to avoid duplicating the glossary.
|
||||
# The dataset row total is not repeated here: each column's cardinality table
|
||||
# already carries "Total filas (dataset)".
|
||||
entropia = ("[[term:entropia]]entropía[[/term]]" if mark_term
|
||||
else "entropía")
|
||||
text = (
|
||||
f"Cada columna categórica ocupa su propia página: sus métricas de "
|
||||
f"cardinalidad —incluida la {entropia}—, una nota que señala cardinalidad "
|
||||
"problemática, la tabla de las categorías más frecuentes y un gráfico de "
|
||||
"tarta (donut) de las más comunes, todo junto."
|
||||
)
|
||||
if n_rows is not None:
|
||||
text += f" El dataset tiene {total} filas en total como referencia."
|
||||
pagina = ("[[term:pagina_categorica]]cómo se organiza cada página[[/term]]"
|
||||
if mark_term else "cómo se organiza cada página")
|
||||
text = f"Términos: {entropia} · {pagina}."
|
||||
return [
|
||||
model.Heading(text="Entropía y cardinalidad", level=2),
|
||||
model.Markdown(text=text),
|
||||
@@ -406,15 +467,22 @@ def build_cat_distr(profile: dict, ctx: dict):
|
||||
return None
|
||||
|
||||
n_rows = profile.get("n_rows")
|
||||
# Register "entropía" in the shared glossary collector (if present) and mark
|
||||
# its first appearance clickable. End-to-end glossary example (mejora 6).
|
||||
# Register "entropía" and the "how each categorical page is laid out" term in
|
||||
# the shared glossary collector (if present) and mark their first appearance
|
||||
# clickable. End-to-end glossary example (mejora 6).
|
||||
glossary = ctx.get("glossary")
|
||||
mark_term = False
|
||||
if isinstance(glossary, model.GlossaryCollector):
|
||||
glossary.add(_TERM_ENTROPIA_KEY, _TERM_ENTROPIA_LABEL,
|
||||
_TERM_ENTROPIA_DEF)
|
||||
glossary.add(_TERM_PAGINA_KEY, _TERM_PAGINA_LABEL)
|
||||
mark_term = True
|
||||
blocks = list(_intro_blocks(n_rows, mark_term=mark_term))
|
||||
blocks = list(_intro_blocks(mark_term=mark_term))
|
||||
|
||||
# Business description + unit per column come from the LLM dictionary
|
||||
# (profile['llm']['dictionary'], matched by column name); absent without
|
||||
# run_llm, in which case the per-column description block is simply omitted.
|
||||
llm_index = _llm_index(profile, ctx)
|
||||
|
||||
rendered = cat_cols[:MAX_COLS]
|
||||
for idx, col in enumerate(rendered):
|
||||
@@ -422,31 +490,36 @@ def build_cat_distr(profile: dict, ctx: dict):
|
||||
cat = col.get("categorical") or {}
|
||||
card = _normalize_card(_cardinality(cat, n_rows))
|
||||
|
||||
# One Group per categorical column: heading + cardinality table + flag
|
||||
# note + top-k table + donut figure are kept together and the renderer
|
||||
# starts each on a fresh page/slide (page_break_before) so every column
|
||||
# gets its own page with its chart next to its tables. The first column
|
||||
# may share the intro's page (no forced break) to avoid a near-empty page.
|
||||
col_blocks = [
|
||||
model.Heading(text=str(name), level=2),
|
||||
_cardinality_block(card),
|
||||
]
|
||||
# One Group per categorical column: heading + (optional) LLM description +
|
||||
# cardinality table + flag note + top-k table + bar figure are kept
|
||||
# together and the renderer starts each on a fresh page/slide
|
||||
# (page_break_before) so every column gets its own page with its chart next
|
||||
# to its tables. The first column may share the intro's page (no forced
|
||||
# break) to avoid a near-empty page.
|
||||
col_blocks = [model.Heading(text=str(name), level=2)]
|
||||
desc_block = _llm_desc_unit_block(name, llm_index)
|
||||
if desc_block is not None:
|
||||
col_blocks.append(desc_block)
|
||||
col_blocks.append(_cardinality_block(card))
|
||||
note = _flag_note(card)
|
||||
if note is not None:
|
||||
col_blocks.append(note)
|
||||
# For id-like columns (≈100% distinct) the top-k is a list of unique
|
||||
# values — pure noise; skip it (the flag note already explains why) and
|
||||
# let the donut take that room so the whole column fits one page/slide.
|
||||
# let the bar chart take that room so the whole column fits one page/slide.
|
||||
if not card.get("id_like"):
|
||||
topk = _topk_table(cat)
|
||||
if topk is not None:
|
||||
col_blocks.append(topk)
|
||||
col_blocks.append(model.Figure(
|
||||
make=_pie_make(cat.get("top") or [], card.get("n_distinct"),
|
||||
make=_bar_make(cat.get("top") or [], card.get("n_distinct"),
|
||||
str(name), n_rows),
|
||||
caption=(f"Categorías más comunes de «{_truncate(name, 32)}» "
|
||||
"(donut: top-k + «Otros»)")))
|
||||
blocks.append(model.Group(blocks=col_blocks,
|
||||
"(barras: top-k + «Otros»)")))
|
||||
# layout="side_by_side": in PPTX the cardinality table goes to the LEFT and
|
||||
# the bar chart to the RIGHT of the same slide; the PDF renderer stacks it
|
||||
# (the A5 mobile page is too narrow for two readable columns).
|
||||
blocks.append(model.Group(blocks=col_blocks, layout="side_by_side",
|
||||
page_break_before=(idx > 0)))
|
||||
|
||||
if len(cat_cols) > len(rendered):
|
||||
|
||||
@@ -2,12 +2,14 @@
|
||||
|
||||
Self-contained: builds synthetic TableProfiles (no DuckDB) so the suite is fast
|
||||
and deterministic. Verifies that ``build_cat_distr`` emits the blocks the user
|
||||
asked for (distinct/total/%-distinct/unique metrics, top-k table and a donut
|
||||
asked for (distinct/total/%-distinct/unique metrics, top-k table and a bar
|
||||
figure), that EACH categorical column is wrapped in its own keep-together
|
||||
``Group`` that starts on a fresh page/slide (one column per page, chart next to
|
||||
its tables), that the long entropy explanation is NOT repeated inline (it lives
|
||||
in the glossary — only the clickable term is kept), that the chapter renders
|
||||
inside the full document to both PDF and PPTX showing that content, that a
|
||||
``Group`` laid out ``side_by_side`` (PPTX: table left / bars right) that starts on
|
||||
a fresh page/slide (one column per page, chart next to its tables), that the LLM
|
||||
business description + unit are shown per column when the profile carries an LLM
|
||||
block, that the long entropy / page-layout explanations are NOT repeated inline
|
||||
(they live in the glossary — only the clickable terms are kept), that the chapter
|
||||
renders inside the full document to both PDF and PPTX showing that content, that a
|
||||
profile with no categorical columns yields ``None`` without raising, and that
|
||||
long labels / many columns are never cut in either output.
|
||||
"""
|
||||
@@ -116,6 +118,10 @@ def test_golden_build_cat_distr_emite_bloques_pedidos():
|
||||
assert "log2" not in md.text # redundant explanation removed.
|
||||
assert "máxima diversidad" not in md.text
|
||||
|
||||
# The donut/pie is gone: the intro no longer mentions tarta/donut (the chart
|
||||
# is now a bar chart; the long page-layout explanation moved to the glossary).
|
||||
assert "donut" not in md.text and "tarta" not in md.text
|
||||
|
||||
# Per-column blocks are wrapped in keep-together Groups: flatten to inspect.
|
||||
flat = _flatten(ch.blocks)
|
||||
kv = next(b for b in flat if isinstance(b, KVTable))
|
||||
@@ -128,11 +134,13 @@ def test_golden_build_cat_distr_emite_bloques_pedidos():
|
||||
assert any("Entropía" in lbl for lbl in labels)
|
||||
assert "únicos" in values and "%" in values
|
||||
assert "bits" in values and "norm" in values # entropy + max + normalized.
|
||||
# Top-k table + pie figure.
|
||||
# Top-k table + bar figure.
|
||||
dt = next(b for b in flat if isinstance(b, DataTable))
|
||||
assert dt.header == ["Valor", "Conteo", "%"]
|
||||
assert any("neumaticos" in str(cell) for row in dt.rows for cell in row)
|
||||
assert any(isinstance(b, Figure) for b in flat)
|
||||
# Each per-column Group is laid out side_by_side (table left / bars right).
|
||||
assert all(g.layout == "side_by_side" for g in _column_groups(ch))
|
||||
# id-like column flagged with a Note that also explains the top-k is dropped.
|
||||
idnote = next((b for b in flat
|
||||
if isinstance(b, Note) and "identificador" in b.text), None)
|
||||
@@ -140,9 +148,9 @@ def test_golden_build_cat_distr_emite_bloques_pedidos():
|
||||
assert "No se lista el top" in idnote.text
|
||||
|
||||
|
||||
def test_golden_idlike_omite_topk_y_conserva_donut():
|
||||
def test_golden_idlike_omite_topk_y_conserva_grafico():
|
||||
# The id-like column (uuid, 100% distinct) must NOT carry a top-k DataTable
|
||||
# (it would be a list of unique values), but must still keep its donut Figure
|
||||
# (it would be a list of unique values), but must still keep its bar Figure
|
||||
# and its cardinality table so it stays a full per-column page.
|
||||
ch = build_cat_distr(_profile(), {})
|
||||
groups = _column_groups(ch)
|
||||
@@ -151,7 +159,7 @@ def test_golden_idlike_omite_topk_y_conserva_donut():
|
||||
kinds = [b.kind for b in uuid_group.blocks]
|
||||
assert "data_table" not in kinds # top-k of unique values dropped.
|
||||
assert "kv_table" in kinds # cardinality kept.
|
||||
assert "figure" in kinds # donut kept (chart per column).
|
||||
assert "figure" in kinds # bar chart kept (chart per column).
|
||||
# A non-id-like column keeps its top-k table.
|
||||
cat_group = next(g for g in groups
|
||||
if any(getattr(b, "text", "") == "categoria"
|
||||
@@ -205,7 +213,7 @@ def test_golden_render_pdf_una_pagina_por_columna():
|
||||
assert "Entrop" in txt
|
||||
assert "distintos" in txt
|
||||
assert "categoria" in txt and "neumaticos" in txt
|
||||
assert "donut" in txt # figure caption rendered as text.
|
||||
assert "barras" in txt # bar-chart caption rendered as text (PDF).
|
||||
assert "identificador" in txt # id-like note rendered.
|
||||
|
||||
|
||||
@@ -258,9 +266,11 @@ def _profile_high_card() -> dict:
|
||||
|
||||
|
||||
def test_golden_pptx_una_slide_por_columna_con_su_grafico():
|
||||
"""Each categorical column occupies EXACTLY ONE cat_distr slide that carries
|
||||
BOTH its cardinality table and its donut figure (picture) — i.e. the chart is
|
||||
never separated from its table, even for a high-cardinality column."""
|
||||
"""Cada columna categórica ocupa EXACTAMENTE UN slide cat_distr que lleva su
|
||||
gráfico (picture) en la misma slide — el chart nunca se separa de su columna,
|
||||
ni siquiera para una columna de alta cardinalidad. Con layout side_by_side la
|
||||
tabla se rasteriza a imagen, así que la comprobación se hace por presencia de
|
||||
picture (no por el texto de la tabla)."""
|
||||
from pptx.enum.shapes import MSO_SHAPE_TYPE
|
||||
|
||||
prof = _profile_high_card()
|
||||
@@ -272,7 +282,7 @@ def test_golden_pptx_una_slide_por_columna_con_su_grafico():
|
||||
prs = Presentation(out)
|
||||
|
||||
# Per column: the cat_distr slides whose text mentions it, and whether the
|
||||
# owning slide also has the donut caption + an actual picture shape.
|
||||
# owning slide also carries an actual picture shape (its chart).
|
||||
slides_with_col = {n: [] for n in cat_names}
|
||||
owner_has_chart = {n: False for n in cat_names}
|
||||
for i, sl in enumerate(prs.slides):
|
||||
@@ -288,15 +298,106 @@ def test_golden_pptx_una_slide_por_columna_con_su_grafico():
|
||||
for n in cat_names:
|
||||
if n in txt:
|
||||
slides_with_col[n].append(i)
|
||||
has_table = "Cardinalidad" in txt or "distintos" in txt
|
||||
if has_pic and "donut" in txt and has_table:
|
||||
if has_pic:
|
||||
owner_has_chart[n] = True
|
||||
|
||||
for n in cat_names:
|
||||
# Exactly one slide carries the column (not split across slides).
|
||||
assert len(slides_with_col[n]) == 1, (n, slides_with_col[n])
|
||||
# That single slide also holds its table AND its donut picture.
|
||||
assert owner_has_chart[n], (n, "tabla y donut no están en el mismo slide")
|
||||
# That single slide also holds its chart picture.
|
||||
assert owner_has_chart[n], (n, "el gráfico no está en el slide de la columna")
|
||||
|
||||
|
||||
def test_golden_pptx_columna_side_by_side_tabla_izq_barra_der():
|
||||
"""Con layout side_by_side, una columna categórica coloca su tabla de
|
||||
cardinalidad (imagen) en la mitad izquierda y su gráfico de barras (imagen) en
|
||||
la mitad derecha de la MISMA slide. Verifica que al menos una columna queda en
|
||||
dos columnas (tabla-izq / barras-der), evidencia del side_by_side en PPTX."""
|
||||
from pptx.enum.shapes import MSO_SHAPE_TYPE
|
||||
from pptx.util import Inches
|
||||
|
||||
with tempfile.TemporaryDirectory() as d:
|
||||
out = os.path.join(d, "eda.pptx")
|
||||
render_automatic_eda_pptx(_profile(), out, {"title": "EDA"})
|
||||
prs = Presentation(out)
|
||||
centre = int(Inches(13.333 / 2.0)) # half of the 16:9 slide width.
|
||||
two_col_slides = 0
|
||||
for sl in prs.slides:
|
||||
texts, lefts = [], []
|
||||
for sh in sl.shapes:
|
||||
if sh.has_text_frame:
|
||||
texts.append(sh.text_frame.text)
|
||||
if (sh.shape_type == MSO_SHAPE_TYPE.PICTURE
|
||||
and sh.left is not None):
|
||||
lefts.append(sh.left)
|
||||
txt = re.sub(r"\s+", " ", " ".join(texts))
|
||||
if "Distribuciones categ" not in txt:
|
||||
continue
|
||||
# One picture starts in the left half, another in the right half.
|
||||
if len(lefts) >= 2 and min(lefts) < centre and max(lefts) > centre:
|
||||
two_col_slides += 1
|
||||
assert two_col_slides >= 1, (
|
||||
"ninguna columna quedó con tabla-izq / barras-der (side_by_side)")
|
||||
|
||||
|
||||
def _profile_with_llm() -> dict:
|
||||
"""The base profile plus an ``llm`` block (as eda_llm_insights would store it
|
||||
with run_llm=True): a data dictionary with description/unit per column."""
|
||||
prof = _profile()
|
||||
prof["llm"] = {
|
||||
"dictionary": [
|
||||
{"column": "categoria",
|
||||
"description": "Familia de producto del recambio",
|
||||
"business_meaning": "Agrupa el catálogo por tipo de pieza",
|
||||
"unit": "categoría"},
|
||||
{"column": "uuid",
|
||||
"description": "Identificador único de registro",
|
||||
"unit": ""},
|
||||
],
|
||||
}
|
||||
return prof
|
||||
|
||||
|
||||
def test_llm_descripcion_y_unidad_por_columna():
|
||||
# With an LLM dictionary, each categorical column whose name matches shows its
|
||||
# business description and unit in a per-column markdown block.
|
||||
ch = build_cat_distr(_profile_with_llm(), {})
|
||||
groups = _column_groups(ch)
|
||||
cat_group = next(g for g in groups
|
||||
if any(getattr(b, "text", "") == "categoria"
|
||||
for b in g.blocks))
|
||||
md = " ".join(b.text for b in cat_group.blocks
|
||||
if getattr(b, "kind", "") == "markdown")
|
||||
assert "Descripción" in md and "Familia de producto" in md
|
||||
assert "Unidad" in md and "categoría" in md
|
||||
|
||||
|
||||
def test_edge_sin_llm_no_anade_descripcion():
|
||||
# Without an LLM block the per-column description markdown is simply omitted;
|
||||
# the column still renders its cardinality table and bar figure.
|
||||
ch = build_cat_distr(_profile(), {})
|
||||
for g in _column_groups(ch):
|
||||
mds = [b.text for b in g.blocks if getattr(b, "kind", "") == "markdown"]
|
||||
assert not any("Descripción" in t for t in mds)
|
||||
|
||||
|
||||
def test_pagina_categorica_clicable_y_definicion_en_glosario():
|
||||
# The "how each categorical page is laid out" term is registered + marked
|
||||
# clickable in the intro, and its full definition lands in the glossary
|
||||
# chapter (canonical baseline catalog), not inline.
|
||||
from datascience.automatic_eda.chapters.glosario import build_glosario
|
||||
|
||||
gc = GlossaryCollector()
|
||||
ch = build_cat_distr(_profile(), {"glossary": gc})
|
||||
md = next(b for b in ch.blocks if isinstance(b, Markdown))
|
||||
assert "[[term:pagina_categorica]]" in md.text
|
||||
assert gc.has("pagina_categorica")
|
||||
glos = build_glosario(_profile(), {"glossary": gc})
|
||||
entry = next(b for b in glos.blocks
|
||||
if getattr(b, "kind", "") == "glossary_entry"
|
||||
and b.key == "pagina_categorica")
|
||||
assert "barras" in entry.definition
|
||||
assert "identificador" in entry.definition
|
||||
|
||||
|
||||
def test_edge_sin_categoricas_devuelve_none():
|
||||
|
||||
@@ -17,10 +17,63 @@ from __future__ import annotations
|
||||
|
||||
from .. import model
|
||||
|
||||
CHAPTER_VERSION = "1.0.0"
|
||||
CHAPTER_VERSION = "1.1.0"
|
||||
CHAPTER_ID = "glosario"
|
||||
CHAPTER_TITLE = "Glosario"
|
||||
|
||||
# Canonical definitions for cross-cutting terms — the "how to read it" entries
|
||||
# that do not belong to a single chapter. A chapter only needs to *register* the
|
||||
# term (``ctx['glossary'].add(key, label)``) and mark its in-text appearance with
|
||||
# ``[[term:key]]…[[/term]]``; this chapter supplies the full definition here when
|
||||
# the collector carries the term without one. Keeping the prose in a single place
|
||||
# avoids repeating a long paragraph inline in every chapter that names the term
|
||||
# (the explanation moved out of the NUM DISTR and CAT DISTR intros lives here).
|
||||
_BASELINE_TERMS = {
|
||||
"histograma_boxplot": {
|
||||
"label": "Cómo leer el histograma y el boxplot",
|
||||
"definition": (
|
||||
"Para cada columna numérica se muestra su histograma con tres líneas "
|
||||
"de referencia: la media (línea roja discontinua), la mediana (línea "
|
||||
"verde continua) y la banda ±1σ (zona sombreada que cubre una "
|
||||
"desviación estándar a cada lado de la media). Debajo, alineado al "
|
||||
"mismo eje horizontal, un boxplot de Tukey: la caja abarca del primer "
|
||||
"al tercer cuartil (P25–P75), la línea interior es la mediana y los "
|
||||
"bigotes llegan hasta 1,5·IQR; los puntos rojos señalan que hay "
|
||||
"valores más allá de las vallas (posibles atípicos). Comparar la media "
|
||||
"con la mediana revela la asimetría: si la media supera a la mediana la "
|
||||
"cola larga cae hacia los valores altos (asimetría a la derecha), y al "
|
||||
"revés hacia los bajos."),
|
||||
},
|
||||
"pagina_categorica": {
|
||||
"label": "Cómo se organiza cada página categórica",
|
||||
"definition": (
|
||||
"Cada columna categórica ocupa su propia página: muestra sus métricas "
|
||||
"de cardinalidad —incluida la entropía—, una nota que señala "
|
||||
"cardinalidad problemática (columnas que se comportan como "
|
||||
"identificador, con casi todos los valores distintos, o dominadas por "
|
||||
"una sola categoría), la tabla de las categorías más frecuentes (top-k, "
|
||||
"con su conteo y porcentaje) y un gráfico de barras de las categorías "
|
||||
"más comunes (top-k más una barra «Otros» que agrupa la cola). El total "
|
||||
"de filas del dataset se usa como referencia para interpretar los "
|
||||
"conteos."),
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def _resolve_term(term: dict) -> tuple:
|
||||
"""Return (label, definition) for a collected term, completing a missing
|
||||
definition (and, if absent, the label) from the canonical baseline catalog."""
|
||||
key = model._safe_str(term.get("key"))
|
||||
label = model._safe_str(term.get("label"))
|
||||
definition = model._safe_str(term.get("definition"))
|
||||
base = _BASELINE_TERMS.get(key)
|
||||
if base:
|
||||
if not definition.strip():
|
||||
definition = model._safe_str(base.get("definition"))
|
||||
if not label.strip() or label == key:
|
||||
label = model._safe_str(base.get("label")) or label
|
||||
return label, definition
|
||||
|
||||
|
||||
def build_glosario(profile: dict, ctx: dict):
|
||||
"""Build the glossary Chapter from the shared collector, or None if empty."""
|
||||
@@ -36,12 +89,14 @@ def build_glosario(profile: dict, ctx: dict):
|
||||
"Cada término va resaltado en el texto y, al pulsarlo, salta a su "
|
||||
"definición en esta sección.")),
|
||||
]
|
||||
# One clickable destination per term, alphabetically by visible label.
|
||||
# One clickable destination per term, alphabetically by visible label. A term
|
||||
# registered without a definition is completed from the canonical baseline.
|
||||
for term in glossary.terms(by="label"):
|
||||
label, definition = _resolve_term(term)
|
||||
blocks.append(model.GlossaryEntry(
|
||||
key=model._safe_str(term.get("key")),
|
||||
label=model._safe_str(term.get("label")),
|
||||
definition=model._safe_str(term.get("definition"))))
|
||||
label=label,
|
||||
definition=definition))
|
||||
|
||||
return model.Chapter(id=CHAPTER_ID, title=CHAPTER_TITLE,
|
||||
version=CHAPTER_VERSION, blocks=blocks)
|
||||
|
||||
@@ -35,10 +35,21 @@ try:
|
||||
except Exception: # noqa: BLE001 — keep the chapter importable no matter what.
|
||||
build_boxplot_stats = None # type: ignore[assignment]
|
||||
|
||||
CHAPTER_VERSION = "1.2.0"
|
||||
CHAPTER_VERSION = "1.3.0"
|
||||
CHAPTER_ID = "num_distr"
|
||||
CHAPTER_TITLE = "Distribuciones numéricas"
|
||||
|
||||
# Glossary term this chapter explains. The long "how to read the histogram and
|
||||
# the boxplot" paragraph used to live inline in the intro; it now lives in the
|
||||
# GLOSARIO chapter (canonical definition in ``glosario._BASELINE_TERMS``) and the
|
||||
# intro only names the clickable term — one click jumps to the full explanation,
|
||||
# so the information is relocated, not lost (mejora glosario).
|
||||
_TERM_HISTOBOX_KEY = "histograma_boxplot"
|
||||
_TERM_HISTOBOX_LABEL = "Cómo leer el histograma y el boxplot"
|
||||
|
||||
# Key under which eda_llm_insights stores its interpretive block in the profile.
|
||||
LLM_KEY = "llm"
|
||||
|
||||
# Plain-Spanish gloss for every label ``detect_distribution_type`` can emit, so a
|
||||
# non-expert reader understands the shape and the suggested next step (MUST-4.3).
|
||||
_DIST_GLOSS = {
|
||||
@@ -99,6 +110,53 @@ def _numeric_columns(profile: dict) -> list:
|
||||
return out
|
||||
|
||||
|
||||
def _llm_index(profile: dict, ctx: dict) -> dict:
|
||||
"""Map column name -> its LLM dictionary entry (description/unit/...).
|
||||
|
||||
Reads the ``llm.dictionary`` list that ``eda_llm_insights`` stored in the
|
||||
profile (``profile['llm']``; falls back to ``ctx['llm']``). Returns an empty
|
||||
dict when ``run_llm`` did not run, so the caller degrades cleanly. Fully
|
||||
defensive: never raises on malformed input.
|
||||
"""
|
||||
llm = profile.get(LLM_KEY)
|
||||
if not isinstance(llm, dict):
|
||||
llm = ctx.get(LLM_KEY)
|
||||
if not isinstance(llm, dict):
|
||||
return {}
|
||||
entries = llm.get("dictionary")
|
||||
if not isinstance(entries, (list, tuple)):
|
||||
return {}
|
||||
index: dict = {}
|
||||
for e in entries:
|
||||
if not isinstance(e, dict):
|
||||
continue
|
||||
col = e.get("column")
|
||||
if col is None:
|
||||
continue
|
||||
index[model._safe_str(col)] = e
|
||||
return index
|
||||
|
||||
|
||||
def _llm_desc_unit_block(name: str, llm_index: dict):
|
||||
"""Markdown block with the LLM business description + unit of a column, or
|
||||
None when no LLM entry matches the column (clean fallback without LLM)."""
|
||||
entry = llm_index.get(model._safe_str(name))
|
||||
if not isinstance(entry, dict):
|
||||
return None
|
||||
raw_desc = entry.get("description") or entry.get("business_meaning")
|
||||
desc = " ".join(model._safe_str(raw_desc).split()) if raw_desc else ""
|
||||
raw_unit = entry.get("unit")
|
||||
unit = " ".join(model._safe_str(raw_unit).split()) if raw_unit else ""
|
||||
parts = []
|
||||
if desc:
|
||||
parts.append(f"**Descripción:** {desc}")
|
||||
if unit:
|
||||
parts.append(f"**Unidad:** {unit}")
|
||||
if not parts:
|
||||
return None
|
||||
return model.Markdown(text=" · ".join(parts))
|
||||
|
||||
|
||||
def _make_hist_box(name: str, numeric: dict, box: dict):
|
||||
"""Build the histogram (with mean/median/±σ lines) + boxplot figure.
|
||||
|
||||
@@ -271,15 +329,26 @@ def build_num_distr(profile: dict, ctx: dict):
|
||||
if not numerics:
|
||||
return None # chapter does not apply to a dataset with no numerics.
|
||||
|
||||
# Register the "how to read the histogram and boxplot" term in the shared
|
||||
# glossary collector (if present) and mark its first appearance clickable. The
|
||||
# full explanation (colour code, 1,5·IQR rule, asymmetry reading) lives in the
|
||||
# GLOSARIO chapter instead of inline here: the intro only names the term.
|
||||
glossary = ctx.get("glossary")
|
||||
mark_term = False
|
||||
if isinstance(glossary, model.GlossaryCollector):
|
||||
glossary.add(_TERM_HISTOBOX_KEY, _TERM_HISTOBOX_LABEL)
|
||||
mark_term = True
|
||||
como_leer = ("[[term:histograma_boxplot]]cómo leer estos gráficos[[/term]]"
|
||||
if mark_term else "cómo leer estos gráficos")
|
||||
intro = (
|
||||
"Para cada columna numérica se muestra su **histograma** con tres líneas "
|
||||
"de referencia: la **media** (línea roja discontinua), la **mediana** "
|
||||
"(línea verde continua) y la banda **±1σ** (zona sombreada). Debajo, "
|
||||
"alineado al mismo eje, un **boxplot de Tukey**: la caja abarca del "
|
||||
"primer al tercer cuartil (P25–P75), la línea interior es la mediana y "
|
||||
"los bigotes llegan hasta 1,5·IQR; los puntos rojos señalan que hay "
|
||||
"valores más allá de las vallas. Comparar media y mediana revela la "
|
||||
"asimetría de la distribución.")
|
||||
"Cada columna numérica muestra su **histograma** (con la **media**, la "
|
||||
"**mediana** y la banda **±1σ**) y, debajo y al mismo eje, su **boxplot "
|
||||
f"de Tukey** — {como_leer}.")
|
||||
|
||||
# Business description + unit per column come from the LLM dictionary
|
||||
# (profile['llm']['dictionary'], matched by column name); absent without
|
||||
# run_llm, in which case the per-column description block is simply omitted.
|
||||
llm_index = _llm_index(profile, ctx)
|
||||
|
||||
blocks = [
|
||||
model.Heading(text=CHAPTER_TITLE, level=1),
|
||||
@@ -293,17 +362,20 @@ def build_num_distr(profile: dict, ctx: dict):
|
||||
box = build_boxplot_stats(numeric) or {}
|
||||
except Exception: # noqa: BLE001 — degrade, never raise.
|
||||
box = {}
|
||||
# Keep the column heading, its figure and its stats note together on the
|
||||
# same page/slide (mejora 3 — keep-together): the renderers measure the
|
||||
# whole Group and move it whole when it would not fit.
|
||||
blocks.append(model.Group(blocks=[
|
||||
model.Heading(text=str(name), level=2),
|
||||
model.Figure(
|
||||
make=_figure_maker(name, numeric, box),
|
||||
caption=f"Distribución de «{name}» — histograma "
|
||||
f"(media/mediana/±σ) y boxplot."),
|
||||
model.Markdown(text=_stats_note(name, numeric, box)),
|
||||
]))
|
||||
# Keep the column heading, its (optional) LLM description, its figure and
|
||||
# its stats note together on the same page/slide (mejora 3 —
|
||||
# keep-together): the renderers measure the whole Group and move it whole
|
||||
# when it would not fit.
|
||||
col_blocks = [model.Heading(text=str(name), level=2)]
|
||||
desc_block = _llm_desc_unit_block(name, llm_index)
|
||||
if desc_block is not None:
|
||||
col_blocks.append(desc_block)
|
||||
col_blocks.append(model.Figure(
|
||||
make=_figure_maker(name, numeric, box),
|
||||
caption=f"Distribución de «{name}» — histograma "
|
||||
f"(media/mediana/±σ) y boxplot."))
|
||||
col_blocks.append(model.Markdown(text=_stats_note(name, numeric, box)))
|
||||
blocks.append(model.Group(blocks=col_blocks))
|
||||
|
||||
return model.Chapter(id=CHAPTER_ID, title=CHAPTER_TITLE,
|
||||
version=CHAPTER_VERSION, blocks=blocks)
|
||||
|
||||
@@ -101,7 +101,7 @@ def test_golden_chapter_estructura_y_bloques():
|
||||
|
||||
|
||||
def test_golden_media_mediana_sigma_y_boxplot_presentes():
|
||||
# The intro documents the three reference lines and the Tukey boxplot; the
|
||||
# The short intro names the three reference lines and the Tukey boxplot; the
|
||||
# per-column note carries the actual mean/median/σ numbers and the shape.
|
||||
ch = build_num_distr(_profile(n_numeric=1, extra_categorical=False), {})
|
||||
md_texts = " ".join(b.text for b in _flatten(ch.blocks)
|
||||
@@ -110,10 +110,58 @@ def test_golden_media_mediana_sigma_y_boxplot_presentes():
|
||||
assert "±1σ" in md_texts or "σ" in md_texts
|
||||
assert "boxplot" in md_texts.lower()
|
||||
assert "Tukey" in md_texts
|
||||
# The long "how to read it" explanation moved to the glossary: the colour-code
|
||||
# / 1,5·IQR walkthrough is no longer inline in the chapter body.
|
||||
assert "1,5·IQR" not in md_texts
|
||||
assert "línea roja" not in md_texts
|
||||
# distribution_type gloss surfaced for the column (right-skewed preset).
|
||||
assert _DIST_GLOSS["right-skewed"].split(";")[0][:20] in md_texts
|
||||
|
||||
|
||||
def test_glosario_histograma_boxplot_clicable_y_definicion():
|
||||
# With a glossary collector the intro marks the clickable term and the FULL
|
||||
# explanation (the long paragraph removed from the body) lands in the glossary.
|
||||
from datascience.automatic_eda.chapters.glosario import build_glosario
|
||||
|
||||
gc = model.GlossaryCollector()
|
||||
prof = _profile(n_numeric=1, extra_categorical=False)
|
||||
ch = build_num_distr(prof, {"glossary": gc})
|
||||
intro = next(b for b in ch.blocks if b.kind == "markdown")
|
||||
assert "[[term:histograma_boxplot]]" in intro.text
|
||||
assert gc.has("histograma_boxplot")
|
||||
glos = build_glosario(prof, {"glossary": gc})
|
||||
entry = next(b for b in glos.blocks
|
||||
if getattr(b, "kind", "") == "glossary_entry"
|
||||
and b.key == "histograma_boxplot")
|
||||
assert "boxplot" in entry.definition.lower()
|
||||
assert "1,5·IQR" in entry.definition
|
||||
|
||||
|
||||
def test_llm_descripcion_y_unidad_por_columna():
|
||||
# With an LLM dictionary, each numeric column whose name matches shows its
|
||||
# business description and unit in a per-column markdown block.
|
||||
prof = _profile(n_numeric=2)
|
||||
prof["llm"] = {"dictionary": [
|
||||
{"column": "precio", "description": "Precio de venta del producto",
|
||||
"unit": "EUR"},
|
||||
{"column": "alcohol", "business_meaning": "Grado alcohólico",
|
||||
"unit": "% vol"},
|
||||
]}
|
||||
ch = build_num_distr(prof, {})
|
||||
md_all = " ".join(b.text for b in _flatten(ch.blocks)
|
||||
if b.kind == "markdown")
|
||||
assert "Precio de venta" in md_all and "EUR" in md_all
|
||||
assert "Grado alcohólico" in md_all and "% vol" in md_all
|
||||
|
||||
|
||||
def test_edge_sin_llm_no_anade_descripcion():
|
||||
# Without an LLM block the per-column description markdown is simply omitted.
|
||||
ch = build_num_distr(_profile(n_numeric=2), {})
|
||||
md_all = " ".join(b.text for b in _flatten(ch.blocks)
|
||||
if b.kind == "markdown")
|
||||
assert "Descripción" not in md_all
|
||||
|
||||
|
||||
def test_boxplot_stats_se_consumen_del_registry():
|
||||
# The chapter must feed build_boxplot_stats (group eda) and the resulting
|
||||
# box must carry the Tukey fences for the figure.
|
||||
|
||||
@@ -0,0 +1,593 @@
|
||||
"""Outliers chapter (OUTLIERS) — univariate + multivariate atypical values.
|
||||
|
||||
Today the analysis of atypical values is scattered across the document: the
|
||||
NUM DISTR chapter mentions the per-column outlier count inside each distribution
|
||||
figure, and the MODELOS chapter runs Isolation Forest as one of several cheap
|
||||
models. This chapter gathers and deepens the whole outlier story in a single
|
||||
place, with its interpretation: an [[term:outlier]]outlier[[/term]] is **not
|
||||
necessarily an error** — it can be a legitimate, extreme but real observation —
|
||||
so the reading is exploratory (what to look at), never confirmatory (what to
|
||||
delete).
|
||||
|
||||
Sections, in order:
|
||||
|
||||
1. **Resumen univariante por columna** — for every numeric column, the number
|
||||
and percentage of atypical values by two complementary criteria: Tukey's
|
||||
1.5·IQR rule ([[term:tukey_fence]]vallas de Tukey[[/term]]) and the
|
||||
[[term:zscore]]z-score[[/term]] rule (|z| > 3). The most contaminated columns
|
||||
are flagged. The fences come from the pure registry function
|
||||
``build_boxplot_stats`` (derived from the profile percentiles); the per-column
|
||||
counts use the raw sample in ``ctx['raw_numeric']`` when available (the exact
|
||||
count), degrading to the profile's own z-score counts otherwise.
|
||||
2. **Boxplots** — a single figure with the Tukey boxplots of the most
|
||||
contaminated columns (box, whiskers and atypical points), delegated to the
|
||||
reusable registry helper ``build_boxplots_figure``.
|
||||
3. **Multivariante (filas anómalas)** — rows that are atypical considering ALL
|
||||
columns at once, via the registry function ``isolation_forest_outliers``: the
|
||||
count and percentage of anomalous rows, the most anomalous rows with their
|
||||
score, and the dimensions that make each one rare (top columns by |z|, via
|
||||
``summarize_outlier_dims``). Run live on ``ctx['raw_numeric']`` (the same
|
||||
numeric columns ``summarize_outlier_dims`` uses, so the row indexing stays
|
||||
coherent and the dimension breakdown is correct); falls back to the
|
||||
precomputed ``profile['models']['outliers']`` only when no raw sample is
|
||||
available (e.g. the lite preset), where no per-row breakdown is shown.
|
||||
4. **Interpretación** — outlier ≠ error: how to tell a data-entry error from a
|
||||
genuine extreme value, and what to do (inspect, winsorize, or re-express —
|
||||
linking to the Tukey re-expression the profile already computes).
|
||||
|
||||
The chapter activates whenever the table has at least one numeric column; with
|
||||
no numeric column it returns ``None`` and disappears from the document.
|
||||
|
||||
Reads everything defensively (``.get``) and never raises: every registry
|
||||
delegation is imported lazily and degraded to an honest note on any failure.
|
||||
|
||||
Contract: build_<id>(profile, ctx) -> Chapter | None ; CHAPTER_VERSION = "x.y.z".
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from .. import model
|
||||
|
||||
CHAPTER_VERSION = "1.0.0"
|
||||
CHAPTER_ID = "outliers"
|
||||
CHAPTER_TITLE = "Valores atípicos"
|
||||
|
||||
# z-score threshold for the univariate z rule: |z| > 3 flags a value ~3 standard
|
||||
# deviations from the mean (≈99.7% of a normal distribution lies within ±3σ).
|
||||
_Z_THRESH = 3.0
|
||||
# How many columns to draw in the boxplots figure (most contaminated first) and
|
||||
# how many anomalous rows to list in the multivariate table.
|
||||
_TOP_BOX = 12
|
||||
_TOP_ROWS = 12
|
||||
# Cap on the raw atypical values passed as boxplot fliers, so a heavy-tailed
|
||||
# column does not flood the figure with thousands of points.
|
||||
_MAX_FLIERS = 200
|
||||
# How many columns flagged as "most contaminated" in the summary note.
|
||||
_TOP_FLAGGED = 3
|
||||
|
||||
# Glossary terms this chapter explains (contract §11.1). Registered in the shared
|
||||
# collector and marked clickable on first appearance. ``isolation_forest`` and
|
||||
# ``zscore`` may also be registered by the MODELOS chapter — ``add`` is
|
||||
# idempotent (first definition wins), so registering them here is harmless and
|
||||
# keeps this chapter self-contained when MODELOS does not render.
|
||||
_TERM_DEFS = {
|
||||
"outlier": (
|
||||
"Valor atípico (outlier)",
|
||||
"Una observación que se aparta mucho del grueso de los datos. Un atípico "
|
||||
"NO es necesariamente un error: puede ser un fallo de medida o de "
|
||||
"registro, pero también un dato real extremo (un cliente que gasta diez "
|
||||
"veces la media, un día de ventas excepcional). Por eso se señalan para "
|
||||
"revisarlos, no para borrarlos automáticamente.",
|
||||
),
|
||||
"tukey_fence": (
|
||||
"Vallas de Tukey (1,5·IQR)",
|
||||
"Regla clásica para marcar atípicos a partir de los cuartiles: se calcula "
|
||||
"el rango intercuartílico IQR = P75 − P25 y se trazan dos vallas, una "
|
||||
"inferior en P25 − 1,5·IQR y otra superior en P75 + 1,5·IQR. Los valores "
|
||||
"que caen fuera de esas vallas se consideran atípicos. Es robusta porque "
|
||||
"se apoya en la mediana y los cuartiles, no en la media.",
|
||||
),
|
||||
"zscore": (
|
||||
"z-score (puntuación típica)",
|
||||
"Mide a cuántas desviaciones típicas está un valor de la media de su "
|
||||
"columna: z = (valor − media) / desviación típica. Un |z| grande (aquí > "
|
||||
"3) señala un valor alejado del centro. A diferencia de las vallas de "
|
||||
"Tukey, el z-score usa media y desviación, así que es más sensible a la "
|
||||
"presencia de los propios atípicos.",
|
||||
),
|
||||
"isolation_forest": (
|
||||
"Isolation Forest (anomalías multivariantes)",
|
||||
"Algoritmo de detección de anomalías que considera TODAS las columnas a "
|
||||
"la vez: construye árboles que parten el espacio con cortes aleatorios y "
|
||||
"mide cuántos cortes hacen falta para aislar cada fila. Las filas raras "
|
||||
"se aíslan con muy pocos cortes y se marcan como atípicas según un umbral "
|
||||
"de contaminación. Detecta combinaciones de valores poco frecuentes que "
|
||||
"ninguna columna por separado revelaría.",
|
||||
),
|
||||
}
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Lazy registry delegations (each degrades to None / no-op on any failure).
|
||||
# --------------------------------------------------------------------------- #
|
||||
def _load_build_boxplot_stats():
|
||||
try:
|
||||
from datascience.build_boxplot_stats import build_boxplot_stats
|
||||
return build_boxplot_stats
|
||||
except Exception: # noqa: BLE001
|
||||
return None
|
||||
|
||||
|
||||
def _load_detect_outliers():
|
||||
# detect_outliers lives in the monolithic ``datascience.datascience`` module
|
||||
# (file_path datascience.py), not in its own submodule — try both shapes.
|
||||
try:
|
||||
from datascience.datascience import detect_outliers
|
||||
return detect_outliers
|
||||
except Exception: # noqa: BLE001
|
||||
try:
|
||||
from datascience import detect_outliers
|
||||
return detect_outliers
|
||||
except Exception: # noqa: BLE001
|
||||
return None
|
||||
|
||||
|
||||
def _load_isolation_forest():
|
||||
try:
|
||||
from datascience.isolation_forest_outliers import isolation_forest_outliers
|
||||
return isolation_forest_outliers
|
||||
except Exception: # noqa: BLE001
|
||||
return None
|
||||
|
||||
|
||||
def _load_summarize_dims():
|
||||
try:
|
||||
from datascience.summarize_outlier_dims import summarize_outlier_dims
|
||||
return summarize_outlier_dims
|
||||
except Exception: # noqa: BLE001
|
||||
return None
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Defensive formatters (own copy: the chapter never imports siblings).
|
||||
# --------------------------------------------------------------------------- #
|
||||
def _fmt_num(value, decimals: int = 3) -> str:
|
||||
if value is None:
|
||||
return "—"
|
||||
if isinstance(value, bool):
|
||||
return "sí" if value else "no"
|
||||
if isinstance(value, int):
|
||||
return f"{value:,}".replace(",", ".")
|
||||
if isinstance(value, float):
|
||||
if value != value: # NaN
|
||||
return "—"
|
||||
if value in (float("inf"), float("-inf")):
|
||||
return str(value)
|
||||
text = f"{value:.{decimals}f}".rstrip("0").rstrip(".")
|
||||
return text if text else "0"
|
||||
return model._safe_str(value)
|
||||
|
||||
|
||||
def _fmt_int(value) -> str:
|
||||
if value is None:
|
||||
return "—"
|
||||
try:
|
||||
return f"{int(round(float(value))):,}".replace(",", ".")
|
||||
except (TypeError, ValueError):
|
||||
return model._safe_str(value)
|
||||
|
||||
|
||||
def _fmt_pct(value, decimals: int = 2) -> str:
|
||||
"""Format an already-0-100 value as a percentage. None -> placeholder."""
|
||||
if value is None:
|
||||
return "—"
|
||||
try:
|
||||
return f"{float(value):.{decimals}f}%"
|
||||
except (TypeError, ValueError):
|
||||
return model._safe_str(value)
|
||||
|
||||
|
||||
def _term(mark: bool, key: str, text: str) -> str:
|
||||
return f"[[term:{key}]]{text}[[/term]]" if mark else text
|
||||
|
||||
|
||||
def _is_dict(v) -> bool:
|
||||
return isinstance(v, dict)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Profile reads.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def _numeric_columns(profile: dict) -> list:
|
||||
"""Return [(name, numeric_dict)] for numeric columns with usable stats."""
|
||||
out = []
|
||||
for col in profile.get("columns") or []:
|
||||
if not isinstance(col, dict):
|
||||
continue
|
||||
if col.get("inferred_type") != "numeric":
|
||||
continue
|
||||
num = col.get("numeric")
|
||||
if not isinstance(num, dict) or not num:
|
||||
continue
|
||||
if num.get("mean") is None and num.get("median") is None:
|
||||
continue
|
||||
out.append((col.get("name") or "(columna)", num))
|
||||
return out
|
||||
|
||||
|
||||
def _clean_values(raw):
|
||||
"""Return the finite float values of a raw column list (drop None/NaN/inf)."""
|
||||
if not isinstance(raw, (list, tuple)):
|
||||
return None
|
||||
vals = []
|
||||
for v in raw:
|
||||
if v is None or isinstance(v, bool):
|
||||
continue
|
||||
try:
|
||||
f = float(v)
|
||||
except (TypeError, ValueError):
|
||||
continue
|
||||
if f != f or f in (float("inf"), float("-inf")):
|
||||
continue
|
||||
vals.append(f)
|
||||
return vals
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Per-column univariate summary.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def _univariate_row(name, numeric, raw_vals, box_fn, detect_fn):
|
||||
"""Compute one univariate summary row + boxplot inputs for a column.
|
||||
|
||||
Returns a dict with the table cells and, when raw values are available, the
|
||||
exact Tukey/z counts and the list of atypical (flier) values; otherwise it
|
||||
degrades to the profile's own z-score counts and the fence flags.
|
||||
"""
|
||||
box = {}
|
||||
if box_fn is not None:
|
||||
try:
|
||||
box = box_fn(numeric) or {}
|
||||
except Exception: # noqa: BLE001
|
||||
box = {}
|
||||
lf = box.get("lower_fence")
|
||||
uf = box.get("upper_fence")
|
||||
|
||||
vals = _clean_values(raw_vals)
|
||||
n_tukey = pct_tukey = None
|
||||
n_z = pct_z = None
|
||||
low_extreme = high_extreme = None
|
||||
fliers = []
|
||||
contamination = None # metric used to rank columns (prefer Tukey %).
|
||||
|
||||
if vals:
|
||||
n = len(vals)
|
||||
tukey_out = []
|
||||
for v in vals:
|
||||
below = (lf is not None and v < lf)
|
||||
above = (uf is not None and v > uf)
|
||||
if below or above:
|
||||
tukey_out.append(v)
|
||||
n_tukey = len(tukey_out)
|
||||
pct_tukey = 100.0 * n_tukey / n if n else None
|
||||
if tukey_out:
|
||||
low_extreme = min(tukey_out)
|
||||
high_extreme = max(tukey_out)
|
||||
fliers = tukey_out[:_MAX_FLIERS]
|
||||
# z-score rule via the registry function (returns parallel bools).
|
||||
if detect_fn is not None:
|
||||
try:
|
||||
flags = detect_fn(vals, _Z_THRESH) or []
|
||||
n_z = int(sum(1 for b in flags if b))
|
||||
pct_z = 100.0 * n_z / n if n else None
|
||||
except Exception: # noqa: BLE001
|
||||
n_z = pct_z = None
|
||||
contamination = pct_tukey
|
||||
else:
|
||||
# Degrade: no raw sample for this column. The profile's own outlier
|
||||
# count/pct come from the z-score block (build_boxplot_stats note); the
|
||||
# Tukey count is unknown, only the fence flags are.
|
||||
n_z = numeric.get("n_outliers")
|
||||
pct_z = numeric.get("outlier_pct")
|
||||
if box.get("has_low_outliers") and box.get("min") is not None:
|
||||
low_extreme = box.get("min")
|
||||
if box.get("has_high_outliers") and box.get("max") is not None:
|
||||
high_extreme = box.get("max")
|
||||
contamination = pct_z if isinstance(pct_z, (int, float)) else None
|
||||
|
||||
# Compact "extremos atípicos" cell: down/up arrows for the low/high tail.
|
||||
extremes = []
|
||||
if low_extreme is not None:
|
||||
extremes.append(f"↓ {_fmt_num(low_extreme)}")
|
||||
if high_extreme is not None:
|
||||
extremes.append(f"↑ {_fmt_num(high_extreme)}")
|
||||
extremes_cell = " ".join(extremes) if extremes else "—"
|
||||
|
||||
return {
|
||||
"name": model._safe_str(name),
|
||||
"n_tukey": n_tukey,
|
||||
"pct_tukey": pct_tukey,
|
||||
"n_z": n_z,
|
||||
"pct_z": pct_z,
|
||||
"lower_fence": lf,
|
||||
"upper_fence": uf,
|
||||
"extremes": extremes_cell,
|
||||
"box": box,
|
||||
"fliers": fliers,
|
||||
"has_raw": bool(vals),
|
||||
"contamination": contamination if isinstance(contamination, (int, float)) else -1.0,
|
||||
}
|
||||
|
||||
|
||||
def _univariate_table(rows: list) -> model.DataTable:
|
||||
header = ["Columna", "Atípicos Tukey", "% Tukey", "Atípicos z", "% z",
|
||||
"Valla inf.", "Valla sup.", "Extremos atípicos"]
|
||||
table_rows = []
|
||||
for r in rows:
|
||||
table_rows.append([
|
||||
r["name"],
|
||||
_fmt_int(r["n_tukey"]) if r["n_tukey"] is not None else "—",
|
||||
_fmt_pct(r["pct_tukey"]) if r["pct_tukey"] is not None else "—",
|
||||
_fmt_int(r["n_z"]) if r["n_z"] is not None else "—",
|
||||
_fmt_pct(r["pct_z"]) if r["pct_z"] is not None else "—",
|
||||
_fmt_num(r["lower_fence"]),
|
||||
_fmt_num(r["upper_fence"]),
|
||||
r["extremes"],
|
||||
])
|
||||
return model.DataTable(
|
||||
header=header, rows=table_rows,
|
||||
title="Valores atípicos por columna",
|
||||
note="Tukey = fuera de las vallas 1,5·IQR · z = |z-score| > 3 · "
|
||||
"ordenado de más a menos contaminada")
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Multivariate (Isolation Forest) section.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def _resolve_multivariate(profile: dict, ctx: dict, raw_numeric):
|
||||
"""Return (outliers_dict_or_None, source).
|
||||
|
||||
Prefers a LIVE Isolation Forest over ``raw_numeric`` so the detector and
|
||||
``summarize_outlier_dims`` use EXACTLY the same numeric columns and the same
|
||||
valid-row indexing — otherwise the precomputed ``profile['models']
|
||||
['outliers']`` (run by MODELOS over a possibly different column subset) would
|
||||
yield ``row_index`` values that no longer point at the rows
|
||||
``summarize_outlier_dims`` reconstructs, mislabelling the "dimensions that
|
||||
make each row rare". Falls back to the precomputed block when no raw sample
|
||||
is available (e.g. the lite preset drops ``raw_numeric``)."""
|
||||
if _is_dict(raw_numeric) and raw_numeric:
|
||||
iso = _load_isolation_forest()
|
||||
if iso is not None:
|
||||
try:
|
||||
out = iso(raw_numeric)
|
||||
if _is_dict(out) and out.get("n_outliers") is not None and out.get("n_rows_used"):
|
||||
return out, "live"
|
||||
except Exception: # noqa: BLE001
|
||||
pass
|
||||
# Fallback: the model the MODELOS chapter already computed (no raw sample to
|
||||
# recompute against, so no per-row dimension breakdown either).
|
||||
models = profile.get("models") if _is_dict(profile.get("models")) else {}
|
||||
pre = models.get("outliers") if _is_dict(models) else None
|
||||
if _is_dict(pre) and pre.get("n_outliers") is not None and pre.get("n_rows_used"):
|
||||
return pre, "precomputed"
|
||||
return None, "none"
|
||||
|
||||
|
||||
def _multivariate_blocks(outliers: dict, raw_numeric, mark: bool) -> list:
|
||||
isof = _term(mark, "isolation_forest", "**Isolation Forest**")
|
||||
blocks = [
|
||||
model.Heading(text="Filas atípicas (multivariante)", level=2),
|
||||
model.Markdown(text=(
|
||||
f"Hasta aquí cada columna se ha mirado por separado. {isof} busca "
|
||||
"filas raras considerando **todas las columnas a la vez**: una fila "
|
||||
"puede ser normal en cada variable y aun así ser atípica por la "
|
||||
"**combinación** de sus valores (p. ej. una edad baja con una tarifa "
|
||||
"muy alta). La tabla resume cuántas filas se marcaron y el umbral de "
|
||||
"decisión.")),
|
||||
model.KVTable(rows=[
|
||||
("Filas analizadas", _fmt_int(outliers.get("n_rows_used"))),
|
||||
("Columnas consideradas", _fmt_int(outliers.get("n_features"))),
|
||||
("Filas atípicas", _fmt_int(outliers.get("n_outliers"))),
|
||||
("% filas atípicas", _fmt_pct(outliers.get("outlier_pct"))),
|
||||
("Umbral de decisión", _fmt_num(outliers.get("threshold"), 4)),
|
||||
], title="Anomalías multivariantes"),
|
||||
]
|
||||
|
||||
rows_in = outliers.get("outlier_rows") or []
|
||||
if not rows_in:
|
||||
return blocks
|
||||
|
||||
# Enrich each anomalous row with the dimensions that make it rare, when the
|
||||
# raw sample is available (summarize_outlier_dims reconstructs the same
|
||||
# valid-row indexing as isolation_forest_outliers).
|
||||
dims_by_row = {}
|
||||
if _is_dict(raw_numeric) and raw_numeric:
|
||||
summ = _load_summarize_dims()
|
||||
if summ is not None:
|
||||
try:
|
||||
enriched = summ(raw_numeric, rows_in, top_k=3) or []
|
||||
for e in enriched:
|
||||
if _is_dict(e) and e.get("row_index") is not None:
|
||||
dims_by_row[e.get("row_index")] = e.get("dims") or []
|
||||
except Exception: # noqa: BLE001
|
||||
dims_by_row = {}
|
||||
|
||||
has_dims = bool(dims_by_row)
|
||||
header = ["Fila (entre válidas)", "Score"]
|
||||
if has_dims:
|
||||
header.append("Dimensiones que la hacen rara (col = valor, z)")
|
||||
table_rows = []
|
||||
for r in rows_in[:_TOP_ROWS]:
|
||||
if not _is_dict(r):
|
||||
continue
|
||||
ridx = r.get("row_index")
|
||||
cells = [_fmt_int(ridx), _fmt_num(r.get("score"), 4)]
|
||||
if has_dims:
|
||||
dims = dims_by_row.get(ridx) or []
|
||||
parts = []
|
||||
for d in dims:
|
||||
if not _is_dict(d):
|
||||
continue
|
||||
parts.append(
|
||||
f"{model._safe_str(d.get('col'))} = {_fmt_num(d.get('value'))} "
|
||||
f"(z {_fmt_num(d.get('z'), 2)})")
|
||||
cells.append("; ".join(parts) if parts else "—")
|
||||
table_rows.append(cells)
|
||||
|
||||
if table_rows:
|
||||
shown = len(table_rows)
|
||||
total = outliers.get("n_outliers")
|
||||
note = "las filas más anómalas primero (score más bajo = más rara)"
|
||||
if isinstance(total, int) and total > shown:
|
||||
note += f" — top {shown} de {total}"
|
||||
if not has_dims:
|
||||
note += (" · no se pudo recuperar la muestra cruda para explicar las "
|
||||
"dimensiones de cada fila")
|
||||
blocks.append(model.DataTable(
|
||||
header=header, rows=table_rows,
|
||||
title="Filas más atípicas", note=note))
|
||||
return blocks
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Interpretation section.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def _interpretation_block(mark: bool) -> model.Markdown:
|
||||
outlier = _term(mark, "outlier", "atípico")
|
||||
text = (
|
||||
f"**Un {outlier} no es necesariamente un error.** Conviene distinguir "
|
||||
"dos casos antes de actuar:\n\n"
|
||||
"- **Error de dato** (medida, registro o unidad equivocada): una edad de "
|
||||
"200 años, un importe negativo donde no puede haberlo, un decimal "
|
||||
"desplazado. Estos sí se corrigen o se eliminan, idealmente en el origen.\n"
|
||||
"- **Dato real extremo**: una observación legítima de la cola de la "
|
||||
"distribución (un cliente que gasta mucho más, una tarifa de lujo, un día "
|
||||
"de ventas excepcional). Borrarla sesga el análisis y oculta información "
|
||||
"valiosa.\n\n"
|
||||
"**Qué hacer.** Primero, **revisar** los valores señalados arriba contra "
|
||||
"su origen para decidir cuál de los dos casos es. Si son errores, "
|
||||
"corregirlos. Si son datos reales que distorsionan medias y modelos, hay "
|
||||
"alternativas a borrarlos: **winsorizar** (recortar los extremos a un "
|
||||
"percentil), o **re-expresar** la variable (por ejemplo una "
|
||||
"transformación logarítmica o la escalera de re-expresión de Tukey que "
|
||||
"este mismo perfil ya calcula para las columnas asimétricas), que suele "
|
||||
"domar la cola sin perder ninguna fila. La elección depende del objetivo: "
|
||||
"esta lectura es **exploratoria** —orienta dónde mirar—, no una regla "
|
||||
"automática de limpieza.")
|
||||
return model.Markdown(text=text)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Entry point.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def build_outliers(profile: dict, ctx: dict):
|
||||
"""Build the OUTLIERS Chapter, or None if the dataset has no numeric column."""
|
||||
profile = profile or {}
|
||||
ctx = ctx or {}
|
||||
if not isinstance(profile, dict):
|
||||
return None
|
||||
|
||||
numerics = _numeric_columns(profile)
|
||||
if not numerics:
|
||||
return None # chapter does not apply to a dataset with no numerics.
|
||||
|
||||
# Register glossary terms (if a collector is present) and mark them clickable.
|
||||
glossary = ctx.get("glossary")
|
||||
mark = False
|
||||
if isinstance(glossary, model.GlossaryCollector):
|
||||
for key, (label, definition) in _TERM_DEFS.items():
|
||||
glossary.add(key, label, definition)
|
||||
mark = True
|
||||
|
||||
raw_numeric = ctx.get("raw_numeric")
|
||||
raw_numeric = raw_numeric if isinstance(raw_numeric, dict) else {}
|
||||
|
||||
box_fn = _load_build_boxplot_stats()
|
||||
detect_fn = _load_detect_outliers()
|
||||
|
||||
# --- Univariate summary ------------------------------------------------- #
|
||||
uni_rows = []
|
||||
for name, numeric in numerics:
|
||||
uni_rows.append(_univariate_row(
|
||||
name, numeric, raw_numeric.get(name), box_fn, detect_fn))
|
||||
# Rank columns by contamination (Tukey % when available, else z %).
|
||||
uni_rows.sort(key=lambda r: r.get("contamination", -1.0), reverse=True)
|
||||
|
||||
intro = (
|
||||
"Este capítulo reúne en un solo sitio el análisis de los **valores "
|
||||
"atípicos** de la tabla, que en el resto del informe aparecen dispersos. "
|
||||
f"Un {_term(mark, 'outlier', 'atípico')} es una observación que se aparta "
|
||||
"mucho del grueso de los datos. Cada columna numérica se evalúa con dos "
|
||||
f"criterios complementarios: las {_term(mark, 'tukey_fence', 'vallas de Tukey')} "
|
||||
"(fuera de P25−1,5·IQR o P75+1,5·IQR, robusto a la propia cola) y el "
|
||||
f"{_term(mark, 'zscore', 'z-score')} (|z| > 3, sensible a la media). La "
|
||||
"tabla está ordenada de la columna más contaminada a la menos.")
|
||||
|
||||
blocks = [
|
||||
model.Heading(text=CHAPTER_TITLE, level=1),
|
||||
model.Markdown(text=intro),
|
||||
_univariate_table(uni_rows),
|
||||
]
|
||||
|
||||
# Flag the most contaminated columns explicitly.
|
||||
flagged = [r["name"] for r in uni_rows
|
||||
if r.get("contamination", -1.0) > 0][:_TOP_FLAGGED]
|
||||
if flagged:
|
||||
names = ", ".join(f"**{n}**" for n in flagged)
|
||||
blocks.append(model.Markdown(text=(
|
||||
f"Las columnas con mayor proporción de atípicos son {names}: "
|
||||
"concentran el grueso de los valores fuera de las vallas y son las "
|
||||
"primeras a revisar.")))
|
||||
|
||||
# --- Boxplots figure ---------------------------------------------------- #
|
||||
box_entries = [
|
||||
{"name": r["name"], "box": r["box"], "fliers": r.get("fliers")}
|
||||
for r in uni_rows
|
||||
if r.get("box")
|
||||
][:_TOP_BOX]
|
||||
if box_entries:
|
||||
def _boxplots_make(entries=box_entries):
|
||||
try:
|
||||
from datascience.build_boxplots_figure import build_boxplots_figure
|
||||
return build_boxplots_figure(
|
||||
entries, title="Boxplots de Tukey por columna",
|
||||
max_boxes=_TOP_BOX)
|
||||
except Exception: # noqa: BLE001 — minimal fallback figure.
|
||||
import matplotlib
|
||||
matplotlib.use("Agg")
|
||||
from matplotlib.figure import Figure
|
||||
fig = Figure(figsize=(5.0, 2.2))
|
||||
ax = fig.add_subplot(111)
|
||||
ax.text(0.5, 0.5, "(boxplots no disponibles)",
|
||||
ha="center", va="center")
|
||||
ax.axis("off")
|
||||
return fig
|
||||
|
||||
blocks.append(model.Group(blocks=[
|
||||
model.Heading(text="Boxplots", level=2),
|
||||
model.Markdown(text=(
|
||||
"Cada caja abarca del primer al tercer cuartil (P25–P75), la línea "
|
||||
"interior es la mediana y los bigotes llegan hasta 1,5·IQR; los "
|
||||
"puntos son los valores que caen fuera de las vallas (atípicos por "
|
||||
"Tukey).")),
|
||||
model.Figure(
|
||||
make=_boxplots_make,
|
||||
caption="Boxplots de Tukey de las columnas más contaminadas."),
|
||||
]))
|
||||
|
||||
# --- Multivariate ------------------------------------------------------- #
|
||||
outliers, _src = _resolve_multivariate(profile, ctx, raw_numeric)
|
||||
if outliers is not None:
|
||||
blocks.extend(_multivariate_blocks(outliers, raw_numeric, mark))
|
||||
else:
|
||||
blocks.append(model.Heading(text="Filas atípicas (multivariante)", level=2))
|
||||
blocks.append(model.Note(
|
||||
"No se pudo analizar la anomalía multivariante: hacen falta al menos "
|
||||
"dos columnas numéricas y la muestra cruda (o los modelos del perfil) "
|
||||
"para correr Isolation Forest."))
|
||||
|
||||
# --- Interpretation ----------------------------------------------------- #
|
||||
blocks.append(model.Heading(text="Cómo interpretar los atípicos", level=2))
|
||||
blocks.append(_interpretation_block(mark))
|
||||
|
||||
return model.Chapter(id=CHAPTER_ID, title=CHAPTER_TITLE,
|
||||
version=CHAPTER_VERSION, blocks=blocks)
|
||||
@@ -0,0 +1,304 @@
|
||||
"""Tests for the OUTLIERS chapter — DoD: golden + edges + error path.
|
||||
|
||||
Self-contained: builds synthetic ``numeric`` blocks + a raw_numeric sample (no
|
||||
DuckDB) so the suite is fast and deterministic. Verifies that the chapter emits
|
||||
the univariate per-column table, a boxplots figure, the multivariate Isolation
|
||||
Forest section and the outlier≠error interpretation; that the most contaminated
|
||||
column is ranked first; that a profile with no numeric column yields None; that
|
||||
None/empty never raises; that the glossary terms are registered; and that the
|
||||
chapter renders into both PDF and PPTX without cutting its title.
|
||||
"""
|
||||
|
||||
import math
|
||||
import os
|
||||
import re
|
||||
import tempfile
|
||||
|
||||
from pypdf import PdfReader
|
||||
|
||||
from datascience.automatic_eda.chapters.outliers import (
|
||||
build_outliers, CHAPTER_VERSION, CHAPTER_TITLE, _TERM_DEFS,
|
||||
)
|
||||
from datascience.automatic_eda import model
|
||||
from datascience.render_automatic_eda_pdf import render_automatic_eda_pdf
|
||||
from datascience.render_automatic_eda_pptx import render_automatic_eda_pptx
|
||||
|
||||
|
||||
def _percentile(sorted_vals, q):
|
||||
"""Linear-interpolation percentile (q in 0..1) on an already-sorted list."""
|
||||
if not sorted_vals:
|
||||
return None
|
||||
if len(sorted_vals) == 1:
|
||||
return float(sorted_vals[0])
|
||||
pos = q * (len(sorted_vals) - 1)
|
||||
lo = int(math.floor(pos))
|
||||
hi = int(math.ceil(pos))
|
||||
if lo == hi:
|
||||
return float(sorted_vals[lo])
|
||||
frac = pos - lo
|
||||
return float(sorted_vals[lo] * (1 - frac) + sorted_vals[hi] * frac)
|
||||
|
||||
|
||||
def _col_from_values(values, nbins=10):
|
||||
"""Build a ``numeric`` sub-block shaped like describe_numeric's output from a
|
||||
concrete list of raw values, so the profile percentiles and the raw sample
|
||||
are consistent (the boxplot fences match the crudo)."""
|
||||
vals = [float(v) for v in values]
|
||||
s = sorted(vals)
|
||||
n = len(s)
|
||||
mean = sum(vals) / n
|
||||
var = sum((v - mean) ** 2 for v in vals) / n
|
||||
std = math.sqrt(var)
|
||||
median = _percentile(s, 0.5)
|
||||
p25 = _percentile(s, 0.25)
|
||||
p75 = _percentile(s, 0.75)
|
||||
mn, mx = s[0], s[-1]
|
||||
# z-score outlier count (population), what the profile's n_outliers carries.
|
||||
n_out = sum(1 for v in vals if std > 0 and abs((v - mean) / std) > 3.0)
|
||||
width = (mx - mn) / nbins if mx > mn else 1.0
|
||||
hist = [{"lo": mn + i * width, "hi": mn + (i + 1) * width, "count": 1}
|
||||
for i in range(nbins)]
|
||||
return {
|
||||
"min": mn, "max": mx, "mean": mean, "median": median, "std": std,
|
||||
"p25": p25, "p50": median, "p75": p75, "iqr": (p75 - p25),
|
||||
"n_outliers": n_out, "outlier_pct": 100.0 * n_out / n,
|
||||
"distribution_type": "right-skewed", "histogram": hist,
|
||||
}
|
||||
|
||||
|
||||
def _fare_values():
|
||||
"""A heavy-tailed column (most ~10-30, a few 200-512): clear Tukey/z outliers."""
|
||||
base = [7.0 + (i % 25) for i in range(120)] # bulk 7..31
|
||||
tail = [180.0, 210.0, 263.0, 512.0] # extreme upper tail
|
||||
return base + tail
|
||||
|
||||
|
||||
def _age_values():
|
||||
"""A roughly symmetric column with one extreme low value."""
|
||||
base = [22.0 + (i % 40) for i in range(120)] # 22..61
|
||||
return base + [80.0, 0.5, 74.0, 1.0]
|
||||
|
||||
|
||||
def _quiet_values():
|
||||
"""A clean column with no atypical values."""
|
||||
return [50.0 + (i % 5) for i in range(124)]
|
||||
|
||||
|
||||
def _profile_and_ctx(with_models=True, with_raw=True):
|
||||
fare = _fare_values()
|
||||
age = _age_values()
|
||||
quiet = _quiet_values()
|
||||
cols = [
|
||||
{"name": "Fare", "inferred_type": "numeric", "numeric": _col_from_values(fare)},
|
||||
{"name": "Age", "inferred_type": "numeric", "numeric": _col_from_values(age)},
|
||||
{"name": "Quiet", "inferred_type": "numeric", "numeric": _col_from_values(quiet)},
|
||||
{"name": "Sexo", "inferred_type": "categorical",
|
||||
"categorical": {"top": [{"value": "male", "count": 80}]}},
|
||||
]
|
||||
profile = {"table": "titanic", "n_rows": len(fare), "n_cols": len(cols),
|
||||
"columns": cols}
|
||||
if with_models:
|
||||
profile["models"] = {
|
||||
"outliers": {
|
||||
"n_outliers": 4, "outlier_pct": 3.2,
|
||||
"outlier_rows": [
|
||||
{"row_index": 123, "score": -0.21},
|
||||
{"row_index": 121, "score": -0.15},
|
||||
],
|
||||
"threshold": -0.02, "n_rows_used": 124, "n_features": 3,
|
||||
}
|
||||
}
|
||||
ctx = {}
|
||||
if with_raw:
|
||||
ctx["raw_numeric"] = {"Fare": fare, "Age": age, "Quiet": quiet}
|
||||
return profile, ctx
|
||||
|
||||
|
||||
def _pdf_text(path: str) -> str:
|
||||
txt = "".join((pg.extract_text() or "") for pg in PdfReader(path).pages)
|
||||
return re.sub(r"\s+", " ", txt)
|
||||
|
||||
|
||||
def _flatten(blocks):
|
||||
out = []
|
||||
for b in blocks:
|
||||
if getattr(b, "kind", "") == "group":
|
||||
out.extend(_flatten(getattr(b, "blocks", []) or []))
|
||||
else:
|
||||
out.append(b)
|
||||
return out
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Golden.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def test_golden_estructura_y_secciones():
|
||||
profile, ctx = _profile_and_ctx()
|
||||
ctx["glossary"] = model.GlossaryCollector()
|
||||
ch = build_outliers(profile, ctx)
|
||||
assert ch is not None
|
||||
assert ch.id == "outliers"
|
||||
assert ch.version == CHAPTER_VERSION
|
||||
|
||||
flat = _flatten(ch.blocks)
|
||||
kinds = [b.kind for b in flat]
|
||||
# Title heading + univariate DataTable + boxplots Figure + multivariate
|
||||
# KVTable + interpretation Markdown.
|
||||
assert kinds[0] == "heading" and flat[0].text == CHAPTER_TITLE
|
||||
tables = [b for b in flat if b.kind == "data_table"]
|
||||
titles = [t.title for t in tables]
|
||||
assert any(t and "atípicos por columna" in t for t in titles)
|
||||
assert any(b.kind == "figure" for b in flat), "falta la figura de boxplots"
|
||||
assert any(b.kind == "kv_table" for b in flat), "falta el resumen multivariante"
|
||||
|
||||
# The boxplots figure maker yields a real matplotlib figure (or its fallback).
|
||||
fig = next(b for b in flat if b.kind == "figure").make()
|
||||
assert fig is not None
|
||||
import matplotlib.pyplot as plt
|
||||
plt.close(fig)
|
||||
|
||||
|
||||
def test_golden_fare_es_la_mas_contaminada():
|
||||
# The univariate table must rank Fare (heavy tail) first and report a
|
||||
# non-zero Tukey percentage for it.
|
||||
profile, ctx = _profile_and_ctx()
|
||||
ch = build_outliers(profile, ctx)
|
||||
table = next(b for b in _flatten(ch.blocks)
|
||||
if b.kind == "data_table" and b.title
|
||||
and "atípicos por columna" in b.title)
|
||||
first_col = table.rows[0][0]
|
||||
assert first_col == "Fare", f"esperaba Fare primera, fue {first_col}"
|
||||
# % Tukey column (index 2) of the first row must be > 0.
|
||||
pct_cell = table.rows[0][2]
|
||||
assert pct_cell not in ("—", "0%", "0.00%"), f"% Tukey de Fare vacío: {pct_cell}"
|
||||
# The z-score rule (detect_outliers) must actually run with raw_numeric: at
|
||||
# least one column reports a non-empty z count/percentage (regression guard
|
||||
# for the detect_outliers import path).
|
||||
z_pcts = [r[4] for r in table.rows]
|
||||
assert any(c not in ("—",) for c in z_pcts), f"columna z toda vacía: {z_pcts}"
|
||||
z_counts = [r[3] for r in table.rows]
|
||||
assert any(c not in ("—",) for c in z_counts), f"conteo z vacío: {z_counts}"
|
||||
|
||||
|
||||
def test_golden_interpretacion_outlier_no_es_error():
|
||||
profile, ctx = _profile_and_ctx()
|
||||
ch = build_outliers(profile, ctx)
|
||||
md = " ".join(b.text for b in _flatten(ch.blocks) if b.kind == "markdown")
|
||||
assert "no es necesariamente un error" in md.lower()
|
||||
# Mentions the actionable options (winsorize / re-express).
|
||||
assert "winsoriz" in md.lower()
|
||||
assert "re-expres" in md.lower() or "logarítmic" in md.lower()
|
||||
|
||||
|
||||
def test_golden_terminos_glosario_registrados():
|
||||
profile, ctx = _profile_and_ctx()
|
||||
gloss = model.GlossaryCollector()
|
||||
ctx["glossary"] = gloss
|
||||
build_outliers(profile, ctx)
|
||||
for key in _TERM_DEFS:
|
||||
assert gloss.has(key), f"término '{key}' no registrado en el glosario"
|
||||
# Terms are marked clickable in the body text.
|
||||
md = " ".join(b.text for b in _flatten(build_outliers(profile, ctx).blocks)
|
||||
if b.kind == "markdown")
|
||||
assert "[[term:outlier]]" in md and "[[term:tukey_fence]]" in md
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Multivariate.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def test_multivariante_live_con_raw_y_dims():
|
||||
# With a raw sample the chapter runs Isolation Forest live (over the same
|
||||
# columns summarize_outlier_dims uses) and lists the anomalous rows with the
|
||||
# dimensions that make each one rare.
|
||||
profile, ctx = _profile_and_ctx(with_models=False, with_raw=True)
|
||||
ch = build_outliers(profile, ctx)
|
||||
flat = _flatten(ch.blocks)
|
||||
kv = next(b for b in flat if b.kind == "kv_table")
|
||||
flat_kv = " ".join(f"{k} {v}" for (k, v) in kv.rows)
|
||||
assert "Filas atípicas" in flat_kv
|
||||
# A non-zero number of anomalous rows is reported.
|
||||
n_cell = dict(kv.rows).get("Filas atípicas")
|
||||
assert n_cell not in (None, "—", "0"), f"sin filas atípicas: {n_cell}"
|
||||
# The anomalous-rows table carries the per-row dimension breakdown.
|
||||
tbls = [b for b in flat if b.kind == "data_table" and b.title
|
||||
and "más atípicas" in b.title]
|
||||
assert tbls, "falta la tabla de filas más atípicas"
|
||||
assert any("hacen rara" in h for h in tbls[0].header), \
|
||||
f"falta la columna de dimensiones: {tbls[0].header}"
|
||||
|
||||
|
||||
def test_multivariante_precomputed_sin_raw():
|
||||
# Without a raw sample the chapter falls back to profile['models']['outliers']
|
||||
# (lite preset path); the precomputed n_outliers (4) surfaces in the KV table.
|
||||
profile, ctx = _profile_and_ctx(with_models=True, with_raw=False)
|
||||
ch = build_outliers(profile, ctx)
|
||||
kv = next(b for b in _flatten(ch.blocks) if b.kind == "kv_table")
|
||||
assert any("4" in str(v) for (k, v) in kv.rows)
|
||||
|
||||
|
||||
def test_multivariante_ausente_degrada_a_nota():
|
||||
# No models and no raw sample → an honest note, never a crash.
|
||||
profile, ctx = _profile_and_ctx(with_models=False, with_raw=False)
|
||||
ch = build_outliers(profile, ctx)
|
||||
assert ch is not None
|
||||
notes = [b.text for b in _flatten(ch.blocks) if b.kind == "note"]
|
||||
assert any("Isolation Forest" in n for n in notes)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Edges / error path.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def test_edge_sin_columnas_numericas_devuelve_none():
|
||||
prof = {"columns": [{"name": "c", "inferred_type": "categorical",
|
||||
"categorical": {"top": [{"value": "x", "count": 3}]}}]}
|
||||
assert build_outliers(prof, {}) is None
|
||||
|
||||
|
||||
def test_edge_solo_texto_sintetico_devuelve_none():
|
||||
# A text-only synthetic table (no numeric column) yields None (does not break).
|
||||
prof = {"table": "notas", "n_rows": 3, "n_cols": 1,
|
||||
"columns": [{"name": "comentario", "inferred_type": "text",
|
||||
"text": {"n_docs": 3}}]}
|
||||
assert build_outliers(prof, {}) is None
|
||||
|
||||
|
||||
def test_edge_profile_none_y_vacio_no_revienta():
|
||||
assert build_outliers(None, None) is None
|
||||
assert build_outliers({}, {}) is None
|
||||
assert build_outliers({"columns": []}, {}) is None
|
||||
|
||||
|
||||
def test_edge_sin_raw_numeric_degrada_a_perfil():
|
||||
# Without raw_numeric the chapter still builds, using the profile z-score
|
||||
# counts; the univariate table exists and Tukey counts degrade to '—'.
|
||||
profile, ctx = _profile_and_ctx(with_models=True, with_raw=False)
|
||||
ch = build_outliers(profile, ctx)
|
||||
assert ch is not None
|
||||
table = next(b for b in _flatten(ch.blocks)
|
||||
if b.kind == "data_table" and b.title
|
||||
and "atípicos por columna" in b.title)
|
||||
# z column comes from the profile; Tukey count is unknown ('—').
|
||||
assert all(len(r) == 8 for r in table.rows)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Anti-cut render.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def test_render_pdf_y_pptx_incluyen_el_capitulo():
|
||||
profile, ctx = _profile_and_ctx()
|
||||
# The renderers build the whole document; the chapter is reached via the
|
||||
# registry. Render the chapter standalone through a one-chapter document by
|
||||
# passing the profile directly (the renderers run the full chapter registry).
|
||||
with tempfile.TemporaryDirectory() as d:
|
||||
pdf = os.path.join(d, "out.pdf")
|
||||
res_pdf = render_automatic_eda_pdf(profile, pdf,
|
||||
{"write_manifest": False, "ctx": ctx})
|
||||
assert res_pdf["path"] == pdf
|
||||
txt = _pdf_text(pdf)
|
||||
assert CHAPTER_TITLE in txt, "el capítulo OUTLIERS no aparece en el PDF"
|
||||
assert "Fare" in txt
|
||||
pptx = os.path.join(d, "out.pptx")
|
||||
res_pptx = render_automatic_eda_pptx(profile, pptx,
|
||||
{"write_manifest": False, "ctx": ctx})
|
||||
assert res_pptx["path"] == pptx
|
||||
assert res_pptx["n_slides"] >= 1
|
||||
@@ -7,11 +7,21 @@ as needed, the renderers paginate):
|
||||
NOT carry the raw head, so this is read from ``ctx['head_rows']`` /
|
||||
``profile['head_rows']`` (a list of row dicts). When absent the chapter shows
|
||||
an honest placeholder documenting the missing key instead of inventing data.
|
||||
2. Column dictionary — name / type / nulls / non-null examples. Examples come
|
||||
2. Column dictionary — name / type / nulls / non-null examples plus, when the
|
||||
LLM layer ran, the business **description** and **unit** of each column so the
|
||||
reader knows at a glance what every column is and in which unit. Examples come
|
||||
from ``columns[i]['examples']`` when present; otherwise they are derived from
|
||||
real non-null profile values (categorical top values, numeric min/median/max)
|
||||
so the cell is never empty nor fabricated.
|
||||
3. ``df.describe`` — mean / median / min / max / std for every numeric column.
|
||||
3. ``df.describe`` — mean / median / min / max / std for every numeric column,
|
||||
plus its **unit** (same LLM source) so the stats read in context.
|
||||
|
||||
The description/unit come from the ``llm`` block that ``eda_llm_insights`` (group
|
||||
``eda``) already stored in the profile (``profile['llm']['dictionary']``, a list
|
||||
of ``{"column","description","business_meaning","unit"}`` entries) — this chapter
|
||||
only **consumes** it, matching by column name; it never calls the LLM nor
|
||||
recomputes anything. When the block is absent (``run_llm`` did not run) those
|
||||
cells degrade to ``"—"`` and the tables still render.
|
||||
|
||||
Contract: build_<id>(profile, ctx) -> Chapter | None ; CHAPTER_VERSION = "x.y.z".
|
||||
"""
|
||||
@@ -20,13 +30,59 @@ from __future__ import annotations
|
||||
|
||||
from .. import model
|
||||
|
||||
CHAPTER_VERSION = "1.1.0"
|
||||
CHAPTER_VERSION = "1.2.0"
|
||||
CHAPTER_ID = "overview"
|
||||
CHAPTER_TITLE = "Overview"
|
||||
|
||||
# Profile/ctx keys the calculation phase must add for a full head + examples.
|
||||
HEAD_KEY = "head_rows" # list[dict] — df.head(n)
|
||||
EXAMPLES_KEY = "examples" # per column: list of non-null sample values
|
||||
LLM_KEY = "llm" # interpretive block from eda_llm_insights
|
||||
|
||||
|
||||
def _llm_dict_index(profile: dict, ctx: dict) -> dict:
|
||||
"""Map column name -> its LLM dictionary entry (description/unit/...).
|
||||
|
||||
Reads the ``llm.dictionary`` list that ``eda_llm_insights`` stored in the
|
||||
profile (``profile['llm']``; falls back to ``ctx['llm']``). Returns an empty
|
||||
dict when no LLM block ran, so the caller degrades to "—" cells. Fully
|
||||
defensive: never raises on malformed input.
|
||||
"""
|
||||
llm = profile.get(LLM_KEY)
|
||||
if not isinstance(llm, dict):
|
||||
llm = ctx.get(LLM_KEY)
|
||||
if not isinstance(llm, dict):
|
||||
return {}
|
||||
entries = llm.get("dictionary")
|
||||
if not isinstance(entries, (list, tuple)):
|
||||
return {}
|
||||
index: dict = {}
|
||||
for e in entries:
|
||||
if not isinstance(e, dict):
|
||||
continue
|
||||
col = e.get("column")
|
||||
if col is None:
|
||||
continue
|
||||
index[model._safe_str(col)] = e
|
||||
return index
|
||||
|
||||
|
||||
def _llm_desc(entry) -> str:
|
||||
"""Business description of a column from its LLM entry, or "—"."""
|
||||
if not isinstance(entry, dict):
|
||||
return "—"
|
||||
raw = entry.get("description") or entry.get("business_meaning")
|
||||
text = " ".join(model._safe_str(raw).split()) if raw is not None else ""
|
||||
return text or "—"
|
||||
|
||||
|
||||
def _llm_unit(entry) -> str:
|
||||
"""Unit of a column from its LLM entry, or "—"."""
|
||||
if not isinstance(entry, dict):
|
||||
return "—"
|
||||
raw = entry.get("unit")
|
||||
text = " ".join(model._safe_str(raw).split()) if raw is not None else ""
|
||||
return text or "—"
|
||||
|
||||
|
||||
def _fmt_num(value, decimals: int = 3) -> str:
|
||||
@@ -104,9 +160,12 @@ def _head_block(profile: dict, ctx: dict):
|
||||
"pasarlo en ctx['head_rows'] para mostrar las primeras filas.")
|
||||
|
||||
|
||||
def _columns_block(profile: dict):
|
||||
def _columns_block(profile: dict, llm_index: dict):
|
||||
cols = profile.get("columns") or []
|
||||
header = ["Columna", "Tipo", "Nulos", "Ejemplos (no nulos)"]
|
||||
# Descripción / Unidad come from the LLM dictionary (matched by column name);
|
||||
# they read "—" when run_llm did not run, so the table always renders.
|
||||
header = ["Columna", "Tipo", "Nulos", "Ejemplos (no nulos)",
|
||||
"Descripción", "Unidad"]
|
||||
rows = []
|
||||
for c in cols:
|
||||
if not isinstance(c, dict):
|
||||
@@ -126,15 +185,18 @@ def _columns_block(profile: dict):
|
||||
nulls = str(null_count)
|
||||
else:
|
||||
nulls = "—"
|
||||
rows.append([name, ctype, nulls, _examples_for(c)])
|
||||
entry = llm_index.get(model._safe_str(name))
|
||||
rows.append([name, ctype, nulls, _examples_for(c),
|
||||
_llm_desc(entry), _llm_unit(entry)])
|
||||
if not rows:
|
||||
return None
|
||||
return model.DataTable(header=header, rows=rows, title="Columnas")
|
||||
|
||||
|
||||
def _describe_block(profile: dict):
|
||||
def _describe_block(profile: dict, llm_index: dict):
|
||||
cols = profile.get("columns") or []
|
||||
header = ["Columna", "mean", "median", "min", "max", "std"]
|
||||
# "Unidad" (LLM source) lets the reader know in which unit each stat is.
|
||||
header = ["Columna", "mean", "median", "min", "max", "std", "Unidad"]
|
||||
rows = []
|
||||
for c in cols:
|
||||
if not isinstance(c, dict) or c.get("inferred_type") != "numeric":
|
||||
@@ -142,13 +204,16 @@ def _describe_block(profile: dict):
|
||||
num = c.get("numeric") or {}
|
||||
if not num:
|
||||
continue
|
||||
name = c.get("name") or "(col)"
|
||||
entry = llm_index.get(model._safe_str(name))
|
||||
rows.append([
|
||||
c.get("name") or "(col)",
|
||||
name,
|
||||
_fmt_num(num.get("mean")),
|
||||
_fmt_num(num.get("median")),
|
||||
_fmt_num(num.get("min")),
|
||||
_fmt_num(num.get("max")),
|
||||
_fmt_num(num.get("std")),
|
||||
_llm_unit(entry),
|
||||
])
|
||||
if not rows:
|
||||
return None
|
||||
@@ -163,16 +228,18 @@ def build_overview(profile: dict, ctx: dict):
|
||||
if not cols and not (ctx.get(HEAD_KEY) or profile.get(HEAD_KEY)):
|
||||
return None
|
||||
|
||||
llm_index = _llm_dict_index(profile, ctx)
|
||||
|
||||
blocks = [
|
||||
model.Heading(text="Primeras filas (df.head)", level=2),
|
||||
_head_block(profile, ctx),
|
||||
]
|
||||
cols_block = _columns_block(profile)
|
||||
cols_block = _columns_block(profile, llm_index)
|
||||
if cols_block is not None:
|
||||
blocks.append(model.Heading(
|
||||
text="Diccionario de columnas", level=2))
|
||||
blocks.append(cols_block)
|
||||
desc_block = _describe_block(profile)
|
||||
desc_block = _describe_block(profile, llm_index)
|
||||
if desc_block is not None:
|
||||
blocks.append(model.Heading(
|
||||
text="Resumen estadístico numérico", level=2))
|
||||
|
||||
@@ -56,7 +56,21 @@ def _head_rows() -> list:
|
||||
]
|
||||
|
||||
|
||||
def _profile(with_head: bool = True) -> dict:
|
||||
def _llm() -> dict:
|
||||
"""Interpretive block as eda_llm_insights stores it under profile['llm']."""
|
||||
return {
|
||||
"summary": "Pasajeros del Titanic.",
|
||||
"dictionary": [
|
||||
{"column": "PassengerId", "description": "Identificador del pasajero",
|
||||
"business_meaning": "Clave única de cada pasajero", "unit": "id"},
|
||||
{"column": "Pclass", "description": "Clase del billete",
|
||||
"business_meaning": "Clase socioeconómica", "unit": "clase (1-3)"},
|
||||
# No entry for Survived/Name/Sex on purpose -> they degrade to "—".
|
||||
],
|
||||
}
|
||||
|
||||
|
||||
def _profile(with_head: bool = True, with_llm: bool = False) -> dict:
|
||||
prof = {
|
||||
"table": "titanic",
|
||||
"source": "/data/titanic.csv",
|
||||
@@ -68,6 +82,8 @@ def _profile(with_head: bool = True) -> dict:
|
||||
}
|
||||
if with_head:
|
||||
prof["head_rows"] = _head_rows()
|
||||
if with_llm:
|
||||
prof["llm"] = _llm()
|
||||
return prof
|
||||
|
||||
|
||||
@@ -185,3 +201,70 @@ def test_edge_none_y_vacio_no_rompen():
|
||||
assert ch is not None
|
||||
tables = [b for b in _flatten(ch.blocks) if isinstance(b, DataTable)]
|
||||
assert tables and len(tables[0].rows) == 3
|
||||
|
||||
|
||||
def _table_by_header(blocks, marker: str):
|
||||
"""Return the first DataTable whose header contains ``marker``."""
|
||||
for b in _flatten(blocks):
|
||||
if isinstance(b, DataTable) and marker in b.header:
|
||||
return b
|
||||
return None
|
||||
|
||||
|
||||
def test_golden_diccionario_lleva_descripcion_y_unidad_del_llm():
|
||||
# With run_llm: the column dictionary gains "Descripción" and "Unidad"
|
||||
# columns populated from profile['llm']['dictionary'], matched by name.
|
||||
ch = build_overview(_profile(with_llm=True), {})
|
||||
assert ch is not None
|
||||
dic = _table_by_header(ch.blocks, "Descripción")
|
||||
assert dic is not None
|
||||
assert dic.header == ["Columna", "Tipo", "Nulos", "Ejemplos (no nulos)",
|
||||
"Descripción", "Unidad"]
|
||||
by_name = {row[0]: row for row in dic.rows}
|
||||
# PassengerId has an LLM entry -> description + unit populated.
|
||||
assert by_name["PassengerId"][4] == "Identificador del pasajero"
|
||||
assert by_name["PassengerId"][5] == "id"
|
||||
assert by_name["Pclass"][5] == "clase (1-3)"
|
||||
# Columns with no LLM entry degrade to "—" without breaking the row.
|
||||
assert by_name["Survived"][4] == "—" and by_name["Survived"][5] == "—"
|
||||
|
||||
|
||||
def test_golden_describe_lleva_unidad_del_llm():
|
||||
ch = build_overview(_profile(with_llm=True), {})
|
||||
desc = _table_by_header(ch.blocks, "std")
|
||||
assert desc is not None
|
||||
assert desc.header[-1] == "Unidad"
|
||||
by_name = {row[0]: row for row in desc.rows}
|
||||
assert by_name["PassengerId"][-1] == "id"
|
||||
assert by_name["Pclass"][-1] == "clase (1-3)"
|
||||
# Numeric column with no LLM unit still renders, unit "—".
|
||||
assert by_name["Survived"][-1] == "—"
|
||||
|
||||
|
||||
def test_edge_sin_llm_descripcion_unidad_son_guion():
|
||||
# No profile['llm'] at all: the new cells degrade to "—" and nothing breaks.
|
||||
ch = build_overview(_profile(), {})
|
||||
assert ch is not None
|
||||
dic = _table_by_header(ch.blocks, "Unidad")
|
||||
assert dic is not None
|
||||
for row in dic.rows:
|
||||
assert row[4] == "—" and row[5] == "—"
|
||||
desc = _table_by_header(ch.blocks, "std")
|
||||
assert all(row[-1] == "—" for row in desc.rows)
|
||||
|
||||
|
||||
def test_golden_llm_via_ctx_tambien_funciona():
|
||||
# LLM block arriving through ctx['llm'] (fallback path) is consumed too.
|
||||
ch = build_overview(_profile(with_llm=False), {"llm": _llm()})
|
||||
dic = _table_by_header(ch.blocks, "Descripción")
|
||||
by_name = {row[0]: row for row in dic.rows}
|
||||
assert by_name["PassengerId"][5] == "id"
|
||||
|
||||
|
||||
def test_golden_render_pdf_muestra_descripcion_y_unidad():
|
||||
with tempfile.TemporaryDirectory() as d:
|
||||
out = os.path.join(d, "eda.pdf")
|
||||
render_automatic_eda_pdf(_profile(with_llm=True), out, {"title": "EDA"})
|
||||
txt = _pdf_text(out)
|
||||
assert "Descripción" in txt and "Unidad" in txt
|
||||
assert "Identificador del pasajero" in txt
|
||||
|
||||
@@ -26,7 +26,7 @@ from datetime import datetime, timezone
|
||||
|
||||
from .. import model
|
||||
|
||||
CHAPTER_VERSION = "1.2.0"
|
||||
CHAPTER_VERSION = "1.4.0"
|
||||
CHAPTER_ID = "portada"
|
||||
CHAPTER_TITLE = "Portada"
|
||||
|
||||
@@ -35,12 +35,9 @@ CHAPTER_TITLE = "Portada"
|
||||
# row represents) from it when the LLM layer ran (``run_llm``).
|
||||
_LLM_KEY = "llm"
|
||||
|
||||
# Default human description of what the table quality score measures. Chapters
|
||||
# can override it via ctx["quality_criteria"].
|
||||
_DEFAULT_QUALITY_CRITERIA = (
|
||||
"media de los scores por columna (0–100): completitud (sin nulos/vacíos), "
|
||||
"validez (tipo y rango coherentes) y consistencia (sin duplicados/constantes)."
|
||||
)
|
||||
# Font size (pt) for the dataset name on the PPTX cover slide — notably larger
|
||||
# than the default H1 so the dataset name stands out (shown underlined too).
|
||||
_PPTX_TITLE_PT = 44.0
|
||||
|
||||
|
||||
def _storage_from_source(source: str) -> str:
|
||||
@@ -120,11 +117,20 @@ def _summary_blocks(summary) -> list:
|
||||
|
||||
blocks = [model.Heading(text="Resumen del análisis", level=2)]
|
||||
if rows:
|
||||
blocks.append(model.KVTable(rows=rows))
|
||||
# Values pinned to the right margin (numbers flush right, label left).
|
||||
blocks.append(model.KVTable(rows=rows, value_align="right"))
|
||||
if titles:
|
||||
bullets = "\n".join(f"- {model._safe_str(t)}" for t in titles)
|
||||
blocks.append(model.Markdown(
|
||||
text="Este informe incluye los siguientes capítulos:\n" + bullets))
|
||||
# Clickable index ("Índice"): one TocEntry per chapter title. Each entry
|
||||
# becomes a real jump to that chapter's first page/slide once the document
|
||||
# is laid out (the renderers register every chapter start and wire the
|
||||
# links; ``target_id`` is matched against the chapter title). The cover only
|
||||
# knows chapter titles, so the title doubles as the link target.
|
||||
blocks.append(model.Heading(text="Índice", level=2))
|
||||
for t in titles:
|
||||
label = model._safe_str(t)
|
||||
if not label:
|
||||
continue
|
||||
blocks.append(model.TocEntry(label=label, target_id=label))
|
||||
return blocks
|
||||
|
||||
|
||||
@@ -213,9 +219,7 @@ def _derive_description(profile: dict, ctx: dict) -> str:
|
||||
score = profile.get("quality_score")
|
||||
if score is not None:
|
||||
parts.append(f"Calidad media estimada: {score}/100.")
|
||||
parts.append(
|
||||
"Resumen derivado del perfil; active la interpretación LLM (`run_llm`) "
|
||||
"para una descripción de negocio más rica.")
|
||||
parts.append("Resumen derivado del perfil.")
|
||||
return " ".join(parts)
|
||||
|
||||
|
||||
@@ -259,7 +263,6 @@ def build_portada(profile: dict, ctx: dict):
|
||||
shape = f"{_fmt_int(n_rows)} filas × {_fmt_int(n_cols)} columnas"
|
||||
|
||||
score = profile.get("quality_score")
|
||||
quality_criteria = ctx.get("quality_criteria") or _DEFAULT_QUALITY_CRITERIA
|
||||
quality_value = "—" if score is None else f"{score} / 100"
|
||||
|
||||
llm = _llm_block(profile, ctx)
|
||||
@@ -282,8 +285,11 @@ def build_portada(profile: dict, ctx: dict):
|
||||
|
||||
# Title + dataset size shown together and BIG (Heading) at the top, kept on
|
||||
# the same page (Group). The size is no longer buried in the metadata table.
|
||||
# The dataset name is shown big and underlined on the PPTX cover slide
|
||||
# (size_pt/underline are honoured by the PPTX renderer; the PDF ignores them).
|
||||
cover = [
|
||||
model.Heading(text=str(dataset_name), level=1),
|
||||
model.Heading(text=str(dataset_name), level=1, underline=True,
|
||||
size_pt=_PPTX_TITLE_PT),
|
||||
model.Markdown(text="**Automatic-EDA** · informe exploratorio automático"),
|
||||
model.Heading(text=shape, level=2),
|
||||
]
|
||||
@@ -295,7 +301,6 @@ def build_portada(profile: dict, ctx: dict):
|
||||
("Almacenamiento", storage),
|
||||
("Generado", when),
|
||||
("Calidad", quality_value),
|
||||
("Criterios de calidad", quality_criteria),
|
||||
]),
|
||||
model.Heading(text="Descripción", level=2),
|
||||
model.Markdown(text=str(description)),
|
||||
|
||||
@@ -34,6 +34,7 @@ CHAPTER_ORDER = [
|
||||
"text_distr", # free-text / NLP distributions (non-tabular content)
|
||||
"calidad", # data quality
|
||||
"missingness", # missing-data patterns (co-occurrence of absences; MCAR/MAR)
|
||||
"outliers", # atypical values: univariate (Tukey/z) + multivariate (IsolationForest)
|
||||
"correlacion", # correlations / associations
|
||||
"relaciones", # key relations: declared/candidate PK + FK (inter/intra-table)
|
||||
"modelos", # cheap models (PCA/KMeans/outliers)
|
||||
@@ -72,24 +73,51 @@ def build_chapter(chapter_id: str, profile: dict, ctx: dict):
|
||||
return model.as_chapter(result)
|
||||
|
||||
|
||||
def build_document(profile: dict, ctx: dict = None) -> list:
|
||||
"""Build the full ordered list of chapters for a TableProfile.
|
||||
def build_document(profile: dict, ctx: dict = None, only: list = None) -> list:
|
||||
"""Build the ordered list of chapters for a TableProfile.
|
||||
|
||||
Args:
|
||||
profile: the ``eda`` group TableProfile dict (may be None/empty).
|
||||
ctx: optional context dict carrying presentation metadata not present in
|
||||
the profile (dataset_name, source_origin, storage, generated_at,
|
||||
description, granularity, quality_criteria, head_rows, ...).
|
||||
only: optional list of chapter ids to render. ``None`` (default) keeps
|
||||
the historical behaviour — every implemented & applicable chapter in
|
||||
canonical order. A list restricts the BODY to just those ids (in
|
||||
canonical order), but the cover (``portada``) and glossary
|
||||
(``glosario``) are ALWAYS included so the document stays valid and
|
||||
the clickable terms keep a destination — so passing ``only=["x"]``
|
||||
yields portada + x + glosario. Unknown ids are simply skipped (the
|
||||
caller is responsible for strict validation). ``only=[]`` yields the
|
||||
minimal document (portada + glosario only). This argument is additive
|
||||
and backward-compatible: the signature is unchanged for existing
|
||||
callers (default ``None``).
|
||||
|
||||
Returns:
|
||||
list[Chapter] in canonical order, containing only the chapters that are
|
||||
implemented and applicable. Never raises.
|
||||
implemented, applicable and selected. Never raises.
|
||||
"""
|
||||
if not isinstance(profile, dict):
|
||||
profile = {}
|
||||
# Copy ctx so the shared collector / summary we add do not leak to the caller.
|
||||
ctx = dict(ctx) if isinstance(ctx, dict) else {}
|
||||
|
||||
# only=None -> all body chapters (historical). only=list -> restrict body to
|
||||
# that selection (portada/glosario are added unconditionally below). The
|
||||
# renderers call build_document(profile, meta['ctx']) without an `only`
|
||||
# argument, so the pipeline forwards the selection through a reserved ctx key
|
||||
# (``_only_chapters``); an explicit `only` argument always wins. The key is
|
||||
# popped from the local ctx copy so it never reaches the chapters.
|
||||
if only is None:
|
||||
_carried = ctx.pop("_only_chapters", None)
|
||||
if isinstance(_carried, (list, tuple, set)):
|
||||
only = list(_carried)
|
||||
else:
|
||||
ctx.pop("_only_chapters", None)
|
||||
# A set makes the membership test cheap; the iteration order stays
|
||||
# CHAPTER_ORDER. only=[] is a valid (empty) selection -> minimal document.
|
||||
only_set = set(only) if isinstance(only, (list, tuple, set)) else None
|
||||
|
||||
# A single glossary collector is shared by every chapter via ctx['glossary'].
|
||||
# Chapters call ctx['glossary'].add(key, label, definition) and mark in-text
|
||||
# appearances with [[term:key]]…[[/term]]; the glosario chapter renders the
|
||||
@@ -105,6 +133,10 @@ def build_document(profile: dict, ctx: dict = None) -> list:
|
||||
for cid in CHAPTER_ORDER:
|
||||
if cid in (_PORTADA, _GLOSARIO):
|
||||
continue
|
||||
# When a selection is given, skip body chapters outside it. portada and
|
||||
# glosario are never filtered (handled out of this loop).
|
||||
if only_set is not None and cid not in only_set:
|
||||
continue
|
||||
ch = build_chapter(cid, profile, ctx)
|
||||
if ch is not None and ch.blocks:
|
||||
body.append(ch)
|
||||
|
||||
@@ -38,10 +38,18 @@ ENGINE_NAME = "AutomaticEDA"
|
||||
# --------------------------------------------------------------------------- #
|
||||
@dataclass
|
||||
class Heading:
|
||||
"""A section heading. ``level`` 1 (largest) .. 3 (smallest)."""
|
||||
"""A section heading. ``level`` 1 (largest) .. 3 (smallest).
|
||||
|
||||
``underline`` and ``size_pt`` are optional emphasis hints honoured by the
|
||||
PPTX renderer (the cover uses them to show the dataset name big and
|
||||
underlined). ``size_pt`` overrides the per-level font size when set; the PDF
|
||||
renderer ignores both so its layout is unchanged.
|
||||
"""
|
||||
|
||||
text: str = ""
|
||||
level: int = 1
|
||||
underline: bool = False
|
||||
size_pt: Optional[float] = None
|
||||
kind: str = field(default="heading", init=False)
|
||||
|
||||
|
||||
@@ -62,10 +70,17 @@ class Markdown:
|
||||
|
||||
@dataclass
|
||||
class KVTable:
|
||||
"""A two-column key/value table. ``rows`` is a list of ``(label, value)``."""
|
||||
"""A two-column key/value table. ``rows`` is a list of ``(label, value)``.
|
||||
|
||||
``value_align`` controls the horizontal alignment of the value column in the
|
||||
PDF renderer: ``"left"`` (default) keeps values next to the label column;
|
||||
``"right"`` pins them to the right margin (used by the cover's analysis
|
||||
summary so the numbers line up flush right).
|
||||
"""
|
||||
|
||||
rows: list = field(default_factory=list)
|
||||
title: Optional[str] = None
|
||||
value_align: str = "left"
|
||||
kind: str = field(default="kv_table", init=False)
|
||||
|
||||
|
||||
@@ -145,11 +160,21 @@ class Group:
|
||||
a chapter can give each unit its own page — e.g. one categorical column per
|
||||
page (see CAT DISTR). It is purely additive: the default False keeps the plain
|
||||
keep-together behaviour for every existing chapter.
|
||||
|
||||
``layout`` is a hint for how the group's children are arranged:
|
||||
``"stack"`` (default) keeps the historical top-to-bottom flow; ``"side_by_side"``
|
||||
asks the PPTX renderer to place the group's table to the LEFT and its figure to
|
||||
the RIGHT of the same slide (table ~55% width, figure ~45%), measuring so both
|
||||
fit and falling back to stacking when they do not. The PDF renderer treats
|
||||
``"side_by_side"`` exactly like ``"stack"`` (the A5 mobile page is too narrow for
|
||||
two readable columns). Unknown values degrade to ``"stack"``. Purely additive:
|
||||
the default keeps every existing chapter unchanged.
|
||||
"""
|
||||
|
||||
blocks: list = field(default_factory=list)
|
||||
title: Optional[str] = None
|
||||
page_break_before: bool = False
|
||||
layout: str = "stack"
|
||||
kind: str = field(default="group", init=False)
|
||||
|
||||
|
||||
@@ -168,6 +193,22 @@ class GlossaryEntry:
|
||||
kind: str = field(default="glossary_entry", init=False)
|
||||
|
||||
|
||||
@dataclass
|
||||
class TocEntry:
|
||||
"""One clickable index (table-of-contents) entry shown on the cover.
|
||||
|
||||
Rendered as a single line — the chapter ``label`` in the accent link colour —
|
||||
that, once the document is laid out, becomes a real click jumping to the first
|
||||
page/slide of the target chapter (PDF link annotation via PyMuPDF; PPTX native
|
||||
slide jump). ``target_id`` is matched against each chapter's ``id`` *and* its
|
||||
``title`` (the cover only knows chapter titles), so either resolves. If the
|
||||
target cannot be resolved the entry still renders as plain text (never cut)."""
|
||||
|
||||
label: str = ""
|
||||
target_id: str = ""
|
||||
kind: str = field(default="toc_entry", init=False)
|
||||
|
||||
|
||||
@dataclass
|
||||
class Chapter:
|
||||
"""An ordered set of blocks with an id, a title and a generation version."""
|
||||
@@ -192,13 +233,14 @@ _BLOCK_BY_KIND = {
|
||||
"note": Note,
|
||||
"group": Group,
|
||||
"glossary_entry": GlossaryEntry,
|
||||
"toc_entry": TocEntry,
|
||||
}
|
||||
|
||||
|
||||
def as_block(obj: Any):
|
||||
"""Coerce a value into a block dataclass. Unknown values become a Note."""
|
||||
if isinstance(obj, (Heading, Markdown, KVTable, DataTable, Figure, Image,
|
||||
Caption, Note, Group, GlossaryEntry)):
|
||||
Caption, Note, Group, GlossaryEntry, TocEntry)):
|
||||
if isinstance(obj, Group):
|
||||
obj.blocks = as_blocks(obj.blocks)
|
||||
return obj
|
||||
@@ -210,13 +252,20 @@ def as_block(obj: Any):
|
||||
# Build only with fields the dataclass accepts (ignore extras).
|
||||
try:
|
||||
if cls is Heading:
|
||||
size_pt = obj.get("size_pt")
|
||||
return Heading(text=_safe_str(obj.get("text")),
|
||||
level=int(obj.get("level", 1) or 1))
|
||||
level=int(obj.get("level", 1) or 1),
|
||||
underline=bool(obj.get("underline", False)),
|
||||
size_pt=(float(size_pt)
|
||||
if isinstance(size_pt, (int, float))
|
||||
else None))
|
||||
if cls is Markdown:
|
||||
return Markdown(text=_safe_str(obj.get("text")))
|
||||
if cls is KVTable:
|
||||
return KVTable(rows=list(obj.get("rows") or []),
|
||||
title=obj.get("title"))
|
||||
title=obj.get("title"),
|
||||
value_align=_safe_str(
|
||||
obj.get("value_align")) or "left")
|
||||
if cls is DataTable:
|
||||
return DataTable(header=list(obj.get("header") or []),
|
||||
rows=list(obj.get("rows") or []),
|
||||
@@ -237,11 +286,15 @@ def as_block(obj: Any):
|
||||
return Group(blocks=as_blocks(obj.get("blocks")),
|
||||
title=obj.get("title"),
|
||||
page_break_before=bool(
|
||||
obj.get("page_break_before", False)))
|
||||
obj.get("page_break_before", False)),
|
||||
layout=_safe_str(obj.get("layout")) or "stack")
|
||||
if cls is GlossaryEntry:
|
||||
return GlossaryEntry(key=_safe_str(obj.get("key")),
|
||||
label=_safe_str(obj.get("label")),
|
||||
definition=_safe_str(obj.get("definition")))
|
||||
if cls is TocEntry:
|
||||
return TocEntry(label=_safe_str(obj.get("label")),
|
||||
target_id=_safe_str(obj.get("target_id")))
|
||||
except Exception: # noqa: BLE001 — never raise on a malformed block.
|
||||
return Note(text=_safe_str(obj))
|
||||
return Note(text=_safe_str(obj))
|
||||
|
||||
@@ -298,11 +298,16 @@ def test_cover_first_glossary_last_with_summary():
|
||||
headings = [b.text for b in cover.blocks if b.kind == "heading"]
|
||||
assert any("Resumen" in h for h in headings), \
|
||||
"la portada no incluye el resumen agregado"
|
||||
# The summary reflects the body chapters (e.g. the numeric/categorical ones).
|
||||
cover_text = " ".join(
|
||||
b.text for b in cover.blocks if getattr(b, "kind", "") == "markdown")
|
||||
assert "Distribuciones" in cover_text, \
|
||||
"el resumen de portada no menciona los capítulos del cuerpo"
|
||||
# The index ("Índice") is now a clickable list of TocEntry blocks (one per
|
||||
# body chapter), not a markdown bullet list. Verify both the heading and that
|
||||
# the entries name the body chapters.
|
||||
assert any("Índice" in h for h in headings), \
|
||||
"la portada no incluye la sección Índice"
|
||||
toc_labels = " ".join(
|
||||
getattr(b, "label", "") for b in cover.blocks
|
||||
if getattr(b, "kind", "") == "toc_entry")
|
||||
assert "Distribuciones" in toc_labels, \
|
||||
"el índice de portada no menciona los capítulos del cuerpo"
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
|
||||
@@ -46,11 +46,23 @@ _MUTED = "#8a8a8a"
|
||||
_RULE = "#cccccc"
|
||||
_HEAD_BG = "#eef3f6"
|
||||
|
||||
# Rasterization DPI for every embedded raster (figure/table image) AND for the
|
||||
# page save itself. Raised from the old 150/default-100 to 220 so a reader can
|
||||
# pinch-zoom on a phone and still see crisp detail (axis labels, table cells)
|
||||
# without pixelation. Text stays vectorial (pdf.fonttype=42) so it remains
|
||||
# selectable regardless of DPI — only the embedded images gain resolution. 220 is
|
||||
# a deliberate balance: noticeably sharper than 150 while keeping the file size
|
||||
# reasonable. ``savefig.dpi`` matters because matplotlib re-rasterizes each
|
||||
# ``imshow`` when PdfPages writes the page; without it the final image would land
|
||||
# at ~100 dpi no matter how sharp the intermediate PNG was.
|
||||
_RASTER_DPI = 220
|
||||
|
||||
_RC = {
|
||||
"font.size": 10,
|
||||
"font.family": "sans-serif",
|
||||
"figure.facecolor": "white",
|
||||
"savefig.facecolor": "white",
|
||||
"savefig.dpi": _RASTER_DPI,
|
||||
"pdf.fonttype": 42, # embed TrueType — text stays selectable on mobile.
|
||||
}
|
||||
|
||||
@@ -80,6 +92,10 @@ class _PdfState:
|
||||
# points (1/72") with a top-left origin — same convention as PyMuPDF.
|
||||
self.term_sources = [] # [{key, page, rect:[x0,y0,x1,y1]}]
|
||||
self.term_dests = {} # key -> {page, point:[x,y]}
|
||||
# Clickable index (cover → chapter). Sources are the cover's TocEntry
|
||||
# rects; chapter_starts maps a chapter id AND its title to its first page.
|
||||
self.toc_sources = [] # [{target_id, page, rect:[x0,y0,x1,y1]}]
|
||||
self.chapter_starts = {} # id|title -> {page, point:[x,y]}
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
@@ -317,10 +333,18 @@ def _place_kv_table(st: _PdfState, block) -> None:
|
||||
if title:
|
||||
_place_heading(st, model.Heading(title, level=2))
|
||||
rows = getattr(block, "rows", []) or []
|
||||
# ``value_align="right"`` pins the value column to the right margin (label
|
||||
# left, number flush right) — used by the cover's analysis summary.
|
||||
right = str(getattr(block, "value_align", "left")).lower() == "right"
|
||||
key_w = 1.9 # inches reserved for the label column.
|
||||
# Right-aligned values wrap against the full usable width minus the label
|
||||
# column; left-aligned values wrap against the value column only.
|
||||
val_chars = tl.chars_per_line(_USABLE_W - key_w - 0.1, _FS_BODY)
|
||||
lh = tl.line_height_in(_FS_BODY)
|
||||
for row in rows:
|
||||
# ``data_idx`` is the 0-based logical row index: even rows (1-based) are
|
||||
# zebra-shaded → 0-based odd indices, matching the data-table convention so
|
||||
# every table in the document carries the same striping.
|
||||
for data_idx, row in enumerate(rows):
|
||||
try:
|
||||
label, value = row[0], row[1]
|
||||
except Exception: # noqa: BLE001
|
||||
@@ -329,11 +353,25 @@ def _place_kv_table(st: _PdfState, block) -> None:
|
||||
row_h = lh * len(v_lines) + _ROW_VPAD
|
||||
_ensure_space(st, row_h)
|
||||
y0 = st.y
|
||||
# Faint zebra fill for even rows, drawn first (zorder 0) so striping
|
||||
# never hides the text/value drawn on top.
|
||||
if data_idx % 2 == 1:
|
||||
st.fig.add_artist(Rectangle(
|
||||
(_xf(_ML), _yf(y0 + row_h)), _xf(_ML + _USABLE_W) - _xf(_ML),
|
||||
_yf(y0) - _yf(y0 + row_h), transform=st.fig.transFigure,
|
||||
color=_ZEBRA, lw=0, zorder=0))
|
||||
st.fig.text(_xf(_ML), _yf(y0), tl.strip_inline_md(model._safe_str(label)),
|
||||
fontsize=_FS_BODY, color=_MUTED, ha="left", va="top")
|
||||
fontsize=_FS_BODY, color=_MUTED, ha="left", va="top",
|
||||
zorder=2)
|
||||
for k, vl in enumerate(v_lines):
|
||||
st.fig.text(_xf(_ML + key_w), _yf(y0 + k * lh), vl,
|
||||
fontsize=_FS_BODY, color=_INK, ha="left", va="top")
|
||||
if right:
|
||||
st.fig.text(_xf(_ML + _USABLE_W), _yf(y0 + k * lh), vl,
|
||||
fontsize=_FS_BODY, color=_INK, ha="right",
|
||||
va="top", zorder=2)
|
||||
else:
|
||||
st.fig.text(_xf(_ML + key_w), _yf(y0 + k * lh), vl,
|
||||
fontsize=_FS_BODY, color=_INK, ha="left",
|
||||
va="top", zorder=2)
|
||||
st.y = y0 + row_h
|
||||
st.y += _GAP
|
||||
|
||||
@@ -363,6 +401,57 @@ def _col_widths(header: list, rows: list, fs: float) -> list:
|
||||
return widths
|
||||
|
||||
|
||||
# Minimal legible characters reserved per column when deciding whether a table
|
||||
# can be shown as selectable text. Below this width per column the cells become
|
||||
# unreadable, so the table is rasterized to a zoomable high-res image instead.
|
||||
_MIN_LEGIBLE_CHARS = 8
|
||||
|
||||
|
||||
def _table_fits_as_text(header: list, rows: list) -> bool:
|
||||
"""True when the table fits the usable width as readable text.
|
||||
|
||||
A table whose columns cannot each get a minimal legible width within the A5
|
||||
usable width (typically many columns, e.g. a 19-column ``df.head``) is flagged
|
||||
so it is rendered as a single high-resolution image — the reader zooms in on
|
||||
the phone and reads every cell, nothing cut — instead of being squeezed until
|
||||
unreadable. Narrow tables (few columns) keep the selectable-text rendering."""
|
||||
header = header or []
|
||||
rows = rows or []
|
||||
ncol = len(header) if header else (len(rows[0]) if rows else 1)
|
||||
ncol = max(1, ncol)
|
||||
cw = tl.avg_char_width_in(_FS_CELL)
|
||||
min_needed = ncol * (_MIN_LEGIBLE_CHARS * cw + _CELL_PAD * 2)
|
||||
return min_needed <= _USABLE_W
|
||||
|
||||
|
||||
def _table_figure_block(block):
|
||||
"""Wrap a too-wide table as a lazily-rasterized Figure (cached on the block).
|
||||
|
||||
The table is drawn once via ``render_table_as_figure`` (header shading + zebra)
|
||||
and embedded as one high-res image scaled to fit entirely. The same Figure is
|
||||
reused for measuring and placing so keep-together stays consistent. The table
|
||||
title/note are drawn inside the image (self-describing when zoomed/shared), so
|
||||
the block-level caption is left empty to avoid a duplicate title."""
|
||||
cached = getattr(block, "_aeda_tablefig", None)
|
||||
if cached is not None:
|
||||
return cached
|
||||
header = list(getattr(block, "header", []) or [])
|
||||
rows = list(getattr(block, "rows", []) or [])
|
||||
title = getattr(block, "title", None)
|
||||
note = getattr(block, "note", None)
|
||||
|
||||
def _make():
|
||||
from datascience.render_table_as_figure import render_table_as_figure
|
||||
return render_table_as_figure(header, rows, title=title, note=note)
|
||||
|
||||
fig = model.Figure(make=_make, caption=None)
|
||||
try:
|
||||
block._aeda_tablefig = fig
|
||||
except Exception: # noqa: BLE001 — block may reject attributes; degrade.
|
||||
pass
|
||||
return fig
|
||||
|
||||
|
||||
def _wrap_row(cells: list, widths: list, fs: float) -> list:
|
||||
"""Wrap each cell to its column width → list of line-lists per cell."""
|
||||
out = []
|
||||
@@ -402,11 +491,16 @@ def _draw_table_row(st: _PdfState, cells_lines: list, widths: list, fs: float,
|
||||
|
||||
|
||||
def _place_data_table(st: _PdfState, block) -> None:
|
||||
header = list(getattr(block, "header", []) or [])
|
||||
rows = list(getattr(block, "rows", []) or [])
|
||||
# Too many columns to be legible as text → render the whole table as one
|
||||
# high-res image, scaled to fit entirely (the reader zooms to read it).
|
||||
if not _table_fits_as_text(header, rows):
|
||||
_place_figure(st, _table_figure_block(block))
|
||||
return
|
||||
title = getattr(block, "title", None)
|
||||
if title:
|
||||
_place_heading(st, model.Heading(title, level=2))
|
||||
header = list(getattr(block, "header", []) or [])
|
||||
rows = list(getattr(block, "rows", []) or [])
|
||||
fs = _FS_CELL
|
||||
widths = _col_widths(header, rows, fs)
|
||||
header_lines = _wrap_row(header, widths, fs) if header else None
|
||||
@@ -464,8 +558,11 @@ def _resolve_figure(block):
|
||||
|
||||
|
||||
def _png_from_figure(fig) -> bytes:
|
||||
# ``bbox_inches='tight'`` is kept so the real aspect ratio is what we measure
|
||||
# and place. The page save (savefig.dpi in _RC) re-rasterizes this at the same
|
||||
# high DPI, so the embedded image stays crisp for phone zoom.
|
||||
buf = io.BytesIO()
|
||||
fig.savefig(buf, format="png", dpi=150, bbox_inches="tight")
|
||||
fig.savefig(buf, format="png", dpi=_RASTER_DPI, bbox_inches="tight")
|
||||
buf.seek(0)
|
||||
return buf.read()
|
||||
|
||||
@@ -707,12 +804,16 @@ def _measure_data_table(block) -> float:
|
||||
Counts the optional title heading, the wrapped header row, every wrapped data
|
||||
row (per-column wrap via the same ``_col_widths``/``_wrap_row`` the placer
|
||||
uses) and the optional note. Keep this in sync with ``_place_data_table``."""
|
||||
header = list(getattr(block, "header", []) or [])
|
||||
rows = list(getattr(block, "rows", []) or [])
|
||||
# Mirror the placer: a too-wide table is drawn as a single image, so its
|
||||
# keep-together height is the image's, not the (squeezed) text layout's.
|
||||
if not _table_fits_as_text(header, rows):
|
||||
return _measure_figure_like(_table_figure_block(block))
|
||||
h = 0.0
|
||||
title = getattr(block, "title", None)
|
||||
if title:
|
||||
h += _measure_heading_text(title, 2)
|
||||
header = list(getattr(block, "header", []) or [])
|
||||
rows = list(getattr(block, "rows", []) or [])
|
||||
fs = _FS_CELL
|
||||
widths = _col_widths(header, rows, fs)
|
||||
lh = tl.line_height_in(fs)
|
||||
@@ -744,6 +845,10 @@ def _measure_block(st: _PdfState, block) -> float:
|
||||
lines = tl.wrap(getattr(block, "text", ""),
|
||||
tl.chars_per_line(_USABLE_W, _FS_NOTE))
|
||||
return tl.line_height_in(_FS_NOTE) * len(lines) + _GAP
|
||||
if kind == "toc_entry":
|
||||
lines = tl.wrap(tl.strip_inline_md(getattr(block, "label", "")),
|
||||
tl.chars_per_line(_USABLE_W - 0.22, _FS_BODY)) or [""]
|
||||
return tl.line_height_in(_FS_BODY) * len(lines) + _GAP * 0.4
|
||||
if kind == "kv_table":
|
||||
return _measure_kv_table(block)
|
||||
if kind == "data_table":
|
||||
@@ -828,6 +933,38 @@ def _place_glossary_entry(st: _PdfState, block) -> None:
|
||||
st.y += _GAP * 0.5
|
||||
|
||||
|
||||
def _place_toc_entry(st: _PdfState, block) -> None:
|
||||
"""Render one clickable index line and record it as a link source.
|
||||
|
||||
Drawn as a bulleted line in the accent link colour; its rectangle is recorded
|
||||
in ``st.toc_sources`` so the post-processor turns it into a real jump to the
|
||||
target chapter's first page. If the target is never resolved the line still
|
||||
shows as plain (accent) text — never cut, never broken."""
|
||||
label = tl.strip_inline_md(getattr(block, "label", "")) or ""
|
||||
target_id = getattr(block, "target_id", "") or ""
|
||||
fs = _FS_BODY
|
||||
lh = tl.line_height_in(fs)
|
||||
bullet = "• "
|
||||
indent = 0.22
|
||||
max_chars = tl.chars_per_line(_USABLE_W - indent, fs)
|
||||
lines = tl.wrap(label, max_chars) or [""]
|
||||
for idx, ln in enumerate(lines):
|
||||
_ensure_space(st, lh)
|
||||
x = _ML
|
||||
st.fig.text(_xf(x), _yf(st.y), bullet if idx == 0 else " ",
|
||||
fontsize=fs, color=_LINK, ha="left", va="top")
|
||||
x += indent
|
||||
w = _text_width_in(st, ln, fs, False)
|
||||
st.fig.text(_xf(x), _yf(st.y), ln, fontsize=fs, color=_LINK,
|
||||
ha="left", va="top")
|
||||
if target_id and idx == 0:
|
||||
st.toc_sources.append({
|
||||
"target_id": target_id, "page": st.page - 1,
|
||||
"rect": _pt_rect(_ML, st.y, x + w, st.y + lh)})
|
||||
st.y += lh
|
||||
st.y += _GAP * 0.4
|
||||
|
||||
|
||||
_PLACERS = {
|
||||
"heading": _place_heading,
|
||||
"markdown": _place_markdown,
|
||||
@@ -839,6 +976,7 @@ _PLACERS = {
|
||||
"note": _place_note,
|
||||
"group": _place_group,
|
||||
"glossary_entry": _place_glossary_entry,
|
||||
"toc_entry": _place_toc_entry,
|
||||
}
|
||||
|
||||
|
||||
@@ -870,6 +1008,15 @@ def render_pdf(chapters: list, out_path: str, meta: dict = None) -> dict:
|
||||
st.chapter = ch
|
||||
st.chapter_pages = 0
|
||||
_new_page(st) # each chapter starts on a fresh page.
|
||||
# Record this chapter's first page as a link target for the
|
||||
# cover index (keyed by id AND title, since the cover only
|
||||
# knows titles). Point is the top of the content area.
|
||||
_start = {"page": st.page - 1,
|
||||
"point": [_ML * 72.0, _CONTENT_TOP * 72.0]}
|
||||
if ch.id:
|
||||
st.chapter_starts[ch.id] = _start
|
||||
if getattr(ch, "title", ""):
|
||||
st.chapter_starts.setdefault(ch.title, _start)
|
||||
for block in ch.blocks:
|
||||
placer = _PLACERS.get(getattr(block, "kind", ""),
|
||||
_place_note)
|
||||
@@ -902,7 +1049,7 @@ def render_pdf(chapters: list, out_path: str, meta: dict = None) -> dict:
|
||||
|
||||
note = f"{n_pages} páginas"
|
||||
if n_links:
|
||||
note += f" · {n_links} enlaces de glosario"
|
||||
note += f" · {n_links} enlaces internos"
|
||||
if notes:
|
||||
note += " · " + "; ".join(notes)
|
||||
return {"path": out_path, "n_pages": n_pages, "chapters": chapters_meta,
|
||||
@@ -910,9 +1057,11 @@ def render_pdf(chapters: list, out_path: str, meta: dict = None) -> dict:
|
||||
|
||||
|
||||
def _wire_glossary_links(st: _PdfState, out_path: str, notes: list) -> int:
|
||||
"""Build {source rect → glossary dest} links and apply them via PyMuPDF.
|
||||
"""Apply internal PDF links via PyMuPDF: glossary terms + the cover index.
|
||||
|
||||
Returns the number of links applied (0 if there is nothing to wire or the
|
||||
Builds two sets of GOTO links — every in-text glossary term → its entry, and
|
||||
every cover ``TocEntry`` → its chapter's first page — and applies them in one
|
||||
pass. Returns the number of links applied (0 if there is nothing to wire or the
|
||||
post-processor is unavailable). Never raises."""
|
||||
try:
|
||||
links = []
|
||||
@@ -923,6 +1072,14 @@ def _wire_glossary_links(st: _PdfState, out_path: str, notes: list) -> int:
|
||||
links.append({
|
||||
"src_page": src["page"], "src_rect": src["rect"],
|
||||
"dst_page": dest["page"], "dst_point": dest["point"]})
|
||||
# Cover index → chapter first page (clickable, navigable table of contents).
|
||||
for src in st.toc_sources:
|
||||
dest = st.chapter_starts.get(src.get("target_id"))
|
||||
if not dest:
|
||||
continue
|
||||
links.append({
|
||||
"src_page": src["page"], "src_rect": src["rect"],
|
||||
"dst_page": dest["page"], "dst_point": dest["point"]})
|
||||
if not links:
|
||||
return 0
|
||||
from datascience.add_pdf_internal_links import add_pdf_internal_links
|
||||
@@ -930,7 +1087,7 @@ def _wire_glossary_links(st: _PdfState, out_path: str, notes: list) -> int:
|
||||
if isinstance(res, dict) and res.get("status") == "ok":
|
||||
return int(res.get("n_links") or 0)
|
||||
if isinstance(res, dict) and res.get("error"):
|
||||
notes.append(f"glosario sin enlaces: {res.get('error')}")
|
||||
notes.append(f"enlaces internos no aplicados: {res.get('error')}")
|
||||
except Exception as e: # noqa: BLE001 — links are best-effort.
|
||||
notes.append(f"glosario sin enlaces: {e}")
|
||||
notes.append(f"enlaces internos no aplicados: {e}")
|
||||
return 0
|
||||
|
||||
@@ -51,6 +51,12 @@ _FS_H1, _FS_H2, _FS_H3 = 20, 16, 13
|
||||
_FS_BODY, _FS_CELL, _FS_NOTE = 14, 11, 11
|
||||
_GAP = 0.12
|
||||
|
||||
# Rasterization DPI for every embedded figure/table image. Raised from 150 to 220
|
||||
# so a viewer can zoom into a slide (or a shared picture) and read crisp detail —
|
||||
# axis labels, table cells — without pixelation. Kept moderate so the deck size
|
||||
# stays reasonable. Same value as the PDF renderer.
|
||||
_RASTER_DPI = 220
|
||||
|
||||
|
||||
class _PptxState:
|
||||
def __init__(self, prs, title: str):
|
||||
@@ -65,6 +71,10 @@ class _PptxState:
|
||||
# Glossary wiring (mejora 6): runs to link and per-term target slide.
|
||||
self.term_runs = [] # [(key, run)]
|
||||
self.term_anchor_slide = {} # key -> Slide (glossary entry)
|
||||
# Clickable index (cover → chapter). toc_runs are the cover's index runs;
|
||||
# chapter_starts maps a chapter id AND its title to its first slide.
|
||||
self.toc_runs = [] # [(target_id, run, src_slide)]
|
||||
self.chapter_starts = {} # id|title -> Slide (chapter first slide)
|
||||
|
||||
|
||||
def _rgb(c):
|
||||
@@ -135,7 +145,7 @@ def _ensure(st: _PptxState, height: float) -> None:
|
||||
|
||||
|
||||
def _add_text(st: _PptxState, lines: list, fs: float, color, bold=False,
|
||||
italic=False, indent=0.0, bullet=False) -> None:
|
||||
italic=False, indent=0.0, bullet=False, underline=False) -> None:
|
||||
lh = tl.line_height_in(fs)
|
||||
height = lh * len(lines) + 0.05
|
||||
_ensure(st, height)
|
||||
@@ -153,6 +163,7 @@ def _add_text(st: _PptxState, lines: list, fs: float, color, bold=False,
|
||||
run.font.size = Pt(fs)
|
||||
run.font.bold = bold
|
||||
run.font.italic = italic
|
||||
run.font.underline = underline
|
||||
run.font.color.rgb = _rgb(color)
|
||||
st.y += height
|
||||
|
||||
@@ -206,10 +217,16 @@ def _add_rich_text(st: _PptxState, rich_lines: list, fs: float, color,
|
||||
def _place_heading(st: _PptxState, block) -> None:
|
||||
level = max(1, min(3, int(getattr(block, "level", 1) or 1)))
|
||||
fs = {1: _FS_H1, 2: _FS_H2, 3: _FS_H3}[level]
|
||||
# Optional per-heading emphasis (cover dataset name): a larger font and an
|
||||
# underline. ``size_pt`` overrides the per-level size when set.
|
||||
size_override = getattr(block, "size_pt", None)
|
||||
if isinstance(size_override, (int, float)) and size_override > 0:
|
||||
fs = float(size_override)
|
||||
underline = bool(getattr(block, "underline", False))
|
||||
text = tl.strip_inline_md(getattr(block, "text", ""))
|
||||
st.last_heading = text or st.last_heading
|
||||
lines = tl.wrap(text, tl.chars_per_line(_USABLE_W, fs))
|
||||
_add_text(st, lines, fs, _INK, bold=True)
|
||||
_add_text(st, lines, fs, _INK, bold=True, underline=underline)
|
||||
st.y += 0.04
|
||||
|
||||
|
||||
@@ -302,6 +319,58 @@ def _col_widths(header, rows):
|
||||
return [_USABLE_W * w / total for w in clamped]
|
||||
|
||||
|
||||
# Minimal legible characters reserved per column when deciding whether a table
|
||||
# can be shown as a native (selectable) PowerPoint table. Below this width per
|
||||
# column the cells become unreadable, so the table is rasterized to a zoomable
|
||||
# high-res image instead. The 16:9 slide is wide, so more columns fit than on A5.
|
||||
_MIN_LEGIBLE_CHARS = 8
|
||||
_CELL_PAD = 0.05
|
||||
|
||||
|
||||
def _table_fits_as_text(header: list, rows: list) -> bool:
|
||||
"""True when the table fits the usable slide width as a readable table.
|
||||
|
||||
A table whose columns cannot each get a minimal legible width within the slide
|
||||
usable width (typically many columns, e.g. a 19-column ``df.head``) is flagged
|
||||
so it is rendered as one high-resolution image — the viewer zooms in and reads
|
||||
every cell — instead of being squeezed unreadable. Narrow tables keep the
|
||||
native selectable table."""
|
||||
header = header or []
|
||||
rows = rows or []
|
||||
ncol = len(header) if header else (len(rows[0]) if rows else 1)
|
||||
ncol = max(1, ncol)
|
||||
cw = tl.avg_char_width_in(_FS_CELL)
|
||||
min_needed = ncol * (_MIN_LEGIBLE_CHARS * cw + _CELL_PAD * 2)
|
||||
return min_needed <= _USABLE_W
|
||||
|
||||
|
||||
def _table_figure_block(block):
|
||||
"""Wrap a too-wide table as a lazily-rasterized Figure (cached on the block).
|
||||
|
||||
Drawn once via ``render_table_as_figure`` (header shading + zebra) and embedded
|
||||
as one high-res image scaled to fit entirely. The title/note are drawn inside
|
||||
the image (self-describing when zoomed/shared), so no separate caption is
|
||||
emitted. Reused for measuring and placing so keep-together stays consistent."""
|
||||
cached = getattr(block, "_aeda_tablefig", None)
|
||||
if cached is not None:
|
||||
return cached
|
||||
header = list(getattr(block, "header", []) or [])
|
||||
rows = list(getattr(block, "rows", []) or [])
|
||||
title = getattr(block, "title", None)
|
||||
note = getattr(block, "note", None)
|
||||
|
||||
def _make():
|
||||
from datascience.render_table_as_figure import render_table_as_figure
|
||||
return render_table_as_figure(header, rows, title=title, note=note)
|
||||
|
||||
fig = model.Figure(make=_make, caption=None)
|
||||
try:
|
||||
block._aeda_tablefig = fig
|
||||
except Exception: # noqa: BLE001 — block may reject attributes; degrade.
|
||||
pass
|
||||
return fig
|
||||
|
||||
|
||||
def _row_height_in(cells, widths, fs) -> float:
|
||||
lh = tl.line_height_in(fs)
|
||||
maxlines = 1
|
||||
@@ -365,11 +434,27 @@ def _style_cell(cell, fs, color, bold, fill) -> None:
|
||||
|
||||
def _place_data_table(st: _PptxState, block, shaded_header=True,
|
||||
key_value=False) -> None:
|
||||
header = list(getattr(block, "header", []) or [])
|
||||
rows = list(getattr(block, "rows", []) or [])
|
||||
# Too many columns to be legible as a native table → render the whole table as
|
||||
# one high-res picture, scaled to fit entirely (the viewer zooms to read it).
|
||||
# KVTables (rendered here as a 2-column Campo/Valor table) are excluded: they
|
||||
# always fit in width and stay as a selectable table.
|
||||
if not key_value and not _table_fits_as_text(header, rows):
|
||||
figblock = _table_figure_block(block)
|
||||
data, _asp = _figure_bytes_cached(figblock)
|
||||
if data is None:
|
||||
_add_text(st, ["(tabla no disponible)"], _FS_NOTE, _MUTED,
|
||||
italic=True)
|
||||
st.y += _GAP
|
||||
return
|
||||
_place_picture_bytes(st, data, None,
|
||||
max_h_in=getattr(figblock, "height_in", None),
|
||||
force_caption=False)
|
||||
return
|
||||
title = getattr(block, "title", None)
|
||||
if title:
|
||||
_place_heading(st, model.Heading(title, level=2))
|
||||
header = list(getattr(block, "header", []) or [])
|
||||
rows = list(getattr(block, "rows", []) or [])
|
||||
fs = _FS_CELL
|
||||
widths = _col_widths(header, rows)
|
||||
header_h = _row_height_in(header, widths, fs) if header else 0.0
|
||||
@@ -429,7 +514,7 @@ def _resolve_png(block):
|
||||
try:
|
||||
import matplotlib.pyplot as plt
|
||||
buf = io.BytesIO()
|
||||
f.savefig(buf, format="png", dpi=150, bbox_inches="tight")
|
||||
f.savefig(buf, format="png", dpi=_RASTER_DPI, bbox_inches="tight")
|
||||
buf.seek(0)
|
||||
return buf.read()
|
||||
except Exception: # noqa: BLE001
|
||||
@@ -476,12 +561,15 @@ def _figure_bytes_cached(block):
|
||||
|
||||
|
||||
def _place_picture_bytes(st: _PptxState, data: bytes, caption,
|
||||
max_h_in=None) -> None:
|
||||
max_h_in=None, force_caption=True) -> None:
|
||||
# Mejora 4 — every figure on a slide carries a visible caption/title. If the
|
||||
# block has no caption, fall back to the current section heading, then to a
|
||||
# generic label, so no image is ever shown untitled.
|
||||
caption = (model._safe_str(caption).strip()
|
||||
or model._safe_str(st.last_heading).strip() or "Figura")
|
||||
# generic label, so no image is ever shown untitled. ``force_caption=False``
|
||||
# suppresses that fallback (used for table images, whose title is inside the
|
||||
# picture) so no redundant caption is drawn.
|
||||
caption = model._safe_str(caption).strip()
|
||||
if not caption and force_caption:
|
||||
caption = model._safe_str(st.last_heading).strip() or "Figura"
|
||||
w_px, h_px = _img_size_px(data)
|
||||
aspect = (h_px / w_px) if w_px else 0.66
|
||||
# Reserve the caption's REAL (possibly multi-line) height FIRST, then scale
|
||||
@@ -489,9 +577,11 @@ def _place_picture_bytes(st: _PptxState, data: bytes, caption,
|
||||
# so its caption always fits on the SAME slide and no image is untitled.
|
||||
# cap_real = what _add_text consumes; cap_reserve adds the post-image gap and
|
||||
# a small cushion so the caption never spills to the next slide.
|
||||
cap_lines = tl.wrap(caption, tl.chars_per_line(_USABLE_W, _FS_NOTE))
|
||||
cap_real = tl.line_height_in(_FS_NOTE) * len(cap_lines) + 0.05
|
||||
cap_reserve = cap_real + 0.05 + 0.10
|
||||
cap_lines = tl.wrap(caption, tl.chars_per_line(_USABLE_W, _FS_NOTE)) \
|
||||
if caption else []
|
||||
cap_real = (tl.line_height_in(_FS_NOTE) * len(cap_lines) + 0.05) \
|
||||
if cap_lines else 0.0
|
||||
cap_reserve = (cap_real + 0.05 + 0.10) if cap_lines else 0.05
|
||||
max_h = _CONTENT_BOTTOM - _CONTENT_TOP
|
||||
# height_in hint (model.Figure/Image): cap the target height so a figure in a
|
||||
# keep-together Group shrinks to leave room for its heading and text.
|
||||
@@ -510,7 +600,8 @@ def _place_picture_bytes(st: _PptxState, data: bytes, caption,
|
||||
st.slide.shapes.add_picture(io.BytesIO(data), Inches(left), Inches(st.y),
|
||||
width=Inches(target_w), height=Inches(target_h))
|
||||
st.y += target_h + 0.05
|
||||
_add_text(st, cap_lines, _FS_NOTE, _MUTED, italic=True)
|
||||
if cap_lines:
|
||||
_add_text(st, cap_lines, _FS_NOTE, _MUTED, italic=True)
|
||||
st.y += _GAP
|
||||
|
||||
|
||||
@@ -552,9 +643,11 @@ def _place_note(st: _PptxState, block) -> None:
|
||||
# WITHOUT drawing it so a Group can move whole to the next slide before drawing.
|
||||
# Over-estimating only triggers an earlier slide break, never a content cut.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def _measure_heading_text(text: str, level: int) -> float:
|
||||
def _measure_heading_text(text: str, level: int, size_pt=None) -> float:
|
||||
level = max(1, min(3, int(level or 1)))
|
||||
fs = {1: _FS_H1, 2: _FS_H2, 3: _FS_H3}[level]
|
||||
if isinstance(size_pt, (int, float)) and size_pt > 0:
|
||||
fs = float(size_pt)
|
||||
lines = tl.wrap(tl.strip_inline_md(text), tl.chars_per_line(_USABLE_W, fs))
|
||||
return tl.line_height_in(fs) * len(lines) + 0.05 + 0.04
|
||||
|
||||
@@ -654,12 +747,16 @@ def _measure_kv_table(block) -> float:
|
||||
def _measure_data_table(block) -> float:
|
||||
"""Faithful DataTable height — matches ``_place_data_table`` (title heading +
|
||||
wrapped header + every wrapped row + optional note). Keep in sync."""
|
||||
header = list(getattr(block, "header", []) or [])
|
||||
rows = list(getattr(block, "rows", []) or [])
|
||||
# Mirror the placer: a too-wide table is drawn as one image, so its
|
||||
# keep-together height is the image's, not the (squeezed) table layout's.
|
||||
if not _table_fits_as_text(header, rows):
|
||||
return _measure_figure_like(_table_figure_block(block))
|
||||
h = 0.0
|
||||
title = getattr(block, "title", None)
|
||||
if title:
|
||||
h += _measure_heading_text(title, 2)
|
||||
header = list(getattr(block, "header", []) or [])
|
||||
rows = list(getattr(block, "rows", []) or [])
|
||||
fs = _FS_CELL
|
||||
widths = _col_widths(header, rows)
|
||||
if header:
|
||||
@@ -679,7 +776,8 @@ def _measure_block(st: _PptxState, block) -> float:
|
||||
try:
|
||||
if kind == "heading":
|
||||
return _measure_heading_text(getattr(block, "text", ""),
|
||||
getattr(block, "level", 1))
|
||||
getattr(block, "level", 1),
|
||||
size_pt=getattr(block, "size_pt", None))
|
||||
if kind == "markdown":
|
||||
return _measure_markdown(block)
|
||||
if kind in ("figure", "image"):
|
||||
@@ -688,6 +786,10 @@ def _measure_block(st: _PptxState, block) -> float:
|
||||
lines = tl.wrap(getattr(block, "text", ""),
|
||||
tl.chars_per_line(_USABLE_W, _FS_NOTE))
|
||||
return tl.line_height_in(_FS_NOTE) * len(lines) + 0.05 + _GAP
|
||||
if kind == "toc_entry":
|
||||
lines = tl.wrap(tl.strip_inline_md(getattr(block, "label", "")),
|
||||
tl.chars_per_line(_USABLE_W - 0.3, _FS_BODY)) or [""]
|
||||
return tl.line_height_in(_FS_BODY) * len(lines) + 0.05
|
||||
if kind == "kv_table":
|
||||
return _measure_kv_table(block)
|
||||
if kind == "data_table":
|
||||
@@ -800,6 +902,73 @@ def _fit_group_blocks(st: _PptxState, blocks: list, avail_full: float) -> list:
|
||||
return out
|
||||
|
||||
|
||||
def _fit_img(width_col: float, aspect: float, max_h: float):
|
||||
"""Scale an image to ``width_col`` then clamp to ``max_h`` keeping aspect."""
|
||||
w = width_col
|
||||
h = w * aspect
|
||||
if h > max_h:
|
||||
h = max_h
|
||||
w = (h / aspect) if aspect else width_col
|
||||
return w, h
|
||||
|
||||
|
||||
def _place_group_side_by_side(st: _PptxState, block, avail_full: float) -> bool:
|
||||
"""Place a Group's table (left ~55%) next to its figure (right ~45%).
|
||||
|
||||
Both the table and the figure are rasterized to high-res images and placed in
|
||||
two columns of the SAME slide; any other blocks (e.g. a heading) render full
|
||||
width above the pair, the rest below. Returns True on success; returns False
|
||||
(so the caller falls back to stacking) when the group has no table+figure pair
|
||||
or the pair cannot fit side by side on one slide. Never raises by itself."""
|
||||
blocks = getattr(block, "blocks", []) or []
|
||||
tbl = next((b for b in blocks
|
||||
if getattr(b, "kind", "") in ("data_table", "kv_table")), None)
|
||||
fig = next((b for b in blocks
|
||||
if getattr(b, "kind", "") in ("figure", "image")), None)
|
||||
if tbl is None or fig is None:
|
||||
return False
|
||||
gap_col = 0.3
|
||||
left_w = _USABLE_W * 0.55 - gap_col / 2.0
|
||||
right_w = _USABLE_W * 0.45 - gap_col / 2.0
|
||||
if left_w <= 1.0 or right_w <= 1.0:
|
||||
return False
|
||||
tdata, tasp = _figure_bytes_cached(_table_figure_block(tbl))
|
||||
fdata, fasp = _figure_bytes_cached(fig)
|
||||
if not tdata or not fdata:
|
||||
return False
|
||||
ti, fi = blocks.index(tbl), blocks.index(fig)
|
||||
lo = min(ti, fi)
|
||||
lead = list(blocks[:lo])
|
||||
rest = [b for b in blocks[lo + 1:] if b is not tbl and b is not fig]
|
||||
lead_h = sum(_measure_block(st, b) for b in lead)
|
||||
rest_h = sum(_measure_block(st, b) for b in rest)
|
||||
col_max_h = avail_full - lead_h - rest_h - _GAP * 2
|
||||
if col_max_h < 1.2:
|
||||
return False # not enough vertical room to put the pair side by side.
|
||||
tw, th = _fit_img(left_w, tasp, col_max_h)
|
||||
fw, fh = _fit_img(right_w, fasp, col_max_h)
|
||||
band = max(th, fh)
|
||||
needed = lead_h + band + rest_h + _GAP * 2
|
||||
if needed > avail_full:
|
||||
return False # taller than a whole slide even side by side → stack.
|
||||
if needed > _remaining(st):
|
||||
_new_slide(st, cont=True)
|
||||
for b in lead:
|
||||
_PLACERS.get(getattr(b, "kind", ""), _place_note)(st, b)
|
||||
top = st.y
|
||||
f_left = _ML + left_w + gap_col
|
||||
st.slide.shapes.add_picture(
|
||||
io.BytesIO(tdata), Inches(_ML + (left_w - tw) / 2.0),
|
||||
Inches(top + (band - th) / 2.0), width=Inches(tw), height=Inches(th))
|
||||
st.slide.shapes.add_picture(
|
||||
io.BytesIO(fdata), Inches(f_left + (right_w - fw) / 2.0),
|
||||
Inches(top + (band - fh) / 2.0), width=Inches(fw), height=Inches(fh))
|
||||
st.y = top + band + _GAP
|
||||
for b in rest:
|
||||
_PLACERS.get(getattr(b, "kind", ""), _place_note)(st, b)
|
||||
return True
|
||||
|
||||
|
||||
def _place_group(st: _PptxState, block) -> None:
|
||||
"""Render a keep-together Group: move it whole to the next slide if needed."""
|
||||
blocks = getattr(block, "blocks", []) or []
|
||||
@@ -810,6 +979,14 @@ def _place_group(st: _PptxState, block) -> None:
|
||||
if getattr(block, "page_break_before", False) and st.y > _CONTENT_TOP + 1e-6:
|
||||
_new_slide(st, cont=True)
|
||||
avail_full = _CONTENT_BOTTOM - _CONTENT_TOP
|
||||
# layout="side_by_side": try table-left / figure-right on one slide; on any
|
||||
# reason it can't, fall through to the normal stacked keep-together below.
|
||||
if str(getattr(block, "layout", "stack")).lower() == "side_by_side":
|
||||
try:
|
||||
if _place_group_side_by_side(st, block, avail_full):
|
||||
return
|
||||
except Exception: # noqa: BLE001 — degrade to stacking, never abort.
|
||||
pass
|
||||
# Trim oversized tables first (keeps the chart on the same slide), then shrink
|
||||
# the figure to share the remaining room.
|
||||
blocks = _fit_group_blocks(st, blocks, avail_full)
|
||||
@@ -843,6 +1020,44 @@ def _place_glossary_entry(st: _PptxState, block) -> None:
|
||||
st.y += _GAP
|
||||
|
||||
|
||||
def _place_toc_entry(st: _PptxState, block) -> None:
|
||||
"""Render one clickable index line and record its run as a link source.
|
||||
|
||||
Drawn as a bulleted line in the accent link colour; the run is recorded in
|
||||
``st.toc_runs`` so it later becomes a native slide-jump to the target chapter's
|
||||
first slide. If the target is never resolved the line still shows as plain
|
||||
(accent) text — never cut."""
|
||||
label = tl.strip_inline_md(getattr(block, "label", "")) or ""
|
||||
target_id = getattr(block, "target_id", "") or ""
|
||||
fs = _FS_BODY
|
||||
lines = tl.wrap(label, tl.chars_per_line(_USABLE_W - 0.3, fs)) or [""]
|
||||
lh = tl.line_height_in(fs)
|
||||
height = lh * len(lines) + 0.05
|
||||
_ensure(st, height)
|
||||
box = st.slide.shapes.add_textbox(
|
||||
Inches(_ML), Inches(st.y), Inches(_USABLE_W), Inches(height))
|
||||
tf = box.text_frame
|
||||
tf.word_wrap = True
|
||||
first = True
|
||||
link_run = None
|
||||
for idx, ln in enumerate(lines):
|
||||
p = tf.paragraphs[0] if first else tf.add_paragraph()
|
||||
first = False
|
||||
r0 = p.add_run()
|
||||
r0.text = "• " if idx == 0 else " "
|
||||
r0.font.size = Pt(fs)
|
||||
r0.font.color.rgb = _rgb(_LINK)
|
||||
run = p.add_run()
|
||||
run.text = ln
|
||||
run.font.size = Pt(fs)
|
||||
run.font.color.rgb = _rgb(_LINK)
|
||||
if idx == 0:
|
||||
link_run = run
|
||||
if target_id and link_run is not None:
|
||||
st.toc_runs.append((target_id, link_run, st.slide))
|
||||
st.y += height
|
||||
|
||||
|
||||
_PLACERS = {
|
||||
"heading": _place_heading,
|
||||
"markdown": _place_markdown,
|
||||
@@ -854,6 +1069,7 @@ _PLACERS = {
|
||||
"note": _place_note,
|
||||
"group": _place_group,
|
||||
"glossary_entry": _place_glossary_entry,
|
||||
"toc_entry": _place_toc_entry,
|
||||
}
|
||||
|
||||
|
||||
@@ -889,6 +1105,12 @@ def render_pptx(chapters: list, out_path: str, meta: dict = None) -> dict:
|
||||
st.chapter = ch
|
||||
st.chapter_slides = 0
|
||||
_new_slide(st, cont=False)
|
||||
# Record this chapter's first slide as a link target for the cover
|
||||
# index (keyed by id AND title, since the cover only knows titles).
|
||||
if ch.id:
|
||||
st.chapter_starts[ch.id] = st.slide
|
||||
if getattr(ch, "title", ""):
|
||||
st.chapter_starts.setdefault(ch.title, st.slide)
|
||||
for block in ch.blocks:
|
||||
placer = _PLACERS.get(getattr(block, "kind", ""), _place_note)
|
||||
try:
|
||||
@@ -916,7 +1138,7 @@ def render_pptx(chapters: list, out_path: str, meta: dict = None) -> dict:
|
||||
|
||||
note = f"{n_slides} slides"
|
||||
if n_links:
|
||||
note += f" · {n_links} enlaces de glosario"
|
||||
note += f" · {n_links} enlaces internos"
|
||||
if notes:
|
||||
note += " · " + "; ".join(notes)
|
||||
return {"path": out_path, "n_slides": n_slides, "chapters": chapters_meta,
|
||||
@@ -924,19 +1146,21 @@ def render_pptx(chapters: list, out_path: str, meta: dict = None) -> dict:
|
||||
|
||||
|
||||
def _wire_glossary_links(st: _PptxState, notes: list) -> int:
|
||||
"""Turn each recorded term run into a native jump to its glossary slide.
|
||||
"""Apply native slide-jumps: glossary terms + the cover index.
|
||||
|
||||
Returns the number of links applied. A term whose only appearance is inside
|
||||
its own glossary entry (source slide == target slide) is skipped. Never
|
||||
Each in-text glossary term run jumps to its glossary entry slide, and each
|
||||
cover ``TocEntry`` run jumps to its chapter's first slide. Returns the total
|
||||
number of links applied. A run whose target is its own slide is skipped. Never
|
||||
raises."""
|
||||
if not st.term_runs or not st.term_anchor_slide:
|
||||
if not (st.term_runs and st.term_anchor_slide) and not (
|
||||
st.toc_runs and st.chapter_starts):
|
||||
return 0
|
||||
linked = 0
|
||||
try:
|
||||
from datascience.pptx_link_run_to_slide import pptx_link_run_to_slide
|
||||
except Exception as e: # noqa: BLE001
|
||||
notes.append(f"glosario sin enlaces: {e}")
|
||||
notes.append(f"enlaces internos no aplicados: {e}")
|
||||
return 0
|
||||
linked = 0
|
||||
for key, run, src_slide in st.term_runs:
|
||||
tgt = st.term_anchor_slide.get(key)
|
||||
if tgt is None or tgt is src_slide:
|
||||
@@ -946,4 +1170,14 @@ def _wire_glossary_links(st: _PptxState, notes: list) -> int:
|
||||
linked += 1
|
||||
except Exception: # noqa: BLE001 — links are best-effort.
|
||||
pass
|
||||
# Cover index → chapter first slide (clickable, navigable table of contents).
|
||||
for target_id, run, src_slide in st.toc_runs:
|
||||
tgt = st.chapter_starts.get(target_id)
|
||||
if tgt is None or tgt is src_slide:
|
||||
continue
|
||||
try:
|
||||
if pptx_link_run_to_slide(run, src_slide, tgt):
|
||||
linked += 1
|
||||
except Exception: # noqa: BLE001 — links are best-effort.
|
||||
pass
|
||||
return linked
|
||||
|
||||
@@ -0,0 +1,283 @@
|
||||
"""Golden tests for the global render-quality features (issue: eda-render-quality).
|
||||
|
||||
Covers, with executable evidence:
|
||||
* High DPI: every embedded figure is rasterized at 220 dpi, so a phone reader
|
||||
can zoom in and still see crisp detail.
|
||||
* Wide table → image: a table too wide to be legible as text (e.g. a 19-column
|
||||
df.head) is rendered as one high-res image that scales to fit entirely, while
|
||||
a narrow table keeps its selectable-text/native-table rendering.
|
||||
* ``Group(layout="side_by_side")``: in PPTX the table and figure are placed in
|
||||
two columns of the same slide; in PDF the same group stacks vertically.
|
||||
* Backward compatibility: a Group without ``layout`` defaults to ``"stack"`` and
|
||||
a fitting table renders exactly as before.
|
||||
|
||||
Renderers are invoked for real; PDFs are inspected with PyMuPDF and PPTX decks
|
||||
with python-pptx.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
import tempfile
|
||||
|
||||
import matplotlib
|
||||
|
||||
matplotlib.use("Agg")
|
||||
import matplotlib.pyplot as plt # noqa: E402
|
||||
|
||||
import pytest # noqa: E402
|
||||
|
||||
from datascience.automatic_eda import model # noqa: E402
|
||||
from datascience.automatic_eda.render_pdf_impl import ( # noqa: E402
|
||||
render_pdf, _RASTER_DPI as _PDF_DPI, _table_fits_as_text as _pdf_fits)
|
||||
from datascience.automatic_eda.render_pptx_impl import ( # noqa: E402
|
||||
render_pptx, _RASTER_DPI as _PPTX_DPI, _table_fits_as_text as _pptx_fits)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Helpers.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def _simple_fig():
|
||||
"""A small, real matplotlib figure for the figure blocks."""
|
||||
fig, ax = plt.subplots(figsize=(4, 3))
|
||||
ax.plot([0, 1, 2, 3], [1, 3, 2, 4])
|
||||
ax.set_title("demo")
|
||||
return fig
|
||||
|
||||
|
||||
def _wide_table(n_cols=19, n_rows=5):
|
||||
header = [f"columna_{i}" for i in range(n_cols)]
|
||||
rows = [[f"v{r}_{c}" for c in range(n_cols)] for r in range(n_rows)]
|
||||
return model.DataTable(header=header, rows=rows, title="Primeras filas")
|
||||
|
||||
|
||||
def _narrow_table():
|
||||
return model.DataTable(header=["a", "b", "c"],
|
||||
rows=[["1", "2", "3"], ["4", "5", "6"]],
|
||||
title="Tabla estrecha")
|
||||
|
||||
|
||||
def _chapter(blocks, cid="cap", title="Capítulo"):
|
||||
return [model.Chapter(id=cid, title=title, version="1.0.0", blocks=blocks)]
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# 1) High DPI — the unit constant and a real embedded image.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def test_raster_dpi_is_high_both_renderers():
|
||||
assert _PDF_DPI >= 200, "el DPI del PDF debe ser alto (>=200)"
|
||||
assert _PPTX_DPI >= 200, "el DPI del PPTX debe ser alto (>=200)"
|
||||
|
||||
|
||||
def test_pdf_embedded_figure_is_high_resolution(tmp_path):
|
||||
fitz = pytest.importorskip("fitz")
|
||||
out = str(tmp_path / "fig.pdf")
|
||||
res = render_pdf(_chapter([model.Figure(make=_simple_fig, caption="demo")]),
|
||||
out, {"title": "T"})
|
||||
assert res["path"] == out
|
||||
doc = fitz.open(out)
|
||||
try:
|
||||
widths = []
|
||||
for page in doc:
|
||||
for img in page.get_images(full=True):
|
||||
xref = img[0]
|
||||
info = doc.extract_image(xref)
|
||||
widths.append(info.get("width", 0))
|
||||
assert widths, "no se incrustó ninguna imagen en el PDF"
|
||||
# A ~4" figure rasterized at 220 dpi is ~ >850 px wide. At the old 150 dpi
|
||||
# it would be ~600 px. The high-res threshold proves the DPI bump.
|
||||
assert max(widths) >= 800, \
|
||||
f"la figura embebida no es de alta resolución: {max(widths)} px"
|
||||
finally:
|
||||
doc.close()
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# 2) Wide table → image (PDF and PPTX); narrow table stays text.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def test_fit_criterion_flags_wide_and_keeps_narrow():
|
||||
wide = _wide_table()
|
||||
narrow = _narrow_table()
|
||||
assert not _pdf_fits(wide.header, wide.rows), \
|
||||
"una tabla de 19 columnas debería NO caber como texto en A5"
|
||||
assert not _pptx_fits(wide.header, wide.rows), \
|
||||
"una tabla de 19 columnas debería NO caber como tabla nativa en 16:9"
|
||||
assert _pdf_fits(narrow.header, narrow.rows), \
|
||||
"una tabla de 3 columnas debería caber como texto en A5"
|
||||
assert _pptx_fits(narrow.header, narrow.rows), \
|
||||
"una tabla de 3 columnas debería caber como tabla nativa en 16:9"
|
||||
|
||||
|
||||
def test_wide_table_rendered_as_image_pdf(tmp_path):
|
||||
fitz = pytest.importorskip("fitz")
|
||||
out = str(tmp_path / "wide.pdf")
|
||||
res = render_pdf(_chapter([_wide_table()]), out, {"title": "T"})
|
||||
assert res["path"] == out
|
||||
doc = fitz.open(out)
|
||||
try:
|
||||
n_images = sum(len(page.get_images(full=True)) for page in doc)
|
||||
text = "".join(page.get_text() for page in doc)
|
||||
finally:
|
||||
doc.close()
|
||||
assert n_images >= 1, "la tabla ancha no se rasterizó como imagen en el PDF"
|
||||
# The cells are now inside the image, not selectable text. A unique cell value
|
||||
# must therefore NOT appear as extractable text (it lives in the picture).
|
||||
assert "v4_18" not in text, \
|
||||
"la tabla ancha sigue como texto seleccionable (no se hizo imagen)"
|
||||
|
||||
|
||||
def test_narrow_table_stays_selectable_text_pdf(tmp_path):
|
||||
fitz = pytest.importorskip("fitz")
|
||||
out = str(tmp_path / "narrow.pdf")
|
||||
render_pdf(_chapter([_narrow_table()]), out, {"title": "T"})
|
||||
doc = fitz.open(out)
|
||||
try:
|
||||
text = "".join(page.get_text() for page in doc)
|
||||
finally:
|
||||
doc.close()
|
||||
# Narrow table is selectable text: its header/cells are extractable.
|
||||
for v in ("a", "b", "c", "1", "6"):
|
||||
assert v in text, f"la celda '{v}' debería ser texto seleccionable"
|
||||
|
||||
|
||||
def test_wide_table_rendered_as_picture_pptx(tmp_path):
|
||||
pptx = pytest.importorskip("pptx")
|
||||
from pptx.enum.shapes import MSO_SHAPE_TYPE
|
||||
out = str(tmp_path / "wide.pptx")
|
||||
res = render_pptx(_chapter([_wide_table()]), out, {"title": "T"})
|
||||
assert res["path"] == out
|
||||
prs = pptx.Presentation(out)
|
||||
pics = sum(1 for s in prs.slides for sh in s.shapes
|
||||
if sh.shape_type == MSO_SHAPE_TYPE.PICTURE)
|
||||
assert pics >= 1, "la tabla ancha no se colocó como imagen en el PPTX"
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# 3) Group(layout="side_by_side"): two columns in PPTX, stacked in PDF.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def _side_by_side_group():
|
||||
return model.Group(
|
||||
blocks=[model.Heading(text="Columna X", level=2),
|
||||
_narrow_table(),
|
||||
model.Figure(make=_simple_fig, caption="grafico")],
|
||||
layout="side_by_side")
|
||||
|
||||
|
||||
def test_side_by_side_places_two_columns_pptx(tmp_path):
|
||||
pptx = pytest.importorskip("pptx")
|
||||
from pptx.enum.shapes import MSO_SHAPE_TYPE
|
||||
from pptx.util import Inches
|
||||
out = str(tmp_path / "sbs.pptx")
|
||||
render_pptx(_chapter([_side_by_side_group()]), out, {"title": "T"})
|
||||
prs = pptx.Presentation(out)
|
||||
# Find the slide that holds the pair (table image + figure image).
|
||||
centre_emu = int(Inches(13.333 / 2.0))
|
||||
placed = False
|
||||
for s in prs.slides:
|
||||
lefts = [sh.left for sh in s.shapes
|
||||
if sh.shape_type == MSO_SHAPE_TYPE.PICTURE
|
||||
and sh.left is not None]
|
||||
if len(lefts) >= 2:
|
||||
# one picture starts in the left half, another in the right half.
|
||||
if min(lefts) < centre_emu and max(lefts) > centre_emu:
|
||||
placed = True
|
||||
break
|
||||
assert placed, \
|
||||
"side_by_side no colocó tabla y figura en dos columnas de la misma slide"
|
||||
|
||||
|
||||
def test_side_by_side_stacks_in_pdf(tmp_path):
|
||||
fitz = pytest.importorskip("fitz")
|
||||
out = str(tmp_path / "sbs.pdf")
|
||||
res = render_pdf(_chapter([_side_by_side_group()]), out, {"title": "T"})
|
||||
assert res["path"] == out and res["n_pages"] >= 1
|
||||
doc = fitz.open(out)
|
||||
try:
|
||||
n_images = sum(len(page.get_images(full=True)) for page in doc)
|
||||
text = "".join(page.get_text() for page in doc)
|
||||
finally:
|
||||
doc.close()
|
||||
# PDF stacks: the narrow table stays selectable text (1 of its cells is
|
||||
# extractable) and the figure is the single embedded image — not a 2-column
|
||||
# pair of pictures like PPTX.
|
||||
assert n_images == 1, "el PDF no debería usar el layout de dos imágenes"
|
||||
assert "Columna X" in text and "1" in text, \
|
||||
"la tabla del grupo debería seguir como texto apilado en el PDF"
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# 4) Backward compatibility — default layout stacks, fitting table unchanged.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def test_group_default_layout_is_stack():
|
||||
g = model.Group(blocks=[_narrow_table()])
|
||||
assert g.layout == "stack", "el layout por defecto debe ser 'stack'"
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# 5) Clickable cover index ("Índice") → chapter first page/slide.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def _doc_with_index():
|
||||
portada = model.Chapter(id="portada", title="Portada", version="1.0.0",
|
||||
blocks=[model.Heading(text="Índice", level=2),
|
||||
model.TocEntry(label="Distribuciones",
|
||||
target_id="Distribuciones")])
|
||||
cap = model.Chapter(id="num", title="Distribuciones", version="1.0.0",
|
||||
blocks=[model.Markdown(text="contenido del capítulo")])
|
||||
return [portada, cap]
|
||||
|
||||
|
||||
def test_cover_index_is_clickable_pdf(tmp_path):
|
||||
fitz = pytest.importorskip("fitz")
|
||||
out = str(tmp_path / "idx.pdf")
|
||||
res = render_pdf(_doc_with_index(), out, {"title": "T"})
|
||||
assert res["path"] == out
|
||||
doc = fitz.open(out)
|
||||
try:
|
||||
# The cover (page 0) must carry a GOTO link jumping to a later page.
|
||||
goto = [lk for lk in doc[0].get_links()
|
||||
if lk.get("kind") == fitz.LINK_GOTO and lk.get("page", 0) > 0]
|
||||
finally:
|
||||
doc.close()
|
||||
assert goto, "el índice de la portada no produjo enlaces clicables en el PDF"
|
||||
|
||||
|
||||
def test_cover_index_shows_heading_pdf(tmp_path):
|
||||
fitz = pytest.importorskip("fitz")
|
||||
out = str(tmp_path / "idxh.pdf")
|
||||
render_pdf(_doc_with_index(), out, {"title": "T"})
|
||||
doc = fitz.open(out)
|
||||
try:
|
||||
text = "".join(page.get_text() for page in doc)
|
||||
finally:
|
||||
doc.close()
|
||||
assert "Índice" in text, "la portada no muestra el encabezado 'Índice'"
|
||||
assert "Este informe incluye" not in text, \
|
||||
"la portada aún muestra el texto antiguo 'Este informe incluye'"
|
||||
|
||||
|
||||
def test_cover_index_is_clickable_pptx(tmp_path):
|
||||
pptx = pytest.importorskip("pptx")
|
||||
out = str(tmp_path / "idx.pptx")
|
||||
render_pptx(_doc_with_index(), out, {"title": "T"})
|
||||
prs = pptx.Presentation(out)
|
||||
cover_xml = prs.slides[0]._element.xml
|
||||
assert "hlinksldjump" in cover_xml, \
|
||||
"el índice de la portada no produjo un salto de slide nativo en el PPTX"
|
||||
|
||||
|
||||
def test_default_group_renders_like_before_pptx(tmp_path):
|
||||
pptx = pytest.importorskip("pptx")
|
||||
from pptx.enum.shapes import MSO_SHAPE_TYPE
|
||||
out = str(tmp_path / "stack.pptx")
|
||||
grp = model.Group(blocks=[model.Heading(text="Y", level=2),
|
||||
_narrow_table(),
|
||||
model.Figure(make=_simple_fig, caption="g")])
|
||||
render_pptx(_chapter([grp]), out, {"title": "T"})
|
||||
prs = pptx.Presentation(out)
|
||||
# Stacked group: the narrow table is a NATIVE table (selectable), and there is
|
||||
# exactly one picture (the figure) — not the two-image side-by-side layout.
|
||||
n_tables = sum(1 for s in prs.slides for sh in s.shapes if sh.has_table)
|
||||
n_pics = sum(1 for s in prs.slides for sh in s.shapes
|
||||
if sh.shape_type == MSO_SHAPE_TYPE.PICTURE)
|
||||
assert n_tables >= 1, "el grupo apilado debería usar una tabla nativa"
|
||||
assert n_pics == 1, "el grupo apilado no debería duplicar imágenes"
|
||||
@@ -0,0 +1,125 @@
|
||||
---
|
||||
id: build_boxplots_figure_py_datascience
|
||||
name: build_boxplots_figure
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "def build_boxplots_figure(boxes: list, title: str = \"\", max_boxes: int = 12) -> \"matplotlib.figure.Figure\""
|
||||
description: "Construye una unica figura matplotlib con boxplots de Tukey HORIZONTALES (uno por columna) usando ax.bxp: caja Q1-Q3, bigotes hasta 1.5*IQR, linea de mediana y puntos atipicos. Consume la salida de build_boxplot_stats (un dict box por columna, leido con .get) mas una lista opcional de outliers crudos por columna; si vienen los dibuja como puntos (showfliers), si no marca solo box[min]/box[max] cuando hay outliers de cola (igual que num_distr). Dibuja como mucho max_boxes cajas (las primeras, ya ordenadas por contaminacion por el caller) y avisa de la truncacion con (mostrando N de M). Backend Agg sin pyplot global; alto adaptativo al nº de cajas. Defensiva: omite entradas invalidas y NUNCA lanza — sin cajas validas devuelve una figura placeholder (sin boxplots). Es la version small-multiples del capitulo num_distr para responder que columnas tienen mas outliers de un vistazo."
|
||||
tags: [eda, outliers, boxplot, tukey, iqr, bxp, matplotlib, figure, visualization, small-multiples, datascience, impure]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: [matplotlib]
|
||||
example: |
|
||||
from datascience.build_boxplot_stats import build_boxplot_stats
|
||||
from datascience.build_boxplots_figure import build_boxplots_figure
|
||||
boxes = [
|
||||
{"name": "ingresos", "box": build_boxplot_stats({"min": 1.0, "max": 9e3,
|
||||
"p25": 1e3, "median": 2e3, "p75": 3e3, "n_outliers": 7}), "fliers": None},
|
||||
{"name": "edad", "box": build_boxplot_stats({"min": 0.0, "max": 99.0,
|
||||
"p25": 25.0, "median": 38.0, "p75": 52.0}), "fliers": None},
|
||||
]
|
||||
fig = build_boxplots_figure(boxes, title="Outliers por columna", max_boxes=12)
|
||||
tested: true
|
||||
tests:
|
||||
- "test_returns_figure_with_axes"
|
||||
- "test_empty_list_returns_placeholder_figure"
|
||||
- "test_invalid_box_is_skipped_not_raised"
|
||||
- "test_all_invalid_returns_placeholder"
|
||||
- "test_raw_fliers_are_drawn"
|
||||
- "test_max_boxes_truncates_and_does_not_raise"
|
||||
test_file_path: "python/functions/datascience/build_boxplots_figure_test.py"
|
||||
file_path: "python/functions/datascience/build_boxplots_figure.py"
|
||||
params:
|
||||
- name: boxes
|
||||
desc: "Lista de dicts, cada uno {\"name\": str, \"box\": dict, \"fliers\": list|None}. box es EXACTAMENTE la salida de build_boxplot_stats (claves leidas con .get: q1, median, q3, whisker_lo, whisker_hi, min, max, has_low_outliers, has_high_outliers, lower_fence, upper_fence, n_outliers). fliers es la lista opcional de outliers crudos: si viene se dibuja como puntos; si es None/ausente solo se marcan los extremos box[min]/box[max] cuando hay outliers de cola. Entradas que no son dict, sin box dict, o sin q1/median/q3 se omiten. El caller las pasa ya ordenadas por contaminacion (la mayor primera)."
|
||||
- name: title
|
||||
desc: "Titulo de la figura (fig.suptitle, alineado a la izquierda). Vacio => sin titulo. Si len(boxes) > max_boxes se le anade una nota \"(mostrando N de M)\" para que la truncacion no sea silenciosa. Default \"\"."
|
||||
- name: max_boxes
|
||||
desc: "Numero maximo de cajas a dibujar (las primeras de la lista). Default 12. Un valor no entero o <= 0 cae a 12. Si la lista trae mas entradas, las sobrantes se descartan pero se reporta en el titulo con (mostrando N de M)."
|
||||
output: "Un matplotlib.figure.Figure (figsize 7.0 x alto adaptativo = max(2.0, 0.5*n + 1.0), dpi 150) con un unico Axes que apila boxplots horizontales de Tukey (ax.bxp, orientation=horizontal con fallback vert=False), uno por columna valida, de arriba a abajo en el orden recibido. Cada caja: relleno #9ec6df, borde/bigotes/caps #5b8aa6, mediana #2e8b57, atipicos #c0392b. Etiquetas del eje Y = nombres de columna; eje X etiquetado \"valor\". Outliers dibujados desde fliers crudos (showfliers) o, si faltan, marcados en box[min]/box[max] segun has_low/high_outliers. Si no queda ninguna caja valida (lista vacia o todas invalidas) devuelve una Figure placeholder con texto centrado \"(sin boxplots)\"; cualquier error inesperado se captura y devuelve una Figure con el mensaje de error. NUNCA lanza. El caller rasteriza/cierra la figura; la funcion no la muestra ni la guarda."
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
import sys, os
|
||||
sys.path.insert(0, os.path.join("python", "functions"))
|
||||
from datascience.build_boxplot_stats import build_boxplot_stats
|
||||
from datascience.build_boxplots_figure import build_boxplots_figure
|
||||
|
||||
# Un `box` por columna numérica, derivado del sub-bloque `numeric` del profile
|
||||
# (salida de describe_numeric). El caller los pasa ya ordenados por outlier_pct.
|
||||
boxes = [
|
||||
{
|
||||
"name": "ingresos",
|
||||
"box": build_boxplot_stats({
|
||||
"min": 1.0, "max": 9000.0,
|
||||
"p25": 1000.0, "median": 2000.0, "p75": 3000.0,
|
||||
"n_outliers": 7,
|
||||
}),
|
||||
"fliers": None, # valores crudos desconocidos -> se marca solo el extremo.
|
||||
},
|
||||
{
|
||||
"name": "edad",
|
||||
"box": build_boxplot_stats({
|
||||
"min": 0.0, "max": 99.0,
|
||||
"p25": 25.0, "median": 38.0, "p75": 52.0,
|
||||
}),
|
||||
"fliers": [88.0, 95.0, 99.0], # outliers crudos -> se dibujan como puntos.
|
||||
},
|
||||
]
|
||||
|
||||
fig = build_boxplots_figure(boxes, title="Outliers por columna", max_boxes=12)
|
||||
|
||||
# El renderer del informe lo rasteriza; aquí solo persistimos para inspección.
|
||||
fig.savefig("/tmp/boxplots.png")
|
||||
```
|
||||
|
||||
## Cuando usarla
|
||||
|
||||
Úsala en el capítulo de outliers de un informe EDA cuando quieras comparar de un
|
||||
vistazo *qué columnas están más contaminadas por valores atípicos*: a diferencia
|
||||
de `num_distr` (que dibuja un histograma+boxplot por columna en figuras
|
||||
separadas), aquí apilas todos los boxplots horizontales en **una sola figura**
|
||||
(small multiples). Primero deriva el `box` de cada columna con
|
||||
`build_boxplot_stats`, ordénalas por `outlier_pct` descendente, envuélvelas como
|
||||
`{"name", "box", "fliers"}` y pásaselas. Si tienes los valores crudos fuera de
|
||||
las vallas, métele la lista `fliers` y se dibujarán como puntos; si no, la
|
||||
función marca solo los extremos `min`/`max` cuando hay cola.
|
||||
|
||||
## Gotchas
|
||||
|
||||
- **Impura por matplotlib.** Toca la maquinaria de render. Usa el backend `Agg`
|
||||
y la API orientada a objetos `Figure`/`add_subplot` — NUNCA `pyplot.*` aquí,
|
||||
para no tocar el estado global ni filtrar figuras entre llamadas. `pyplot` NO
|
||||
es thread-safe; esta función construye el `Figure` directamente, así que es
|
||||
segura de llamar en bucle desde el renderer.
|
||||
- **El caller cierra la figura.** Devuelve el `Figure` pero no lo muestra ni lo
|
||||
guarda. Quien la consume debe rasterizarla y luego liberarla
|
||||
(`matplotlib.pyplot.close(fig)`) para no acumular memoria en lotes grandes.
|
||||
- **`fliers` opcional, semántica distinta.** Si pasas la lista de outliers
|
||||
crudos se dibujan todos como puntos (`showfliers=True`). Si es `None`/ausente
|
||||
los valores son desconocidos y solo se marca un punto en `box["min"]` /
|
||||
`box["max"]` cuando `has_low_outliers` / `has_high_outliers` — mismo criterio
|
||||
que `num_distr`. No inventes fliers a partir del profile: el `box` no trae los
|
||||
valores crudos, solo si los extremos superan las vallas.
|
||||
- **API de orientación de `ax.bxp`.** matplotlib reciente usa
|
||||
`orientation="horizontal"`; las versiones antiguas usan `vert=False`. La
|
||||
función prueba la primera y cae a la segunda en `except TypeError`, así que
|
||||
funciona en ambas. Si `bxp` falla del todo, el Axes degrada a un texto
|
||||
"(boxplot no disponible)" en vez de propagar.
|
||||
- **Truncación visible.** `max_boxes` (default 12) limita el nº de cajas para que
|
||||
ninguna se solape; si la lista trae más, las sobrantes se descartan pero se
|
||||
avisa en el título con "(mostrando N de M)". Pasa las columnas ya ordenadas por
|
||||
contaminación para que las descartadas sean las menos relevantes.
|
||||
- **Defensiva, nunca lanza.** Lista vacía, entradas no-dict, sin `box`, o sin
|
||||
`q1`/`median`/`q3` se omiten sin propagar; sin cajas válidas devuelve un
|
||||
placeholder "(sin boxplots)" y cualquier error inesperado se captura en una
|
||||
figura con el texto del error. No envuelvas la llamada en try/except por miedo
|
||||
a un raise — no lo hay.
|
||||
@@ -0,0 +1,250 @@
|
||||
"""Impure EDA helper: a single figure of horizontal Tukey boxplots (`eda` group).
|
||||
|
||||
Draws, in one ``matplotlib.figure.Figure``, a stack of horizontal Tukey boxplots
|
||||
(one per column) using ``ax.bxp``: each carries its box (Q1–Q3), whiskers (up to
|
||||
1.5·IQR), the median line and its outlier points. It consumes the output of the
|
||||
pure registry function ``build_boxplot_stats`` (one ``box`` dict per column) plus
|
||||
an optional list of raw outlier values per column; it never recomputes anything.
|
||||
|
||||
It is the "small-multiples" companion of ``num_distr`` (which draws one
|
||||
histogram+boxplot per column): here every column shares a single figure so the
|
||||
caller can show, at a glance, *which* columns are the most contaminated by
|
||||
outliers (the caller passes them already ordered by contamination).
|
||||
|
||||
Impure because it touches matplotlib's rendering machinery. It uses the headless
|
||||
Agg backend and the object-oriented ``Figure`` API (no ``pyplot``) so it leaks no
|
||||
global state and is safe to call repeatedly from a report renderer. It is fully
|
||||
defensive and NEVER raises: invalid entries are skipped and, if nothing valid
|
||||
remains, it returns a placeholder figure carrying a centered "(sin boxplots)".
|
||||
"""
|
||||
|
||||
import matplotlib
|
||||
|
||||
matplotlib.use("Agg")
|
||||
|
||||
from matplotlib.figure import Figure # noqa: E402
|
||||
|
||||
# Blue palette shared with the ``num_distr`` chapter so the report stays coherent.
|
||||
_BOX_FACE = "#9ec6df" # box fill.
|
||||
_BOX_EDGE = "#5b8aa6" # box / whisker / cap border.
|
||||
_MEDIAN = "#2e8b57" # median line (sea green).
|
||||
_OUTLIER = "#c0392b" # outlier points (soft red).
|
||||
# Muted gray for the placeholder / fallback message text.
|
||||
_MUTED_TEXT = "#5f6b7a"
|
||||
# Soft red for the error fallback message.
|
||||
_ERROR_TEXT = "#b00020"
|
||||
|
||||
|
||||
def _num(value):
|
||||
"""Coerce ``value`` to float defensively; None for None/bool/non-numeric/NaN."""
|
||||
# bool is a subclass of int; a stat value is never a real bool, so treat
|
||||
# True/False as missing instead of silently coercing to 1.0/0.0.
|
||||
if value is None or isinstance(value, bool):
|
||||
return None
|
||||
try:
|
||||
f = float(value)
|
||||
except (TypeError, ValueError):
|
||||
return None
|
||||
if f != f: # NaN guard.
|
||||
return None
|
||||
return f
|
||||
|
||||
|
||||
def _placeholder_figure(message: str, color: str = _MUTED_TEXT) -> "Figure":
|
||||
"""Return a fallback ``Figure`` carrying a single centered message."""
|
||||
fig = Figure(figsize=(7.0, 2.4), dpi=150)
|
||||
ax = fig.add_subplot(111)
|
||||
ax.axis("off")
|
||||
ax.text(
|
||||
0.5,
|
||||
0.5,
|
||||
message,
|
||||
ha="center",
|
||||
va="center",
|
||||
fontsize=12,
|
||||
color=color,
|
||||
wrap=True,
|
||||
transform=ax.transAxes,
|
||||
)
|
||||
fig.tight_layout()
|
||||
return fig
|
||||
|
||||
|
||||
def build_boxplots_figure(
|
||||
boxes: list,
|
||||
title: str = "",
|
||||
max_boxes: int = 12,
|
||||
) -> "matplotlib.figure.Figure":
|
||||
"""Build one figure of stacked horizontal Tukey boxplots (one per column).
|
||||
|
||||
For each entry the function builds a ``bxp`` stats record (``med, q1, q3,
|
||||
whislo, whishi, fliers, label``) from its ``box`` sub-dict (the output of
|
||||
``build_boxplot_stats``) and draws all of them as horizontal boxplots sharing
|
||||
the X axis, top-to-bottom in the order received (the caller is expected to
|
||||
pass them already sorted by contamination).
|
||||
|
||||
Outliers are shown two ways:
|
||||
|
||||
- If an entry carries a ``fliers`` list (the raw out-of-fence values), they
|
||||
are drawn as red points via ``ax.bxp(..., showfliers=True)``.
|
||||
- If ``fliers`` is ``None``/absent, the raw values are unknown, so only the
|
||||
extremes are marked: a red point at ``box["min"]`` when
|
||||
``box["has_low_outliers"]`` and at ``box["max"]`` when
|
||||
``box["has_high_outliers"]`` (same convention as ``num_distr``).
|
||||
|
||||
The function is fully defensive and NEVER raises. Entries that are not dicts,
|
||||
lack a ``box`` dict, or miss any of ``q1``/``median``/``q3`` are skipped. If
|
||||
after filtering no valid box remains it returns a placeholder ``Figure`` with
|
||||
a centered "(sin boxplots)"; any unexpected error is caught and turned into a
|
||||
fallback figure carrying the error text. It always returns a ``Figure``.
|
||||
|
||||
Args:
|
||||
boxes: List of dicts ``{"name": str, "box": dict, "fliers": list|None}``.
|
||||
``box`` is exactly the output of ``build_boxplot_stats`` (read with
|
||||
``.get``: ``q1, median, q3, whisker_lo, whisker_hi, min, max,
|
||||
has_low_outliers, has_high_outliers, ...``). ``fliers`` is the
|
||||
optional list of raw outlier values; when present they are plotted,
|
||||
otherwise only the extremes are marked.
|
||||
title: Figure title (``fig.suptitle``). Empty => no title. When the list
|
||||
is longer than ``max_boxes`` a "(mostrando N de M)" note is appended.
|
||||
max_boxes: Draw at most the first ``max_boxes`` entries (default 12). The
|
||||
rest are dropped but their omission is surfaced in the title note, so
|
||||
the truncation is never silent.
|
||||
|
||||
Returns:
|
||||
A ``matplotlib.figure.Figure`` with a single Axes holding the horizontal
|
||||
boxplots (height adaptive to the box count so none overlap). The caller is
|
||||
responsible for rasterizing/closing it; this function never shows nor
|
||||
saves it.
|
||||
"""
|
||||
try:
|
||||
if not isinstance(boxes, (list, tuple)) or len(boxes) == 0:
|
||||
return _placeholder_figure("(sin boxplots)")
|
||||
|
||||
total = len(boxes)
|
||||
|
||||
# Cap the number of boxes; tolerate a non-int / non-positive max_boxes.
|
||||
try:
|
||||
cap = int(max_boxes)
|
||||
except (TypeError, ValueError):
|
||||
cap = 12
|
||||
if cap <= 0:
|
||||
cap = 12
|
||||
candidates = list(boxes)[:cap]
|
||||
|
||||
stats_list = [] # bxp stats records, in draw order.
|
||||
labels = [] # Y tick labels (column names).
|
||||
manual_markers = [] # (position, box) for entries without raw fliers.
|
||||
any_fliers = False # whether to enable showfliers in the bxp call.
|
||||
|
||||
for entry in candidates:
|
||||
if not isinstance(entry, dict):
|
||||
continue
|
||||
box = entry.get("box")
|
||||
if not isinstance(box, dict):
|
||||
continue
|
||||
|
||||
q1 = _num(box.get("q1"))
|
||||
med = _num(box.get("median"))
|
||||
q3 = _num(box.get("q3"))
|
||||
# Without the three quartiles a boxplot cannot be drawn — skip it.
|
||||
if q1 is None or med is None or q3 is None:
|
||||
continue
|
||||
|
||||
# Whisker extremes fall back to the quartiles when missing.
|
||||
whislo = _num(box.get("whisker_lo"))
|
||||
whishi = _num(box.get("whisker_hi"))
|
||||
if whislo is None:
|
||||
whislo = q1
|
||||
if whishi is None:
|
||||
whishi = q3
|
||||
|
||||
name = entry.get("name")
|
||||
label = "" if name is None else str(name)
|
||||
|
||||
position = len(stats_list) + 1 # bxp positions are 1-indexed.
|
||||
fliers_raw = entry.get("fliers")
|
||||
if isinstance(fliers_raw, (list, tuple)):
|
||||
fliers = [v for v in (_num(x) for x in fliers_raw) if v is not None]
|
||||
if fliers:
|
||||
any_fliers = True
|
||||
else:
|
||||
# Raw values unknown: draw no bxp fliers, mark min/max by hand.
|
||||
fliers = []
|
||||
manual_markers.append((position, box))
|
||||
|
||||
stats_list.append({
|
||||
"med": med,
|
||||
"q1": q1,
|
||||
"q3": q3,
|
||||
"whislo": whislo,
|
||||
"whishi": whishi,
|
||||
"fliers": fliers,
|
||||
"label": label,
|
||||
})
|
||||
labels.append(label)
|
||||
|
||||
if not stats_list:
|
||||
return _placeholder_figure("(sin boxplots)")
|
||||
|
||||
n = len(stats_list)
|
||||
positions = list(range(1, n + 1))
|
||||
|
||||
# Height grows with the box count so none of them overlap.
|
||||
height = max(2.0, 0.5 * n + 1.0)
|
||||
fig = Figure(figsize=(7.0, height), dpi=150)
|
||||
ax = fig.add_subplot(111)
|
||||
|
||||
bxp_kw = dict(
|
||||
showfliers=any_fliers, widths=0.5, patch_artist=True,
|
||||
boxprops={"facecolor": _BOX_FACE, "edgecolor": _BOX_EDGE},
|
||||
medianprops={"color": _MEDIAN, "linewidth": 1.6},
|
||||
whiskerprops={"color": _BOX_EDGE},
|
||||
capprops={"color": _BOX_EDGE},
|
||||
flierprops={"marker": "o", "markersize": 3.5,
|
||||
"markerfacecolor": _OUTLIER, "markeredgecolor": _OUTLIER,
|
||||
"linestyle": "none"})
|
||||
try:
|
||||
# ``orientation`` is the current API; older matplotlib uses ``vert``.
|
||||
try:
|
||||
ax.bxp(stats_list, positions=positions,
|
||||
orientation="horizontal", **bxp_kw)
|
||||
except TypeError:
|
||||
ax.bxp(stats_list, positions=positions, vert=False, **bxp_kw)
|
||||
except Exception: # noqa: BLE001 — never let bxp kill the whole figure.
|
||||
ax.text(0.5, 0.5, "(boxplot no disponible)", ha="center",
|
||||
va="center", fontsize=10, color=_MUTED_TEXT,
|
||||
transform=ax.transAxes)
|
||||
|
||||
# For entries without raw fliers, mark only the out-of-fence extremes.
|
||||
for position, box in manual_markers:
|
||||
mn = _num(box.get("min"))
|
||||
mx = _num(box.get("max"))
|
||||
if box.get("has_low_outliers") and mn is not None:
|
||||
ax.plot([mn], [position], marker="o", markersize=3.5,
|
||||
color=_OUTLIER, zorder=5)
|
||||
if box.get("has_high_outliers") and mx is not None:
|
||||
ax.plot([mx], [position], marker="o", markersize=3.5,
|
||||
color=_OUTLIER, zorder=5)
|
||||
|
||||
# Pin the Y tick labels explicitly so they work across matplotlib
|
||||
# versions regardless of whether ``bxp`` consumed the ``label`` key.
|
||||
ax.set_yticks(positions)
|
||||
ax.set_yticklabels(labels, fontsize=8)
|
||||
ax.set_xlabel("valor", fontsize=9)
|
||||
ax.tick_params(labelsize=7)
|
||||
ax.margins(y=0.15)
|
||||
for spine in ("top", "right"):
|
||||
ax.spines[spine].set_visible(False)
|
||||
|
||||
# Surface truncation in the title instead of silently dropping boxes.
|
||||
note = f"(mostrando {n} de {total})" if total > cap else ""
|
||||
heading = " ".join(p for p in (title, note) if p)
|
||||
if heading:
|
||||
fig.suptitle(heading, fontsize=12, x=0.02, ha="left")
|
||||
|
||||
fig.tight_layout()
|
||||
return fig
|
||||
except Exception as exc: # noqa: BLE001 — never raise from a figure builder.
|
||||
return _placeholder_figure(
|
||||
f"error al dibujar boxplots: {exc}", color=_ERROR_TEXT)
|
||||
@@ -0,0 +1,109 @@
|
||||
"""Tests para build_boxplots_figure (boxplots horizontales de Tukey, grupo eda).
|
||||
|
||||
Usa el backend Agg sin display; no muestra ni guarda figuras. Cada test cierra
|
||||
explícitamente la Figure construida (matplotlib.pyplot.close) para no acumular
|
||||
estado entre tests.
|
||||
"""
|
||||
|
||||
import matplotlib
|
||||
|
||||
matplotlib.use("Agg")
|
||||
|
||||
import matplotlib.pyplot as plt # noqa: E402
|
||||
from matplotlib.figure import Figure # noqa: E402
|
||||
|
||||
from build_boxplots_figure import build_boxplots_figure
|
||||
|
||||
|
||||
def _box(name, q1, median, q3, mn, mx, low=False, high=False, fliers=None):
|
||||
"""Construye una entrada {name, box, fliers} con un box estilo build_boxplot_stats."""
|
||||
iqr = q3 - q1
|
||||
return {
|
||||
"name": name,
|
||||
"box": {
|
||||
"q1": q1,
|
||||
"median": median,
|
||||
"q3": q3,
|
||||
"iqr": iqr,
|
||||
"lower_fence": q1 - 1.5 * iqr,
|
||||
"upper_fence": q3 + 1.5 * iqr,
|
||||
"whisker_lo": max(mn, q1 - 1.5 * iqr),
|
||||
"whisker_hi": min(mx, q3 + 1.5 * iqr),
|
||||
"min": mn,
|
||||
"max": mx,
|
||||
"has_low_outliers": low,
|
||||
"has_high_outliers": high,
|
||||
"n_outliers": 0,
|
||||
},
|
||||
"fliers": fliers,
|
||||
}
|
||||
|
||||
|
||||
def test_returns_figure_with_axes():
|
||||
boxes = [
|
||||
_box("edad", 10.0, 25.0, 40.0, 1.0, 100.0, high=True),
|
||||
_box("ingresos", 100.0, 200.0, 300.0, 50.0, 400.0),
|
||||
_box("score", -1.0, 0.0, 1.0, -5.0, 5.0, low=True, high=True),
|
||||
]
|
||||
fig = build_boxplots_figure(boxes, title="Boxplots", max_boxes=12)
|
||||
assert isinstance(fig, Figure)
|
||||
assert len(fig.axes) >= 1
|
||||
# Tres cajas -> tres etiquetas en el eje Y.
|
||||
ax = fig.axes[0]
|
||||
assert len(ax.get_yticks()) == 3
|
||||
plt.close(fig)
|
||||
|
||||
|
||||
def test_empty_list_returns_placeholder_figure():
|
||||
fig = build_boxplots_figure([], title="vacío")
|
||||
assert isinstance(fig, Figure)
|
||||
assert len(fig.axes) >= 1
|
||||
plt.close(fig)
|
||||
|
||||
|
||||
def test_invalid_box_is_skipped_not_raised():
|
||||
boxes = [
|
||||
{"name": "rota", "box": {"q1": None, "median": None, "q3": None}},
|
||||
{"name": "sin_box"}, # falta la clave box.
|
||||
"no_es_dict", # entrada no-dict.
|
||||
_box("buena", 1.0, 2.0, 3.0, 0.0, 10.0, high=True),
|
||||
]
|
||||
fig = build_boxplots_figure(boxes)
|
||||
assert isinstance(fig, Figure)
|
||||
ax = fig.axes[0]
|
||||
# Solo la caja válida sobrevive al filtrado.
|
||||
assert len(ax.get_yticks()) == 1
|
||||
plt.close(fig)
|
||||
|
||||
|
||||
def test_all_invalid_returns_placeholder():
|
||||
boxes = [
|
||||
{"name": "a", "box": {"q1": None, "median": 1.0, "q3": 2.0}},
|
||||
{"name": "b"},
|
||||
]
|
||||
fig = build_boxplots_figure(boxes)
|
||||
assert isinstance(fig, Figure)
|
||||
assert len(fig.axes) >= 1
|
||||
plt.close(fig)
|
||||
|
||||
|
||||
def test_raw_fliers_are_drawn():
|
||||
boxes = [
|
||||
_box("con_fliers", 10.0, 20.0, 30.0, 5.0, 200.0,
|
||||
high=True, fliers=[150.0, 180.0, 200.0]),
|
||||
]
|
||||
fig = build_boxplots_figure(boxes)
|
||||
assert isinstance(fig, Figure)
|
||||
assert len(fig.axes) >= 1
|
||||
plt.close(fig)
|
||||
|
||||
|
||||
def test_max_boxes_truncates_and_does_not_raise():
|
||||
boxes = [_box(f"c{i}", float(i), float(i + 1), float(i + 2),
|
||||
float(i - 5), float(i + 10)) for i in range(20)]
|
||||
fig = build_boxplots_figure(boxes, title="muchos", max_boxes=5)
|
||||
assert isinstance(fig, Figure)
|
||||
ax = fig.axes[0]
|
||||
# Solo se dibujan las primeras 5 cajas.
|
||||
assert len(ax.get_yticks()) == 5
|
||||
plt.close(fig)
|
||||
@@ -0,0 +1,111 @@
|
||||
---
|
||||
id: categorical_top_bar_figure_py_datascience
|
||||
name: categorical_top_bar_figure
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "def categorical_top_bar_figure(top: list, n_distinct: int = 0, title: str = \"\", top_k: int = 6, n_rows=None) -> \"matplotlib.figure.Figure\""
|
||||
description: "Construye una figura matplotlib de barras horizontales de las top_k categorías más frecuentes de una columna categórica, con la mayor arriba y agregando el resto en una barra gris \"Otros (N categorías)\". Contrato de entrada idéntico a categorical_top_pie_figure (swap directo donut↔barras): consume el bloque `top` de summarize_categorical y devuelve un matplotlib.figure.Figure listo para rasterizar por el renderer del informe EDA. Backend Agg sin pyplot global; defensivo total ante top vacío/None, nunca lanza."
|
||||
tags: [eda, categorical, bar, barh, matplotlib, figure, visualization, datascience, impure]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: [matplotlib]
|
||||
example: |
|
||||
from categorical_top_bar_figure import categorical_top_bar_figure
|
||||
top = [
|
||||
{"value": "rojo", "count": 40, "pct": 0.4},
|
||||
{"value": "azul", "count": 30, "pct": 0.3},
|
||||
{"value": "verde", "count": 20, "pct": 0.2},
|
||||
]
|
||||
fig = categorical_top_bar_figure(top, n_distinct=12, title="color", top_k=6, n_rows=100)
|
||||
tested: true
|
||||
tests:
|
||||
- "test_returns_figure"
|
||||
- "test_ten_items_topk_six_yields_seven_bars"
|
||||
- "test_empty_top_does_not_raise_and_returns_figure"
|
||||
- "test_long_value_truncated"
|
||||
- "test_none_value_and_none_count_are_handled"
|
||||
- "test_n_rows_adds_exact_others_bar"
|
||||
test_file_path: "python/functions/datascience/categorical_top_bar_figure_test.py"
|
||||
file_path: "python/functions/datascience/categorical_top_bar_figure.py"
|
||||
params:
|
||||
- name: top
|
||||
desc: "Lista de dicts {value, count, pct} ordenada de mayor a menor por count (salida del bloque `top` de summarize_categorical). Puede venir vacía o con dicts incompletos: items no-dict, sin count, con count None o count <= 0 se descartan. value None se admite (etiqueta vacía)."
|
||||
- name: n_distinct
|
||||
desc: "Nº total de categorías distintas de la columna. Etiqueta la barra agregada como \"Otros (n_distinct - top_k)\" (mínimo 0). Si no supera el nº de barras mostradas, se usa el overflow real de `top` como nº de categorías agregadas. Default 0."
|
||||
- name: title
|
||||
desc: "Título de la figura (nombre de la columna). Se trunca a ~48 chars con elipsis si es muy largo. Default \"\" (sin título)."
|
||||
- name: top_k
|
||||
desc: "Nº máximo de barras explícitas. Default 6. La barra \"Otros\" no cuenta contra este límite. Con top_k <= 0 se muestra al menos la categoría mayor."
|
||||
- name: n_rows
|
||||
desc: "Opcional. Total de filas del dataset. Si se da y la suma de counts mostrados < n_rows, la barra \"Otros\" usa (n_rows - suma_mostrada) como count para que sea exacta respecto al total real. Si se omite, \"Otros\" usa la suma de counts fuera del top_k mostrado (solo cuando top trae más de top_k items). Default None."
|
||||
output: "Un matplotlib.figure.Figure (figsize 6.4 x altura escalada con el nº de barras, dpi 150) con un Axes de barras horizontales: la categoría más frecuente arriba, la barra gris \"Otros (N categorías)\" abajo, cada barra anotada con su conteo y porcentaje al final y etiquetas de categoría (yticklabels) truncadas a ~22 chars. Si no hay counts válidos devuelve igualmente una Figure con un texto centrado \"sin datos categóricos\" (nunca lanza); cualquier error inesperado cae a una Figure con el texto del error. El caller rasteriza/cierra la figura; la función no la muestra ni la guarda."
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from categorical_top_bar_figure import categorical_top_bar_figure
|
||||
|
||||
# `top` es la salida del bloque "top" de summarize_categorical (ya ordenado desc).
|
||||
top = [
|
||||
{"value": "rojo", "count": 40, "pct": 0.40},
|
||||
{"value": "azul", "count": 30, "pct": 0.30},
|
||||
{"value": "verde", "count": 20, "pct": 0.20},
|
||||
{"value": "amarillo", "count": 5, "pct": 0.05},
|
||||
]
|
||||
|
||||
fig = categorical_top_bar_figure(
|
||||
top,
|
||||
n_distinct=12, # 12 categorías distintas en total
|
||||
title="color_producto",
|
||||
top_k=6, # hasta 6 barras explícitas
|
||||
n_rows=100, # "Otros" = 100 - 95 = 5, sobre 8 categorías agregadas
|
||||
)
|
||||
|
||||
# El renderer del informe lo rasteriza; aquí solo persistimos para inspección.
|
||||
fig.savefig("/tmp/barras_color.png")
|
||||
```
|
||||
|
||||
## Cuando usarla
|
||||
|
||||
Úsala dentro de un informe EDA cuando quieras comparar **magnitudes** de las
|
||||
categorías dominantes de una columna categórica: qué categoría manda y por
|
||||
cuánto frente a las siguientes. Pásale directamente el bloque `top` de
|
||||
`summarize_categorical` (ya ordenado de mayor a menor) más `n_distinct` para que
|
||||
la barra "Otros" indique cuántas categorías quedan agrupadas. Es el clon "de
|
||||
barras" del donut `categorical_top_pie_figure` con **contrato de entrada
|
||||
idéntico**: puedes intercambiar una por otra sin tocar el caller. Elige barras
|
||||
cuando importe comparar tamaños exactos; el donut cuando importe la proporción
|
||||
del total.
|
||||
|
||||
## Gotchas
|
||||
|
||||
- **Impura por matplotlib.** Toca la maquinaria de render. Usa el backend `Agg`
|
||||
y la API orientada a objetos `Figure`/`add_subplot` — NUNCA `pyplot.*` aquí,
|
||||
para no tocar el estado global ni filtrar figuras entre llamadas. `pyplot` NO
|
||||
es thread-safe; esta función evita ese riesgo construyendo el `Figure`
|
||||
directamente, así que es segura de llamar en bucle desde el renderer.
|
||||
- **El caller cierra la figura.** La función devuelve el `Figure` pero no lo
|
||||
muestra ni lo guarda. Quien la consume debe rasterizarla y luego liberarla
|
||||
(`fig.clf()` / `matplotlib.pyplot.close(fig)` si se usó pyplot en el caller)
|
||||
para no acumular memoria en lotes grandes de columnas.
|
||||
- **`barh` dibuja de abajo arriba.** La categoría más frecuente va arriba porque
|
||||
el orden de display se invierte antes de plotear; la barra "Otros" queda
|
||||
siempre al fondo. No reordenes `top` esperando otro layout: la función asume
|
||||
que ya viene ordenado desc por count.
|
||||
- **Magnitud exacta de "Otros" solo con `n_rows`.** Sin `n_rows`, la barra
|
||||
"Otros" se calcula con el overflow presente en `top`; si `top` ya viene
|
||||
recortado a `top_k` por el productor, no habrá "Otros" aunque existan más
|
||||
categorías. Pasa `n_rows` (total de filas del dataset) para una barra correcta
|
||||
respecto al total real.
|
||||
- **Defensiva, nunca lanza.** `top=[]`, `value=None`, `count=None` o counts no
|
||||
numéricos se manejan sin error: en el peor caso devuelve una `Figure` con
|
||||
"sin datos categóricos", y cualquier excepción inesperada cae a una `Figure`
|
||||
con el texto del error. No envuelvas la llamada en try/except por miedo a un
|
||||
raise — no lo hay.
|
||||
@@ -0,0 +1,233 @@
|
||||
"""Impure EDA helper: horizontal bar figure of the most common categories (`eda` group).
|
||||
|
||||
Builds a horizontal bar chart of the ``top_k`` most frequent categories of a
|
||||
categorical column, folding everything else into a single gray
|
||||
"Otros (N categorías)" bar. The most frequent category sits at the top, each bar
|
||||
labelled with its count (and percentage) at the end. Returns a ready-to-rasterize
|
||||
``matplotlib.figure.Figure``; it never shows nor saves it.
|
||||
|
||||
This is the "magnitude" twin of ``categorical_top_pie_figure``: identical input
|
||||
contract (same ``top``/``n_distinct``/``title``/``top_k``/``n_rows`` signature) so
|
||||
it can be swapped in directly, but it communicates comparable magnitudes via bars
|
||||
instead of proportions via wedges.
|
||||
|
||||
Impure because it touches matplotlib's rendering machinery. It uses the headless
|
||||
Agg backend and the object-oriented ``Figure`` API (no ``pyplot``) so it leaks no
|
||||
global state and is safe to call repeatedly from a report renderer.
|
||||
"""
|
||||
|
||||
import matplotlib
|
||||
|
||||
matplotlib.use("Agg")
|
||||
|
||||
from matplotlib.figure import Figure # noqa: E402
|
||||
|
||||
|
||||
# Gray reserved for the aggregated "Otros" bar.
|
||||
_OTHER_COLOR = "#9e9e9e"
|
||||
# Muted gray for secondary text (title fallback, no-data message).
|
||||
_MUTED_TEXT = "#5f6b7a"
|
||||
# Soft red for the error fallback message.
|
||||
_ERROR_TEXT = "#b00020"
|
||||
# Pleasant, colour-blind-friendly qualitative palette for the explicit bars.
|
||||
_PALETTE = [
|
||||
"#4C72B0",
|
||||
"#DD8452",
|
||||
"#55A868",
|
||||
"#C44E52",
|
||||
"#8172B3",
|
||||
"#937860",
|
||||
"#DA8BC3",
|
||||
"#8C8C8C",
|
||||
"#CCB974",
|
||||
"#64B5CD",
|
||||
]
|
||||
|
||||
|
||||
def _truncate(text, width: int = 22) -> str:
|
||||
"""Truncate ``text`` to ``width`` chars, appending an ellipsis if cut."""
|
||||
s = "" if text is None else str(text)
|
||||
if len(s) <= width:
|
||||
return s
|
||||
if width <= 1:
|
||||
return s[:width]
|
||||
return s[: width - 1] + "…"
|
||||
|
||||
|
||||
def _message_figure(message: str, color: str = _MUTED_TEXT, title: str = "") -> "Figure":
|
||||
"""Return a fallback ``Figure`` carrying a single centered message."""
|
||||
fig = Figure(figsize=(6.4, 4.0), dpi=150)
|
||||
ax = fig.add_subplot(111)
|
||||
ax.axis("off")
|
||||
ax.text(
|
||||
0.5,
|
||||
0.5,
|
||||
message,
|
||||
ha="center",
|
||||
va="center",
|
||||
fontsize=12,
|
||||
color=color,
|
||||
wrap=True,
|
||||
transform=ax.transAxes,
|
||||
)
|
||||
if title:
|
||||
ax.set_title(_truncate(title, 48), fontsize=12, loc="center", pad=8)
|
||||
fig.tight_layout()
|
||||
return fig
|
||||
|
||||
|
||||
def categorical_top_bar_figure(
|
||||
top: list,
|
||||
n_distinct: int = 0,
|
||||
title: str = "",
|
||||
top_k: int = 6,
|
||||
n_rows=None,
|
||||
) -> "matplotlib.figure.Figure":
|
||||
"""Build a horizontal bar figure of the most common categories of a column.
|
||||
|
||||
Renders the ``top_k`` most frequent categories as explicit horizontal bars,
|
||||
largest at the top, and aggregates every remaining category into a single
|
||||
gray "Otros (N categorías)" bar at the bottom. Each bar is annotated with its
|
||||
count and percentage of the total at the end of the bar; the category names
|
||||
are truncated Y tick labels.
|
||||
|
||||
The function shares the exact input contract of
|
||||
``categorical_top_pie_figure`` (the donut twin) so it is a drop-in swap. It is
|
||||
fully defensive: empty input, missing/``None`` values or counts never raise.
|
||||
When there is nothing valid to draw it still returns a ``Figure`` carrying a
|
||||
centered "sin datos categóricos" message, and any unexpected error is caught
|
||||
and turned into a fallback ``Figure`` carrying the error text.
|
||||
|
||||
Args:
|
||||
top: List of ``{value, count, pct}`` dicts, already sorted by ``count``
|
||||
descending (the ``top`` block of ``summarize_categorical``). May be
|
||||
empty or carry incomplete/``None`` entries; non-dict items, items
|
||||
without a positive numeric ``count`` and ``None`` counts are skipped.
|
||||
n_distinct: Total number of distinct categories in the column. Used to
|
||||
label the aggregated bar as "Otros (n_distinct - top_k)" (floored at
|
||||
0). Ignored when it does not exceed the number of shown bars.
|
||||
title: Figure title (the column name). Truncated when too long.
|
||||
top_k: Maximum number of explicit bars. Default 6. The "Otros" bar does
|
||||
not count against this limit.
|
||||
n_rows: Optional total row count of the dataset. When given and the sum of
|
||||
shown counts is below ``n_rows``, the "Otros" bar uses
|
||||
``n_rows - sum_shown`` as its count so it is exact with respect to the
|
||||
real total. When omitted, "Otros" uses the sum of the counts that fall
|
||||
outside the shown ``top_k`` (only when ``top`` carries more than
|
||||
``top_k`` items).
|
||||
|
||||
Returns:
|
||||
A ``matplotlib.figure.Figure`` with a single horizontal-bar Axes. The
|
||||
caller is responsible for rasterizing/closing it.
|
||||
"""
|
||||
try:
|
||||
safe_title = _truncate(title, 48)
|
||||
|
||||
# --- Defensive parse: keep only well-formed {value, count} with count > 0.
|
||||
cleaned = []
|
||||
if isinstance(top, list):
|
||||
for item in top:
|
||||
if not isinstance(item, dict):
|
||||
continue
|
||||
count = item.get("count")
|
||||
if count is None:
|
||||
continue
|
||||
try:
|
||||
count = float(count)
|
||||
except (TypeError, ValueError):
|
||||
continue
|
||||
if count <= 0:
|
||||
continue
|
||||
cleaned.append((item.get("value"), count))
|
||||
|
||||
if not cleaned:
|
||||
return _message_figure("sin datos categóricos", title=title)
|
||||
|
||||
# --- Split into shown bars and the aggregated remainder.
|
||||
shown = cleaned[: max(int(top_k), 0)]
|
||||
if not shown: # top_k <= 0 — show at least the largest category.
|
||||
shown = cleaned[:1]
|
||||
|
||||
sum_shown = sum(c for _, c in shown)
|
||||
overflow_count = sum(c for _, c in cleaned[len(shown):])
|
||||
|
||||
# How many categories are folded into "Otros".
|
||||
try:
|
||||
nd = int(n_distinct)
|
||||
except (TypeError, ValueError):
|
||||
nd = 0
|
||||
others_categories = max(nd - len(shown), 0)
|
||||
# If n_distinct is unknown/too small, fall back to the overflow we
|
||||
# actually have in `top` beyond the shown bars.
|
||||
overflow_items = len(cleaned) - len(shown)
|
||||
if others_categories == 0 and overflow_items > 0:
|
||||
others_categories = overflow_items
|
||||
|
||||
# Count attributed to the "Otros" bar.
|
||||
others_count = 0.0
|
||||
if n_rows is not None:
|
||||
try:
|
||||
total_rows = float(n_rows)
|
||||
except (TypeError, ValueError):
|
||||
total_rows = None
|
||||
if total_rows is not None and total_rows > sum_shown:
|
||||
others_count = total_rows - sum_shown
|
||||
if others_count <= 0:
|
||||
others_count = overflow_count
|
||||
|
||||
# --- Build the display order (top to bottom): largest .. smallest, Otros.
|
||||
display_labels = [_truncate(v, 22) for v, _ in shown]
|
||||
display_values = [c for _, c in shown]
|
||||
display_colors = [_PALETTE[i % len(_PALETTE)] for i in range(len(shown))]
|
||||
|
||||
has_others = others_count > 0 and others_categories > 0
|
||||
if has_others:
|
||||
display_labels.append(f"Otros ({others_categories} categorías)")
|
||||
display_values.append(others_count)
|
||||
display_colors.append(_OTHER_COLOR)
|
||||
|
||||
total = sum(display_values) or 1.0
|
||||
|
||||
# barh draws bottom-up, so reverse the display order before plotting to
|
||||
# land the largest category on top and "Otros" at the bottom.
|
||||
labels = list(reversed(display_labels))
|
||||
values = list(reversed(display_values))
|
||||
colors = list(reversed(display_colors))
|
||||
y_pos = range(len(values))
|
||||
|
||||
# Height scales with the number of bars so dense reports stay readable.
|
||||
n_bars = len(values)
|
||||
height = max(2.4, min(0.4 * n_bars + 1.2, 14.0))
|
||||
fig = Figure(figsize=(6.4, height), dpi=150)
|
||||
ax = fig.add_subplot(111)
|
||||
|
||||
ax.barh(list(y_pos), values, color=colors, edgecolor="white")
|
||||
ax.set_yticks(list(y_pos))
|
||||
ax.set_yticklabels(labels, fontsize=8)
|
||||
ax.set_xlabel("conteo", fontsize=9)
|
||||
|
||||
max_val = max(values) if values else 1.0
|
||||
ax.set_xlim(0, max_val * 1.18 if max_val > 0 else 1.0)
|
||||
|
||||
# Annotate each bar with its count and percentage at the end of the bar.
|
||||
for y, val in zip(y_pos, values):
|
||||
pct = val / total * 100.0
|
||||
ax.text(
|
||||
val + max_val * 0.012,
|
||||
y,
|
||||
f"{int(round(val))} ({pct:.0f}%)",
|
||||
va="center",
|
||||
ha="left",
|
||||
fontsize=7,
|
||||
color="#202020",
|
||||
)
|
||||
|
||||
if safe_title:
|
||||
ax.set_title(safe_title, fontsize=13, loc="left", pad=10)
|
||||
|
||||
fig.tight_layout()
|
||||
return fig
|
||||
except Exception as exc: # noqa: BLE001 — never raise from a figure builder.
|
||||
return _message_figure(
|
||||
f"error al dibujar barras: {exc}", color=_ERROR_TEXT
|
||||
)
|
||||
@@ -0,0 +1,103 @@
|
||||
"""Tests para categorical_top_bar_figure (barras de categorías top, grupo eda).
|
||||
|
||||
Usa el backend Agg sin pyplot; no muestra ni guarda figuras. Cada test cierra
|
||||
explícitamente la Figure construida (matplotlib.pyplot.close) para no acumular
|
||||
estado entre tests.
|
||||
"""
|
||||
|
||||
import matplotlib
|
||||
|
||||
matplotlib.use("Agg")
|
||||
|
||||
import matplotlib.pyplot as plt # noqa: E402
|
||||
from matplotlib.figure import Figure # noqa: E402
|
||||
|
||||
from categorical_top_bar_figure import categorical_top_bar_figure
|
||||
|
||||
|
||||
def _make_top(n):
|
||||
"""n items {value, count, pct} ordenados desc por count."""
|
||||
return [
|
||||
{"value": f"cat_{i}", "count": n - i, "pct": (n - i) / sum(range(1, n + 1))}
|
||||
for i in range(n)
|
||||
]
|
||||
|
||||
|
||||
def _bar_count(ax):
|
||||
"""Devuelve el nº de barras (longitud del primer BarContainer del Axes)."""
|
||||
if ax.containers:
|
||||
return len(ax.containers[0])
|
||||
return 0
|
||||
|
||||
|
||||
def test_returns_figure():
|
||||
fig = categorical_top_bar_figure(_make_top(3), n_distinct=3, title="col")
|
||||
assert isinstance(fig, Figure)
|
||||
plt.close(fig)
|
||||
|
||||
|
||||
def test_ten_items_topk_six_yields_seven_bars():
|
||||
top = _make_top(10)
|
||||
fig = categorical_top_bar_figure(top, n_distinct=10, title="muchas", top_k=6)
|
||||
ax = fig.axes[0]
|
||||
# 6 categorías explícitas + 1 barra "Otros".
|
||||
assert _bar_count(ax) == 7
|
||||
plt.close(fig)
|
||||
|
||||
|
||||
def test_empty_top_does_not_raise_and_returns_figure():
|
||||
fig = categorical_top_bar_figure([], n_distinct=0, title="vacía")
|
||||
assert isinstance(fig, Figure)
|
||||
# Sin datos: no debe haber barras.
|
||||
assert _bar_count(fig.axes[0]) == 0
|
||||
plt.close(fig)
|
||||
|
||||
|
||||
def test_long_value_truncated():
|
||||
long_value = "una_categoria_con_un_nombre_larguisimo_que_excede_el_limite"
|
||||
top = [
|
||||
{"value": long_value, "count": 10, "pct": 0.5},
|
||||
{"value": "corta", "count": 10, "pct": 0.5},
|
||||
]
|
||||
fig = categorical_top_bar_figure(top, n_distinct=2, title="col", top_k=6)
|
||||
ax = fig.axes[0]
|
||||
tick_texts = [t.get_text() for t in ax.get_yticklabels()]
|
||||
# El valor largo aparece truncado con elipsis y NO en su forma completa.
|
||||
assert any("…" in t for t in tick_texts)
|
||||
assert long_value not in " ".join(tick_texts)
|
||||
plt.close(fig)
|
||||
|
||||
|
||||
def test_none_value_and_none_count_are_handled():
|
||||
top = [
|
||||
{"value": None, "count": 5, "pct": 0.5},
|
||||
{"value": "b", "count": None, "pct": 0.0}, # count None -> se descarta
|
||||
{"value": "c", "count": 5, "pct": 0.5},
|
||||
]
|
||||
fig = categorical_top_bar_figure(top, n_distinct=2, title="con nones", top_k=6)
|
||||
assert isinstance(fig, Figure)
|
||||
# Solo 2 items válidos, sin overflow -> 2 barras, sin "Otros".
|
||||
assert _bar_count(fig.axes[0]) == 2
|
||||
plt.close(fig)
|
||||
|
||||
|
||||
def test_n_rows_adds_exact_others_bar():
|
||||
# 3 categorías mostradas suman 30, dataset real 100 -> "Otros" = 70.
|
||||
top = [
|
||||
{"value": "a", "count": 15, "pct": 0.15},
|
||||
{"value": "b", "count": 10, "pct": 0.10},
|
||||
{"value": "c", "count": 5, "pct": 0.05},
|
||||
]
|
||||
fig = categorical_top_bar_figure(
|
||||
top, n_distinct=20, title="col", top_k=3, n_rows=100
|
||||
)
|
||||
ax = fig.axes[0]
|
||||
# 3 explícitas + Otros.
|
||||
assert _bar_count(ax) == 4
|
||||
tick_texts = [t.get_text() for t in ax.get_yticklabels()]
|
||||
# La barra Otros refleja n_distinct - top_k = 17 categorías.
|
||||
assert any("Otros (17 categorías)" in t for t in tick_texts)
|
||||
# Su anotación lleva el count 70.
|
||||
annotation_texts = [t.get_text() for t in ax.texts]
|
||||
assert any("70" in t for t in annotation_texts)
|
||||
plt.close(fig)
|
||||
@@ -0,0 +1,121 @@
|
||||
---
|
||||
id: render_table_as_figure_py_datascience
|
||||
name: render_table_as_figure
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "def render_table_as_figure(header, rows, title=None, note=None, fontsize=9.0, max_cell_chars=40) -> \"matplotlib.figure.Figure\""
|
||||
description: "Dibuja un bloque tabular (cabecera + filas) como una matplotlib.figure.Figure nítida, lista para rasterizar a DPI alto. Pensada para tablas que NO caben como texto en una página/slide del informe EDA: se rasteriza a alta resolución (el caller usa dpi=220, bbox_inches='tight') y el usuario hace zoom en el móvil para leerla entera sin perder datos. Cabecera sombreada (#eef3f6) y en negrita, filas pares (1-based) con zebra suave (#f6f8fa), tinta oscura (#1b1b1b) sobre blanco, rejilla gris muy fina (#cccccc). Trunca cada celda a max_cell_chars con elipsis y str()-ea cada valor (None -> \"\"). figsize proporcional al contenido (ancho por nº y longitud de columnas, alto por nº de filas) para que sea legible con zoom. Backend Agg sin pyplot global. Defensiva: header/rows vacíos o None, filas irregulares o cualquier error interno devuelven una Figure placeholder con texto centrado \"(tabla no disponible)\". NUNCA lanza."
|
||||
tags: [eda, table, figure, matplotlib, visualization, rasterize, zoom, render, datascience, impure]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: [matplotlib]
|
||||
example: |
|
||||
from datascience.render_table_as_figure import render_table_as_figure
|
||||
header = ["columna", "n_nulos", "%_nulos", "distintos", "tipo", "ejemplo"]
|
||||
rows = [
|
||||
["ingresos", 12, "1.2%", 980, "float64", "2345.67"],
|
||||
["edad", 0, "0.0%", 88, "int64", "37"],
|
||||
["ciudad", 5, "0.5%", 412, "object", "Madrid"],
|
||||
]
|
||||
fig = render_table_as_figure(header, rows, title="Resumen de columnas",
|
||||
note="rasteriza a dpi=220 y haz zoom")
|
||||
fig.savefig("/tmp/tabla.png", dpi=220, bbox_inches="tight")
|
||||
tested: true
|
||||
tests:
|
||||
- "test_returns_figure_with_table"
|
||||
- "test_rows_none_does_not_raise"
|
||||
- "test_header_none_does_not_raise"
|
||||
- "test_empty_lists_return_placeholder_figure"
|
||||
- "test_both_none_return_placeholder_figure"
|
||||
- "test_long_cell_is_truncated"
|
||||
- "test_none_cells_become_empty_strings"
|
||||
- "test_can_rasterize_to_png_high_dpi"
|
||||
- "test_placeholder_can_rasterize"
|
||||
- "test_ragged_rows_are_padded"
|
||||
test_file_path: "python/functions/datascience/render_table_as_figure_test.py"
|
||||
file_path: "python/functions/datascience/render_table_as_figure.py"
|
||||
params:
|
||||
- name: header
|
||||
desc: "Lista de nombres de columna (puede ser [] o None). Cada nombre se str()-ea, se trunca a max_cell_chars y se pinta en la fila cabecera sombreada en negrita. Si está vacío/None no se dibuja fila de cabecera (solo cuerpo)."
|
||||
- name: rows
|
||||
desc: "Lista de filas; cada fila es una lista de celdas con valores cualesquiera (se str()-ean; None -> \"\"). Admite None (se trata como []), filas escalares (se envuelven en una celda) y filas de distinta longitud (la rejilla se rectangulariza al ancho máximo, rellenando con celdas vacías). Saltos de línea/tabs en una celda se colapsan a espacios para que no desborde a otras filas."
|
||||
- name: title
|
||||
desc: "Título opcional dibujado encima de la tabla, en negrita tinta #1b1b1b, alineado a la izquierda. None o \"\" => sin título. Default None."
|
||||
- name: note
|
||||
desc: "Nota opcional al pie de la figura, en gris #8a8a8a e itálica. None o \"\" => sin nota. Default None."
|
||||
- name: fontsize
|
||||
desc: "Tamaño de fuente base (pt) de las celdas del cuerpo. La cabecera usa fontsize+3 y la nota max(7, fontsize-1). Un valor no numérico o <= 0 cae a 9.0. Default 9.0."
|
||||
- name: max_cell_chars
|
||||
desc: "Trunca el texto de cada celda a este nº de chars (con … final cuando se recorta) para que el ancho no explote. Un valor no entero cae a 40; <= 0 deja las celdas vacías. Default 40."
|
||||
output: "Un matplotlib.figure.Figure (figsize proporcional al contenido: ancho ≈ 0.9-1.6\" por columna según su texto, total acotado a 3-26\"; alto ≈ 0.32\" por fila + cabecera + espacio para título/nota, acotado) con un Axes sin ejes que contiene un ax.table(...) NO cerrado. Cabecera fondo #eef3f6 texto #1b1b1b bold; filas pares (1-based) zebra #f6f8fa, impares blanco; tinta #1b1b1b; bordes/rejilla #cccccc lw 0.4; texto alineado a la izquierda. Título encima (bold) y nota debajo (gris itálica) si se pasan. Si header/rows son vacíos o None, o ante cualquier error interno, devuelve una Figure placeholder pequeña con el texto centrado \"(tabla no disponible)\". NUNCA lanza. El caller la rasteriza (dpi=220, bbox_inches='tight') y la cierra; la función no la muestra ni la guarda."
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
import sys, os
|
||||
sys.path.insert(0, os.path.join("python", "functions"))
|
||||
from datascience.render_table_as_figure import render_table_as_figure
|
||||
|
||||
# Tabla que no cabe como texto en la slide -> se rasteriza y se lee con zoom.
|
||||
header = ["columna", "n_nulos", "%_nulos", "distintos", "tipo", "ejemplo"]
|
||||
rows = [
|
||||
["ingresos", 12, "1.2%", 980, "float64", "2345.67"],
|
||||
["edad", 0, "0.0%", 88, "int64", "37"],
|
||||
["ciudad", 5, "0.5%", 412, "object", "Madrid"],
|
||||
["categoria_producto", 0, "0.0%", 1840, "object",
|
||||
"un_valor_categorico_muy_largo_que_se_trunca"],
|
||||
]
|
||||
|
||||
fig = render_table_as_figure(
|
||||
header,
|
||||
rows,
|
||||
title="Resumen de columnas",
|
||||
note="rasteriza a dpi=220 y haz zoom en el móvil",
|
||||
fontsize=9.0,
|
||||
max_cell_chars=40,
|
||||
)
|
||||
|
||||
# El renderer del informe lo rasteriza a alta resolución; aquí lo persistimos.
|
||||
fig.savefig("/tmp/tabla.png", dpi=220, bbox_inches="tight")
|
||||
```
|
||||
|
||||
## Cuando usarla
|
||||
|
||||
Úsala en un informe EDA cuando una tabla **no cabe como texto** en una página o
|
||||
slide y prefieres una imagen nítida que el lector pueda ampliar en el móvil para
|
||||
leerla entera (perfiles de columnas, matrices de conteo, tablas de frecuencias
|
||||
con muchas filas o columnas anchas). Pásale la cabecera y las filas tal cual (los
|
||||
valores se `str()`-ean por ti) más un `title`/`note` opcionales; el llamante la
|
||||
rasteriza a `dpi=220` con `bbox_inches='tight'`. Es la pareja "tabla-como-imagen"
|
||||
de los gráficos `build_boxplots_figure` / `categorical_top_pie_figure`: misma
|
||||
paleta y mismo contrato (Agg, sin `pyplot`, el caller cierra la figura).
|
||||
|
||||
## Gotchas
|
||||
|
||||
- **Impura por matplotlib.** Toca la maquinaria de render. Usa el backend `Agg`
|
||||
y la API orientada a objetos `Figure`/`add_subplot` — NUNCA `pyplot.*` aquí,
|
||||
para no tocar el estado global ni filtrar figuras entre llamadas. `pyplot` NO
|
||||
es thread-safe; esta función construye el `Figure` directamente, así que es
|
||||
segura de llamar en bucle desde el renderer.
|
||||
- **El caller cierra la figura.** Devuelve el `Figure` pero no lo muestra ni lo
|
||||
guarda. Quien la consume debe rasterizarla y luego liberarla
|
||||
(`matplotlib.pyplot.close(fig)`) para no acumular memoria en lotes grandes.
|
||||
- **Pensada para rasterizar a DPI alto.** El `figsize` es proporcional al
|
||||
contenido pero la legibilidad real viene del DPI: rasteriza con `dpi=220` y
|
||||
`bbox_inches='tight'`. Una tabla con muchísimas filas crece en alto (capado a
|
||||
~60") — para miles de filas, parte la tabla o resume antes de pasarla.
|
||||
- **Truncación de celda visible.** Cada celda se recorta a `max_cell_chars`
|
||||
(default 40) con `…` final y los saltos de línea/tabs se colapsan a espacios,
|
||||
para que ninguna celda desborde a otras filas. Sube `max_cell_chars` si
|
||||
necesitas ver el valor completo (a costa de ancho).
|
||||
- **Defensiva, nunca lanza.** `header`/`rows` vacíos o `None`, filas escalares,
|
||||
filas de distinta longitud o cualquier error interno se manejan sin propagar:
|
||||
en el peor caso devuelve una `Figure` placeholder con "(tabla no disponible)".
|
||||
No envuelvas la llamada en try/except por miedo a un raise — no lo hay.
|
||||
@@ -0,0 +1,241 @@
|
||||
"""Impure EDA helper: a crisp table rendered as a matplotlib Figure (`eda` group).
|
||||
|
||||
Draws a tabular block (header + rows) as a sharp ``matplotlib.figure.Figure``
|
||||
ready to be rasterized at high DPI, so a table that does NOT fit as text on a
|
||||
page/slide can still be read in full by zooming into the rasterized image on a
|
||||
phone. The header is shaded and bold, even rows carry a soft zebra stripe, the
|
||||
ink is dark on white and the grid is very thin.
|
||||
|
||||
Impure because it touches matplotlib's rendering machinery. It uses the headless
|
||||
Agg backend and the object-oriented ``Figure`` API (no ``pyplot``) so it leaks no
|
||||
global state and is safe to call repeatedly from a report renderer. It is fully
|
||||
defensive and NEVER raises: empty/invalid input or any internal error returns a
|
||||
small placeholder figure carrying a centered "(tabla no disponible)".
|
||||
"""
|
||||
|
||||
import matplotlib
|
||||
|
||||
matplotlib.use("Agg")
|
||||
|
||||
from matplotlib.figure import Figure # noqa: E402
|
||||
|
||||
# Palette shared with the EDA report renderer so the document stays coherent.
|
||||
_HEADER_BG = "#eef3f6" # header cell background.
|
||||
_HEADER_TEXT = "#1b1b1b" # header cell text (bold).
|
||||
_ZEBRA_BG = "#f6f8fa" # even (1-based) row background stripe.
|
||||
_BODY_BG = "#ffffff" # odd row background.
|
||||
_INK = "#1b1b1b" # body text + title ink.
|
||||
_GRID = "#cccccc" # cell borders / grid (thin).
|
||||
_NOTE_TEXT = "#8a8a8a" # muted gray for the note (italic).
|
||||
|
||||
|
||||
def _placeholder_figure(message: str = "(tabla no disponible)") -> "Figure":
|
||||
"""Return a small fallback ``Figure`` carrying a single centered message."""
|
||||
fig = Figure(figsize=(6.0, 1.6), dpi=150)
|
||||
ax = fig.add_subplot(111)
|
||||
ax.axis("off")
|
||||
ax.text(
|
||||
0.5,
|
||||
0.5,
|
||||
message,
|
||||
ha="center",
|
||||
va="center",
|
||||
fontsize=11,
|
||||
color=_NOTE_TEXT,
|
||||
style="italic",
|
||||
wrap=True,
|
||||
transform=ax.transAxes,
|
||||
)
|
||||
fig.tight_layout()
|
||||
return fig
|
||||
|
||||
|
||||
def _cell_text(value, max_cell_chars: int) -> str:
|
||||
"""``str()`` a cell value defensively, None -> "", truncate with an ellipsis."""
|
||||
s = "" if value is None else str(value)
|
||||
# Collapse newlines/tabs so a single cell never spills across table rows.
|
||||
s = s.replace("\n", " ").replace("\r", " ").replace("\t", " ")
|
||||
try:
|
||||
limit = int(max_cell_chars)
|
||||
except (TypeError, ValueError):
|
||||
limit = 40
|
||||
if limit <= 0:
|
||||
return ""
|
||||
if len(s) <= limit:
|
||||
return s
|
||||
if limit == 1:
|
||||
return "…"
|
||||
return s[: limit - 1] + "…"
|
||||
|
||||
|
||||
def render_table_as_figure(
|
||||
header,
|
||||
rows,
|
||||
title=None,
|
||||
note=None,
|
||||
fontsize=9.0,
|
||||
max_cell_chars=40,
|
||||
):
|
||||
"""Dibuja una tabla nítida como matplotlib.figure.Figure, lista para rasterizar a DPI alto.
|
||||
|
||||
Pensada para tablas que NO caben como texto en una página/slide: se rasteriza
|
||||
a alta resolución y el usuario hace zoom en el móvil para leerla entera sin
|
||||
perder datos. Cabecera sombreada + negrita, filas pares con zebra suave,
|
||||
tinta oscura sobre blanco, rejilla muy fina.
|
||||
|
||||
Args:
|
||||
header: lista de nombres de columna (puede ser []).
|
||||
rows: lista de filas; cada fila es una lista de celdas (valores cualquiera, se str()-ean).
|
||||
title: título opcional dibujado encima de la tabla (o None).
|
||||
note: nota opcional en gris/itálica bajo la tabla (o None).
|
||||
fontsize: tamaño de fuente base (pt) de las celdas.
|
||||
max_cell_chars: trunca el texto de celda a este nº de chars (con … final) para que no explote el ancho.
|
||||
|
||||
Returns:
|
||||
matplotlib.figure.Figure — NO cerrada (el llamante la rasteriza y la cierra).
|
||||
Nunca lanza: ante cualquier error devuelve una Figure con el texto "(tabla no disponible)".
|
||||
"""
|
||||
try:
|
||||
# --- Defensive normalization of header/rows into a rectangular grid.
|
||||
header_list = list(header) if isinstance(header, (list, tuple)) else []
|
||||
raw_rows = list(rows) if isinstance(rows, (list, tuple)) else []
|
||||
|
||||
clean_rows = []
|
||||
for row in raw_rows:
|
||||
if isinstance(row, (list, tuple)):
|
||||
clean_rows.append(list(row))
|
||||
elif row is None:
|
||||
clean_rows.append([])
|
||||
else:
|
||||
# A scalar row becomes a single-cell row instead of being dropped.
|
||||
clean_rows.append([row])
|
||||
|
||||
# Nothing to draw at all -> placeholder.
|
||||
if not header_list and not clean_rows:
|
||||
return _placeholder_figure()
|
||||
|
||||
# Number of columns = widest of header / any row.
|
||||
n_cols = len(header_list)
|
||||
for row in clean_rows:
|
||||
if len(row) > n_cols:
|
||||
n_cols = len(row)
|
||||
if n_cols <= 0:
|
||||
return _placeholder_figure()
|
||||
|
||||
# Base font size, tolerate a bad value.
|
||||
try:
|
||||
base_fs = float(fontsize)
|
||||
except (TypeError, ValueError):
|
||||
base_fs = 9.0
|
||||
if base_fs <= 0:
|
||||
base_fs = 9.0
|
||||
|
||||
# --- Build the truncated, padded text matrix.
|
||||
header_cells = [
|
||||
_cell_text(header_list[c] if c < len(header_list) else "", max_cell_chars)
|
||||
for c in range(n_cols)
|
||||
]
|
||||
body_cells = []
|
||||
for row in clean_rows:
|
||||
body_cells.append(
|
||||
[
|
||||
_cell_text(row[c] if c < len(row) else "", max_cell_chars)
|
||||
for c in range(n_cols)
|
||||
]
|
||||
)
|
||||
|
||||
has_header = any(t for t in header_cells)
|
||||
n_body = len(body_cells)
|
||||
# Total drawn table rows (header counts as one when present).
|
||||
n_table_rows = n_body + (1 if has_header else 0)
|
||||
if n_table_rows <= 0:
|
||||
return _placeholder_figure()
|
||||
|
||||
# --- figsize proportional to content so it reads under zoom.
|
||||
# Width: per-column width scales with the longest text in that column,
|
||||
# clamped to a sensible per-column range, total capped.
|
||||
per_col_widths = []
|
||||
for c in range(n_cols):
|
||||
col_texts = [header_cells[c]] if has_header else []
|
||||
col_texts += [body_cells[r][c] for r in range(n_body)]
|
||||
longest = max((len(t) for t in col_texts), default=0)
|
||||
# ~0.085" per char at the base font, clamped to [0.9, 1.6] inches.
|
||||
w = 0.9 + 0.085 * max(longest - 6, 0)
|
||||
w = max(0.9, min(1.6, w))
|
||||
per_col_widths.append(w)
|
||||
fig_w = sum(per_col_widths)
|
||||
fig_w = max(3.0, min(26.0, fig_w))
|
||||
|
||||
# Height: ~0.32" per row + room for title / note.
|
||||
fig_h = 0.32 * n_table_rows + 0.30
|
||||
if title is not None and str(title) != "":
|
||||
fig_h += 0.45
|
||||
if note is not None and str(note) != "":
|
||||
fig_h += 0.30
|
||||
fig_h = max(1.0, min(60.0, fig_h))
|
||||
|
||||
fig = Figure(figsize=(fig_w, fig_h), dpi=150)
|
||||
ax = fig.add_subplot(111)
|
||||
ax.axis("off")
|
||||
|
||||
# Reserve vertical bands for the optional title (top) and note (bottom)
|
||||
# so the table itself never overlaps them.
|
||||
title_band = 0.10 if (title is not None and str(title) != "") else 0.0
|
||||
note_band = 0.07 if (note is not None and str(note) != "") else 0.0
|
||||
table_bbox = [0.0, note_band, 1.0, max(0.05, 1.0 - title_band - note_band)]
|
||||
|
||||
cell_text = ([header_cells] if has_header else []) + body_cells
|
||||
|
||||
col_widths = [w / fig_w for w in per_col_widths]
|
||||
|
||||
table = ax.table(
|
||||
cellText=cell_text,
|
||||
colWidths=col_widths,
|
||||
cellLoc="left",
|
||||
loc="center",
|
||||
bbox=table_bbox,
|
||||
)
|
||||
table.auto_set_font_size(False)
|
||||
table.set_fontsize(base_fs)
|
||||
|
||||
# --- Style every cell: zebra body, shaded bold header, thin gray grid.
|
||||
for (r, _c), cell in table.get_celld().items():
|
||||
cell.set_edgecolor(_GRID)
|
||||
cell.set_linewidth(0.4)
|
||||
# Small horizontal padding so text does not touch the border.
|
||||
cell.PAD = 0.04
|
||||
if has_header and r == 0:
|
||||
cell.set_facecolor(_HEADER_BG)
|
||||
cell.set_text_props(color=_HEADER_TEXT, fontweight="bold", ha="left")
|
||||
else:
|
||||
body_index = r - 1 if has_header else r # 0-based body row.
|
||||
# 1-based even rows get the zebra stripe.
|
||||
is_even = ((body_index + 1) % 2) == 0
|
||||
cell.set_facecolor(_ZEBRA_BG if is_even else _BODY_BG)
|
||||
cell.set_text_props(color=_INK, ha="left")
|
||||
|
||||
if title is not None and str(title) != "":
|
||||
ax.set_title(
|
||||
str(title),
|
||||
fontsize=base_fs + 3.0,
|
||||
fontweight="bold",
|
||||
color=_INK,
|
||||
loc="left",
|
||||
pad=8,
|
||||
)
|
||||
|
||||
if note is not None and str(note) != "":
|
||||
fig.text(
|
||||
0.01,
|
||||
0.01,
|
||||
str(note),
|
||||
ha="left",
|
||||
va="bottom",
|
||||
fontsize=max(7.0, base_fs - 1.0),
|
||||
color=_NOTE_TEXT,
|
||||
style="italic",
|
||||
)
|
||||
|
||||
return fig
|
||||
except Exception: # noqa: BLE001 — never raise from a figure builder.
|
||||
return _placeholder_figure()
|
||||
@@ -0,0 +1,119 @@
|
||||
"""Tests para render_table_as_figure (tabla nítida como Figure, grupo eda).
|
||||
|
||||
Usa el backend Agg sin display; no muestra ni guarda figuras a disco salvo a un
|
||||
BytesIO en memoria. Cada test cierra explícitamente la Figure construida
|
||||
(matplotlib.pyplot.close) para no acumular estado entre tests.
|
||||
"""
|
||||
|
||||
from io import BytesIO
|
||||
|
||||
import matplotlib
|
||||
|
||||
matplotlib.use("Agg")
|
||||
|
||||
import matplotlib.pyplot as plt # noqa: E402
|
||||
from matplotlib.figure import Figure # noqa: E402
|
||||
|
||||
from render_table_as_figure import render_table_as_figure
|
||||
|
||||
|
||||
def _grid(n_cols, n_rows):
|
||||
"""Cabecera de n_cols columnas + n_rows filas de celdas."""
|
||||
header = [f"col_{c}" for c in range(n_cols)]
|
||||
rows = [[f"r{r}c{c}" for c in range(n_cols)] for r in range(n_rows)]
|
||||
return header, rows
|
||||
|
||||
|
||||
def test_returns_figure_with_table():
|
||||
header, rows = _grid(6, 5)
|
||||
fig = render_table_as_figure(header, rows, title="Tabla", note="nota al pie")
|
||||
assert isinstance(fig, Figure)
|
||||
# Hay al menos un Axes y ese Axes contiene una tabla con celdas.
|
||||
assert len(fig.axes) >= 1
|
||||
ax = fig.axes[0]
|
||||
assert len(ax.tables) >= 1
|
||||
# 6 columnas x (1 cabecera + 5 filas) = 36 celdas.
|
||||
assert len(ax.tables[0].get_celld()) == 6 * (5 + 1)
|
||||
plt.close(fig)
|
||||
|
||||
|
||||
def test_rows_none_does_not_raise():
|
||||
fig = render_table_as_figure(["a", "b"], None)
|
||||
assert isinstance(fig, Figure)
|
||||
assert len(fig.axes) >= 1
|
||||
plt.close(fig)
|
||||
|
||||
|
||||
def test_header_none_does_not_raise():
|
||||
fig = render_table_as_figure(None, [["x", "y"], ["z", "w"]])
|
||||
assert isinstance(fig, Figure)
|
||||
assert len(fig.axes) >= 1
|
||||
plt.close(fig)
|
||||
|
||||
|
||||
def test_empty_lists_return_placeholder_figure():
|
||||
fig = render_table_as_figure([], [])
|
||||
assert isinstance(fig, Figure)
|
||||
# Placeholder: un Axes con texto, sin tabla.
|
||||
assert len(fig.axes) >= 1
|
||||
assert len(fig.axes[0].tables) == 0
|
||||
plt.close(fig)
|
||||
|
||||
|
||||
def test_both_none_return_placeholder_figure():
|
||||
fig = render_table_as_figure(None, None)
|
||||
assert isinstance(fig, Figure)
|
||||
assert len(fig.axes[0].tables) == 0
|
||||
plt.close(fig)
|
||||
|
||||
|
||||
def test_long_cell_is_truncated():
|
||||
long_value = "x" * 200
|
||||
header, _ = _grid(2, 0)
|
||||
fig = render_table_as_figure(header, [[long_value, "ok"]], max_cell_chars=20)
|
||||
assert isinstance(fig, Figure)
|
||||
ax = fig.axes[0]
|
||||
texts = [c.get_text().get_text() for c in ax.tables[0].get_celld().values()]
|
||||
# La celda larga aparece truncada con elipsis y nunca en su forma completa.
|
||||
assert any(t.endswith("…") and len(t) <= 20 for t in texts)
|
||||
assert long_value not in texts
|
||||
plt.close(fig)
|
||||
|
||||
|
||||
def test_none_cells_become_empty_strings():
|
||||
fig = render_table_as_figure(["a", "b"], [[None, "v"], ["w", None]])
|
||||
assert isinstance(fig, Figure)
|
||||
ax = fig.axes[0]
|
||||
texts = [c.get_text().get_text() for c in ax.tables[0].get_celld().values()]
|
||||
# Hay celdas vacías (los None) y celdas con valor.
|
||||
assert "" in texts
|
||||
assert "v" in texts
|
||||
plt.close(fig)
|
||||
|
||||
|
||||
def test_can_rasterize_to_png_high_dpi():
|
||||
header, rows = _grid(6, 8)
|
||||
fig = render_table_as_figure(header, rows, title="Render", note="zoom me")
|
||||
buf = BytesIO()
|
||||
# No debe lanzar al rasterizar a DPI alto con bbox tight.
|
||||
fig.savefig(buf, format="png", dpi=220, bbox_inches="tight")
|
||||
assert buf.getbuffer().nbytes > 0
|
||||
plt.close(fig)
|
||||
|
||||
|
||||
def test_placeholder_can_rasterize():
|
||||
fig = render_table_as_figure([], [])
|
||||
buf = BytesIO()
|
||||
fig.savefig(buf, format="png", dpi=220, bbox_inches="tight")
|
||||
assert buf.getbuffer().nbytes > 0
|
||||
plt.close(fig)
|
||||
|
||||
|
||||
def test_ragged_rows_are_padded():
|
||||
# Filas de distinta longitud: la rejilla se rectangulariza al ancho máximo.
|
||||
fig = render_table_as_figure(["a", "b", "c"], [["1"], ["1", "2", "3", "4"]])
|
||||
assert isinstance(fig, Figure)
|
||||
ax = fig.axes[0]
|
||||
# 4 columnas (la fila más ancha) x (1 cabecera + 2 filas) = 12 celdas.
|
||||
assert len(ax.tables[0].get_celld()) == 4 * (2 + 1)
|
||||
plt.close(fig)
|
||||
@@ -0,0 +1,79 @@
|
||||
---
|
||||
name: summarize_outlier_dims
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def summarize_outlier_dims(raw_numeric: dict, outlier_rows: list, top_k: int = 3) -> list"
|
||||
description: "Explica QUE columnas hacen rara cada fila anomala detectada por isolation_forest_outliers. Para cada {row_index, score} reconstruye la fila valida (mismo filtro de columnas numericas y mismo descarte de filas con None que el detector, asi row_index coincide) y devuelve las top_k columnas de mayor |z-score| poblacional (ddof=0). Capa de explicabilidad del paso de outliers multivariante en EDA. Pura y determinista; ante entradas vacias/invalidas o sin filas validas devuelve [] sin petar."
|
||||
tags: [eda, models, outliers, anomaly-detection, explainability, z-score, multivariate]
|
||||
params:
|
||||
- name: raw_numeric
|
||||
desc: "dict {nombre_columna: [valores]} alineado por fila (como ctx['raw_numeric'] del motor AutomaticEDA). Solo se usan columnas con todos los valores numericos (None permitido por fila; bool/str/NaN/Inf descartan la columna entera) — filtro IDENTICO al de isolation_forest_outliers para que row_index coincida."
|
||||
- name: outlier_rows
|
||||
desc: "Lista de {row_index, score} tal cual la devuelve isolation_forest_outliers. row_index cuenta SOLO las filas validas (sin None) en orden de aparicion, base 0. Entradas fuera de rango o malformadas se ignoran defensivamente."
|
||||
- name: top_k
|
||||
desc: "Numero de columnas (las de mayor |z-score|) a reportar por outlier. Default 3. Valores invalidos (no-int, bool, <1) caen a 3."
|
||||
output: "Lista paralela a outlier_rows (mismo orden) de dicts {row_index: int, score: float, dims: [{col: str, value: float, z: float}, ...]}. dims trae hasta top_k columnas ordenadas por |z| descendente, con z (z-score poblacional, ddof=0) redondeado a 3 decimales; si una columna tiene std==0 su z es 0. Las entradas de outlier_rows fuera de rango/malformadas se omiten. Ante raw_numeric vacio/no-dict, outlier_rows no-lista, 0 columnas numericas o 0 filas validas devuelve []."
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: []
|
||||
tested: true
|
||||
tests: ["test_row_index_skips_none_rows", "test_extreme_row_flagged_via_isolation", "test_out_of_range_row_index_is_ignored", "test_degrades_to_empty_on_invalid_inputs"]
|
||||
test_file_path: "python/functions/datascience/summarize_outlier_dims_test.py"
|
||||
file_path: "python/functions/datascience/summarize_outlier_dims.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from datascience import isolation_forest_outliers, summarize_outlier_dims
|
||||
|
||||
# Nube densa alrededor del origen + 1 fila con un valor extremo en "c".
|
||||
raw_numeric = {
|
||||
"a": [0.1, 0.2, -0.1, 0.0, 0.3, -0.2, 0.15, -0.05, 0.25, 0.2, -0.3, 0.1],
|
||||
"b": [1.0, 1.1, 0.9, 1.2, 0.8, 1.0, 1.1, 0.95, 1.05, 0.9, 1.15, 1.0],
|
||||
"c": [5.0, 5.2, 4.8, 5.1, 4.9, 5.0, 4.95, 5.05, 4.9, 500.0, 5.1, 5.0],
|
||||
}
|
||||
|
||||
result = isolation_forest_outliers(raw_numeric, contamination=0.1)
|
||||
summary = summarize_outlier_dims(raw_numeric, result["outlier_rows"], top_k=3)
|
||||
|
||||
for item in summary:
|
||||
top = item["dims"][0]
|
||||
print(item["row_index"], top["col"], top["value"], top["z"])
|
||||
# La fila del valor 500 sale con dim top "c" y |z| alto: es lo que la hace rara.
|
||||
```
|
||||
|
||||
## Cuando usarla
|
||||
|
||||
Justo **despues** de `isolation_forest_outliers`, cuando ya sabes QUE filas son
|
||||
anomalas y quieres explicar POR QUE: en que columnas se desvian mas respecto al
|
||||
resto. Util para rellenar la seccion de outliers de un report/notebook EDA con
|
||||
"la fila 9 es rara sobre todo por `c` (z=+3.3)" en lugar de solo un row_index
|
||||
opaco. Pasa el mismo `raw_numeric` que diste al detector y su `outlier_rows`
|
||||
intacto; el `row_index` apunta a la misma fila porque ambas funciones aplican el
|
||||
mismo filtro de columnas y el mismo descarte de filas con None.
|
||||
|
||||
## Gotchas
|
||||
|
||||
- **Mismo `raw_numeric` que el detector**: el `row_index` solo coincide si pasas
|
||||
el mismo dict de columnas (mismo orden, mismas listas) con el que llamaste a
|
||||
`isolation_forest_outliers`. Si cambias las columnas o el orden, los indices
|
||||
dejan de mapear.
|
||||
- **`row_index` es relativo a las filas validas**: las filas con `None` en
|
||||
cualquier columna usada se descartan y los indices se recalculan sobre las que
|
||||
quedan (base 0, orden de aparicion). No mapea 1:1 con las listas de entrada si
|
||||
hay None.
|
||||
- **z-score poblacional (ddof=0)**: se usa la desviacion tipica poblacional,
|
||||
consistente con el escalado del detector. Columnas con `std==0` (todos los
|
||||
valores iguales) dan `z=0`, asi que nunca aparecen como "raras".
|
||||
- **Devuelve `[]` en vez de petar**: entrada no-dict/no-lista, 0 columnas
|
||||
numericas, 0 filas validas, o todas las entradas fuera de rango -> lista vacia.
|
||||
No lanza excepciones.
|
||||
- **No llama a `isolation_forest_outliers`**: solo consume su salida. Es una
|
||||
funcion independiente (no la importa), por eso `uses_functions` esta vacio.
|
||||
@@ -0,0 +1,144 @@
|
||||
"""Explica que dimensiones (columnas) hacen rara cada fila anomala.
|
||||
|
||||
Toma la salida multivariante de `isolation_forest_outliers` (lista de
|
||||
`{row_index, score}`) y, para cada outlier, devuelve las columnas con mayor
|
||||
|z-score| respecto a la distribucion de las filas validas. Es la capa de
|
||||
"explicabilidad" del paso de outliers multivariante en la fase EDA: el
|
||||
Isolation Forest dice QUE filas son raras, esta funcion dice POR QUE (en que
|
||||
columnas se desvian mas).
|
||||
|
||||
Pura y determinista: reconstruye EXACTAMENTE las mismas "filas validas" que usa
|
||||
`isolation_forest_outliers` (mismo filtro de columnas numericas y mismo descarte
|
||||
de filas con None), de modo que el `row_index` apunta a la misma fila en ambas
|
||||
funciones. No hace I/O ni depende de estado.
|
||||
"""
|
||||
|
||||
import math
|
||||
|
||||
import numpy as np
|
||||
|
||||
|
||||
def _is_finite_number(v) -> bool:
|
||||
"""True si v es int/float finito. bool NO cuenta; NaN/Inf tampoco."""
|
||||
if isinstance(v, bool):
|
||||
return False
|
||||
if not isinstance(v, (int, float)):
|
||||
return False
|
||||
if isinstance(v, float) and (math.isnan(v) or math.isinf(v)):
|
||||
return False
|
||||
return True
|
||||
|
||||
|
||||
def summarize_outlier_dims(
|
||||
raw_numeric: dict,
|
||||
outlier_rows: list,
|
||||
top_k: int = 3,
|
||||
) -> list:
|
||||
"""Resume las dimensiones que mas desvian a cada fila anomala.
|
||||
|
||||
Args:
|
||||
raw_numeric: dict {nombre_columna: [valores]} alineado por fila (como
|
||||
ctx['raw_numeric'] del motor AutomaticEDA). Solo se usan columnas
|
||||
cuyos valores sean todos numericos (None permitido por fila; bool,
|
||||
str, NaN e Inf descartan la columna entera) — filtro identico al de
|
||||
isolation_forest_outliers.
|
||||
outlier_rows: lista de {row_index, score} tal como la devuelve
|
||||
isolation_forest_outliers. row_index cuenta SOLO las filas validas
|
||||
(sin None) en orden de aparicion, empezando en 0.
|
||||
top_k: numero de columnas (las de mayor |z-score|) a reportar por cada
|
||||
outlier. Default 3. Valores invalidos caen a 3.
|
||||
|
||||
Returns:
|
||||
Lista paralela a outlier_rows (mismo orden) de dicts
|
||||
{row_index, score, dims}, donde dims es la lista de hasta top_k columnas
|
||||
ordenadas por |z| descendente: [{col, value, z}, ...] con z redondeado a
|
||||
3 decimales. Las entradas de outlier_rows fuera de rango o malformadas se
|
||||
omiten (defensivo). Ante raw_numeric vacio/no-dict, outlier_rows
|
||||
no-lista, 0 columnas numericas o 0 filas validas devuelve [].
|
||||
"""
|
||||
# Validacion defensiva de los argumentos principales.
|
||||
if not isinstance(raw_numeric, dict) or not isinstance(outlier_rows, list):
|
||||
return []
|
||||
if not isinstance(top_k, int) or isinstance(top_k, bool) or top_k < 1:
|
||||
top_k = 3
|
||||
|
||||
# Seleccion de columnas numericas: identica a isolation_forest_outliers.
|
||||
# Una columna entra solo si todos sus valores son numericos (None permitido
|
||||
# por fila); cualquier bool/str/NaN/Inf descarta la columna completa.
|
||||
numeric_cols: dict[str, list] = {}
|
||||
for name, values in raw_numeric.items():
|
||||
if not isinstance(values, (list, tuple)):
|
||||
continue
|
||||
ok = True
|
||||
for v in values:
|
||||
if v is None:
|
||||
continue
|
||||
if not _is_finite_number(v):
|
||||
ok = False
|
||||
break
|
||||
if ok:
|
||||
numeric_cols[name] = list(values)
|
||||
|
||||
if len(numeric_cols) < 1:
|
||||
return []
|
||||
|
||||
col_names = list(numeric_cols.keys())
|
||||
try:
|
||||
n_rows_total = min(len(numeric_cols[c]) for c in col_names)
|
||||
except ValueError:
|
||||
return []
|
||||
|
||||
# Reconstruye las filas validas con el MISMO criterio que el detector: la
|
||||
# fila i toma un valor por columna; si cualquier valor es None, la fila se
|
||||
# descarta y NO incrementa el indice valido. Asi row_index de outlier_rows
|
||||
# apunta a esta misma secuencia (base 0, orden de aparicion).
|
||||
valid_rows: list[list[float]] = []
|
||||
for i in range(n_rows_total):
|
||||
row = [numeric_cols[c][i] for c in col_names]
|
||||
if any(v is None for v in row):
|
||||
continue
|
||||
valid_rows.append([float(v) for v in row])
|
||||
|
||||
if not valid_rows:
|
||||
return []
|
||||
|
||||
matrix = np.asarray(valid_rows, dtype=float)
|
||||
n_valid = matrix.shape[0]
|
||||
means = matrix.mean(axis=0)
|
||||
stds = matrix.std(axis=0, ddof=0) # poblacional (ddof=0)
|
||||
|
||||
out: list = []
|
||||
for entry in outlier_rows:
|
||||
if not isinstance(entry, dict):
|
||||
continue
|
||||
ri = entry.get("row_index")
|
||||
# bool es subclase de int: lo excluimos explicitamente.
|
||||
if not isinstance(ri, int) or isinstance(ri, bool):
|
||||
continue
|
||||
if ri < 0 or ri >= n_valid:
|
||||
continue
|
||||
|
||||
try:
|
||||
score = float(entry.get("score"))
|
||||
except (TypeError, ValueError):
|
||||
score = 0.0
|
||||
|
||||
row = matrix[ri]
|
||||
dims = []
|
||||
for j, name in enumerate(col_names):
|
||||
std = stds[j]
|
||||
if std == 0.0:
|
||||
z = 0.0
|
||||
else:
|
||||
z = float((row[j] - means[j]) / std)
|
||||
dims.append({"col": name, "value": float(row[j]), "z": z})
|
||||
|
||||
# Mayor |z| primero; sort estable, empates por orden de columna.
|
||||
dims.sort(key=lambda d: abs(d["z"]), reverse=True)
|
||||
dims = dims[:top_k]
|
||||
for d in dims:
|
||||
d["z"] = round(d["z"], 3)
|
||||
|
||||
out.append({"row_index": int(ri), "score": score, "dims": dims})
|
||||
|
||||
return out
|
||||
@@ -0,0 +1,93 @@
|
||||
"""Tests para summarize_outlier_dims."""
|
||||
|
||||
from isolation_forest_outliers import isolation_forest_outliers
|
||||
from summarize_outlier_dims import summarize_outlier_dims
|
||||
|
||||
|
||||
# Dataset compartido: 3 columnas, 13 filas. La fila ORIGINAL 6 tiene None en "a"
|
||||
# (se descarta), de modo que la fila ORIGINAL 10 -- con un valor extremo en "c"
|
||||
# -- queda en el indice VALIDO 9 (no 10). Esto verifica el salto de None.
|
||||
A = [0.1, 0.2, -0.1, 0.0, 0.3, -0.2, None, 0.15, -0.05, 0.25, 0.2, -0.3, 0.1]
|
||||
B = [1.0, 1.1, 0.9, 1.2, 0.8, 1.0, 1.3, 1.1, 0.95, 1.05, 0.9, 1.15, 1.0]
|
||||
C = [5.0, 5.2, 4.8, 5.1, 4.9, 5.0, 5.3, 4.95, 5.05, 4.9, 500.0, 5.1, 5.0]
|
||||
RAW = {"a": A, "b": B, "c": C}
|
||||
|
||||
# Mapa original -> valido (saltando original 6):
|
||||
# orig: 0 1 2 3 4 5 7 8 9 10 11 12
|
||||
# valid: 0 1 2 3 4 5 6 7 8 9 10 11
|
||||
# => el extremo en "c" (original 10) esta en el indice valido 9.
|
||||
EXTREME_VALID_INDEX = 9
|
||||
|
||||
|
||||
def test_row_index_skips_none_rows():
|
||||
# Mapeo directo (sin depender de la aleatoriedad de IsolationForest): el
|
||||
# indice valido 9 debe corresponder a la fila con c == 500 -> el None de la
|
||||
# fila original 6 se salto correctamente.
|
||||
summary = summarize_outlier_dims(
|
||||
RAW, [{"row_index": EXTREME_VALID_INDEX, "score": -0.5}], top_k=3
|
||||
)
|
||||
assert len(summary) == 1
|
||||
entry = summary[0]
|
||||
assert entry["row_index"] == EXTREME_VALID_INDEX
|
||||
assert entry["score"] == -0.5
|
||||
# La dimension dominante es "c", con su valor extremo y |z| alto.
|
||||
top = entry["dims"][0]
|
||||
assert top["col"] == "c"
|
||||
assert top["value"] == 500.0
|
||||
assert abs(top["z"]) > 2.0
|
||||
# top_k respetado: como mucho 3 dims.
|
||||
assert len(entry["dims"]) <= 3
|
||||
|
||||
|
||||
def test_extreme_row_flagged_via_isolation():
|
||||
# Integracion real: detectar outliers y explicarlos.
|
||||
result = isolation_forest_outliers(RAW, contamination=0.1)
|
||||
assert "note" not in result
|
||||
outlier_rows = result["outlier_rows"]
|
||||
assert outlier_rows # al menos un outlier
|
||||
|
||||
summary = summarize_outlier_dims(RAW, outlier_rows, top_k=3)
|
||||
# Paralela a outlier_rows (todos los indices estan en rango).
|
||||
assert len(summary) == len(outlier_rows)
|
||||
|
||||
by_index = {e["row_index"]: e for e in summary}
|
||||
# El punto extremo debe estar entre los outliers detectados...
|
||||
assert EXTREME_VALID_INDEX in by_index
|
||||
# ...y su dimension top debe ser "c" (donde se desvia ~muchas sigmas).
|
||||
extreme = by_index[EXTREME_VALID_INDEX]
|
||||
assert extreme["dims"][0]["col"] == "c"
|
||||
assert abs(extreme["dims"][0]["z"]) > 2.0
|
||||
|
||||
|
||||
def test_out_of_range_row_index_is_ignored():
|
||||
# Indices fuera de rango se omiten en lugar de petar.
|
||||
summary = summarize_outlier_dims(
|
||||
RAW,
|
||||
[
|
||||
{"row_index": 999, "score": -1.0},
|
||||
{"row_index": -1, "score": -1.0},
|
||||
{"row_index": EXTREME_VALID_INDEX, "score": -0.5},
|
||||
],
|
||||
top_k=2,
|
||||
)
|
||||
# Solo sobrevive el indice valido; los otros dos se descartan.
|
||||
assert len(summary) == 1
|
||||
assert summary[0]["row_index"] == EXTREME_VALID_INDEX
|
||||
assert len(summary[0]["dims"]) <= 2
|
||||
|
||||
|
||||
def test_degrades_to_empty_on_invalid_inputs():
|
||||
# raw_numeric vacio + outlier_rows vacio.
|
||||
assert summarize_outlier_dims({}, [], 3) == []
|
||||
# raw_numeric no es dict.
|
||||
assert summarize_outlier_dims("not a dict", [{"row_index": 0}], 3) == []
|
||||
# outlier_rows no es lista.
|
||||
assert summarize_outlier_dims(RAW, "not a list", 3) == []
|
||||
# Sin columnas numericas (todas con strings) -> [].
|
||||
assert summarize_outlier_dims(
|
||||
{"s": ["x", "y", "z"]}, [{"row_index": 0, "score": -1.0}], 3
|
||||
) == []
|
||||
# Entradas malformadas dentro de outlier_rows se ignoran (no petan).
|
||||
assert summarize_outlier_dims(
|
||||
RAW, ["nope", 42, {"no_row_index": 1}], 3
|
||||
) == []
|
||||
@@ -0,0 +1,466 @@
|
||||
"""Batería de tests de ACEPTACIÓN del AutomaticEDA — "que cada AEDA salga como queremos".
|
||||
|
||||
Esta suite es la red de seguridad del subsistema EDA del grupo `eda`: garantiza
|
||||
que CADA capítulo de un informe AutomaticEDA sale poblado y con su contenido
|
||||
esencial, que la feature de capítulos sueltos (``only_chapters``) resuelve sus
|
||||
dependencias de cómputo, que los capítulos opcionales devuelven None cuando no
|
||||
aplican, que el informe de carpeta multi-tabla detecta la FK, y que el Markdown
|
||||
trae el apéndice completo (matriz de asociación entera + describe con
|
||||
skew/kurtosis). A diferencia de los tests unitarios de cada capítulo, aquí se
|
||||
ejercita el pipeline END-TO-END sobre un dataset sintético determinista que
|
||||
activa todos los capítulos a la vez.
|
||||
|
||||
Determinismo: el dataset se genera con ``seed`` fijo y el pipeline corre sin LLM
|
||||
(``profile_level='standard'``), de modo que el manifest y el Markdown son
|
||||
reproducibles entre corridas. Un único render `standard` se reutiliza vía un
|
||||
fixture de scope module para no repetir el cómputo caro.
|
||||
|
||||
dict-no-throw: los pipelines del grupo `eda` nunca lanzan; aquí se asserta sobre
|
||||
``status == 'ok'`` y luego sobre el contenido concreto del manifest / Markdown.
|
||||
|
||||
Honestidad (DoD): los asserts comprueban CONTENIDO real (texto esencial de cada
|
||||
capítulo), no solo el heading. Si un capítulo dejara de emitir su contenido (un
|
||||
cambio rompiera la distribución numérica, el Isolation Forest, la matriz de
|
||||
correlación completa, …), el test correspondiente FALLA nombrando el capítulo y
|
||||
el fragmento ausente — no se ablanda para que pase.
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import subprocess
|
||||
import sys
|
||||
|
||||
import pytest
|
||||
|
||||
_HERE = os.path.dirname(os.path.abspath(__file__))
|
||||
_FUNCTIONS = os.path.abspath(os.path.join(_HERE, "..", "..")) # python/functions
|
||||
if _FUNCTIONS not in sys.path:
|
||||
sys.path.insert(0, _FUNCTIONS)
|
||||
|
||||
from datascience.automatic_eda import CHAPTER_ORDER # noqa: E402
|
||||
from datascience.generate_synthetic_eda_folder import ( # noqa: E402
|
||||
generate_synthetic_eda_folder,
|
||||
)
|
||||
from datascience.generate_synthetic_eda_table import ( # noqa: E402
|
||||
generate_synthetic_eda_table,
|
||||
)
|
||||
from pipelines.render_automatic_eda import render_automatic_eda # noqa: E402
|
||||
from pipelines.render_automatic_eda_folder import ( # noqa: E402
|
||||
render_automatic_eda_folder,
|
||||
)
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Parámetros deterministas del fixture de oro.
|
||||
# --------------------------------------------------------------------------- #
|
||||
SEED = 42
|
||||
N_ROWS = 800
|
||||
TABLE = "synthetic"
|
||||
|
||||
# El capítulo `analisis_llm` SOLO se computa con run_llm=True; en el preset
|
||||
# `standard` (sin LLM, lo que esta suite usa) no debe aparecer. Por eso los
|
||||
# capítulos esperados en un informe `standard` son todos los de CHAPTER_ORDER
|
||||
# MENOS analisis_llm. CHAPTER_ORDER es la fuente de verdad de los 16 capítulos
|
||||
# del motor (portada … glosario).
|
||||
LLM_ONLY_CHAPTERS = {"analisis_llm"}
|
||||
EXPECTED_STANDARD = [c for c in CHAPTER_ORDER if c not in LLM_ONLY_CHAPTERS]
|
||||
|
||||
|
||||
def _pdf_text(path):
|
||||
"""Texto del PDF vía pdftotext, o None si la herramienta no está disponible."""
|
||||
try:
|
||||
out = subprocess.run(
|
||||
["pdftotext", "-layout", path, "-"],
|
||||
capture_output=True, text=True, timeout=60,
|
||||
)
|
||||
return out.stdout if out.returncode == 0 else None
|
||||
except Exception: # noqa: BLE001 — la verificación principal es sobre el MD.
|
||||
return None
|
||||
|
||||
|
||||
def _manifest_chapters(result):
|
||||
"""Set de ids de capítulo presentes en el manifest del resultado."""
|
||||
with open(result["manifest_path"], encoding="utf-8") as fh:
|
||||
return set((json.load(fh).get("chapters") or {}).keys())
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Fixtures de scope module: el dataset sintético se genera UNA vez y el render
|
||||
# `standard` se computa UNA vez; todos los tests de contenido lo reutilizan.
|
||||
# --------------------------------------------------------------------------- #
|
||||
@pytest.fixture(scope="module")
|
||||
def synth_db(tmp_path_factory):
|
||||
"""Tabla sintética determinista que activa los 16 capítulos del motor."""
|
||||
d = tmp_path_factory.mktemp("aeda_accept_synth")
|
||||
db = str(d / "synthetic.duckdb")
|
||||
g = generate_synthetic_eda_table(db, TABLE, n_rows=N_ROWS, seed=SEED)
|
||||
assert g["status"] == "ok", g.get("error")
|
||||
return {"db": db, "table": TABLE, "gen": g}
|
||||
|
||||
|
||||
@pytest.fixture(scope="module")
|
||||
def standard_run(synth_db, tmp_path_factory):
|
||||
"""Render AutomaticEDA `standard` (sin LLM) sobre el dataset sintético.
|
||||
|
||||
Devuelve el dict del pipeline más el manifest cargado, el texto del Markdown
|
||||
y el del PDF (si pdftotext está). Reutilizado por la mayoría de los tests.
|
||||
"""
|
||||
out = str(tmp_path_factory.mktemp("aeda_accept_std"))
|
||||
r = render_automatic_eda(
|
||||
synth_db["db"], synth_db["table"],
|
||||
profile_level="standard", out_dir=out, basename="synth_std",
|
||||
)
|
||||
assert r["status"] == "ok", r.get("error")
|
||||
with open(r["manifest_path"], encoding="utf-8") as fh:
|
||||
manifest = json.load(fh)
|
||||
md = open(r["aeda_md_path"], encoding="utf-8").read()
|
||||
return {
|
||||
"r": r,
|
||||
"manifest": manifest,
|
||||
"chapters": manifest.get("chapters") or {},
|
||||
"md": md,
|
||||
"pdf_text": _pdf_text(r["pdf_path"]),
|
||||
}
|
||||
|
||||
|
||||
@pytest.fixture(scope="module")
|
||||
def minimal_db(tmp_path_factory):
|
||||
"""Tabla mínima SIN texto libre, SIN fecha y SIN lat/lon.
|
||||
|
||||
Sirve para comprobar que text_distr / timeseries / geospatial devuelven None
|
||||
(no aparecen en el manifest) y el EDA no peta. Solo numéricas continuas +
|
||||
una categórica de baja cardinalidad.
|
||||
"""
|
||||
import random
|
||||
|
||||
import duckdb
|
||||
|
||||
d = tmp_path_factory.mktemp("aeda_accept_min")
|
||||
db = str(d / "minimal.duckdb")
|
||||
con = duckdb.connect(db)
|
||||
con.execute("CREATE TABLE minimal (a DOUBLE, b DOUBLE, c INTEGER, grp VARCHAR)")
|
||||
random.seed(7)
|
||||
rows = [
|
||||
(round(random.gauss(10, 2), 3), round(random.gauss(50, 5), 3),
|
||||
random.randint(1, 100), ["x", "y", "z"][i % 3])
|
||||
for i in range(120)
|
||||
]
|
||||
con.executemany("INSERT INTO minimal VALUES (?,?,?,?)", rows)
|
||||
con.close()
|
||||
return {"db": db, "table": "minimal"}
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# 1) COBERTURA DE CAPÍTULOS (golden) — el manifest standard trae los 15
|
||||
# capítulos no-LLM esperados, ninguno falta, y analisis_llm NO sale sin LLM.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def test_standard_cubre_todos_los_capitulos_esperados(standard_run):
|
||||
chapters = set(standard_run["chapters"].keys())
|
||||
expected = set(EXPECTED_STANDARD)
|
||||
missing = expected - chapters
|
||||
assert not missing, (
|
||||
"capítulos esperados ausentes del manifest standard: "
|
||||
f"{sorted(missing)} (presentes: {sorted(chapters)})"
|
||||
)
|
||||
# analisis_llm requiere run_llm=True: en standard NO debe aparecer.
|
||||
assert "analisis_llm" not in chapters, (
|
||||
"analisis_llm apareció sin LLM: el preset standard no debería computarlo"
|
||||
)
|
||||
|
||||
|
||||
def test_manifest_top_level_es_valido(standard_run):
|
||||
"""El manifest declara el motor y un dict de capítulos con metadatos por id."""
|
||||
man = standard_run["manifest"]
|
||||
assert man.get("engine") == "AutomaticEDA"
|
||||
assert man.get("engine_version")
|
||||
chapters = standard_run["chapters"]
|
||||
# Cada capítulo trae version + nº de páginas/slides (formato del manifest).
|
||||
for cid, meta in chapters.items():
|
||||
assert meta.get("version"), f"capítulo {cid} sin version en el manifest"
|
||||
assert (meta.get("n_pages") or 0) > 0, f"capítulo {cid} con 0 páginas"
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# 2) CONTENIDO CLAVE POR CAPÍTULO (acceptance) — cada capítulo trae su contenido
|
||||
# ESENCIAL en el Markdown, no solo el heading. Un fragmento ausente nombra el
|
||||
# capítulo y el texto que falta.
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Fragmentos de texto ESTABLE que cada capítulo emite en el Markdown del dataset
|
||||
# sintético. No son números frágiles: son etiquetas/estructura del capítulo más
|
||||
# nombres de columna del fixture. Si un capítulo deja de poblar su contenido, su
|
||||
# fragmento desaparece y el test falla nombrándolo.
|
||||
CHAPTER_NEEDLES = {
|
||||
"portada": ["800 filas", "19 columnas"],
|
||||
"overview": ["Primeras filas (df.head)", "Diccionario de columnas",
|
||||
"customer_id", "signup_date"],
|
||||
"num_distr": ["Distribuciones numéricas", "vallas Tukey", "income"],
|
||||
"cat_distr": ["Distribuciones categóricas", "Entropía", "Top categorías",
|
||||
"country"],
|
||||
"text_distr": ["Texto libre (NLP)", "TTR", "Términos más frecuentes",
|
||||
"Idioma dominante"],
|
||||
"calidad": ["Cómo se calcula la calidad", "Calidad global"],
|
||||
"missingness": ["Datos faltantes", "Celdas faltantes (global)",
|
||||
"Faltantes por columna"],
|
||||
"outliers": ["Valores atípicos por columna", "Filas atípicas (multivariante)",
|
||||
"Isolation Forest", "Filas analizadas"],
|
||||
"correlacion": ["Matriz de asociación", "Pares más correlacionados"],
|
||||
"relaciones": ["Candidatas a clave primaria", "customer_id"],
|
||||
"modelos": ["PCA — varianza explicada", "Segmentación (KMeans)"],
|
||||
"timeseries": ["Series temporales", "Columna de fecha", "signup_date"],
|
||||
"geospatial": ["Análisis geoespacial", "Extensión geográfica", "Centroide"],
|
||||
"agregacion": ["Agregación por grupos", "Agrupado por"],
|
||||
"glosario": ["Glosario de términos",
|
||||
"### Isolation Forest (anomalías multivariantes)",
|
||||
"### PCA (componentes principales)"],
|
||||
}
|
||||
|
||||
|
||||
def test_needles_cubren_exactamente_los_capitulos_standard():
|
||||
"""Guard de mantenimiento: las needles cubren los mismos 15 capítulos no-LLM.
|
||||
|
||||
Si alguien añade un capítulo nuevo a CHAPTER_ORDER, este test recuerda que
|
||||
hay que documentar su contenido esencial aquí (o marcarlo como LLM-only)."""
|
||||
assert set(CHAPTER_NEEDLES.keys()) == set(EXPECTED_STANDARD), (
|
||||
"CHAPTER_NEEDLES desincronizado con los capítulos esperados de standard: "
|
||||
f"falta needles para {set(EXPECTED_STANDARD) - set(CHAPTER_NEEDLES)}, "
|
||||
f"sobra {set(CHAPTER_NEEDLES) - set(EXPECTED_STANDARD)}"
|
||||
)
|
||||
|
||||
|
||||
@pytest.mark.parametrize("chapter_id", list(CHAPTER_NEEDLES.keys()))
|
||||
def test_capitulo_trae_su_contenido_esencial(standard_run, chapter_id):
|
||||
md = standard_run["md"]
|
||||
# Pre-condición: el capítulo está en el manifest (cobertura). Si no, es un
|
||||
# fallo de cobertura, no de contenido — se reporta como tal.
|
||||
assert chapter_id in standard_run["chapters"], (
|
||||
f"capítulo {chapter_id} ausente del manifest (fallo de cobertura)"
|
||||
)
|
||||
for needle in CHAPTER_NEEDLES[chapter_id]:
|
||||
assert needle in md, (
|
||||
f"capítulo '{chapter_id}': falta su contenido esencial en el Markdown "
|
||||
f"— fragmento ausente: {needle!r}"
|
||||
)
|
||||
|
||||
|
||||
def test_outliers_isolation_forest_poblado_no_degradado(standard_run):
|
||||
"""El bloque multivariante (Isolation Forest) sale con datos, no degradado."""
|
||||
md = standard_run["md"]
|
||||
assert "Anomalías multivariantes" in md
|
||||
assert "Filas analizadas" in md, "el Isolation Forest no trae su tabla poblada"
|
||||
assert "No se pudo analizar la anomalía multivariante" not in md, (
|
||||
"el bloque multivariante salió degradado en el informe completo"
|
||||
)
|
||||
# El perfil trae el bloque de modelos con los outliers multivariantes.
|
||||
models = (standard_run["r"]["profile"] or {}).get("models") or {}
|
||||
assert models.get("outliers") is not None, "profile['models']['outliers'] vacío"
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# 3) CAPÍTULOS SUELTOS CON DEPS RESUELTAS (acceptance de only_chapters) — pedir
|
||||
# un capítulo suelto lo deja POBLADO porque la resolución de dependencias
|
||||
# activa el cómputo que necesita, aunque el caller no lo pidiera.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def test_only_outliers_isolation_forest_poblado(synth_db, tmp_path):
|
||||
"""only=['outliers'] sin run_models explícito → IsolationForest poblado."""
|
||||
out = str(tmp_path / "only_out")
|
||||
r = render_automatic_eda(
|
||||
synth_db["db"], synth_db["table"],
|
||||
only_chapters=["outliers"], out_dir=out, basename="only_outliers",
|
||||
)
|
||||
assert r["status"] == "ok", r.get("error")
|
||||
# Documento = portada + outliers + glosario, nada más.
|
||||
assert _manifest_chapters(r) == {"portada", "outliers", "glosario"}
|
||||
md = open(r["aeda_md_path"], encoding="utf-8").read()
|
||||
assert "Filas atípicas (multivariante)" in md
|
||||
assert "Filas analizadas" in md, "Isolation Forest sin tabla poblada"
|
||||
assert "No se pudo analizar la anomalía multivariante" not in md, (
|
||||
"el multivariante salió degradado pese a resolver las deps"
|
||||
)
|
||||
# La resolución activó run_models → el perfil trae el bloque de modelos.
|
||||
assert ((r["profile"] or {}).get("models") or {}).get("outliers") is not None
|
||||
|
||||
|
||||
def test_only_timeseries_rango_temporal_presente(synth_db, tmp_path):
|
||||
"""only=['timeseries'] → rango temporal poblado (run_series resuelto)."""
|
||||
out = str(tmp_path / "only_ts")
|
||||
r = render_automatic_eda(
|
||||
synth_db["db"], synth_db["table"],
|
||||
only_chapters=["timeseries"], out_dir=out, basename="only_ts",
|
||||
)
|
||||
assert r["status"] == "ok", r.get("error")
|
||||
assert "timeseries" in _manifest_chapters(r)
|
||||
md = open(r["aeda_md_path"], encoding="utf-8").read()
|
||||
assert "Columna de fecha" in md
|
||||
assert "signup_date" in md, "la serie no nombra su columna de fecha"
|
||||
# run_series resuelto por deps → el perfil trae el análisis de serie.
|
||||
assert (r["profile"] or {}).get("series") is not None, (
|
||||
"only=['timeseries'] debe activar run_series por dependencias"
|
||||
)
|
||||
|
||||
|
||||
def test_only_correlacion_scatters_presentes(synth_db, tmp_path):
|
||||
"""only=['correlacion'] → matriz + scatters de los pares fuertes."""
|
||||
out = str(tmp_path / "only_corr")
|
||||
r = render_automatic_eda(
|
||||
synth_db["db"], synth_db["table"],
|
||||
only_chapters=["correlacion"], out_dir=out, basename="only_corr",
|
||||
)
|
||||
assert r["status"] == "ok", r.get("error")
|
||||
assert _manifest_chapters(r) == {"portada", "correlacion", "glosario"}
|
||||
md = open(r["aeda_md_path"], encoding="utf-8").read()
|
||||
assert "Matriz de asociación" in md
|
||||
assert "Relaciones más fuertes (scatter)" in md, "faltan los scatters"
|
||||
assert "Dispersión de" in md, "no se emitió ninguna figura de dispersión"
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# 4) NONE CUANDO NO APLICA — sobre una tabla sin texto largo, sin fecha y sin
|
||||
# lat/lon, text_distr / timeseries / geospatial NO aparecen y el EDA no peta.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def test_capitulos_opcionales_ausentes_cuando_no_aplican(minimal_db, tmp_path):
|
||||
out = str(tmp_path / "minimal_out")
|
||||
r = render_automatic_eda(
|
||||
minimal_db["db"], minimal_db["table"],
|
||||
profile_level="standard", out_dir=out, basename="minimal",
|
||||
)
|
||||
assert r["status"] == "ok", r.get("error")
|
||||
chapters = _manifest_chapters(r)
|
||||
for absent in ("text_distr", "timeseries", "geospatial"):
|
||||
assert absent not in chapters, (
|
||||
f"capítulo {absent} apareció en una tabla que no lo justifica "
|
||||
f"(presentes: {sorted(chapters)})"
|
||||
)
|
||||
# El documento sigue siendo válido: portada + glosario + capítulos que sí
|
||||
# aplican (overview/num_distr/correlacion al menos).
|
||||
assert {"portada", "glosario", "overview", "num_distr"} <= chapters
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# 5) FOLDER MULTI-TABLA (acceptance) — el informe de carpeta perfila las N tablas
|
||||
# y el capítulo de relaciones detecta la FK por containment.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def test_folder_multitabla_con_fk_detectada(tmp_path):
|
||||
fdir = str(tmp_path / "folder")
|
||||
g = generate_synthetic_eda_folder(fdir, n_rows=300, seed=SEED)
|
||||
assert g["status"] == "ok", g.get("error")
|
||||
|
||||
out = str(tmp_path / "fout")
|
||||
rf = render_automatic_eda_folder(fdir, out_dir=out, basename="folder")
|
||||
assert rf["status"] == "ok", rf.get("error")
|
||||
|
||||
# Las 3 tablas se perfilaron.
|
||||
assert rf["n_tables"] == 3, f"esperadas 3 tablas, vistas {rf['n_tables']}"
|
||||
|
||||
# El manifest base trae el capítulo de relaciones inter-tabla.
|
||||
with open(rf["manifest_path"], encoding="utf-8") as fh:
|
||||
chapters = set((json.load(fh).get("chapters") or {}).keys())
|
||||
assert "relaciones" in chapters, (
|
||||
f"el documento de carpeta no incluye el capítulo de relaciones: {chapters}"
|
||||
)
|
||||
|
||||
# El Markdown nombra las 3 tablas y declara la FK detectada por containment.
|
||||
md = open(rf["md_path"], encoding="utf-8").read()
|
||||
for tbl in ("customers", "orders", "reviews"):
|
||||
assert tbl in md, f"la tabla {tbl} no aparece en el informe de carpeta"
|
||||
assert "FK candidatas" in md, "no se declaran las FK candidatas"
|
||||
assert "orders.customer_id" in md and "customers.customer_id" in md, (
|
||||
"la FK orders→customers no se detectó por containment"
|
||||
)
|
||||
assert "reviews.customer_id" in md, "la FK reviews→customers no se detectó"
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# 6) MD COMPLETITUD (regresión) — el Markdown trae el apéndice con la matriz de
|
||||
# asociación COMPLETA (todos los pares, no solo el top) y el describe con
|
||||
# skew/kurtosis de todas las numéricas. Protege un fix ya mergeado.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def test_md_apendice_matriz_correlacion_completa(standard_run):
|
||||
md = standard_run["md"]
|
||||
assert "Matriz de asociación — todos los pares" in md, (
|
||||
"falta el apéndice con la matriz de asociación completa"
|
||||
)
|
||||
# Un par num-num de correlación BAJA que el top del capítulo NUNCA mostraría:
|
||||
# su presencia prueba que el apéndice lista TODOS los pares, no solo el top.
|
||||
assert "income ↔ longitude" in md, (
|
||||
"el apéndice no contiene los pares de baja correlación: no es la matriz "
|
||||
"completa, solo el top-k del capítulo"
|
||||
)
|
||||
|
||||
|
||||
def test_md_apendice_describe_con_skew_kurtosis(standard_run):
|
||||
md = standard_run["md"]
|
||||
assert "Estadísticos numéricos completos (describe)" in md, (
|
||||
"falta el apéndice describe completo"
|
||||
)
|
||||
# La cabecera del describe del apéndice lleva las columnas skew y kurtosis
|
||||
# (subcadena única de ese header). Sin ellas el describe está incompleto.
|
||||
assert "| skew | kurtosis |" in md, (
|
||||
"el describe del apéndice no trae las columnas skew/kurtosis"
|
||||
)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# 7) LAS 3 SALIDAS NO-VACÍAS — PDF con páginas, PPTX con slides, MD con un mínimo
|
||||
# de caracteres, y los tres archivos en disco. Manifest válido.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def test_tres_salidas_no_vacias(standard_run):
|
||||
r = standard_run["r"]
|
||||
assert r["pdf_path"] and os.path.exists(r["pdf_path"])
|
||||
assert r["pptx_path"] and os.path.exists(r["pptx_path"])
|
||||
assert r["aeda_md_path"] and os.path.exists(r["aeda_md_path"])
|
||||
assert (r["n_pages"] or 0) > 0, "el PDF no tiene páginas"
|
||||
assert (r["n_slides"] or 0) > 0, "el PPTX no tiene slides"
|
||||
# El informe completo es grande: un mínimo holgado protege contra un MD vacío
|
||||
# o truncado sin atarse a un tamaño exacto.
|
||||
assert (r["md_chars"] or 0) > 10000, f"MD demasiado corto: {r['md_chars']} chars"
|
||||
assert r["manifest_path"] and os.path.exists(r["manifest_path"])
|
||||
|
||||
|
||||
def test_pdf_texto_extraible_con_contenido(standard_run):
|
||||
"""Si pdftotext está disponible, el PDF debe traer texto real (no solo
|
||||
imágenes): la portada nombra el dataset y su forma. Si no está la
|
||||
herramienta, el test se omite (no es un fallo del EDA)."""
|
||||
txt = standard_run["pdf_text"]
|
||||
if txt is None:
|
||||
pytest.skip("pdftotext no disponible")
|
||||
assert len(txt) > 5000, "el PDF apenas tiene texto extraíble"
|
||||
assert "Portada" in txt or "synthetic" in txt, (
|
||||
"el texto del PDF no contiene la portada esperada"
|
||||
)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# DETERMINISMO — dos renders del MISMO dataset producen el MISMO manifest
|
||||
# (mismos capítulos y mismos n_pages/n_slides por capítulo). El generated_at
|
||||
# difiere por timestamp, por eso se compara el dict de capítulos, no el archivo.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def test_render_es_determinista(synth_db, tmp_path):
|
||||
out1 = str(tmp_path / "det1")
|
||||
out2 = str(tmp_path / "det2")
|
||||
r1 = render_automatic_eda(synth_db["db"], synth_db["table"],
|
||||
profile_level="standard", out_dir=out1, basename="d1")
|
||||
r2 = render_automatic_eda(synth_db["db"], synth_db["table"],
|
||||
profile_level="standard", out_dir=out2, basename="d2")
|
||||
assert r1["status"] == "ok" and r2["status"] == "ok"
|
||||
c1 = json.load(open(r1["manifest_path"], encoding="utf-8")).get("chapters")
|
||||
c2 = json.load(open(r2["manifest_path"], encoding="utf-8")).get("chapters")
|
||||
assert c1 == c2, "el manifest no es determinista entre dos renders del mismo dataset"
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# SLOW (opcional, skippeable) — informe `full` con narrativa LLM. Requiere red /
|
||||
# credenciales y NO es determinista, por eso está apagado salvo opt-in explícito
|
||||
# vía la variable de entorno EDA_ACCEPT_LLM=1. Se omite con skipif (no con un
|
||||
# marker custom) para no depender de registro de marks en la config del repo.
|
||||
# --------------------------------------------------------------------------- #
|
||||
@pytest.mark.skipif(
|
||||
os.environ.get("EDA_ACCEPT_LLM") != "1",
|
||||
reason="full+LLM es lento/no determinista; exporta EDA_ACCEPT_LLM=1 para correrlo",
|
||||
)
|
||||
def test_full_incluye_capitulo_analisis_llm(synth_db, tmp_path):
|
||||
out = str(tmp_path / "full")
|
||||
r = render_automatic_eda(synth_db["db"], synth_db["table"],
|
||||
profile_level="full", out_dir=out, basename="full")
|
||||
assert r["status"] == "ok", r.get("error")
|
||||
assert "analisis_llm" in _manifest_chapters(r), (
|
||||
"el preset full debe incluir el capítulo de análisis LLM"
|
||||
)
|
||||
@@ -4,8 +4,8 @@ kind: pipeline
|
||||
lang: py
|
||||
domain: pipelines
|
||||
purity: impure
|
||||
version: "1.1.0"
|
||||
signature: "def render_automatic_eda(db_path: str, table: str, backend: str = \"duckdb\", sample: int = None, run_models: bool = None, run_series: bool = None, run_llm: bool = None, profile_level: str = \"standard\", out_dir: str = \"reports\", basename: str = None, ctx_extra: dict = None) -> dict"
|
||||
version: "1.2.0"
|
||||
signature: "def render_automatic_eda(db_path: str, table: str, backend: str = \"duckdb\", sample: int = None, run_models: bool = None, run_series: bool = None, run_llm: bool = None, profile_level: str = \"standard\", out_dir: str = \"reports\", basename: str = None, ctx_extra: dict = None, emit_md: bool = True, only_chapters: list = None) -> dict"
|
||||
description: "Informe AutomaticEDA COMPLETO one-shot de una tabla DuckDB/PostgreSQL: perfila con profile_table, construye el ctx con los datos crudos (build_eda_render_ctx: raw_numeric para modelos/geo, timeseries_raw para series, geo_points para el mapa, db_path/table para la agregacion push-down) y emite PDF (A5 movil) Y PPTX (16:9) del mismo documento por capitulos, con los 11 capitulos POBLADOS de verdad (clusters pintados sobre el PCA, evolucion temporal, mapa geografico y tablas de agregacion), no degradados. El parametro profile_level es un preset de consumo CPU/LLM (lite/standard/full) que mapea a los flags run_models/run_series/run_llm/sample; un flag explicito siempre prima sobre el preset. lite=bajo consumo (sin LLM, sin serie, modelos solo PCA+normalidad sin KMeans/IsolationForest, sample reducido); standard=comportamiento historico; full=standard+narrativa LLM. Devuelve las rutas de PDF/PPTX y el manifiesto de versiones por capitulo."
|
||||
tags: [eda, duckdb, postgres, profiling, pipeline, dataops, report, pdf, pptx]
|
||||
uses_functions:
|
||||
@@ -46,6 +46,10 @@ params:
|
||||
desc: "Nombre base de los archivos sin extension. Default 'aeda_<table>_<timestamp>'."
|
||||
- name: ctx_extra
|
||||
desc: "Dict opcional con claves de presentacion/contexto extra que se mezclan en el ctx (dataset_name, description, source_origin, ...); no pisan las claves de datos calculadas por build_eda_render_ctx."
|
||||
- name: emit_md
|
||||
desc: "Ademas del PDF y el PPTX, emite un Markdown autocontenido del mismo documento por capitulos (texto + tablas markdown, sin binarios) para pegar a un LLM. Default True. La ruta sale en aeda_md_path."
|
||||
- name: only_chapters
|
||||
desc: "Lista opcional de ids de capitulo a renderizar (subconjunto de CHAPTER_ORDER) para iterar/testear un capitulo suelto sin generar el documento entero. Default None => documento COMPLETO (retrocompatible). Cuando se pasa una lista: (1) se VALIDA contra CHAPTER_ORDER, un id desconocido o lista vacia devuelve error claro listando los validos; (2) se RESUELVEN las dependencias de computo de esos capitulos (automatic_eda.chapter_deps) activando los flags que necesiten (run_models/run_series/run_llm) aunque el caller no los pidiera y construyendo SOLO las piezas de ctx que leen, de modo que el capitulo suelto SIEMPRE llega poblado (p.ej. ['outliers'] activa run_models y conserva raw_numeric -> Isolation Forest completo) sin malgastar CPU/LLM en lo que ningun capitulo pedido usa; (3) el documento y su manifest contienen SOLO esos capitulos MAS portada (primera) y glosario (ultima, cuando hay terminos clicables). Un flag explicito del caller prima sobre la resolucion de dependencias."
|
||||
output: "dict {status:'ok', pdf_path:str, pptx_path:str, manifest_path:str|None, n_pages:int, n_slides:int, pdf_note:str, pptx_note:str, profile:<TableProfile>} o {status:'error', error:str} (dict-no-throw)."
|
||||
---
|
||||
|
||||
@@ -69,6 +73,21 @@ r = render_automatic_eda("/tmp/ventas.duckdb", "ventas", profile_level="full")
|
||||
# Precedencia: el flag explicito SIEMPRE prima sobre el preset. lite pero con LLM:
|
||||
r = render_automatic_eda("/tmp/ventas.duckdb", "ventas",
|
||||
profile_level="lite", run_llm=True) # el LLM SI se ejecuta
|
||||
|
||||
# Capitulo SUELTO: itera/testea un capitulo sin generar el documento entero. La
|
||||
# resolucion de dependencias activa el computo que el capitulo necesita aunque no
|
||||
# se pase explicito. Pedir solo 'outliers' activa run_models y conserva
|
||||
# raw_numeric -> el bloque Isolation Forest sale COMPLETO. Documento = portada +
|
||||
# outliers + glosario.
|
||||
r = render_automatic_eda("/tmp/ventas.duckdb", "ventas", only_chapters=["outliers"])
|
||||
|
||||
# Varios capitulos sueltos a la vez (se unen sus dependencias):
|
||||
r = render_automatic_eda("/tmp/ventas.duckdb", "ventas",
|
||||
only_chapters=["correlacion", "missingness"])
|
||||
|
||||
# id desconocido -> error claro listando los validos (dict-no-throw, no lanza):
|
||||
r = render_automatic_eda("/tmp/ventas.duckdb", "ventas", only_chapters=["nope"])
|
||||
# {'status': 'error', 'error': 'only_chapters con ids desconocidos: nope. Capitulos validos: portada, overview, ...'}
|
||||
```
|
||||
|
||||
## Cuando usarla
|
||||
@@ -86,6 +105,16 @@ Para un EDA **barato/rapido** (CI, vistazo previo, maquina sin GPU o sin red) us
|
||||
temporal y el LLM. Para el **maximo** con interpretacion narrativa por capitulo,
|
||||
`profile_level="full"`. El default `"standard"` mantiene el comportamiento previo.
|
||||
|
||||
Cuando estes **iterando o testeando UN capitulo concreto** (afinar el render de
|
||||
outliers, comprobar el mapa geoespacial, depurar la agregacion) usa
|
||||
`only_chapters=[...]`: genera el documento con solo esos capitulos (+ portada y
|
||||
glosario), pero **resuelve sus dependencias de computo** para que el capitulo
|
||||
suelto nunca salga degradado — pedir `['outliers']` activa run_models y conserva
|
||||
`raw_numeric` aunque no los pases, y a la vez no malgasta CPU/LLM en lo que ningun
|
||||
capitulo pedido necesita (pedir `['geospatial']` no corre modelos). Es mucho mas
|
||||
rapido que renderizar el informe entero en cada iteracion. El mapa central de
|
||||
dependencias vive en `automatic_eda/chapter_deps.py` (fuente de verdad).
|
||||
|
||||
## Gotchas
|
||||
|
||||
- Impura: ESCRIBE el PDF, el PPTX y `automatic_eda_manifest.json` en `out_dir`.
|
||||
@@ -111,9 +140,29 @@ temporal y el LLM. Para el **maximo** con interpretacion narrativa por capitulo,
|
||||
- Los datos crudos del ctx se muestrean con `sample` (LIMIT), no se trae la tabla
|
||||
entera a RAM; con tablas enormes sube `sample` si quieres mas representatividad
|
||||
(coste: mas memoria).
|
||||
- **`only_chapters` y el glosario**: el glosario (ultimo capitulo) solo aparece si
|
||||
algun capitulo del cuerpo registro terminos clicables. Un capitulo suelto que no
|
||||
registra terminos (p.ej. `timeseries`, `geospatial`) sale como portada + ese
|
||||
capitulo, sin glosario, porque no hay nada que enlazar — es correcto, no un fallo.
|
||||
- **`only_chapters` con `profile_level="lite"`**: en capitulos sueltos el preset
|
||||
solo gobierna `sample`; los modelos NO usan el camino "lite" (que podaria
|
||||
`ctx['raw_numeric']` y dejaria a outliers sin su multivariante en vivo). Quien
|
||||
manda en capitulos sueltos es la resolucion de dependencias, no el preset de
|
||||
coste de modelos.
|
||||
|
||||
## Capability growth log
|
||||
|
||||
- v1.2.0 (2026-06-30) — anade el parametro `only_chapters`: renderiza un
|
||||
SUBCONJUNTO de capitulos (para iterar/testear uno suelto) resolviendo sus
|
||||
dependencias de computo via `automatic_eda/chapter_deps.py` (mapa central
|
||||
CHAPTER_DEPS): activa los flags de coste que el capitulo necesita (run_models/
|
||||
run_series/run_llm) aunque el caller no los pase y construye solo las piezas de
|
||||
ctx que lee, de modo que el capitulo suelto SIEMPRE llega poblado (golden:
|
||||
['outliers'] -> Isolation Forest completo) sin malgastar en lo que no usa. La
|
||||
seleccion viaja a build_document por la clave reservada `ctx['_only_chapters']`
|
||||
(los renderers no cambian). Valida ids (error claro dict-no-throw). Cambio
|
||||
aditivo y retro-compatible: `only_chapters=None` produce el documento completo
|
||||
identico a v1.1.0.
|
||||
- v1.1.0 (2026-06-30) — anade el parametro `profile_level` (lite/standard/full),
|
||||
preset de consumo CPU/LLM que mapea a los flags run_models/run_series/run_llm/
|
||||
sample. lite limita los modelos a PCA+normalidad (cableado a run_eda_models con
|
||||
|
||||
@@ -99,6 +99,7 @@ def render_automatic_eda(
|
||||
basename: str = None,
|
||||
ctx_extra: dict = None,
|
||||
emit_md: bool = True,
|
||||
only_chapters: list = None,
|
||||
) -> dict:
|
||||
"""Perfila una tabla y emite el informe AutomaticEDA completo (PDF + PPTX).
|
||||
|
||||
@@ -150,6 +151,29 @@ def render_automatic_eda(
|
||||
MISMO documento por capítulos (texto plano + tablas markdown, sin
|
||||
binarios), pensado para pegar a un LLM. Default True. La ruta sale en
|
||||
la clave de retorno ``aeda_md_path``. No altera las demás salidas.
|
||||
only_chapters: lista opcional de ids de capítulo a renderizar (un
|
||||
SUBCONJUNTO de CHAPTER_ORDER) para iterar/testear un capítulo concreto
|
||||
sin generar el documento entero. Default None => documento COMPLETO,
|
||||
idéntico al de hoy (retrocompatible). Cuando se pasa una lista:
|
||||
|
||||
- Se VALIDA contra CHAPTER_ORDER; un id desconocido devuelve un error
|
||||
claro listando los válidos (dict-no-throw, no lanza). Lista vacía
|
||||
``[]`` también devuelve error (pasa al menos un capítulo o None).
|
||||
- Se RESUELVEN las dependencias de cómputo de esos capítulos
|
||||
(``automatic_eda.chapter_deps``): se activan los flags de coste que
|
||||
necesiten (run_models / run_series / run_llm) AUNQUE el caller no
|
||||
los pidiera, y se construyen SOLO las piezas de ``ctx`` que esos
|
||||
capítulos leen. Así un capítulo suelto SIEMPRE llega poblado —
|
||||
p.ej. ``only_chapters=['outliers']`` activa run_models y conserva
|
||||
``ctx['raw_numeric']`` para que el bloque IsolationForest salga
|
||||
completo— y a la vez no se malgasta CPU/LLM en lo que ningún
|
||||
capítulo pedido usa (pedir solo ``geospatial`` no corre modelos).
|
||||
- El documento (PDF/PPTX/MD) y su manifest contienen SOLO esos
|
||||
capítulos, MÁS la portada (primera) y el glosario (última), que se
|
||||
incluyen siempre para que el documento sea válido y los términos
|
||||
clicables tengan destino.
|
||||
- Un flag explícito del caller (run_models/run_series/run_llm != None)
|
||||
SIEMPRE prima sobre lo que resuelvan las dependencias.
|
||||
|
||||
Returns:
|
||||
dict (nunca lanza). En éxito::
|
||||
@@ -169,11 +193,56 @@ def render_automatic_eda(
|
||||
# "standard" (comportamiento histórico), sin lanzar.
|
||||
preset = _PROFILE_PRESETS.get(profile_level, _PROFILE_PRESETS["standard"])
|
||||
sample = preset["sample"] if sample is None else sample
|
||||
run_models = preset["run_models"] if run_models is None else run_models
|
||||
run_series = preset["run_series"] if run_series is None else run_series
|
||||
run_llm = preset["run_llm"] if run_llm is None else run_llm
|
||||
model_opts = preset["model_opts"]
|
||||
|
||||
# 0.bis) Modo "capítulos sueltos": valida la selección y RESUELVE sus
|
||||
# dependencias de cómputo. Es lo que garantiza que un capítulo pedido
|
||||
# llegue completo (activa lo que necesita) sin malgastar en lo que no.
|
||||
# Cuando only_chapters es None se conserva el camino histórico (preset).
|
||||
if only_chapters is not None:
|
||||
from datascience.automatic_eda import CHAPTER_ORDER
|
||||
from datascience.automatic_eda.chapter_deps import (
|
||||
needs_render_ctx,
|
||||
resolve_ctx_data_keys,
|
||||
resolve_requirements,
|
||||
validate_chapter_ids,
|
||||
)
|
||||
|
||||
if not isinstance(only_chapters, (list, tuple)):
|
||||
return {"status": "error",
|
||||
"error": "only_chapters debe ser una lista de ids de "
|
||||
"capítulo o None (documento completo)."}
|
||||
only_chapters = [c for c in only_chapters]
|
||||
if not only_chapters:
|
||||
return {"status": "error",
|
||||
"error": "only_chapters=[] está vacío. Pasa al menos un "
|
||||
"capítulo, o None para el documento completo. "
|
||||
"Capítulos válidos: " + ", ".join(CHAPTER_ORDER)}
|
||||
checked = validate_chapter_ids(only_chapters, CHAPTER_ORDER)
|
||||
if checked["unknown"]:
|
||||
return {"status": "error",
|
||||
"error": "only_chapters con ids desconocidos: "
|
||||
+ ", ".join(checked["unknown"])
|
||||
+ ". Capítulos válidos: "
|
||||
+ ", ".join(CHAPTER_ORDER)}
|
||||
only_chapters = checked["valid"]
|
||||
|
||||
# Las dependencias fijan el DEFAULT de cada flag de coste (eficiencia:
|
||||
# lo que ningún capítulo pedido necesita queda en False); un flag
|
||||
# explícito del caller (!= None) sigue primando.
|
||||
dep_flags = resolve_requirements(only_chapters)["profile_flags"]
|
||||
run_models = ("run_models" in dep_flags) if run_models is None else run_models
|
||||
run_series = ("run_series" in dep_flags) if run_series is None else run_series
|
||||
run_llm = ("run_llm" in dep_flags) if run_llm is None else run_llm
|
||||
# En capítulos sueltos no se usa el camino "modelos baratos" (lite),
|
||||
# que poda ctx['raw_numeric']: un capítulo como outliers lo necesita
|
||||
# para su multivariante en vivo. El preset solo gobierna `sample`.
|
||||
model_opts = None
|
||||
else:
|
||||
run_models = preset["run_models"] if run_models is None else run_models
|
||||
run_series = preset["run_series"] if run_series is None else run_series
|
||||
run_llm = preset["run_llm"] if run_llm is None else run_llm
|
||||
|
||||
# En el camino "modelos baratos" (lite) profile_table NO corre los
|
||||
# modelos: los ejecuta este pipeline con run_eda_models y la granularidad
|
||||
# del preset, evitando pagar el coste CPU de KMeans + IsolationForest.
|
||||
@@ -217,10 +286,25 @@ def render_automatic_eda(
|
||||
if ctx_extra:
|
||||
base_ctx.update(ctx_extra)
|
||||
|
||||
ctx = build_eda_render_ctx(
|
||||
db_path, table, prof, backend=backend, sample=sample,
|
||||
base_ctx=base_ctx,
|
||||
)
|
||||
# En modo capítulos sueltos, si NINGÚN capítulo pedido necesita datos
|
||||
# crudos del ctx, se salta build_eda_render_ctx por completo (ahorro real
|
||||
# de I/O): solo se conservan presentación + db_path/table. Si sí los
|
||||
# necesita, se construye el ctx y luego se PODAN las piezas de datos que
|
||||
# ningún capítulo pedido usa (db_path/table nunca se podan).
|
||||
if only_chapters is not None and not needs_render_ctx(only_chapters):
|
||||
ctx = dict(base_ctx)
|
||||
ctx["db_path"] = db_path
|
||||
ctx["table"] = table
|
||||
else:
|
||||
ctx = build_eda_render_ctx(
|
||||
db_path, table, prof, backend=backend, sample=sample,
|
||||
base_ctx=base_ctx,
|
||||
)
|
||||
if only_chapters is not None and isinstance(ctx, dict):
|
||||
keep = resolve_ctx_data_keys(only_chapters)
|
||||
for k in ("head_rows", "raw_numeric", "timeseries_raw", "geo_points"):
|
||||
if k not in keep:
|
||||
ctx.pop(k, None)
|
||||
|
||||
# 2.5) Camino lite — modelos baratos (PCA + normalidad, sin KMeans ni
|
||||
# IsolationForest). profile_table no corrió los modelos; aquí se corren
|
||||
@@ -245,6 +329,13 @@ def render_automatic_eda(
|
||||
ctx.pop("raw_numeric", None)
|
||||
|
||||
# 3) Render a ambos formatos desde el MISMO documento por capítulos.
|
||||
# En modo capítulos sueltos, la selección viaja a build_document por una
|
||||
# clave reservada del ctx (los renderers llaman build_document sin pasar
|
||||
# `only`): build_document filtra el cuerpo a esos capítulos y siempre
|
||||
# añade portada (primera) + glosario (última). build_document la consume
|
||||
# y la quita, así que no llega a los capítulos.
|
||||
if only_chapters is not None and isinstance(ctx, dict):
|
||||
ctx["_only_chapters"] = list(only_chapters)
|
||||
os.makedirs(out_dir, exist_ok=True)
|
||||
ts = datetime.now(timezone.utc).strftime("%Y%m%d-%H%M%S")
|
||||
base = basename or f"aeda_{table}_{ts}"
|
||||
@@ -283,6 +374,7 @@ def render_automatic_eda(
|
||||
"pdf_note": rpdf.get("note"),
|
||||
"pptx_note": rpptx.get("note"),
|
||||
"md_note": rmd.get("note"),
|
||||
"only_chapters": only_chapters,
|
||||
"profile": prof,
|
||||
}
|
||||
except Exception as e: # noqa: BLE001 — dict-no-throw: degradar, nunca lanzar.
|
||||
|
||||
@@ -0,0 +1,235 @@
|
||||
"""Tests del modo `only_chapters` del pipeline render_automatic_eda.
|
||||
|
||||
Cubre la tarea de "capítulos sueltos con resolución de dependencias":
|
||||
|
||||
- Golden (DuckDB real): pedir SOLO un capítulo genera un documento con solo
|
||||
portada + ese capítulo + glosario, y el capítulo llega COMPLETO porque la
|
||||
resolución de dependencias activó el cómputo que necesita aunque el caller
|
||||
no lo pidiera (outliers → run_models + raw_numeric → IsolationForest poblado;
|
||||
timeseries → run_series; correlacion → raw_numeric).
|
||||
- Eficiencia: pedir un capítulo que NO necesita flags caros (geospatial) no los
|
||||
activa, y un capítulo puramente agregado (num_distr) ni siquiera construye el
|
||||
ctx de datos crudos.
|
||||
- Edge: id desconocido / lista vacía / no-lista devuelven error claro sin
|
||||
lanzar; only_chapters=None mantiene el comportamiento histórico.
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import random
|
||||
import sys
|
||||
from datetime import date, timedelta
|
||||
|
||||
_HERE = os.path.dirname(os.path.abspath(__file__))
|
||||
_FUNCTIONS = os.path.abspath(os.path.join(_HERE, "..", "..")) # python/functions
|
||||
if _FUNCTIONS not in sys.path:
|
||||
sys.path.insert(0, _FUNCTIONS)
|
||||
|
||||
import duckdb # noqa: E402
|
||||
|
||||
from pipelines.render_automatic_eda import render_automatic_eda # noqa: E402
|
||||
|
||||
|
||||
def _make_db_models(path):
|
||||
"""DB con fecha + 3 numéricas continuas en 3 clusters gaussianos.
|
||||
|
||||
Garantiza material para outliers/modelos (>=2 numéricas → IsolationForest),
|
||||
timeseries (columna DATE) y correlacion (numéricas). Mismo shape que el
|
||||
fixture del test del pipeline base.
|
||||
"""
|
||||
con = duckdb.connect(path)
|
||||
con.execute("CREATE TABLE pts (d DATE, grp VARCHAR, x1 DOUBLE, x2 DOUBLE, x3 DOUBLE)")
|
||||
random.seed(42)
|
||||
centers = [(0.0, 0.0, 0.0), (10.0, 10.0, 10.0), (20.0, 5.0, 15.0)]
|
||||
d0 = date(2024, 1, 1)
|
||||
rows = []
|
||||
for i in range(150):
|
||||
cx, cy, cz = centers[i % 3]
|
||||
rows.append((
|
||||
d0 + timedelta(days=i), f"g{i % 3}",
|
||||
round(cx + random.gauss(0, 1.0), 4),
|
||||
round(cy + random.gauss(0, 1.0), 4),
|
||||
round(cz + random.gauss(0, 1.0), 4),
|
||||
))
|
||||
con.executemany("INSERT INTO pts VALUES (?,?,?,?,?)", rows)
|
||||
con.close()
|
||||
|
||||
|
||||
def _manifest_chapters(result):
|
||||
with open(result["manifest_path"], encoding="utf-8") as fh:
|
||||
return set((json.load(fh).get("chapters") or {}).keys())
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# GOLDEN — outliers suelto: IsolationForest poblado por resolución de deps.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def test_only_outliers_isolation_forest_populated_without_explicit_run_models(tmp_path):
|
||||
"""El corazón de la tarea: pedir SOLO 'outliers' sin run_models explícito
|
||||
activa run_models por dependencias y conserva ctx['raw_numeric'], de modo que
|
||||
el bloque multivariante (Isolation Forest) sale con datos, no degradado."""
|
||||
db = str(tmp_path / "pts.duckdb")
|
||||
_make_db_models(db)
|
||||
out = str(tmp_path / "out")
|
||||
|
||||
# NB: no se pasa run_models — la resolución de dependencias debe activarlo.
|
||||
r = render_automatic_eda(db, "pts", only_chapters=["outliers"],
|
||||
out_dir=out, basename="only_outliers")
|
||||
assert r["status"] == "ok", r.get("error")
|
||||
assert r["only_chapters"] == ["outliers"]
|
||||
|
||||
# Documento = portada + outliers + glosario, nada más.
|
||||
assert _manifest_chapters(r) == {"portada", "outliers", "glosario"}
|
||||
|
||||
# El multivariante salió POBLADO (no la nota de degradación). Se comprueba en
|
||||
# el Markdown (mismo documento por capítulos, texto plano fiable).
|
||||
md = open(r["aeda_md_path"], encoding="utf-8").read()
|
||||
assert "Filas atípicas (multivariante)" in md
|
||||
assert "Filas analizadas" in md, "el Isolation Forest no trae su tabla poblada"
|
||||
assert "No se pudo analizar la anomalía multivariante" not in md, \
|
||||
"el bloque multivariante salió degradado pese a resolver las deps"
|
||||
|
||||
# La resolución activó run_models → el perfil trae el bloque de modelos.
|
||||
assert ((r["profile"] or {}).get("models") or {}).get("outliers") is not None
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# GOLDEN — timeseries suelto activa run_series.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def test_only_timeseries_activates_run_series(tmp_path):
|
||||
db = str(tmp_path / "pts.duckdb")
|
||||
_make_db_models(db)
|
||||
out = str(tmp_path / "out")
|
||||
|
||||
r = render_automatic_eda(db, "pts", only_chapters=["timeseries"],
|
||||
out_dir=out, basename="only_ts")
|
||||
assert r["status"] == "ok", r.get("error")
|
||||
assert "timeseries" in _manifest_chapters(r)
|
||||
assert "modelos" not in _manifest_chapters(r)
|
||||
# run_series resuelto por deps → el perfil trae el análisis de serie.
|
||||
assert (r["profile"] or {}).get("series") is not None, \
|
||||
"only_chapters=['timeseries'] debe activar run_series"
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# GOLDEN — correlacion suelto construye raw_numeric (sin activar modelos).
|
||||
# --------------------------------------------------------------------------- #
|
||||
def test_only_correlacion_builds_raw_numeric_without_models(tmp_path):
|
||||
db = str(tmp_path / "pts.duckdb")
|
||||
_make_db_models(db)
|
||||
out = str(tmp_path / "out")
|
||||
|
||||
r = render_automatic_eda(db, "pts", only_chapters=["correlacion"],
|
||||
out_dir=out, basename="only_corr")
|
||||
assert r["status"] == "ok", r.get("error")
|
||||
assert _manifest_chapters(r) == {"portada", "correlacion", "glosario"}
|
||||
# Eficiencia: correlacion no necesita los modelos → no se corrieron.
|
||||
assert ((r["profile"] or {}).get("models") or {}).get("outliers") is None
|
||||
assert (r["profile"] or {}).get("series") is None
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Eficiencia y precedencia — vía stub (sin DuckDB).
|
||||
# --------------------------------------------------------------------------- #
|
||||
def _patch(monkeypatch, cap):
|
||||
import pipelines.render_automatic_eda as mod
|
||||
|
||||
def fake_pt(db, t, **kw):
|
||||
cap["run_models"] = kw.get("run_models")
|
||||
cap["run_series"] = kw.get("run_series")
|
||||
cap["run_llm"] = kw.get("run_llm")
|
||||
return {"status": "ok", "profile": {"columns": []}}
|
||||
|
||||
def fake_ctx(db, t, prof, **kw):
|
||||
cap["ctx_called"] = True
|
||||
return {"db_path": db, "table": t}
|
||||
|
||||
cap["ctx_called"] = False
|
||||
monkeypatch.setattr(mod, "profile_table", fake_pt)
|
||||
monkeypatch.setattr(mod, "build_eda_render_ctx", fake_ctx)
|
||||
monkeypatch.setattr(mod, "render_automatic_eda_pdf",
|
||||
lambda *a, **k: {"path": "x.pdf", "n_pages": 1,
|
||||
"manifest_path": "m.json"})
|
||||
monkeypatch.setattr(mod, "render_automatic_eda_pptx",
|
||||
lambda *a, **k: {"path": "x.pptx", "n_slides": 1})
|
||||
monkeypatch.setattr(mod, "render_automatic_eda_markdown",
|
||||
lambda *a, **k: {"path": "x.md", "n_chars": 1})
|
||||
|
||||
|
||||
def test_only_geospatial_does_not_activate_cost_flags(monkeypatch):
|
||||
"""Eficiencia: pedir solo geospatial NO corre modelos/serie/LLM."""
|
||||
cap = {}
|
||||
_patch(monkeypatch, cap)
|
||||
render_automatic_eda("db", "t", only_chapters=["geospatial"])
|
||||
assert cap["run_models"] is False
|
||||
assert cap["run_series"] is False
|
||||
assert cap["run_llm"] is False
|
||||
|
||||
|
||||
def test_only_outliers_activates_run_models_via_deps(monkeypatch):
|
||||
cap = {}
|
||||
_patch(monkeypatch, cap)
|
||||
render_automatic_eda("db", "t", only_chapters=["outliers"])
|
||||
assert cap["run_models"] is True
|
||||
assert cap["run_series"] is False
|
||||
|
||||
|
||||
def test_explicit_flag_overrides_dependency_resolution(monkeypatch):
|
||||
"""run_models=False explícito gana, aunque outliers lo pediría por deps."""
|
||||
cap = {}
|
||||
_patch(monkeypatch, cap)
|
||||
render_automatic_eda("db", "t", only_chapters=["outliers"], run_models=False)
|
||||
assert cap["run_models"] is False
|
||||
|
||||
|
||||
def test_purely_aggregated_chapter_skips_render_ctx(monkeypatch):
|
||||
"""num_distr solo lee el profile → build_eda_render_ctx no se llama."""
|
||||
cap = {}
|
||||
_patch(monkeypatch, cap)
|
||||
render_automatic_eda("db", "t", only_chapters=["num_distr"])
|
||||
assert cap["ctx_called"] is False, \
|
||||
"num_distr no necesita datos crudos: el ctx no debe construirse"
|
||||
|
||||
|
||||
def test_chapter_that_needs_ctx_builds_it(monkeypatch):
|
||||
cap = {}
|
||||
_patch(monkeypatch, cap)
|
||||
render_automatic_eda("db", "t", only_chapters=["outliers"])
|
||||
assert cap["ctx_called"] is True
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# EDGE — errores claros sin lanzar.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def test_unknown_chapter_id_returns_clear_error(tmp_path):
|
||||
r = render_automatic_eda(str(tmp_path / "x.duckdb"), "t",
|
||||
only_chapters=["no_existe"])
|
||||
assert r["status"] == "error"
|
||||
assert "no_existe" in r["error"]
|
||||
assert "Capítulos válidos" in r["error"]
|
||||
# Algún id válido conocido aparece en la lista.
|
||||
assert "outliers" in r["error"]
|
||||
|
||||
|
||||
def test_empty_only_list_returns_error(tmp_path):
|
||||
r = render_automatic_eda(str(tmp_path / "x.duckdb"), "t", only_chapters=[])
|
||||
assert r["status"] == "error"
|
||||
assert "vac" in r["error"].lower()
|
||||
|
||||
|
||||
def test_only_chapters_not_a_list_returns_error(tmp_path):
|
||||
r = render_automatic_eda(str(tmp_path / "x.duckdb"), "t",
|
||||
only_chapters="outliers")
|
||||
assert r["status"] == "error"
|
||||
|
||||
|
||||
def test_only_none_keeps_full_document(tmp_path):
|
||||
"""Retro-compat: only_chapters=None genera el documento completo."""
|
||||
db = str(tmp_path / "pts.duckdb")
|
||||
_make_db_models(db)
|
||||
out = str(tmp_path / "out")
|
||||
r = render_automatic_eda(db, "pts", out_dir=out, basename="full")
|
||||
assert r["status"] == "ok", r.get("error")
|
||||
chapters = _manifest_chapters(r)
|
||||
# Documento completo: muchos más capítulos que portada/glosario.
|
||||
assert {"portada", "glosario", "overview", "correlacion"} <= chapters
|
||||
assert len(chapters) > 4
|
||||
Reference in New Issue
Block a user