feat(eda): CAP4/CAP5 distribuciones — párrafos al glosario, desc LLM + unidad por columna, donut→barras, PPT figura a la derecha
CAP4 num_distr: - Mueve el párrafo introductorio largo del histograma/boxplot al glosario (nuevo término clicable "histograma_boxplot"); el cuerpo del capítulo solo nombra el término con [[term:histograma_boxplot]] y la explicación completa (código de colores, 1,5·IQR, lectura de asimetría) vive en la entrada del glosario. La información se traslada, no se pierde. - Añade por columna numérica la descripción de negocio del LLM y la unidad, leídas de profile['llm']['dictionary'] (empareja por nombre de columna). Sin bloque LLM el bloque de descripción se omite limpiamente. CAP5 cat_distr: - Mueve el párrafo "Cada columna categórica ocupa su propia página..." al glosario (nuevo término clicable "pagina_categorica"); el intro solo nombra los términos entropía y pagina_categorica. - Añade descripción LLM + unidad por columna (misma fuente que CAP4). - Cambia el donut/pie por gráfico de barras horizontales (nueva función del registry categorical_top_bar_figure_py_datascience, contrato de entrada idéntico al donut para swap directo) más su fallback inline de barras. - Marca cada Group de columna con layout="side_by_side": en PPTX la tabla de cardinalidad queda a la izquierda y la barra a la derecha; en PDF se apila (A5 estrecho). No toca los renderers — el soporte de layout ya existía. Glosario: - Catálogo canónico _BASELINE_TERMS con las definiciones de los dos términos nuevos; build_glosario completa la definición de un término registrado sin ella desde el catálogo (los chapters solo registran clave+label). Tests actualizados (donut→barras, side_by_side, LLM desc/unidad, glosario) y nueva función con sus tests. Suite del subsistema + acceptance verde.
This commit is contained in:
@@ -5,28 +5,32 @@ page (PDF) / slide (PPTX)**: every column is wrapped in a keep-together
|
||||
``model.Group`` with ``page_break_before=True`` (except the first, which may share
|
||||
the intro's page), so its chart sits next to its tables and no column is split.
|
||||
|
||||
A short intro names the clickable **[[term:entropia]]entropía[[/term]]** term —
|
||||
the full definition lives in the GLOSARIO chapter, so it is NOT repeated inline
|
||||
here (one click jumps to the glossary entry). The intro also carries the dataset
|
||||
row total used as a comparison baseline.
|
||||
Per column the Group is laid out ``side_by_side`` (PPTX: cardinality table LEFT,
|
||||
chart RIGHT; PDF: stacked) and contains, in order:
|
||||
|
||||
Per column the Group contains, in order:
|
||||
|
||||
1. A cardinality key/value table: distinct values, ``% distinct`` (distinct /
|
||||
1. The column name plus, when the LLM layer ran, its business **description** and
|
||||
**unit** (read from ``profile['llm']['dictionary']``, matched by column name).
|
||||
2. A cardinality key/value table: distinct values, ``% distinct`` (distinct /
|
||||
total rows), total dataset rows, singleton values (frequency 1), entropy with
|
||||
its theoretical maximum and the normalized ratio, mode, imbalance and
|
||||
string-length stats.
|
||||
2. A short note flagging problematic cardinality (id-like ≈100% distinct, or a
|
||||
3. A short note flagging problematic cardinality (id-like ≈100% distinct, or a
|
||||
single dominating category).
|
||||
3. A ``top-k`` table (value / count / %).
|
||||
4. A **donut pie chart** of the most common categories (top-k + an "Otros"
|
||||
4. A ``top-k`` table (value / count / %).
|
||||
5. A **horizontal bar chart** of the most common categories (top-k + an "Otros"
|
||||
bucket), drawn lazily so the renderers scale it to fit entirely.
|
||||
|
||||
A short intro names the clickable **[[term:entropia]]entropía[[/term]]** and
|
||||
**[[term:pagina_categorica]]page-layout[[/term]]** terms — their full
|
||||
definitions live in the GLOSARIO chapter, so they are NOT repeated inline here
|
||||
(one click jumps to the glossary entry). The intro also carries the dataset row
|
||||
total used as a comparison baseline.
|
||||
|
||||
Data comes from the ``eda`` group: each ``columns[i]['categorical']`` is the
|
||||
output of ``summarize_categorical`` (``top[{value,count,pct}]``, ``mode``,
|
||||
``n_distinct``, ``entropy``, ``imbalance``, ``len_min/mean/max``). The derived
|
||||
cardinality metrics and the pie figure are delegated to two registry functions
|
||||
(``categorical_cardinality_block`` and ``categorical_top_pie_figure``); both are
|
||||
cardinality metrics and the bar figure are delegated to two registry functions
|
||||
(``categorical_cardinality_block`` and ``categorical_top_bar_figure``); both are
|
||||
imported lazily and degrade to a minimal inline fallback so this chapter never
|
||||
raises even if they are unavailable.
|
||||
|
||||
@@ -39,10 +43,21 @@ import math
|
||||
|
||||
from .. import model
|
||||
|
||||
CHAPTER_VERSION = "1.2.0"
|
||||
CHAPTER_VERSION = "1.3.0"
|
||||
CHAPTER_ID = "cat_distr"
|
||||
CHAPTER_TITLE = "Distribuciones categóricas"
|
||||
|
||||
# Key under which eda_llm_insights stores its interpretive block in the profile.
|
||||
LLM_KEY = "llm"
|
||||
|
||||
# Second glossary term this chapter names: "how each categorical page is laid
|
||||
# out". The long paragraph that used to describe it inline in the intro now lives
|
||||
# in the GLOSARIO chapter (canonical definition in ``glosario._BASELINE_TERMS``);
|
||||
# the intro only names the clickable term, relocating the explanation, not losing
|
||||
# it. The chapter only needs to register key+label here.
|
||||
_TERM_PAGINA_KEY = "pagina_categorica"
|
||||
_TERM_PAGINA_LABEL = "Cómo se organiza cada página categórica"
|
||||
|
||||
# Glossary term this chapter explains. Registered in the shared collector and
|
||||
# marked clickable on its first appearance (end-to-end glossary example —
|
||||
# mejora 6). Other chapters hook their own terms the same way (see the contract).
|
||||
@@ -59,14 +74,14 @@ _TERM_ENTROPIA_DEF = (
|
||||
# Cap the number of categorical columns rendered to keep the document bounded;
|
||||
# the rest are summarized in a closing note (no silent truncation).
|
||||
MAX_COLS = 40
|
||||
# Rows shown in each top-k table and explicit slices in the pie. Kept moderate so
|
||||
# the whole column — cardinality table + top-k table + donut — fits on ONE
|
||||
# Rows shown in each top-k table and explicit bars in the chart. Kept moderate so
|
||||
# the whole column — cardinality table + top-k table + bar chart — fits on ONE
|
||||
# page/slide with the chart next to its tables; the table note still reports
|
||||
# "top N of M" so nothing is silently hidden. For id-like columns (≈100%
|
||||
# distinct) the top-k table is dropped entirely (it would be a list of unique
|
||||
# values — pure noise), which also frees the room the donut needs (see build).
|
||||
# values — pure noise), which also frees the room the chart needs (see build).
|
||||
TOP_TABLE_ROWS = 8
|
||||
PIE_TOP_K = 6
|
||||
CHART_TOP_K = 6
|
||||
# Truncate very long category labels in tables (the renderer also wraps). Kept
|
||||
# tight so a column with long id-like values (names, tickets) still fits its page.
|
||||
LABEL_MAX = 28
|
||||
@@ -208,26 +223,74 @@ def _fallback_cardinality(cat: dict, n_rows) -> dict:
|
||||
}
|
||||
|
||||
|
||||
def _pie_make(top, n_distinct, title, n_rows):
|
||||
"""Return a zero-arg callable that builds the donut figure lazily."""
|
||||
def _llm_index(profile: dict, ctx: dict) -> dict:
|
||||
"""Map column name -> its LLM dictionary entry (description/unit/...).
|
||||
|
||||
Reads the ``llm.dictionary`` list that ``eda_llm_insights`` stored in the
|
||||
profile (``profile['llm']``; falls back to ``ctx['llm']``). Returns an empty
|
||||
dict when ``run_llm`` did not run, so the caller degrades cleanly. Fully
|
||||
defensive: never raises on malformed input.
|
||||
"""
|
||||
llm = profile.get(LLM_KEY)
|
||||
if not isinstance(llm, dict):
|
||||
llm = ctx.get(LLM_KEY)
|
||||
if not isinstance(llm, dict):
|
||||
return {}
|
||||
entries = llm.get("dictionary")
|
||||
if not isinstance(entries, (list, tuple)):
|
||||
return {}
|
||||
index: dict = {}
|
||||
for e in entries:
|
||||
if not isinstance(e, dict):
|
||||
continue
|
||||
col = e.get("column")
|
||||
if col is None:
|
||||
continue
|
||||
index[model._safe_str(col)] = e
|
||||
return index
|
||||
|
||||
|
||||
def _llm_desc_unit_block(name: str, llm_index: dict):
|
||||
"""Markdown block with the LLM business description + unit of a column, or
|
||||
None when no LLM entry matches the column (clean fallback without LLM)."""
|
||||
entry = llm_index.get(model._safe_str(name))
|
||||
if not isinstance(entry, dict):
|
||||
return None
|
||||
raw_desc = entry.get("description") or entry.get("business_meaning")
|
||||
desc = " ".join(model._safe_str(raw_desc).split()) if raw_desc else ""
|
||||
raw_unit = entry.get("unit")
|
||||
unit = " ".join(model._safe_str(raw_unit).split()) if raw_unit else ""
|
||||
parts = []
|
||||
if desc:
|
||||
parts.append(f"**Descripción:** {desc}")
|
||||
if unit:
|
||||
parts.append(f"**Unidad:** {unit}")
|
||||
if not parts:
|
||||
return None
|
||||
return model.Markdown(text=" · ".join(parts))
|
||||
|
||||
|
||||
def _bar_make(top, n_distinct, title, n_rows):
|
||||
"""Return a zero-arg callable that builds the bar figure lazily."""
|
||||
|
||||
def make():
|
||||
try:
|
||||
from datascience.categorical_top_pie_figure import (
|
||||
categorical_top_pie_figure,
|
||||
from datascience.categorical_top_bar_figure import (
|
||||
categorical_top_bar_figure,
|
||||
)
|
||||
|
||||
return categorical_top_pie_figure(
|
||||
return categorical_top_bar_figure(
|
||||
top=top, n_distinct=n_distinct or 0, title=title,
|
||||
top_k=PIE_TOP_K, n_rows=n_rows)
|
||||
top_k=CHART_TOP_K, n_rows=n_rows)
|
||||
except Exception: # noqa: BLE001 — minimal local fallback figure.
|
||||
return _fallback_pie(top, title)
|
||||
return _fallback_bar(top, title)
|
||||
|
||||
return make
|
||||
|
||||
|
||||
def _fallback_pie(top, title):
|
||||
"""Minimal donut figure used only if the registry function is unavailable."""
|
||||
def _fallback_bar(top, title):
|
||||
"""Minimal horizontal-bar figure used only if the registry function is
|
||||
unavailable. Largest category on top, the rest folded into "Otros"."""
|
||||
import matplotlib
|
||||
|
||||
matplotlib.use("Agg")
|
||||
@@ -238,8 +301,8 @@ def _fallback_pie(top, title):
|
||||
items = [t for t in (top or [])
|
||||
if isinstance(t, dict) and isinstance(t.get("count"), (int, float))]
|
||||
items = sorted(items, key=lambda t: t.get("count") or 0, reverse=True)
|
||||
head = items[:PIE_TOP_K]
|
||||
rest = items[PIE_TOP_K:]
|
||||
head = items[:CHART_TOP_K]
|
||||
rest = items[CHART_TOP_K:]
|
||||
labels = [_truncate(t.get("value"), 20) for t in head]
|
||||
sizes = [float(t.get("count") or 0) for t in head]
|
||||
if rest:
|
||||
@@ -249,10 +312,13 @@ def _fallback_pie(top, title):
|
||||
ax.text(0.5, 0.5, "sin datos categóricos", ha="center", va="center")
|
||||
ax.axis("off")
|
||||
return fig
|
||||
ax.pie(sizes, labels=None, wedgeprops={"width": 0.42},
|
||||
autopct=lambda p: f"{p:.0f}%" if p >= 4 else "")
|
||||
ax.legend(labels, loc="center left", bbox_to_anchor=(1.0, 0.5),
|
||||
fontsize=7, frameon=False)
|
||||
# barh draws bottom-up, so reverse to put the largest category on top.
|
||||
y_pos = range(len(labels))
|
||||
ax.barh(list(y_pos), list(reversed(sizes)), color="#4C72B0",
|
||||
edgecolor="white")
|
||||
ax.set_yticks(list(y_pos))
|
||||
ax.set_yticklabels(list(reversed(labels)), fontsize=7)
|
||||
ax.set_xlabel("conteo", fontsize=8)
|
||||
ax.set_title(_truncate(title, 40))
|
||||
fig.tight_layout()
|
||||
return fig
|
||||
@@ -375,17 +441,19 @@ def _topk_table(cat: dict):
|
||||
|
||||
def _intro_blocks(n_rows, mark_term: bool = False):
|
||||
total = _fmt_int(n_rows)
|
||||
# Mark the first appearance of the term as a clickable glossary jump when the
|
||||
# term was registered (mark_term). The full definition of entropy lives in the
|
||||
# GLOSARIO chapter, so the intro only names the clickable term here instead of
|
||||
# repeating the long explanation (avoids the redundancy with the glossary).
|
||||
# Mark the first appearance of each term as a clickable glossary jump when the
|
||||
# terms were registered (mark_term). The full definition of the entropy term
|
||||
# AND of how each categorical page is laid out live in the GLOSARIO chapter, so
|
||||
# the intro only names the clickable terms instead of repeating the long
|
||||
# explanation (avoids the redundancy with the glossary).
|
||||
entropia = ("[[term:entropia]]entropía[[/term]]" if mark_term
|
||||
else "entropía")
|
||||
pagina = ("[[term:pagina_categorica]]cómo se organiza cada página[[/term]]"
|
||||
if mark_term else "cómo se organiza cada página")
|
||||
text = (
|
||||
f"Cada columna categórica ocupa su propia página: sus métricas de "
|
||||
f"cardinalidad —incluida la {entropia}—, una nota que señala cardinalidad "
|
||||
"problemática, la tabla de las categorías más frecuentes y un gráfico de "
|
||||
"tarta (donut) de las más comunes, todo junto."
|
||||
f"Cada columna categórica ocupa su propia página — {pagina}: "
|
||||
f"cardinalidad (incluida la {entropia}), top de categorías y un gráfico "
|
||||
"de barras de las más comunes."
|
||||
)
|
||||
if n_rows is not None:
|
||||
text += f" El dataset tiene {total} filas en total como referencia."
|
||||
@@ -406,47 +474,59 @@ def build_cat_distr(profile: dict, ctx: dict):
|
||||
return None
|
||||
|
||||
n_rows = profile.get("n_rows")
|
||||
# Register "entropía" in the shared glossary collector (if present) and mark
|
||||
# its first appearance clickable. End-to-end glossary example (mejora 6).
|
||||
# Register "entropía" and the "how each categorical page is laid out" term in
|
||||
# the shared glossary collector (if present) and mark their first appearance
|
||||
# clickable. End-to-end glossary example (mejora 6).
|
||||
glossary = ctx.get("glossary")
|
||||
mark_term = False
|
||||
if isinstance(glossary, model.GlossaryCollector):
|
||||
glossary.add(_TERM_ENTROPIA_KEY, _TERM_ENTROPIA_LABEL,
|
||||
_TERM_ENTROPIA_DEF)
|
||||
glossary.add(_TERM_PAGINA_KEY, _TERM_PAGINA_LABEL)
|
||||
mark_term = True
|
||||
blocks = list(_intro_blocks(n_rows, mark_term=mark_term))
|
||||
|
||||
# Business description + unit per column come from the LLM dictionary
|
||||
# (profile['llm']['dictionary'], matched by column name); absent without
|
||||
# run_llm, in which case the per-column description block is simply omitted.
|
||||
llm_index = _llm_index(profile, ctx)
|
||||
|
||||
rendered = cat_cols[:MAX_COLS]
|
||||
for idx, col in enumerate(rendered):
|
||||
name = col.get("name") or "(columna)"
|
||||
cat = col.get("categorical") or {}
|
||||
card = _normalize_card(_cardinality(cat, n_rows))
|
||||
|
||||
# One Group per categorical column: heading + cardinality table + flag
|
||||
# note + top-k table + donut figure are kept together and the renderer
|
||||
# starts each on a fresh page/slide (page_break_before) so every column
|
||||
# gets its own page with its chart next to its tables. The first column
|
||||
# may share the intro's page (no forced break) to avoid a near-empty page.
|
||||
col_blocks = [
|
||||
model.Heading(text=str(name), level=2),
|
||||
_cardinality_block(card),
|
||||
]
|
||||
# One Group per categorical column: heading + (optional) LLM description +
|
||||
# cardinality table + flag note + top-k table + bar figure are kept
|
||||
# together and the renderer starts each on a fresh page/slide
|
||||
# (page_break_before) so every column gets its own page with its chart next
|
||||
# to its tables. The first column may share the intro's page (no forced
|
||||
# break) to avoid a near-empty page.
|
||||
col_blocks = [model.Heading(text=str(name), level=2)]
|
||||
desc_block = _llm_desc_unit_block(name, llm_index)
|
||||
if desc_block is not None:
|
||||
col_blocks.append(desc_block)
|
||||
col_blocks.append(_cardinality_block(card))
|
||||
note = _flag_note(card)
|
||||
if note is not None:
|
||||
col_blocks.append(note)
|
||||
# For id-like columns (≈100% distinct) the top-k is a list of unique
|
||||
# values — pure noise; skip it (the flag note already explains why) and
|
||||
# let the donut take that room so the whole column fits one page/slide.
|
||||
# let the bar chart take that room so the whole column fits one page/slide.
|
||||
if not card.get("id_like"):
|
||||
topk = _topk_table(cat)
|
||||
if topk is not None:
|
||||
col_blocks.append(topk)
|
||||
col_blocks.append(model.Figure(
|
||||
make=_pie_make(cat.get("top") or [], card.get("n_distinct"),
|
||||
make=_bar_make(cat.get("top") or [], card.get("n_distinct"),
|
||||
str(name), n_rows),
|
||||
caption=(f"Categorías más comunes de «{_truncate(name, 32)}» "
|
||||
"(donut: top-k + «Otros»)")))
|
||||
blocks.append(model.Group(blocks=col_blocks,
|
||||
"(barras: top-k + «Otros»)")))
|
||||
# layout="side_by_side": in PPTX the cardinality table goes to the LEFT and
|
||||
# the bar chart to the RIGHT of the same slide; the PDF renderer stacks it
|
||||
# (the A5 mobile page is too narrow for two readable columns).
|
||||
blocks.append(model.Group(blocks=col_blocks, layout="side_by_side",
|
||||
page_break_before=(idx > 0)))
|
||||
|
||||
if len(cat_cols) > len(rendered):
|
||||
|
||||
@@ -2,12 +2,14 @@
|
||||
|
||||
Self-contained: builds synthetic TableProfiles (no DuckDB) so the suite is fast
|
||||
and deterministic. Verifies that ``build_cat_distr`` emits the blocks the user
|
||||
asked for (distinct/total/%-distinct/unique metrics, top-k table and a donut
|
||||
asked for (distinct/total/%-distinct/unique metrics, top-k table and a bar
|
||||
figure), that EACH categorical column is wrapped in its own keep-together
|
||||
``Group`` that starts on a fresh page/slide (one column per page, chart next to
|
||||
its tables), that the long entropy explanation is NOT repeated inline (it lives
|
||||
in the glossary — only the clickable term is kept), that the chapter renders
|
||||
inside the full document to both PDF and PPTX showing that content, that a
|
||||
``Group`` laid out ``side_by_side`` (PPTX: table left / bars right) that starts on
|
||||
a fresh page/slide (one column per page, chart next to its tables), that the LLM
|
||||
business description + unit are shown per column when the profile carries an LLM
|
||||
block, that the long entropy / page-layout explanations are NOT repeated inline
|
||||
(they live in the glossary — only the clickable terms are kept), that the chapter
|
||||
renders inside the full document to both PDF and PPTX showing that content, that a
|
||||
profile with no categorical columns yields ``None`` without raising, and that
|
||||
long labels / many columns are never cut in either output.
|
||||
"""
|
||||
@@ -116,6 +118,10 @@ def test_golden_build_cat_distr_emite_bloques_pedidos():
|
||||
assert "log2" not in md.text # redundant explanation removed.
|
||||
assert "máxima diversidad" not in md.text
|
||||
|
||||
# The donut/pie is gone: the intro no longer mentions tarta/donut (the chart
|
||||
# is now a bar chart; the long page-layout explanation moved to the glossary).
|
||||
assert "donut" not in md.text and "tarta" not in md.text
|
||||
|
||||
# Per-column blocks are wrapped in keep-together Groups: flatten to inspect.
|
||||
flat = _flatten(ch.blocks)
|
||||
kv = next(b for b in flat if isinstance(b, KVTable))
|
||||
@@ -128,11 +134,13 @@ def test_golden_build_cat_distr_emite_bloques_pedidos():
|
||||
assert any("Entropía" in lbl for lbl in labels)
|
||||
assert "únicos" in values and "%" in values
|
||||
assert "bits" in values and "norm" in values # entropy + max + normalized.
|
||||
# Top-k table + pie figure.
|
||||
# Top-k table + bar figure.
|
||||
dt = next(b for b in flat if isinstance(b, DataTable))
|
||||
assert dt.header == ["Valor", "Conteo", "%"]
|
||||
assert any("neumaticos" in str(cell) for row in dt.rows for cell in row)
|
||||
assert any(isinstance(b, Figure) for b in flat)
|
||||
# Each per-column Group is laid out side_by_side (table left / bars right).
|
||||
assert all(g.layout == "side_by_side" for g in _column_groups(ch))
|
||||
# id-like column flagged with a Note that also explains the top-k is dropped.
|
||||
idnote = next((b for b in flat
|
||||
if isinstance(b, Note) and "identificador" in b.text), None)
|
||||
@@ -140,9 +148,9 @@ def test_golden_build_cat_distr_emite_bloques_pedidos():
|
||||
assert "No se lista el top" in idnote.text
|
||||
|
||||
|
||||
def test_golden_idlike_omite_topk_y_conserva_donut():
|
||||
def test_golden_idlike_omite_topk_y_conserva_grafico():
|
||||
# The id-like column (uuid, 100% distinct) must NOT carry a top-k DataTable
|
||||
# (it would be a list of unique values), but must still keep its donut Figure
|
||||
# (it would be a list of unique values), but must still keep its bar Figure
|
||||
# and its cardinality table so it stays a full per-column page.
|
||||
ch = build_cat_distr(_profile(), {})
|
||||
groups = _column_groups(ch)
|
||||
@@ -151,7 +159,7 @@ def test_golden_idlike_omite_topk_y_conserva_donut():
|
||||
kinds = [b.kind for b in uuid_group.blocks]
|
||||
assert "data_table" not in kinds # top-k of unique values dropped.
|
||||
assert "kv_table" in kinds # cardinality kept.
|
||||
assert "figure" in kinds # donut kept (chart per column).
|
||||
assert "figure" in kinds # bar chart kept (chart per column).
|
||||
# A non-id-like column keeps its top-k table.
|
||||
cat_group = next(g for g in groups
|
||||
if any(getattr(b, "text", "") == "categoria"
|
||||
@@ -205,7 +213,7 @@ def test_golden_render_pdf_una_pagina_por_columna():
|
||||
assert "Entrop" in txt
|
||||
assert "distintos" in txt
|
||||
assert "categoria" in txt and "neumaticos" in txt
|
||||
assert "donut" in txt # figure caption rendered as text.
|
||||
assert "barras" in txt # bar-chart caption rendered as text (PDF).
|
||||
assert "identificador" in txt # id-like note rendered.
|
||||
|
||||
|
||||
@@ -258,9 +266,11 @@ def _profile_high_card() -> dict:
|
||||
|
||||
|
||||
def test_golden_pptx_una_slide_por_columna_con_su_grafico():
|
||||
"""Each categorical column occupies EXACTLY ONE cat_distr slide that carries
|
||||
BOTH its cardinality table and its donut figure (picture) — i.e. the chart is
|
||||
never separated from its table, even for a high-cardinality column."""
|
||||
"""Cada columna categórica ocupa EXACTAMENTE UN slide cat_distr que lleva su
|
||||
gráfico (picture) en la misma slide — el chart nunca se separa de su columna,
|
||||
ni siquiera para una columna de alta cardinalidad. Con layout side_by_side la
|
||||
tabla se rasteriza a imagen, así que la comprobación se hace por presencia de
|
||||
picture (no por el texto de la tabla)."""
|
||||
from pptx.enum.shapes import MSO_SHAPE_TYPE
|
||||
|
||||
prof = _profile_high_card()
|
||||
@@ -272,7 +282,7 @@ def test_golden_pptx_una_slide_por_columna_con_su_grafico():
|
||||
prs = Presentation(out)
|
||||
|
||||
# Per column: the cat_distr slides whose text mentions it, and whether the
|
||||
# owning slide also has the donut caption + an actual picture shape.
|
||||
# owning slide also carries an actual picture shape (its chart).
|
||||
slides_with_col = {n: [] for n in cat_names}
|
||||
owner_has_chart = {n: False for n in cat_names}
|
||||
for i, sl in enumerate(prs.slides):
|
||||
@@ -288,15 +298,106 @@ def test_golden_pptx_una_slide_por_columna_con_su_grafico():
|
||||
for n in cat_names:
|
||||
if n in txt:
|
||||
slides_with_col[n].append(i)
|
||||
has_table = "Cardinalidad" in txt or "distintos" in txt
|
||||
if has_pic and "donut" in txt and has_table:
|
||||
if has_pic:
|
||||
owner_has_chart[n] = True
|
||||
|
||||
for n in cat_names:
|
||||
# Exactly one slide carries the column (not split across slides).
|
||||
assert len(slides_with_col[n]) == 1, (n, slides_with_col[n])
|
||||
# That single slide also holds its table AND its donut picture.
|
||||
assert owner_has_chart[n], (n, "tabla y donut no están en el mismo slide")
|
||||
# That single slide also holds its chart picture.
|
||||
assert owner_has_chart[n], (n, "el gráfico no está en el slide de la columna")
|
||||
|
||||
|
||||
def test_golden_pptx_columna_side_by_side_tabla_izq_barra_der():
|
||||
"""Con layout side_by_side, una columna categórica coloca su tabla de
|
||||
cardinalidad (imagen) en la mitad izquierda y su gráfico de barras (imagen) en
|
||||
la mitad derecha de la MISMA slide. Verifica que al menos una columna queda en
|
||||
dos columnas (tabla-izq / barras-der), evidencia del side_by_side en PPTX."""
|
||||
from pptx.enum.shapes import MSO_SHAPE_TYPE
|
||||
from pptx.util import Inches
|
||||
|
||||
with tempfile.TemporaryDirectory() as d:
|
||||
out = os.path.join(d, "eda.pptx")
|
||||
render_automatic_eda_pptx(_profile(), out, {"title": "EDA"})
|
||||
prs = Presentation(out)
|
||||
centre = int(Inches(13.333 / 2.0)) # half of the 16:9 slide width.
|
||||
two_col_slides = 0
|
||||
for sl in prs.slides:
|
||||
texts, lefts = [], []
|
||||
for sh in sl.shapes:
|
||||
if sh.has_text_frame:
|
||||
texts.append(sh.text_frame.text)
|
||||
if (sh.shape_type == MSO_SHAPE_TYPE.PICTURE
|
||||
and sh.left is not None):
|
||||
lefts.append(sh.left)
|
||||
txt = re.sub(r"\s+", " ", " ".join(texts))
|
||||
if "Distribuciones categ" not in txt:
|
||||
continue
|
||||
# One picture starts in the left half, another in the right half.
|
||||
if len(lefts) >= 2 and min(lefts) < centre and max(lefts) > centre:
|
||||
two_col_slides += 1
|
||||
assert two_col_slides >= 1, (
|
||||
"ninguna columna quedó con tabla-izq / barras-der (side_by_side)")
|
||||
|
||||
|
||||
def _profile_with_llm() -> dict:
|
||||
"""The base profile plus an ``llm`` block (as eda_llm_insights would store it
|
||||
with run_llm=True): a data dictionary with description/unit per column."""
|
||||
prof = _profile()
|
||||
prof["llm"] = {
|
||||
"dictionary": [
|
||||
{"column": "categoria",
|
||||
"description": "Familia de producto del recambio",
|
||||
"business_meaning": "Agrupa el catálogo por tipo de pieza",
|
||||
"unit": "categoría"},
|
||||
{"column": "uuid",
|
||||
"description": "Identificador único de registro",
|
||||
"unit": ""},
|
||||
],
|
||||
}
|
||||
return prof
|
||||
|
||||
|
||||
def test_llm_descripcion_y_unidad_por_columna():
|
||||
# With an LLM dictionary, each categorical column whose name matches shows its
|
||||
# business description and unit in a per-column markdown block.
|
||||
ch = build_cat_distr(_profile_with_llm(), {})
|
||||
groups = _column_groups(ch)
|
||||
cat_group = next(g for g in groups
|
||||
if any(getattr(b, "text", "") == "categoria"
|
||||
for b in g.blocks))
|
||||
md = " ".join(b.text for b in cat_group.blocks
|
||||
if getattr(b, "kind", "") == "markdown")
|
||||
assert "Descripción" in md and "Familia de producto" in md
|
||||
assert "Unidad" in md and "categoría" in md
|
||||
|
||||
|
||||
def test_edge_sin_llm_no_anade_descripcion():
|
||||
# Without an LLM block the per-column description markdown is simply omitted;
|
||||
# the column still renders its cardinality table and bar figure.
|
||||
ch = build_cat_distr(_profile(), {})
|
||||
for g in _column_groups(ch):
|
||||
mds = [b.text for b in g.blocks if getattr(b, "kind", "") == "markdown"]
|
||||
assert not any("Descripción" in t for t in mds)
|
||||
|
||||
|
||||
def test_pagina_categorica_clicable_y_definicion_en_glosario():
|
||||
# The "how each categorical page is laid out" term is registered + marked
|
||||
# clickable in the intro, and its full definition lands in the glossary
|
||||
# chapter (canonical baseline catalog), not inline.
|
||||
from datascience.automatic_eda.chapters.glosario import build_glosario
|
||||
|
||||
gc = GlossaryCollector()
|
||||
ch = build_cat_distr(_profile(), {"glossary": gc})
|
||||
md = next(b for b in ch.blocks if isinstance(b, Markdown))
|
||||
assert "[[term:pagina_categorica]]" in md.text
|
||||
assert gc.has("pagina_categorica")
|
||||
glos = build_glosario(_profile(), {"glossary": gc})
|
||||
entry = next(b for b in glos.blocks
|
||||
if getattr(b, "kind", "") == "glossary_entry"
|
||||
and b.key == "pagina_categorica")
|
||||
assert "barras" in entry.definition
|
||||
assert "identificador" in entry.definition
|
||||
|
||||
|
||||
def test_edge_sin_categoricas_devuelve_none():
|
||||
|
||||
@@ -17,10 +17,63 @@ from __future__ import annotations
|
||||
|
||||
from .. import model
|
||||
|
||||
CHAPTER_VERSION = "1.0.0"
|
||||
CHAPTER_VERSION = "1.1.0"
|
||||
CHAPTER_ID = "glosario"
|
||||
CHAPTER_TITLE = "Glosario"
|
||||
|
||||
# Canonical definitions for cross-cutting terms — the "how to read it" entries
|
||||
# that do not belong to a single chapter. A chapter only needs to *register* the
|
||||
# term (``ctx['glossary'].add(key, label)``) and mark its in-text appearance with
|
||||
# ``[[term:key]]…[[/term]]``; this chapter supplies the full definition here when
|
||||
# the collector carries the term without one. Keeping the prose in a single place
|
||||
# avoids repeating a long paragraph inline in every chapter that names the term
|
||||
# (the explanation moved out of the NUM DISTR and CAT DISTR intros lives here).
|
||||
_BASELINE_TERMS = {
|
||||
"histograma_boxplot": {
|
||||
"label": "Cómo leer el histograma y el boxplot",
|
||||
"definition": (
|
||||
"Para cada columna numérica se muestra su histograma con tres líneas "
|
||||
"de referencia: la media (línea roja discontinua), la mediana (línea "
|
||||
"verde continua) y la banda ±1σ (zona sombreada que cubre una "
|
||||
"desviación estándar a cada lado de la media). Debajo, alineado al "
|
||||
"mismo eje horizontal, un boxplot de Tukey: la caja abarca del primer "
|
||||
"al tercer cuartil (P25–P75), la línea interior es la mediana y los "
|
||||
"bigotes llegan hasta 1,5·IQR; los puntos rojos señalan que hay "
|
||||
"valores más allá de las vallas (posibles atípicos). Comparar la media "
|
||||
"con la mediana revela la asimetría: si la media supera a la mediana la "
|
||||
"cola larga cae hacia los valores altos (asimetría a la derecha), y al "
|
||||
"revés hacia los bajos."),
|
||||
},
|
||||
"pagina_categorica": {
|
||||
"label": "Cómo se organiza cada página categórica",
|
||||
"definition": (
|
||||
"Cada columna categórica ocupa su propia página: muestra sus métricas "
|
||||
"de cardinalidad —incluida la entropía—, una nota que señala "
|
||||
"cardinalidad problemática (columnas que se comportan como "
|
||||
"identificador, con casi todos los valores distintos, o dominadas por "
|
||||
"una sola categoría), la tabla de las categorías más frecuentes (top-k, "
|
||||
"con su conteo y porcentaje) y un gráfico de barras de las categorías "
|
||||
"más comunes (top-k más una barra «Otros» que agrupa la cola). El total "
|
||||
"de filas del dataset se usa como referencia para interpretar los "
|
||||
"conteos."),
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def _resolve_term(term: dict) -> tuple:
|
||||
"""Return (label, definition) for a collected term, completing a missing
|
||||
definition (and, if absent, the label) from the canonical baseline catalog."""
|
||||
key = model._safe_str(term.get("key"))
|
||||
label = model._safe_str(term.get("label"))
|
||||
definition = model._safe_str(term.get("definition"))
|
||||
base = _BASELINE_TERMS.get(key)
|
||||
if base:
|
||||
if not definition.strip():
|
||||
definition = model._safe_str(base.get("definition"))
|
||||
if not label.strip() or label == key:
|
||||
label = model._safe_str(base.get("label")) or label
|
||||
return label, definition
|
||||
|
||||
|
||||
def build_glosario(profile: dict, ctx: dict):
|
||||
"""Build the glossary Chapter from the shared collector, or None if empty."""
|
||||
@@ -36,12 +89,14 @@ def build_glosario(profile: dict, ctx: dict):
|
||||
"Cada término va resaltado en el texto y, al pulsarlo, salta a su "
|
||||
"definición en esta sección.")),
|
||||
]
|
||||
# One clickable destination per term, alphabetically by visible label.
|
||||
# One clickable destination per term, alphabetically by visible label. A term
|
||||
# registered without a definition is completed from the canonical baseline.
|
||||
for term in glossary.terms(by="label"):
|
||||
label, definition = _resolve_term(term)
|
||||
blocks.append(model.GlossaryEntry(
|
||||
key=model._safe_str(term.get("key")),
|
||||
label=model._safe_str(term.get("label")),
|
||||
definition=model._safe_str(term.get("definition"))))
|
||||
label=label,
|
||||
definition=definition))
|
||||
|
||||
return model.Chapter(id=CHAPTER_ID, title=CHAPTER_TITLE,
|
||||
version=CHAPTER_VERSION, blocks=blocks)
|
||||
|
||||
@@ -35,10 +35,21 @@ try:
|
||||
except Exception: # noqa: BLE001 — keep the chapter importable no matter what.
|
||||
build_boxplot_stats = None # type: ignore[assignment]
|
||||
|
||||
CHAPTER_VERSION = "1.2.0"
|
||||
CHAPTER_VERSION = "1.3.0"
|
||||
CHAPTER_ID = "num_distr"
|
||||
CHAPTER_TITLE = "Distribuciones numéricas"
|
||||
|
||||
# Glossary term this chapter explains. The long "how to read the histogram and
|
||||
# the boxplot" paragraph used to live inline in the intro; it now lives in the
|
||||
# GLOSARIO chapter (canonical definition in ``glosario._BASELINE_TERMS``) and the
|
||||
# intro only names the clickable term — one click jumps to the full explanation,
|
||||
# so the information is relocated, not lost (mejora glosario).
|
||||
_TERM_HISTOBOX_KEY = "histograma_boxplot"
|
||||
_TERM_HISTOBOX_LABEL = "Cómo leer el histograma y el boxplot"
|
||||
|
||||
# Key under which eda_llm_insights stores its interpretive block in the profile.
|
||||
LLM_KEY = "llm"
|
||||
|
||||
# Plain-Spanish gloss for every label ``detect_distribution_type`` can emit, so a
|
||||
# non-expert reader understands the shape and the suggested next step (MUST-4.3).
|
||||
_DIST_GLOSS = {
|
||||
@@ -99,6 +110,53 @@ def _numeric_columns(profile: dict) -> list:
|
||||
return out
|
||||
|
||||
|
||||
def _llm_index(profile: dict, ctx: dict) -> dict:
|
||||
"""Map column name -> its LLM dictionary entry (description/unit/...).
|
||||
|
||||
Reads the ``llm.dictionary`` list that ``eda_llm_insights`` stored in the
|
||||
profile (``profile['llm']``; falls back to ``ctx['llm']``). Returns an empty
|
||||
dict when ``run_llm`` did not run, so the caller degrades cleanly. Fully
|
||||
defensive: never raises on malformed input.
|
||||
"""
|
||||
llm = profile.get(LLM_KEY)
|
||||
if not isinstance(llm, dict):
|
||||
llm = ctx.get(LLM_KEY)
|
||||
if not isinstance(llm, dict):
|
||||
return {}
|
||||
entries = llm.get("dictionary")
|
||||
if not isinstance(entries, (list, tuple)):
|
||||
return {}
|
||||
index: dict = {}
|
||||
for e in entries:
|
||||
if not isinstance(e, dict):
|
||||
continue
|
||||
col = e.get("column")
|
||||
if col is None:
|
||||
continue
|
||||
index[model._safe_str(col)] = e
|
||||
return index
|
||||
|
||||
|
||||
def _llm_desc_unit_block(name: str, llm_index: dict):
|
||||
"""Markdown block with the LLM business description + unit of a column, or
|
||||
None when no LLM entry matches the column (clean fallback without LLM)."""
|
||||
entry = llm_index.get(model._safe_str(name))
|
||||
if not isinstance(entry, dict):
|
||||
return None
|
||||
raw_desc = entry.get("description") or entry.get("business_meaning")
|
||||
desc = " ".join(model._safe_str(raw_desc).split()) if raw_desc else ""
|
||||
raw_unit = entry.get("unit")
|
||||
unit = " ".join(model._safe_str(raw_unit).split()) if raw_unit else ""
|
||||
parts = []
|
||||
if desc:
|
||||
parts.append(f"**Descripción:** {desc}")
|
||||
if unit:
|
||||
parts.append(f"**Unidad:** {unit}")
|
||||
if not parts:
|
||||
return None
|
||||
return model.Markdown(text=" · ".join(parts))
|
||||
|
||||
|
||||
def _make_hist_box(name: str, numeric: dict, box: dict):
|
||||
"""Build the histogram (with mean/median/±σ lines) + boxplot figure.
|
||||
|
||||
@@ -271,15 +329,26 @@ def build_num_distr(profile: dict, ctx: dict):
|
||||
if not numerics:
|
||||
return None # chapter does not apply to a dataset with no numerics.
|
||||
|
||||
# Register the "how to read the histogram and boxplot" term in the shared
|
||||
# glossary collector (if present) and mark its first appearance clickable. The
|
||||
# full explanation (colour code, 1,5·IQR rule, asymmetry reading) lives in the
|
||||
# GLOSARIO chapter instead of inline here: the intro only names the term.
|
||||
glossary = ctx.get("glossary")
|
||||
mark_term = False
|
||||
if isinstance(glossary, model.GlossaryCollector):
|
||||
glossary.add(_TERM_HISTOBOX_KEY, _TERM_HISTOBOX_LABEL)
|
||||
mark_term = True
|
||||
como_leer = ("[[term:histograma_boxplot]]cómo leer estos gráficos[[/term]]"
|
||||
if mark_term else "cómo leer estos gráficos")
|
||||
intro = (
|
||||
"Para cada columna numérica se muestra su **histograma** con tres líneas "
|
||||
"de referencia: la **media** (línea roja discontinua), la **mediana** "
|
||||
"(línea verde continua) y la banda **±1σ** (zona sombreada). Debajo, "
|
||||
"alineado al mismo eje, un **boxplot de Tukey**: la caja abarca del "
|
||||
"primer al tercer cuartil (P25–P75), la línea interior es la mediana y "
|
||||
"los bigotes llegan hasta 1,5·IQR; los puntos rojos señalan que hay "
|
||||
"valores más allá de las vallas. Comparar media y mediana revela la "
|
||||
"asimetría de la distribución.")
|
||||
"Cada columna numérica muestra su **histograma** (con la **media**, la "
|
||||
"**mediana** y la banda **±1σ**) y, debajo y al mismo eje, su **boxplot "
|
||||
f"de Tukey** — {como_leer}.")
|
||||
|
||||
# Business description + unit per column come from the LLM dictionary
|
||||
# (profile['llm']['dictionary'], matched by column name); absent without
|
||||
# run_llm, in which case the per-column description block is simply omitted.
|
||||
llm_index = _llm_index(profile, ctx)
|
||||
|
||||
blocks = [
|
||||
model.Heading(text=CHAPTER_TITLE, level=1),
|
||||
@@ -293,17 +362,20 @@ def build_num_distr(profile: dict, ctx: dict):
|
||||
box = build_boxplot_stats(numeric) or {}
|
||||
except Exception: # noqa: BLE001 — degrade, never raise.
|
||||
box = {}
|
||||
# Keep the column heading, its figure and its stats note together on the
|
||||
# same page/slide (mejora 3 — keep-together): the renderers measure the
|
||||
# whole Group and move it whole when it would not fit.
|
||||
blocks.append(model.Group(blocks=[
|
||||
model.Heading(text=str(name), level=2),
|
||||
model.Figure(
|
||||
make=_figure_maker(name, numeric, box),
|
||||
caption=f"Distribución de «{name}» — histograma "
|
||||
f"(media/mediana/±σ) y boxplot."),
|
||||
model.Markdown(text=_stats_note(name, numeric, box)),
|
||||
]))
|
||||
# Keep the column heading, its (optional) LLM description, its figure and
|
||||
# its stats note together on the same page/slide (mejora 3 —
|
||||
# keep-together): the renderers measure the whole Group and move it whole
|
||||
# when it would not fit.
|
||||
col_blocks = [model.Heading(text=str(name), level=2)]
|
||||
desc_block = _llm_desc_unit_block(name, llm_index)
|
||||
if desc_block is not None:
|
||||
col_blocks.append(desc_block)
|
||||
col_blocks.append(model.Figure(
|
||||
make=_figure_maker(name, numeric, box),
|
||||
caption=f"Distribución de «{name}» — histograma "
|
||||
f"(media/mediana/±σ) y boxplot."))
|
||||
col_blocks.append(model.Markdown(text=_stats_note(name, numeric, box)))
|
||||
blocks.append(model.Group(blocks=col_blocks))
|
||||
|
||||
return model.Chapter(id=CHAPTER_ID, title=CHAPTER_TITLE,
|
||||
version=CHAPTER_VERSION, blocks=blocks)
|
||||
|
||||
@@ -101,7 +101,7 @@ def test_golden_chapter_estructura_y_bloques():
|
||||
|
||||
|
||||
def test_golden_media_mediana_sigma_y_boxplot_presentes():
|
||||
# The intro documents the three reference lines and the Tukey boxplot; the
|
||||
# The short intro names the three reference lines and the Tukey boxplot; the
|
||||
# per-column note carries the actual mean/median/σ numbers and the shape.
|
||||
ch = build_num_distr(_profile(n_numeric=1, extra_categorical=False), {})
|
||||
md_texts = " ".join(b.text for b in _flatten(ch.blocks)
|
||||
@@ -110,10 +110,58 @@ def test_golden_media_mediana_sigma_y_boxplot_presentes():
|
||||
assert "±1σ" in md_texts or "σ" in md_texts
|
||||
assert "boxplot" in md_texts.lower()
|
||||
assert "Tukey" in md_texts
|
||||
# The long "how to read it" explanation moved to the glossary: the colour-code
|
||||
# / 1,5·IQR walkthrough is no longer inline in the chapter body.
|
||||
assert "1,5·IQR" not in md_texts
|
||||
assert "línea roja" not in md_texts
|
||||
# distribution_type gloss surfaced for the column (right-skewed preset).
|
||||
assert _DIST_GLOSS["right-skewed"].split(";")[0][:20] in md_texts
|
||||
|
||||
|
||||
def test_glosario_histograma_boxplot_clicable_y_definicion():
|
||||
# With a glossary collector the intro marks the clickable term and the FULL
|
||||
# explanation (the long paragraph removed from the body) lands in the glossary.
|
||||
from datascience.automatic_eda.chapters.glosario import build_glosario
|
||||
|
||||
gc = model.GlossaryCollector()
|
||||
prof = _profile(n_numeric=1, extra_categorical=False)
|
||||
ch = build_num_distr(prof, {"glossary": gc})
|
||||
intro = next(b for b in ch.blocks if b.kind == "markdown")
|
||||
assert "[[term:histograma_boxplot]]" in intro.text
|
||||
assert gc.has("histograma_boxplot")
|
||||
glos = build_glosario(prof, {"glossary": gc})
|
||||
entry = next(b for b in glos.blocks
|
||||
if getattr(b, "kind", "") == "glossary_entry"
|
||||
and b.key == "histograma_boxplot")
|
||||
assert "boxplot" in entry.definition.lower()
|
||||
assert "1,5·IQR" in entry.definition
|
||||
|
||||
|
||||
def test_llm_descripcion_y_unidad_por_columna():
|
||||
# With an LLM dictionary, each numeric column whose name matches shows its
|
||||
# business description and unit in a per-column markdown block.
|
||||
prof = _profile(n_numeric=2)
|
||||
prof["llm"] = {"dictionary": [
|
||||
{"column": "precio", "description": "Precio de venta del producto",
|
||||
"unit": "EUR"},
|
||||
{"column": "alcohol", "business_meaning": "Grado alcohólico",
|
||||
"unit": "% vol"},
|
||||
]}
|
||||
ch = build_num_distr(prof, {})
|
||||
md_all = " ".join(b.text for b in _flatten(ch.blocks)
|
||||
if b.kind == "markdown")
|
||||
assert "Precio de venta" in md_all and "EUR" in md_all
|
||||
assert "Grado alcohólico" in md_all and "% vol" in md_all
|
||||
|
||||
|
||||
def test_edge_sin_llm_no_anade_descripcion():
|
||||
# Without an LLM block the per-column description markdown is simply omitted.
|
||||
ch = build_num_distr(_profile(n_numeric=2), {})
|
||||
md_all = " ".join(b.text for b in _flatten(ch.blocks)
|
||||
if b.kind == "markdown")
|
||||
assert "Descripción" not in md_all
|
||||
|
||||
|
||||
def test_boxplot_stats_se_consumen_del_registry():
|
||||
# The chapter must feed build_boxplot_stats (group eda) and the resulting
|
||||
# box must carry the Tukey fences for the figure.
|
||||
|
||||
Reference in New Issue
Block a user