feat(eda): CAP4/CAP5 distribuciones — párrafos al glosario, desc LLM + unidad por columna, donut→barras, PPT figura a la derecha

CAP4 num_distr: - Mueve el párrafo introductorio largo del histograma/boxplot al glosario (nuevo término clicable "histograma_boxplot"); el cuerpo del capítulo solo nombra el término con [[term:histograma_boxplot]] y la explicación completa (código de colores, 1,5·IQR, lectura de asimetría) vive en la entrada del glosario. La información se traslada, no se pierde. - Añade por columna numérica la descripción de negocio del LLM y la unidad, leídas de profile['llm']['dictionary'] (empareja por nombre de columna). Sin bloque LLM el bloque de descripción se omite limpiamente. CAP5 cat_distr: - Mueve el párrafo "Cada columna categórica ocupa su propia página..." al glosario (nuevo término clicable "pagina_categorica"); el intro solo nombra los términos entropía y pagina_categorica. - Añade descripción LLM + unidad por columna (misma fuente que CAP4). - Cambia el donut/pie por gráfico de barras horizontales (nueva función del registry categorical_top_bar_figure_py_datascience, contrato de entrada idéntico al donut para swap directo) más su fallback inline de barras. - Marca cada Group de columna con layout="side_by_side": en PPTX la tabla de cardinalidad queda a la izquierda y la barra a la derecha; en PDF se apila (A5 estrecho). No toca los renderers — el soporte de layout ya existía. Glosario: - Catálogo canónico _BASELINE_TERMS con las definiciones de los dos términos nuevos; build_glosario completa la definición de un término registrado sin ella desde el catálogo (los chapters solo registran clave+label). Tests actualizados (donut→barras, side_by_side, LLM desc/unidad, glosario) y nueva función con sus tests. Suite del subsistema + acceptance verde.
2026-07-01 02:01:07 +02:00
parent 7f304adc9c
commit cab0fbf0a3
8 changed files with 901 additions and 98 deletions
@@ -5,28 +5,32 @@ page (PDF) / slide (PPTX)**: every column is wrapped in a keep-together
 ``model.Group`` with ``page_break_before=True`` (except the first, which may share
 the intro's page), so its chart sits next to its tables and no column is split.

-A short intro names the clickable **[[term:entropia]]entropía[[/term]]** term —
-the full definition lives in the GLOSARIO chapter, so it is NOT repeated inline
-here (one click jumps to the glossary entry). The intro also carries the dataset
-row total used as a comparison baseline.
+Per column the Group is laid out ``side_by_side`` (PPTX: cardinality table LEFT,
+chart RIGHT; PDF: stacked) and contains, in order:

-Per column the Group contains, in order:
-
-1. A cardinality key/value table: distinct values, ``% distinct`` (distinct /
+1. The column name plus, when the LLM layer ran, its business **description** and
+   **unit** (read from ``profile['llm']['dictionary']``, matched by column name).
+2. A cardinality key/value table: distinct values, ``% distinct`` (distinct /
   total rows), total dataset rows, singleton values (frequency 1), entropy with
   its theoretical maximum and the normalized ratio, mode, imbalance and
   string-length stats.
-2. A short note flagging problematic cardinality (id-like ≈100% distinct, or a
+3. A short note flagging problematic cardinality (id-like ≈100% distinct, or a
   single dominating category).
-3. A ``top-k`` table (value / count / %).
-4. A **donut pie chart** of the most common categories (top-k + an "Otros"
+4. A ``top-k`` table (value / count / %).
+5. A **horizontal bar chart** of the most common categories (top-k + an "Otros"
   bucket), drawn lazily so the renderers scale it to fit entirely.

+A short intro names the clickable **[[term:entropia]]entropía[[/term]]** and
+**[[term:pagina_categorica]]page-layout[[/term]]** terms — their full
+definitions live in the GLOSARIO chapter, so they are NOT repeated inline here
+(one click jumps to the glossary entry). The intro also carries the dataset row
+total used as a comparison baseline.
+
 Data comes from the ``eda`` group: each ``columns[i]['categorical']`` is the
 output of ``summarize_categorical`` (``top[{value,count,pct}]``, ``mode``,
 ``n_distinct``, ``entropy``, ``imbalance``, ``len_min/mean/max``). The derived
-cardinality metrics and the pie figure are delegated to two registry functions
-(``categorical_cardinality_block`` and ``categorical_top_pie_figure``); both are
+cardinality metrics and the bar figure are delegated to two registry functions
+(``categorical_cardinality_block`` and ``categorical_top_bar_figure``); both are
 imported lazily and degrade to a minimal inline fallback so this chapter never
 raises even if they are unavailable.

@@ -39,10 +43,21 @@ import math

 from .. import model

-CHAPTER_VERSION = "1.2.0"
+CHAPTER_VERSION = "1.3.0"
 CHAPTER_ID = "cat_distr"
 CHAPTER_TITLE = "Distribuciones categóricas"

+# Key under which eda_llm_insights stores its interpretive block in the profile.
+LLM_KEY = "llm"
+
+# Second glossary term this chapter names: "how each categorical page is laid
+# out". The long paragraph that used to describe it inline in the intro now lives
+# in the GLOSARIO chapter (canonical definition in ``glosario._BASELINE_TERMS``);
+# the intro only names the clickable term, relocating the explanation, not losing
+# it. The chapter only needs to register key+label here.
+_TERM_PAGINA_KEY = "pagina_categorica"
+_TERM_PAGINA_LABEL = "Cómo se organiza cada página categórica"
+
 # Glossary term this chapter explains. Registered in the shared collector and
 # marked clickable on its first appearance (end-to-end glossary example —
 # mejora 6). Other chapters hook their own terms the same way (see the contract).
@@ -59,14 +74,14 @@ _TERM_ENTROPIA_DEF = (
 # Cap the number of categorical columns rendered to keep the document bounded;
 # the rest are summarized in a closing note (no silent truncation).
 MAX_COLS = 40
-# Rows shown in each top-k table and explicit slices in the pie. Kept moderate so
-# the whole column — cardinality table + top-k table + donut — fits on ONE
+# Rows shown in each top-k table and explicit bars in the chart. Kept moderate so
+# the whole column — cardinality table + top-k table + bar chart — fits on ONE
 # page/slide with the chart next to its tables; the table note still reports
 # "top N of M" so nothing is silently hidden. For id-like columns (≈100%
 # distinct) the top-k table is dropped entirely (it would be a list of unique
-# values — pure noise), which also frees the room the donut needs (see build).
+# values — pure noise), which also frees the room the chart needs (see build).
 TOP_TABLE_ROWS = 8
-PIE_TOP_K = 6
+CHART_TOP_K = 6
 # Truncate very long category labels in tables (the renderer also wraps). Kept
 # tight so a column with long id-like values (names, tickets) still fits its page.
 LABEL_MAX = 28
@@ -208,26 +223,74 @@ def _fallback_cardinality(cat: dict, n_rows) -> dict:
    }


-def _pie_make(top, n_distinct, title, n_rows):
-    """Return a zero-arg callable that builds the donut figure lazily."""
+def _llm_index(profile: dict, ctx: dict) -> dict:
+    """Map column name -> its LLM dictionary entry (description/unit/...).
+
+    Reads the ``llm.dictionary`` list that ``eda_llm_insights`` stored in the
+    profile (``profile['llm']``; falls back to ``ctx['llm']``). Returns an empty
+    dict when ``run_llm`` did not run, so the caller degrades cleanly. Fully
+    defensive: never raises on malformed input.
+    """
+    llm = profile.get(LLM_KEY)
+    if not isinstance(llm, dict):
+        llm = ctx.get(LLM_KEY)
+    if not isinstance(llm, dict):
+        return {}
+    entries = llm.get("dictionary")
+    if not isinstance(entries, (list, tuple)):
+        return {}
+    index: dict = {}
+    for e in entries:
+        if not isinstance(e, dict):
+            continue
+        col = e.get("column")
+        if col is None:
+            continue
+        index[model._safe_str(col)] = e
+    return index
+
+
+def _llm_desc_unit_block(name: str, llm_index: dict):
+    """Markdown block with the LLM business description + unit of a column, or
+    None when no LLM entry matches the column (clean fallback without LLM)."""
+    entry = llm_index.get(model._safe_str(name))
+    if not isinstance(entry, dict):
+        return None
+    raw_desc = entry.get("description") or entry.get("business_meaning")
+    desc = " ".join(model._safe_str(raw_desc).split()) if raw_desc else ""
+    raw_unit = entry.get("unit")
+    unit = " ".join(model._safe_str(raw_unit).split()) if raw_unit else ""
+    parts = []
+    if desc:
+        parts.append(f"**Descripción:** {desc}")
+    if unit:
+        parts.append(f"**Unidad:** {unit}")
+    if not parts:
+        return None
+    return model.Markdown(text=" · ".join(parts))
+
+
+def _bar_make(top, n_distinct, title, n_rows):
+    """Return a zero-arg callable that builds the bar figure lazily."""

    def make():
        try:
-            from datascience.categorical_top_pie_figure import (
-                categorical_top_pie_figure,
+            from datascience.categorical_top_bar_figure import (
+                categorical_top_bar_figure,
            )

-            return categorical_top_pie_figure(
+            return categorical_top_bar_figure(
                top=top, n_distinct=n_distinct or 0, title=title,
-                top_k=PIE_TOP_K, n_rows=n_rows)
+                top_k=CHART_TOP_K, n_rows=n_rows)
        except Exception:  # noqa: BLE001 — minimal local fallback figure.
-            return _fallback_pie(top, title)
+            return _fallback_bar(top, title)

    return make


-def _fallback_pie(top, title):
-    """Minimal donut figure used only if the registry function is unavailable."""
+def _fallback_bar(top, title):
+    """Minimal horizontal-bar figure used only if the registry function is
+    unavailable. Largest category on top, the rest folded into "Otros"."""
    import matplotlib

    matplotlib.use("Agg")
@@ -238,8 +301,8 @@ def _fallback_pie(top, title):
    items = [t for t in (top or [])
             if isinstance(t, dict) and isinstance(t.get("count"), (int, float))]
    items = sorted(items, key=lambda t: t.get("count") or 0, reverse=True)
-    head = items[:PIE_TOP_K]
-    rest = items[PIE_TOP_K:]
+    head = items[:CHART_TOP_K]
+    rest = items[CHART_TOP_K:]
    labels = [_truncate(t.get("value"), 20) for t in head]
    sizes = [float(t.get("count") or 0) for t in head]
    if rest:
@@ -249,10 +312,13 @@ def _fallback_pie(top, title):
        ax.text(0.5, 0.5, "sin datos categóricos", ha="center", va="center")
        ax.axis("off")
        return fig
-    ax.pie(sizes, labels=None, wedgeprops={"width": 0.42},
-           autopct=lambda p: f"{p:.0f}%" if p >= 4 else "")
-    ax.legend(labels, loc="center left", bbox_to_anchor=(1.0, 0.5),
-              fontsize=7, frameon=False)
+    # barh draws bottom-up, so reverse to put the largest category on top.
+    y_pos = range(len(labels))
+    ax.barh(list(y_pos), list(reversed(sizes)), color="#4C72B0",
+            edgecolor="white")
+    ax.set_yticks(list(y_pos))
+    ax.set_yticklabels(list(reversed(labels)), fontsize=7)
+    ax.set_xlabel("conteo", fontsize=8)
    ax.set_title(_truncate(title, 40))
    fig.tight_layout()
    return fig
@@ -375,17 +441,19 @@ def _topk_table(cat: dict):

 def _intro_blocks(n_rows, mark_term: bool = False):
    total = _fmt_int(n_rows)
-    # Mark the first appearance of the term as a clickable glossary jump when the
-    # term was registered (mark_term). The full definition of entropy lives in the
-    # GLOSARIO chapter, so the intro only names the clickable term here instead of
-    # repeating the long explanation (avoids the redundancy with the glossary).
+    # Mark the first appearance of each term as a clickable glossary jump when the
+    # terms were registered (mark_term). The full definition of the entropy term
+    # AND of how each categorical page is laid out live in the GLOSARIO chapter, so
+    # the intro only names the clickable terms instead of repeating the long
+    # explanation (avoids the redundancy with the glossary).
    entropia = ("[[term:entropia]]entropía[[/term]]" if mark_term
                else "entropía")
+    pagina = ("[[term:pagina_categorica]]cómo se organiza cada página[[/term]]"
+              if mark_term else "cómo se organiza cada página")
    text = (
-        f"Cada columna categórica ocupa su propia página: sus métricas de "
-        f"cardinalidad —incluida la {entropia}—, una nota que señala cardinalidad "
-        "problemática, la tabla de las categorías más frecuentes y un gráfico de "
-        "tarta (donut) de las más comunes, todo junto."
+        f"Cada columna categórica ocupa su propia página — {pagina}: "
+        f"cardinalidad (incluida la {entropia}), top de categorías y un gráfico "
+        "de barras de las más comunes."
    )
    if n_rows is not None:
        text += f" El dataset tiene {total} filas en total como referencia."
@@ -406,47 +474,59 @@ def build_cat_distr(profile: dict, ctx: dict):
        return None

    n_rows = profile.get("n_rows")
-    # Register "entropía" in the shared glossary collector (if present) and mark
-    # its first appearance clickable. End-to-end glossary example (mejora 6).
+    # Register "entropía" and the "how each categorical page is laid out" term in
+    # the shared glossary collector (if present) and mark their first appearance
+    # clickable. End-to-end glossary example (mejora 6).
    glossary = ctx.get("glossary")
    mark_term = False
    if isinstance(glossary, model.GlossaryCollector):
        glossary.add(_TERM_ENTROPIA_KEY, _TERM_ENTROPIA_LABEL,
                     _TERM_ENTROPIA_DEF)
+        glossary.add(_TERM_PAGINA_KEY, _TERM_PAGINA_LABEL)
        mark_term = True
    blocks = list(_intro_blocks(n_rows, mark_term=mark_term))

+    # Business description + unit per column come from the LLM dictionary
+    # (profile['llm']['dictionary'], matched by column name); absent without
+    # run_llm, in which case the per-column description block is simply omitted.
+    llm_index = _llm_index(profile, ctx)
+
    rendered = cat_cols[:MAX_COLS]
    for idx, col in enumerate(rendered):
        name = col.get("name") or "(columna)"
        cat = col.get("categorical") or {}
        card = _normalize_card(_cardinality(cat, n_rows))

-        # One Group per categorical column: heading + cardinality table + flag
-        # note + top-k table + donut figure are kept together and the renderer
-        # starts each on a fresh page/slide (page_break_before) so every column
-        # gets its own page with its chart next to its tables. The first column
-        # may share the intro's page (no forced break) to avoid a near-empty page.
-        col_blocks = [
-            model.Heading(text=str(name), level=2),
-            _cardinality_block(card),
-        ]
+        # One Group per categorical column: heading + (optional) LLM description +
+        # cardinality table + flag note + top-k table + bar figure are kept
+        # together and the renderer starts each on a fresh page/slide
+        # (page_break_before) so every column gets its own page with its chart next
+        # to its tables. The first column may share the intro's page (no forced
+        # break) to avoid a near-empty page.
+        col_blocks = [model.Heading(text=str(name), level=2)]
+        desc_block = _llm_desc_unit_block(name, llm_index)
+        if desc_block is not None:
+            col_blocks.append(desc_block)
+        col_blocks.append(_cardinality_block(card))
        note = _flag_note(card)
        if note is not None:
            col_blocks.append(note)
        # For id-like columns (≈100% distinct) the top-k is a list of unique
        # values — pure noise; skip it (the flag note already explains why) and
-        # let the donut take that room so the whole column fits one page/slide.
+        # let the bar chart take that room so the whole column fits one page/slide.
        if not card.get("id_like"):
            topk = _topk_table(cat)
            if topk is not None:
                col_blocks.append(topk)
        col_blocks.append(model.Figure(
-            make=_pie_make(cat.get("top") or [], card.get("n_distinct"),
+            make=_bar_make(cat.get("top") or [], card.get("n_distinct"),
                           str(name), n_rows),
            caption=(f"Categorías más comunes de «{_truncate(name, 32)}» "
-                     "(donut: top-k + «Otros»)")))
-        blocks.append(model.Group(blocks=col_blocks,
+                     "(barras: top-k + «Otros»)")))
+        # layout="side_by_side": in PPTX the cardinality table goes to the LEFT and
+        # the bar chart to the RIGHT of the same slide; the PDF renderer stacks it
+        # (the A5 mobile page is too narrow for two readable columns).
+        blocks.append(model.Group(blocks=col_blocks, layout="side_by_side",
                                  page_break_before=(idx > 0)))

    if len(cat_cols) > len(rendered):
@@ -2,12 +2,14 @@

 Self-contained: builds synthetic TableProfiles (no DuckDB) so the suite is fast
 and deterministic. Verifies that ``build_cat_distr`` emits the blocks the user
-asked for (distinct/total/%-distinct/unique metrics, top-k table and a donut
+asked for (distinct/total/%-distinct/unique metrics, top-k table and a bar
 figure), that EACH categorical column is wrapped in its own keep-together
-``Group`` that starts on a fresh page/slide (one column per page, chart next to
-its tables), that the long entropy explanation is NOT repeated inline (it lives
-in the glossary — only the clickable term is kept), that the chapter renders
-inside the full document to both PDF and PPTX showing that content, that a
+``Group`` laid out ``side_by_side`` (PPTX: table left / bars right) that starts on
+a fresh page/slide (one column per page, chart next to its tables), that the LLM
+business description + unit are shown per column when the profile carries an LLM
+block, that the long entropy / page-layout explanations are NOT repeated inline
+(they live in the glossary — only the clickable terms are kept), that the chapter
+renders inside the full document to both PDF and PPTX showing that content, that a
 profile with no categorical columns yields ``None`` without raising, and that
 long labels / many columns are never cut in either output.
 """
@@ -116,6 +118,10 @@ def test_golden_build_cat_distr_emite_bloques_pedidos():
    assert "log2" not in md.text          # redundant explanation removed.
    assert "máxima diversidad" not in md.text

+    # The donut/pie is gone: the intro no longer mentions tarta/donut (the chart
+    # is now a bar chart; the long page-layout explanation moved to the glossary).
+    assert "donut" not in md.text and "tarta" not in md.text
+
    # Per-column blocks are wrapped in keep-together Groups: flatten to inspect.
    flat = _flatten(ch.blocks)
    kv = next(b for b in flat if isinstance(b, KVTable))
@@ -128,11 +134,13 @@ def test_golden_build_cat_distr_emite_bloques_pedidos():
    assert any("Entropía" in lbl for lbl in labels)
    assert "únicos" in values and "%" in values
    assert "bits" in values and "norm" in values   # entropy + max + normalized.
-    # Top-k table + pie figure.
+    # Top-k table + bar figure.
    dt = next(b for b in flat if isinstance(b, DataTable))
    assert dt.header == ["Valor", "Conteo", "%"]
    assert any("neumaticos" in str(cell) for row in dt.rows for cell in row)
    assert any(isinstance(b, Figure) for b in flat)
+    # Each per-column Group is laid out side_by_side (table left / bars right).
+    assert all(g.layout == "side_by_side" for g in _column_groups(ch))
    # id-like column flagged with a Note that also explains the top-k is dropped.
    idnote = next((b for b in flat
                   if isinstance(b, Note) and "identificador" in b.text), None)
@@ -140,9 +148,9 @@ def test_golden_build_cat_distr_emite_bloques_pedidos():
    assert "No se lista el top" in idnote.text


-def test_golden_idlike_omite_topk_y_conserva_donut():
+def test_golden_idlike_omite_topk_y_conserva_grafico():
    # The id-like column (uuid, 100% distinct) must NOT carry a top-k DataTable
-    # (it would be a list of unique values), but must still keep its donut Figure
+    # (it would be a list of unique values), but must still keep its bar Figure
    # and its cardinality table so it stays a full per-column page.
    ch = build_cat_distr(_profile(), {})
    groups = _column_groups(ch)
@@ -151,7 +159,7 @@ def test_golden_idlike_omite_topk_y_conserva_donut():
    kinds = [b.kind for b in uuid_group.blocks]
    assert "data_table" not in kinds      # top-k of unique values dropped.
    assert "kv_table" in kinds            # cardinality kept.
-    assert "figure" in kinds              # donut kept (chart per column).
+    assert "figure" in kinds              # bar chart kept (chart per column).
    # A non-id-like column keeps its top-k table.
    cat_group = next(g for g in groups
                     if any(getattr(b, "text", "") == "categoria"
@@ -205,7 +213,7 @@ def test_golden_render_pdf_una_pagina_por_columna():
        assert "Entrop" in txt
        assert "distintos" in txt
        assert "categoria" in txt and "neumaticos" in txt
-        assert "donut" in txt           # figure caption rendered as text.
+        assert "barras" in txt          # bar-chart caption rendered as text (PDF).
        assert "identificador" in txt   # id-like note rendered.


@@ -258,9 +266,11 @@ def _profile_high_card() -> dict:


 def test_golden_pptx_una_slide_por_columna_con_su_grafico():
-    """Each categorical column occupies EXACTLY ONE cat_distr slide that carries
-    BOTH its cardinality table and its donut figure (picture) — i.e. the chart is
-    never separated from its table, even for a high-cardinality column."""
+    """Cada columna categórica ocupa EXACTAMENTE UN slide cat_distr que lleva su
+    gráfico (picture) en la misma slide — el chart nunca se separa de su columna,
+    ni siquiera para una columna de alta cardinalidad. Con layout side_by_side la
+    tabla se rasteriza a imagen, así que la comprobación se hace por presencia de
+    picture (no por el texto de la tabla)."""
    from pptx.enum.shapes import MSO_SHAPE_TYPE

    prof = _profile_high_card()
@@ -272,7 +282,7 @@ def test_golden_pptx_una_slide_por_columna_con_su_grafico():
        prs = Presentation(out)

        # Per column: the cat_distr slides whose text mentions it, and whether the
-        # owning slide also has the donut caption + an actual picture shape.
+        # owning slide also carries an actual picture shape (its chart).
        slides_with_col = {n: [] for n in cat_names}
        owner_has_chart = {n: False for n in cat_names}
        for i, sl in enumerate(prs.slides):
@@ -288,15 +298,106 @@ def test_golden_pptx_una_slide_por_columna_con_su_grafico():
            for n in cat_names:
                if n in txt:
                    slides_with_col[n].append(i)
-                    has_table = "Cardinalidad" in txt or "distintos" in txt
-                    if has_pic and "donut" in txt and has_table:
+                    if has_pic:
                        owner_has_chart[n] = True

        for n in cat_names:
            # Exactly one slide carries the column (not split across slides).
            assert len(slides_with_col[n]) == 1, (n, slides_with_col[n])
-            # That single slide also holds its table AND its donut picture.
-            assert owner_has_chart[n], (n, "tabla y donut no están en el mismo slide")
+            # That single slide also holds its chart picture.
+            assert owner_has_chart[n], (n, "el gráfico no está en el slide de la columna")
+
+
+def test_golden_pptx_columna_side_by_side_tabla_izq_barra_der():
+    """Con layout side_by_side, una columna categórica coloca su tabla de
+    cardinalidad (imagen) en la mitad izquierda y su gráfico de barras (imagen) en
+    la mitad derecha de la MISMA slide. Verifica que al menos una columna queda en
+    dos columnas (tabla-izq / barras-der), evidencia del side_by_side en PPTX."""
+    from pptx.enum.shapes import MSO_SHAPE_TYPE
+    from pptx.util import Inches
+
+    with tempfile.TemporaryDirectory() as d:
+        out = os.path.join(d, "eda.pptx")
+        render_automatic_eda_pptx(_profile(), out, {"title": "EDA"})
+        prs = Presentation(out)
+        centre = int(Inches(13.333 / 2.0))   # half of the 16:9 slide width.
+        two_col_slides = 0
+        for sl in prs.slides:
+            texts, lefts = [], []
+            for sh in sl.shapes:
+                if sh.has_text_frame:
+                    texts.append(sh.text_frame.text)
+                if (sh.shape_type == MSO_SHAPE_TYPE.PICTURE
+                        and sh.left is not None):
+                    lefts.append(sh.left)
+            txt = re.sub(r"\s+", " ", " ".join(texts))
+            if "Distribuciones categ" not in txt:
+                continue
+            # One picture starts in the left half, another in the right half.
+            if len(lefts) >= 2 and min(lefts) < centre and max(lefts) > centre:
+                two_col_slides += 1
+        assert two_col_slides >= 1, (
+            "ninguna columna quedó con tabla-izq / barras-der (side_by_side)")
+
+
+def _profile_with_llm() -> dict:
+    """The base profile plus an ``llm`` block (as eda_llm_insights would store it
+    with run_llm=True): a data dictionary with description/unit per column."""
+    prof = _profile()
+    prof["llm"] = {
+        "dictionary": [
+            {"column": "categoria",
+             "description": "Familia de producto del recambio",
+             "business_meaning": "Agrupa el catálogo por tipo de pieza",
+             "unit": "categoría"},
+            {"column": "uuid",
+             "description": "Identificador único de registro",
+             "unit": ""},
+        ],
+    }
+    return prof
+
+
+def test_llm_descripcion_y_unidad_por_columna():
+    # With an LLM dictionary, each categorical column whose name matches shows its
+    # business description and unit in a per-column markdown block.
+    ch = build_cat_distr(_profile_with_llm(), {})
+    groups = _column_groups(ch)
+    cat_group = next(g for g in groups
+                     if any(getattr(b, "text", "") == "categoria"
+                            for b in g.blocks))
+    md = " ".join(b.text for b in cat_group.blocks
+                  if getattr(b, "kind", "") == "markdown")
+    assert "Descripción" in md and "Familia de producto" in md
+    assert "Unidad" in md and "categoría" in md
+
+
+def test_edge_sin_llm_no_anade_descripcion():
+    # Without an LLM block the per-column description markdown is simply omitted;
+    # the column still renders its cardinality table and bar figure.
+    ch = build_cat_distr(_profile(), {})
+    for g in _column_groups(ch):
+        mds = [b.text for b in g.blocks if getattr(b, "kind", "") == "markdown"]
+        assert not any("Descripción" in t for t in mds)
+
+
+def test_pagina_categorica_clicable_y_definicion_en_glosario():
+    # The "how each categorical page is laid out" term is registered + marked
+    # clickable in the intro, and its full definition lands in the glossary
+    # chapter (canonical baseline catalog), not inline.
+    from datascience.automatic_eda.chapters.glosario import build_glosario
+
+    gc = GlossaryCollector()
+    ch = build_cat_distr(_profile(), {"glossary": gc})
+    md = next(b for b in ch.blocks if isinstance(b, Markdown))
+    assert "[[term:pagina_categorica]]" in md.text
+    assert gc.has("pagina_categorica")
+    glos = build_glosario(_profile(), {"glossary": gc})
+    entry = next(b for b in glos.blocks
+                 if getattr(b, "kind", "") == "glossary_entry"
+                 and b.key == "pagina_categorica")
+    assert "barras" in entry.definition
+    assert "identificador" in entry.definition


 def test_edge_sin_categoricas_devuelve_none():
@@ -17,10 +17,63 @@ from __future__ import annotations

 from .. import model

-CHAPTER_VERSION = "1.0.0"
+CHAPTER_VERSION = "1.1.0"
 CHAPTER_ID = "glosario"
 CHAPTER_TITLE = "Glosario"

+# Canonical definitions for cross-cutting terms — the "how to read it" entries
+# that do not belong to a single chapter. A chapter only needs to *register* the
+# term (``ctx['glossary'].add(key, label)``) and mark its in-text appearance with
+# ``[[term:key]]…[[/term]]``; this chapter supplies the full definition here when
+# the collector carries the term without one. Keeping the prose in a single place
+# avoids repeating a long paragraph inline in every chapter that names the term
+# (the explanation moved out of the NUM DISTR and CAT DISTR intros lives here).
+_BASELINE_TERMS = {
+    "histograma_boxplot": {
+        "label": "Cómo leer el histograma y el boxplot",
+        "definition": (
+            "Para cada columna numérica se muestra su histograma con tres líneas "
+            "de referencia: la media (línea roja discontinua), la mediana (línea "
+            "verde continua) y la banda ±1σ (zona sombreada que cubre una "
+            "desviación estándar a cada lado de la media). Debajo, alineado al "
+            "mismo eje horizontal, un boxplot de Tukey: la caja abarca del primer "
+            "al tercer cuartil (P25–P75), la línea interior es la mediana y los "
+            "bigotes llegan hasta 1,5·IQR; los puntos rojos señalan que hay "
+            "valores más allá de las vallas (posibles atípicos). Comparar la media "
+            "con la mediana revela la asimetría: si la media supera a la mediana la "
+            "cola larga cae hacia los valores altos (asimetría a la derecha), y al "
+            "revés hacia los bajos."),
+    },
+    "pagina_categorica": {
+        "label": "Cómo se organiza cada página categórica",
+        "definition": (
+            "Cada columna categórica ocupa su propia página: muestra sus métricas "
+            "de cardinalidad —incluida la entropía—, una nota que señala "
+            "cardinalidad problemática (columnas que se comportan como "
+            "identificador, con casi todos los valores distintos, o dominadas por "
+            "una sola categoría), la tabla de las categorías más frecuentes (top-k, "
+            "con su conteo y porcentaje) y un gráfico de barras de las categorías "
+            "más comunes (top-k más una barra «Otros» que agrupa la cola). El total "
+            "de filas del dataset se usa como referencia para interpretar los "
+            "conteos."),
+    },
+}
+
+
+def _resolve_term(term: dict) -> tuple:
+    """Return (label, definition) for a collected term, completing a missing
+    definition (and, if absent, the label) from the canonical baseline catalog."""
+    key = model._safe_str(term.get("key"))
+    label = model._safe_str(term.get("label"))
+    definition = model._safe_str(term.get("definition"))
+    base = _BASELINE_TERMS.get(key)
+    if base:
+        if not definition.strip():
+            definition = model._safe_str(base.get("definition"))
+        if not label.strip() or label == key:
+            label = model._safe_str(base.get("label")) or label
+    return label, definition
+

 def build_glosario(profile: dict, ctx: dict):
    """Build the glossary Chapter from the shared collector, or None if empty."""
@@ -36,12 +89,14 @@ def build_glosario(profile: dict, ctx: dict):
            "Cada término va resaltado en el texto y, al pulsarlo, salta a su "
            "definición en esta sección.")),
    ]
-    # One clickable destination per term, alphabetically by visible label.
+    # One clickable destination per term, alphabetically by visible label. A term
+    # registered without a definition is completed from the canonical baseline.
    for term in glossary.terms(by="label"):
+        label, definition = _resolve_term(term)
        blocks.append(model.GlossaryEntry(
            key=model._safe_str(term.get("key")),
-            label=model._safe_str(term.get("label")),
-            definition=model._safe_str(term.get("definition"))))
+            label=label,
+            definition=definition))

    return model.Chapter(id=CHAPTER_ID, title=CHAPTER_TITLE,
                         version=CHAPTER_VERSION, blocks=blocks)
@@ -35,10 +35,21 @@ try:
 except Exception:  # noqa: BLE001 — keep the chapter importable no matter what.
    build_boxplot_stats = None  # type: ignore[assignment]

-CHAPTER_VERSION = "1.2.0"
+CHAPTER_VERSION = "1.3.0"
 CHAPTER_ID = "num_distr"
 CHAPTER_TITLE = "Distribuciones numéricas"

+# Glossary term this chapter explains. The long "how to read the histogram and
+# the boxplot" paragraph used to live inline in the intro; it now lives in the
+# GLOSARIO chapter (canonical definition in ``glosario._BASELINE_TERMS``) and the
+# intro only names the clickable term — one click jumps to the full explanation,
+# so the information is relocated, not lost (mejora glosario).
+_TERM_HISTOBOX_KEY = "histograma_boxplot"
+_TERM_HISTOBOX_LABEL = "Cómo leer el histograma y el boxplot"
+
+# Key under which eda_llm_insights stores its interpretive block in the profile.
+LLM_KEY = "llm"
+
 # Plain-Spanish gloss for every label ``detect_distribution_type`` can emit, so a
 # non-expert reader understands the shape and the suggested next step (MUST-4.3).
 _DIST_GLOSS = {
@@ -99,6 +110,53 @@ def _numeric_columns(profile: dict) -> list:
    return out


+def _llm_index(profile: dict, ctx: dict) -> dict:
+    """Map column name -> its LLM dictionary entry (description/unit/...).
+
+    Reads the ``llm.dictionary`` list that ``eda_llm_insights`` stored in the
+    profile (``profile['llm']``; falls back to ``ctx['llm']``). Returns an empty
+    dict when ``run_llm`` did not run, so the caller degrades cleanly. Fully
+    defensive: never raises on malformed input.
+    """
+    llm = profile.get(LLM_KEY)
+    if not isinstance(llm, dict):
+        llm = ctx.get(LLM_KEY)
+    if not isinstance(llm, dict):
+        return {}
+    entries = llm.get("dictionary")
+    if not isinstance(entries, (list, tuple)):
+        return {}
+    index: dict = {}
+    for e in entries:
+        if not isinstance(e, dict):
+            continue
+        col = e.get("column")
+        if col is None:
+            continue
+        index[model._safe_str(col)] = e
+    return index
+
+
+def _llm_desc_unit_block(name: str, llm_index: dict):
+    """Markdown block with the LLM business description + unit of a column, or
+    None when no LLM entry matches the column (clean fallback without LLM)."""
+    entry = llm_index.get(model._safe_str(name))
+    if not isinstance(entry, dict):
+        return None
+    raw_desc = entry.get("description") or entry.get("business_meaning")
+    desc = " ".join(model._safe_str(raw_desc).split()) if raw_desc else ""
+    raw_unit = entry.get("unit")
+    unit = " ".join(model._safe_str(raw_unit).split()) if raw_unit else ""
+    parts = []
+    if desc:
+        parts.append(f"**Descripción:** {desc}")
+    if unit:
+        parts.append(f"**Unidad:** {unit}")
+    if not parts:
+        return None
+    return model.Markdown(text=" · ".join(parts))
+
+
 def _make_hist_box(name: str, numeric: dict, box: dict):
    """Build the histogram (with mean/median/±σ lines) + boxplot figure.

@@ -271,15 +329,26 @@ def build_num_distr(profile: dict, ctx: dict):
    if not numerics:
        return None  # chapter does not apply to a dataset with no numerics.

+    # Register the "how to read the histogram and boxplot" term in the shared
+    # glossary collector (if present) and mark its first appearance clickable. The
+    # full explanation (colour code, 1,5·IQR rule, asymmetry reading) lives in the
+    # GLOSARIO chapter instead of inline here: the intro only names the term.
+    glossary = ctx.get("glossary")
+    mark_term = False
+    if isinstance(glossary, model.GlossaryCollector):
+        glossary.add(_TERM_HISTOBOX_KEY, _TERM_HISTOBOX_LABEL)
+        mark_term = True
+    como_leer = ("[[term:histograma_boxplot]]cómo leer estos gráficos[[/term]]"
+                 if mark_term else "cómo leer estos gráficos")
    intro = (
-        "Para cada columna numérica se muestra su **histograma** con tres líneas "
-        "de referencia: la **media** (línea roja discontinua), la **mediana** "
-        "(línea verde continua) y la banda **±1σ** (zona sombreada). Debajo, "
-        "alineado al mismo eje, un **boxplot de Tukey**: la caja abarca del "
-        "primer al tercer cuartil (P25–P75), la línea interior es la mediana y "
-        "los bigotes llegan hasta 1,5·IQR; los puntos rojos señalan que hay "
-        "valores más allá de las vallas. Comparar media y mediana revela la "
-        "asimetría de la distribución.")
+        "Cada columna numérica muestra su **histograma** (con la **media**, la "
+        "**mediana** y la banda **±1σ**) y, debajo y al mismo eje, su **boxplot "
+        f"de Tukey** — {como_leer}.")
+
+    # Business description + unit per column come from the LLM dictionary
+    # (profile['llm']['dictionary'], matched by column name); absent without
+    # run_llm, in which case the per-column description block is simply omitted.
+    llm_index = _llm_index(profile, ctx)

    blocks = [
        model.Heading(text=CHAPTER_TITLE, level=1),
@@ -293,17 +362,20 @@ def build_num_distr(profile: dict, ctx: dict):
                box = build_boxplot_stats(numeric) or {}
            except Exception:  # noqa: BLE001 — degrade, never raise.
                box = {}
-        # Keep the column heading, its figure and its stats note together on the
-        # same page/slide (mejora 3 — keep-together): the renderers measure the
-        # whole Group and move it whole when it would not fit.
-        blocks.append(model.Group(blocks=[
-            model.Heading(text=str(name), level=2),
-            model.Figure(
-                make=_figure_maker(name, numeric, box),
-                caption=f"Distribución de «{name}» — histograma "
-                        f"(media/mediana/±σ) y boxplot."),
-            model.Markdown(text=_stats_note(name, numeric, box)),
-        ]))
+        # Keep the column heading, its (optional) LLM description, its figure and
+        # its stats note together on the same page/slide (mejora 3 —
+        # keep-together): the renderers measure the whole Group and move it whole
+        # when it would not fit.
+        col_blocks = [model.Heading(text=str(name), level=2)]
+        desc_block = _llm_desc_unit_block(name, llm_index)
+        if desc_block is not None:
+            col_blocks.append(desc_block)
+        col_blocks.append(model.Figure(
+            make=_figure_maker(name, numeric, box),
+            caption=f"Distribución de «{name}» — histograma "
+                    f"(media/mediana/±σ) y boxplot."))
+        col_blocks.append(model.Markdown(text=_stats_note(name, numeric, box)))
+        blocks.append(model.Group(blocks=col_blocks))

    return model.Chapter(id=CHAPTER_ID, title=CHAPTER_TITLE,
                         version=CHAPTER_VERSION, blocks=blocks)
@@ -101,7 +101,7 @@ def test_golden_chapter_estructura_y_bloques():


 def test_golden_media_mediana_sigma_y_boxplot_presentes():
-    # The intro documents the three reference lines and the Tukey boxplot; the
+    # The short intro names the three reference lines and the Tukey boxplot; the
    # per-column note carries the actual mean/median/σ numbers and the shape.
    ch = build_num_distr(_profile(n_numeric=1, extra_categorical=False), {})
    md_texts = " ".join(b.text for b in _flatten(ch.blocks)
@@ -110,10 +110,58 @@ def test_golden_media_mediana_sigma_y_boxplot_presentes():
    assert "±1σ" in md_texts or "σ" in md_texts
    assert "boxplot" in md_texts.lower()
    assert "Tukey" in md_texts
+    # The long "how to read it" explanation moved to the glossary: the colour-code
+    # / 1,5·IQR walkthrough is no longer inline in the chapter body.
+    assert "1,5·IQR" not in md_texts
+    assert "línea roja" not in md_texts
    # distribution_type gloss surfaced for the column (right-skewed preset).
    assert _DIST_GLOSS["right-skewed"].split(";")[0][:20] in md_texts


+def test_glosario_histograma_boxplot_clicable_y_definicion():
+    # With a glossary collector the intro marks the clickable term and the FULL
+    # explanation (the long paragraph removed from the body) lands in the glossary.
+    from datascience.automatic_eda.chapters.glosario import build_glosario
+
+    gc = model.GlossaryCollector()
+    prof = _profile(n_numeric=1, extra_categorical=False)
+    ch = build_num_distr(prof, {"glossary": gc})
+    intro = next(b for b in ch.blocks if b.kind == "markdown")
+    assert "[[term:histograma_boxplot]]" in intro.text
+    assert gc.has("histograma_boxplot")
+    glos = build_glosario(prof, {"glossary": gc})
+    entry = next(b for b in glos.blocks
+                 if getattr(b, "kind", "") == "glossary_entry"
+                 and b.key == "histograma_boxplot")
+    assert "boxplot" in entry.definition.lower()
+    assert "1,5·IQR" in entry.definition
+
+
+def test_llm_descripcion_y_unidad_por_columna():
+    # With an LLM dictionary, each numeric column whose name matches shows its
+    # business description and unit in a per-column markdown block.
+    prof = _profile(n_numeric=2)
+    prof["llm"] = {"dictionary": [
+        {"column": "precio", "description": "Precio de venta del producto",
+         "unit": "EUR"},
+        {"column": "alcohol", "business_meaning": "Grado alcohólico",
+         "unit": "% vol"},
+    ]}
+    ch = build_num_distr(prof, {})
+    md_all = " ".join(b.text for b in _flatten(ch.blocks)
+                      if b.kind == "markdown")
+    assert "Precio de venta" in md_all and "EUR" in md_all
+    assert "Grado alcohólico" in md_all and "% vol" in md_all
+
+
+def test_edge_sin_llm_no_anade_descripcion():
+    # Without an LLM block the per-column description markdown is simply omitted.
+    ch = build_num_distr(_profile(n_numeric=2), {})
+    md_all = " ".join(b.text for b in _flatten(ch.blocks)
+                      if b.kind == "markdown")
+    assert "Descripción" not in md_all
+
+
 def test_boxplot_stats_se_consumen_del_registry():
    # The chapter must feed build_boxplot_stats (group eda) and the resulting
    # box must carry the Tukey fences for the figure.