refactor(eda): quitar definiciones inline redundantes con el glosario en 5 capítulos

Ahora que el AutomaticEDA tiene un capítulo GLOSARIO con las definiciones de los términos técnicos (enganchados como links clicables desde el cuerpo), los capítulos calidad/correlacion/modelos/agregacion/relaciones ya no repiten inline esas explicaciones largas: se deja el TÉRMINO marcado (clicable, sigue saltando al glosario) y se elimina el párrafo/oración de definición redundante. Los HALLAZGOS y datos concretos del análisis se mantienen intactos; solo se quitan las definiciones generales que el glosario ya cubre. - calidad: _criteria_intro pasa de un bullet-list con las definiciones de completitud/validez/unicidad/calidad + fórmula renormalizada + párrafo de outliers a una frase que nombra las dimensiones, sus pesos (60/40) y el principio de outliers; los 4 términos siguen marcados. - modelos: la nota de normalización deja de explicar la fórmula del z-score; la intro de PCA ya no define "componentes ortogonales ordenados por varianza"; la de KMeans quita "rango −1 a 1: cuanto más alto..." (silhouette); la sección de Isolation Forest quita la descripción de árboles/cortes/umbral. Términos marcados intactos. - correlacion: la intro deja de describir cada método y consolida la duplicación signo/dirección; los 4 métodos + FDR siguen marcados. - agregacion: la intro quita la definición de pivot ("cruzan dos categóricas sobre una medida") y abrevia la selección de claves; groupby y pivot marcados. - relaciones: la intro y la sección de candidatas/inter-tabla quitan las definiciones de PK ("identifica cada fila"), FK ("referencian a otra tabla") y containment ("valores contenidos en la clave de otra"); pk/fk/cardinalidad/ containment siguen marcados. Verificado sobre el EDA de titanic (run_models + run_llm, 48 págs): los 23 link annotations término→glosario se conservan (PyMuPDF), el glosario mantiene las 20 definiciones, y el texto visible de los 5 capítulos baja un 34.7% en conjunto (calidad −67%, modelos −33%, relaciones −19%, agregacion −15%, correlacion −8%). Tests actualizados (calidad_test asertaba el texto viejo). Suite EDA + pipeline verde (118 passed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-30 19:15:24 +02:00
11 changed files with 164 additions and 572 deletions
@@ -561,13 +561,11 @@ def _intro_blocks(gloss=None, mark_term: bool = False) -> list:
    t_groupby = _term(mark_term, "groupby", "**por grupos** (split-apply-combine)")
    t_pivot = _term(mark_term, "pivot_table", "**tablas dinámicas** (pivot)")
    text = (
-        f"Este capítulo analiza la tabla {t_groupby}: "
-        "elige las columnas categóricas más informativas — por su cardinalidad "
-        "y relevancia, no todas contra todas, para no inflar comparaciones "
-        "espurias — y resume las variables numéricas dentro de cada grupo "
-        f"(conteo, media, mediana, desviación). Las {t_pivot} "
-        "cruzan dos categóricas sobre una medida, y los **gráficos de barras** "
-        "(siempre desde cero) comparan los grupos de un vistazo."
+        f"Este capítulo analiza la tabla {t_groupby}: elige las columnas "
+        "categóricas más informativas (por cardinalidad y relevancia, no todas "
+        "contra todas) y resume las variables numéricas dentro de cada grupo "
+        f"(conteo, media, mediana, desviación). Se añaden {t_pivot} y "
+        "**gráficos de barras** (siempre desde cero) para comparar los grupos."
    )
    return [model.Heading(text=CHAPTER_TITLE, level=1),
            model.Markdown(text=text)]
@@ -3,12 +3,13 @@
 Builds the quality chapter from a ``TableProfile`` of the ``eda`` group. The
 chapter implements the quality model of report 2046:

-1. **En qué se basa la calidad** — an intro paragraph explaining the two scored
+1. **En qué se basa la calidad** — a concise intro naming the two scored
   dimensions and their weights (completitud 60%, validez 40%) plus the
-   table-level row uniqueness, BEFORE any number, and stating explicitly that
-   outliers are reported as observations and do **not** lower the score. The
-   criteria terms (calidad de datos, completitud, validez, unicidad de registro)
-   are hooked into the shared glossary as clickable jumps.
+   table-level row uniqueness, BEFORE any number, and stating that outliers are
+   reported as observations and do **not** lower the score. The criteria terms
+   (calidad de datos, completitud, validez, unicidad de registro) are hooked
+   into the shared glossary as clickable jumps; their full definitions live in
+   the GLOSARIO chapter, not inline here.
 2. **Scores por columna** — a table with, per column, the total quality score and
   its breakdown into completeness / validity (no consistency dimension).
 3. **Problemas de calidad** — a table listing ONLY real quality defects
@@ -309,30 +310,22 @@ def _term(key: str, label: str, mark: bool) -> str:


 def _criteria_intro(mark: bool) -> str:
-    """Intro paragraph explaining the two scored dimensions and the principle."""
+    """Intro: how the score is composed, with every term marked clickable.
+
+    Concise on purpose: the definitions of each term (calidad de datos,
+    completitud, validez, unicidad de registro) now live in the GLOSARIO
+    chapter, so the body no longer repeats them — it only states how the score
+    is composed and keeps each term marked so it stays a clickable jump.
+    """
    calidad = _term("calidad_datos", "calidad de datos", mark)
-    completitud = _term("completitud", "Completitud (peso 60%)", mark)
-    validez = _term("validez", "Validez (peso 40%, cuando es medible)", mark)
+    completitud = _term("completitud", "completitud", mark)
+    validez = _term("validez", "validez", mark)
    unicidad = _term("unicidad_registro", "unicidad de registro", mark)
    return (
-        f"La {calidad} de cada columna es un score de 0 a 100 que combina solo "
-        "dimensiones medibles desde el perfil de la tabla, sin fuente externa "
-        "de verdad:\n\n"
-        f"- {completitud}: proporción de valores presentes (1 − % de nulos; en "
-        "texto, las celdas vacías cuentan como faltantes). Los nulos y vacíos "
-        "bajan el score.\n"
-        f"- {validez}: proporción de valores que encajan con su tipo o formato "
-        "(un número que parsea, una fecha legible, un email con forma de email). "
-        "Si una columna es texto libre sin formato esperado, la validez no se "
-        "mide y el score se basa solo en la completitud.\n\n"
-        f"Score de columna = 100 × (0,6·completitud + 0,4·validez), "
-        "renormalizado cuando la validez no aplica. A nivel de tabla se añade "
-        f"la {unicidad} (1 − % de filas duplicadas).\n\n"
-        "**Los valores atípicos (outliers) NO bajan la calidad.** Un valor "
-        "extremo puede ser real y correcto; detectar atípicos es parte del "
-        "análisis de la distribución, no un juicio de corrección. Por eso, junto "
-        "con las columnas constantes y los identificadores, se listan aparte "
-        "como **observaciones analíticas** que no afectan al score."
+        f"La {calidad} de cada columna es un score de 0 a 100 que combina "
+        f"{completitud} (peso 60%) y {validez} (peso 40%, cuando es medible); "
+        f"a nivel de tabla se añade la {unicidad}. Los valores atípicos no "
+        "bajan el score: se listan aparte como **observaciones analíticas**."
    )


@@ -72,14 +72,16 @@ def test_golden_chapter_estructura_y_version():
    assert "markdown" in kinds and "kv_table" in kinds and "data_table" in kinds


-def test_golden_intro_explica_dos_dimensiones_y_pesos():
+def test_golden_intro_nombra_dos_dimensiones_y_pesos():
+    # La intro nombra las dos dimensiones, sus pesos y la unicidad, pero ya NO
+    # repite sus definiciones largas: estas viven ahora en el capítulo GLOSARIO.
    ch = build_calidad(_profile(), {})
    intro = [b for b in ch.blocks if b.kind == "markdown"][0].text
-    for needle in ("Completitud", "Validez", "60%", "40%",
+    for needle in ("completitud", "validez", "60%", "40%",
                   "unicidad de registro"):
        assert needle in intro, f"falta {needle!r} en la intro de criterios"
    # El principio: los outliers NO bajan la calidad.
-    assert "atípicos" in intro and "NO bajan" in intro
+    assert "atípicos" in intro and "no bajan" in intro
    # Ya no se menciona la dimensión consistencia eliminada.
    assert "20%" not in intro

@@ -1,25 +1,19 @@
 """Categorical distributions chapter (CAT DISTR).

-Third reference chapter for AutomaticEDA. Each categorical column gets **its own
-page (PDF) / slide (PPTX)**: every column is wrapped in a keep-together
-``model.Group`` with ``page_break_before=True`` (except the first, which may share
-the intro's page), so its chart sits next to its tables and no column is split.
+Third reference chapter for AutomaticEDA. For every categorical column it shows,
+fulfilling the user's request:

-A short intro names the clickable **[[term:entropia]]entropía[[/term]]** term —
-the full definition lives in the GLOSARIO chapter, so it is NOT repeated inline
-here (one click jumps to the glossary entry). The intro also carries the dataset
-row total used as a comparison baseline.
-
-Per column the Group contains, in order:
-
-1. A cardinality key/value table: distinct values, ``% distinct`` (distinct /
-   total rows), total dataset rows, singleton values (frequency 1), entropy with
-   its theoretical maximum and the normalized ratio, mode, imbalance and
-   string-length stats.
-2. A short note flagging problematic cardinality (id-like ≈100% distinct, or a
+1. A short opening explanation of **Shannon entropy** (what it measures, its 0
+   and log2(k) bounds, the normalized 0–1 version) and the dataset row total used
+   as a comparison baseline.
+2. Per column, a cardinality key/value table: distinct values, ``% distinct``
+   (distinct / total rows), total dataset rows, singleton values (frequency 1),
+   entropy with its theoretical maximum and the normalized ratio, mode, imbalance
+   and string-length stats.
+3. A short note flagging problematic cardinality (id-like ≈100% distinct, or a
   single dominating category).
-3. A ``top-k`` table (value / count / %).
-4. A **donut pie chart** of the most common categories (top-k + an "Otros"
+4. A ``top-k`` table (value / count / %).
+5. A **donut pie chart** of the most common categories (top-k + an "Otros"
   bucket), drawn lazily so the renderers scale it to fit entirely.

 Data comes from the ``eda`` group: each ``columns[i]['categorical']`` is the
@@ -39,7 +33,7 @@ import math

 from .. import model

-CHAPTER_VERSION = "1.2.0"
+CHAPTER_VERSION = "1.1.0"
 CHAPTER_ID = "cat_distr"
 CHAPTER_TITLE = "Distribuciones categóricas"

@@ -59,17 +53,11 @@ _TERM_ENTROPIA_DEF = (
 # Cap the number of categorical columns rendered to keep the document bounded;
 # the rest are summarized in a closing note (no silent truncation).
 MAX_COLS = 40
-# Rows shown in each top-k table and explicit slices in the pie. Kept moderate so
-# the whole column — cardinality table + top-k table + donut — fits on ONE
-# page/slide with the chart next to its tables; the table note still reports
-# "top N of M" so nothing is silently hidden. For id-like columns (≈100%
-# distinct) the top-k table is dropped entirely (it would be a list of unique
-# values — pure noise), which also frees the room the donut needs (see build).
-TOP_TABLE_ROWS = 8
+# Rows shown in each top-k table and explicit slices in the pie.
+TOP_TABLE_ROWS = 15
 PIE_TOP_K = 6
-# Truncate very long category labels in tables (the renderer also wraps). Kept
-# tight so a column with long id-like values (names, tickets) still fits its page.
-LABEL_MAX = 28
+# Truncate very long category labels in tables (the renderer also wraps).
+LABEL_MAX = 48


 def _fmt_int(value) -> str:
@@ -279,55 +267,45 @@ def _normalize_card(card: dict) -> dict:


 def _cardinality_block(card: dict):
-    """KVTable with the cardinality / entropy metrics for one column.
-
-    Related metrics are grouped onto a single row each (distinct/%/unique;
-    entropy bits/max/normalized; length min/mean/max) so the whole column —
-    table + chart — fits one page/slide without dropping any datum; the short
-    16:9 PPTX slide does not fit one metric per row plus a chart otherwise."""
+    """KVTable with the cardinality / entropy metrics for one column."""
    n_singletons = card.get("n_singletons")
    if n_singletons is not None and card.get("n_singletons_partial"):
-        singletons = f"≥{_fmt_int(n_singletons)}"
+        singletons = f"≥{_fmt_int(n_singletons)} (en top mostrado)"
    elif n_singletons is not None:
        singletons = _fmt_int(n_singletons)
    else:
        singletons = "—"

-    # Distinct count · % distinct · unique (frequency 1) on one row.
-    distinct_combo = (f"{_fmt_int(card.get('n_distinct'))} · "
-                      f"{_fmt_pct_value(card.get('pct_distinct'))} · "
-                      f"{singletons} únicos")
-
-    # Entropy bits · theoretical max · normalized 0–1 on one row.
-    entropy_combo = (f"{_fmt_num(card.get('entropy'))} bits · "
-                     f"máx {_fmt_num(card.get('entropy_max'))} · "
-                     f"norm {_fmt_num(card.get('entropy_norm'))}")
+    entropy_ref = _fmt_num(card.get("entropy"))
+    emax = card.get("entropy_max")
+    if emax is not None:
+        entropy_ref = f"{entropy_ref} (máx {_fmt_num(emax)})"

    mode = card.get("mode")
    mode_pct = card.get("mode_pct")
-    mode_str = "—" if mode is None else _truncate(mode, 32)
+    mode_str = "—" if mode is None else model._safe_str(mode)
    if mode is not None and mode_pct is not None:
        mode_str = f"{mode_str} ({_fmt_pct_value(mode_pct)})"

    rows = [
-        ("Distintos · % · únicos", distinct_combo),
+        ("Valores distintos", _fmt_int(card.get("n_distinct"))),
+        ("% distintos", _fmt_pct_value(card.get("pct_distinct"))),
        ("Total filas (dataset)", _fmt_int(card.get("n_rows"))),
-        ("Entropía (bits · máx · norm)", entropy_combo),
+        ("Valores únicos (frecuencia 1)", singletons),
+        ("Entropía (bits)", entropy_ref),
+        ("Entropía normalizada (0–1)", _fmt_num(card.get("entropy_norm"))),
        ("Moda", mode_str),
    ]
    imbalance = card.get("imbalance")
+    if imbalance is not None:
+        rows.append(("Desbalance", _fmt_num(imbalance)))
    lm = card.get("len_min")
    lmean = card.get("len_mean")
    lmax = card.get("len_max")
-    # Imbalance and string length (both secondary) share one closing row.
-    extras = []
-    if imbalance is not None:
-        extras.append(f"desbalance {_fmt_num(imbalance)}")
    if any(v is not None for v in (lm, lmean, lmax)):
-        extras.append(
-            f"long. {_fmt_num(lm)}/{_fmt_num(lmean)}/{_fmt_num(lmax)}")
-    if extras:
-        rows.append(("Desbalance · longitud", " · ".join(extras)))
+        rows.append((
+            "Longitud (mín/media/máx)",
+            f"{_fmt_num(lm)} / {_fmt_num(lmean)} / {_fmt_num(lmax)}"))
    return model.KVTable(rows=rows, title="Cardinalidad")


@@ -337,8 +315,7 @@ def _flag_note(card: dict):
        return model.Note(
            "Casi todos los valores son distintos (≈100% distintos): la columna "
            "se comporta como un identificador y aporta poco para agrupar o "
-            "comparar categorías. No se lista el top de categorías (serían "
-            "valores casi todos únicos).")
+            "comparar categorías.")
    if card.get("dominated"):
        mp = card.get("mode_pct")
        mp_str = _fmt_pct_value(mp) if mp is not None else "muy alta"
@@ -358,7 +335,7 @@ def _topk_table(cat: dict):
        if not isinstance(t, dict):
            continue
        rows.append([
-            _truncate(t.get("value")),
+            model._safe_str(t.get("value")),
            _fmt_int(t.get("count")),
            _pct_from_maybe_fraction(t.get("pct")),
        ])
@@ -376,16 +353,20 @@ def _topk_table(cat: dict):
 def _intro_blocks(n_rows, mark_term: bool = False):
    total = _fmt_int(n_rows)
    # Mark the first appearance of the term as a clickable glossary jump when the
-    # term was registered (mark_term). The full definition of entropy lives in the
-    # GLOSARIO chapter, so the intro only names the clickable term here instead of
-    # repeating the long explanation (avoids the redundancy with the glossary).
-    entropia = ("[[term:entropia]]entropía[[/term]]" if mark_term
-                else "entropía")
+    # term was registered (mark_term). The visible text is identical either way.
+    entropia = ("[[term:entropia]]**entropía de Shannon**[[/term]]" if mark_term
+                else "**entropía de Shannon**")
    text = (
-        f"Cada columna categórica ocupa su propia página: sus métricas de "
-        f"cardinalidad —incluida la {entropia}—, una nota que señala cardinalidad "
-        "problemática, la tabla de las categorías más frecuentes y un gráfico de "
-        "tarta (donut) de las más comunes, todo junto."
+        f"La {entropia} mide cómo de repartidos están los valores de "
+        "una columna categórica, en bits. Vale 0 cuando una sola categoría "
+        "concentra todas las filas (máxima previsibilidad) y alcanza su máximo, "
+        "log2(k) para k categorías distintas, cuando todas aparecen por igual "
+        "(máxima diversidad). La **entropía normalizada** (entropía dividida por "
+        "su máximo) la lleva al rango 0–1 para comparar columnas con distinto "
+        "número de categorías. Para cada columna se muestran los valores "
+        "distintos, el porcentaje que representan sobre el total de filas, los "
+        "valores únicos (que aparecen una sola vez), la tabla de las categorías "
+        "más frecuentes y un gráfico de tarta (donut) de las más comunes."
    )
    if n_rows is not None:
        text += f" El dataset tiene {total} filas en total como referencia."
@@ -417,37 +398,24 @@ def build_cat_distr(profile: dict, ctx: dict):
    blocks = list(_intro_blocks(n_rows, mark_term=mark_term))

    rendered = cat_cols[:MAX_COLS]
-    for idx, col in enumerate(rendered):
+    for col in rendered:
        name = col.get("name") or "(columna)"
        cat = col.get("categorical") or {}
        card = _normalize_card(_cardinality(cat, n_rows))

-        # One Group per categorical column: heading + cardinality table + flag
-        # note + top-k table + donut figure are kept together and the renderer
-        # starts each on a fresh page/slide (page_break_before) so every column
-        # gets its own page with its chart next to its tables. The first column
-        # may share the intro's page (no forced break) to avoid a near-empty page.
-        col_blocks = [
-            model.Heading(text=str(name), level=2),
-            _cardinality_block(card),
-        ]
+        blocks.append(model.Heading(text=str(name), level=2))
+        blocks.append(_cardinality_block(card))
        note = _flag_note(card)
        if note is not None:
-            col_blocks.append(note)
-        # For id-like columns (≈100% distinct) the top-k is a list of unique
-        # values — pure noise; skip it (the flag note already explains why) and
-        # let the donut take that room so the whole column fits one page/slide.
-        if not card.get("id_like"):
-            topk = _topk_table(cat)
-            if topk is not None:
-                col_blocks.append(topk)
-        col_blocks.append(model.Figure(
+            blocks.append(note)
+        topk = _topk_table(cat)
+        if topk is not None:
+            blocks.append(topk)
+        blocks.append(model.Figure(
            make=_pie_make(cat.get("top") or [], card.get("n_distinct"),
                           str(name), n_rows),
            caption=(f"Categorías más comunes de «{_truncate(name, 32)}» "
                     "(donut: top-k + «Otros»)")))
-        blocks.append(model.Group(blocks=col_blocks,
-                                  page_break_before=(idx > 0)))

    if len(cat_cols) > len(rendered):
        omitted = len(cat_cols) - len(rendered)
@@ -2,14 +2,11 @@

 Self-contained: builds synthetic TableProfiles (no DuckDB) so the suite is fast
 and deterministic. Verifies that ``build_cat_distr`` emits the blocks the user
-asked for (distinct/total/%-distinct/unique metrics, top-k table and a donut
-figure), that EACH categorical column is wrapped in its own keep-together
-``Group`` that starts on a fresh page/slide (one column per page, chart next to
-its tables), that the long entropy explanation is NOT repeated inline (it lives
-in the glossary — only the clickable term is kept), that the chapter renders
-inside the full document to both PDF and PPTX showing that content, that a
-profile with no categorical columns yields ``None`` without raising, and that
-long labels / many columns are never cut in either output.
+asked for (entropy intro, distinct/total/%-distinct/unique metrics, top-k table
+and a donut figure), that the chapter renders inside the full document to both
+PDF and PPTX showing that content, that a profile with no categorical columns
+yields ``None`` without raising, and that long labels / many columns are never
+cut in either output.
 """

 import os
@@ -20,8 +17,7 @@ from pypdf import PdfReader
 from pptx import Presentation

 from datascience.automatic_eda.model import (
-    DataTable, Figure, GlossaryCollector, Group, Heading, KVTable, Markdown,
-    Note,
+    DataTable, Figure, Heading, KVTable, Note,
 )
 from datascience.automatic_eda.chapters.cat_distr import (
    CHAPTER_ID, CHAPTER_VERSION, build_cat_distr,
@@ -85,20 +81,8 @@ def _pptx_text(path: str) -> str:
    return re.sub(r"\s+", " ", " ".join(parts))


-def _flatten(blocks):
-    """Expand keep-together Groups so the per-column heading/table/figure are
-    inspectable as a flat block list (the chapter wraps each column in a Group)."""
-    out = []
-    for b in blocks:
-        if getattr(b, "kind", "") == "group":
-            out.extend(_flatten(getattr(b, "blocks", []) or []))
-        else:
-            out.append(b)
-    return out
-
-
-def _column_groups(chapter):
-    return [b for b in chapter.blocks if isinstance(b, Group)]
+def _kinds(chapter):
+    return [b.kind for b in chapter.blocks]


 def test_golden_build_cat_distr_emite_bloques_pedidos():
@@ -106,101 +90,36 @@ def test_golden_build_cat_distr_emite_bloques_pedidos():
    assert ch is not None
    assert ch.id == CHAPTER_ID
    assert ch.version == CHAPTER_VERSION
-
-    # Entropy intro present, but the long explanation is gone (it lives in the
-    # glossary now): only the term is named, no log2/normalizada walkthrough.
+    kinds = _kinds(ch)
+    # Entropy intro present.
    headings = [b.text for b in ch.blocks if isinstance(b, Heading)]
    assert any("Entrop" in h for h in headings)
-    md = next(b for b in ch.blocks if isinstance(b, Markdown))
-    assert "entropía" in md.text.lower()
-    assert "log2" not in md.text          # redundant explanation removed.
-    assert "máxima diversidad" not in md.text
-
-    # Per-column blocks are wrapped in keep-together Groups: flatten to inspect.
-    flat = _flatten(ch.blocks)
-    kv = next(b for b in flat if isinstance(b, KVTable))
+    md = next(b for b in ch.blocks if b.kind == "markdown")
+    assert "entropía" in md.text.lower() and "log2" in md.text
+    # Cardinality metrics: distinct, total rows, %-distinct, unique values.
+    kv = next(b for b in ch.blocks if isinstance(b, KVTable))
    labels = [r[0] for r in kv.rows]
-    values = " ".join(str(r[1]) for r in kv.rows)
-    # Cardinality metrics: distinct count, %-distinct, unique values and total
-    # rows are present (grouped onto compact rows so the chart fits the page).
-    assert "Distintos · % · únicos" in labels
+    assert "Valores distintos" in labels
+    assert "% distintos" in labels
    assert "Total filas (dataset)" in labels
+    assert "Valores únicos (frecuencia 1)" in labels
    assert any("Entropía" in lbl for lbl in labels)
-    assert "únicos" in values and "%" in values
-    assert "bits" in values and "norm" in values   # entropy + max + normalized.
    # Top-k table + pie figure.
-    dt = next(b for b in flat if isinstance(b, DataTable))
+    dt = next(b for b in ch.blocks if isinstance(b, DataTable))
    assert dt.header == ["Valor", "Conteo", "%"]
    assert any("neumaticos" in str(cell) for row in dt.rows for cell in row)
-    assert any(isinstance(b, Figure) for b in flat)
-    # id-like column flagged with a Note that also explains the top-k is dropped.
-    idnote = next((b for b in flat
-                   if isinstance(b, Note) and "identificador" in b.text), None)
-    assert idnote is not None
-    assert "No se lista el top" in idnote.text
+    assert any(isinstance(b, Figure) for b in ch.blocks)
+    # id-like column flagged with a Note.
+    assert any(isinstance(b, Note) and "identificador" in b.text
+               for b in ch.blocks)


-def test_golden_idlike_omite_topk_y_conserva_donut():
-    # The id-like column (uuid, 100% distinct) must NOT carry a top-k DataTable
-    # (it would be a list of unique values), but must still keep its donut Figure
-    # and its cardinality table so it stays a full per-column page.
-    ch = build_cat_distr(_profile(), {})
-    groups = _column_groups(ch)
-    uuid_group = next(g for g in groups
-                      if any(getattr(b, "text", "") == "uuid" for b in g.blocks))
-    kinds = [b.kind for b in uuid_group.blocks]
-    assert "data_table" not in kinds      # top-k of unique values dropped.
-    assert "kv_table" in kinds            # cardinality kept.
-    assert "figure" in kinds              # donut kept (chart per column).
-    # A non-id-like column keeps its top-k table.
-    cat_group = next(g for g in groups
-                     if any(getattr(b, "text", "") == "categoria"
-                            for b in g.blocks))
-    assert "data_table" in [b.kind for b in cat_group.blocks]
-
-
-def test_golden_una_pagina_por_columna_groups():
-    ch = build_cat_distr(_profile(), {})
-    groups = _column_groups(ch)
-    # Two categorical columns -> two column Groups (numeric column excluded).
-    assert len(groups) == 2
-    # Each Group carries one column: a heading + its cardinality table + figure.
-    for g in groups:
-        kinds = [b.kind for b in g.blocks]
-        assert kinds[0] == "heading"
-        assert "kv_table" in kinds
-        assert "figure" in kinds
-    # The first column may share the intro page (no forced break); every later
-    # column starts on a fresh page/slide so each column gets its own page.
-    assert groups[0].page_break_before is False
-    assert all(g.page_break_before is True for g in groups[1:])
-
-
-def test_golden_entropia_clicable_y_definicion_en_glosario():
-    # With a glossary collector the intro marks the clickable term and the FULL
-    # definition (the long explanation removed from the intro) lands in the
-    # glossary, not inline — no data lost, just relocated.
-    gc = GlossaryCollector()
-    ch = build_cat_distr(_profile(), {"glossary": gc})
-    md = next(b for b in ch.blocks if isinstance(b, Markdown))
-    assert "[[term:entropia]]entropía[[/term]]" in md.text
-    assert gc.has("entropia")
-    entry = gc.get("entropia")
-    assert entry is not None
-    # The definition kept in the glossary still carries the detail removed inline.
-    assert "log2" in entry["definition"]
-    assert "normalizada" in entry["definition"].lower()
-
-
-def test_golden_render_pdf_una_pagina_por_columna():
+def test_golden_render_pdf_muestra_categoricas():
    with tempfile.TemporaryDirectory() as d:
        out = os.path.join(d, "eda.pdf")
        res = render_automatic_eda_pdf(_profile(), out, {"title": "EDA"})
        assert res["path"] == out and os.path.exists(out)
-        cat_meta = next(c for c in res["chapters"] if c["id"] == CHAPTER_ID)
-        # Two categorical columns, each on its own page -> >= 2 pages for the
-        # chapter (intro shares the first column's page).
-        assert cat_meta["n_pages"] >= 2
+        assert CHAPTER_ID in [c["id"] for c in res["chapters"]]
        txt = _pdf_text(out)
        assert "Entrop" in txt
        assert "distintos" in txt
@@ -214,91 +133,13 @@ def test_golden_render_pptx_muestra_categoricas():
        out = os.path.join(d, "eda.pptx")
        res = render_automatic_eda_pptx(_profile(), out, {"title": "EDA"})
        assert res["path"] == out and os.path.exists(out)
-        cat_meta = next(c for c in res["chapters"] if c["id"] == CHAPTER_ID)
-        assert cat_meta["n_slides"] >= 2  # one slide per categorical column.
+        assert CHAPTER_ID in [c["id"] for c in res["chapters"]]
        txt = _pptx_text(out)
        assert "Entrop" in txt
        assert "categoria" in txt and "neumaticos" in txt
        assert "distintos" in txt


-def _profile_high_card() -> dict:
-    """Profile with a high-cardinality NON-id-like categorical column whose top-k
-    of long values would split from its donut on a short 16:9 slide unless the
-    renderer trims the table — the exact case the adversarial check flagged
-    (Ticket / Cabin)."""
-    long_vals = [f"Valor largo de categoria numero {i:02d} con texto extra"
-                 for i in range(40)]
-    top = [{"value": v, "count": 60 - i, "pct": (60 - i) / 5000.0}
-           for i, v in enumerate(long_vals)]
-    return {
-        "table": "t", "source": "t.csv", "n_rows": 5000, "n_cols": 3,
-        "quality_score": 80.0,
-        "columns": [
-            {"name": "precio", "inferred_type": "numeric", "null_pct": 0.0,
-             "numeric": {"mean": 1.0, "median": 1.0, "min": 0.0, "max": 2.0,
-                         "std": 0.5}},
-            # 40 distinct over 5000 rows = 0.8% distinct -> NOT id-like, keeps
-            # its (long) top-k table; the tall table must not push the donut off.
-            {"name": "alta_card_col", "inferred_type": "categorical",
-             "null_pct": 0.0, "distinct_count": 40,
-             "categorical": {"top": top, "mode": long_vals[0], "n_distinct": 40,
-                             "entropy": 5.2, "imbalance": 1.2, "len_min": 40,
-                             "len_mean": 45, "len_max": 50}},
-            {"name": "baja_card_col", "inferred_type": "categorical",
-             "null_pct": 0.0, "distinct_count": 4,
-             "categorical": {
-                 "top": [{"value": "norte", "count": 2000, "pct": 0.4},
-                         {"value": "sur", "count": 1500, "pct": 0.3},
-                         {"value": "este", "count": 1000, "pct": 0.2},
-                         {"value": "oeste", "count": 500, "pct": 0.1}],
-                 "mode": "norte", "n_distinct": 4, "entropy": 1.8}},
-        ],
-    }
-
-
-def test_golden_pptx_una_slide_por_columna_con_su_grafico():
-    """Each categorical column occupies EXACTLY ONE cat_distr slide that carries
-    BOTH its cardinality table and its donut figure (picture) — i.e. the chart is
-    never separated from its table, even for a high-cardinality column."""
-    from pptx.enum.shapes import MSO_SHAPE_TYPE
-
-    prof = _profile_high_card()
-    cat_names = ["alta_card_col", "baja_card_col"]
-    with tempfile.TemporaryDirectory() as d:
-        out = os.path.join(d, "eda.pptx")
-        res = render_automatic_eda_pptx(prof, out, {"title": "EDA"})
-        assert res["path"] == out and os.path.exists(out)
-        prs = Presentation(out)
-
-        # Per column: the cat_distr slides whose text mentions it, and whether the
-        # owning slide also has the donut caption + an actual picture shape.
-        slides_with_col = {n: [] for n in cat_names}
-        owner_has_chart = {n: False for n in cat_names}
-        for i, sl in enumerate(prs.slides):
-            texts, has_pic = [], False
-            for sh in sl.shapes:
-                if sh.has_text_frame:
-                    texts.append(sh.text_frame.text)
-                if sh.shape_type == MSO_SHAPE_TYPE.PICTURE:
-                    has_pic = True
-            txt = re.sub(r"\s+", " ", " ".join(texts))
-            if "Distribuciones categ" not in txt:   # footer stamp of the chapter.
-                continue
-            for n in cat_names:
-                if n in txt:
-                    slides_with_col[n].append(i)
-                    has_table = "Cardinalidad" in txt or "distintos" in txt
-                    if has_pic and "donut" in txt and has_table:
-                        owner_has_chart[n] = True
-
-        for n in cat_names:
-            # Exactly one slide carries the column (not split across slides).
-            assert len(slides_with_col[n]) == 1, (n, slides_with_col[n])
-            # That single slide also holds its table AND its donut picture.
-            assert owner_has_chart[n], (n, "tabla y donut no están en el mismo slide")
-
-
 def test_edge_sin_categoricas_devuelve_none():
    only_numeric = {
        "n_rows": 10, "columns": [
@@ -329,15 +170,11 @@ def test_anti_corte_label_largo_y_muchas_columnas():

    ch = build_cat_distr(profile, {})
    assert ch is not None
-    # One Group per column, each forcing its own page (except the first).
-    groups = _column_groups(ch)
-    assert len(groups) == 30
-    assert sum(1 for g in groups if g.page_break_before) == 29
    with tempfile.TemporaryDirectory() as d:
        pdf = os.path.join(d, "anti.pdf")
        res = render_automatic_eda_pdf(profile, pdf, {"write_manifest": False})
        assert res["path"] == pdf
-        assert res["n_pages"] > 1       # one page per column, OK.
+        assert res["n_pages"] > 1       # many columns spilled across pages, OK.
        txt = _pdf_text(pdf)
        # Long label wrapped (not truncated): every word survives.
        for word in ("Lorem", "incididunt", "reprehenderit", "voluptate"):
@@ -356,12 +356,11 @@ def build_correlacion(profile: dict, ctx: dict):
    t_cramers = _term(mark_term, "cramers_v", "Cramér's V")
    t_corr_ratio = _term(mark_term, "correlation_ratio", "razón de correlación")
    blocks.append(model.Markdown(text=(
-        "Asociación entre columnas. Cada par se evalúa con la métrica adecuada a "
-        f"sus tipos ({t_pearson}/{t_spearman} entre numéricas — con **signo**; "
-        f"{t_cramers} entre categóricas; {t_corr_ratio} num-categórica; "
-        "información mutua como medida común no lineal). Sólo las correlaciones "
-        "**num-num** tienen dirección: por eso los pares **negativos** son siempre "
-        "num-num.")))
+        "Asociación entre columnas. Cada par se evalúa con la métrica adecuada "
+        f"a sus tipos: {t_pearson}/{t_spearman} (numéricas), {t_cramers} "
+        f"(categóricas), {t_corr_ratio} (num-categórica) e información mutua. "
+        "Sólo las correlaciones **num-num** llevan **signo** (dirección): por "
+        "eso los pares **negativos** son siempre num-num.")))

    # 1) Association matrix (heatmap).
    labels, trimmed = _ordered_labels(pairs)
@@ -6,15 +6,16 @@ normality}``). It renders, as structured markdown/tables/figures that the core
 paginator never cuts:

 1. **Normalization note** — every multivariate model below standardizes the
-   columns with z-score first; the chapter explains why (different scales would
-   otherwise dominate distance/variance).
+   columns with z-score first (the term is marked clickable; its definition
+   lives in the GLOSARIO chapter, not inline).
 2. **PCA** — a scree plot (explained + cumulative variance, single Y axis) plus
   variance and top-loadings tables.
 3. **KMeans segments** — a PCA scatter **coloured by cluster** (its own
   page/slide), the cluster-size table, and a per-cluster LLM micro-analysis
   with a title for each segment.
-4. **Isolation Forest outliers** — a short explanation of how anomalous rows are
-   isolated multivariately and how the threshold is chosen, plus the counts.
+4. **Isolation Forest outliers** — the multivariate anomaly counts and decision
+   threshold (the method is marked clickable; its definition lives in the
+   GLOSARIO chapter, not inline).
 5. **Normality** — per-column Jarque-Bera / D'Agostino / Shapiro verdicts.

 The raw numeric data needed to colour the cluster scatter is **not** in the
@@ -314,12 +315,8 @@ def _normalization_intro(gloss=None, mark_term: bool = False) -> list:
    text = (
        "Estos modelos son **no supervisados**: buscan estructura latente sin "
        "una variable objetivo. Antes de aplicarlos, todas las columnas "
-        f"numéricas se {zscore} (cada valor menos la media, dividido por la "
-        "desviación típica). Sin esta normalización, una variable con escala "
-        "grande (p.ej. ingresos en euros) dominaría las distancias y la varianza "
-        "frente a otra de escala pequeña (p.ej. un ratio entre 0 y 1), sesgando "
-        "tanto el PCA como el KMeans. Tras la estandarización todas las variables "
-        "pesan por igual."
+        f"numéricas se {zscore}, para que todas pesen por igual con "
+        "independencia de su escala."
    )
    return [model.Heading(text="Modelos no supervisados", level=1),
            model.Markdown(text=text)]
@@ -334,11 +331,11 @@ def _pca_section(pca: dict, gloss=None, mark_term: bool = False) -> list:
    n_used = pca.get("n_rows_used")
    n_feat = pca.get("n_features")
    intro = (
-        f"El {_term(mark_term, 'pca', 'PCA')} resume {_fmt_num(n_feat)} variables "
-        "numéricas en componentes ortogonales ordenados por la varianza que "
-        f"capturan ({_fmt_num(n_used)} filas usadas tras eliminar nulos). El "
-        "gráfico de sedimentación (scree) muestra cuánta varianza aporta cada "
-        "componente y su acumulado: un codo marca cuántos componentes bastan."
+        f"El {_term(mark_term, 'pca', 'PCA')} se aplica sobre "
+        f"{_fmt_num(n_feat)} variables numéricas ({_fmt_num(n_used)} filas "
+        "usadas tras eliminar nulos). El gráfico de sedimentación (scree) "
+        "muestra cuánta varianza aporta cada componente y su acumulado: un "
+        "codo marca cuántos componentes bastan."
    )
    blocks.append(model.Markdown(text=intro))

@@ -403,9 +400,8 @@ def _kmeans_section(kmeans: dict, projection: dict, titles,
    t_sil = _term(mark_term, "silhouette", "*silhouette*")
    intro = (
        f"{t_kmeans} agrupa las filas en **{_fmt_num(best_k)} segmentos** "
-        f"elegidos automáticamente maximizando el coeficiente de {t_sil} "
-        f"(**{_fmt_num(sil)}**, rango −1 a 1: cuanto más alto, segmentos más "
-        "compactos y separados). Los segmentos se proyectan sobre el plano de "
+        f"elegidos automáticamente por el coeficiente de {t_sil} "
+        f"(**{_fmt_num(sil)}**). Los segmentos se proyectan sobre el plano de "
        "los dos primeros componentes principales para visualizarlos."
    )
    blocks.append(model.Markdown(text=intro))
@@ -469,14 +465,10 @@ def _outliers_section(outliers: dict, gloss=None, mark_term: bool = False) -> li
                            level=2)]
    isof = _term(mark_term, "isolation_forest", "**Isolation Forest**")
    explain = (
-        f"{isof} detecta filas anómalas de forma *multivariante*: "
-        "construye árboles que parten el espacio con cortes aleatorios y mide "
-        "cuántos cortes hacen falta para aislar cada fila. Las filas raras "
-        "(combinaciones de valores poco frecuentes considerando **todas las "
-        "columnas a la vez**, no una sola) se aíslan con muy pocos cortes y "
-        "obtienen un score bajo. El **umbral** de decisión separa las filas "
-        "normales de las anómalas según la contaminación esperada del modelo: "
-        "una fila es outlier cuando su score queda por debajo de ese umbral."
+        f"{isof} marca filas anómalas de forma *multivariante*: combinaciones "
+        "de valores poco frecuentes considerando **todas las columnas a la "
+        "vez**, no una sola. La tabla resume cuántas se detectaron y el umbral "
+        "de decisión empleado."
    )
    blocks.append(model.Markdown(text=explain))
    blocks.append(model.KVTable(rows=[
@@ -256,14 +256,14 @@ def _pk_candidates_section(profile: dict, mark: bool) -> list:
    pk = ("[[term:pk]]**clave primaria**[[/term]]" if mark
          else "**clave primaria**")
    intro = (
-        f"Estas columnas son **candidatas a {pk}**: su "
-        "[[term:cardinalidad]]cardinalidad[[/term]] iguala al número de filas y no "
-        "tienen nulos, así que cada valor identifica una fila distinta. Son "
-        "candidatas, no una clave declarada: la base no las marca como tal."
+        f"Columnas **candidatas a {pk}**: su "
+        "[[term:cardinalidad]]cardinalidad[[/term]] iguala al número de filas y "
+        "no tienen nulos. Son candidatas, no una clave declarada: la base no "
+        "las marca como tal."
        if mark else
-        "Estas columnas son **candidatas a clave primaria**: su cardinalidad "
-        "iguala al número de filas y no tienen nulos, así que cada valor "
-        "identifica una fila distinta.")
+        "Columnas **candidatas a clave primaria**: su cardinalidad iguala al "
+        "número de filas y no tienen nulos. Son candidatas, no una clave "
+        "declarada.")

    rows = []
    for name in keys:
@@ -320,10 +320,10 @@ def _inter_table_section(db_path: str, tables: list, mark: bool) -> list:
    blocks = [
        model.Heading(text="Claves foráneas candidatas (inter-tabla)", level=2),
        model.Markdown(text=(
-            f"La fuente tiene varias tablas. Estas {fk_term} candidatas se infieren "
-            f"por señal de nombre y por {containment}: una columna de una tabla cuyos "
-            "valores están contenidos en la clave de otra. No están declaradas por "
-            "la base; son la relación más probable según los datos.")),
+            f"La fuente tiene varias tablas. Estas {fk_term} candidatas se "
+            f"infieren por señal de nombre y por {containment}. No están "
+            "declaradas por la base; son la relación más probable según los "
+            "datos.")),
    ]

    shown = candidates[:MAX_FK_ROWS]
@@ -441,13 +441,12 @@ def _intro_blocks(mark: bool) -> list:
    pk = "[[term:pk]]clave primaria[[/term]]" if mark else "clave primaria"
    fk = "[[term:fk]]clave foránea[[/term]]" if mark else "clave foránea"
    text = (
-        f"Este capítulo analiza las **relaciones de clave** de la tabla: qué columna "
-        f"identifica cada fila (la {pk}) y qué columnas referencian a otra tabla (las "
-        f"{fk}). Cuando la base las **declara** como restricciones del esquema, se "
-        "muestran tal cual; cuando no, se proponen las más probables a partir de los "
-        "datos —por inclusión de valores entre tablas (containment) o, en una sola "
-        "tabla, por una heurística de nombre y cardinalidad— siempre marcadas como "
-        "candidatas, nunca como hechos.")
+        f"Este capítulo analiza las **relaciones de clave** de la tabla: cuál es "
+        f"la {pk} y cuáles son las {fk}. Cuando la base las **declara** como "
+        "restricciones del esquema, se muestran tal cual; cuando no, se proponen "
+        "las más probables a partir de los datos —por containment entre tablas o, "
+        "en una sola tabla, por una heurística de nombre y cardinalidad— siempre "
+        "marcadas como candidatas, nunca como hechos.")
    return [model.Heading(text=CHAPTER_TITLE, level=1), model.Markdown(text=text)]


@@ -139,17 +139,10 @@ class Group:
    it starts on a fresh page and flows (honest degradation, never cut). Use it to
    bind ``Heading`` + ``Markdown`` + ``Figure`` of one idea together (see the
    DISTR NUM / AGREGACION chapters).
-
-    When ``page_break_before`` is True the renderer additionally forces the group
-    to *start* on a fresh page/slide (unless the current one is already empty), so
-    a chapter can give each unit its own page — e.g. one categorical column per
-    page (see CAT DISTR). It is purely additive: the default False keeps the plain
-    keep-together behaviour for every existing chapter.
    """

    blocks: list = field(default_factory=list)
    title: Optional[str] = None
-    page_break_before: bool = False
    kind: str = field(default="group", init=False)


@@ -235,9 +228,7 @@ def as_block(obj: Any):
                return Note(text=_safe_str(obj.get("text")))
            if cls is Group:
                return Group(blocks=as_blocks(obj.get("blocks")),
-                             title=obj.get("title"),
-                             page_break_before=bool(
-                                 obj.get("page_break_before", False)))
+                             title=obj.get("title"))
            if cls is GlossaryEntry:
                return GlossaryEntry(key=_safe_str(obj.get("key")),
                                     label=_safe_str(obj.get("label")),
@@ -675,61 +675,6 @@ def _measure_figure_like(block) -> float:
    return target_h + 0.04 + cap_h + _GAP


-def _measure_kv_table(block) -> float:
-    """Faithful height of a KVTable — matches ``_place_kv_table``.
-
-    Counts the optional title heading and, per row, the wrapped VALUE column
-    (the label column never wraps in the placer). The previous estimate assumed
-    one line per row and ignored the title, so a column's keep-together Group
-    under-budgeted the figure and the chart spilled to the next page. Keep this in
-    sync with ``_place_kv_table``."""
-    h = 0.0
-    title = getattr(block, "title", None)
-    if title:
-        h += _measure_heading_text(title, 2)
-    rows = getattr(block, "rows", []) or []
-    key_w = 1.9
-    val_chars = tl.chars_per_line(_USABLE_W - key_w - 0.1, _FS_BODY)
-    lh = tl.line_height_in(_FS_BODY)
-    for row in rows:
-        try:
-            value = row[1]
-        except Exception:  # noqa: BLE001
-            value = ""
-        v_lines = tl.wrap(model._safe_str(value), val_chars)
-        h += lh * len(v_lines) + _ROW_VPAD
-    return h + _GAP
-
-
-def _measure_data_table(block) -> float:
-    """Faithful height of a DataTable — matches ``_place_data_table``.
-
-    Counts the optional title heading, the wrapped header row, every wrapped data
-    row (per-column wrap via the same ``_col_widths``/``_wrap_row`` the placer
-    uses) and the optional note. Keep this in sync with ``_place_data_table``."""
-    h = 0.0
-    title = getattr(block, "title", None)
-    if title:
-        h += _measure_heading_text(title, 2)
-    header = list(getattr(block, "header", []) or [])
-    rows = list(getattr(block, "rows", []) or [])
-    fs = _FS_CELL
-    widths = _col_widths(header, rows, fs)
-    lh = tl.line_height_in(fs)
-    if header:
-        header_lines = _wrap_row(header, widths, fs)
-        h += lh * max((len(c) for c in header_lines), default=1) + _ROW_VPAD * 2
-    for r in rows:
-        cells_lines = _wrap_row(r, widths, fs)
-        h += lh * max((len(c) for c in cells_lines), default=1) + _ROW_VPAD * 2
-    note = getattr(block, "note", None)
-    if note:
-        nlines = tl.wrap(model._safe_str(note),
-                         tl.chars_per_line(_USABLE_W, _FS_NOTE))
-        h += tl.line_height_in(_FS_NOTE) * len(nlines)
-    return h + _GAP
-
-
 def _measure_block(st: _PdfState, block) -> float:
    kind = getattr(block, "kind", "")
    try:
@@ -745,9 +690,13 @@ def _measure_block(st: _PdfState, block) -> float:
                            tl.chars_per_line(_USABLE_W, _FS_NOTE))
            return tl.line_height_in(_FS_NOTE) * len(lines) + _GAP
        if kind == "kv_table":
-            return _measure_kv_table(block)
+            rows = getattr(block, "rows", []) or []
+            return (tl.line_height_in(_FS_BODY) + _ROW_VPAD) * (len(rows) + 1) \
+                + _GAP
        if kind == "data_table":
-            return _measure_data_table(block)
+            rows = getattr(block, "rows", []) or []
+            return (tl.line_height_in(_FS_CELL) + _ROW_VPAD * 2) \
+                * (len(rows) + 1) + _GAP
        if kind == "group":
            return sum(_measure_block(st, b)
                       for b in (getattr(block, "blocks", []) or []))
@@ -786,10 +735,6 @@ def _place_group(st: _PdfState, block) -> None:
    blocks = getattr(block, "blocks", []) or []
    if not blocks:
        return
-    # Opt-in page break: start this group on a fresh page unless the current one
-    # is still empty (so a chapter can give each unit its own page).
-    if getattr(block, "page_break_before", False) and st.y > _CONTENT_TOP + 1e-6:
-        _new_page(st)
    avail_full = _CONTENT_BOTTOM - _CONTENT_TOP
    _shrink_group_figures(st, blocks, avail_full)
    total = sum(_measure_block(st, b) for b in blocks)
@@ -625,55 +625,6 @@ def _measure_figure_like(block) -> float:
    return target_h + 0.05 + cap_h + _GAP


-def _measure_kv_table(block) -> float:
-    """Faithful KVTable height — matches ``_place_kv_table`` (rendered as a
-    Campo/Valor data table with wrapped cells). The previous estimate assumed one
-    line per row and ignored the title, so a keep-together Group under-budgeted
-    the figure and the chart spilled to the next slide. Keep in sync."""
-    h = 0.0
-    title = getattr(block, "title", None)
-    if title:
-        h += _measure_heading_text(title, 2)
-    rows = getattr(block, "rows", []) or []
-    data_rows = []
-    for row in rows:
-        try:
-            label, value = row[0], row[1]
-        except Exception:  # noqa: BLE001
-            label, value = str(row), ""
-        data_rows.append([model._safe_str(label), model._safe_str(value)])
-    header = ["Campo", "Valor"]
-    widths = _col_widths(header, data_rows)
-    fs = _FS_CELL
-    h += _row_height_in(header, widths, fs)
-    for r in data_rows:
-        h += _row_height_in(r, widths, fs)
-    return h + _GAP
-
-
-def _measure_data_table(block) -> float:
-    """Faithful DataTable height — matches ``_place_data_table`` (title heading +
-    wrapped header + every wrapped row + optional note). Keep in sync."""
-    h = 0.0
-    title = getattr(block, "title", None)
-    if title:
-        h += _measure_heading_text(title, 2)
-    header = list(getattr(block, "header", []) or [])
-    rows = list(getattr(block, "rows", []) or [])
-    fs = _FS_CELL
-    widths = _col_widths(header, rows)
-    if header:
-        h += _row_height_in(header, widths, fs)
-    for r in rows:
-        h += _row_height_in(r, widths, fs)
-    note = getattr(block, "note", None)
-    if note:
-        nlines = tl.wrap(model._safe_str(note),
-                         tl.chars_per_line(_USABLE_W, _FS_NOTE))
-        h += tl.line_height_in(_FS_NOTE) * len(nlines) + 0.05
-    return h + _GAP
-
-
 def _measure_block(st: _PptxState, block) -> float:
    kind = getattr(block, "kind", "")
    try:
@@ -688,10 +639,9 @@ def _measure_block(st: _PptxState, block) -> float:
            lines = tl.wrap(getattr(block, "text", ""),
                            tl.chars_per_line(_USABLE_W, _FS_NOTE))
            return tl.line_height_in(_FS_NOTE) * len(lines) + 0.05 + _GAP
-        if kind == "kv_table":
-            return _measure_kv_table(block)
-        if kind == "data_table":
-            return _measure_data_table(block)
+        if kind in ("kv_table", "data_table"):
+            rows = getattr(block, "rows", []) or []
+            return (tl.line_height_in(_FS_CELL) + 0.10) * (len(rows) + 1) + _GAP
        if kind == "group":
            return sum(_measure_block(st, b)
                       for b in (getattr(block, "blocks", []) or []))
@@ -714,14 +664,10 @@ def _shrink_group_figures(st: _PptxState, blocks: list, avail_full: float) -> No
                   if getattr(b, "kind", "") not in ("figure", "image"))
    fig_overhead = tl.line_height_in(_FS_NOTE) + 0.05 + 0.05 + _GAP
    budget = avail_full - nonfig_h - 0.10 * len(fig_blocks)
-    # Low thresholds: a 16:9 slide is short, so a content-heavy column (cardinality
-    # table + top-k + chart) only fits if the chart is allowed to shrink small.
-    # Prefer a small-but-present chart on the SAME slide over splitting the column
-    # across slides (matches the PDF renderer's keep-together philosophy).
-    if budget <= 0.6:
+    if budget <= 1.0:
        return  # not enough room to keep together; let it flow (degrade).
    per = budget / len(fig_blocks) - fig_overhead
-    if per <= 0.35:
+    if per <= 0.8:
        return
    for fb in fig_blocks:
        cur = getattr(fb, "height_in", None)
@@ -729,90 +675,12 @@ def _shrink_group_figures(st: _PptxState, blocks: list, avail_full: float) -> No
                        if isinstance(cur, (int, float)) and cur > 0 else per)


-# Minimum height (inches) reserved for a figure inside a keep-together group on
-# the short 16:9 slide. When a high-cardinality column's table(s) would otherwise
-# leave no room, the data table is trimmed (with an honest note) so the chart
-# stays on the SAME slide next to its table instead of spilling to the next one.
-_GROUP_MIN_FIG_H = 1.3
-
-
-def _trim_data_table_to_budget(block, budget: float):
-    """Return a copy of a DataTable whose rows fit within ``budget`` inches.
-
-    Keeps the title, header, as many leading rows as fit (at least one) and an
-    honest note reporting how many of the original rows are shown. NEVER mutates
-    the original block — the same Chapter blocks are rendered by the PDF renderer,
-    which keeps the full table (an A5 page fits it)."""
-    header = list(getattr(block, "header", []) or [])
-    rows = list(getattr(block, "rows", []) or [])
-    title = getattr(block, "title", None)
-    fs = _FS_CELL
-    widths = _col_widths(header, rows)
-    fixed = 0.0
-    if title:
-        fixed += _measure_heading_text(title, 2)
-    if header:
-        fixed += _row_height_in(header, widths, fs)
-    note_h = tl.line_height_in(_FS_NOTE) + 0.05
-    avail_rows = budget - fixed - note_h - _GAP
-    kept = []
-    used = 0.0
-    for r in rows:
-        rh = _row_height_in(r, widths, fs)
-        if used + rh > avail_rows and kept:
-            break
-        kept.append(r)
-        used += rh
-    if len(kept) >= len(rows):
-        return block  # already fits; keep the original (with its own note).
-    note = (f"top {len(kept)} de {len(rows)} categorías mostradas "
-            "(recortado para caber en el slide; el PDF muestra más)")
-    return model.DataTable(header=header, rows=kept, title=title, note=note)
-
-
-def _fit_group_blocks(st: _PptxState, blocks: list, avail_full: float) -> list:
-    """Return a slide-fitting copy of a keep-together group's blocks.
-
-    On the short 16:9 slide a high-cardinality column's top-k table plus its
-    chart can overflow. Reserve ``_GROUP_MIN_FIG_H`` for the (later shrunk) figure
-    and trim the data table(s) to what is left, so every column keeps its chart
-    next to its table on ONE slide. No-op when the group has no figure+table pair
-    (e.g. id-like columns already drop the top-k upstream, or it already fits)."""
-    has_fig = any(getattr(b, "kind", "") in ("figure", "image") for b in blocks)
-    tbls = [b for b in blocks if getattr(b, "kind", "") == "data_table"]
-    if not (has_fig and tbls):
-        return blocks
-    fixed_h = sum(_measure_block(st, b) for b in blocks
-                  if getattr(b, "kind", "") not in ("figure", "image",
-                                                    "data_table"))
-    tables_h = sum(_measure_block(st, b) for b in tbls)
-    budget_tables = avail_full - fixed_h - _GROUP_MIN_FIG_H
-    if tables_h <= budget_tables:
-        return blocks  # already fits next to a min-height figure; leave intact.
-    out = []
-    for b in blocks:
-        if getattr(b, "kind", "") != "data_table":
-            out.append(b)
-            continue
-        trimmed = _trim_data_table_to_budget(b, max(budget_tables, 0.8))
-        out.append(trimmed)
-        budget_tables -= _measure_data_table(trimmed)
-    return out
-
-
 def _place_group(st: _PptxState, block) -> None:
    """Render a keep-together Group: move it whole to the next slide if needed."""
    blocks = getattr(block, "blocks", []) or []
    if not blocks:
        return
-    # Opt-in slide break: start this group on a fresh slide unless the current one
-    # is still empty (so a chapter can give each unit its own slide).
-    if getattr(block, "page_break_before", False) and st.y > _CONTENT_TOP + 1e-6:
-        _new_slide(st, cont=True)
    avail_full = _CONTENT_BOTTOM - _CONTENT_TOP
-    # Trim oversized tables first (keeps the chart on the same slide), then shrink
-    # the figure to share the remaining room.
-    blocks = _fit_group_blocks(st, blocks, avail_full)
    _shrink_group_figures(st, blocks, avail_full)
    total = sum(_measure_block(st, b) for b in blocks)
    if total <= avail_full: