feat(eda): motor AutomaticEDA fase 4a — render fixes + keep-together + glosario clicable

Mejoras transversales del motor de render (no del contenido de capítulos):

1. Fix negrita pisa texto (PDF): _place_rich_lines mide el ancho REAL de cada
   span con las métricas de fuente del renderer (peso correcto) en vez del
   grid de ancho medio; negrita y normal en la misma línea ya no se solapan.
2. Zebra striping: filas pares sombreadas (#f6f8fa) en DataTable (PDF + PPTX),
   coherente al partir tablas largas (índice de fila lógico, no por página).
3. Keep-together: bloque Group nuevo; el renderer mide el grupo entero y lo
   mueve completo a la página/slide siguiente si no cabe, y encoge la figura
   (height_in) para dejar sitio a su título y texto. num_distr lo usa.
4. Caption siempre visible en toda figura PPTX (fallback al heading); la figura
   reserva el alto de su caption para que ambos quepan en el mismo slide.
5. Portada construida al final (con resumen agregado del análisis vía
   ctx['document_summary']) pero colocada primera por build_document.
6. Glosario: capítulo nuevo (último) + GlossaryCollector en ctx; los capítulos
   registran términos y marcan apariciones con [[term:key]]...[[/term]]. Links
   clicables reales: PDF (PyMuPDF, link GOTO) y PPTX (slide-jump nativo).
   Enganchado "entropía" en cat_distr como ejemplo end-to-end.

Funciones reutilizables delegadas a fn-constructor (tag eda):
- add_pdf_internal_links_py_datascience (PyMuPDF)
- pptx_link_run_to_slide_py_datascience (slide-jump)

Contrato docs/automatic_eda_contract.md actualizado (§1/§3/§5 + §11 nueva) con
la API de glosario, keep-together y zebra para la siguiente fase. PyMuPDF
declarado en pyproject. Suite verde (90 tests); golden titanic verificado.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-30 17:35:19 +02:00
parent b5334a2e97
commit d1a3d58a6b
21 changed files with 2116 additions and 107 deletions
@@ -33,10 +33,23 @@ import math
from .. import model
CHAPTER_VERSION = "1.0.0"
CHAPTER_VERSION = "1.1.0"
CHAPTER_ID = "cat_distr"
CHAPTER_TITLE = "Distribuciones categóricas"
# Glossary term this chapter explains. Registered in the shared collector and
# marked clickable on its first appearance (end-to-end glossary example —
# mejora 6). Other chapters hook their own terms the same way (see the contract).
_TERM_ENTROPIA_KEY = "entropia"
_TERM_ENTROPIA_LABEL = "Entropía (de Shannon)"
_TERM_ENTROPIA_DEF = (
"Medida, en bits, de cómo de repartidos están los valores de una columna "
"categórica. Vale 0 cuando una sola categoría concentra todas las filas "
"(máxima previsibilidad) y alcanza su máximo, log2(k) para k categorías "
"distintas, cuando todas aparecen por igual (máxima diversidad). La entropía "
"normalizada (entropía dividida por su máximo) la lleva al rango 01 para "
"comparar columnas con distinto número de categorías.")
# Cap the number of categorical columns rendered to keep the document bounded;
# the rest are summarized in a closing note (no silent truncation).
MAX_COLS = 40
@@ -337,10 +350,14 @@ def _topk_table(cat: dict):
note=note)
def _intro_blocks(n_rows):
def _intro_blocks(n_rows, mark_term: bool = False):
total = _fmt_int(n_rows)
# Mark the first appearance of the term as a clickable glossary jump when the
# term was registered (mark_term). The visible text is identical either way.
entropia = ("[[term:entropia]]**entropía de Shannon**[[/term]]" if mark_term
else "**entropía de Shannon**")
text = (
"La **entropía de Shannon** mide cómo de repartidos están los valores de "
f"La {entropia} mide cómo de repartidos están los valores de "
"una columna categórica, en bits. Vale 0 cuando una sola categoría "
"concentra todas las filas (máxima previsibilidad) y alcanza su máximo, "
"log2(k) para k categorías distintas, cuando todas aparecen por igual "
@@ -370,7 +387,15 @@ def build_cat_distr(profile: dict, ctx: dict):
return None
n_rows = profile.get("n_rows")
blocks = list(_intro_blocks(n_rows))
# Register "entropía" in the shared glossary collector (if present) and mark
# its first appearance clickable. End-to-end glossary example (mejora 6).
glossary = ctx.get("glossary")
mark_term = False
if isinstance(glossary, model.GlossaryCollector):
glossary.add(_TERM_ENTROPIA_KEY, _TERM_ENTROPIA_LABEL,
_TERM_ENTROPIA_DEF)
mark_term = True
blocks = list(_intro_blocks(n_rows, mark_term=mark_term))
rendered = cat_cols[:MAX_COLS]
for col in rendered:
@@ -0,0 +1,47 @@
"""Glossary chapter (GLOSARIO) — always the last chapter, clickable terms.
Renders one entry per glossary term that the other chapters registered during
the document build through ``ctx['glossary'].add(key, label, definition)`` (see
``GlossaryCollector`` in ``model.py``). Each entry is a clickable destination:
every in-text appearance a chapter marked with ``[[term:key]]texto[[/term]]``
becomes a real jump to its entry here — PDF link annotations (PyMuPDF) and PPTX
native slide jumps, both wired by the renderers.
Returns ``None`` when no term was registered (there is nothing to show), so the
chapter simply disappears from documents that did not mark any term.
Contract: build_<id>(profile, ctx) -> Chapter | None ; CHAPTER_VERSION = "x.y.z".
"""
from __future__ import annotations
from .. import model
CHAPTER_VERSION = "1.0.0"
CHAPTER_ID = "glosario"
CHAPTER_TITLE = "Glosario"
def build_glosario(profile: dict, ctx: dict):
"""Build the glossary Chapter from the shared collector, or None if empty."""
ctx = ctx or {}
glossary = ctx.get("glossary")
if not isinstance(glossary, model.GlossaryCollector) or not glossary:
return None
blocks = [
model.Heading(text="Glosario de términos", level=1),
model.Markdown(text=(
"Definición de los términos técnicos que aparecen en el informe. "
"Cada término va resaltado en el texto y, al pulsarlo, salta a su "
"definición en esta sección.")),
]
# One clickable destination per term, alphabetically by visible label.
for term in glossary.terms(by="label"):
blocks.append(model.GlossaryEntry(
key=model._safe_str(term.get("key")),
label=model._safe_str(term.get("label")),
definition=model._safe_str(term.get("definition"))))
return model.Chapter(id=CHAPTER_ID, title=CHAPTER_TITLE,
version=CHAPTER_VERSION, blocks=blocks)
@@ -34,7 +34,7 @@ try:
except Exception: # noqa: BLE001 — keep the chapter importable no matter what.
build_boxplot_stats = None # type: ignore[assignment]
CHAPTER_VERSION = "1.0.0"
CHAPTER_VERSION = "1.1.0"
CHAPTER_ID = "num_distr"
CHAPTER_TITLE = "Distribuciones numéricas"
@@ -278,12 +278,17 @@ def build_num_distr(profile: dict, ctx: dict):
box = build_boxplot_stats(numeric) or {}
except Exception: # noqa: BLE001 — degrade, never raise.
box = {}
blocks.append(model.Heading(text=str(name), level=2))
blocks.append(model.Figure(
make=_figure_maker(name, numeric, box),
caption=f"Distribución de «{name}» — histograma (media/mediana/±σ) "
f"y boxplot."))
blocks.append(model.Markdown(text=_stats_note(name, numeric, box)))
# Keep the column heading, its figure and its stats note together on the
# same page/slide (mejora 3 — keep-together): the renderers measure the
# whole Group and move it whole when it would not fit.
blocks.append(model.Group(blocks=[
model.Heading(text=str(name), level=2),
model.Figure(
make=_figure_maker(name, numeric, box),
caption=f"Distribución de «{name}» — histograma "
f"(media/mediana/±σ) y boxplot."),
model.Markdown(text=_stats_note(name, numeric, box)),
]))
return model.Chapter(id=CHAPTER_ID, title=CHAPTER_TITLE,
version=CHAPTER_VERSION, blocks=blocks)
@@ -65,19 +65,33 @@ def _pdf_text(path: str) -> str:
return re.sub(r"\s+", " ", txt)
def _flatten(blocks):
"""Expand keep-together Groups so the per-column heading/figure/markdown are
inspectable as a flat block list (the chapter wraps each column in a Group)."""
out = []
for b in blocks:
if getattr(b, "kind", "") == "group":
out.extend(_flatten(getattr(b, "blocks", []) or []))
else:
out.append(b)
return out
def test_golden_chapter_estructura_y_bloques():
ch = build_num_distr(_profile(n_numeric=2), {})
assert ch is not None
assert ch.id == "num_distr"
assert ch.version == CHAPTER_VERSION
kinds = [b.kind for b in ch.blocks]
# Per-column blocks are wrapped in keep-together Groups: flatten to inspect.
flat = _flatten(ch.blocks)
kinds = [b.kind for b in flat]
# Heading + intro Markdown, then per column: Heading + Figure + Markdown.
assert kinds[0] == "heading"
assert kinds[1] == "markdown"
assert kinds.count("figure") == 2 # one figure per numeric column.
assert kinds.count("heading") == 1 + 2 # chapter title + one per column.
# Each figure has a lazy maker that produces a real matplotlib figure.
figs = [b for b in ch.blocks if b.kind == "figure"]
figs = [b for b in flat if b.kind == "figure"]
fig = figs[0].make()
assert fig is not None
# Two stacked axes: histogram + boxplot share the figure.
@@ -90,7 +104,8 @@ def test_golden_media_mediana_sigma_y_boxplot_presentes():
# The intro documents the three reference lines and the Tukey boxplot; the
# per-column note carries the actual mean/median/σ numbers and the shape.
ch = build_num_distr(_profile(n_numeric=1, extra_categorical=False), {})
md_texts = " ".join(b.text for b in ch.blocks if b.kind == "markdown")
md_texts = " ".join(b.text for b in _flatten(ch.blocks)
if b.kind == "markdown")
assert "media" in md_texts and "mediana" in md_texts
assert "±1σ" in md_texts or "σ" in md_texts
assert "boxplot" in md_texts.lower()
@@ -126,7 +141,8 @@ def test_anti_corte_muchas_columnas_pdf_y_pptx():
# 8 numeric columns + long note text: nothing may be cut. Every column
# heading must survive in both the PDF text and the PPTX deck.
ch = build_num_distr(_profile(n_numeric=8), {})
names = [b.text for b in ch.blocks if b.kind == "heading" and b.level == 2]
names = [b.text for b in _flatten(ch.blocks)
if b.kind == "heading" and b.level == 2]
assert len(names) == 8
with tempfile.TemporaryDirectory() as d:
pdf = os.path.join(d, "num.pdf")
@@ -17,7 +17,7 @@ from datetime import datetime, timezone
from .. import model
CHAPTER_VERSION = "1.0.0"
CHAPTER_VERSION = "1.1.0"
CHAPTER_ID = "portada"
CHAPTER_TITLE = "Portada"
@@ -67,6 +67,53 @@ def _fmt_int(v) -> str:
return str(v)
def _fmt_pct(value) -> str:
"""Format a percentage that may arrive as a 01 fraction or a 0100 number."""
if value is None:
return ""
try:
v = float(value)
except (TypeError, ValueError):
return str(value)
if 0 < v <= 1.0:
v *= 100.0
return f"{v:.1f}%"
def _summary_blocks(summary) -> list:
"""Mini-summary of the rest of the analysis, shown on the cover (mejora 5).
The cover is built AFTER the body (``build_document`` passes the aggregated
``ctx['document_summary']``), so it can reflect what the analysis found:
shape, column types, quality flags and which chapters were included. Returns
an empty list when there is no summary (the cover degrades to its metadata
table only)."""
if not isinstance(summary, dict) or not summary:
return []
rows = []
n_num = summary.get("n_numeric")
n_cat = summary.get("n_categorical")
if n_num is not None or n_cat is not None:
rows.append(("Columnas numéricas / categóricas",
f"{_fmt_int(n_num)} / {_fmt_int(n_cat)}"))
if summary.get("duplicate_pct") is not None:
rows.append(("Filas duplicadas", _fmt_pct(summary.get("duplicate_pct"))))
if summary.get("null_cell_pct") is not None:
rows.append(("Celdas nulas", _fmt_pct(summary.get("null_cell_pct"))))
titles = summary.get("chapter_titles") or []
if titles:
rows.append(("Capítulos del informe", _fmt_int(len(titles))))
blocks = [model.Heading(text="Resumen del análisis", level=2)]
if rows:
blocks.append(model.KVTable(rows=rows))
if titles:
bullets = "\n".join(f"- {model._safe_str(t)}" for t in titles)
blocks.append(model.Markdown(
text="Este informe incluye los siguientes capítulos:\n" + bullets))
return blocks
def _fmt_date_eu(value) -> str:
"""Format a date/ISO string as European DD/MM/AAAA HH:mm (UI convention).
@@ -152,5 +199,8 @@ def build_portada(profile: dict, ctx: dict):
model.Markdown(text=str(granularity)),
]
# Mini-summary of the rest of the analysis (built last, shown on the cover).
blocks.extend(_summary_blocks(ctx.get("document_summary")))
return model.Chapter(id=CHAPTER_ID, title=CHAPTER_TITLE,
version=CHAPTER_VERSION, blocks=blocks)