feat(eda): motor AutomaticEDA fase 4a — render fixes + keep-together + glosario clicable

Mejoras transversales del motor de render (no del contenido de capítulos):

1. Fix negrita pisa texto (PDF): _place_rich_lines mide el ancho REAL de cada
   span con las métricas de fuente del renderer (peso correcto) en vez del
   grid de ancho medio; negrita y normal en la misma línea ya no se solapan.
2. Zebra striping: filas pares sombreadas (#f6f8fa) en DataTable (PDF + PPTX),
   coherente al partir tablas largas (índice de fila lógico, no por página).
3. Keep-together: bloque Group nuevo; el renderer mide el grupo entero y lo
   mueve completo a la página/slide siguiente si no cabe, y encoge la figura
   (height_in) para dejar sitio a su título y texto. num_distr lo usa.
4. Caption siempre visible en toda figura PPTX (fallback al heading); la figura
   reserva el alto de su caption para que ambos quepan en el mismo slide.
5. Portada construida al final (con resumen agregado del análisis vía
   ctx['document_summary']) pero colocada primera por build_document.
6. Glosario: capítulo nuevo (último) + GlossaryCollector en ctx; los capítulos
   registran términos y marcan apariciones con [[term:key]]...[[/term]]. Links
   clicables reales: PDF (PyMuPDF, link GOTO) y PPTX (slide-jump nativo).
   Enganchado "entropía" en cat_distr como ejemplo end-to-end.

Funciones reutilizables delegadas a fn-constructor (tag eda):
- add_pdf_internal_links_py_datascience (PyMuPDF)
- pptx_link_run_to_slide_py_datascience (slide-jump)

Contrato docs/automatic_eda_contract.md actualizado (§1/§3/§5 + §11 nueva) con
la API de glosario, keep-together y zebra para la siguiente fase. PyMuPDF
declarado en pyproject. Suite verde (90 tests); golden titanic verificado.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-30 17:35:19 +02:00
parent b5334a2e97
commit d1a3d58a6b
21 changed files with 2116 additions and 107 deletions
@@ -26,7 +26,7 @@ from . import model
# placeholders other agents will fill by creating chapters/<id>.py — they will
# appear in this exact position automatically once their module exists.
CHAPTER_ORDER = [
"portada", # cover
"portada", # cover — BUILT LAST, PLACED FIRST (see build_document).
"overview", # df.head + columns/types/nulls/examples + describe
"analisis_llm", # LLM interpretation — sits next to overview (user request)
"num_distr", # numeric distributions
@@ -37,8 +37,15 @@ CHAPTER_ORDER = [
"timeseries", # time-series analysis
"geospatial", # geospatial
"agregacion", # aggregations / pivots
"glosario", # glossary — ALWAYS LAST; clickable term destinations.
]
# Chapters whose position is special-cased by build_document: portada is built
# last (so it can summarize the rest) but placed first; glosario is built and
# placed last (it reads the terms every other chapter registered).
_PORTADA = "portada"
_GLOSARIO = "glosario"
def build_chapter(chapter_id: str, profile: dict, ctx: dict):
"""Build a single chapter by id, or None if absent/not-applicable/error.
@@ -75,15 +82,72 @@ def build_document(profile: dict, ctx: dict = None) -> list:
list[Chapter] in canonical order, containing only the chapters that are
implemented and applicable. Never raises.
"""
if profile is None:
profile = {}
if not isinstance(profile, dict):
profile = {}
if ctx is None:
ctx = {}
chapters = []
# Copy ctx so the shared collector / summary we add do not leak to the caller.
ctx = dict(ctx) if isinstance(ctx, dict) else {}
# A single glossary collector is shared by every chapter via ctx['glossary'].
# Chapters call ctx['glossary'].add(key, label, definition) and mark in-text
# appearances with [[term:key]]…[[/term]]; the glosario chapter renders the
# registered terms and the renderers wire the clickable links.
glossary = ctx.get("glossary")
if not isinstance(glossary, model.GlossaryCollector):
glossary = model.GlossaryCollector()
ctx["glossary"] = glossary
# 1) Body: every chapter except portada (built last) and glosario (placed
# last), in canonical order. This also fills the glossary collector.
body = []
for cid in CHAPTER_ORDER:
if cid in (_PORTADA, _GLOSARIO):
continue
ch = build_chapter(cid, profile, ctx)
if ch is not None and ch.blocks:
chapters.append(ch)
body.append(ch)
# 2) Aggregated summary of the rest, for the cover (user decision: the cover
# is BUILT after the body so it can reflect what the analysis found).
ctx["document_summary"] = _summarize_document(profile, body)
# 3) Build the cover last, place it FIRST.
portada = build_chapter(_PORTADA, profile, ctx)
# 4) Build the glossary last (reads the terms the body registered), place LAST.
glosario = build_chapter(_GLOSARIO, profile, ctx)
chapters = []
if portada is not None and portada.blocks:
chapters.append(portada)
chapters.extend(body)
if glosario is not None and glosario.blocks:
chapters.append(glosario)
return chapters
def _summarize_document(profile: dict, body: list) -> dict:
"""Aggregate a tiny findings summary of the body for the cover. Never raises.
Returns a dict with dataset shape, quality, column-type counts and the list
of chapters actually included — enough for the cover to show a mini-summary
of the analysis without re-deriving anything."""
try:
cols = profile.get("columns") or []
n_num = sum(1 for c in cols if isinstance(c, dict)
and c.get("inferred_type") == "numeric")
n_cat = sum(1 for c in cols if isinstance(c, dict)
and isinstance(c.get("categorical"), dict)
and c.get("categorical", {}).get("top")
and c.get("inferred_type") != "numeric")
return {
"n_chapters": len(body),
"chapter_titles": [getattr(c, "title", "") for c in body],
"n_rows": profile.get("n_rows"),
"n_cols": profile.get("n_cols"),
"quality_score": profile.get("quality_score"),
"n_numeric": n_num,
"n_categorical": n_cat,
"duplicate_pct": profile.get("duplicate_pct"),
"null_cell_pct": profile.get("null_cell_pct"),
}
except Exception: # noqa: BLE001 — the summary is best-effort.
return {"n_chapters": len(body) if isinstance(body, list) else 0}