feat(eda): motor AutomaticEDA fase 4a — render fixes + keep-together + glosario clicable

Mejoras transversales del motor de render (no del contenido de capítulos): 1. Fix negrita pisa texto (PDF): _place_rich_lines mide el ancho REAL de cada span con las métricas de fuente del renderer (peso correcto) en vez del grid de ancho medio; negrita y normal en la misma línea ya no se solapan. 2. Zebra striping: filas pares sombreadas (#f6f8fa) en DataTable (PDF + PPTX), coherente al partir tablas largas (índice de fila lógico, no por página). 3. Keep-together: bloque Group nuevo; el renderer mide el grupo entero y lo mueve completo a la página/slide siguiente si no cabe, y encoge la figura (height_in) para dejar sitio a su título y texto. num_distr lo usa. 4. Caption siempre visible en toda figura PPTX (fallback al heading); la figura reserva el alto de su caption para que ambos quepan en el mismo slide. 5. Portada construida al final (con resumen agregado del análisis vía ctx['document_summary']) pero colocada primera por build_document. 6. Glosario: capítulo nuevo (último) + GlossaryCollector en ctx; los capítulos registran términos y marcan apariciones con [[term:key]]...[[/term]]. Links clicables reales: PDF (PyMuPDF, link GOTO) y PPTX (slide-jump nativo). Enganchado "entropía" en cat_distr como ejemplo end-to-end. Funciones reutilizables delegadas a fn-constructor (tag eda): - add_pdf_internal_links_py_datascience (PyMuPDF) - pptx_link_run_to_slide_py_datascience (slide-jump) Contrato docs/automatic_eda_contract.md actualizado (§1/§3/§5 + §11 nueva) con la API de glosario, keep-together y zebra para la siguiente fase. PyMuPDF declarado en pyproject. Suite verde (90 tests); golden titanic verificado. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-30 17:35:19 +02:00
parent b5334a2e97
commit d1a3d58a6b
21 changed files with 2116 additions and 107 deletions
@@ -26,7 +26,7 @@ from . import model
 # placeholders other agents will fill by creating chapters/<id>.py — they will
 # appear in this exact position automatically once their module exists.
 CHAPTER_ORDER = [
-    "portada",       # cover
+    "portada",       # cover — BUILT LAST, PLACED FIRST (see build_document).
    "overview",      # df.head + columns/types/nulls/examples + describe
    "analisis_llm",  # LLM interpretation — sits next to overview (user request)
    "num_distr",     # numeric distributions
@@ -37,8 +37,15 @@ CHAPTER_ORDER = [
    "timeseries",    # time-series analysis
    "geospatial",    # geospatial
    "agregacion",    # aggregations / pivots
+    "glosario",      # glossary — ALWAYS LAST; clickable term destinations.
 ]

+# Chapters whose position is special-cased by build_document: portada is built
+# last (so it can summarize the rest) but placed first; glosario is built and
+# placed last (it reads the terms every other chapter registered).
+_PORTADA = "portada"
+_GLOSARIO = "glosario"
+

 def build_chapter(chapter_id: str, profile: dict, ctx: dict):
    """Build a single chapter by id, or None if absent/not-applicable/error.
@@ -75,15 +82,72 @@ def build_document(profile: dict, ctx: dict = None) -> list:
        list[Chapter] in canonical order, containing only the chapters that are
        implemented and applicable. Never raises.
    """
-    if profile is None:
-        profile = {}
    if not isinstance(profile, dict):
        profile = {}
-    if ctx is None:
-        ctx = {}
-    chapters = []
+    # Copy ctx so the shared collector / summary we add do not leak to the caller.
+    ctx = dict(ctx) if isinstance(ctx, dict) else {}
+
+    # A single glossary collector is shared by every chapter via ctx['glossary'].
+    # Chapters call ctx['glossary'].add(key, label, definition) and mark in-text
+    # appearances with [[term:key]]…[[/term]]; the glosario chapter renders the
+    # registered terms and the renderers wire the clickable links.
+    glossary = ctx.get("glossary")
+    if not isinstance(glossary, model.GlossaryCollector):
+        glossary = model.GlossaryCollector()
+        ctx["glossary"] = glossary
+
+    # 1) Body: every chapter except portada (built last) and glosario (placed
+    # last), in canonical order. This also fills the glossary collector.
+    body = []
    for cid in CHAPTER_ORDER:
+        if cid in (_PORTADA, _GLOSARIO):
+            continue
        ch = build_chapter(cid, profile, ctx)
        if ch is not None and ch.blocks:
-            chapters.append(ch)
+            body.append(ch)
+
+    # 2) Aggregated summary of the rest, for the cover (user decision: the cover
+    # is BUILT after the body so it can reflect what the analysis found).
+    ctx["document_summary"] = _summarize_document(profile, body)
+
+    # 3) Build the cover last, place it FIRST.
+    portada = build_chapter(_PORTADA, profile, ctx)
+    # 4) Build the glossary last (reads the terms the body registered), place LAST.
+    glosario = build_chapter(_GLOSARIO, profile, ctx)
+
+    chapters = []
+    if portada is not None and portada.blocks:
+        chapters.append(portada)
+    chapters.extend(body)
+    if glosario is not None and glosario.blocks:
+        chapters.append(glosario)
    return chapters
+
+
+def _summarize_document(profile: dict, body: list) -> dict:
+    """Aggregate a tiny findings summary of the body for the cover. Never raises.
+
+    Returns a dict with dataset shape, quality, column-type counts and the list
+    of chapters actually included — enough for the cover to show a mini-summary
+    of the analysis without re-deriving anything."""
+    try:
+        cols = profile.get("columns") or []
+        n_num = sum(1 for c in cols if isinstance(c, dict)
+                    and c.get("inferred_type") == "numeric")
+        n_cat = sum(1 for c in cols if isinstance(c, dict)
+                    and isinstance(c.get("categorical"), dict)
+                    and c.get("categorical", {}).get("top")
+                    and c.get("inferred_type") != "numeric")
+        return {
+            "n_chapters": len(body),
+            "chapter_titles": [getattr(c, "title", "") for c in body],
+            "n_rows": profile.get("n_rows"),
+            "n_cols": profile.get("n_cols"),
+            "quality_score": profile.get("quality_score"),
+            "n_numeric": n_num,
+            "n_categorical": n_cat,
+            "duplicate_pct": profile.get("duplicate_pct"),
+            "null_cell_pct": profile.get("null_cell_pct"),
+        }
+    except Exception:  # noqa: BLE001 — the summary is best-effort.
+        return {"n_chapters": len(body) if isinstance(body, list) else 0}