feat(eda): núcleo AutomaticEDA — documento por capítulos + renderers PDF/PPTX anti-corte

Introduce la capa intermedia entre el contenido de un EDA y su formato de salida. Un documento es una lista de capítulos versionados; cada capítulo es un conjunto ordenado de bloques (heading, markdown, kv_table, data_table, figure, image, caption, note) independientes del formato. Núcleo (paquete de soporte python/functions/datascience/automatic_eda/): - model.py: dataclasses de bloques + Chapter, normalizadores defensivos (aceptan dataclass o dict, nunca lanzan), ENGINE_VERSION y el manifiesto por capítulo (automatic_eda_manifest.json). - text_layout.py: medición/wrapping por rejilla de caracteres compartida. - chapters_registry.py: CHAPTER_ORDER pre-declarado + build_document con auto-discovery de capítulos por convención (permite añadir capítulos en paralelo sin editar el registro). - render_pdf_impl.py: paginador A5 retrato móvil que MIDE cada bloque y nunca corta: texto a líneas completas, tablas largas partidas por filas repitiendo cabecera, figuras/imágenes escaladas para caber enteras. Pie versionado por capítulo. - render_pptx_impl.py: mismo principio sobre slides 16:9 (continúa en slide "(cont.)"; tablas repiten cabecera; figuras exportadas a PNG escaladas). - chapters/portada.py y chapters/overview.py: capítulos de referencia. Portada con nombre, rótulo Automatic-EDA, fuente, almacenamiento (inferido de source), fecha europea, filas×cols, descripción, granularidad y calidad con criterios. Overview con df.head (placeholder honesto si falta head_rows), diccionario de columnas (tipo/nulos/ejemplos) y describe numérico. Funciones públicas del registry (grupo eda, dict-no-throw): - render_automatic_eda_pdf / render_automatic_eda_pptx: aceptan capítulos o un TableProfile (construyen los capítulos con build_document) y escriben el manifiesto. Aditivas — no reemplazan render_eda_pdf. Tests self-contained (sin DuckDB) para ambos renderers: golden (portada + overview), partición de tablas largas repitiendo cabecera, no-corte de celdas y markdown largos, profile None/{} válido de 1 página/slide, y error path en directorio no escribible. 23 tests verdes (incluye los previos de render_eda_pdf, intactos). Dependencia nueva python-pptx>=1.0.2 declarada en python/pyproject.toml. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-30 14:30:31 +02:00
parent 5501507588
commit 9cdde4a341
17 changed files with 2563 additions and 0 deletions
@@ -0,0 +1,89 @@
+"""Chapter registry — the canonical order of an AutomaticEDA document.
+
+``CHAPTER_ORDER`` declares every chapter the engine will *ever* place, in the
+order they appear in the document. Each id maps by convention to a module
+``automatic_eda/chapters/<id>.py`` exposing ``build_<id>(profile, ctx) ->
+Chapter | None`` and a ``CHAPTER_VERSION`` constant.
+
+This pre-declared order is what lets many agents add chapters in parallel
+without contention: an agent only creates its own ``chapters/<id>.py`` module —
+it never edits this file. ``build_document`` imports each chapter lazily; a
+chapter whose module does not exist yet (not implemented) is simply skipped, so
+the document is always renderable with whatever chapters are present today.
+
+``build_document`` never raises: a chapter that errors out is dropped with a
+note, and a chapter that returns ``None`` (does not apply to this dataset, e.g.
+time series on a dataset with no date column) is omitted.
+"""
+
+from __future__ import annotations
+
+import importlib
+
+from . import model
+
+# Canonical document order. Implemented today: portada, overview. The rest are
+# placeholders other agents will fill by creating chapters/<id>.py — they will
+# appear in this exact position automatically once their module exists.
+CHAPTER_ORDER = [
+    "portada",       # cover
+    "overview",      # df.head + columns/types/nulls/examples + describe
+    "num_distr",     # numeric distributions
+    "cat_distr",     # categorical distributions
+    "calidad",       # data quality
+    "correlacion",   # correlations / associations
+    "modelos",       # cheap models (PCA/KMeans/outliers)
+    "analisis_llm",  # LLM interpretation
+    "timeseries",    # time-series analysis
+    "geospatial",    # geospatial
+    "agregacion",    # aggregations / pivots
+]
+
+
+def build_chapter(chapter_id: str, profile: dict, ctx: dict):
+    """Build a single chapter by id, or None if absent/not-applicable/error.
+
+    Looks up ``automatic_eda.chapters.<chapter_id>`` and calls its
+    ``build_<chapter_id>(profile, ctx)``. Returns a normalized Chapter, or None
+    when the module is missing, the builder returns None, or anything raises.
+    """
+    mod_name = f"{__package__}.chapters.{chapter_id}"
+    try:
+        mod = importlib.import_module(mod_name)
+    except Exception:  # noqa: BLE001 — chapter not implemented yet → skip.
+        return None
+    builder = getattr(mod, f"build_{chapter_id}", None)
+    if builder is None:
+        return None
+    try:
+        result = builder(profile or {}, ctx or {})
+    except Exception:  # noqa: BLE001 — a broken chapter never aborts the doc.
+        return None
+    return model.as_chapter(result)
+
+
+def build_document(profile: dict, ctx: dict = None) -> list:
+    """Build the full ordered list of chapters for a TableProfile.
+
+    Args:
+        profile: the ``eda`` group TableProfile dict (may be None/empty).
+        ctx: optional context dict carrying presentation metadata not present in
+            the profile (dataset_name, source_origin, storage, generated_at,
+            description, granularity, quality_criteria, head_rows, ...).
+
+    Returns:
+        list[Chapter] in canonical order, containing only the chapters that are
+        implemented and applicable. Never raises.
+    """
+    if profile is None:
+        profile = {}
+    if not isinstance(profile, dict):
+        profile = {}
+    if ctx is None:
+        ctx = {}
+    chapters = []
+    for cid in CHAPTER_ORDER:
+        ch = build_chapter(cid, profile, ctx)
+        if ch is not None and ch.blocks:
+            chapters.append(ch)
+    return chapters