feat(eda): núcleo AutomaticEDA — documento por capítulos + renderers PDF/PPTX anti-corte

Introduce la capa intermedia entre el contenido de un EDA y su formato de salida. Un documento es una lista de capítulos versionados; cada capítulo es un conjunto ordenado de bloques (heading, markdown, kv_table, data_table, figure, image, caption, note) independientes del formato. Núcleo (paquete de soporte python/functions/datascience/automatic_eda/): - model.py: dataclasses de bloques + Chapter, normalizadores defensivos (aceptan dataclass o dict, nunca lanzan), ENGINE_VERSION y el manifiesto por capítulo (automatic_eda_manifest.json). - text_layout.py: medición/wrapping por rejilla de caracteres compartida. - chapters_registry.py: CHAPTER_ORDER pre-declarado + build_document con auto-discovery de capítulos por convención (permite añadir capítulos en paralelo sin editar el registro). - render_pdf_impl.py: paginador A5 retrato móvil que MIDE cada bloque y nunca corta: texto a líneas completas, tablas largas partidas por filas repitiendo cabecera, figuras/imágenes escaladas para caber enteras. Pie versionado por capítulo. - render_pptx_impl.py: mismo principio sobre slides 16:9 (continúa en slide "(cont.)"; tablas repiten cabecera; figuras exportadas a PNG escaladas). - chapters/portada.py y chapters/overview.py: capítulos de referencia. Portada con nombre, rótulo Automatic-EDA, fuente, almacenamiento (inferido de source), fecha europea, filas×cols, descripción, granularidad y calidad con criterios. Overview con df.head (placeholder honesto si falta head_rows), diccionario de columnas (tipo/nulos/ejemplos) y describe numérico. Funciones públicas del registry (grupo eda, dict-no-throw): - render_automatic_eda_pdf / render_automatic_eda_pptx: aceptan capítulos o un TableProfile (construyen los capítulos con build_document) y escriben el manifiesto. Aditivas — no reemplazan render_eda_pdf. Tests self-contained (sin DuckDB) para ambos renderers: golden (portada + overview), partición de tablas largas repitiendo cabecera, no-corte de celdas y markdown largos, profile None/{} válido de 1 página/slide, y error path en directorio no escribible. 23 tests verdes (incluye los previos de render_eda_pdf, intactos). Dependencia nueva python-pptx>=1.0.2 declarada en python/pyproject.toml. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-30 14:30:31 +02:00
parent 5501507588
commit 9cdde4a341
17 changed files with 2563 additions and 0 deletions
@@ -0,0 +1,83 @@
+"""render_automatic_eda_pdf — chapter-based EDA report as an A5-portrait PDF.
+
+Public ``eda``-group entry point of the AutomaticEDA engine. Takes either a list
+of chapters (the format-independent document model) or an ``eda`` TableProfile
+dict (in which case the canonical chapters are built with ``build_document``),
+and renders a mobile-first PDF whose paginator MEASURES every block and never
+cuts text, tables or images: text wraps to whole lines, long tables split by
+rows repeating the header, figures/images scale to fit entirely. Each chapter
+starts on a fresh page stamped ``<Chapter> · v<version>`` in the footer, and a
+per-chapter manifest (``automatic_eda_manifest.json``) is written next to the
+output for version tracking.
+
+dict-no-throw: never raises. Returns ``{path, n_pages, chapters, manifest_path,
+note}``; on a fatal write error ``path`` is None and ``note`` explains why.
+
+Additive: this does NOT replace ``render_eda_pdf`` (still used by
+``profile_table(emit_pdf=True)``). It is the new engine that will, in the next
+phase, let every EDA emit both a PDF and a PPTX from the same chapter model.
+"""
+
+from __future__ import annotations
+
+import os
+
+from datascience.automatic_eda import build_document, merge_manifest, render_pdf
+from datascience.automatic_eda.model import as_chapter, as_chapters
+
+
+def _coerce_chapters(chapters_or_profile, meta: dict) -> list:
+    """Accept chapters OR an eda profile and return a list of Chapter."""
+    arg = chapters_or_profile
+    if isinstance(arg, (list, tuple)):
+        return as_chapters(list(arg))
+    if isinstance(arg, dict):
+        # A single chapter dict has 'blocks'; a profile has columns/table/rows.
+        if "blocks" in arg and "columns" not in arg:
+            ch = as_chapter(arg)
+            return [ch] if ch is not None else []
+        # Treat as an eda TableProfile.
+        return build_document(arg, (meta or {}).get("ctx"))
+    return []
+
+
+def render_automatic_eda_pdf(chapters_or_profile, out_path: str,
+                             meta: dict = None) -> dict:
+    """Render an AutomaticEDA document into a mobile-readable PDF.
+
+    Args:
+        chapters_or_profile: either a list of chapters (``Chapter`` dataclasses
+            or dicts following the document model) or an ``eda`` TableProfile
+            dict — in the latter case the canonical chapters are built via
+            ``build_document(profile, meta['ctx'])``.
+        out_path: filesystem path for the PDF (parent dirs are created).
+        meta: optional dict. Recognised keys: ``title`` (cover/footer title),
+            ``ctx`` (presentation context passed to chapter builders when a
+            profile is given), ``manifest_path`` (override; defaults to
+            ``automatic_eda_manifest.json`` beside ``out_path``),
+            ``write_manifest`` (set False to skip), ``generated_at``.
+
+    Returns:
+        dict (never raises): ``{path, n_pages, chapters, manifest_path, note}``.
+    """
+    meta = dict(meta or {})
+    chapters = _coerce_chapters(chapters_or_profile, meta)
+    result = render_pdf(chapters, out_path, meta)
+
+    manifest_path = None
+    if meta.get("write_manifest", True) and result.get("path"):
+        manifest_path = meta.get("manifest_path")
+        if not manifest_path:
+            manifest_path = os.path.join(
+                os.path.dirname(os.path.abspath(out_path)),
+                "automatic_eda_manifest.json")
+        generated_at = meta.get("generated_at") or _now_iso()
+        merge_manifest(manifest_path, "pdf", result.get("chapters") or [],
+                       generated_at)
+    result["manifest_path"] = manifest_path
+    return result
+
+
+def _now_iso() -> str:
+    from datetime import datetime, timezone
+    return datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M:%S UTC")