feat(eda): núcleo AutomaticEDA — documento por capítulos + renderers PDF/PPTX anti-corte
Introduce la capa intermedia entre el contenido de un EDA y su formato de
salida. Un documento es una lista de capítulos versionados; cada capítulo es
un conjunto ordenado de bloques (heading, markdown, kv_table, data_table,
figure, image, caption, note) independientes del formato.
Núcleo (paquete de soporte python/functions/datascience/automatic_eda/):
- model.py: dataclasses de bloques + Chapter, normalizadores defensivos
(aceptan dataclass o dict, nunca lanzan), ENGINE_VERSION y el manifiesto
por capítulo (automatic_eda_manifest.json).
- text_layout.py: medición/wrapping por rejilla de caracteres compartida.
- chapters_registry.py: CHAPTER_ORDER pre-declarado + build_document con
auto-discovery de capítulos por convención (permite añadir capítulos en
paralelo sin editar el registro).
- render_pdf_impl.py: paginador A5 retrato móvil que MIDE cada bloque y nunca
corta: texto a líneas completas, tablas largas partidas por filas repitiendo
cabecera, figuras/imágenes escaladas para caber enteras. Pie versionado por
capítulo.
- render_pptx_impl.py: mismo principio sobre slides 16:9 (continúa en slide
"(cont.)"; tablas repiten cabecera; figuras exportadas a PNG escaladas).
- chapters/portada.py y chapters/overview.py: capítulos de referencia. Portada
con nombre, rótulo Automatic-EDA, fuente, almacenamiento (inferido de
source), fecha europea, filas×cols, descripción, granularidad y calidad con
criterios. Overview con df.head (placeholder honesto si falta head_rows),
diccionario de columnas (tipo/nulos/ejemplos) y describe numérico.
Funciones públicas del registry (grupo eda, dict-no-throw):
- render_automatic_eda_pdf / render_automatic_eda_pptx: aceptan capítulos o un
TableProfile (construyen los capítulos con build_document) y escriben el
manifiesto. Aditivas — no reemplazan render_eda_pdf.
Tests self-contained (sin DuckDB) para ambos renderers: golden (portada +
overview), partición de tablas largas repitiendo cabecera, no-corte de celdas
y markdown largos, profile None/{} válido de 1 página/slide, y error path en
directorio no escribible. 23 tests verdes (incluye los previos de
render_eda_pdf, intactos).
Dependencia nueva python-pptx>=1.0.2 declarada en python/pyproject.toml.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,83 @@
|
||||
"""render_automatic_eda_pdf — chapter-based EDA report as an A5-portrait PDF.
|
||||
|
||||
Public ``eda``-group entry point of the AutomaticEDA engine. Takes either a list
|
||||
of chapters (the format-independent document model) or an ``eda`` TableProfile
|
||||
dict (in which case the canonical chapters are built with ``build_document``),
|
||||
and renders a mobile-first PDF whose paginator MEASURES every block and never
|
||||
cuts text, tables or images: text wraps to whole lines, long tables split by
|
||||
rows repeating the header, figures/images scale to fit entirely. Each chapter
|
||||
starts on a fresh page stamped ``<Chapter> · v<version>`` in the footer, and a
|
||||
per-chapter manifest (``automatic_eda_manifest.json``) is written next to the
|
||||
output for version tracking.
|
||||
|
||||
dict-no-throw: never raises. Returns ``{path, n_pages, chapters, manifest_path,
|
||||
note}``; on a fatal write error ``path`` is None and ``note`` explains why.
|
||||
|
||||
Additive: this does NOT replace ``render_eda_pdf`` (still used by
|
||||
``profile_table(emit_pdf=True)``). It is the new engine that will, in the next
|
||||
phase, let every EDA emit both a PDF and a PPTX from the same chapter model.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
|
||||
from datascience.automatic_eda import build_document, merge_manifest, render_pdf
|
||||
from datascience.automatic_eda.model import as_chapter, as_chapters
|
||||
|
||||
|
||||
def _coerce_chapters(chapters_or_profile, meta: dict) -> list:
|
||||
"""Accept chapters OR an eda profile and return a list of Chapter."""
|
||||
arg = chapters_or_profile
|
||||
if isinstance(arg, (list, tuple)):
|
||||
return as_chapters(list(arg))
|
||||
if isinstance(arg, dict):
|
||||
# A single chapter dict has 'blocks'; a profile has columns/table/rows.
|
||||
if "blocks" in arg and "columns" not in arg:
|
||||
ch = as_chapter(arg)
|
||||
return [ch] if ch is not None else []
|
||||
# Treat as an eda TableProfile.
|
||||
return build_document(arg, (meta or {}).get("ctx"))
|
||||
return []
|
||||
|
||||
|
||||
def render_automatic_eda_pdf(chapters_or_profile, out_path: str,
|
||||
meta: dict = None) -> dict:
|
||||
"""Render an AutomaticEDA document into a mobile-readable PDF.
|
||||
|
||||
Args:
|
||||
chapters_or_profile: either a list of chapters (``Chapter`` dataclasses
|
||||
or dicts following the document model) or an ``eda`` TableProfile
|
||||
dict — in the latter case the canonical chapters are built via
|
||||
``build_document(profile, meta['ctx'])``.
|
||||
out_path: filesystem path for the PDF (parent dirs are created).
|
||||
meta: optional dict. Recognised keys: ``title`` (cover/footer title),
|
||||
``ctx`` (presentation context passed to chapter builders when a
|
||||
profile is given), ``manifest_path`` (override; defaults to
|
||||
``automatic_eda_manifest.json`` beside ``out_path``),
|
||||
``write_manifest`` (set False to skip), ``generated_at``.
|
||||
|
||||
Returns:
|
||||
dict (never raises): ``{path, n_pages, chapters, manifest_path, note}``.
|
||||
"""
|
||||
meta = dict(meta or {})
|
||||
chapters = _coerce_chapters(chapters_or_profile, meta)
|
||||
result = render_pdf(chapters, out_path, meta)
|
||||
|
||||
manifest_path = None
|
||||
if meta.get("write_manifest", True) and result.get("path"):
|
||||
manifest_path = meta.get("manifest_path")
|
||||
if not manifest_path:
|
||||
manifest_path = os.path.join(
|
||||
os.path.dirname(os.path.abspath(out_path)),
|
||||
"automatic_eda_manifest.json")
|
||||
generated_at = meta.get("generated_at") or _now_iso()
|
||||
merge_manifest(manifest_path, "pdf", result.get("chapters") or [],
|
||||
generated_at)
|
||||
result["manifest_path"] = manifest_path
|
||||
return result
|
||||
|
||||
|
||||
def _now_iso() -> str:
|
||||
from datetime import datetime, timezone
|
||||
return datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M:%S UTC")
|
||||
Reference in New Issue
Block a user