feat(eda): núcleo AutomaticEDA — documento por capítulos + renderers PDF/PPTX anti-corte
Introduce la capa intermedia entre el contenido de un EDA y su formato de
salida. Un documento es una lista de capítulos versionados; cada capítulo es
un conjunto ordenado de bloques (heading, markdown, kv_table, data_table,
figure, image, caption, note) independientes del formato.
Núcleo (paquete de soporte python/functions/datascience/automatic_eda/):
- model.py: dataclasses de bloques + Chapter, normalizadores defensivos
(aceptan dataclass o dict, nunca lanzan), ENGINE_VERSION y el manifiesto
por capítulo (automatic_eda_manifest.json).
- text_layout.py: medición/wrapping por rejilla de caracteres compartida.
- chapters_registry.py: CHAPTER_ORDER pre-declarado + build_document con
auto-discovery de capítulos por convención (permite añadir capítulos en
paralelo sin editar el registro).
- render_pdf_impl.py: paginador A5 retrato móvil que MIDE cada bloque y nunca
corta: texto a líneas completas, tablas largas partidas por filas repitiendo
cabecera, figuras/imágenes escaladas para caber enteras. Pie versionado por
capítulo.
- render_pptx_impl.py: mismo principio sobre slides 16:9 (continúa en slide
"(cont.)"; tablas repiten cabecera; figuras exportadas a PNG escaladas).
- chapters/portada.py y chapters/overview.py: capítulos de referencia. Portada
con nombre, rótulo Automatic-EDA, fuente, almacenamiento (inferido de
source), fecha europea, filas×cols, descripción, granularidad y calidad con
criterios. Overview con df.head (placeholder honesto si falta head_rows),
diccionario de columnas (tipo/nulos/ejemplos) y describe numérico.
Funciones públicas del registry (grupo eda, dict-no-throw):
- render_automatic_eda_pdf / render_automatic_eda_pptx: aceptan capítulos o un
TableProfile (construyen los capítulos con build_document) y escriben el
manifiesto. Aditivas — no reemplazan render_eda_pdf.
Tests self-contained (sin DuckDB) para ambos renderers: golden (portada +
overview), partición de tablas largas repitiendo cabecera, no-corte de celdas
y markdown largos, profile None/{} válido de 1 página/slide, y error path en
directorio no escribible. 23 tests verdes (incluye los previos de
render_eda_pdf, intactos).
Dependencia nueva python-pptx>=1.0.2 declarada en python/pyproject.toml.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,89 @@
|
||||
"""Chapter registry — the canonical order of an AutomaticEDA document.
|
||||
|
||||
``CHAPTER_ORDER`` declares every chapter the engine will *ever* place, in the
|
||||
order they appear in the document. Each id maps by convention to a module
|
||||
``automatic_eda/chapters/<id>.py`` exposing ``build_<id>(profile, ctx) ->
|
||||
Chapter | None`` and a ``CHAPTER_VERSION`` constant.
|
||||
|
||||
This pre-declared order is what lets many agents add chapters in parallel
|
||||
without contention: an agent only creates its own ``chapters/<id>.py`` module —
|
||||
it never edits this file. ``build_document`` imports each chapter lazily; a
|
||||
chapter whose module does not exist yet (not implemented) is simply skipped, so
|
||||
the document is always renderable with whatever chapters are present today.
|
||||
|
||||
``build_document`` never raises: a chapter that errors out is dropped with a
|
||||
note, and a chapter that returns ``None`` (does not apply to this dataset, e.g.
|
||||
time series on a dataset with no date column) is omitted.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import importlib
|
||||
|
||||
from . import model
|
||||
|
||||
# Canonical document order. Implemented today: portada, overview. The rest are
|
||||
# placeholders other agents will fill by creating chapters/<id>.py — they will
|
||||
# appear in this exact position automatically once their module exists.
|
||||
CHAPTER_ORDER = [
|
||||
"portada", # cover
|
||||
"overview", # df.head + columns/types/nulls/examples + describe
|
||||
"num_distr", # numeric distributions
|
||||
"cat_distr", # categorical distributions
|
||||
"calidad", # data quality
|
||||
"correlacion", # correlations / associations
|
||||
"modelos", # cheap models (PCA/KMeans/outliers)
|
||||
"analisis_llm", # LLM interpretation
|
||||
"timeseries", # time-series analysis
|
||||
"geospatial", # geospatial
|
||||
"agregacion", # aggregations / pivots
|
||||
]
|
||||
|
||||
|
||||
def build_chapter(chapter_id: str, profile: dict, ctx: dict):
|
||||
"""Build a single chapter by id, or None if absent/not-applicable/error.
|
||||
|
||||
Looks up ``automatic_eda.chapters.<chapter_id>`` and calls its
|
||||
``build_<chapter_id>(profile, ctx)``. Returns a normalized Chapter, or None
|
||||
when the module is missing, the builder returns None, or anything raises.
|
||||
"""
|
||||
mod_name = f"{__package__}.chapters.{chapter_id}"
|
||||
try:
|
||||
mod = importlib.import_module(mod_name)
|
||||
except Exception: # noqa: BLE001 — chapter not implemented yet → skip.
|
||||
return None
|
||||
builder = getattr(mod, f"build_{chapter_id}", None)
|
||||
if builder is None:
|
||||
return None
|
||||
try:
|
||||
result = builder(profile or {}, ctx or {})
|
||||
except Exception: # noqa: BLE001 — a broken chapter never aborts the doc.
|
||||
return None
|
||||
return model.as_chapter(result)
|
||||
|
||||
|
||||
def build_document(profile: dict, ctx: dict = None) -> list:
|
||||
"""Build the full ordered list of chapters for a TableProfile.
|
||||
|
||||
Args:
|
||||
profile: the ``eda`` group TableProfile dict (may be None/empty).
|
||||
ctx: optional context dict carrying presentation metadata not present in
|
||||
the profile (dataset_name, source_origin, storage, generated_at,
|
||||
description, granularity, quality_criteria, head_rows, ...).
|
||||
|
||||
Returns:
|
||||
list[Chapter] in canonical order, containing only the chapters that are
|
||||
implemented and applicable. Never raises.
|
||||
"""
|
||||
if profile is None:
|
||||
profile = {}
|
||||
if not isinstance(profile, dict):
|
||||
profile = {}
|
||||
if ctx is None:
|
||||
ctx = {}
|
||||
chapters = []
|
||||
for cid in CHAPTER_ORDER:
|
||||
ch = build_chapter(cid, profile, ctx)
|
||||
if ch is not None and ch.blocks:
|
||||
chapters.append(ch)
|
||||
return chapters
|
||||
Reference in New Issue
Block a user