feat(eda): núcleo AutomaticEDA — documento por capítulos + renderers PDF/PPTX anti-corte
Introduce la capa intermedia entre el contenido de un EDA y su formato de
salida. Un documento es una lista de capítulos versionados; cada capítulo es
un conjunto ordenado de bloques (heading, markdown, kv_table, data_table,
figure, image, caption, note) independientes del formato.
Núcleo (paquete de soporte python/functions/datascience/automatic_eda/):
- model.py: dataclasses de bloques + Chapter, normalizadores defensivos
(aceptan dataclass o dict, nunca lanzan), ENGINE_VERSION y el manifiesto
por capítulo (automatic_eda_manifest.json).
- text_layout.py: medición/wrapping por rejilla de caracteres compartida.
- chapters_registry.py: CHAPTER_ORDER pre-declarado + build_document con
auto-discovery de capítulos por convención (permite añadir capítulos en
paralelo sin editar el registro).
- render_pdf_impl.py: paginador A5 retrato móvil que MIDE cada bloque y nunca
corta: texto a líneas completas, tablas largas partidas por filas repitiendo
cabecera, figuras/imágenes escaladas para caber enteras. Pie versionado por
capítulo.
- render_pptx_impl.py: mismo principio sobre slides 16:9 (continúa en slide
"(cont.)"; tablas repiten cabecera; figuras exportadas a PNG escaladas).
- chapters/portada.py y chapters/overview.py: capítulos de referencia. Portada
con nombre, rótulo Automatic-EDA, fuente, almacenamiento (inferido de
source), fecha europea, filas×cols, descripción, granularidad y calidad con
criterios. Overview con df.head (placeholder honesto si falta head_rows),
diccionario de columnas (tipo/nulos/ejemplos) y describe numérico.
Funciones públicas del registry (grupo eda, dict-no-throw):
- render_automatic_eda_pdf / render_automatic_eda_pptx: aceptan capítulos o un
TableProfile (construyen los capítulos con build_document) y escriben el
manifiesto. Aditivas — no reemplazan render_eda_pdf.
Tests self-contained (sin DuckDB) para ambos renderers: golden (portada +
overview), partición de tablas largas repitiendo cabecera, no-corte de celdas
y markdown largos, profile None/{} válido de 1 página/slide, y error path en
directorio no escribible. 23 tests verdes (incluye los previos de
render_eda_pdf, intactos).
Dependencia nueva python-pptx>=1.0.2 declarada en python/pyproject.toml.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,310 @@
|
||||
"""AutomaticEDA document model — format-independent blocks and chapters.
|
||||
|
||||
This is the intermediate layer between *content* (what an EDA chapter wants to
|
||||
say) and *output format* (PDF for mobile reading, PPTX for sharing). A document
|
||||
is an ordered list of :class:`Chapter`. A chapter is ``{id, title, version,
|
||||
blocks}``. A block is one of a small, closed set of presentation primitives
|
||||
(heading, markdown, key/value table, data table, figure, image, caption, note).
|
||||
|
||||
Neither renderer knows anything about the EDA profile: they only know how to lay
|
||||
out blocks so that **nothing is ever cut** — long text wraps to whole lines,
|
||||
long tables split by rows repeating the header, figures and images are scaled to
|
||||
fit entirely. Each chapter declares its own ``version`` so every page/slide can
|
||||
be stamped ``<Chapter> · v<version>`` and tracked in a manifest for continuous,
|
||||
per-chapter improvement.
|
||||
|
||||
Reading is defensive throughout (the ``eda`` group "dict-no-throw" style): the
|
||||
normalizers accept dataclass blocks *or* plain dicts, coerce anything unknown
|
||||
into a readable :class:`Note` instead of raising, and the renderers degrade a
|
||||
malformed block to text rather than crashing the whole document.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import os
|
||||
from dataclasses import dataclass, field
|
||||
from typing import Any, Callable, Optional
|
||||
|
||||
# Global engine version. Bump when the document model or a renderer changes in a
|
||||
# way that affects output. Individual chapters carry their own CHAPTER_VERSION.
|
||||
ENGINE_VERSION = "1.0.0"
|
||||
ENGINE_NAME = "AutomaticEDA"
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Block primitives. Each carries a stable ``kind`` string so renderers can
|
||||
# dispatch by kind (works for dataclass instances and for plain dicts alike).
|
||||
# --------------------------------------------------------------------------- #
|
||||
@dataclass
|
||||
class Heading:
|
||||
"""A section heading. ``level`` 1 (largest) .. 3 (smallest)."""
|
||||
|
||||
text: str = ""
|
||||
level: int = 1
|
||||
kind: str = field(default="heading", init=False)
|
||||
|
||||
|
||||
@dataclass
|
||||
class Markdown:
|
||||
"""A block of light markdown text.
|
||||
|
||||
Supported subset (everything else is rendered verbatim, never dropped):
|
||||
``#``/``##``/``###`` headings, ``-``/``*`` bullet lists, ``| a | b |``
|
||||
tables (consecutive pipe lines become a data table), blank lines as
|
||||
paragraph breaks, and ``**bold**`` inline markers (markers are stripped, the
|
||||
text is kept). Text is wrapped to whole lines so it is never cut mid-line.
|
||||
"""
|
||||
|
||||
text: str = ""
|
||||
kind: str = field(default="markdown", init=False)
|
||||
|
||||
|
||||
@dataclass
|
||||
class KVTable:
|
||||
"""A two-column key/value table. ``rows`` is a list of ``(label, value)``."""
|
||||
|
||||
rows: list = field(default_factory=list)
|
||||
title: Optional[str] = None
|
||||
kind: str = field(default="kv_table", init=False)
|
||||
|
||||
|
||||
@dataclass
|
||||
class DataTable:
|
||||
"""A tabular block with a header row.
|
||||
|
||||
If it does not fit in the remaining page/slide space it is split by rows,
|
||||
**repeating the header** on each continuation. Long cell text wraps inside
|
||||
its column (the row grows taller) so no cell content is ever lost.
|
||||
"""
|
||||
|
||||
header: list = field(default_factory=list)
|
||||
rows: list = field(default_factory=list) # list[list[Any]]
|
||||
title: Optional[str] = None
|
||||
note: Optional[str] = None
|
||||
kind: str = field(default="data_table", init=False)
|
||||
|
||||
|
||||
@dataclass
|
||||
class Figure:
|
||||
"""A matplotlib figure, scaled to fit entirely (never cropped).
|
||||
|
||||
Provide either an already-built ``fig`` (a ``matplotlib.figure.Figure``) or
|
||||
a zero-arg ``make`` callable that returns one (lazy: only built when the
|
||||
renderer needs it). ``height_in`` is an optional hint for the target height
|
||||
on the page; renderers clamp it to the available space preserving aspect.
|
||||
"""
|
||||
|
||||
fig: Any = None
|
||||
make: Optional[Callable[[], Any]] = None
|
||||
caption: Optional[str] = None
|
||||
height_in: Optional[float] = None
|
||||
kind: str = field(default="figure", init=False)
|
||||
|
||||
|
||||
@dataclass
|
||||
class Image:
|
||||
"""A raster image (PNG/JPG) by path, scaled to fit entirely."""
|
||||
|
||||
path: str = ""
|
||||
caption: Optional[str] = None
|
||||
height_in: Optional[float] = None
|
||||
kind: str = field(default="image", init=False)
|
||||
|
||||
|
||||
@dataclass
|
||||
class Caption:
|
||||
"""Small auxiliary text rendered under a figure/table."""
|
||||
|
||||
text: str = ""
|
||||
kind: str = field(default="caption", init=False)
|
||||
|
||||
|
||||
@dataclass
|
||||
class Note:
|
||||
"""Small auxiliary note (italic). Also the fallback for unknown content."""
|
||||
|
||||
text: str = ""
|
||||
kind: str = field(default="note", init=False)
|
||||
|
||||
|
||||
@dataclass
|
||||
class Chapter:
|
||||
"""An ordered set of blocks with an id, a title and a generation version."""
|
||||
|
||||
id: str = ""
|
||||
title: str = ""
|
||||
version: str = "1.0.0"
|
||||
blocks: list = field(default_factory=list)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Defensive normalizers — accept dataclasses OR plain dicts, never raise.
|
||||
# --------------------------------------------------------------------------- #
|
||||
_BLOCK_BY_KIND = {
|
||||
"heading": Heading,
|
||||
"markdown": Markdown,
|
||||
"kv_table": KVTable,
|
||||
"data_table": DataTable,
|
||||
"figure": Figure,
|
||||
"image": Image,
|
||||
"caption": Caption,
|
||||
"note": Note,
|
||||
}
|
||||
|
||||
|
||||
def as_block(obj: Any):
|
||||
"""Coerce a value into a block dataclass. Unknown values become a Note."""
|
||||
if isinstance(obj, (Heading, Markdown, KVTable, DataTable, Figure, Image,
|
||||
Caption, Note)):
|
||||
return obj
|
||||
if isinstance(obj, dict):
|
||||
kind = obj.get("kind")
|
||||
cls = _BLOCK_BY_KIND.get(kind)
|
||||
if cls is None:
|
||||
return Note(text=_safe_str(obj))
|
||||
# Build only with fields the dataclass accepts (ignore extras).
|
||||
try:
|
||||
if cls is Heading:
|
||||
return Heading(text=_safe_str(obj.get("text")),
|
||||
level=int(obj.get("level", 1) or 1))
|
||||
if cls is Markdown:
|
||||
return Markdown(text=_safe_str(obj.get("text")))
|
||||
if cls is KVTable:
|
||||
return KVTable(rows=list(obj.get("rows") or []),
|
||||
title=obj.get("title"))
|
||||
if cls is DataTable:
|
||||
return DataTable(header=list(obj.get("header") or []),
|
||||
rows=list(obj.get("rows") or []),
|
||||
title=obj.get("title"), note=obj.get("note"))
|
||||
if cls is Figure:
|
||||
return Figure(fig=obj.get("fig"), make=obj.get("make"),
|
||||
caption=obj.get("caption"),
|
||||
height_in=obj.get("height_in"))
|
||||
if cls is Image:
|
||||
return Image(path=_safe_str(obj.get("path")),
|
||||
caption=obj.get("caption"),
|
||||
height_in=obj.get("height_in"))
|
||||
if cls is Caption:
|
||||
return Caption(text=_safe_str(obj.get("text")))
|
||||
if cls is Note:
|
||||
return Note(text=_safe_str(obj.get("text")))
|
||||
except Exception: # noqa: BLE001 — never raise on a malformed block.
|
||||
return Note(text=_safe_str(obj))
|
||||
return Note(text=_safe_str(obj))
|
||||
|
||||
|
||||
def as_blocks(seq: Any) -> list:
|
||||
"""Normalize an arbitrary sequence into a list of block dataclasses."""
|
||||
if seq is None:
|
||||
return []
|
||||
if not isinstance(seq, (list, tuple)):
|
||||
return [as_block(seq)]
|
||||
return [as_block(b) for b in seq]
|
||||
|
||||
|
||||
def as_chapter(obj: Any) -> Optional[Chapter]:
|
||||
"""Coerce a value into a Chapter (or None). Accepts a dict or a Chapter."""
|
||||
if obj is None:
|
||||
return None
|
||||
if isinstance(obj, Chapter):
|
||||
obj.blocks = as_blocks(obj.blocks)
|
||||
return obj
|
||||
if isinstance(obj, dict):
|
||||
return Chapter(
|
||||
id=_safe_str(obj.get("id")),
|
||||
title=_safe_str(obj.get("title")) or _safe_str(obj.get("id")),
|
||||
version=_safe_str(obj.get("version")) or "1.0.0",
|
||||
blocks=as_blocks(obj.get("blocks")),
|
||||
)
|
||||
return None
|
||||
|
||||
|
||||
def as_chapters(seq: Any) -> list:
|
||||
"""Normalize a sequence of chapters, dropping anything that can't coerce."""
|
||||
if seq is None:
|
||||
return []
|
||||
if isinstance(seq, Chapter):
|
||||
return [as_chapter(seq)]
|
||||
if not isinstance(seq, (list, tuple)):
|
||||
return []
|
||||
out = []
|
||||
for c in seq:
|
||||
ch = as_chapter(c)
|
||||
if ch is not None:
|
||||
out.append(ch)
|
||||
return out
|
||||
|
||||
|
||||
def _safe_str(v: Any) -> str:
|
||||
"""str() that never raises and maps None to ''."""
|
||||
if v is None:
|
||||
return ""
|
||||
try:
|
||||
return str(v)
|
||||
except Exception: # noqa: BLE001
|
||||
return ""
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Manifest — per-chapter versions and page/slide counts for tracking.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def merge_manifest(manifest_path: str, renderer: str, chapters_meta: list,
|
||||
generated_at: str,
|
||||
engine_version: str = ENGINE_VERSION) -> dict:
|
||||
"""Read-modify-write the AutomaticEDA manifest, merging one renderer's run.
|
||||
|
||||
The manifest lives next to the outputs as ``automatic_eda_manifest.json``
|
||||
and records, per chapter, its version plus the page count (PDF) and slide
|
||||
count (PPTX). Calling either renderer creates or updates it. Never raises:
|
||||
on any error returns the in-memory manifest without writing.
|
||||
|
||||
Args:
|
||||
manifest_path: path to the JSON manifest to create or update.
|
||||
renderer: "pdf" or "pptx" — selects which count key is written.
|
||||
chapters_meta: list of ``{"id", "version", "n_pages"|"n_slides"}``.
|
||||
generated_at: ISO-ish timestamp string for this run.
|
||||
engine_version: AutomaticEDA engine version.
|
||||
|
||||
Returns:
|
||||
The merged manifest dict (also written to disk on success).
|
||||
"""
|
||||
data: dict = {}
|
||||
try:
|
||||
if manifest_path and os.path.exists(manifest_path):
|
||||
with open(manifest_path, "r", encoding="utf-8") as fh:
|
||||
loaded = json.load(fh)
|
||||
if isinstance(loaded, dict):
|
||||
data = loaded
|
||||
except Exception: # noqa: BLE001 — a corrupt manifest is overwritten.
|
||||
data = {}
|
||||
|
||||
data["engine"] = ENGINE_NAME
|
||||
data["engine_version"] = engine_version
|
||||
data["generated_at"] = generated_at
|
||||
chapters = data.get("chapters")
|
||||
if not isinstance(chapters, dict):
|
||||
chapters = {}
|
||||
count_key = "n_slides" if renderer == "pptx" else "n_pages"
|
||||
for cm in chapters_meta or []:
|
||||
if not isinstance(cm, dict):
|
||||
continue
|
||||
cid = cm.get("id")
|
||||
if not cid:
|
||||
continue
|
||||
entry = chapters.get(cid)
|
||||
if not isinstance(entry, dict):
|
||||
entry = {}
|
||||
entry["version"] = cm.get("version") or entry.get("version") or "1.0.0"
|
||||
entry[count_key] = cm.get(count_key, cm.get("n_pages", cm.get("n_slides")))
|
||||
chapters[cid] = entry
|
||||
data["chapters"] = chapters
|
||||
|
||||
try:
|
||||
parent = os.path.dirname(os.path.abspath(manifest_path))
|
||||
os.makedirs(parent, exist_ok=True)
|
||||
with open(manifest_path, "w", encoding="utf-8") as fh:
|
||||
json.dump(data, fh, ensure_ascii=False, indent=2, default=str)
|
||||
except Exception: # noqa: BLE001 — never raise from the manifest writer.
|
||||
pass
|
||||
return data
|
||||
Reference in New Issue
Block a user