feat(eda): núcleo AutomaticEDA — documento por capítulos + renderers PDF/PPTX anti-corte
Introduce la capa intermedia entre el contenido de un EDA y su formato de
salida. Un documento es una lista de capítulos versionados; cada capítulo es
un conjunto ordenado de bloques (heading, markdown, kv_table, data_table,
figure, image, caption, note) independientes del formato.
Núcleo (paquete de soporte python/functions/datascience/automatic_eda/):
- model.py: dataclasses de bloques + Chapter, normalizadores defensivos
(aceptan dataclass o dict, nunca lanzan), ENGINE_VERSION y el manifiesto
por capítulo (automatic_eda_manifest.json).
- text_layout.py: medición/wrapping por rejilla de caracteres compartida.
- chapters_registry.py: CHAPTER_ORDER pre-declarado + build_document con
auto-discovery de capítulos por convención (permite añadir capítulos en
paralelo sin editar el registro).
- render_pdf_impl.py: paginador A5 retrato móvil que MIDE cada bloque y nunca
corta: texto a líneas completas, tablas largas partidas por filas repitiendo
cabecera, figuras/imágenes escaladas para caber enteras. Pie versionado por
capítulo.
- render_pptx_impl.py: mismo principio sobre slides 16:9 (continúa en slide
"(cont.)"; tablas repiten cabecera; figuras exportadas a PNG escaladas).
- chapters/portada.py y chapters/overview.py: capítulos de referencia. Portada
con nombre, rótulo Automatic-EDA, fuente, almacenamiento (inferido de
source), fecha europea, filas×cols, descripción, granularidad y calidad con
criterios. Overview con df.head (placeholder honesto si falta head_rows),
diccionario de columnas (tipo/nulos/ejemplos) y describe numérico.
Funciones públicas del registry (grupo eda, dict-no-throw):
- render_automatic_eda_pdf / render_automatic_eda_pptx: aceptan capítulos o un
TableProfile (construyen los capítulos con build_document) y escriben el
manifiesto. Aditivas — no reemplazan render_eda_pdf.
Tests self-contained (sin DuckDB) para ambos renderers: golden (portada +
overview), partición de tablas largas repitiendo cabecera, no-corte de celdas
y markdown largos, profile None/{} válido de 1 página/slide, y error path en
directorio no escribible. 23 tests verdes (incluye los previos de
render_eda_pdf, intactos).
Dependencia nueva python-pptx>=1.0.2 declarada en python/pyproject.toml.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,7 @@
|
||||
"""AutomaticEDA chapters.
|
||||
|
||||
Each chapter is a module ``<id>.py`` exposing ``build_<id>(profile, ctx) ->
|
||||
Chapter | None`` and a ``CHAPTER_VERSION`` constant. The canonical document
|
||||
order lives in :mod:`automatic_eda.chapters_registry`. Implemented today:
|
||||
``portada`` and ``overview`` (the reference chapters other agents copy).
|
||||
"""
|
||||
@@ -0,0 +1,176 @@
|
||||
"""Overview chapter — df.head, column dictionary and describe (reference).
|
||||
|
||||
Second reference chapter for AutomaticEDA. Renders (across as many pages/slides
|
||||
as needed, the renderers paginate):
|
||||
|
||||
1. ``df.head`` — the first rows of the table. The current ``TableProfile`` does
|
||||
NOT carry the raw head, so this is read from ``ctx['head_rows']`` /
|
||||
``profile['head_rows']`` (a list of row dicts). When absent the chapter shows
|
||||
an honest placeholder documenting the missing key instead of inventing data.
|
||||
2. Column dictionary — name / type / nulls / non-null examples. Examples come
|
||||
from ``columns[i]['examples']`` when present; otherwise they are derived from
|
||||
real non-null profile values (categorical top values, numeric min/median/max)
|
||||
so the cell is never empty nor fabricated.
|
||||
3. ``df.describe`` — mean / median / min / max / std for every numeric column.
|
||||
|
||||
Contract: build_<id>(profile, ctx) -> Chapter | None ; CHAPTER_VERSION = "x.y.z".
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from .. import model
|
||||
|
||||
CHAPTER_VERSION = "1.0.0"
|
||||
CHAPTER_ID = "overview"
|
||||
CHAPTER_TITLE = "Overview"
|
||||
|
||||
# Profile/ctx keys the calculation phase must add for a full head + examples.
|
||||
HEAD_KEY = "head_rows" # list[dict] — df.head(n)
|
||||
EXAMPLES_KEY = "examples" # per column: list of non-null sample values
|
||||
|
||||
|
||||
def _fmt_num(value, decimals: int = 3) -> str:
|
||||
if value is None:
|
||||
return "—"
|
||||
if isinstance(value, bool):
|
||||
return str(value)
|
||||
if isinstance(value, int):
|
||||
return f"{value:,}".replace(",", ".")
|
||||
if isinstance(value, float):
|
||||
if value != value: # NaN
|
||||
return "NaN"
|
||||
if value in (float("inf"), float("-inf")):
|
||||
return str(value)
|
||||
text = f"{value:.{decimals}f}".rstrip("0").rstrip(".")
|
||||
return text if text else "0"
|
||||
return str(value)
|
||||
|
||||
|
||||
def _fmt_pct(value, decimals: int = 1) -> str:
|
||||
if value is None:
|
||||
return "—"
|
||||
try:
|
||||
return f"{float(value) * 100:.{decimals}f}%"
|
||||
except (TypeError, ValueError):
|
||||
return str(value)
|
||||
|
||||
|
||||
def _examples_for(col: dict) -> str:
|
||||
"""Build a short string of real non-null example values for a column."""
|
||||
explicit = col.get(EXAMPLES_KEY)
|
||||
if isinstance(explicit, (list, tuple)) and explicit:
|
||||
return ", ".join(model._safe_str(v) for v in explicit[:4])
|
||||
cat = col.get("categorical") or {}
|
||||
top = cat.get("top") or []
|
||||
if top:
|
||||
vals = [model._safe_str((t or {}).get("value")) for t in top[:4]
|
||||
if isinstance(t, dict)]
|
||||
vals = [v for v in vals if v]
|
||||
if vals:
|
||||
return ", ".join(vals)
|
||||
num = col.get("numeric") or {}
|
||||
if num:
|
||||
bits = []
|
||||
for key in ("min", "median", "max"):
|
||||
v = num.get(key)
|
||||
if v is not None:
|
||||
bits.append(_fmt_num(v))
|
||||
if bits:
|
||||
return ", ".join(bits)
|
||||
return "—"
|
||||
|
||||
|
||||
def _head_block(profile: dict, ctx: dict):
|
||||
"""Return a DataTable for df.head, or a Note documenting the missing key."""
|
||||
head = ctx.get(HEAD_KEY) or profile.get(HEAD_KEY)
|
||||
if isinstance(head, list) and head and isinstance(head[0], dict):
|
||||
# Column order from the profile, then any extra keys present in rows.
|
||||
cols = [c.get("name") for c in (profile.get("columns") or [])
|
||||
if c.get("name")]
|
||||
if not cols:
|
||||
cols = list(head[0].keys())
|
||||
rows = [[model._safe_str(r.get(c)) for c in cols] for r in head[:10]]
|
||||
return model.DataTable(header=cols, rows=rows,
|
||||
note=f"primeras {len(rows)} filas")
|
||||
return model.Note(
|
||||
"df.head no disponible: el TableProfile no incluye 'head_rows'. La fase "
|
||||
"de cálculo debe añadir profile['head_rows'] (lista de dicts fila) o "
|
||||
"pasarlo en ctx['head_rows'] para mostrar las primeras filas.")
|
||||
|
||||
|
||||
def _columns_block(profile: dict):
|
||||
cols = profile.get("columns") or []
|
||||
header = ["Columna", "Tipo", "Nulos", "Ejemplos (no nulos)"]
|
||||
rows = []
|
||||
for c in cols:
|
||||
if not isinstance(c, dict):
|
||||
continue
|
||||
name = c.get("name") or "(col)"
|
||||
ctype = c.get("inferred_type") or c.get("physical_type") or "—"
|
||||
sem = c.get("semantic_type")
|
||||
if sem:
|
||||
ctype = f"{ctype} ({sem})"
|
||||
null_pct = c.get("null_pct")
|
||||
null_count = c.get("null_count")
|
||||
if null_pct is not None:
|
||||
nulls = _fmt_pct(null_pct)
|
||||
if null_count is not None:
|
||||
nulls += f" ({null_count})"
|
||||
elif null_count is not None:
|
||||
nulls = str(null_count)
|
||||
else:
|
||||
nulls = "—"
|
||||
rows.append([name, ctype, nulls, _examples_for(c)])
|
||||
if not rows:
|
||||
return None
|
||||
return model.DataTable(header=header, rows=rows, title="Columnas")
|
||||
|
||||
|
||||
def _describe_block(profile: dict):
|
||||
cols = profile.get("columns") or []
|
||||
header = ["Columna", "mean", "median", "min", "max", "std"]
|
||||
rows = []
|
||||
for c in cols:
|
||||
if not isinstance(c, dict) or c.get("inferred_type") != "numeric":
|
||||
continue
|
||||
num = c.get("numeric") or {}
|
||||
if not num:
|
||||
continue
|
||||
rows.append([
|
||||
c.get("name") or "(col)",
|
||||
_fmt_num(num.get("mean")),
|
||||
_fmt_num(num.get("median")),
|
||||
_fmt_num(num.get("min")),
|
||||
_fmt_num(num.get("max")),
|
||||
_fmt_num(num.get("std")),
|
||||
])
|
||||
if not rows:
|
||||
return None
|
||||
return model.DataTable(header=header, rows=rows, title="Estadística (describe)")
|
||||
|
||||
|
||||
def build_overview(profile: dict, ctx: dict):
|
||||
"""Build the Overview Chapter, or None if the profile has no columns."""
|
||||
profile = profile or {}
|
||||
ctx = ctx or {}
|
||||
cols = profile.get("columns") or []
|
||||
if not cols and not (ctx.get(HEAD_KEY) or profile.get(HEAD_KEY)):
|
||||
return None
|
||||
|
||||
blocks = [
|
||||
model.Heading(text="Primeras filas (df.head)", level=2),
|
||||
_head_block(profile, ctx),
|
||||
]
|
||||
cols_block = _columns_block(profile)
|
||||
if cols_block is not None:
|
||||
blocks.append(model.Heading(
|
||||
text="Diccionario de columnas", level=2))
|
||||
blocks.append(cols_block)
|
||||
desc_block = _describe_block(profile)
|
||||
if desc_block is not None:
|
||||
blocks.append(model.Heading(
|
||||
text="Resumen estadístico numérico", level=2))
|
||||
blocks.append(desc_block)
|
||||
|
||||
return model.Chapter(id=CHAPTER_ID, title=CHAPTER_TITLE,
|
||||
version=CHAPTER_VERSION, blocks=blocks)
|
||||
@@ -0,0 +1,156 @@
|
||||
"""Cover chapter (PORTADA) — the reference chapter for AutomaticEDA.
|
||||
|
||||
Builds the document cover from a TableProfile plus an optional ``ctx`` of
|
||||
presentation metadata. Reads everything defensively (``.get``) and degrades
|
||||
honestly: a field that is neither in the profile nor in ``ctx`` is shown as a
|
||||
placeholder rather than invented, leaving a hook for the LLM layer to fill it.
|
||||
|
||||
Contract for chapter authors (see ``docs/capabilities/automatic_eda.md``):
|
||||
build_<id>(profile: dict, ctx: dict) -> Chapter | None
|
||||
CHAPTER_VERSION = "x.y.z"
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
from datetime import datetime, timezone
|
||||
|
||||
from .. import model
|
||||
|
||||
CHAPTER_VERSION = "1.0.0"
|
||||
CHAPTER_ID = "portada"
|
||||
CHAPTER_TITLE = "Portada"
|
||||
|
||||
# Default human description of what the table quality score measures. Chapters
|
||||
# can override it via ctx["quality_criteria"].
|
||||
_DEFAULT_QUALITY_CRITERIA = (
|
||||
"media de los scores por columna (0–100): completitud (sin nulos/vacíos), "
|
||||
"validez (tipo y rango coherentes) y consistencia (sin duplicados/constantes)."
|
||||
)
|
||||
|
||||
|
||||
def _storage_from_source(source: str) -> str:
|
||||
"""Infer the storage technology the dataset currently lives in.
|
||||
|
||||
Heuristic on the profile ``source`` string (a path, DSN or backend name).
|
||||
Returns a human label; falls back to the raw source when unknown.
|
||||
"""
|
||||
s = (source or "").strip().lower()
|
||||
if not s:
|
||||
return "—"
|
||||
if s.endswith(".csv") or s.endswith(".tsv"):
|
||||
return "CSV"
|
||||
if s.endswith(".parquet") or s.endswith(".pq"):
|
||||
return "Parquet"
|
||||
if s.endswith(".json") or s.endswith(".ndjson"):
|
||||
return "JSON"
|
||||
if s.endswith(".xlsx") or s.endswith(".xls"):
|
||||
return "Excel"
|
||||
if s.endswith((".duckdb", ".ddb")) or s == "duckdb" or s.endswith(".db"):
|
||||
return "DuckDB"
|
||||
if s.startswith(("postgres://", "postgresql://")) or "postgres" in s:
|
||||
return "PostgreSQL"
|
||||
if s.startswith("bigquery") or "bigquery" in s or s.count(".") == 2 and " " not in s:
|
||||
return "BigQuery"
|
||||
if "sqlite" in s:
|
||||
return "SQLite"
|
||||
# Unknown: show the raw source so nothing is hidden.
|
||||
return source
|
||||
|
||||
|
||||
def _fmt_int(v) -> str:
|
||||
if v is None:
|
||||
return "—"
|
||||
try:
|
||||
return f"{int(v):,}".replace(",", ".")
|
||||
except (TypeError, ValueError):
|
||||
return str(v)
|
||||
|
||||
|
||||
def _fmt_date_eu(value) -> str:
|
||||
"""Format a date/ISO string as European DD/MM/AAAA HH:mm (UI convention).
|
||||
|
||||
Accepts a datetime, an ISO-8601 string (with or without microseconds/tz) or
|
||||
any other string. Non-parseable strings are returned verbatim so nothing is
|
||||
lost; None yields a placeholder.
|
||||
"""
|
||||
if value is None:
|
||||
return "—"
|
||||
if isinstance(value, datetime):
|
||||
return value.strftime("%d/%m/%Y %H:%M")
|
||||
s = str(value).strip()
|
||||
if not s:
|
||||
return "—"
|
||||
try:
|
||||
dt = datetime.fromisoformat(s.replace("Z", "+00:00"))
|
||||
return dt.strftime("%d/%m/%Y %H:%M")
|
||||
except (TypeError, ValueError):
|
||||
# Try a couple of common forms before giving up.
|
||||
for fmt in ("%Y-%m-%d %H:%M:%S UTC", "%Y-%m-%d %H:%M UTC",
|
||||
"%Y-%m-%d %H:%M:%S", "%Y-%m-%d"):
|
||||
try:
|
||||
return datetime.strptime(s, fmt).strftime("%d/%m/%Y %H:%M")
|
||||
except ValueError:
|
||||
continue
|
||||
return s
|
||||
|
||||
|
||||
def build_portada(profile: dict, ctx: dict):
|
||||
"""Build the cover Chapter, or None if there is truly nothing to show."""
|
||||
profile = profile or {}
|
||||
ctx = ctx or {}
|
||||
|
||||
dataset_name = (ctx.get("dataset_name") or profile.get("table")
|
||||
or "(dataset sin nombre)")
|
||||
source = profile.get("source") or ""
|
||||
# Where the dataset comes from (origin), distinct from where it is stored.
|
||||
source_origin = ctx.get("source_origin") or source or "—"
|
||||
storage = ctx.get("storage") or _storage_from_source(source)
|
||||
|
||||
when = _fmt_date_eu(
|
||||
ctx.get("generated_at") or profile.get("profiled_at")
|
||||
or datetime.now(timezone.utc))
|
||||
|
||||
n_rows = profile.get("n_rows")
|
||||
n_cols = profile.get("n_cols")
|
||||
shape = f"{_fmt_int(n_rows)} filas × {_fmt_int(n_cols)} columnas"
|
||||
|
||||
score = profile.get("quality_score")
|
||||
quality_criteria = ctx.get("quality_criteria") or _DEFAULT_QUALITY_CRITERIA
|
||||
quality_value = "—" if score is None else f"{score} / 100"
|
||||
|
||||
# Granularity: ctx wins; else derive from key candidates; else be honest.
|
||||
granularity = ctx.get("granularity")
|
||||
if not granularity:
|
||||
keys = profile.get("key_candidates") or []
|
||||
if keys:
|
||||
granularity = ("Cada fila parece identificada por "
|
||||
+ ", ".join(str(k) for k in keys[:3]) + ".")
|
||||
else:
|
||||
granularity = ("Cada fila es… (granularidad no determinada — "
|
||||
"pendiente de la capa de cálculo/LLM).")
|
||||
|
||||
description = ctx.get("description")
|
||||
if not description:
|
||||
description = ("Descripción no provista — pendiente de la capa LLM "
|
||||
"(`run_llm`) o de `ctx['description']`.")
|
||||
|
||||
blocks = [
|
||||
model.Heading(text=str(dataset_name), level=1),
|
||||
model.Markdown(text="**Automatic-EDA** · informe exploratorio automático"),
|
||||
model.KVTable(rows=[
|
||||
("Fuente", source_origin),
|
||||
("Almacenamiento", storage),
|
||||
("Generado", when),
|
||||
("Tamaño", shape),
|
||||
("Calidad", quality_value),
|
||||
("Criterios de calidad", quality_criteria),
|
||||
]),
|
||||
model.Heading(text="Descripción", level=2),
|
||||
model.Markdown(text=str(description)),
|
||||
model.Heading(text="Granularidad", level=2),
|
||||
model.Markdown(text=str(granularity)),
|
||||
]
|
||||
|
||||
return model.Chapter(id=CHAPTER_ID, title=CHAPTER_TITLE,
|
||||
version=CHAPTER_VERSION, blocks=blocks)
|
||||
Reference in New Issue
Block a user