48de3ce3da
Añade un tercer formato de salida al AutomaticEDA, junto al PDF y el PPTX: un Markdown autocontenido del MISMO documento por capítulos (chapters_registry.build_document), optimizado para incorporar a un LLM (texto plano + tablas markdown reales, sin binarios incrustados). - render_md_impl.render_md(chapters, out_path, meta): serializa los bloques del modelo (Heading/Markdown/KVTable/DataTable/Figure/Image/Caption/Note/ Group/GlossaryEntry) a Markdown. Cabecera con metadatos + índice navegable con anclas GitHub; tablas volcadas enteras (el MD no pagina); marcadores de glosario eliminados conservando la negrita; glosario al final. - Figuras: un LLM no ve la imagen, así que se prioriza texto + datos. Se emite el caption y, cuando la figura tiene barras (histograma), se extrae la tabla de bins (Desde/Hasta/Frecuencia) de los artistas matplotlib. La banda ±1σ (axvspan) se descarta por ancho para que no aparezca como un falso bin. PNG opcional vía meta['embed_figures'] (off por defecto → sin binarios). - render_automatic_eda_markdown: función pública del registry (tag eda), espejo de render_automatic_eda_pdf/pptx, acepta lista de capítulos o un TableProfile (build_document). dict-no-throw. - render_automatic_eda (pipeline): emite también el .md (emit_md=True por defecto, clave de retorno aeda_md_path). Cambio aditivo: PDF/PPTX/manifest siguen saliendo igual. Tests: golden de todos los kinds + regresión del filtro de la banda ±1σ + edge documento vacío + profile path. Suite del paquete y del pipeline verde (122 passed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
56 lines
2.5 KiB
Python
56 lines
2.5 KiB
Python
"""render_automatic_eda_markdown — chapter-based EDA report as one Markdown file.
|
|
|
|
Public ``eda``-group entry point that serializes an AutomaticEDA document (a list
|
|
of chapters, or an ``eda`` TableProfile from which the canonical chapters are
|
|
built) into a single self-contained Markdown file optimised to be **pasted into
|
|
an LLM**: plain text, Markdown tables (every row dumped — there are no pages to
|
|
cut), figures reduced to caption + underlying data, no binaries. It mirrors
|
|
``render_automatic_eda_pdf`` / ``render_automatic_eda_pptx`` but for text output;
|
|
unlike those it writes no manifest (KISS — Markdown is a single text artefact).
|
|
|
|
dict-no-throw: never raises. Returns ``{path, n_chars, chapters, note}``; on a
|
|
fatal error ``path`` is None and ``note`` explains why.
|
|
"""
|
|
|
|
from __future__ import annotations
|
|
|
|
from datascience.automatic_eda import build_document, render_md
|
|
from datascience.automatic_eda.model import as_chapter, as_chapters
|
|
|
|
|
|
def _coerce_chapters(chapters_or_profile, meta: dict) -> list:
|
|
"""Accept chapters OR an eda profile and return a list of Chapter."""
|
|
arg = chapters_or_profile
|
|
if isinstance(arg, (list, tuple)):
|
|
return as_chapters(list(arg))
|
|
if isinstance(arg, dict):
|
|
if "blocks" in arg and "columns" not in arg:
|
|
ch = as_chapter(arg)
|
|
return [ch] if ch is not None else []
|
|
return build_document(arg, (meta or {}).get("ctx"))
|
|
return []
|
|
|
|
|
|
def render_automatic_eda_markdown(chapters_or_profile, out_path: str,
|
|
meta: dict = None) -> dict:
|
|
"""Render an AutomaticEDA document into a single self-contained Markdown file.
|
|
|
|
Args:
|
|
chapters_or_profile: a list of chapters (``Chapter`` dataclasses or
|
|
dicts) or an ``eda`` TableProfile dict (chapters built via
|
|
``build_document(profile, meta['ctx'])``).
|
|
out_path: filesystem path for the ``.md`` (parent dirs are created).
|
|
meta: optional dict. Recognised keys: ``title``, ``ctx`` (dict with
|
|
``dataset_name``/``source_origin``/``storage``/``n_rows``/``n_cols``),
|
|
``generated_at``, ``embed_figures`` (export PNGs beside the .md,
|
|
default False — off keeps the Markdown self-contained).
|
|
|
|
Returns:
|
|
dict (never raises): ``{path: str|None, n_chars: int,
|
|
chapters: list[{id, version}], note: str}``. On a fatal error ``path`` is
|
|
None and ``note`` explains the cause.
|
|
"""
|
|
meta = dict(meta or {})
|
|
chapters = _coerce_chapters(chapters_or_profile, meta)
|
|
return render_md(chapters, out_path, meta)
|