Files
fn_registry/python/functions/datascience/render_automatic_eda_markdown.py
T
egutierrez 48de3ce3da feat(eda): salida Markdown del AutomaticEDA para pegar a un LLM
Añade un tercer formato de salida al AutomaticEDA, junto al PDF y el PPTX:
un Markdown autocontenido del MISMO documento por capítulos
(chapters_registry.build_document), optimizado para incorporar a un LLM
(texto plano + tablas markdown reales, sin binarios incrustados).

- render_md_impl.render_md(chapters, out_path, meta): serializa los bloques
  del modelo (Heading/Markdown/KVTable/DataTable/Figure/Image/Caption/Note/
  Group/GlossaryEntry) a Markdown. Cabecera con metadatos + índice navegable
  con anclas GitHub; tablas volcadas enteras (el MD no pagina); marcadores de
  glosario eliminados conservando la negrita; glosario al final.
- Figuras: un LLM no ve la imagen, así que se prioriza texto + datos. Se emite
  el caption y, cuando la figura tiene barras (histograma), se extrae la tabla
  de bins (Desde/Hasta/Frecuencia) de los artistas matplotlib. La banda ±1σ
  (axvspan) se descarta por ancho para que no aparezca como un falso bin.
  PNG opcional vía meta['embed_figures'] (off por defecto → sin binarios).
- render_automatic_eda_markdown: función pública del registry (tag eda),
  espejo de render_automatic_eda_pdf/pptx, acepta lista de capítulos o un
  TableProfile (build_document). dict-no-throw.
- render_automatic_eda (pipeline): emite también el .md (emit_md=True por
  defecto, clave de retorno aeda_md_path). Cambio aditivo: PDF/PPTX/manifest
  siguen saliendo igual.

Tests: golden de todos los kinds + regresión del filtro de la banda ±1σ +
edge documento vacío + profile path. Suite del paquete y del pipeline verde
(122 passed).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-30 18:52:08 +02:00

56 lines
2.5 KiB
Python

"""render_automatic_eda_markdown — chapter-based EDA report as one Markdown file.
Public ``eda``-group entry point that serializes an AutomaticEDA document (a list
of chapters, or an ``eda`` TableProfile from which the canonical chapters are
built) into a single self-contained Markdown file optimised to be **pasted into
an LLM**: plain text, Markdown tables (every row dumped — there are no pages to
cut), figures reduced to caption + underlying data, no binaries. It mirrors
``render_automatic_eda_pdf`` / ``render_automatic_eda_pptx`` but for text output;
unlike those it writes no manifest (KISS — Markdown is a single text artefact).
dict-no-throw: never raises. Returns ``{path, n_chars, chapters, note}``; on a
fatal error ``path`` is None and ``note`` explains why.
"""
from __future__ import annotations
from datascience.automatic_eda import build_document, render_md
from datascience.automatic_eda.model import as_chapter, as_chapters
def _coerce_chapters(chapters_or_profile, meta: dict) -> list:
"""Accept chapters OR an eda profile and return a list of Chapter."""
arg = chapters_or_profile
if isinstance(arg, (list, tuple)):
return as_chapters(list(arg))
if isinstance(arg, dict):
if "blocks" in arg and "columns" not in arg:
ch = as_chapter(arg)
return [ch] if ch is not None else []
return build_document(arg, (meta or {}).get("ctx"))
return []
def render_automatic_eda_markdown(chapters_or_profile, out_path: str,
meta: dict = None) -> dict:
"""Render an AutomaticEDA document into a single self-contained Markdown file.
Args:
chapters_or_profile: a list of chapters (``Chapter`` dataclasses or
dicts) or an ``eda`` TableProfile dict (chapters built via
``build_document(profile, meta['ctx'])``).
out_path: filesystem path for the ``.md`` (parent dirs are created).
meta: optional dict. Recognised keys: ``title``, ``ctx`` (dict with
``dataset_name``/``source_origin``/``storage``/``n_rows``/``n_cols``),
``generated_at``, ``embed_figures`` (export PNGs beside the .md,
default False — off keeps the Markdown self-contained).
Returns:
dict (never raises): ``{path: str|None, n_chars: int,
chapters: list[{id, version}], note: str}``. On a fatal error ``path`` is
None and ``note`` explains the cause.
"""
meta = dict(meta or {})
chapters = _coerce_chapters(chapters_or_profile, meta)
return render_md(chapters, out_path, meta)