Compare commits
7 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 6e3c3cf2a2 | |||
| a1e2e3567c | |||
| 833597c831 | |||
| 7158be8142 | |||
| 9be84a48ea | |||
| 4099d88eaf | |||
| 48de3ce3da |
@@ -54,6 +54,13 @@ reports/*
|
||||
!reports/.gitkeep
|
||||
projects/*/reports/
|
||||
|
||||
# Papers — artefacto local: papers académicos reproducibles. En fase interna viven
|
||||
# local y gitignored (como los reports); al promocionar a fase publishable se
|
||||
# vuelven sub-repo Gitea propio (como apps/analyses). Solo el marcador .gitkeep se
|
||||
# versiona. Convención: docs/capabilities/papers.md
|
||||
papers/*
|
||||
!papers/.gitkeep
|
||||
|
||||
# Node / pnpm
|
||||
**/node_modules/
|
||||
|
||||
|
||||
@@ -0,0 +1,58 @@
|
||||
---
|
||||
name: next_numbered_dir
|
||||
kind: function
|
||||
lang: bash
|
||||
domain: io
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "next_numbered_dir(parent_dir: string, [width: int]) -> string"
|
||||
description: "Calcula el siguiente prefijo numerico NNNN- para un directorio numerado incremental. Escanea los subdirectorios directos de parent_dir cuyo nombre empiece por NNNN- (4+ digitos seguidos de guion), toma el maximo, le suma 1 y lo imprime con zero-padding al ancho width (default 4). Si parent_dir no existe o no tiene subdirs que matcheen, imprime 0001."
|
||||
tags: [papers, io, scaffold]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: []
|
||||
params:
|
||||
- name: parent_dir
|
||||
desc: "directorio padre cuyos subdirectorios numerados (NNNN-...) se escanean; obligatorio"
|
||||
- name: width
|
||||
desc: "ancho del zero-padding del numero impreso (default 4); opcional"
|
||||
output: "el siguiente numero como string con zero-padding a width digitos a stdout (ej. 0003); usage a stderr y exit 1 si falta parent_dir"
|
||||
tested: false
|
||||
tests: []
|
||||
test_file_path: ""
|
||||
file_path: "bash/functions/io/next_numbered_dir.sh"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```bash
|
||||
source bash/functions/io/next_numbered_dir.sh
|
||||
|
||||
# Sobre un papers/ que ya contiene 0001-foo y 0002-bar
|
||||
mkdir -p /tmp/papers/{0001-foo,0002-bar}
|
||||
next_numbered_dir /tmp/papers
|
||||
# -> 0003
|
||||
|
||||
# Directorio vacio o inexistente -> primer numero
|
||||
next_numbered_dir /tmp/papers_nuevo
|
||||
# -> 0001
|
||||
|
||||
# Ancho de padding distinto
|
||||
next_numbered_dir /tmp/papers 6
|
||||
# -> 000003
|
||||
```
|
||||
|
||||
## Cuando usarla
|
||||
|
||||
Cuando scaffoldees un artefacto numerado incremental (papers/, reports/, issues/) y necesites el siguiente NNNN sin colision: escanea lo que ya existe en disco y te da el numero libre listo para crear `<NNNN>-<slug>`.
|
||||
|
||||
## Gotchas
|
||||
|
||||
- **Impura**: lee el filesystem (estado del directorio en el momento de la llamada). No crea nada — solo calcula e imprime el numero.
|
||||
- **Octal**: los numeros con cero a la izquierda (`08`, `09`) se interpretan como octal en aritmetica bash y romperian el calculo. La funcion fuerza base 10 con `10#$num` para evitarlo.
|
||||
- **Solo subdirectorios**: cuenta unicamente subdirs directos. Archivos sueltos (`.gitkeep`, `notas.md`) y subdirs que no matcheen el patron se ignoran. No es recursivo.
|
||||
- **Patron estricto**: el prefijo debe ser `NNNN-` (minimo 4 digitos seguidos de guion). Un subdir `12-foo` o `0001foo` (sin guion) NO se cuenta.
|
||||
- No hay deteccion de huecos: devuelve `max+1`, no el primer numero libre intermedio. Si tienes `0001` y `0003`, devuelve `0004`, no `0002`.
|
||||
@@ -0,0 +1,46 @@
|
||||
#!/usr/bin/env bash
|
||||
# next_numbered_dir — Compute the next NNNN- prefix for a numbered directory.
|
||||
#
|
||||
# Scans the DIRECT subdirectories of <parent_dir> whose names start with a
|
||||
# numeric prefix of the form `NNNN-` (4+ digits followed by a hyphen), takes
|
||||
# the maximum number, adds 1, and prints it zero-padded to <width> (default 4).
|
||||
# If <parent_dir> does not exist or contains no matching subdir, prints the
|
||||
# first number (0001 at default width).
|
||||
|
||||
next_numbered_dir() {
|
||||
local parent_dir="${1:-}"
|
||||
local width="${2:-4}"
|
||||
|
||||
if [[ -z "$parent_dir" ]]; then
|
||||
echo "usage: next_numbered_dir <parent_dir> [width]" >&2
|
||||
return 1
|
||||
fi
|
||||
|
||||
local max=0
|
||||
local entry base num
|
||||
|
||||
if [[ -d "$parent_dir" ]]; then
|
||||
# Iterate only over direct subdirectories. The trailing slash in the
|
||||
# glob ensures files (e.g. .gitkeep) are skipped — only dirs match.
|
||||
for entry in "$parent_dir"/*/; do
|
||||
# If the glob matched nothing it stays literal; guard with -d.
|
||||
[[ -d "$entry" ]] || continue
|
||||
base="$(basename "$entry")"
|
||||
# Require a prefix of 4+ digits followed by a hyphen.
|
||||
if [[ "$base" =~ ^([0-9]{4,})- ]]; then
|
||||
num="${BASH_REMATCH[1]}"
|
||||
# Force base 10 so leading zeros (08, 09) are not read as octal.
|
||||
num=$((10#$num))
|
||||
if (( num > max )); then
|
||||
max=$num
|
||||
fi
|
||||
fi
|
||||
done
|
||||
fi
|
||||
|
||||
printf "%0*d\n" "$width" $(( max + 1 ))
|
||||
}
|
||||
|
||||
if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then
|
||||
next_numbered_dir "$@"
|
||||
fi
|
||||
@@ -0,0 +1,69 @@
|
||||
---
|
||||
name: init_paper
|
||||
kind: pipeline
|
||||
lang: bash
|
||||
domain: pipelines
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "init_paper(slug: string, [--title <t>] [--domain <d>] [--tags <csv>]) -> void"
|
||||
description: "Scaffold de un paper académico reproducible en papers/<NNNN-slug>/. Calcula el siguiente número incremental escaneando papers/, crea las subcarpetas (experiments data figures reviews out), copia las plantillas paper.md (IMRaD) + preregistration.md (anti-HARKing) rellenando el frontmatter (title, slug, date de hoy, phase=question, status=draft) y crea references.md. NO hace git init: el paper arranca en fase interna local (papers/ gitignored). Grupo de capacidad papers."
|
||||
tags: [papers, scaffold, paper, pipeline, bash, launcher]
|
||||
uses_functions:
|
||||
- next_numbered_dir_bash_io
|
||||
- slugify_ascii_py_core
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: []
|
||||
params:
|
||||
- name: slug
|
||||
desc: "identificador legible del paper; se slugifica a ASCII (espacios/acentos se normalizan) y se prefija con el siguiente NNNN incremental"
|
||||
- name: "--title"
|
||||
desc: "título del paper (string); si se omite, usa el slug limpio. No debe contener el carácter '|'"
|
||||
- name: "--domain"
|
||||
desc: "dominio del paper escrito en el frontmatter (default datascience)"
|
||||
- name: "--tags"
|
||||
desc: "tags CSV que se escriben en el frontmatter de paper.md (opcional)"
|
||||
output: "sin salida directa; crea papers/<NNNN-slug>/ con paper.md, preregistration.md, references.md y las subcarpetas experiments/ data/ figures/ reviews/ out/. Imprime el resumen y los pasos siguientes a stdout."
|
||||
tested: false
|
||||
tests: []
|
||||
test_file_path: ""
|
||||
file_path: "bash/functions/pipelines/init_paper.sh"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```bash
|
||||
# Scaffold de un paper nuevo (numera 0001, 0002, ... automáticamente)
|
||||
fn run init_paper mi-primer-paper --title "Mi primer paper"
|
||||
fn run init_paper reactive-loop-calls --domain datascience --tags registry,telemetria
|
||||
|
||||
# El slug se slugifica: "Áreas de Mejora" -> papers/0003-areas-de-mejora/
|
||||
fn run init_paper "Áreas de Mejora"
|
||||
```
|
||||
|
||||
## Cuando usarla
|
||||
|
||||
Cuando empiezas un paper académico nuevo dentro de `fn_registry` y necesitas el esqueleto del artefacto (`papers/<NNNN-slug>/`) con las plantillas IMRaD y de pre-registro listas para rellenar. Es el paso 1 del grupo de capacidad `papers` (ver `docs/capabilities/papers.md`), antes de la revisión de literatura y del pre-registro de la hipótesis.
|
||||
|
||||
## Flujo
|
||||
|
||||
1. Parsea `<slug>` (posicional) + flags `--title` / `--domain` / `--tags`. Falla con exit ≠ 0 si falta el slug.
|
||||
2. `slugify_ascii` — normaliza el slug a ASCII lowercase sin diacríticos (reutiliza la función del registry, solo stdlib).
|
||||
3. `next_numbered_dir papers/` — calcula el siguiente NNNN de 4 dígitos sin colisión.
|
||||
4. Crea `papers/<NNNN-slug>/` con las subcarpetas `experiments/ data/ figures/ reviews/ out/`.
|
||||
5. Copia `docs/templates/paper.md` + `docs/templates/preregistration.md` y rellena el frontmatter por clave de línea (title, slug, date de hoy, domain, tags; phase=question y status=draft vienen de la plantilla).
|
||||
6. Crea `references.md` vacío.
|
||||
|
||||
## Gotchas
|
||||
|
||||
- **NO hace `git init`.** El paper arranca en fase interna local; `papers/` está gitignored en el repo padre (solo `papers/.gitkeep` se versiona). Promocionar a sub-repo Gitea (fase publishable) es manual.
|
||||
- **El `--title` no debe contener el carácter `|`** (se usa como delimitador de sed al rellenar el frontmatter; los `&` y `\` sí se escapan).
|
||||
- **No indexa el paper en `registry.db`** — los artefactos `papers/<slug>/` no se indexan en esta fase (KISS); sí se indexa este pipeline.
|
||||
- Requiere `python3` (del venv del registry o del sistema) para slugificar; `slugify_ascii` solo usa stdlib, así que el venv no es obligatorio.
|
||||
- Idempotencia: si el directorio destino ya existiera, aborta con exit ≠ 0 en vez de sobrescribir.
|
||||
|
||||
## Notas
|
||||
|
||||
Cada paper es un artefacto independiente (mismo patrón que `apps/` y `analysis/`, pero para investigación). El pipeline usa `set -euo pipefail`: cualquier fallo detiene la ejecución. Parte del grupo de capacidad `papers` — diseño completo en `reports/0001-2026-06-30-papers-system-design.md`.
|
||||
@@ -0,0 +1,177 @@
|
||||
#!/usr/bin/env bash
|
||||
# init_paper
|
||||
# ----------
|
||||
# Scaffold de un paper académico reproducible en papers/<NNNN-slug>/.
|
||||
#
|
||||
# Calcula el siguiente número incremental escaneando papers/, crea el
|
||||
# directorio con todas las subcarpetas (experiments data figures reviews out),
|
||||
# copia las plantillas paper.md + preregistration.md rellenando el frontmatter
|
||||
# (title, slug, date de hoy, phase=question, status=draft) y crea references.md.
|
||||
#
|
||||
# NO hace `git init`: el paper arranca en fase interna local (papers/ está
|
||||
# gitignored en el repo padre, solo .gitkeep se versiona). La promoción a
|
||||
# sub-repo Gitea (fase publishable) es un paso posterior MANUAL.
|
||||
#
|
||||
# Compone: next_numbered_dir (helper de numeración del registry) +
|
||||
# slugify_ascii (slug ASCII del registry).
|
||||
#
|
||||
# USO:
|
||||
# ./init_paper.sh <slug> [--title "..."] [--domain <d>] [--tags a,b,c]
|
||||
#
|
||||
# EJEMPLOS:
|
||||
# ./init_paper.sh mi-primer-paper --title "Mi primer paper"
|
||||
# ./init_paper.sh reactive-loop-calls --domain datascience --tags registry,telemetria
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
REGISTRY_ROOT="$(cd "$SCRIPT_DIR/../../.." && pwd)"
|
||||
|
||||
# Funciones atómicas del registry
|
||||
source "$REGISTRY_ROOT/bash/functions/io/next_numbered_dir.sh"
|
||||
|
||||
# ── Parsing de argumentos ────────────────────────────────────
|
||||
|
||||
SLUG_RAW=""
|
||||
TITLE=""
|
||||
DOMAIN="datascience"
|
||||
TAGS=""
|
||||
|
||||
while [ $# -gt 0 ]; do
|
||||
case "$1" in
|
||||
--title)
|
||||
TITLE="$2"; shift 2 ;;
|
||||
--domain)
|
||||
DOMAIN="$2"; shift 2 ;;
|
||||
--tags)
|
||||
TAGS="$2"; shift 2 ;;
|
||||
-h|--help)
|
||||
grep "^#" "$0" | sed 's/^# \?//' ; exit 0 ;;
|
||||
-*)
|
||||
echo "Flag desconocido: $1" >&2 ; exit 1 ;;
|
||||
*)
|
||||
if [ -z "$SLUG_RAW" ]; then
|
||||
SLUG_RAW="$1"
|
||||
else
|
||||
echo "ERROR: argumento posicional inesperado: '$1' (solo se admite un <slug>)." >&2
|
||||
exit 1
|
||||
fi
|
||||
shift ;;
|
||||
esac
|
||||
done
|
||||
|
||||
if [ -z "$SLUG_RAW" ]; then
|
||||
echo "ERROR: falta el argumento <slug>." >&2
|
||||
echo "Uso: $0 <slug> [--title \"...\"] [--domain <d>] [--tags a,b,c]" >&2
|
||||
echo " Ejemplo: $0 mi-primer-paper --title \"Mi primer paper\"" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# ── Slugificar (reutiliza slugify_ascii del registry; solo stdlib) ──
|
||||
|
||||
PYBIN="$REGISTRY_ROOT/python/.venv/bin/python3"
|
||||
[ -x "$PYBIN" ] || PYBIN="$(command -v python3 || true)"
|
||||
if [ -z "$PYBIN" ]; then
|
||||
echo "ERROR: no se encontró python3 para slugificar el slug." >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
SLUG_CLEAN=$("$PYBIN" -c '
|
||||
import sys, os
|
||||
sys.path.insert(0, os.path.join(sys.argv[2], "python", "functions"))
|
||||
from core.slugify_ascii import slugify_ascii
|
||||
print(slugify_ascii(sys.argv[1], default="paper"))
|
||||
' "$SLUG_RAW" "$REGISTRY_ROOT")
|
||||
|
||||
# ── Resolver número incremental y directorio destino ─────────
|
||||
|
||||
PAPERS_DIR="$REGISTRY_ROOT/papers"
|
||||
mkdir -p "$PAPERS_DIR"
|
||||
|
||||
NUM=$(next_numbered_dir "$PAPERS_DIR")
|
||||
SLUG_FULL="${NUM}-${SLUG_CLEAN}"
|
||||
PAPER_DIR="$PAPERS_DIR/$SLUG_FULL"
|
||||
|
||||
if [ -d "$PAPER_DIR" ]; then
|
||||
echo "ERROR: el directorio del paper ya existe: $PAPER_DIR" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
TODAY=$(date +%Y-%m-%d)
|
||||
[ -n "$TITLE" ] || TITLE="$SLUG_CLEAN"
|
||||
|
||||
TAGS_YAML="[]"
|
||||
if [ -n "$TAGS" ]; then
|
||||
TAGS_YAML="[$(echo "$TAGS" | sed 's/,/, /g')]"
|
||||
fi
|
||||
|
||||
echo ""
|
||||
echo "════════════════════════════════════════════════════════════"
|
||||
echo " INIT PAPER: ${SLUG_FULL}"
|
||||
echo " Título: ${TITLE}"
|
||||
echo " Directorio: ${PAPER_DIR}"
|
||||
echo "════════════════════════════════════════════════════════════"
|
||||
echo ""
|
||||
|
||||
# ── Crear estructura ─────────────────────────────────────────
|
||||
|
||||
echo "[1/3] Creando estructura..."
|
||||
mkdir -p "$PAPER_DIR"/experiments "$PAPER_DIR"/data "$PAPER_DIR"/figures \
|
||||
"$PAPER_DIR"/reviews "$PAPER_DIR"/out
|
||||
echo " experiments/ data/ figures/ reviews/ out/"
|
||||
|
||||
# ── Copiar plantillas + rellenar frontmatter ─────────────────
|
||||
|
||||
echo "[2/3] Escribiendo paper.md + preregistration.md..."
|
||||
|
||||
# Escapa caracteres especiales del RHS de sed (delimitador |)
|
||||
sed_escape() { printf '%s' "$1" | sed -e 's/[\\&|]/\\&/g'; }
|
||||
TITLE_ESC="$(sed_escape "$TITLE")"
|
||||
DOMAIN_ESC="$(sed_escape "$DOMAIN")"
|
||||
|
||||
PAPER_MD="$PAPER_DIR/paper.md"
|
||||
PREREG_MD="$PAPER_DIR/preregistration.md"
|
||||
|
||||
cp "$REGISTRY_ROOT/docs/templates/paper.md" "$PAPER_MD"
|
||||
cp "$REGISTRY_ROOT/docs/templates/preregistration.md" "$PREREG_MD"
|
||||
|
||||
sed -i \
|
||||
-e "s|^title:.*|title: \"${TITLE_ESC}\"|" \
|
||||
-e "s|^slug:.*|slug: ${SLUG_FULL}|" \
|
||||
-e "s|^date:.*|date: ${TODAY}|" \
|
||||
-e "s|^domain:.*|domain: ${DOMAIN_ESC}|" \
|
||||
-e "s|^tags:.*|tags: ${TAGS_YAML}|" \
|
||||
"$PAPER_MD"
|
||||
|
||||
sed -i \
|
||||
-e "s|^paper_slug:.*|paper_slug: ${SLUG_FULL}|" \
|
||||
"$PREREG_MD"
|
||||
|
||||
echo " $PAPER_MD"
|
||||
echo " $PREREG_MD"
|
||||
|
||||
# ── references.md ────────────────────────────────────────────
|
||||
|
||||
echo "[3/3] Escribiendo references.md..."
|
||||
cat > "$PAPER_DIR/references.md" << EOF
|
||||
# References — ${TITLE}
|
||||
|
||||
<!-- Una entrada por referencia. Formato libre (o BibTeX) hasta promocionar a publishable. -->
|
||||
EOF
|
||||
echo " $PAPER_DIR/references.md"
|
||||
|
||||
# ── Resumen ──────────────────────────────────────────────────
|
||||
|
||||
echo ""
|
||||
echo "════════════════════════════════════════════════════════════"
|
||||
echo " PAPER '${SLUG_FULL}' LISTO (fase: question, status: draft)"
|
||||
echo "════════════════════════════════════════════════════════════"
|
||||
echo ""
|
||||
echo " Pasos siguientes:"
|
||||
echo " 1. Revisión de literatura (skill /deep-research) → Related work."
|
||||
echo " 2. Pre-registro: congela H0/H1 + plan en preregistration.md (preregister_hypothesis)."
|
||||
echo " 3. Experimentos en experiments/ → análisis (grupo eda) → escritura IMRaD en paper.md."
|
||||
echo " 4. render_paper_pdf → out/paper.pdf. Peer review adversarial → reviews/."
|
||||
echo ""
|
||||
echo " papers/ está gitignored: este paper vive local hasta promocionar a publishable."
|
||||
echo ""
|
||||
@@ -39,6 +39,7 @@ Indice de grupos de capacidades del registry. Cada grupo agrupa >=3 funciones qu
|
||||
| [cpp-tables](tql.md) | 9 | Table Query Language C++ puro: filter, group, agg, sort, join, stats, formulas Lua, round-trip emit/apply |
|
||||
| [data-table-renderers](data_table_renderers.md) | 1 | API declarativa de cell renderers para data_table: Badge, Progress, Duration, Icon via TableInput.column_specs |
|
||||
| [scheduler](scheduler.md) | 4 | Cron expression parsing, matching, next-run y traduccion humana (consume `apps/dag_engine`) |
|
||||
| [papers](papers.md) | — | Papers académicos reproducibles en `papers/<NNNN-slug>/`: scaffold del artefacto (`init_paper` + helper `next_numbered_dir`), plantillas IMRaD + pre-registro anti-HARKing, y (en construcción por la flota) congelar hipótesis, funciones estadísticas (effect size/CI/corrección múltiple), render md→PDF y peer-review adversarial. Reutiliza `deep-research`, grupo `eda` y el motor PDF de `datascience`. Diseño: `reports/0001-2026-06-30-papers-system-design.md` |
|
||||
| [extractor](extractor.md) | 15 | Funciones que leen datos de fuentes externas (BD, API, archivos, web). Nodos input de `data_factory` |
|
||||
| [transformer](transformer.md) | 15 | Funciones que clean/dedup/aggregate/feature-engineer datos. Nodos intermedios de `data_factory` |
|
||||
| [sink](sink.md) | 11 | Funciones que escriben datos a destino externo (BD, dashboard, alerta, email). Nodos output |
|
||||
|
||||
@@ -0,0 +1,82 @@
|
||||
# papers — papers académicos reproducibles
|
||||
|
||||
Grupo de capacidad para producir **papers académicos** dentro de `fn_registry`: investigación con hipótesis falsables, experimentos reproducibles, análisis estadístico honesto y escritura en formato IMRaD. Cada paper es un artefacto nuevo en `papers/<NNNN-slug>/` que reutiliza infraestructura existente (skill `deep-research` para la revisión de literatura, grupo `eda` para el análisis, motor md→PDF de `datascience`, patrón de verificación adversarial del orquestador) y añade lo que falta como funciones del registry.
|
||||
|
||||
Diseño completo y decisiones: `reports/0001-2026-06-30-papers-system-design.md`.
|
||||
|
||||
> **Regla de oro anti paper-mill:** una hipótesis que **podía** fallar + un experimento con riesgo real de refutación + estadística que no es teatro. Si no hay riesgo de refutación, no es un paper. Los claims nunca superan a la evidencia. El antídoto al HARKing es el **pre-registro**: el plan de análisis se congela *antes* de mirar los datos.
|
||||
|
||||
## Estructura del artefacto
|
||||
|
||||
```
|
||||
papers/0001-mi-paper/
|
||||
paper.md # frontmatter (title, slug, authors, date, status, phase, tags, domain, hypothesis_id) + cuerpo IMRaD
|
||||
preregistration.md # H0/H1 + plan de análisis CONGELADO (frozen_at + content_hash) antes de correr
|
||||
references.md # bibliografía
|
||||
experiments/ # código / notebooks por experimento (exp01_*, exp02_*)
|
||||
data/ # crudos + procesados (gitignored si pesa)
|
||||
figures/ # gráficos generados
|
||||
reviews/ # outputs del peer-review adversarial
|
||||
out/ # paper.pdf — entregable final
|
||||
.git/ # SOLO cuando promociona a fase publishable (sub-repo Gitea)
|
||||
```
|
||||
|
||||
`papers/` está gitignored en el repo padre (solo `papers/.gitkeep` se versiona): un paper en fase interna no contamina el repo. Al promocionar a `status: publishable` se vuelve sub-repo Gitea `dataforge/<slug>` (como apps y analyses).
|
||||
|
||||
### Fases (campo `phase` de `paper.md`)
|
||||
|
||||
```
|
||||
question → review → hypothesis → design → running → analysis → writing → internal-review
|
||||
→ [DONE interno] → polish → submitted [solo en fase publishable]
|
||||
```
|
||||
|
||||
## Funciones
|
||||
|
||||
| ID | Pureza | Estado | Qué hace |
|
||||
|---|---|---|---|
|
||||
| `init_paper_bash_pipelines` | impure | ✅ disponible | Scaffold de `papers/<NNNN-slug>/`: calcula el siguiente NNNN, crea las subcarpetas, copia `paper.md` + `preregistration.md` con el frontmatter relleno (slug, title, date de hoy, `phase: question`, `status: draft`) y `references.md` vacío. NO hace `git init` (el paper arranca en fase interna local). |
|
||||
| `next_numbered_dir_bash_io` | impure | ✅ disponible | Dado un directorio, devuelve el siguiente número incremental de 4 dígitos (`0001`, `0002`, …) escaneando los subdirs con prefijo `NNNN-`. Helper de numeración de `init_paper` (reutilizable por reports/issues). |
|
||||
| `preregister_hypothesis` | impure | 🚧 en construcción (flota) | Congela el `preregistration.md` (H0/H1 + plan de análisis) con `frozen_at` + `content_hash`, pasa `status` a `frozen` y escribe `hypothesis_id` en `paper.md`. Mata el HARKing: tras congelar, el plan no se edita. |
|
||||
| `cohens_d` (effect size) | pure | 🚧 en construcción (flota) | Tamaño del efecto (Cohen's d) entre dos grupos. Reporta magnitud, no solo significancia. |
|
||||
| `confidence_interval` | pure | 🚧 en construcción (flota) | Intervalo de confianza de una métrica (media/diferencia). |
|
||||
| `holm_bonferroni` | pure | 🚧 en construcción (flota) | Corrección de comparaciones múltiples (Holm-Bonferroni / FWER) para el plan de análisis. |
|
||||
| `render_paper_pdf` | impure | 🚧 en construcción (flota) | Markdown IMRaD (`paper.md` + figuras) → `out/paper.pdf`, reutilizando el motor md→PDF del grupo `eda`/`datascience`. |
|
||||
|
||||
> Las funciones estadísticas reutilizan lo que ya exista en `datascience` (p.ej. `fdr_correction_py_datascience` cubre la corrección de comparaciones múltiples por FDR; el agente del rigor experimental decide si añade Holm-Bonferroni o reusa lo existente). Buscar antes de duplicar: `mcp__registry__fn_search query="effect size" domain="datascience"`.
|
||||
|
||||
### Peer review (no es función del registry)
|
||||
|
||||
El agente adversarial `.claude/agents/paper-reviewer.md` (🚧 en construcción por la flota) puntúa novedad, rigor, reproducibilidad y validez, e intenta **refutar** cada claim. Default a "failed" si la evidencia no soporta. Escribe su veredicto en `reviews/`. Es el equivalente al verificador adversarial del orquestador aplicado al paper.
|
||||
|
||||
## Ejemplo canónico (end-to-end)
|
||||
|
||||
```bash
|
||||
# 1. Scaffold del paper (fase question, local). Crea papers/0001-mi-paper/.
|
||||
./fn run init_paper mi-paper --title "¿El bucle reactivo reduce las calls inline?" --domain datascience --tags registry,telemetria
|
||||
|
||||
# 2. Revisión de literatura → llena Related work (skill deep-research, fase review).
|
||||
# /deep-research "..."
|
||||
|
||||
# 3. Pre-registro: congela H0/H1 + plan de análisis ANTES de mirar datos (fase hypothesis).
|
||||
./fn run preregister_hypothesis papers/0001-mi-paper # 🚧 en construcción
|
||||
|
||||
# 4. Experimentos en papers/0001-mi-paper/experiments/ (fase running) →
|
||||
# análisis con el grupo `eda` + funciones de effect size / CI / corrección múltiple (fase analysis).
|
||||
|
||||
# 5. Escritura IMRaD en paper.md (fase writing) → render del entregable PDF.
|
||||
./fn run render_paper_pdf papers/0001-mi-paper # 🚧 en construcción → out/paper.pdf
|
||||
|
||||
# 6. Peer review adversarial (fase internal-review).
|
||||
# Agent(subagent_type="paper-reviewer", prompt="Revisa papers/0001-mi-paper ...") # 🚧 en construcción
|
||||
```
|
||||
|
||||
## Fronteras
|
||||
|
||||
- **NO es para reports de trabajo.** Un report (`reports/`) es el entregable escrito de una tarea (resumen + evidencia + gaps); un paper es investigación con hipótesis falsable y experimento. Ver `.claude/rules/reports.md`.
|
||||
- **NO se indexa en `registry.db` en esta fase.** No hay tabla `papers` ni `entity_type` `paper` (KISS); se añadiría con migración propia si se decide. Las *funciones* del grupo sí se indexan (viven en `bash/functions/`, `python/functions/`), pero los artefactos `papers/<slug>/` no.
|
||||
- **NO hace `git init` en el scaffold.** El paper arranca en fase interna local y gitignored. La promoción a sub-repo Gitea (fase publishable) es un paso manual posterior.
|
||||
- **NO soporta LaTeX/arXiv todavía.** Formato elegido: Markdown como fuente + PDF como entregable. El soporte LaTeX se añadiría al promocionar un paper a fase publishable.
|
||||
|
||||
## Estado
|
||||
|
||||
Fase de scaffolding. Disponible: estructura del artefacto, plantillas (`docs/templates/paper.md`, `docs/templates/preregistration.md`), pipeline `init_paper` + helper `next_numbered_dir`, esta página y el bloque gitignore de `papers/`. En construcción por la flota: `preregister_hypothesis`, funciones estadísticas (effect size / CI / corrección múltiple), `render_paper_pdf` y el agente `paper-reviewer`. Validación end-to-end con un paper piloto real: pendiente.
|
||||
Vendored
+94
@@ -0,0 +1,94 @@
|
||||
---
|
||||
title: "TITULO DEL PAPER"
|
||||
slug: NNNN-slug
|
||||
authors: [Enmanuel]
|
||||
date: 2026-01-01
|
||||
status: draft # draft | internal | publishable
|
||||
phase: question # question -> review -> hypothesis -> design -> running -> analysis -> writing -> internal-review -> polish -> submitted
|
||||
tags: []
|
||||
domain: datascience
|
||||
hypothesis_id: "" # lo rellena preregister_hypothesis al congelar el preregistro
|
||||
---
|
||||
|
||||
<!--
|
||||
Paper académico reproducible (formato IMRaD). Esta es la FUENTE editable en Markdown;
|
||||
el entregable PDF se genera con render_paper_pdf (grupo `papers`).
|
||||
|
||||
Regla de oro anti paper-mill: una hipótesis que PODÍA fallar + un experimento con
|
||||
riesgo real de refutación + estadística que no es teatro. Si no hay riesgo de
|
||||
refutación, no es un paper. Los claims nunca superan a la evidencia.
|
||||
-->
|
||||
|
||||
# {{título del paper}}
|
||||
|
||||
## Abstract
|
||||
|
||||
<!--
|
||||
Resumen estructurado en 4-6 frases: contexto -> gap -> método -> resultados -> conclusión.
|
||||
Sin citas, sin abreviaturas sin definir. Es lo único que mucha gente leerá: que se sostenga solo.
|
||||
-->
|
||||
|
||||
## 1. Introduction
|
||||
|
||||
<!--
|
||||
Embudo en cuatro movimientos:
|
||||
1. Contexto — el área y por qué importa.
|
||||
2. Gap — qué NO se sabe todavía (el hueco que este paper llena).
|
||||
3. Pregunta / hipótesis — formulada de forma falsable (ver preregistration.md).
|
||||
4. Contribución — lista explícita de lo que aporta este trabajo ("Contributions:").
|
||||
-->
|
||||
|
||||
## 2. Related work
|
||||
|
||||
<!--
|
||||
Qué existe ya y por qué no basta. Agrupa por enfoque, no por autor. Cada cita debe
|
||||
justificar por qué el gap sigue abierto. Output de la fase de revisión (skill deep-research).
|
||||
-->
|
||||
|
||||
## 3. Methods
|
||||
|
||||
<!--
|
||||
Diseño REPRODUCIBLE: otra persona lo corre y obtiene lo mismo.
|
||||
- Variables: independiente(s), dependiente(s), control.
|
||||
- Diseño: N, condiciones, muestreo, aleatorización.
|
||||
- Métricas y cómo se miden.
|
||||
- Protocolo paso a paso + dónde vive el código (experiments/) y los datos (data/).
|
||||
Debe ser coherente con el preregistration.md congelado (no se cambia el plan tras ver datos).
|
||||
-->
|
||||
|
||||
## 4. Results
|
||||
|
||||
<!--
|
||||
Datos SIN interpretar. Tablas y figuras (figures/) con su lectura literal.
|
||||
Reporta effect size + intervalos de confianza, no solo p-valores.
|
||||
Incluye también los resultados negativos / no significativos (anti cherry-picking).
|
||||
-->
|
||||
|
||||
## 5. Discussion
|
||||
|
||||
<!--
|
||||
Interpretación de los resultados a la luz de la pregunta. Claims <= evidencia.
|
||||
-->
|
||||
|
||||
### 5.1 Limitaciones
|
||||
|
||||
<!-- Qué no cubre el estudio, supuestos, datos faltantes. Honestidad explícita. -->
|
||||
|
||||
### 5.2 Amenazas a la validez
|
||||
|
||||
<!--
|
||||
- Validez interna — ¿la causa es lo que decimos o hay confusores?
|
||||
- Validez externa — ¿generaliza fuera de esta muestra/condiciones?
|
||||
- Validez de constructo — ¿la métrica mide lo que dice medir?
|
||||
- Validez estadística — ¿N suficiente, supuestos del test cumplidos, comparaciones múltiples corregidas?
|
||||
-->
|
||||
|
||||
## 6. Conclusion + Future work
|
||||
|
||||
<!--
|
||||
Cierre en 2-4 frases: qué se aprendió (sin overclaiming) + las siguientes preguntas que abre.
|
||||
-->
|
||||
|
||||
## References
|
||||
|
||||
<!-- Ver references.md. -->
|
||||
Vendored
+59
@@ -0,0 +1,59 @@
|
||||
---
|
||||
paper_slug: NNNN-slug
|
||||
frozen_at: "" # timestamp ISO — lo rellena preregister_hypothesis al congelar
|
||||
content_hash: "" # hash del contenido congelado — lo rellena preregister_hypothesis
|
||||
status: draft # draft -> frozen (preregister_hypothesis lo pasa a frozen; tras congelar NO se edita)
|
||||
---
|
||||
|
||||
> **⚠️ ESTE DOCUMENTO SE CONGELA ANTES DE MIRAR LOS DATOS (anti-HARKing).**
|
||||
> El plan de análisis se fija aquí *antes* de ejecutar el experimento. Una vez congelado
|
||||
> (`status: frozen`, con `frozen_at` + `content_hash`), **no se edita**. Inventar o ajustar
|
||||
> la hipótesis después de ver los resultados (HARKing) invalida el paper. Si el plan cambia
|
||||
> tras ver datos, eso es análisis exploratorio y se reporta como tal, no como confirmatorio.
|
||||
|
||||
# Pre-registro — {{título del paper}}
|
||||
|
||||
## 1. Pregunta de investigación
|
||||
|
||||
<!-- La pregunta concreta, en una frase. Debe poder responderse con un experimento. -->
|
||||
|
||||
## 2. Hipótesis
|
||||
|
||||
<!-- Falsable (Popper): una predicción que PODRÍA fallar. -->
|
||||
|
||||
- **H0 (nula):** <!-- no hay efecto / no hay diferencia. Es lo que el test intenta rechazar. -->
|
||||
- **H1 (alternativa):** <!-- el efecto esperado, con dirección si la hay. -->
|
||||
|
||||
## 3. Variables
|
||||
|
||||
- **Independiente(s):** <!-- lo que se manipula. -->
|
||||
- **Dependiente(s):** <!-- lo que se mide (la métrica de resultado). -->
|
||||
- **Control:** <!-- lo que se mantiene fijo / se cubre estadísticamente. -->
|
||||
|
||||
## 4. Diseño
|
||||
|
||||
<!--
|
||||
- N: tamaño de muestra (y justificación / power analysis si aplica).
|
||||
- Condiciones / grupos.
|
||||
- Muestreo y aleatorización.
|
||||
- Criterios de inclusión / exclusión de datos (definidos AHORA, no después).
|
||||
-->
|
||||
|
||||
## 5. Plan de análisis
|
||||
|
||||
<!--
|
||||
El plan estadístico EXACTO, decidido antes de ver los datos:
|
||||
- Test estadístico concreto (p.ej. t-test de Welch, Mann-Whitney U, regresión...).
|
||||
- Métrica de effect size (p.ej. Cohen's d, diferencia de medias, odds ratio).
|
||||
- Criterio de decisión (umbral alpha, qué resultado confirma/refuta H1).
|
||||
- Corrección por comparaciones múltiples (p.ej. Holm-Bonferroni) si hay >1 contraste.
|
||||
- Manejo de supuestos (normalidad, varianzas) y qué se hace si no se cumplen.
|
||||
-->
|
||||
|
||||
## 6. Predicción cuantitativa
|
||||
|
||||
<!--
|
||||
La predicción numérica concreta que el experimento pondrá a prueba.
|
||||
P.ej. "esperamos d >= 0.5 con IC95% que no cruza 0" o "una reducción >= 15% en la métrica X".
|
||||
Cuanto más específica, más falsable.
|
||||
-->
|
||||
@@ -64,6 +64,7 @@ from .exploratory_caveats import exploratory_caveats
|
||||
from .render_eda_pdf import render_eda_pdf, render_eda_pdf_relational
|
||||
from .render_automatic_eda_pdf import render_automatic_eda_pdf
|
||||
from .render_automatic_eda_pptx import render_automatic_eda_pptx
|
||||
from .render_automatic_eda_markdown import render_automatic_eda_markdown
|
||||
from .detect_time_column import detect_time_column
|
||||
from .extract_timeseries_raw import extract_timeseries_raw
|
||||
from .build_eda_render_ctx import build_eda_render_ctx
|
||||
@@ -82,6 +83,7 @@ __all__ = [
|
||||
"resample_timeseries",
|
||||
"render_automatic_eda_pdf",
|
||||
"render_automatic_eda_pptx",
|
||||
"render_automatic_eda_markdown",
|
||||
"decode_qr_image",
|
||||
"adf_kpss_stationarity",
|
||||
"acf_pacf",
|
||||
|
||||
@@ -36,6 +36,7 @@ from .model import ( # noqa: F401
|
||||
from .chapters_registry import CHAPTER_ORDER, build_chapter, build_document # noqa: F401
|
||||
from .render_pdf_impl import render_pdf # noqa: F401
|
||||
from .render_pptx_impl import render_pptx # noqa: F401
|
||||
from .render_md_impl import render_md # noqa: F401
|
||||
|
||||
__all__ = [
|
||||
"ENGINE_NAME",
|
||||
@@ -60,4 +61,5 @@ __all__ = [
|
||||
"build_document",
|
||||
"render_pdf",
|
||||
"render_pptx",
|
||||
"render_md",
|
||||
]
|
||||
|
||||
@@ -1,19 +1,25 @@
|
||||
"""Categorical distributions chapter (CAT DISTR).
|
||||
|
||||
Third reference chapter for AutomaticEDA. For every categorical column it shows,
|
||||
fulfilling the user's request:
|
||||
Third reference chapter for AutomaticEDA. Each categorical column gets **its own
|
||||
page (PDF) / slide (PPTX)**: every column is wrapped in a keep-together
|
||||
``model.Group`` with ``page_break_before=True`` (except the first, which may share
|
||||
the intro's page), so its chart sits next to its tables and no column is split.
|
||||
|
||||
1. A short opening explanation of **Shannon entropy** (what it measures, its 0
|
||||
and log2(k) bounds, the normalized 0–1 version) and the dataset row total used
|
||||
as a comparison baseline.
|
||||
2. Per column, a cardinality key/value table: distinct values, ``% distinct``
|
||||
(distinct / total rows), total dataset rows, singleton values (frequency 1),
|
||||
entropy with its theoretical maximum and the normalized ratio, mode, imbalance
|
||||
and string-length stats.
|
||||
3. A short note flagging problematic cardinality (id-like ≈100% distinct, or a
|
||||
A short intro names the clickable **[[term:entropia]]entropía[[/term]]** term —
|
||||
the full definition lives in the GLOSARIO chapter, so it is NOT repeated inline
|
||||
here (one click jumps to the glossary entry). The intro also carries the dataset
|
||||
row total used as a comparison baseline.
|
||||
|
||||
Per column the Group contains, in order:
|
||||
|
||||
1. A cardinality key/value table: distinct values, ``% distinct`` (distinct /
|
||||
total rows), total dataset rows, singleton values (frequency 1), entropy with
|
||||
its theoretical maximum and the normalized ratio, mode, imbalance and
|
||||
string-length stats.
|
||||
2. A short note flagging problematic cardinality (id-like ≈100% distinct, or a
|
||||
single dominating category).
|
||||
4. A ``top-k`` table (value / count / %).
|
||||
5. A **donut pie chart** of the most common categories (top-k + an "Otros"
|
||||
3. A ``top-k`` table (value / count / %).
|
||||
4. A **donut pie chart** of the most common categories (top-k + an "Otros"
|
||||
bucket), drawn lazily so the renderers scale it to fit entirely.
|
||||
|
||||
Data comes from the ``eda`` group: each ``columns[i]['categorical']`` is the
|
||||
@@ -33,7 +39,7 @@ import math
|
||||
|
||||
from .. import model
|
||||
|
||||
CHAPTER_VERSION = "1.1.0"
|
||||
CHAPTER_VERSION = "1.2.0"
|
||||
CHAPTER_ID = "cat_distr"
|
||||
CHAPTER_TITLE = "Distribuciones categóricas"
|
||||
|
||||
@@ -53,11 +59,17 @@ _TERM_ENTROPIA_DEF = (
|
||||
# Cap the number of categorical columns rendered to keep the document bounded;
|
||||
# the rest are summarized in a closing note (no silent truncation).
|
||||
MAX_COLS = 40
|
||||
# Rows shown in each top-k table and explicit slices in the pie.
|
||||
TOP_TABLE_ROWS = 15
|
||||
# Rows shown in each top-k table and explicit slices in the pie. Kept moderate so
|
||||
# the whole column — cardinality table + top-k table + donut — fits on ONE
|
||||
# page/slide with the chart next to its tables; the table note still reports
|
||||
# "top N of M" so nothing is silently hidden. For id-like columns (≈100%
|
||||
# distinct) the top-k table is dropped entirely (it would be a list of unique
|
||||
# values — pure noise), which also frees the room the donut needs (see build).
|
||||
TOP_TABLE_ROWS = 8
|
||||
PIE_TOP_K = 6
|
||||
# Truncate very long category labels in tables (the renderer also wraps).
|
||||
LABEL_MAX = 48
|
||||
# Truncate very long category labels in tables (the renderer also wraps). Kept
|
||||
# tight so a column with long id-like values (names, tickets) still fits its page.
|
||||
LABEL_MAX = 28
|
||||
|
||||
|
||||
def _fmt_int(value) -> str:
|
||||
@@ -267,45 +279,55 @@ def _normalize_card(card: dict) -> dict:
|
||||
|
||||
|
||||
def _cardinality_block(card: dict):
|
||||
"""KVTable with the cardinality / entropy metrics for one column."""
|
||||
"""KVTable with the cardinality / entropy metrics for one column.
|
||||
|
||||
Related metrics are grouped onto a single row each (distinct/%/unique;
|
||||
entropy bits/max/normalized; length min/mean/max) so the whole column —
|
||||
table + chart — fits one page/slide without dropping any datum; the short
|
||||
16:9 PPTX slide does not fit one metric per row plus a chart otherwise."""
|
||||
n_singletons = card.get("n_singletons")
|
||||
if n_singletons is not None and card.get("n_singletons_partial"):
|
||||
singletons = f"≥{_fmt_int(n_singletons)} (en top mostrado)"
|
||||
singletons = f"≥{_fmt_int(n_singletons)}"
|
||||
elif n_singletons is not None:
|
||||
singletons = _fmt_int(n_singletons)
|
||||
else:
|
||||
singletons = "—"
|
||||
|
||||
entropy_ref = _fmt_num(card.get("entropy"))
|
||||
emax = card.get("entropy_max")
|
||||
if emax is not None:
|
||||
entropy_ref = f"{entropy_ref} (máx {_fmt_num(emax)})"
|
||||
# Distinct count · % distinct · unique (frequency 1) on one row.
|
||||
distinct_combo = (f"{_fmt_int(card.get('n_distinct'))} · "
|
||||
f"{_fmt_pct_value(card.get('pct_distinct'))} · "
|
||||
f"{singletons} únicos")
|
||||
|
||||
# Entropy bits · theoretical max · normalized 0–1 on one row.
|
||||
entropy_combo = (f"{_fmt_num(card.get('entropy'))} bits · "
|
||||
f"máx {_fmt_num(card.get('entropy_max'))} · "
|
||||
f"norm {_fmt_num(card.get('entropy_norm'))}")
|
||||
|
||||
mode = card.get("mode")
|
||||
mode_pct = card.get("mode_pct")
|
||||
mode_str = "—" if mode is None else model._safe_str(mode)
|
||||
mode_str = "—" if mode is None else _truncate(mode, 32)
|
||||
if mode is not None and mode_pct is not None:
|
||||
mode_str = f"{mode_str} ({_fmt_pct_value(mode_pct)})"
|
||||
|
||||
rows = [
|
||||
("Valores distintos", _fmt_int(card.get("n_distinct"))),
|
||||
("% distintos", _fmt_pct_value(card.get("pct_distinct"))),
|
||||
("Distintos · % · únicos", distinct_combo),
|
||||
("Total filas (dataset)", _fmt_int(card.get("n_rows"))),
|
||||
("Valores únicos (frecuencia 1)", singletons),
|
||||
("Entropía (bits)", entropy_ref),
|
||||
("Entropía normalizada (0–1)", _fmt_num(card.get("entropy_norm"))),
|
||||
("Entropía (bits · máx · norm)", entropy_combo),
|
||||
("Moda", mode_str),
|
||||
]
|
||||
imbalance = card.get("imbalance")
|
||||
if imbalance is not None:
|
||||
rows.append(("Desbalance", _fmt_num(imbalance)))
|
||||
lm = card.get("len_min")
|
||||
lmean = card.get("len_mean")
|
||||
lmax = card.get("len_max")
|
||||
# Imbalance and string length (both secondary) share one closing row.
|
||||
extras = []
|
||||
if imbalance is not None:
|
||||
extras.append(f"desbalance {_fmt_num(imbalance)}")
|
||||
if any(v is not None for v in (lm, lmean, lmax)):
|
||||
rows.append((
|
||||
"Longitud (mín/media/máx)",
|
||||
f"{_fmt_num(lm)} / {_fmt_num(lmean)} / {_fmt_num(lmax)}"))
|
||||
extras.append(
|
||||
f"long. {_fmt_num(lm)}/{_fmt_num(lmean)}/{_fmt_num(lmax)}")
|
||||
if extras:
|
||||
rows.append(("Desbalance · longitud", " · ".join(extras)))
|
||||
return model.KVTable(rows=rows, title="Cardinalidad")
|
||||
|
||||
|
||||
@@ -315,7 +337,8 @@ def _flag_note(card: dict):
|
||||
return model.Note(
|
||||
"Casi todos los valores son distintos (≈100% distintos): la columna "
|
||||
"se comporta como un identificador y aporta poco para agrupar o "
|
||||
"comparar categorías.")
|
||||
"comparar categorías. No se lista el top de categorías (serían "
|
||||
"valores casi todos únicos).")
|
||||
if card.get("dominated"):
|
||||
mp = card.get("mode_pct")
|
||||
mp_str = _fmt_pct_value(mp) if mp is not None else "muy alta"
|
||||
@@ -335,7 +358,7 @@ def _topk_table(cat: dict):
|
||||
if not isinstance(t, dict):
|
||||
continue
|
||||
rows.append([
|
||||
model._safe_str(t.get("value")),
|
||||
_truncate(t.get("value")),
|
||||
_fmt_int(t.get("count")),
|
||||
_pct_from_maybe_fraction(t.get("pct")),
|
||||
])
|
||||
@@ -353,20 +376,16 @@ def _topk_table(cat: dict):
|
||||
def _intro_blocks(n_rows, mark_term: bool = False):
|
||||
total = _fmt_int(n_rows)
|
||||
# Mark the first appearance of the term as a clickable glossary jump when the
|
||||
# term was registered (mark_term). The visible text is identical either way.
|
||||
entropia = ("[[term:entropia]]**entropía de Shannon**[[/term]]" if mark_term
|
||||
else "**entropía de Shannon**")
|
||||
# term was registered (mark_term). The full definition of entropy lives in the
|
||||
# GLOSARIO chapter, so the intro only names the clickable term here instead of
|
||||
# repeating the long explanation (avoids the redundancy with the glossary).
|
||||
entropia = ("[[term:entropia]]entropía[[/term]]" if mark_term
|
||||
else "entropía")
|
||||
text = (
|
||||
f"La {entropia} mide cómo de repartidos están los valores de "
|
||||
"una columna categórica, en bits. Vale 0 cuando una sola categoría "
|
||||
"concentra todas las filas (máxima previsibilidad) y alcanza su máximo, "
|
||||
"log2(k) para k categorías distintas, cuando todas aparecen por igual "
|
||||
"(máxima diversidad). La **entropía normalizada** (entropía dividida por "
|
||||
"su máximo) la lleva al rango 0–1 para comparar columnas con distinto "
|
||||
"número de categorías. Para cada columna se muestran los valores "
|
||||
"distintos, el porcentaje que representan sobre el total de filas, los "
|
||||
"valores únicos (que aparecen una sola vez), la tabla de las categorías "
|
||||
"más frecuentes y un gráfico de tarta (donut) de las más comunes."
|
||||
f"Cada columna categórica ocupa su propia página: sus métricas de "
|
||||
f"cardinalidad —incluida la {entropia}—, una nota que señala cardinalidad "
|
||||
"problemática, la tabla de las categorías más frecuentes y un gráfico de "
|
||||
"tarta (donut) de las más comunes, todo junto."
|
||||
)
|
||||
if n_rows is not None:
|
||||
text += f" El dataset tiene {total} filas en total como referencia."
|
||||
@@ -398,24 +417,37 @@ def build_cat_distr(profile: dict, ctx: dict):
|
||||
blocks = list(_intro_blocks(n_rows, mark_term=mark_term))
|
||||
|
||||
rendered = cat_cols[:MAX_COLS]
|
||||
for col in rendered:
|
||||
for idx, col in enumerate(rendered):
|
||||
name = col.get("name") or "(columna)"
|
||||
cat = col.get("categorical") or {}
|
||||
card = _normalize_card(_cardinality(cat, n_rows))
|
||||
|
||||
blocks.append(model.Heading(text=str(name), level=2))
|
||||
blocks.append(_cardinality_block(card))
|
||||
# One Group per categorical column: heading + cardinality table + flag
|
||||
# note + top-k table + donut figure are kept together and the renderer
|
||||
# starts each on a fresh page/slide (page_break_before) so every column
|
||||
# gets its own page with its chart next to its tables. The first column
|
||||
# may share the intro's page (no forced break) to avoid a near-empty page.
|
||||
col_blocks = [
|
||||
model.Heading(text=str(name), level=2),
|
||||
_cardinality_block(card),
|
||||
]
|
||||
note = _flag_note(card)
|
||||
if note is not None:
|
||||
blocks.append(note)
|
||||
topk = _topk_table(cat)
|
||||
if topk is not None:
|
||||
blocks.append(topk)
|
||||
blocks.append(model.Figure(
|
||||
col_blocks.append(note)
|
||||
# For id-like columns (≈100% distinct) the top-k is a list of unique
|
||||
# values — pure noise; skip it (the flag note already explains why) and
|
||||
# let the donut take that room so the whole column fits one page/slide.
|
||||
if not card.get("id_like"):
|
||||
topk = _topk_table(cat)
|
||||
if topk is not None:
|
||||
col_blocks.append(topk)
|
||||
col_blocks.append(model.Figure(
|
||||
make=_pie_make(cat.get("top") or [], card.get("n_distinct"),
|
||||
str(name), n_rows),
|
||||
caption=(f"Categorías más comunes de «{_truncate(name, 32)}» "
|
||||
"(donut: top-k + «Otros»)")))
|
||||
blocks.append(model.Group(blocks=col_blocks,
|
||||
page_break_before=(idx > 0)))
|
||||
|
||||
if len(cat_cols) > len(rendered):
|
||||
omitted = len(cat_cols) - len(rendered)
|
||||
|
||||
@@ -2,11 +2,14 @@
|
||||
|
||||
Self-contained: builds synthetic TableProfiles (no DuckDB) so the suite is fast
|
||||
and deterministic. Verifies that ``build_cat_distr`` emits the blocks the user
|
||||
asked for (entropy intro, distinct/total/%-distinct/unique metrics, top-k table
|
||||
and a donut figure), that the chapter renders inside the full document to both
|
||||
PDF and PPTX showing that content, that a profile with no categorical columns
|
||||
yields ``None`` without raising, and that long labels / many columns are never
|
||||
cut in either output.
|
||||
asked for (distinct/total/%-distinct/unique metrics, top-k table and a donut
|
||||
figure), that EACH categorical column is wrapped in its own keep-together
|
||||
``Group`` that starts on a fresh page/slide (one column per page, chart next to
|
||||
its tables), that the long entropy explanation is NOT repeated inline (it lives
|
||||
in the glossary — only the clickable term is kept), that the chapter renders
|
||||
inside the full document to both PDF and PPTX showing that content, that a
|
||||
profile with no categorical columns yields ``None`` without raising, and that
|
||||
long labels / many columns are never cut in either output.
|
||||
"""
|
||||
|
||||
import os
|
||||
@@ -17,7 +20,8 @@ from pypdf import PdfReader
|
||||
from pptx import Presentation
|
||||
|
||||
from datascience.automatic_eda.model import (
|
||||
DataTable, Figure, Heading, KVTable, Note,
|
||||
DataTable, Figure, GlossaryCollector, Group, Heading, KVTable, Markdown,
|
||||
Note,
|
||||
)
|
||||
from datascience.automatic_eda.chapters.cat_distr import (
|
||||
CHAPTER_ID, CHAPTER_VERSION, build_cat_distr,
|
||||
@@ -81,8 +85,20 @@ def _pptx_text(path: str) -> str:
|
||||
return re.sub(r"\s+", " ", " ".join(parts))
|
||||
|
||||
|
||||
def _kinds(chapter):
|
||||
return [b.kind for b in chapter.blocks]
|
||||
def _flatten(blocks):
|
||||
"""Expand keep-together Groups so the per-column heading/table/figure are
|
||||
inspectable as a flat block list (the chapter wraps each column in a Group)."""
|
||||
out = []
|
||||
for b in blocks:
|
||||
if getattr(b, "kind", "") == "group":
|
||||
out.extend(_flatten(getattr(b, "blocks", []) or []))
|
||||
else:
|
||||
out.append(b)
|
||||
return out
|
||||
|
||||
|
||||
def _column_groups(chapter):
|
||||
return [b for b in chapter.blocks if isinstance(b, Group)]
|
||||
|
||||
|
||||
def test_golden_build_cat_distr_emite_bloques_pedidos():
|
||||
@@ -90,36 +106,101 @@ def test_golden_build_cat_distr_emite_bloques_pedidos():
|
||||
assert ch is not None
|
||||
assert ch.id == CHAPTER_ID
|
||||
assert ch.version == CHAPTER_VERSION
|
||||
kinds = _kinds(ch)
|
||||
# Entropy intro present.
|
||||
|
||||
# Entropy intro present, but the long explanation is gone (it lives in the
|
||||
# glossary now): only the term is named, no log2/normalizada walkthrough.
|
||||
headings = [b.text for b in ch.blocks if isinstance(b, Heading)]
|
||||
assert any("Entrop" in h for h in headings)
|
||||
md = next(b for b in ch.blocks if b.kind == "markdown")
|
||||
assert "entropía" in md.text.lower() and "log2" in md.text
|
||||
# Cardinality metrics: distinct, total rows, %-distinct, unique values.
|
||||
kv = next(b for b in ch.blocks if isinstance(b, KVTable))
|
||||
md = next(b for b in ch.blocks if isinstance(b, Markdown))
|
||||
assert "entropía" in md.text.lower()
|
||||
assert "log2" not in md.text # redundant explanation removed.
|
||||
assert "máxima diversidad" not in md.text
|
||||
|
||||
# Per-column blocks are wrapped in keep-together Groups: flatten to inspect.
|
||||
flat = _flatten(ch.blocks)
|
||||
kv = next(b for b in flat if isinstance(b, KVTable))
|
||||
labels = [r[0] for r in kv.rows]
|
||||
assert "Valores distintos" in labels
|
||||
assert "% distintos" in labels
|
||||
values = " ".join(str(r[1]) for r in kv.rows)
|
||||
# Cardinality metrics: distinct count, %-distinct, unique values and total
|
||||
# rows are present (grouped onto compact rows so the chart fits the page).
|
||||
assert "Distintos · % · únicos" in labels
|
||||
assert "Total filas (dataset)" in labels
|
||||
assert "Valores únicos (frecuencia 1)" in labels
|
||||
assert any("Entropía" in lbl for lbl in labels)
|
||||
assert "únicos" in values and "%" in values
|
||||
assert "bits" in values and "norm" in values # entropy + max + normalized.
|
||||
# Top-k table + pie figure.
|
||||
dt = next(b for b in ch.blocks if isinstance(b, DataTable))
|
||||
dt = next(b for b in flat if isinstance(b, DataTable))
|
||||
assert dt.header == ["Valor", "Conteo", "%"]
|
||||
assert any("neumaticos" in str(cell) for row in dt.rows for cell in row)
|
||||
assert any(isinstance(b, Figure) for b in ch.blocks)
|
||||
# id-like column flagged with a Note.
|
||||
assert any(isinstance(b, Note) and "identificador" in b.text
|
||||
for b in ch.blocks)
|
||||
assert any(isinstance(b, Figure) for b in flat)
|
||||
# id-like column flagged with a Note that also explains the top-k is dropped.
|
||||
idnote = next((b for b in flat
|
||||
if isinstance(b, Note) and "identificador" in b.text), None)
|
||||
assert idnote is not None
|
||||
assert "No se lista el top" in idnote.text
|
||||
|
||||
|
||||
def test_golden_render_pdf_muestra_categoricas():
|
||||
def test_golden_idlike_omite_topk_y_conserva_donut():
|
||||
# The id-like column (uuid, 100% distinct) must NOT carry a top-k DataTable
|
||||
# (it would be a list of unique values), but must still keep its donut Figure
|
||||
# and its cardinality table so it stays a full per-column page.
|
||||
ch = build_cat_distr(_profile(), {})
|
||||
groups = _column_groups(ch)
|
||||
uuid_group = next(g for g in groups
|
||||
if any(getattr(b, "text", "") == "uuid" for b in g.blocks))
|
||||
kinds = [b.kind for b in uuid_group.blocks]
|
||||
assert "data_table" not in kinds # top-k of unique values dropped.
|
||||
assert "kv_table" in kinds # cardinality kept.
|
||||
assert "figure" in kinds # donut kept (chart per column).
|
||||
# A non-id-like column keeps its top-k table.
|
||||
cat_group = next(g for g in groups
|
||||
if any(getattr(b, "text", "") == "categoria"
|
||||
for b in g.blocks))
|
||||
assert "data_table" in [b.kind for b in cat_group.blocks]
|
||||
|
||||
|
||||
def test_golden_una_pagina_por_columna_groups():
|
||||
ch = build_cat_distr(_profile(), {})
|
||||
groups = _column_groups(ch)
|
||||
# Two categorical columns -> two column Groups (numeric column excluded).
|
||||
assert len(groups) == 2
|
||||
# Each Group carries one column: a heading + its cardinality table + figure.
|
||||
for g in groups:
|
||||
kinds = [b.kind for b in g.blocks]
|
||||
assert kinds[0] == "heading"
|
||||
assert "kv_table" in kinds
|
||||
assert "figure" in kinds
|
||||
# The first column may share the intro page (no forced break); every later
|
||||
# column starts on a fresh page/slide so each column gets its own page.
|
||||
assert groups[0].page_break_before is False
|
||||
assert all(g.page_break_before is True for g in groups[1:])
|
||||
|
||||
|
||||
def test_golden_entropia_clicable_y_definicion_en_glosario():
|
||||
# With a glossary collector the intro marks the clickable term and the FULL
|
||||
# definition (the long explanation removed from the intro) lands in the
|
||||
# glossary, not inline — no data lost, just relocated.
|
||||
gc = GlossaryCollector()
|
||||
ch = build_cat_distr(_profile(), {"glossary": gc})
|
||||
md = next(b for b in ch.blocks if isinstance(b, Markdown))
|
||||
assert "[[term:entropia]]entropía[[/term]]" in md.text
|
||||
assert gc.has("entropia")
|
||||
entry = gc.get("entropia")
|
||||
assert entry is not None
|
||||
# The definition kept in the glossary still carries the detail removed inline.
|
||||
assert "log2" in entry["definition"]
|
||||
assert "normalizada" in entry["definition"].lower()
|
||||
|
||||
|
||||
def test_golden_render_pdf_una_pagina_por_columna():
|
||||
with tempfile.TemporaryDirectory() as d:
|
||||
out = os.path.join(d, "eda.pdf")
|
||||
res = render_automatic_eda_pdf(_profile(), out, {"title": "EDA"})
|
||||
assert res["path"] == out and os.path.exists(out)
|
||||
assert CHAPTER_ID in [c["id"] for c in res["chapters"]]
|
||||
cat_meta = next(c for c in res["chapters"] if c["id"] == CHAPTER_ID)
|
||||
# Two categorical columns, each on its own page -> >= 2 pages for the
|
||||
# chapter (intro shares the first column's page).
|
||||
assert cat_meta["n_pages"] >= 2
|
||||
txt = _pdf_text(out)
|
||||
assert "Entrop" in txt
|
||||
assert "distintos" in txt
|
||||
@@ -133,13 +214,91 @@ def test_golden_render_pptx_muestra_categoricas():
|
||||
out = os.path.join(d, "eda.pptx")
|
||||
res = render_automatic_eda_pptx(_profile(), out, {"title": "EDA"})
|
||||
assert res["path"] == out and os.path.exists(out)
|
||||
assert CHAPTER_ID in [c["id"] for c in res["chapters"]]
|
||||
cat_meta = next(c for c in res["chapters"] if c["id"] == CHAPTER_ID)
|
||||
assert cat_meta["n_slides"] >= 2 # one slide per categorical column.
|
||||
txt = _pptx_text(out)
|
||||
assert "Entrop" in txt
|
||||
assert "categoria" in txt and "neumaticos" in txt
|
||||
assert "distintos" in txt
|
||||
|
||||
|
||||
def _profile_high_card() -> dict:
|
||||
"""Profile with a high-cardinality NON-id-like categorical column whose top-k
|
||||
of long values would split from its donut on a short 16:9 slide unless the
|
||||
renderer trims the table — the exact case the adversarial check flagged
|
||||
(Ticket / Cabin)."""
|
||||
long_vals = [f"Valor largo de categoria numero {i:02d} con texto extra"
|
||||
for i in range(40)]
|
||||
top = [{"value": v, "count": 60 - i, "pct": (60 - i) / 5000.0}
|
||||
for i, v in enumerate(long_vals)]
|
||||
return {
|
||||
"table": "t", "source": "t.csv", "n_rows": 5000, "n_cols": 3,
|
||||
"quality_score": 80.0,
|
||||
"columns": [
|
||||
{"name": "precio", "inferred_type": "numeric", "null_pct": 0.0,
|
||||
"numeric": {"mean": 1.0, "median": 1.0, "min": 0.0, "max": 2.0,
|
||||
"std": 0.5}},
|
||||
# 40 distinct over 5000 rows = 0.8% distinct -> NOT id-like, keeps
|
||||
# its (long) top-k table; the tall table must not push the donut off.
|
||||
{"name": "alta_card_col", "inferred_type": "categorical",
|
||||
"null_pct": 0.0, "distinct_count": 40,
|
||||
"categorical": {"top": top, "mode": long_vals[0], "n_distinct": 40,
|
||||
"entropy": 5.2, "imbalance": 1.2, "len_min": 40,
|
||||
"len_mean": 45, "len_max": 50}},
|
||||
{"name": "baja_card_col", "inferred_type": "categorical",
|
||||
"null_pct": 0.0, "distinct_count": 4,
|
||||
"categorical": {
|
||||
"top": [{"value": "norte", "count": 2000, "pct": 0.4},
|
||||
{"value": "sur", "count": 1500, "pct": 0.3},
|
||||
{"value": "este", "count": 1000, "pct": 0.2},
|
||||
{"value": "oeste", "count": 500, "pct": 0.1}],
|
||||
"mode": "norte", "n_distinct": 4, "entropy": 1.8}},
|
||||
],
|
||||
}
|
||||
|
||||
|
||||
def test_golden_pptx_una_slide_por_columna_con_su_grafico():
|
||||
"""Each categorical column occupies EXACTLY ONE cat_distr slide that carries
|
||||
BOTH its cardinality table and its donut figure (picture) — i.e. the chart is
|
||||
never separated from its table, even for a high-cardinality column."""
|
||||
from pptx.enum.shapes import MSO_SHAPE_TYPE
|
||||
|
||||
prof = _profile_high_card()
|
||||
cat_names = ["alta_card_col", "baja_card_col"]
|
||||
with tempfile.TemporaryDirectory() as d:
|
||||
out = os.path.join(d, "eda.pptx")
|
||||
res = render_automatic_eda_pptx(prof, out, {"title": "EDA"})
|
||||
assert res["path"] == out and os.path.exists(out)
|
||||
prs = Presentation(out)
|
||||
|
||||
# Per column: the cat_distr slides whose text mentions it, and whether the
|
||||
# owning slide also has the donut caption + an actual picture shape.
|
||||
slides_with_col = {n: [] for n in cat_names}
|
||||
owner_has_chart = {n: False for n in cat_names}
|
||||
for i, sl in enumerate(prs.slides):
|
||||
texts, has_pic = [], False
|
||||
for sh in sl.shapes:
|
||||
if sh.has_text_frame:
|
||||
texts.append(sh.text_frame.text)
|
||||
if sh.shape_type == MSO_SHAPE_TYPE.PICTURE:
|
||||
has_pic = True
|
||||
txt = re.sub(r"\s+", " ", " ".join(texts))
|
||||
if "Distribuciones categ" not in txt: # footer stamp of the chapter.
|
||||
continue
|
||||
for n in cat_names:
|
||||
if n in txt:
|
||||
slides_with_col[n].append(i)
|
||||
has_table = "Cardinalidad" in txt or "distintos" in txt
|
||||
if has_pic and "donut" in txt and has_table:
|
||||
owner_has_chart[n] = True
|
||||
|
||||
for n in cat_names:
|
||||
# Exactly one slide carries the column (not split across slides).
|
||||
assert len(slides_with_col[n]) == 1, (n, slides_with_col[n])
|
||||
# That single slide also holds its table AND its donut picture.
|
||||
assert owner_has_chart[n], (n, "tabla y donut no están en el mismo slide")
|
||||
|
||||
|
||||
def test_edge_sin_categoricas_devuelve_none():
|
||||
only_numeric = {
|
||||
"n_rows": 10, "columns": [
|
||||
@@ -170,11 +329,15 @@ def test_anti_corte_label_largo_y_muchas_columnas():
|
||||
|
||||
ch = build_cat_distr(profile, {})
|
||||
assert ch is not None
|
||||
# One Group per column, each forcing its own page (except the first).
|
||||
groups = _column_groups(ch)
|
||||
assert len(groups) == 30
|
||||
assert sum(1 for g in groups if g.page_break_before) == 29
|
||||
with tempfile.TemporaryDirectory() as d:
|
||||
pdf = os.path.join(d, "anti.pdf")
|
||||
res = render_automatic_eda_pdf(profile, pdf, {"write_manifest": False})
|
||||
assert res["path"] == pdf
|
||||
assert res["n_pages"] > 1 # many columns spilled across pages, OK.
|
||||
assert res["n_pages"] > 1 # one page per column, OK.
|
||||
txt = _pdf_text(pdf)
|
||||
# Long label wrapped (not truncated): every word survives.
|
||||
for word in ("Lorem", "incididunt", "reprehenderit", "voluptate"):
|
||||
|
||||
@@ -139,10 +139,17 @@ class Group:
|
||||
it starts on a fresh page and flows (honest degradation, never cut). Use it to
|
||||
bind ``Heading`` + ``Markdown`` + ``Figure`` of one idea together (see the
|
||||
DISTR NUM / AGREGACION chapters).
|
||||
|
||||
When ``page_break_before`` is True the renderer additionally forces the group
|
||||
to *start* on a fresh page/slide (unless the current one is already empty), so
|
||||
a chapter can give each unit its own page — e.g. one categorical column per
|
||||
page (see CAT DISTR). It is purely additive: the default False keeps the plain
|
||||
keep-together behaviour for every existing chapter.
|
||||
"""
|
||||
|
||||
blocks: list = field(default_factory=list)
|
||||
title: Optional[str] = None
|
||||
page_break_before: bool = False
|
||||
kind: str = field(default="group", init=False)
|
||||
|
||||
|
||||
@@ -228,7 +235,9 @@ def as_block(obj: Any):
|
||||
return Note(text=_safe_str(obj.get("text")))
|
||||
if cls is Group:
|
||||
return Group(blocks=as_blocks(obj.get("blocks")),
|
||||
title=obj.get("title"))
|
||||
title=obj.get("title"),
|
||||
page_break_before=bool(
|
||||
obj.get("page_break_before", False)))
|
||||
if cls is GlossaryEntry:
|
||||
return GlossaryEntry(key=_safe_str(obj.get("key")),
|
||||
label=_safe_str(obj.get("label")),
|
||||
|
||||
@@ -0,0 +1,458 @@
|
||||
"""AutomaticEDA Markdown serializer — one self-contained file to paste to an LLM.
|
||||
|
||||
Same document model as the PDF/PPTX renderers (an ordered list of
|
||||
:class:`Chapter`, each a list of format-independent blocks) but emitted as plain
|
||||
**Markdown** instead of a binary. The goal is different from the other two
|
||||
renderers: a Markdown EDA is meant to be *pasted into an LLM*, so it prioritises
|
||||
TEXT and DATA over visuals. Tables become Markdown tables (every row dumped, no
|
||||
pagination — nothing is cut because there are no pages); a ``Figure`` becomes its
|
||||
caption plus, when possible, the underlying bar/histogram data as a Markdown
|
||||
table (an LLM cannot see the image); glossary term markers are stripped while
|
||||
``**bold**`` is kept (it is valid Markdown).
|
||||
|
||||
dict-no-throw (the ``eda`` group style): :func:`render_md` never raises. On a
|
||||
fatal error it returns ``{path: None, ...}`` with a ``note`` explaining why; a
|
||||
malformed block degrades to a readable note rather than crashing the document.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
import re
|
||||
|
||||
from . import model
|
||||
|
||||
# Glossary span markers (kept text, dropped markers). We intentionally do NOT use
|
||||
# ``text_layout.strip_inline_md`` for Markdown blocks because that also removes
|
||||
# ``**bold**`` — valid Markdown we want to preserve when pasting to an LLM.
|
||||
_TERM_OPEN_RE = re.compile(r"\[\[term:[A-Za-z0-9_]+\]\]")
|
||||
_MAX_BAR_ROWS = 100
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Small helpers.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def _clean_terms(s) -> str:
|
||||
"""Drop glossary term markers, keeping the visible text (and any **bold**)."""
|
||||
s = model._safe_str(s)
|
||||
s = _TERM_OPEN_RE.sub("", s)
|
||||
return s.replace("[[/term]]", "")
|
||||
|
||||
|
||||
def _cell(v) -> str:
|
||||
"""Render a value as a safe Markdown table cell.
|
||||
|
||||
Escapes pipes (``|`` -> ``\\|``) so they do not break the column layout and
|
||||
folds newlines to ``<br>`` so a multi-line value stays inside one cell. None
|
||||
becomes an empty string.
|
||||
"""
|
||||
s = model._safe_str(v)
|
||||
s = s.replace("|", "\\|")
|
||||
s = s.replace("\r\n", "\n").replace("\r", "\n").replace("\n", "<br>")
|
||||
return s
|
||||
|
||||
|
||||
def _slug(text: str) -> str:
|
||||
"""GitHub-style heading anchor: lowercase, spaces->'-', drop other symbols."""
|
||||
s = model._safe_str(text).strip().lower()
|
||||
out = []
|
||||
for ch in s:
|
||||
if ch.isalnum():
|
||||
out.append(ch)
|
||||
elif ch in " -":
|
||||
out.append("-")
|
||||
# any other symbol is dropped.
|
||||
slug = "".join(out)
|
||||
while "--" in slug:
|
||||
slug = slug.replace("--", "-")
|
||||
return slug.strip("-")
|
||||
|
||||
|
||||
def _fmt_num(v) -> str:
|
||||
"""Compact number for the figure data tables (ints as ints, else 4 sig figs)."""
|
||||
try:
|
||||
f = float(v)
|
||||
except Exception: # noqa: BLE001
|
||||
return model._safe_str(v)
|
||||
if f != f: # NaN
|
||||
return "NaN"
|
||||
if f == int(f) and abs(f) < 1e15:
|
||||
return str(int(f))
|
||||
return f"{f:.4g}"
|
||||
|
||||
|
||||
def _fmt_int(v) -> str:
|
||||
try:
|
||||
return str(int(v))
|
||||
except Exception: # noqa: BLE001
|
||||
return model._safe_str(v)
|
||||
|
||||
|
||||
def _now_iso() -> str:
|
||||
from datetime import datetime, timezone
|
||||
return datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M:%S UTC")
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Document header (title + metadata blockquote + numbered index).
|
||||
# --------------------------------------------------------------------------- #
|
||||
def _meta_block(meta: dict) -> list:
|
||||
"""Build the metadata lines for the header blockquote (omitting absentees)."""
|
||||
ctx = meta.get("ctx") if isinstance(meta.get("ctx"), dict) else {}
|
||||
lines: list = []
|
||||
|
||||
def add(label, value) -> None:
|
||||
if value is None:
|
||||
return
|
||||
s = model._safe_str(value).strip()
|
||||
if s and s.lower() != "none":
|
||||
lines.append(f"**{label}:** {s}")
|
||||
|
||||
add("Dataset", ctx.get("dataset_name") or meta.get("dataset_name"))
|
||||
add("Fuente", ctx.get("source_origin") or meta.get("source_origin"))
|
||||
add("Almacenamiento", ctx.get("storage") or meta.get("storage"))
|
||||
n_rows = ctx.get("n_rows", meta.get("n_rows"))
|
||||
n_cols = ctx.get("n_cols", meta.get("n_cols"))
|
||||
if n_rows is not None and n_cols is not None:
|
||||
lines.append(
|
||||
f"**Dimensiones:** {_fmt_int(n_rows)} filas × {_fmt_int(n_cols)} columnas")
|
||||
add("Generado", meta.get("generated_at") or _now_iso())
|
||||
lines.append(f"**Motor:** {model.ENGINE_NAME} v{model.ENGINE_VERSION}")
|
||||
return lines
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Per-block serializers. Each returns a Markdown string (no surrounding blanks;
|
||||
# the caller separates blocks with a blank line).
|
||||
# --------------------------------------------------------------------------- #
|
||||
def _md_heading(block) -> str:
|
||||
level = int(getattr(block, "level", 1) or 1)
|
||||
hashes = "#" * min(level + 2, 6) # level1 -> ###; '#'/'##' reserved for doc/chapter.
|
||||
text = _clean_terms(getattr(block, "text", "")).strip()
|
||||
return f"{hashes} {text}"
|
||||
|
||||
|
||||
def _md_markdown(block) -> str:
|
||||
# Keep the text verbatim, dropping only glossary markers (keep **bold**).
|
||||
return _clean_terms(getattr(block, "text", "")).rstrip("\n")
|
||||
|
||||
|
||||
def _md_kv_table(block) -> str:
|
||||
lines: list = []
|
||||
title = getattr(block, "title", None)
|
||||
if title:
|
||||
lines.append(f"**{_clean_terms(title).strip()}**")
|
||||
lines.append("")
|
||||
lines.append("| Campo | Valor |")
|
||||
lines.append("| --- | --- |")
|
||||
for row in (getattr(block, "rows", []) or []):
|
||||
try:
|
||||
label, value = row[0], row[1]
|
||||
except Exception: # noqa: BLE001
|
||||
label, value = row, ""
|
||||
lines.append(f"| {_cell(label)} | {_cell(value)} |")
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def _md_data_table(block) -> str:
|
||||
lines: list = []
|
||||
title = getattr(block, "title", None)
|
||||
if title:
|
||||
lines.append(f"**{_clean_terms(title).strip()}**")
|
||||
lines.append("")
|
||||
header = list(getattr(block, "header", []) or [])
|
||||
rows = list(getattr(block, "rows", []) or [])
|
||||
if not header:
|
||||
ncol = max((len(r) for r in rows), default=1)
|
||||
header = [f"col{i + 1}" for i in range(ncol)]
|
||||
ncol = len(header)
|
||||
lines.append("| " + " | ".join(_cell(h) for h in header) + " |")
|
||||
lines.append("| " + " | ".join(["---"] * ncol) + " |")
|
||||
for r in rows: # dump every row — no pagination, nothing cut.
|
||||
cells = [_cell(r[c]) if c < len(r) else "" for c in range(ncol)]
|
||||
lines.append("| " + " | ".join(cells) + " |")
|
||||
note = getattr(block, "note", None)
|
||||
if note:
|
||||
lines.append("")
|
||||
lines.append(f"*{_clean_terms(note).strip()}*")
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def _bars_table(bars: list) -> str:
|
||||
"""Render extracted bar/histogram data as a Markdown table (Desde/Hasta/Frec)."""
|
||||
lines = ["| Desde | Hasta | Frecuencia |", "| --- | --- | --- |"]
|
||||
shown = bars[:_MAX_BAR_ROWS]
|
||||
for x0, x1, h in shown:
|
||||
lines.append(f"| {_fmt_num(x0)} | {_fmt_num(x1)} | {_fmt_num(h)} |")
|
||||
out = "\n".join(lines)
|
||||
extra = len(bars) - len(shown)
|
||||
if extra > 0:
|
||||
out += f"\n\n*… ({extra} filas más)*"
|
||||
return out
|
||||
|
||||
|
||||
def _extract_bars(fig) -> list:
|
||||
"""Collect (x_from, x_to, height) of the rectangular bars of a matplotlib fig.
|
||||
|
||||
Histogram / bar-chart bars are ``matplotlib.patches.Rectangle`` with positive
|
||||
width and height; spines, legends and zero-area artists are skipped. Never
|
||||
raises — returns ``[]`` on any problem.
|
||||
"""
|
||||
bars: list = []
|
||||
try:
|
||||
for ax in fig.get_axes():
|
||||
# Collect this axes' positive-area rectangles, then keep only the ones
|
||||
# that look like actual histogram/bar bins. Reference shapes that
|
||||
# matplotlib also stores in ``ax.patches`` — most notably the ``±1σ``
|
||||
# band drawn by ``axvspan`` (a single rectangle far wider than a bin)
|
||||
# and a lone Tukey boxplot box — would otherwise show up as fake
|
||||
# "bins". A histogram axes has several near-equal-width bars, so we
|
||||
# drop any rectangle whose width is more than twice the median width
|
||||
# of that axes' rectangles (the σ-band spans many bins; uniform bins
|
||||
# all sit at the median width and stay).
|
||||
ax_bars: list = []
|
||||
for patch in list(getattr(ax, "patches", []) or []):
|
||||
try:
|
||||
w = patch.get_width()
|
||||
h = patch.get_height()
|
||||
x = patch.get_x()
|
||||
except Exception: # noqa: BLE001 — not a Rectangle-like patch.
|
||||
continue
|
||||
if w and w > 0 and h and h > 0:
|
||||
ax_bars.append((x, x + w, h))
|
||||
if len(ax_bars) >= 3:
|
||||
widths = sorted(b[1] - b[0] for b in ax_bars)
|
||||
median_w = widths[len(widths) // 2]
|
||||
if median_w > 0:
|
||||
ax_bars = [b for b in ax_bars
|
||||
if (b[1] - b[0]) <= 2.0 * median_w]
|
||||
bars.extend(ax_bars)
|
||||
except Exception: # noqa: BLE001
|
||||
return []
|
||||
return bars
|
||||
|
||||
|
||||
def _md_figure(block, meta: dict, out_path: str, counter: list) -> str:
|
||||
"""Serialize a Figure prioritising TEXT + DATA (an LLM cannot see the image).
|
||||
|
||||
Emits the caption, then — if the matplotlib figure has bars — a Markdown table
|
||||
of the underlying (Desde, Hasta, Frecuencia) values. Optionally (when
|
||||
``meta['embed_figures']`` is True) also exports a PNG beside the .md and adds
|
||||
an image link; off by default so the Markdown stays self-contained.
|
||||
"""
|
||||
caption = model._safe_str(getattr(block, "caption", "")).strip()
|
||||
parts = [f"*Figura: {caption}*" if caption else "*Figura*"]
|
||||
fig = None
|
||||
try:
|
||||
import matplotlib
|
||||
matplotlib.use("Agg") # defensive: headless rasterization backend.
|
||||
fig = getattr(block, "fig", None)
|
||||
make = getattr(block, "make", None)
|
||||
if fig is None and callable(make):
|
||||
fig = make()
|
||||
if fig is not None:
|
||||
bars = _extract_bars(fig)
|
||||
if bars:
|
||||
parts.append(_bars_table(bars))
|
||||
if meta.get("embed_figures"):
|
||||
png = _embed_png(fig, out_path, counter)
|
||||
if png:
|
||||
parts.append(f"")
|
||||
except Exception: # noqa: BLE001 — a bad figure degrades to just its caption.
|
||||
pass
|
||||
finally:
|
||||
if fig is not None:
|
||||
try:
|
||||
import matplotlib.pyplot as plt
|
||||
plt.close(fig)
|
||||
except Exception: # noqa: BLE001
|
||||
pass
|
||||
return "\n\n".join(parts)
|
||||
|
||||
|
||||
def _embed_png(fig, out_path: str, counter: list) -> str:
|
||||
"""Export the figure to ``<basename>_figN.png`` beside the .md; return its name."""
|
||||
try:
|
||||
counter[0] += 1
|
||||
base = os.path.splitext(os.path.basename(out_path))[0] or "figura"
|
||||
name = f"{base}_fig{counter[0]}.png"
|
||||
path = os.path.join(os.path.dirname(os.path.abspath(out_path)), name)
|
||||
fig.savefig(path, format="png", dpi=120, bbox_inches="tight")
|
||||
return name
|
||||
except Exception: # noqa: BLE001
|
||||
return ""
|
||||
|
||||
|
||||
def _md_image(block) -> str:
|
||||
path = model._safe_str(getattr(block, "path", ""))
|
||||
caption = model._safe_str(getattr(block, "caption", "")).strip()
|
||||
out = f""
|
||||
if caption:
|
||||
out += f"\n\n*{caption}*"
|
||||
return out
|
||||
|
||||
|
||||
def _md_caption(block) -> str:
|
||||
return f"*{_clean_terms(getattr(block, 'text', '')).strip()}*"
|
||||
|
||||
|
||||
def _md_note(block) -> str:
|
||||
text = _clean_terms(getattr(block, "text", "")).strip()
|
||||
lines = text.split("\n")
|
||||
return "\n".join((f"> {ln}" if ln.strip() else ">") for ln in lines)
|
||||
|
||||
|
||||
def _md_group(block, meta: dict, out_path: str, counter: list) -> str:
|
||||
parts: list = []
|
||||
title = getattr(block, "title", None)
|
||||
if title:
|
||||
parts.append(f"### {_clean_terms(title).strip()}")
|
||||
for b in (getattr(block, "blocks", []) or []):
|
||||
try:
|
||||
seg = _serialize_block(b, meta, out_path, counter)
|
||||
except Exception: # noqa: BLE001
|
||||
seg = ""
|
||||
if seg:
|
||||
parts.append(seg)
|
||||
return "\n\n".join(parts)
|
||||
|
||||
|
||||
def _md_glossary_entry(block) -> str:
|
||||
label = (model._safe_str(getattr(block, "label", "")).strip()
|
||||
or model._safe_str(getattr(block, "key", "")).strip())
|
||||
definition = _clean_terms(getattr(block, "definition", "")).strip()
|
||||
out = f"### {label}"
|
||||
if definition:
|
||||
out += f"\n\n{definition}"
|
||||
return out
|
||||
|
||||
|
||||
def _serialize_block(block, meta: dict, out_path: str, counter: list) -> str:
|
||||
"""Dispatch a single block to its Markdown serializer. Unknown -> note."""
|
||||
kind = getattr(block, "kind", "")
|
||||
if kind == "heading":
|
||||
return _md_heading(block)
|
||||
if kind == "markdown":
|
||||
return _md_markdown(block)
|
||||
if kind == "kv_table":
|
||||
return _md_kv_table(block)
|
||||
if kind == "data_table":
|
||||
return _md_data_table(block)
|
||||
if kind == "figure":
|
||||
return _md_figure(block, meta, out_path, counter)
|
||||
if kind == "image":
|
||||
return _md_image(block)
|
||||
if kind == "caption":
|
||||
return _md_caption(block)
|
||||
if kind == "note":
|
||||
return _md_note(block)
|
||||
if kind == "group":
|
||||
return _md_group(block, meta, out_path, counter)
|
||||
if kind == "glossary_entry":
|
||||
return _md_glossary_entry(block)
|
||||
# Unknown content -> readable note (mirrors the model's defensive coercion).
|
||||
return _md_note(model.Note(text=model._safe_str(block)))
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Entry point.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def render_md(chapters: list, out_path: str, meta: dict = None) -> dict:
|
||||
"""Serialize a list of Chapters into a single self-contained Markdown file.
|
||||
|
||||
The output leads with ``# <title>``, a metadata blockquote and a numbered
|
||||
``## Índice`` linking each chapter, then one ``## N. <title>`` section per
|
||||
chapter with its blocks. Tables become Markdown tables (every row dumped),
|
||||
figures become caption + underlying data table, glossary markers are stripped
|
||||
while ``**bold**`` is kept. Designed to be pasted into an LLM.
|
||||
|
||||
Args:
|
||||
chapters: a list of ``Chapter`` (dataclasses or dicts); normalized
|
||||
defensively with ``model.as_chapters``.
|
||||
out_path: filesystem path for the ``.md`` (parent dirs are created).
|
||||
meta: optional dict. Recognised keys: ``title``, ``ctx`` (dict with
|
||||
``dataset_name``/``source_origin``/``storage``/``n_rows``/``n_cols``),
|
||||
``generated_at``, ``embed_figures`` (export PNGs beside the .md,
|
||||
default False).
|
||||
|
||||
Returns:
|
||||
dict (never raises): ``{path: str|None, n_chars: int,
|
||||
chapters: list[{id, version}], note: str}``. On a fatal error ``path`` is
|
||||
None and ``note`` explains why.
|
||||
"""
|
||||
meta = meta or {}
|
||||
chapters = model.as_chapters(chapters)
|
||||
title = model._safe_str(meta.get("title")) or model.ENGINE_NAME
|
||||
|
||||
# Edge: nothing to render -> a minimal but valid Markdown document.
|
||||
if not chapters:
|
||||
content = (f"# {title}\n\n"
|
||||
"*(documento vacío — sin capítulos aplicables)*\n")
|
||||
return _write(out_path, content, [], "documento vacío")
|
||||
|
||||
counter = [0] # document-wide figure counter for unique PNG names.
|
||||
notes: list = []
|
||||
segments: list = [f"# {title}"]
|
||||
|
||||
meta_lines = _meta_block(meta)
|
||||
if meta_lines:
|
||||
segments.append("\n".join(f"> {ln}" for ln in meta_lines))
|
||||
|
||||
# Numbered index. The anchor matches the chapter heading emitted below
|
||||
# (``## N. <title>``) in GitHub slug style.
|
||||
chap_heads = []
|
||||
idx_lines = ["## Índice"]
|
||||
for i, ch in enumerate(chapters, 1):
|
||||
head_text = f"{i}. {model._safe_str(ch.title)}"
|
||||
anchor = _slug(head_text)
|
||||
chap_heads.append((head_text, anchor))
|
||||
idx_lines.append(f"{i}. [{model._safe_str(ch.title)}](#{anchor})")
|
||||
segments.append("\n".join(idx_lines))
|
||||
|
||||
chapters_meta = []
|
||||
for i, ch in enumerate(chapters, 1):
|
||||
segments.append("---")
|
||||
head_text, _anchor = chap_heads[i - 1]
|
||||
segments.append(f"## {head_text}")
|
||||
|
||||
blocks = list(ch.blocks or [])
|
||||
# Omit a leading level-1 Heading that just repeats the chapter title.
|
||||
if blocks:
|
||||
b0 = blocks[0]
|
||||
if (getattr(b0, "kind", "") == "heading"
|
||||
and int(getattr(b0, "level", 1) or 1) == 1
|
||||
and _clean_terms(getattr(b0, "text", "")).strip()
|
||||
== model._safe_str(ch.title).strip()):
|
||||
blocks = blocks[1:]
|
||||
|
||||
for block in blocks:
|
||||
try:
|
||||
seg = _serialize_block(block, meta, out_path, counter)
|
||||
except Exception as e: # noqa: BLE001
|
||||
seg = _md_note(model.Note(text=model._safe_str(block)))
|
||||
notes.append(
|
||||
f"bloque '{getattr(block, 'kind', '?')}' del capítulo "
|
||||
f"'{ch.id}' degradado: {e}")
|
||||
if seg:
|
||||
segments.append(seg)
|
||||
chapters_meta.append({"id": ch.id, "version": ch.version})
|
||||
|
||||
content = "\n\n".join(segments) + "\n"
|
||||
note = f"{len(content)} caracteres"
|
||||
if notes:
|
||||
note += " · " + "; ".join(notes)
|
||||
return _write(out_path, content, chapters_meta, note)
|
||||
|
||||
|
||||
def _write(out_path: str, content: str, chapters_meta: list, note: str) -> dict:
|
||||
"""Write the Markdown to disk (creating parents). dict-no-throw."""
|
||||
try:
|
||||
parent = os.path.dirname(os.path.abspath(out_path))
|
||||
os.makedirs(parent, exist_ok=True)
|
||||
with open(out_path, "w", encoding="utf-8") as fh:
|
||||
fh.write(content)
|
||||
except Exception as e: # noqa: BLE001 — never raise from the writer.
|
||||
return {"path": None, "n_chars": 0, "chapters": [],
|
||||
"note": f"no se pudo escribir el Markdown: {e}"}
|
||||
return {"path": out_path, "n_chars": len(content),
|
||||
"chapters": chapters_meta, "note": note}
|
||||
@@ -675,6 +675,61 @@ def _measure_figure_like(block) -> float:
|
||||
return target_h + 0.04 + cap_h + _GAP
|
||||
|
||||
|
||||
def _measure_kv_table(block) -> float:
|
||||
"""Faithful height of a KVTable — matches ``_place_kv_table``.
|
||||
|
||||
Counts the optional title heading and, per row, the wrapped VALUE column
|
||||
(the label column never wraps in the placer). The previous estimate assumed
|
||||
one line per row and ignored the title, so a column's keep-together Group
|
||||
under-budgeted the figure and the chart spilled to the next page. Keep this in
|
||||
sync with ``_place_kv_table``."""
|
||||
h = 0.0
|
||||
title = getattr(block, "title", None)
|
||||
if title:
|
||||
h += _measure_heading_text(title, 2)
|
||||
rows = getattr(block, "rows", []) or []
|
||||
key_w = 1.9
|
||||
val_chars = tl.chars_per_line(_USABLE_W - key_w - 0.1, _FS_BODY)
|
||||
lh = tl.line_height_in(_FS_BODY)
|
||||
for row in rows:
|
||||
try:
|
||||
value = row[1]
|
||||
except Exception: # noqa: BLE001
|
||||
value = ""
|
||||
v_lines = tl.wrap(model._safe_str(value), val_chars)
|
||||
h += lh * len(v_lines) + _ROW_VPAD
|
||||
return h + _GAP
|
||||
|
||||
|
||||
def _measure_data_table(block) -> float:
|
||||
"""Faithful height of a DataTable — matches ``_place_data_table``.
|
||||
|
||||
Counts the optional title heading, the wrapped header row, every wrapped data
|
||||
row (per-column wrap via the same ``_col_widths``/``_wrap_row`` the placer
|
||||
uses) and the optional note. Keep this in sync with ``_place_data_table``."""
|
||||
h = 0.0
|
||||
title = getattr(block, "title", None)
|
||||
if title:
|
||||
h += _measure_heading_text(title, 2)
|
||||
header = list(getattr(block, "header", []) or [])
|
||||
rows = list(getattr(block, "rows", []) or [])
|
||||
fs = _FS_CELL
|
||||
widths = _col_widths(header, rows, fs)
|
||||
lh = tl.line_height_in(fs)
|
||||
if header:
|
||||
header_lines = _wrap_row(header, widths, fs)
|
||||
h += lh * max((len(c) for c in header_lines), default=1) + _ROW_VPAD * 2
|
||||
for r in rows:
|
||||
cells_lines = _wrap_row(r, widths, fs)
|
||||
h += lh * max((len(c) for c in cells_lines), default=1) + _ROW_VPAD * 2
|
||||
note = getattr(block, "note", None)
|
||||
if note:
|
||||
nlines = tl.wrap(model._safe_str(note),
|
||||
tl.chars_per_line(_USABLE_W, _FS_NOTE))
|
||||
h += tl.line_height_in(_FS_NOTE) * len(nlines)
|
||||
return h + _GAP
|
||||
|
||||
|
||||
def _measure_block(st: _PdfState, block) -> float:
|
||||
kind = getattr(block, "kind", "")
|
||||
try:
|
||||
@@ -690,13 +745,9 @@ def _measure_block(st: _PdfState, block) -> float:
|
||||
tl.chars_per_line(_USABLE_W, _FS_NOTE))
|
||||
return tl.line_height_in(_FS_NOTE) * len(lines) + _GAP
|
||||
if kind == "kv_table":
|
||||
rows = getattr(block, "rows", []) or []
|
||||
return (tl.line_height_in(_FS_BODY) + _ROW_VPAD) * (len(rows) + 1) \
|
||||
+ _GAP
|
||||
return _measure_kv_table(block)
|
||||
if kind == "data_table":
|
||||
rows = getattr(block, "rows", []) or []
|
||||
return (tl.line_height_in(_FS_CELL) + _ROW_VPAD * 2) \
|
||||
* (len(rows) + 1) + _GAP
|
||||
return _measure_data_table(block)
|
||||
if kind == "group":
|
||||
return sum(_measure_block(st, b)
|
||||
for b in (getattr(block, "blocks", []) or []))
|
||||
@@ -735,6 +786,10 @@ def _place_group(st: _PdfState, block) -> None:
|
||||
blocks = getattr(block, "blocks", []) or []
|
||||
if not blocks:
|
||||
return
|
||||
# Opt-in page break: start this group on a fresh page unless the current one
|
||||
# is still empty (so a chapter can give each unit its own page).
|
||||
if getattr(block, "page_break_before", False) and st.y > _CONTENT_TOP + 1e-6:
|
||||
_new_page(st)
|
||||
avail_full = _CONTENT_BOTTOM - _CONTENT_TOP
|
||||
_shrink_group_figures(st, blocks, avail_full)
|
||||
total = sum(_measure_block(st, b) for b in blocks)
|
||||
|
||||
@@ -625,6 +625,55 @@ def _measure_figure_like(block) -> float:
|
||||
return target_h + 0.05 + cap_h + _GAP
|
||||
|
||||
|
||||
def _measure_kv_table(block) -> float:
|
||||
"""Faithful KVTable height — matches ``_place_kv_table`` (rendered as a
|
||||
Campo/Valor data table with wrapped cells). The previous estimate assumed one
|
||||
line per row and ignored the title, so a keep-together Group under-budgeted
|
||||
the figure and the chart spilled to the next slide. Keep in sync."""
|
||||
h = 0.0
|
||||
title = getattr(block, "title", None)
|
||||
if title:
|
||||
h += _measure_heading_text(title, 2)
|
||||
rows = getattr(block, "rows", []) or []
|
||||
data_rows = []
|
||||
for row in rows:
|
||||
try:
|
||||
label, value = row[0], row[1]
|
||||
except Exception: # noqa: BLE001
|
||||
label, value = str(row), ""
|
||||
data_rows.append([model._safe_str(label), model._safe_str(value)])
|
||||
header = ["Campo", "Valor"]
|
||||
widths = _col_widths(header, data_rows)
|
||||
fs = _FS_CELL
|
||||
h += _row_height_in(header, widths, fs)
|
||||
for r in data_rows:
|
||||
h += _row_height_in(r, widths, fs)
|
||||
return h + _GAP
|
||||
|
||||
|
||||
def _measure_data_table(block) -> float:
|
||||
"""Faithful DataTable height — matches ``_place_data_table`` (title heading +
|
||||
wrapped header + every wrapped row + optional note). Keep in sync."""
|
||||
h = 0.0
|
||||
title = getattr(block, "title", None)
|
||||
if title:
|
||||
h += _measure_heading_text(title, 2)
|
||||
header = list(getattr(block, "header", []) or [])
|
||||
rows = list(getattr(block, "rows", []) or [])
|
||||
fs = _FS_CELL
|
||||
widths = _col_widths(header, rows)
|
||||
if header:
|
||||
h += _row_height_in(header, widths, fs)
|
||||
for r in rows:
|
||||
h += _row_height_in(r, widths, fs)
|
||||
note = getattr(block, "note", None)
|
||||
if note:
|
||||
nlines = tl.wrap(model._safe_str(note),
|
||||
tl.chars_per_line(_USABLE_W, _FS_NOTE))
|
||||
h += tl.line_height_in(_FS_NOTE) * len(nlines) + 0.05
|
||||
return h + _GAP
|
||||
|
||||
|
||||
def _measure_block(st: _PptxState, block) -> float:
|
||||
kind = getattr(block, "kind", "")
|
||||
try:
|
||||
@@ -639,9 +688,10 @@ def _measure_block(st: _PptxState, block) -> float:
|
||||
lines = tl.wrap(getattr(block, "text", ""),
|
||||
tl.chars_per_line(_USABLE_W, _FS_NOTE))
|
||||
return tl.line_height_in(_FS_NOTE) * len(lines) + 0.05 + _GAP
|
||||
if kind in ("kv_table", "data_table"):
|
||||
rows = getattr(block, "rows", []) or []
|
||||
return (tl.line_height_in(_FS_CELL) + 0.10) * (len(rows) + 1) + _GAP
|
||||
if kind == "kv_table":
|
||||
return _measure_kv_table(block)
|
||||
if kind == "data_table":
|
||||
return _measure_data_table(block)
|
||||
if kind == "group":
|
||||
return sum(_measure_block(st, b)
|
||||
for b in (getattr(block, "blocks", []) or []))
|
||||
@@ -664,10 +714,14 @@ def _shrink_group_figures(st: _PptxState, blocks: list, avail_full: float) -> No
|
||||
if getattr(b, "kind", "") not in ("figure", "image"))
|
||||
fig_overhead = tl.line_height_in(_FS_NOTE) + 0.05 + 0.05 + _GAP
|
||||
budget = avail_full - nonfig_h - 0.10 * len(fig_blocks)
|
||||
if budget <= 1.0:
|
||||
# Low thresholds: a 16:9 slide is short, so a content-heavy column (cardinality
|
||||
# table + top-k + chart) only fits if the chart is allowed to shrink small.
|
||||
# Prefer a small-but-present chart on the SAME slide over splitting the column
|
||||
# across slides (matches the PDF renderer's keep-together philosophy).
|
||||
if budget <= 0.6:
|
||||
return # not enough room to keep together; let it flow (degrade).
|
||||
per = budget / len(fig_blocks) - fig_overhead
|
||||
if per <= 0.8:
|
||||
if per <= 0.35:
|
||||
return
|
||||
for fb in fig_blocks:
|
||||
cur = getattr(fb, "height_in", None)
|
||||
@@ -675,12 +729,90 @@ def _shrink_group_figures(st: _PptxState, blocks: list, avail_full: float) -> No
|
||||
if isinstance(cur, (int, float)) and cur > 0 else per)
|
||||
|
||||
|
||||
# Minimum height (inches) reserved for a figure inside a keep-together group on
|
||||
# the short 16:9 slide. When a high-cardinality column's table(s) would otherwise
|
||||
# leave no room, the data table is trimmed (with an honest note) so the chart
|
||||
# stays on the SAME slide next to its table instead of spilling to the next one.
|
||||
_GROUP_MIN_FIG_H = 1.3
|
||||
|
||||
|
||||
def _trim_data_table_to_budget(block, budget: float):
|
||||
"""Return a copy of a DataTable whose rows fit within ``budget`` inches.
|
||||
|
||||
Keeps the title, header, as many leading rows as fit (at least one) and an
|
||||
honest note reporting how many of the original rows are shown. NEVER mutates
|
||||
the original block — the same Chapter blocks are rendered by the PDF renderer,
|
||||
which keeps the full table (an A5 page fits it)."""
|
||||
header = list(getattr(block, "header", []) or [])
|
||||
rows = list(getattr(block, "rows", []) or [])
|
||||
title = getattr(block, "title", None)
|
||||
fs = _FS_CELL
|
||||
widths = _col_widths(header, rows)
|
||||
fixed = 0.0
|
||||
if title:
|
||||
fixed += _measure_heading_text(title, 2)
|
||||
if header:
|
||||
fixed += _row_height_in(header, widths, fs)
|
||||
note_h = tl.line_height_in(_FS_NOTE) + 0.05
|
||||
avail_rows = budget - fixed - note_h - _GAP
|
||||
kept = []
|
||||
used = 0.0
|
||||
for r in rows:
|
||||
rh = _row_height_in(r, widths, fs)
|
||||
if used + rh > avail_rows and kept:
|
||||
break
|
||||
kept.append(r)
|
||||
used += rh
|
||||
if len(kept) >= len(rows):
|
||||
return block # already fits; keep the original (with its own note).
|
||||
note = (f"top {len(kept)} de {len(rows)} categorías mostradas "
|
||||
"(recortado para caber en el slide; el PDF muestra más)")
|
||||
return model.DataTable(header=header, rows=kept, title=title, note=note)
|
||||
|
||||
|
||||
def _fit_group_blocks(st: _PptxState, blocks: list, avail_full: float) -> list:
|
||||
"""Return a slide-fitting copy of a keep-together group's blocks.
|
||||
|
||||
On the short 16:9 slide a high-cardinality column's top-k table plus its
|
||||
chart can overflow. Reserve ``_GROUP_MIN_FIG_H`` for the (later shrunk) figure
|
||||
and trim the data table(s) to what is left, so every column keeps its chart
|
||||
next to its table on ONE slide. No-op when the group has no figure+table pair
|
||||
(e.g. id-like columns already drop the top-k upstream, or it already fits)."""
|
||||
has_fig = any(getattr(b, "kind", "") in ("figure", "image") for b in blocks)
|
||||
tbls = [b for b in blocks if getattr(b, "kind", "") == "data_table"]
|
||||
if not (has_fig and tbls):
|
||||
return blocks
|
||||
fixed_h = sum(_measure_block(st, b) for b in blocks
|
||||
if getattr(b, "kind", "") not in ("figure", "image",
|
||||
"data_table"))
|
||||
tables_h = sum(_measure_block(st, b) for b in tbls)
|
||||
budget_tables = avail_full - fixed_h - _GROUP_MIN_FIG_H
|
||||
if tables_h <= budget_tables:
|
||||
return blocks # already fits next to a min-height figure; leave intact.
|
||||
out = []
|
||||
for b in blocks:
|
||||
if getattr(b, "kind", "") != "data_table":
|
||||
out.append(b)
|
||||
continue
|
||||
trimmed = _trim_data_table_to_budget(b, max(budget_tables, 0.8))
|
||||
out.append(trimmed)
|
||||
budget_tables -= _measure_data_table(trimmed)
|
||||
return out
|
||||
|
||||
|
||||
def _place_group(st: _PptxState, block) -> None:
|
||||
"""Render a keep-together Group: move it whole to the next slide if needed."""
|
||||
blocks = getattr(block, "blocks", []) or []
|
||||
if not blocks:
|
||||
return
|
||||
# Opt-in slide break: start this group on a fresh slide unless the current one
|
||||
# is still empty (so a chapter can give each unit its own slide).
|
||||
if getattr(block, "page_break_before", False) and st.y > _CONTENT_TOP + 1e-6:
|
||||
_new_slide(st, cont=True)
|
||||
avail_full = _CONTENT_BOTTOM - _CONTENT_TOP
|
||||
# Trim oversized tables first (keeps the chart on the same slide), then shrink
|
||||
# the figure to share the remaining room.
|
||||
blocks = _fit_group_blocks(st, blocks, avail_full)
|
||||
_shrink_group_figures(st, blocks, avail_full)
|
||||
total = sum(_measure_block(st, b) for b in blocks)
|
||||
if total <= avail_full:
|
||||
|
||||
@@ -0,0 +1,89 @@
|
||||
---
|
||||
name: render_automatic_eda_markdown
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "def render_automatic_eda_markdown(chapters_or_profile, out_path: str, meta: dict = None) -> dict"
|
||||
description: "Renderiza un documento AutomaticEDA por CAPÍTULOS (modelo de bloques independiente del formato) en un único MARKDOWN autocontenido pensado para PEGAR A UN LLM. Acepta una lista de capítulos del modelo o directamente un TableProfile del grupo eda (construye los capítulos canónicos con build_document). Prioriza TEXTO + DATOS sobre lo visual: las tablas se vuelcan como tablas markdown con TODAS las filas (sin paginar — no hay páginas que cortar), una figura matplotlib se reduce a su caption más la tabla de datos subyacente (Desde/Hasta/Frecuencia de las barras del histograma) porque un LLM no ve la imagen, y los marcadores de glosario se eliminan conservando el **negrita**. Lleva cabecera (# título), bloque de metadatos en blockquote e índice numerado con anclas GitHub. Espejo de render_automatic_eda_pdf/render_automatic_eda_pptx pero SIN manifest (KISS, el markdown es un único artefacto de texto). dict-no-throw: nunca lanza, devuelve {path, n_chars, chapters, note}; en error fatal path es None y note explica la causa. Flag opcional meta['embed_figures'] exporta PNGs junto al .md (off por defecto)."
|
||||
tags: [eda, markdown, render, report, llm, automatic-eda, chapters, versioned, no-cut, text, datascience, python]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: [os, re, matplotlib, "datascience.automatic_eda"]
|
||||
params:
|
||||
- name: chapters_or_profile
|
||||
desc: "una lista de capítulos del modelo AutomaticEDA (dataclasses Chapter o dicts {id,title,version,blocks}) O un TableProfile dict del grupo eda. Si es un TableProfile, los capítulos canónicos se construyen con build_document(profile, meta['ctx']). Bloques soportados: heading, markdown, kv_table, data_table, figure, image, caption, note, group, glossary_entry. Lectura defensiva: lo no reconocido se degrada a Note, nunca lanza."
|
||||
- name: out_path
|
||||
desc: "ruta del archivo .md de salida. Los directorios padre se crean si faltan. Directorio no escribible → {path:None, note:<causa>} sin lanzar."
|
||||
- name: meta
|
||||
desc: "dict opcional. Claves: title (título del documento), ctx (dict con dataset_name→Dataset, source_origin→Fuente, storage→Almacenamiento, n_rows/n_cols→Dimensiones; también lo consumen los builders de capítulo cuando se da un profile), generated_at (timestamp; si falta se genera ISO UTC), embed_figures (True para exportar PNGs <basename>_figN.png junto al .md; por defecto False y el markdown queda autocontenido)."
|
||||
output: "dict (nunca lanza): {path: str|None, n_chars: int, chapters: list[{id,version}], note: str}. En error fatal (p.ej. directorio no escribible) path es None y note explica la causa. Un documento sin capítulos aplicables produce un markdown mínimo válido con 'documento vacío' y chapters=[]."
|
||||
tested: true
|
||||
tests: ["test_golden_bloques_sinteticos_serializa_todo_a_markdown", "test_edge_documento_vacio_no_revienta", "test_profile_path_construye_capitulos_y_escribe"]
|
||||
test_file_path: "python/functions/datascience/render_automatic_eda_markdown_test.py"
|
||||
file_path: "python/functions/datascience/render_automatic_eda_markdown.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from datascience import render_automatic_eda_markdown
|
||||
|
||||
# Desde un TableProfile del grupo eda (mismo modelo que los renderers PDF/PPTX).
|
||||
profile = {
|
||||
"table": "ventas", "source": "/data/ventas.csv",
|
||||
"n_rows": 1000, "n_cols": 2, "quality_score": 92.5,
|
||||
"columns": [
|
||||
{"name": "precio", "inferred_type": "numeric", "null_pct": 0.01,
|
||||
"numeric": {"mean": 42.5, "median": 40.0, "min": 1.0, "max": 100.0,
|
||||
"std": 12.3}},
|
||||
{"name": "categoria", "inferred_type": "categorical", "null_pct": 0.0,
|
||||
"categorical": {"top": [{"value": "neumaticos", "count": 500}]}},
|
||||
],
|
||||
}
|
||||
res = render_automatic_eda_markdown(
|
||||
profile, "reports/ventas_aeda.md",
|
||||
{"title": "EDA — ventas",
|
||||
"ctx": {"dataset_name": "Ventas", "source_origin": "ERP export",
|
||||
"n_rows": 1000, "n_cols": 2}})
|
||||
print(res["path"], res["n_chars"], res["chapters"])
|
||||
# -> reports/ventas_aeda.md 4123 [{'id':'portada','version':'1.0.0'}, ...]
|
||||
```
|
||||
|
||||
## Cuando usarla
|
||||
|
||||
Cuando quieras **pegar el EDA a un LLM** (ChatGPT, Claude, ...) o tenerlo en texto
|
||||
plano versionable: mismo documento por capítulos que el PDF/PPTX, pero serializado a
|
||||
Markdown sin binarios. Úsala como tercera salida junto a `render_automatic_eda_pdf`
|
||||
(móvil) y `render_automatic_eda_pptx` (compartir) desde el MISMO modelo de capítulos.
|
||||
A diferencia de esas dos, no hay páginas ni slides: todas las filas de cada tabla se
|
||||
vuelcan (nada se corta) y cada figura se reduce a su caption + la tabla de datos
|
||||
subyacente, que es lo que un LLM puede leer. Para añadir capítulos al documento, ver
|
||||
`docs/capabilities/automatic_eda.md`.
|
||||
|
||||
## Gotchas
|
||||
|
||||
- **Impura**: escribe el `.md` en `out_path` (crea los directorios padre). Con
|
||||
`meta['embed_figures']=True` además exporta un PNG `<basename>_figN.png` por figura
|
||||
junto al `.md`; por defecto NO exporta nada y el markdown queda autocontenido.
|
||||
- **Nunca lanza** (dict-no-throw): un bloque que falle se degrada a una nota y se anota
|
||||
en `note`; el documento se escribe igual. Un profile/lista vacíos producen un markdown
|
||||
mínimo válido con `*(documento vacío …)*` y `chapters=[]`.
|
||||
- **Figuras = datos, no imagen**: un bloque `figure` se serializa como `*Figura: caption*`
|
||||
más, si la figura matplotlib trae barras (histograma / barras), una tabla
|
||||
`| Desde | Hasta | Frecuencia |` extraída de los `Rectangle` patches (máx 100 filas;
|
||||
el resto se trunca con `*… (N filas más)*`). Si no hay barras o algo falla, solo sale
|
||||
el caption. La figura se cierra (`plt.close`) tras leerla.
|
||||
- **Glosario vs negrita**: se eliminan SOLO los marcadores de glosario
|
||||
`[[term:key]]visible[[/term]]` (queda `visible`); el `**negrita**` markdown SE
|
||||
CONSERVA (es válido). No se usa `strip_inline_md` aquí porque ese también quita el bold.
|
||||
- **Anclas del índice**: el `## Índice` enlaza cada capítulo con un ancla estilo GitHub
|
||||
del encabezado `## N. Título` (minúsculas, espacios→`-`, sin signos). Si dos capítulos
|
||||
comparten título exacto sus anclas colisionan (caso raro; los capítulos canónicos tienen
|
||||
títulos únicos).
|
||||
- **Tablas**: las celdas escapan `|` (→ `\|`) y pliegan saltos de línea a `<br>` para no
|
||||
romper la columna. No hay reparto por ancho — un LLM no lo necesita.
|
||||
@@ -0,0 +1,55 @@
|
||||
"""render_automatic_eda_markdown — chapter-based EDA report as one Markdown file.
|
||||
|
||||
Public ``eda``-group entry point that serializes an AutomaticEDA document (a list
|
||||
of chapters, or an ``eda`` TableProfile from which the canonical chapters are
|
||||
built) into a single self-contained Markdown file optimised to be **pasted into
|
||||
an LLM**: plain text, Markdown tables (every row dumped — there are no pages to
|
||||
cut), figures reduced to caption + underlying data, no binaries. It mirrors
|
||||
``render_automatic_eda_pdf`` / ``render_automatic_eda_pptx`` but for text output;
|
||||
unlike those it writes no manifest (KISS — Markdown is a single text artefact).
|
||||
|
||||
dict-no-throw: never raises. Returns ``{path, n_chars, chapters, note}``; on a
|
||||
fatal error ``path`` is None and ``note`` explains why.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from datascience.automatic_eda import build_document, render_md
|
||||
from datascience.automatic_eda.model import as_chapter, as_chapters
|
||||
|
||||
|
||||
def _coerce_chapters(chapters_or_profile, meta: dict) -> list:
|
||||
"""Accept chapters OR an eda profile and return a list of Chapter."""
|
||||
arg = chapters_or_profile
|
||||
if isinstance(arg, (list, tuple)):
|
||||
return as_chapters(list(arg))
|
||||
if isinstance(arg, dict):
|
||||
if "blocks" in arg and "columns" not in arg:
|
||||
ch = as_chapter(arg)
|
||||
return [ch] if ch is not None else []
|
||||
return build_document(arg, (meta or {}).get("ctx"))
|
||||
return []
|
||||
|
||||
|
||||
def render_automatic_eda_markdown(chapters_or_profile, out_path: str,
|
||||
meta: dict = None) -> dict:
|
||||
"""Render an AutomaticEDA document into a single self-contained Markdown file.
|
||||
|
||||
Args:
|
||||
chapters_or_profile: a list of chapters (``Chapter`` dataclasses or
|
||||
dicts) or an ``eda`` TableProfile dict (chapters built via
|
||||
``build_document(profile, meta['ctx'])``).
|
||||
out_path: filesystem path for the ``.md`` (parent dirs are created).
|
||||
meta: optional dict. Recognised keys: ``title``, ``ctx`` (dict with
|
||||
``dataset_name``/``source_origin``/``storage``/``n_rows``/``n_cols``),
|
||||
``generated_at``, ``embed_figures`` (export PNGs beside the .md,
|
||||
default False — off keeps the Markdown self-contained).
|
||||
|
||||
Returns:
|
||||
dict (never raises): ``{path: str|None, n_chars: int,
|
||||
chapters: list[{id, version}], note: str}``. On a fatal error ``path`` is
|
||||
None and ``note`` explains the cause.
|
||||
"""
|
||||
meta = dict(meta or {})
|
||||
chapters = _coerce_chapters(chapters_or_profile, meta)
|
||||
return render_md(chapters, out_path, meta)
|
||||
@@ -0,0 +1,168 @@
|
||||
"""Tests for render_automatic_eda_markdown — DoD: golden + edge + profile path.
|
||||
|
||||
Self-contained synthetic blocks (no DuckDB). Verifies every block kind serializes
|
||||
to Markdown (heading, markdown with glossary+bold, kv/data tables, a figure whose
|
||||
histogram bars become a data table, caption, note, group, glossary entry), that a
|
||||
leading level-1 heading equal to the chapter title is omitted, that an empty
|
||||
document degrades to a valid minimal Markdown without raising, and that passing a
|
||||
minimal TableProfile builds chapters and writes the file.
|
||||
"""
|
||||
|
||||
import os
|
||||
import tempfile
|
||||
|
||||
from datascience.render_automatic_eda_markdown import render_automatic_eda_markdown
|
||||
from datascience.automatic_eda.model import (
|
||||
Caption, Chapter, DataTable, Figure, GlossaryEntry, Group, Heading, KVTable,
|
||||
Markdown, Note,
|
||||
)
|
||||
|
||||
|
||||
def _hist_fig():
|
||||
import matplotlib
|
||||
matplotlib.use("Agg")
|
||||
import matplotlib.pyplot as plt
|
||||
fig, ax = plt.subplots()
|
||||
ax.hist([1, 1, 2, 2, 2, 3, 4, 4, 5, 5, 5, 5], bins=5)
|
||||
return fig
|
||||
|
||||
|
||||
def _chapters() -> list:
|
||||
blocks = [
|
||||
Heading("Demo", 1), # == chapter title -> omitted.
|
||||
Heading("Seccion dos", 2), # -> ####
|
||||
Markdown("Texto con [[term:ent]]entropia[[/term]] y **bold** aqui."),
|
||||
KVTable(rows=[("Filas", 1000), ("Columnas", 5)], title="Resumen"),
|
||||
DataTable(header=["col", "valor"],
|
||||
rows=[["alpha", "111"], ["beta", "222"], ["gamma", "333"]],
|
||||
title="Datos", note="nota inferior"),
|
||||
Figure(make=_hist_fig, caption="Histograma demo"),
|
||||
Caption("pie de figura"),
|
||||
Note("una nota aparte"),
|
||||
Group(title="Grupo X", blocks=[Markdown("dentro del grupo")]),
|
||||
GlossaryEntry(key="ent", label="Entropia",
|
||||
definition="Medida de incertidumbre."),
|
||||
]
|
||||
return [Chapter(id="demo", title="Demo", version="1.0.0", blocks=blocks)]
|
||||
|
||||
|
||||
def _read(path: str) -> str:
|
||||
with open(path, "r", encoding="utf-8") as fh:
|
||||
return fh.read()
|
||||
|
||||
|
||||
def test_golden_bloques_sinteticos_serializa_todo_a_markdown():
|
||||
with tempfile.TemporaryDirectory() as d:
|
||||
out = os.path.join(d, "demo.md")
|
||||
res = render_automatic_eda_markdown(
|
||||
_chapters(), out,
|
||||
{"title": "EDA Demo",
|
||||
"ctx": {"dataset_name": "Demo", "n_rows": 12, "n_cols": 2}})
|
||||
assert res["path"] == out
|
||||
assert os.path.exists(out)
|
||||
assert res["n_chars"] > 0
|
||||
assert res["chapters"] == [{"id": "demo", "version": "1.0.0"}]
|
||||
|
||||
content = _read(out)
|
||||
# Document structure.
|
||||
assert content.startswith("# ")
|
||||
assert "## Índice" in content
|
||||
# A Markdown table is present (header + separator row).
|
||||
assert "| " in content and "| --- " in content
|
||||
# DataTable values are all dumped.
|
||||
for v in ("alpha", "111", "beta", "222", "gamma", "333"):
|
||||
assert v in content
|
||||
# Glossary markers stripped, bold kept.
|
||||
assert "[[term" not in content
|
||||
assert "[[/term]]" not in content
|
||||
assert "**bold**" in content
|
||||
assert "entropia" in content # visible glossary text preserved.
|
||||
# Figure histogram bars became a data table.
|
||||
assert "| Desde | Hasta | Frecuencia |" in content
|
||||
# Glossary entry rendered as a level-3 heading.
|
||||
assert "### Entropia" in content
|
||||
# Level-2 heading -> ####.
|
||||
assert "#### Seccion dos" in content
|
||||
# Leading level-1 heading equal to the title was omitted.
|
||||
assert "### Demo" not in content
|
||||
# Group title rendered.
|
||||
assert "### Grupo X" in content
|
||||
|
||||
|
||||
def _hist_fig_with_span():
|
||||
"""Histogram with a wide ``axvspan`` (±1σ band) over it.
|
||||
|
||||
Reproduces the num_distr figure shape: matplotlib keeps the span as a lone
|
||||
Rectangle in ``ax.patches`` alongside the bin bars; it must NOT leak into the
|
||||
extracted bins table as a fake bin (it is ~5x wider than a bin)."""
|
||||
import matplotlib
|
||||
matplotlib.use("Agg")
|
||||
import matplotlib.pyplot as plt
|
||||
fig, ax = plt.subplots()
|
||||
data = [1, 1, 2, 2, 2, 3, 4, 4, 5, 5, 5, 5]
|
||||
ax.hist(data, bins=5)
|
||||
ax.axvspan(2.0, 4.0, alpha=0.2) # mean±σ band — a wide stray rectangle.
|
||||
return fig
|
||||
|
||||
|
||||
def test_figura_descarta_axvspan_de_la_tabla_de_bins():
|
||||
"""The ±1σ band rectangle must not appear as a row in the bins table."""
|
||||
blocks = [Figure(make=_hist_fig_with_span, caption="Hist con banda")]
|
||||
chapters = [Chapter(id="f", title="Fig", version="1.0.0", blocks=blocks)]
|
||||
with tempfile.TemporaryDirectory() as d:
|
||||
out = os.path.join(d, "fig.md")
|
||||
render_automatic_eda_markdown(chapters, out, {"title": "T"})
|
||||
content = _read(out)
|
||||
assert "| Desde | Hasta | Frecuencia |" in content
|
||||
# Extract the rows of the bins table: lines between the header/separator
|
||||
# and the next blank line.
|
||||
lines = content.splitlines()
|
||||
hi = next(i for i, ln in enumerate(lines)
|
||||
if ln.startswith("| Desde | Hasta | Frecuencia |"))
|
||||
rows = []
|
||||
for ln in lines[hi + 2:]: # skip header + separator
|
||||
if not ln.startswith("|"):
|
||||
break
|
||||
rows.append(ln)
|
||||
# 5 histogram bins, no extra wide span row.
|
||||
assert len(rows) == 5, rows
|
||||
# No row spans a width of ~2.0 (the axvspan from x=2 to x=4).
|
||||
for ln in rows:
|
||||
cells = [c.strip() for c in ln.strip("|").split("|")]
|
||||
lo, hi_v = float(cells[0]), float(cells[1])
|
||||
assert (hi_v - lo) < 1.5, f"wide span leaked: {ln}"
|
||||
|
||||
|
||||
def test_edge_documento_vacio_no_revienta():
|
||||
with tempfile.TemporaryDirectory() as d:
|
||||
out = os.path.join(d, "empty.md")
|
||||
res = render_automatic_eda_markdown([], out, {})
|
||||
assert res["path"] == out
|
||||
assert os.path.exists(out)
|
||||
assert res["chapters"] == []
|
||||
content = _read(out)
|
||||
assert "documento vacío" in content
|
||||
assert content.startswith("# ")
|
||||
|
||||
|
||||
def test_profile_path_construye_capitulos_y_escribe():
|
||||
profile = {
|
||||
"table": "mini",
|
||||
"source": "/data/mini.csv",
|
||||
"n_rows": 10,
|
||||
"n_cols": 1,
|
||||
"quality_score": 88.0,
|
||||
"columns": [
|
||||
{"name": "x", "inferred_type": "numeric", "null_pct": 0.0,
|
||||
"null_count": 0,
|
||||
"numeric": {"mean": 1.0, "median": 1.0, "min": 0.0, "max": 2.0,
|
||||
"std": 0.5}},
|
||||
],
|
||||
}
|
||||
with tempfile.TemporaryDirectory() as d:
|
||||
out = os.path.join(d, "mini.md")
|
||||
res = render_automatic_eda_markdown(
|
||||
profile, out, {"title": "Mini", "ctx": {"dataset_name": "Mini"}})
|
||||
assert res["path"] == out # not None — no exception, file written.
|
||||
assert os.path.exists(out)
|
||||
assert res["n_chars"] > 0
|
||||
@@ -1,9 +1,10 @@
|
||||
"""render_automatic_eda — EDA completo one-shot: perfil → ctx → PDF + PPTX.
|
||||
"""render_automatic_eda — EDA completo one-shot: perfil → ctx → PDF + PPTX + MD.
|
||||
|
||||
Pipeline impuro del grupo de capacidad `eda`. Dada UNA tabla DuckDB (o
|
||||
PostgreSQL), produce el informe AutomaticEDA COMPLETO en sus dos formatos a la
|
||||
vez (PDF móvil A5 + PPTX 16:9) con los 11 capítulos POBLADOS, en una sola
|
||||
llamada. Compone, sin reimplementar su lógica, cuatro funciones del registry:
|
||||
PostgreSQL), produce el informe AutomaticEDA COMPLETO en sus tres formatos a la
|
||||
vez (PDF móvil A5 + PPTX 16:9 + Markdown autocontenido para pegar a un LLM) con
|
||||
los capítulos POBLADOS, en una sola llamada. Compone, sin reimplementar su
|
||||
lógica, varias funciones del registry:
|
||||
|
||||
- profile_table : perfila la tabla end-to-end (TableProfile agregado),
|
||||
opcionalmente con modelos baratos y análisis de serie.
|
||||
@@ -12,8 +13,11 @@ llamada. Compone, sin reimplementar su lógica, cuatro funciones del registry:
|
||||
modelos/geo, timeseries_raw para series, geo_points
|
||||
para el mapa, db_path/table para la agregación
|
||||
push-down). Sin él, esos capítulos degradan.
|
||||
- render_automatic_eda_pdf : renderiza el documento por capítulos a PDF.
|
||||
- render_automatic_eda_pptx : renderiza el mismo documento a PPTX.
|
||||
- render_automatic_eda_pdf : renderiza el documento por capítulos a PDF.
|
||||
- render_automatic_eda_pptx : renderiza el mismo documento a PPTX.
|
||||
- render_automatic_eda_markdown : serializa el mismo documento a Markdown
|
||||
autocontenido (texto + tablas markdown, sin
|
||||
binarios) para incorporar a un LLM.
|
||||
|
||||
El TableProfile agregado basta para portada/overview/distribuciones/calidad/
|
||||
correlación, pero los capítulos `modelos`, `timeseries`, `geospatial` y
|
||||
@@ -32,6 +36,7 @@ from datetime import datetime, timezone
|
||||
|
||||
from datascience import (
|
||||
build_eda_render_ctx,
|
||||
render_automatic_eda_markdown,
|
||||
render_automatic_eda_pdf,
|
||||
render_automatic_eda_pptx,
|
||||
run_eda_models,
|
||||
@@ -93,6 +98,7 @@ def render_automatic_eda(
|
||||
out_dir: str = "reports",
|
||||
basename: str = None,
|
||||
ctx_extra: dict = None,
|
||||
emit_md: bool = True,
|
||||
) -> dict:
|
||||
"""Perfila una tabla y emite el informe AutomaticEDA completo (PDF + PPTX).
|
||||
|
||||
@@ -140,13 +146,19 @@ def render_automatic_eda(
|
||||
ctx_extra: dict opcional con claves de presentación/contexto extra que se
|
||||
mezclan en el ctx (p.ej. dataset_name, description, source_origin).
|
||||
No pisan las claves de datos calculadas por build_eda_render_ctx.
|
||||
emit_md: además del PDF y el PPTX, emite un Markdown autocontenido del
|
||||
MISMO documento por capítulos (texto plano + tablas markdown, sin
|
||||
binarios), pensado para pegar a un LLM. Default True. La ruta sale en
|
||||
la clave de retorno ``aeda_md_path``. No altera las demás salidas.
|
||||
|
||||
Returns:
|
||||
dict (nunca lanza). En éxito::
|
||||
|
||||
{"status": "ok", "pdf_path": str, "pptx_path": str,
|
||||
"manifest_path": str|None, "n_pages": int, "n_slides": int,
|
||||
"pdf_note": str, "pptx_note": str, "profile": <TableProfile>}
|
||||
"aeda_md_path": str|None, "manifest_path": str|None,
|
||||
"n_pages": int, "n_slides": int, "md_chars": int|None,
|
||||
"pdf_note": str, "pptx_note": str, "md_note": str|None,
|
||||
"profile": <TableProfile>}
|
||||
|
||||
En error: {"status": "error", "error": str}.
|
||||
"""
|
||||
@@ -243,15 +255,26 @@ def render_automatic_eda(
|
||||
rpdf = render_automatic_eda_pdf(prof, pdf_path, meta) or {}
|
||||
rpptx = render_automatic_eda_pptx(prof, pptx_path, meta) or {}
|
||||
|
||||
# Salida Markdown autocontenida (mismo documento por capítulos) para
|
||||
# pegar a un LLM. Aditiva: no afecta a PDF/PPTX/manifest. dict-no-throw.
|
||||
rmd = {}
|
||||
md_path = None
|
||||
if emit_md:
|
||||
md_path = os.path.join(out_dir, base + ".md")
|
||||
rmd = render_automatic_eda_markdown(prof, md_path, meta) or {}
|
||||
|
||||
return {
|
||||
"status": "ok",
|
||||
"pdf_path": rpdf.get("path"),
|
||||
"pptx_path": rpptx.get("path"),
|
||||
"aeda_md_path": rmd.get("path"),
|
||||
"manifest_path": rpdf.get("manifest_path"),
|
||||
"n_pages": rpdf.get("n_pages"),
|
||||
"n_slides": rpptx.get("n_slides"),
|
||||
"md_chars": rmd.get("n_chars"),
|
||||
"pdf_note": rpdf.get("note"),
|
||||
"pptx_note": rpptx.get("note"),
|
||||
"md_note": rmd.get("note"),
|
||||
"profile": prof,
|
||||
}
|
||||
except Exception as e: # noqa: BLE001 — dict-no-throw: degradar, nunca lanzar.
|
||||
|
||||
Reference in New Issue
Block a user