Compare commits
1 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 8917105184 |
@@ -3,11 +3,11 @@ name: launch_fleetclaude
|
||||
kind: function
|
||||
lang: bash
|
||||
domain: infra
|
||||
version: "1.6.0"
|
||||
version: "1.5.0"
|
||||
purity: impure
|
||||
signature: "launch_fleetclaude [--cwd <dir>] [--bin <path>] [--session <name>] [--reuse] [--cols <n>]"
|
||||
description: "Entrypoint de FleetView: abre una ventana de terminal con una sesion tmux (socket aislado por perfil) de dos panes (TUI fleetview a la izquierda, claude --dangerously-skip-permissions a la derecha) para centralizar la flota de Claudes. La terminal se AUTO-DETECTA sin config por PC: kitty si esta instalado y hay display ($DISPLAY/$WAYLAND_DISPLAY), si no Windows Terminal (wt.exe) en WSL adjuntando via wsl.exe. El pane de la TUI corre dentro del bucle supervisor supervise_fleetview_tui, que la relanza si muere (crash/panic/kill), asi el panel de control NUNCA se pierde. Soporta PERFILES multiples: sin --session/--reuse cada invocacion abre un perfil nuevo (fleet, fleet2, fleet3, ...) con su propia flota; inyecta FLEET_SOCKET/FLEET_SESSION a la TUI para que cada panel vea solo sus Claudes. Instala atajos alt+flechas/alt+enter/alt+n que controlan la TUI desde cualquier pane, y fija el ancho del sidebar con hooks."
|
||||
tags: [claude-fleet, infra, kitty, tmux, claude, fleetview, launcher, wsl, windows-terminal]
|
||||
description: "Entrypoint de FleetView: abre una ventana kitty con una sesion tmux (socket aislado por perfil) de dos panes (TUI fleetview a la izquierda, claude --dangerously-skip-permissions a la derecha) para centralizar la flota de Claudes. El pane de la TUI corre dentro del bucle supervisor supervise_fleetview_tui, que la relanza si muere (crash/panic/kill), asi el panel de control NUNCA se pierde. Soporta PERFILES multiples: sin --session/--reuse cada invocacion abre un perfil nuevo (fleet, fleet2, fleet3, ...) con su propia flota; inyecta FLEET_SOCKET/FLEET_SESSION a la TUI para que cada panel vea solo sus Claudes. Instala atajos alt+flechas/alt+enter/alt+n que controlan la TUI desde cualquier pane, y fija el ancho del sidebar con hooks."
|
||||
tags: [claude-fleet, infra, kitty, tmux, claude, fleetview, launcher]
|
||||
params:
|
||||
- name: --cwd
|
||||
desc: "Directorio de trabajo de ambos panes tmux. Opcional. Default: raiz del repo fn_registry, derivada dinamicamente via git rev-parse desde la ubicacion del script (sin hardcodear paths de usuario)."
|
||||
@@ -19,7 +19,7 @@ params:
|
||||
desc: "Reattach al perfil principal 'fleet' en vez de abrir uno nuevo. Opcional. Recupera el comportamiento idempotente clasico (volver a invocar NO duplica la flota, reusa la existente)."
|
||||
- name: --cols
|
||||
desc: "Ancho en columnas del pane izquierdo (la TUI). Opcional. Default: 40."
|
||||
output: "Crea/reutiliza una sesion tmux detached con dos panes y lanza una ventana de terminal 'FleetView' adjunta a ella (kitty o Windows Terminal segun auto-deteccion), desacoplada del shell padre. Imprime el estado por stdout. Sin valor de retorno; exit 0 en exito."
|
||||
output: "Crea/reutiliza una sesion tmux detached con dos panes y lanza una ventana kitty 'FleetView' adjunta a ella, desacoplada del shell padre (setsid). Imprime el estado por stdout. Sin valor de retorno; exit 0 en exito."
|
||||
uses_functions:
|
||||
- supervise_fleetview_tui_bash_infra
|
||||
uses_types: []
|
||||
@@ -49,7 +49,7 @@ launch_fleetclaude --reuse
|
||||
launch_fleetclaude --session trabajo --cols 50
|
||||
```
|
||||
|
||||
Tras invocarlo aparece una ventana de terminal titulada `FleetView (<perfil>)` con dos
|
||||
Tras invocarlo aparece una ventana kitty titulada `FleetView (<perfil>)` con dos
|
||||
panes lado a lado: a la izquierda la TUI `fleetview`, a la derecha una sesion de
|
||||
`claude --dangerously-skip-permissions`. Cada perfil es un socket+sesion tmux
|
||||
aislados con su propia flota: puedes tener varias FleetView abiertas a la vez.
|
||||
@@ -78,24 +78,12 @@ al retomar el trabajo en el repo `fn_registry`.
|
||||
`respawn-pane` de alt+R y los Claude nuevos hereden el socket). `main.go` los
|
||||
lee con fallback a `fleet`. Por eso cada panel ve SOLO los Claude de su perfil
|
||||
(cruza la lista del sistema con los panes de su socket).
|
||||
- **Auto-deteccion de terminal (sin config por PC)**: en la ruta ventana-nueva el
|
||||
launcher elige terminal solo. (1) `kitty` instalado **y** display usable
|
||||
(`$DISPLAY`/`$WAYLAND_DISPLAY`) → kitty (escritorio Linux nativo o WSLg con
|
||||
kitty). (2) Si no, WSL con `wt.exe` en el PATH → Windows Terminal ejecutando
|
||||
`wsl.exe [-d $WSL_DISTRO_NAME] -- bash -lic 'tmux -L <perfil> attach ...'`.
|
||||
(3) Ninguna → error con las salidas posibles. Asi el MISMO `fleetclaude`
|
||||
funciona en un PC con kitty y en otro WSL sin kitty, cada uno elige su
|
||||
terminal. Causa raiz del sintoma "se lanza la flota pero no se ve": kitty no
|
||||
instalado en WSL hacia que la sesion tmux se creara sin ventana que la mostrara.
|
||||
- **Dentro de tmux abre ventana nueva**: si invocas `fleetclaude` desde dentro de
|
||||
una sesion tmux (`$TMUX` definido), NO hace `attach` anidado (rompe / avisa de
|
||||
nesting); cae a la ruta ventana-nueva (auto-deteccion de terminal). Fuera de
|
||||
tmux y con TTY, reutiliza la terminal actual con `exec tmux attach`.
|
||||
- **kitty detached (setsid)**: la ventana kitty se lanza con `setsid ... &` para
|
||||
sobrevivir al cierre de la terminal que la invoco. La ventana de Windows
|
||||
Terminal (wt.exe) ya es un proceso Windows independiente del arbol Linux, asi
|
||||
que sobrevive sola (se lanza con `&`+`disown` desde un subshell con cwd `/mnt/c`
|
||||
para evitar el warning de wt.exe por cwd UNC `\\wsl.localhost\...`).
|
||||
nesting); cae a la ruta kitty y abre una ventana nueva. Fuera de tmux y con
|
||||
TTY, reutiliza la terminal actual con `exec tmux attach`.
|
||||
- **kitty detached (setsid)**: la ventana se lanza con `setsid ... &` para
|
||||
sobrevivir al cierre de la terminal que la invoco. No bloquea al shell padre.
|
||||
- **TUI bajo supervisor (auto-respawn)**: el pane izquierdo NO corre un
|
||||
`exec fleetview` de una sola vida, sino `supervise_fleetview_tui` (bucle que
|
||||
relanza la TUI si muere por crash/panic/kill). Asi el panel de control nunca se
|
||||
@@ -128,23 +116,14 @@ al retomar el trabajo en el repo `fn_registry`.
|
||||
- **Ancho del sidebar via hooks**: `client-resized` y `window-layout-changed`
|
||||
re-fijan el pane 0 (TUI) a `--cols` columnas, porque el `attach` de kitty y el
|
||||
conmutar de Claude redistribuyen el espacio.
|
||||
- **tmux siempre; terminal (kitty/wt.exe) solo sin TTY**: `tmux` es obligatorio
|
||||
(aborta != 0 si falta). Una terminal nueva (kitty o Windows Terminal) solo se
|
||||
necesita en la ruta sin-TTY (dentro de tmux, atajo de escritorio, cron, script),
|
||||
donde abre una ventana nueva. Invocado desde una terminal interactiva fuera de
|
||||
tmux (el caso normal del alias `fleetclaude`), reutiliza la terminal actual con
|
||||
`exec tmux attach` y no necesita ni kitty ni wt.exe.
|
||||
- **tmux siempre, kitty solo sin TTY**: `tmux` es obligatorio (aborta != 0 si
|
||||
falta). `kitty` solo se necesita en la ruta sin-TTY (atajo de escritorio, cron,
|
||||
script), donde abre una ventana nueva. Invocado desde una terminal interactiva
|
||||
(el caso normal del alias `fleetclaude`), reutiliza la terminal actual con
|
||||
`exec tmux attach` y NO necesita kitty — util en WSL u hosts sin kitty.
|
||||
|
||||
## Capability growth log
|
||||
|
||||
- v1.6.0 (2026-06-29) — **auto-deteccion de terminal (kitty ↔ Windows Terminal)**.
|
||||
La ruta ventana-nueva ya no asume kitty: elige terminal segun el host. kitty si
|
||||
esta instalado y hay display (`$DISPLAY`/`$WAYLAND_DISPLAY`); si no, en WSL abre
|
||||
Windows Terminal (`wt.exe`) ejecutando `wsl.exe [-d $WSL_DISTRO_NAME] -- bash
|
||||
-lic 'tmux ... attach'`. Mismo `fleetclaude` en un PC con kitty y en otro WSL
|
||||
sin kitty. Arregla el sintoma "se lanza la flota pero no se ve": en WSL sin
|
||||
kitty la sesion tmux se creaba pero ninguna ventana la mostraba. wt.exe se
|
||||
lanza desde un subshell con cwd `/mnt/c` para evitar el warning por cwd UNC.
|
||||
- v1.5.0 (2026-06-24) — **auto-respawn de la TUI**. El pane izquierdo ya no corre
|
||||
`exec fleetview` (una sola vida), sino el bucle supervisor
|
||||
`supervise_fleetview_tui`, que relanza la TUI si muere (crash/panic/kill de su
|
||||
|
||||
@@ -294,61 +294,31 @@ USAGE
|
||||
$T set-hook -g window-layout-changed "resize-pane -t $left_pane -x $cols"
|
||||
|
||||
# -----------------------------------------------------------------------
|
||||
# Adjuntar la sesion en una terminal, DESACOPLADA del shell padre para que
|
||||
# no muera al cerrar la terminal invocadora.
|
||||
# Lanzar kitty adjuntando la sesion, DESACOPLADA del shell padre con
|
||||
# setsid, para que no muera al cerrar la terminal invocadora.
|
||||
# (Mismo patron que reboot_all_claudes para relanzar terminales.)
|
||||
# -----------------------------------------------------------------------
|
||||
# Adjuntar la sesion:
|
||||
# - Terminal interactiva y FUERA de tmux: convertir ESA terminal en el
|
||||
# panel FleetView (exec reemplaza el proceso; al hacer detach vuelve la
|
||||
# shell). Asi `fleetclaude` no abre otra ventana: usa la actual.
|
||||
# - DENTRO de tmux (o sin TTY: atajo de escritorio, cron, script): abrir
|
||||
# una ventana de terminal NUEVA desacoplada. No hacemos `attach`
|
||||
# una ventana kitty nueva desacoplada (setsid). No hacemos `attach`
|
||||
# anidado dentro de otra sesion tmux (rompe / da el warning de nesting).
|
||||
if [ -t 0 ] && [ -t 1 ] && [ -z "${TMUX:-}" ]; then
|
||||
exec tmux -L "$session" attach -t "$session"
|
||||
fi
|
||||
|
||||
# -----------------------------------------------------------------------
|
||||
# Ruta ventana-nueva: AUTO-DETECTAR la terminal disponible (sin config por
|
||||
# PC). El mismo `fleetclaude` funciona en un escritorio Linux con kitty y en
|
||||
# un WSL sin kitty pero con Windows Terminal.
|
||||
# 1. kitty instalado + display usable ($DISPLAY/$WAYLAND_DISPLAY) -> kitty
|
||||
# (escritorio Linux nativo, o WSLg con kitty instalado).
|
||||
# 2. WSL con wt.exe alcanzable -> Windows Terminal ejecutando wsl.exe que
|
||||
# adjunta la sesion tmux (PCs WSL sin kitty: la ventana kitty nunca
|
||||
# aparece sin una terminal Linux real, por eso "se lanza pero no se ve").
|
||||
# 3. Ninguna -> error claro con las dos salidas posibles.
|
||||
# -----------------------------------------------------------------------
|
||||
if command -v kitty >/dev/null 2>&1 && [[ -n "${DISPLAY:-}${WAYLAND_DISPLAY:-}" ]]; then
|
||||
setsid kitty --title "FleetView ($session)" -e tmux -L "$session" attach -t "$session" </dev/null >/dev/null 2>&1 &
|
||||
disown 2>/dev/null || true
|
||||
echo "launch_fleetclaude: ventana kitty 'FleetView ($session)' adjunta al perfil '$session'."
|
||||
return 0
|
||||
# Ruta ventana-nueva: necesitamos kitty para abrirla.
|
||||
if ! command -v kitty >/dev/null 2>&1; then
|
||||
echo "launch_fleetclaude: kitty no esta instalado (necesario para abrir ventana nueva)." >&2
|
||||
echo "launch_fleetclaude: lanzalo desde una terminal interactiva fuera de tmux, o instala kitty." >&2
|
||||
return 1
|
||||
fi
|
||||
setsid kitty --title "FleetView ($session)" -e tmux -L "$session" attach -t "$session" </dev/null >/dev/null 2>&1 &
|
||||
disown 2>/dev/null || true
|
||||
|
||||
if command -v wt.exe >/dev/null 2>&1; then
|
||||
# bash -lic <attach> dentro de wsl.exe: login+interactive para que tmux y
|
||||
# el PATH del perfil esten disponibles en la ventana de Windows Terminal.
|
||||
local attach_cmd
|
||||
attach_cmd="tmux -L $(printf '%q' "$session") attach -t $(printf '%q' "$session")"
|
||||
local distro="${WSL_DISTRO_NAME:-}"
|
||||
local wsl_args=(wsl.exe)
|
||||
[[ -n "$distro" ]] && wsl_args+=(-d "$distro")
|
||||
wsl_args+=(-- bash -lic "$attach_cmd")
|
||||
# cd a una ruta Windows (/mnt/c) evita el warning de wt.exe por cwd UNC
|
||||
# (\\wsl.localhost\...). El cwd real de los panes lo fija la sesion tmux.
|
||||
( cd /mnt/c 2>/dev/null || cd /
|
||||
wt.exe new-tab --title "FleetView ($session)" "${wsl_args[@]}" </dev/null >/dev/null 2>&1 &
|
||||
disown 2>/dev/null || true )
|
||||
echo "launch_fleetclaude: Windows Terminal 'FleetView ($session)' adjunta al perfil '$session' (WSL distro '${distro:-default}')."
|
||||
return 0
|
||||
fi
|
||||
|
||||
echo "launch_fleetclaude: no hay terminal para abrir una ventana nueva." >&2
|
||||
echo "launch_fleetclaude: - escritorio Linux: instala kitty y exporta DISPLAY/WAYLAND_DISPLAY." >&2
|
||||
echo "launch_fleetclaude: - WSL: usa Windows Terminal (wt.exe debe estar en el PATH)." >&2
|
||||
echo "launch_fleetclaude: - o lanza fleetclaude desde una terminal interactiva fuera de tmux." >&2
|
||||
return 1
|
||||
echo "launch_fleetclaude: ventana kitty 'FleetView ($session)' adjunta al perfil '$session'."
|
||||
return 0
|
||||
}
|
||||
|
||||
# Permitir ejecutar el archivo directamente (no solo como funcion sourced).
|
||||
|
||||
@@ -1,299 +0,0 @@
|
||||
# AutomaticEDA — contrato de capítulos
|
||||
|
||||
Documento autoritativo para **escribir capítulos** del informe AutomaticEDA. Léelo
|
||||
entero antes de añadir un capítulo: define el modelo de bloques, la firma del builder,
|
||||
el versionado, dónde colocar el módulo, cómo se registra en el orden del documento, qué
|
||||
claves del `profile` consume cada capítulo y un ejemplo completo de capítulo de
|
||||
referencia (OVERVIEW).
|
||||
|
||||
AutomaticEDA es la capa intermedia entre **contenido** (lo que un capítulo quiere
|
||||
decir) y **formato de salida** (PDF móvil + PPTX para compartir). Un mismo documento por
|
||||
capítulos se renderiza a los dos formatos con garantía de **no-corte**: el texto se
|
||||
envuelve a líneas completas, las tablas largas se parten por filas repitiendo la
|
||||
cabecera, y figuras/imágenes se escalan para caber enteras.
|
||||
|
||||
- Código del motor: `python/functions/datascience/automatic_eda/` (paquete de soporte).
|
||||
- Funciones públicas del registry (grupo `eda`): `render_automatic_eda_pdf`,
|
||||
`render_automatic_eda_pptx`.
|
||||
- Sustituye evolutivamente a `render_eda_pdf` **de forma aditiva** (ese sigue activo en
|
||||
`profile_table(emit_pdf=True)`).
|
||||
|
||||
---
|
||||
|
||||
## 1. Modelo de documento
|
||||
|
||||
```
|
||||
Document = list[Chapter]
|
||||
Chapter = { id: str, title: str, version: str, blocks: list[Block] }
|
||||
Block = Heading | Markdown | KVTable | DataTable | Figure | Image | Caption | Note
|
||||
```
|
||||
|
||||
Importa el modelo desde `datascience.automatic_eda.model` (o
|
||||
`from datascience.automatic_eda import ...`). Todos los bloques son dataclasses; los
|
||||
renderers también aceptan **dicts** con la clave `kind` (lectura defensiva: lo no
|
||||
reconocido se degrada a `Note`, nunca lanza).
|
||||
|
||||
### Bloques
|
||||
|
||||
| Bloque | Construcción | Qué hace en el render |
|
||||
|---|---|---|
|
||||
| `Heading(text, level=1)` | título de sección, `level` 1 (grande) … 3 (chico) | una o varias líneas en negrita; nivel 1 lleva subrayado de acento |
|
||||
| `Markdown(text)` | texto markdown ligero | ver subset abajo; **nunca corta a media línea** |
|
||||
| `KVTable(rows, title=None)` | `rows = [(clave, valor), ...]` | tabla de 2 columnas etiqueta/valor; el valor se envuelve |
|
||||
| `DataTable(header, rows, title=None, note=None)` | `header=[...]`, `rows=[[...],...]` | tabla con cabecera; **se parte por filas repitiendo cabecera**; las celdas largas se envuelven dentro de su columna |
|
||||
| `Figure(fig=None, make=None, caption=None, height_in=None)` | una `matplotlib.figure.Figure` ya construida (`fig`) o un callable `make()->Figure` (perezoso) | se rasteriza y escala para caber entera (nunca recortada) |
|
||||
| `Image(path, caption=None, height_in=None)` | ruta a PNG/JPG | se escala para caber entera |
|
||||
| `Caption(text)` / `Note(text)` | texto auxiliar pequeño | pie/nota en gris; `Note` es además el fallback de lo desconocido |
|
||||
|
||||
### Subset de markdown soportado (`Markdown`)
|
||||
|
||||
`#`/`##`/`###` → headings; `-`/`*` → viñetas; líneas `| a | b |` consecutivas → una
|
||||
`DataTable`; línea en blanco → separación de párrafo; `**bold**`/`__bold__`/`` `code` ``
|
||||
→ se quitan los marcadores y se conserva el texto. Todo lo demás se renderiza tal cual.
|
||||
Garantía: ningún carácter se pierde; lo que no cabe se envuelve o pasa de página/slide.
|
||||
|
||||
---
|
||||
|
||||
## 2. Firma del builder de capítulo (OBLIGATORIA)
|
||||
|
||||
Cada capítulo es un módulo `python/functions/datascience/automatic_eda/chapters/<id>.py`
|
||||
que expone **dos** símbolos:
|
||||
|
||||
```python
|
||||
CHAPTER_VERSION = "1.0.0" # semver de generación del capítulo (ver §4)
|
||||
|
||||
def build_<id>(profile: dict, ctx: dict) -> "Chapter | None":
|
||||
"""Construye el capítulo desde el TableProfile y el contexto de presentación.
|
||||
|
||||
Devuelve None si el capítulo NO aplica a este dataset (p.ej. timeseries sin
|
||||
columna fecha). Lee SIEMPRE defensivamente con .get y NUNCA lanza.
|
||||
"""
|
||||
```
|
||||
|
||||
- El nombre de la función es exactamente `build_<id>` donde `<id>` es el del módulo y
|
||||
el de `CHAPTER_ORDER` (§3). Ej.: `chapters/num_distr.py` → `build_num_distr`.
|
||||
- Devuelve un `model.Chapter(id, title, version=CHAPTER_VERSION, blocks=[...])` o `None`.
|
||||
- Un capítulo que devuelve `None` o cuyos `blocks` quedan vacíos se omite del documento.
|
||||
|
||||
---
|
||||
|
||||
## 3. Registro y orden del documento
|
||||
|
||||
El orden canónico está **pre-declarado** en
|
||||
`python/functions/datascience/automatic_eda/chapters_registry.py`:
|
||||
|
||||
```python
|
||||
CHAPTER_ORDER = [
|
||||
"portada", "overview", "num_distr", "cat_distr", "calidad", "correlacion",
|
||||
"modelos", "analisis_llm", "timeseries", "geospatial", "agregacion",
|
||||
]
|
||||
```
|
||||
|
||||
`build_document(profile, ctx)` recorre este orden, importa perezosamente
|
||||
`chapters/<id>.py` y llama `build_<id>`. **Para añadir un capítulo NO se edita
|
||||
`chapters_registry.py`**: basta crear el módulo `chapters/<id>.py` (con su `<id>` ya en
|
||||
`CHAPTER_ORDER`) y aparecerá automáticamente en su posición. Esto permite que muchos
|
||||
agentes trabajen **en paralelo** sin contención: cada uno toca solo su archivo.
|
||||
|
||||
Si tu capítulo usa un `<id>` que aún no está en `CHAPTER_ORDER`, añádelo en la posición
|
||||
correcta (única edición compartida; coordínala con el orquestador).
|
||||
|
||||
`build_document` nunca lanza: un capítulo cuyo módulo no existe se salta, y uno que falla
|
||||
o devuelve `None` se omite.
|
||||
|
||||
---
|
||||
|
||||
## 4. Versionado por capítulo + manifiesto
|
||||
|
||||
- `CHAPTER_VERSION` (semver) identifica la **generación** del capítulo. Bumpéalo cuando
|
||||
cambies qué/cómo emite el capítulo (no en cada corrida). Se estampa en el pie de cada
|
||||
página/slide: `<Título> · v<version>`.
|
||||
- `ENGINE_VERSION` (en `model.py`) versiona el motor global.
|
||||
- Al renderizar se escribe `automatic_eda_manifest.json` junto a la salida:
|
||||
|
||||
```json
|
||||
{
|
||||
"engine": "AutomaticEDA",
|
||||
"engine_version": "1.0.0",
|
||||
"generated_at": "2026-06-30 12:20:56 UTC",
|
||||
"chapters": {
|
||||
"portada": { "version": "1.0.0", "n_pages": 1, "n_slides": 1 },
|
||||
"overview": { "version": "1.0.0", "n_pages": 2, "n_slides": 2 }
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Llamar a uno o ambos renderers crea/actualiza el manifiesto (read-modify-write
|
||||
defensivo). Esto habilita el **seguimiento y la mejora continua por capítulo**.
|
||||
|
||||
---
|
||||
|
||||
## 5. `ctx` — contexto de presentación
|
||||
|
||||
`ctx` lleva metadatos que **no están** en el `TableProfile` (lo aporta el caller via
|
||||
`meta['ctx']`). Claves convencionales (todas opcionales):
|
||||
|
||||
| Clave | Uso |
|
||||
|---|---|
|
||||
| `dataset_name` | nombre del dataset (portada). Default: `profile['table']` |
|
||||
| `source_origin` | de dónde viene el dataset (portada). Default: `profile['source']` |
|
||||
| `storage` | tecnología de almacenamiento (portada). Default: inferido de `source` |
|
||||
| `generated_at` | fecha de generación (portada/manifiesto). Default: `profiled_at`/ahora |
|
||||
| `description` | frase de descripción del dataset (portada) |
|
||||
| `granularity` | "Cada fila es…" (portada). Default: derivado de `key_candidates` |
|
||||
| `quality_criteria` | criterios del score de calidad (portada) |
|
||||
| `head_rows` | `list[dict]` con `df.head` (overview). Ver §7 |
|
||||
|
||||
Un capítulo puede definir y consumir sus propias claves `ctx` — documenta cuáles en su
|
||||
docstring.
|
||||
|
||||
---
|
||||
|
||||
## 6. Claves del `profile` que consume cada capítulo
|
||||
|
||||
El `TableProfile` lo produce `profile_table(...)["profile"]` (grupo `eda`). Claves de
|
||||
nivel superior: `table, source, profiled_at, n_rows, n_cols, size_bytes, duplicate_rows,
|
||||
duplicate_pct, null_cell_pct, constant_cols, all_null_cols, quality_score,
|
||||
type_breakdown, key_candidates, columns[], correlations, llm, models, series, caveats`.
|
||||
|
||||
Cada `columns[i]`: `name, inferred_type, semantic_type, physical_type, distinct_count,
|
||||
unique_pct, null_count, null_pct, empty_count, empty_pct, flags, quality_score,
|
||||
numeric{min,max,mean,median,std,variance,cv,iqr,skew,kurtosis,p1..p99,mode,n_outliers,
|
||||
outlier_pct,zero_pct,negative_pct,distribution_type,histogram[{lo,hi,count}]},
|
||||
categorical{top[{value,count,pct}],mode,n_distinct,entropy,imbalance,len_min/mean/max},
|
||||
reexpression, series{...}`.
|
||||
|
||||
| Capítulo | Claves del profile que consume |
|
||||
|---|---|
|
||||
| `portada` | `table, source, profiled_at, n_rows, n_cols, quality_score, key_candidates` + `ctx` |
|
||||
| `overview` | `columns[].{name,inferred_type,semantic_type,physical_type,null_pct,null_count,categorical.top,numeric.{min,median,max,mean,std}}`, `head_rows` (ver §7) |
|
||||
| `num_distr` (pendiente) | `columns[] numeric.{histogram,mean,median,std,outlier_pct,...}` |
|
||||
| `cat_distr` (pendiente) | `columns[] categorical.{top,entropy,imbalance}` |
|
||||
| `calidad` (pendiente) | `quality_score`, `columns[].{quality_score,flags,issues}`, `duplicate_*`, `null_cell_pct`, `constant_cols`, `all_null_cols` |
|
||||
| `correlacion` (pendiente) | `correlations.pairs[{a,b,value,method}]`, `correlations.levels_caveat` |
|
||||
| `modelos` (pendiente) | `models.{pca,kmeans,outliers,normality}` |
|
||||
| `analisis_llm` (pendiente) | `llm` |
|
||||
| `timeseries` (pendiente) | `series{col:{stationarity,acf_pacf,stl,levels_*}}` |
|
||||
| `geospatial` (pendiente) | columnas con `semantic_type` geográfico (lat/lon) |
|
||||
| `agregacion` (pendiente) | `columns[]` + agregados que la fase de cálculo añada |
|
||||
|
||||
---
|
||||
|
||||
## 7. Claves nuevas del profile que la fase de cálculo debe añadir
|
||||
|
||||
El `TableProfile` actual **no** trae estas claves; el capítulo OVERVIEW las consume y, si
|
||||
faltan, degrada honestamente (placeholder + derivación de valores reales). Para un
|
||||
overview completo, la fase de cálculo (otro agente) debe añadir:
|
||||
|
||||
- `profile['head_rows']`: `list[dict]` con las primeras N filas (`df.head`), una por
|
||||
dict `{columna: valor}`. Mientras tanto OVERVIEW muestra un placeholder.
|
||||
- `columns[i]['examples']`: `list` de hasta N valores **no nulos** crudos de la columna.
|
||||
Mientras tanto OVERVIEW deriva ejemplos de `categorical.top[].value` (categóricas) y de
|
||||
`numeric.{min,median,max}` (numéricas) — son valores reales, no inventados.
|
||||
|
||||
Sugerencia de implementación (no obligatoria en esta fase): una función del registry que
|
||||
muestree `head_rows`/`examples` desde DuckDB y las inyecte en el profile antes de
|
||||
renderizar (delegar a `fn-constructor`, tag `eda`).
|
||||
|
||||
---
|
||||
|
||||
## 8. Ejemplo COMPLETO de capítulo de referencia (OVERVIEW)
|
||||
|
||||
Copia este patrón. Archivo real:
|
||||
`python/functions/datascience/automatic_eda/chapters/overview.py`.
|
||||
|
||||
```python
|
||||
from .. import model
|
||||
|
||||
CHAPTER_VERSION = "1.0.0"
|
||||
CHAPTER_ID = "overview"
|
||||
CHAPTER_TITLE = "Overview"
|
||||
|
||||
def _fmt_num(v, d=3):
|
||||
# ... formateo defensivo (None -> "—", floats compactos) ...
|
||||
...
|
||||
|
||||
def _examples_for(col: dict) -> str:
|
||||
# 1) col['examples'] si existe; 2) categorical.top[].value;
|
||||
# 3) numeric.{min,median,max}. Nunca celda vacía ni inventada.
|
||||
...
|
||||
|
||||
def build_overview(profile: dict, ctx: dict):
|
||||
profile = profile or {}
|
||||
ctx = ctx or {}
|
||||
cols = profile.get("columns") or []
|
||||
if not cols and not (ctx.get("head_rows") or profile.get("head_rows")):
|
||||
return None # no aplica.
|
||||
|
||||
blocks = [
|
||||
model.Heading(text="Primeras filas (df.head)", level=2),
|
||||
_head_block(profile, ctx), # DataTable(df.head) o Note si falta head_rows.
|
||||
]
|
||||
cols_block = _columns_block(profile) # DataTable: nombre/tipo/nulos/ejemplos.
|
||||
if cols_block is not None:
|
||||
blocks.append(model.Heading(text="Diccionario de columnas", level=2))
|
||||
blocks.append(cols_block)
|
||||
desc_block = _describe_block(profile) # DataTable: mean/median/min/max/std.
|
||||
if desc_block is not None:
|
||||
blocks.append(model.Heading(text="Resumen estadístico numérico", level=2))
|
||||
blocks.append(desc_block)
|
||||
|
||||
return model.Chapter(id=CHAPTER_ID, title=CHAPTER_TITLE,
|
||||
version=CHAPTER_VERSION, blocks=blocks)
|
||||
```
|
||||
|
||||
Puntos clave que todo capítulo debe respetar:
|
||||
|
||||
1. **Lectura defensiva**: `profile.get(...)`, `or []`, comprobar `isinstance` — nunca
|
||||
asumir que una clave existe ni lanzar.
|
||||
2. **`None` si no aplica**: devuelve `None` (o `blocks` vacíos) cuando el dataset no tiene
|
||||
lo que el capítulo necesita.
|
||||
3. **No inventar**: si falta un dato (p.ej. `df.head`), muestra un placeholder honesto o
|
||||
deriva de valores reales del perfil; deja el hueco documentado.
|
||||
4. **Tablas vía `DataTable`**: deja que el renderer las parta y repita cabecera; no
|
||||
pre-pagines tú.
|
||||
5. **Figuras vía `Figure(make=...)`**: pásalas perezosas; las dibuja y escala el renderer.
|
||||
|
||||
---
|
||||
|
||||
## 9. Cómo se prueba un capítulo
|
||||
|
||||
```python
|
||||
from datascience.automatic_eda import build_document, render_pdf, render_pptx
|
||||
chapters = build_document(profile, ctx={"dataset_name": "..."})
|
||||
render_pdf(chapters, "reports/x.pdf", {"title": "EDA"})
|
||||
render_pptx(chapters, "reports/x.pptx", {"title": "EDA"})
|
||||
```
|
||||
|
||||
O directo desde las funciones públicas con el profile entero (construyen los capítulos):
|
||||
|
||||
```python
|
||||
from datascience import render_automatic_eda_pdf, render_automatic_eda_pptx
|
||||
render_automatic_eda_pdf(profile, "reports/x.pdf", {"ctx": {...}})
|
||||
render_automatic_eda_pptx(profile, "reports/x.pptx", {"ctx": {...}})
|
||||
```
|
||||
|
||||
Añade un test self-contained por capítulo (perfil sintético, sin DuckDB) que verifique
|
||||
sus bloques presentes y el no-corte (texto largo intacto en la salida). Patrón:
|
||||
`render_automatic_eda_pdf_test.py`.
|
||||
|
||||
---
|
||||
|
||||
## 10. Integración futura con `profile_table` (siguiente fase)
|
||||
|
||||
`profile_table(emit_pdf=True)` usa hoy `render_eda_pdf` (intacto). En la siguiente fase
|
||||
se añadirá `emit_automatic=True` (o se migrará `emit_pdf`) para que cada EDA emita
|
||||
**siempre** PDF + PPTX del motor AutomaticEDA desde el mismo profile:
|
||||
|
||||
```python
|
||||
# Bosquejo de la integración aditiva (NO activar si rompe los tests actuales):
|
||||
if emit_automatic:
|
||||
ctx = {"dataset_name": table, "source_origin": db_path, ...}
|
||||
render_automatic_eda_pdf(prof, os.path.join(report_dir, f"aeda_{table}_{ts}.pdf"),
|
||||
{"title": f"EDA — {table}", "ctx": ctx})
|
||||
render_automatic_eda_pptx(prof, os.path.join(report_dir, f"aeda_{table}_{ts}.pptx"),
|
||||
{"title": f"EDA — {table}", "ctx": ctx})
|
||||
```
|
||||
|
||||
Hasta entonces los renderers se invocan directamente sobre el `profile` que
|
||||
`profile_table` ya devuelve.
|
||||
@@ -68,7 +68,7 @@ Indice de grupos de capacidades del registry. Cada grupo agrupa >=3 funciones qu
|
||||
| [consent](consent.md) | 3 | CMP / IAB TCF / data brokers: detectar el CMP de un sitio (Didomi/OneTrust/Sourcepoint/Quantcast), leer `__tcfapi` para contar vendors y propositos, aceptar el banner (selectores + fallback LLM con haiku que localiza Aceptar/Ver socios), y descargar la GVL de IAB para nominar cada broker y que datos recopila. Nacio de `projects/databrokers/` |
|
||||
| [onlyoffice](onlyoffice.md) | 3 | Operar ONLYOFFICE Desktop Editors (binario onlyoffice-desktopeditors) en Linux/X11 desde terminal via instancia aislada (slot HOME=/tmp/oo_<instance>): abrir un archivo en ventana propia, cerrar+reabrir para mostrar datos editados en disco (no hay reload nativo, Issue #2313), y matar el proceso del slot. Solo gestiona la ventana, NO edita ni crea archivos. Requiere X11 + wmctrl + xdotool. No confundir con el Document Server (web/Docker) |
|
||||
| [email](email.md) | 21 | Gestionar cuentas de correo por IMAP+SMTP directo (Python stdlib, sin browser ni MCP Gmail): conectar/listar/buscar/leer (imap_*), mutar estado (mark_seen/move/delete/save_draft) por UID, y construir+enviar (email_build_html/smtp_send). Auth user+app-password (NO OAuth; Outlook fuera). Credenciales desde pass, resueltas por la capa app. Complementa al browser (interactivo) — no lo reemplaza |
|
||||
| [eda](eda.md) | 29 | Exploratory Data Analysis por tabla y base con motor DuckDB + PostgreSQL push-down: perfil base SQL (SUMMARIZE + distinct exacto), estadística numérica/categórica, tipo semántico regex, calidad, correlación/asociación (Pearson/Spearman/Cramér's V/Theil's U/η/MI), relaciones inter-tabla (FK containment + join graph mermaid), modelos baratos (PCA/KMeans/IsolationForest/normalidad/tendencia), capa LLM (dictionary/PII/limpieza/análisis) y generación de notebook. Orquestadores `profile_table` (backend duckdb/postgres, flags run_models/run_llm) y `profile_database` |
|
||||
| [eda](eda.md) | 27 | Exploratory Data Analysis por tabla y base con motor DuckDB + PostgreSQL push-down: perfil base SQL (SUMMARIZE + distinct exacto), estadística numérica/categórica, tipo semántico regex, calidad, correlación/asociación (Pearson/Spearman/Cramér's V/Theil's U/η/MI), relaciones inter-tabla (FK containment + join graph mermaid), modelos baratos (PCA/KMeans/IsolationForest/normalidad/tendencia), capa LLM (dictionary/PII/limpieza/análisis) y generación de notebook. Orquestadores `profile_table` (backend duckdb/postgres, flags run_models/run_llm) y `profile_database` |
|
||||
| [seo](seo.md) | 3 | SEO orientado a datos sobre Google Search Console: autenticar con service account (`gsc_auth`), extraer Search Analytics paginado (`pull_gsc_search_analytics`) y el pipeline de ingesta a DuckDB + espejo Postgres para Metabase (`ingest_gsc_search_analytics`). Cadena de ingesta del proyecto `seo_analytics`; alimenta dashboards de striking distance, CTR opportunities y content decay |
|
||||
| [local-hub](local-hub.md) | 4 | Exponer los procesos locales como subdominios `*.localhost` (via Caddy, sin DNS) y reunirlos en una pantalla principal Glance con estado en vivo, refrescada a diario por dag_engine. Descubre servicios (manifiesto + registry), renderiza Caddyfile + config Glance (puras), y el pipeline `refresh_local_hub` regenera+recarga. Fuente de verdad: `apps/local_hub/local_services.yaml` |
|
||||
| [comfyui-judge](comfyui-judge.md) | 4 | Panel multi-juez de calidad de imagen: estético LAION-V2 (`comfyui_score_aesthetic`, 0-10) + fidelidad CLIP prompt↔imagen (`comfyui_score_clip_alignment`, 0-1) + crítica LLM-vision (`comfyui_critique_image_llm`, good/bad). Agregados por voto mayoría en `comfyui_judge_image`. Gate objetivo para tests/DoD y el bucle de mejora de skills ComfyUI; degrada con gracia si un juez cae. Jueces estético/fidelidad por subproceso al venv ComfyUI (torch+open_clip), crítico via claude-direct |
|
||||
|
||||
@@ -71,10 +71,6 @@ Orquestadores one-shot:
|
||||
| `eda_llm_insights_py_datascience` | impure | 1 call LLM sobre el perfil agregado (no filas crudas): data dictionary, resumen, granularidad de fila, PII/RGPD, limpieza, análisis sugeridos. |
|
||||
| `build_eda_notebook_py_datascience` | impure | Genera un `.ipynb` (nbformat v4) que perfila la tabla, listo para lanzar en Jupyter colaborativo. |
|
||||
| `render_eda_pdf_py_datascience` | impure | Renderiza el `TableProfile` a un PDF multipágina **vertical (A5), legible en móvil** (estilo Tufte: histogramas como small multiples, top-k, heatmap de asociación). 4ª salida del workflow, junto a JSON/Markdown/notebook. |
|
||||
| `render_automatic_eda_pdf_py_datascience` | impure | Motor **AutomaticEDA**: documento por CAPÍTULOS (modelo de bloques independiente del formato) → PDF A5 móvil que **nunca corta** texto/tablas/imágenes (tablas largas se parten repitiendo cabecera) + manifiesto versionado por capítulo. Acepta el `TableProfile` o capítulos del modelo. Aditivo, no reemplaza `render_eda_pdf`. |
|
||||
| `render_automatic_eda_pptx_py_datascience` | impure | Motor **AutomaticEDA** → PPTX 16:9 para **compartir** desde el mismo documento por capítulos; mismo principio anti-corte (continúa en slide `(cont.)`). Motor `python-pptx`. |
|
||||
|
||||
> **AutomaticEDA** (núcleo nuevo, fase de capítulos): separa contenido (capítulos/bloques) de formato (PDF móvil + PPTX). Para escribir un capítulo nuevo (NUM DISTR, CAT DISTR, CALIDAD, CORRELACIÓN, MODELOS, ANÁLISIS LLM, TIMESERIES, GEOSPATIAL, AGREGACIÓN) lee el contrato: **`docs/automatic_eda_contract.md`**. Código del motor en `python/functions/datascience/automatic_eda/`; capítulos de referencia: `portada`, `overview`.
|
||||
|
||||
### Orquestadores (pipelines)
|
||||
| ID | Qué hace |
|
||||
|
||||
@@ -30,6 +30,7 @@ type auditFnMeta struct {
|
||||
domain string
|
||||
lang string
|
||||
signature string
|
||||
filePath string // registry-relative path to the .go source (Go funcs only)
|
||||
}
|
||||
|
||||
// skipDirs are directory names ignored when walking source for audits.
|
||||
@@ -80,15 +81,16 @@ func AuditUsesFunctions(registryRoot string) ([]UsesFunctionsAudit, error) {
|
||||
return nil, fmt.Errorf("audit_uses_functions: ping db: %w", err)
|
||||
}
|
||||
|
||||
// Load all Go/Python/TS functions from registry: id → name, domain, lang, signature.
|
||||
rows, err := db.Query(`SELECT id, name, domain, lang, COALESCE(signature, '') FROM functions WHERE lang IN ('go','py','ts')`)
|
||||
// Load all Go/Python/TS functions from registry: id → name, domain, lang,
|
||||
// signature, file_path. file_path feeds the Go .go fallback (see auditGoApp).
|
||||
rows, err := db.Query(`SELECT id, name, domain, lang, COALESCE(signature, ''), COALESCE(file_path, '') FROM functions WHERE lang IN ('go','py','ts')`)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("audit_uses_functions: query functions: %w", err)
|
||||
}
|
||||
allFunctions := make(map[string]auditFnMeta) // id → meta
|
||||
for rows.Next() {
|
||||
var m auditFnMeta
|
||||
if err := rows.Scan(&m.id, &m.name, &m.domain, &m.lang, &m.signature); err != nil {
|
||||
if err := rows.Scan(&m.id, &m.name, &m.domain, &m.lang, &m.signature, &m.filePath); err != nil {
|
||||
continue
|
||||
}
|
||||
allFunctions[m.id] = m
|
||||
@@ -144,7 +146,7 @@ func AuditUsesFunctions(registryRoot string) ([]UsesFunctionsAudit, error) {
|
||||
|
||||
switch app.lang {
|
||||
case "go":
|
||||
importedIDs = append(importedIDs, auditGoApp(absDir, allFunctions)...)
|
||||
importedIDs = append(importedIDs, auditGoApp(absDir, allFunctions, registryRoot)...)
|
||||
scannedLangs["go"] = true
|
||||
case "py":
|
||||
importedIDs = append(importedIDs, auditPyApp(absDir, allFunctions)...)
|
||||
@@ -197,11 +199,18 @@ func AuditUsesFunctions(registryRoot string) ([]UsesFunctionsAudit, error) {
|
||||
// Strategy:
|
||||
// 1. Find all "fn-registry/functions/<domain>" import paths (production code only).
|
||||
// 2. For each domain, collect registry functions in that domain.
|
||||
// 3. Grep source files for the exported symbol. The token tried first is the
|
||||
// real Go func identifier parsed from the registry signature; fallback is
|
||||
// PascalCase(name). Many functions deviate (e.g. sqlite_column_exists has
|
||||
// `func ColumnExists`), so signature is the source of truth.
|
||||
func auditGoApp(appDir string, all map[string]auditFnMeta) []string {
|
||||
// 3. Grep source files for the exported symbol. Tokens tried, in order:
|
||||
// a) the real Go func identifier parsed from the registry signature;
|
||||
// b) PascalCase(name) (with commonAbbrevs);
|
||||
// c) the real exported func read straight from the function's .go file.
|
||||
//
|
||||
// Many functions deviate from snake_case→PascalCase (e.g. sqlite_column_exists
|
||||
// has `func ColumnExists`, wails_bind_crud has `func GenerateWailsCRUD`). The
|
||||
// signature is usually the source of truth, but some signatures omit the `func`
|
||||
// keyword or list a different primary symbol; step (c) reads the .go file as a
|
||||
// last-resort fallback so those cases stop being false positives ("unused").
|
||||
// The .go read is cached per execution to avoid reopening the same file.
|
||||
func auditGoApp(appDir string, all map[string]auditFnMeta, registryRoot string) []string {
|
||||
// Step 1: collect imported domains.
|
||||
importedDomains := collectGoImportedDomains(appDir)
|
||||
if len(importedDomains) == 0 {
|
||||
@@ -216,6 +225,10 @@ func auditGoApp(appDir string, all map[string]auditFnMeta) []string {
|
||||
return nil
|
||||
}
|
||||
|
||||
// Cache for the .go fallback: registry file_path → real exported func name.
|
||||
// Populated lazily, only when the cheaper tokens fail to match.
|
||||
goFileSymbolCache := make(map[string]string)
|
||||
|
||||
for _, m := range all {
|
||||
if m.lang != "go" {
|
||||
continue
|
||||
@@ -223,17 +236,76 @@ func auditGoApp(appDir string, all map[string]auditFnMeta) []string {
|
||||
if !importedDomains[m.domain] {
|
||||
continue
|
||||
}
|
||||
tokens := goCandidateTokens(m)
|
||||
for _, tok := range tokens {
|
||||
matched := false
|
||||
for _, tok := range goCandidateTokens(m) {
|
||||
if containsToken(blob, tok) {
|
||||
used = append(used, m.id)
|
||||
matched = true
|
||||
break
|
||||
}
|
||||
}
|
||||
if !matched && goSignatureSymbol(m) == "" {
|
||||
// Fallback (c): read the registry .go file and look for the real
|
||||
// exported func name. Gated on an EMPTY signature symbol on purpose:
|
||||
// when the signature already yields a concrete `func <Name>` it is the
|
||||
// authoritative symbol, so reading the .go (which can only guess the
|
||||
// file's first exported func) must not override it. Several registry
|
||||
// functions share one .go file via the "TU adicional" pattern (e.g.
|
||||
// cdp_new_tab lives in cdp_list_tabs.go); without this gate the first
|
||||
// func would be mis-attributed to every sibling and suppress real
|
||||
// "unused" findings. The file read therefore only happens for the rare
|
||||
// functions whose stored signature omits the `func` keyword.
|
||||
if sym := goRealExportedName(registryRoot, m.filePath, goFileSymbolCache); sym != "" {
|
||||
if containsToken(blob, sym) {
|
||||
matched = true
|
||||
}
|
||||
}
|
||||
}
|
||||
if matched {
|
||||
used = append(used, m.id)
|
||||
}
|
||||
}
|
||||
return used
|
||||
}
|
||||
|
||||
// goRealExportedFnRe matches a top-level exported func declaration in a .go
|
||||
// source file: `func Name(` or the generic form `func Name[T any](`. It captures
|
||||
// the func identifier. Method declarations (`func (r *T) Name(`) are skipped on
|
||||
// purpose — a registry function's primary symbol is a top-level func, and method
|
||||
// names would risk spurious matches. Used by the .go fallback to recover the real
|
||||
// symbol name when the registry signature/name heuristics fail.
|
||||
var goRealExportedFnRe = regexp.MustCompile(`^func\s+([A-Z][A-Za-z0-9_]*)\s*[\(\[]`)
|
||||
|
||||
// goRealExportedName reads the registry .go file at filePath (relative to
|
||||
// registryRoot) and returns the first exported func identifier found. Results
|
||||
// are memoised in cache (filePath → symbol, "" when the file is unreadable or
|
||||
// has no exported func) so a file is opened at most once per audit run.
|
||||
func goRealExportedName(registryRoot, filePath string, cache map[string]string) string {
|
||||
if filePath == "" {
|
||||
return ""
|
||||
}
|
||||
if sym, ok := cache[filePath]; ok {
|
||||
return sym
|
||||
}
|
||||
cache[filePath] = "" // pre-seed so an unreadable file is not retried
|
||||
abs := filePath
|
||||
if !filepath.IsAbs(abs) {
|
||||
abs = filepath.Join(registryRoot, filePath)
|
||||
}
|
||||
f, err := os.Open(abs)
|
||||
if err != nil {
|
||||
return ""
|
||||
}
|
||||
defer f.Close()
|
||||
sc := bufio.NewScanner(f)
|
||||
for sc.Scan() {
|
||||
if m := goRealExportedFnRe.FindStringSubmatch(sc.Text()); m != nil {
|
||||
cache[filePath] = m[1]
|
||||
return m[1]
|
||||
}
|
||||
}
|
||||
return ""
|
||||
}
|
||||
|
||||
// goCandidateTokens returns the identifiers we try when looking for usages
|
||||
// of a Go function in source. Real exported name from signature first,
|
||||
// PascalCase(name) as fallback.
|
||||
@@ -241,10 +313,8 @@ var goSignatureFnRe = regexp.MustCompile(`^\s*func\s+(?:\([^)]*\)\s+)?([A-Z][A-Z
|
||||
|
||||
func goCandidateTokens(m auditFnMeta) []string {
|
||||
out := []string{}
|
||||
if m.signature != "" {
|
||||
if match := goSignatureFnRe.FindStringSubmatch(m.signature); match != nil {
|
||||
out = append(out, match[1])
|
||||
}
|
||||
if sym := goSignatureSymbol(m); sym != "" {
|
||||
out = append(out, sym)
|
||||
}
|
||||
pascal := snakeToPascal(m.name)
|
||||
if pascal != "" && (len(out) == 0 || out[0] != pascal) {
|
||||
@@ -253,6 +323,21 @@ func goCandidateTokens(m auditFnMeta) []string {
|
||||
return out
|
||||
}
|
||||
|
||||
// goSignatureSymbol returns the exported Go identifier parsed from the registry
|
||||
// signature (`func Name(...)` or `func (r *T) Name(...)`), or "" when the
|
||||
// signature is empty or does not start with a `func` declaration. A non-empty
|
||||
// result is the authoritative symbol for the function and gates off the .go
|
||||
// fallback in auditGoApp.
|
||||
func goSignatureSymbol(m auditFnMeta) string {
|
||||
if m.signature == "" {
|
||||
return ""
|
||||
}
|
||||
if match := goSignatureFnRe.FindStringSubmatch(m.signature); match != nil {
|
||||
return match[1]
|
||||
}
|
||||
return ""
|
||||
}
|
||||
|
||||
// collectGoImportedDomains returns the set of registry domains imported by .go files.
|
||||
var goImportRe = regexp.MustCompile(`"fn-registry/functions/([a-z]+)"`)
|
||||
|
||||
@@ -452,6 +537,34 @@ var commonAbbrevs = map[string]string{
|
||||
"io": "IO",
|
||||
"ok": "OK",
|
||||
"ui": "UI",
|
||||
// Issue 0057 — abbreviations verified consistent across the registry's own
|
||||
// Go func names (each entry maps a real `func <Name>` deviation). These only
|
||||
// improve the PascalCase fallback; the signature and the .go fallback remain
|
||||
// the primary sources of truth. Deliberately NOT added because the registry
|
||||
// itself is inconsistent for them (mapping would create more mismatches than
|
||||
// it fixes): "cdp" (uses Cdp: CdpGetHTML, CdpNavigate — not CDP) and
|
||||
// "pdf" (CdpPrintPDF vs PdfSimpleReport).
|
||||
"ohlcv": "OHLCV",
|
||||
"duckdb": "DuckDB",
|
||||
"clickhouse": "ClickHouse",
|
||||
"nordvpn": "NordVPN",
|
||||
"sha256": "SHA256",
|
||||
"md5": "MD5",
|
||||
"ansi": "ANSI",
|
||||
"cidr": "CIDR",
|
||||
"aead": "AEAD",
|
||||
"pty": "PTY",
|
||||
"vps": "VPS",
|
||||
"wg": "WG",
|
||||
"vt": "VT",
|
||||
"fft": "FFT",
|
||||
"ema": "EMA",
|
||||
"rsi": "RSI",
|
||||
"sma": "SMA",
|
||||
"vwap": "VWAP",
|
||||
"ax": "AX",
|
||||
"e2e": "E2E",
|
||||
"urls": "URLs",
|
||||
}
|
||||
|
||||
// hasTSSources reports whether appDir contains any production .ts/.tsx files
|
||||
|
||||
@@ -148,6 +148,273 @@ func main() { fmt.Println("hello") }
|
||||
})
|
||||
}
|
||||
|
||||
// TestSnakeToPascal_HandlesAbbreviations verifies the commonAbbrevs expansion
|
||||
// (issue 0057, Fase 1). Each "want" is the exported Go symbol the registry
|
||||
// actually uses for that snake_case name. It also pins the deliberate
|
||||
// non-mappings (cdp, pdf): the registry's own convention is mixed-case there,
|
||||
// so the abbreviation must NOT fire.
|
||||
func TestSnakeToPascal_HandlesAbbreviations(t *testing.T) {
|
||||
cases := []struct{ in, want string }{
|
||||
// New abbreviations added by issue 0057 (verified against real func names).
|
||||
{"fetch_ohlcv", "FetchOHLCV"},
|
||||
{"normalize_ohlcv", "NormalizeOHLCV"},
|
||||
{"duckdb_open", "DuckDBOpen"},
|
||||
{"load_ohlcv_from_duckdb", "LoadOHLCVFromDuckDB"},
|
||||
{"clickhouse_open", "ClickHouseOpen"},
|
||||
{"nordvpn_container_run", "NordVPNContainerRun"},
|
||||
{"parse_nordvpn_status", "ParseNordVPNStatus"},
|
||||
{"hash_sha256", "HashSHA256"},
|
||||
{"hash_md5", "HashMD5"},
|
||||
{"strip_ansi", "StripANSI"},
|
||||
{"parse_ip_cidr", "ParseIPCIDR"},
|
||||
{"open_aead", "OpenAEAD"},
|
||||
{"seal_aead", "SealAEAD"},
|
||||
{"pty_capture_stream", "PTYCaptureStream"},
|
||||
{"setup_vps_app", "SetupVPSApp"},
|
||||
{"vps_setup_app", "VPSSetupApp"},
|
||||
{"wg_keygen", "WGKeygen"},
|
||||
{"wg_peer_add", "WGPeerAdd"},
|
||||
{"vt_render", "VTRender"},
|
||||
{"fft", "FFT"},
|
||||
{"ema", "EMA"},
|
||||
{"rsi", "RSI"},
|
||||
{"sma", "SMA"},
|
||||
{"vwap", "VWAP"},
|
||||
{"cdp_get_ax_outline", "CdpGetAXOutline"},
|
||||
{"audit_e2e_coverage", "AuditE2ECoverage"},
|
||||
{"e2e_run_checks", "E2ERunChecks"},
|
||||
{"extract_urls", "ExtractURLs"},
|
||||
// Pre-existing abbreviations (regression guard — must keep working).
|
||||
{"http_json_response", "HTTPJSONResponse"},
|
||||
{"sqlite_open", "SQLiteOpen"},
|
||||
{"random_hex_id", "RandomHexID"},
|
||||
// Deliberate non-mappings: registry uses mixed-case (Cdp, Pdf) here, so
|
||||
// the snake_case→Pascal conversion must leave them mixed-case. These are
|
||||
// the cases the .go fallback (Fase 2) and the signature path cover.
|
||||
{"cdp_get_html", "CdpGetHTML"},
|
||||
{"cdp_navigate", "CdpNavigate"},
|
||||
{"pdf_simple_report", "PdfSimpleReport"},
|
||||
}
|
||||
for _, c := range cases {
|
||||
if got := snakeToPascal(c.in); got != c.want {
|
||||
t.Errorf("snakeToPascal(%q) = %q, want %q", c.in, got, c.want)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// goFallbackEnv builds a minimal registry.db + app on disk for the .go fallback
|
||||
// test. The registry function gen_wails_crud_go_infra mimics wails_bind_crud:
|
||||
// its signature omits the `func` keyword (so the signature regex misses) and its
|
||||
// PascalCase("gen_wails_crud")="GenWailsCRUD" differs from the real exported
|
||||
// symbol "GenerateWailsCRUD". The app calls the real symbol. When writeFnFile is
|
||||
// true, the registry .go file exists and the fallback can recover the symbol.
|
||||
func goFallbackEnv(t *testing.T, fnFilePath string, writeFnFile bool) UsesFunctionsAudit {
|
||||
t.Helper()
|
||||
root := t.TempDir()
|
||||
dbPath := filepath.Join(root, "registry.db")
|
||||
db, err := sql.Open("sqlite3", dbPath)
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
_, err = db.Exec(`
|
||||
CREATE TABLE functions (id TEXT PRIMARY KEY, name TEXT, domain TEXT, lang TEXT, signature TEXT, file_path TEXT);
|
||||
CREATE TABLE apps (id TEXT PRIMARY KEY, lang TEXT, dir_path TEXT, uses_functions TEXT DEFAULT '[]');
|
||||
`)
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
_, err = db.Exec(
|
||||
`INSERT INTO functions (id,name,domain,lang,signature,file_path) VALUES (?,?,?,?,?,?)`,
|
||||
"gen_wails_crud_go_infra", "gen_wails_crud", "infra", "go",
|
||||
"GenerateWailsCRUD(spec WailsCRUDSpec) string", fnFilePath,
|
||||
)
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
_, err = db.Exec(
|
||||
`INSERT INTO apps (id,lang,dir_path,uses_functions) VALUES (?,?,?,?)`,
|
||||
"myapp_go_infra", "go", "apps/myapp", `["gen_wails_crud_go_infra"]`,
|
||||
)
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
db.Close()
|
||||
|
||||
if writeFnFile {
|
||||
fnAbsDir := filepath.Join(root, filepath.Dir(fnFilePath))
|
||||
if err := os.MkdirAll(fnAbsDir, 0755); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
src := "package infra\n\ntype WailsCRUDSpec struct{}\n\nfunc GenerateWailsCRUD(spec WailsCRUDSpec) string { return \"\" }\n"
|
||||
if err := os.WriteFile(filepath.Join(root, fnFilePath), []byte(src), 0644); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
}
|
||||
|
||||
appDir := filepath.Join(root, "apps", "myapp")
|
||||
if err := os.MkdirAll(appDir, 0755); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
appSrc := "package main\n\nimport (\n\t\"fmt\"\n\t\"fn-registry/functions/infra\"\n)\n\nfunc main() {\n\tfmt.Println(infra.GenerateWailsCRUD(infra.WailsCRUDSpec{}))\n}\n"
|
||||
if err := os.WriteFile(filepath.Join(appDir, "main.go"), []byte(appSrc), 0644); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
|
||||
results, err := AuditUsesFunctions(root)
|
||||
if err != nil {
|
||||
t.Fatalf("AuditUsesFunctions: %v", err)
|
||||
}
|
||||
if len(results) != 1 {
|
||||
t.Fatalf("expected 1 result, got %d", len(results))
|
||||
}
|
||||
return results[0]
|
||||
}
|
||||
|
||||
// TestAuditUsesFunctions_GoFileFallback verifies the .go fallback (issue 0057,
|
||||
// Fase 2): when neither the registry signature nor PascalCase(name) yields the
|
||||
// real exported symbol, the auditor reads the function's .go file to recover it,
|
||||
// so a genuinely-used function is not a false "unused". The error sub-case (file
|
||||
// absent) shows the fallback degrades gracefully and the function is then
|
||||
// correctly reported unused — proving the fallback is load-bearing.
|
||||
func TestAuditUsesFunctions_GoFileFallback(t *testing.T) {
|
||||
t.Run("golden: .go fallback recovers real symbol -> not unused", func(t *testing.T) {
|
||||
got := goFallbackEnv(t, "functions/infra/gen_wails_crud.go", true)
|
||||
if len(got.Unused) != 0 {
|
||||
t.Errorf("Unused = %v, want [] (fallback should find GenerateWailsCRUD)", got.Unused)
|
||||
}
|
||||
if len(got.Missing) != 0 {
|
||||
t.Errorf("Missing = %v, want []", got.Missing)
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("error: missing .go file -> flagged unused, no crash", func(t *testing.T) {
|
||||
got := goFallbackEnv(t, "functions/infra/gen_wails_crud.go", false)
|
||||
if len(got.Unused) != 1 || got.Unused[0] != "gen_wails_crud_go_infra" {
|
||||
t.Errorf("Unused = %v, want [gen_wails_crud_go_infra] (no fallback file to read)", got.Unused)
|
||||
}
|
||||
})
|
||||
}
|
||||
|
||||
// TestAuditUsesFunctions_SharedGoFileNotMisattributed pins the regression caught
|
||||
// during issue 0057 verification: several registry functions can share one .go
|
||||
// file (the "TU adicional" pattern, e.g. cdp_new_tab living in cdp_list_tabs.go).
|
||||
// Because they have valid signatures, the .go fallback must stay GATED OFF for
|
||||
// them — otherwise the file's first exported func (here ListTabs) would be
|
||||
// mis-attributed to a sibling (NewTab) and suppress a genuine "unused" finding.
|
||||
// The app below uses only ListTabs; NewTab must remain flagged unused.
|
||||
func TestAuditUsesFunctions_SharedGoFileNotMisattributed(t *testing.T) {
|
||||
root := t.TempDir()
|
||||
dbPath := filepath.Join(root, "registry.db")
|
||||
db, err := sql.Open("sqlite3", dbPath)
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
_, err = db.Exec(`
|
||||
CREATE TABLE functions (id TEXT PRIMARY KEY, name TEXT, domain TEXT, lang TEXT, signature TEXT, file_path TEXT);
|
||||
CREATE TABLE apps (id TEXT PRIMARY KEY, lang TEXT, dir_path TEXT, uses_functions TEXT DEFAULT '[]');
|
||||
INSERT INTO functions (id,name,domain,lang,signature,file_path) VALUES
|
||||
('list_tabs_go_browser','list_tabs','browser','go','func ListTabs() error','functions/browser/tabs.go'),
|
||||
('new_tab_go_browser','new_tab','browser','go','func NewTab() error','functions/browser/tabs.go');
|
||||
INSERT INTO apps (id,lang,dir_path,uses_functions) VALUES
|
||||
('tabsapp_go_browser','go','apps/tabsapp','["list_tabs_go_browser","new_tab_go_browser"]');
|
||||
`)
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
db.Close()
|
||||
|
||||
// Shared registry .go file: ListTabs is the FIRST exported func.
|
||||
fnDir := filepath.Join(root, "functions", "browser")
|
||||
if err := os.MkdirAll(fnDir, 0755); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
tabsSrc := "package browser\n\nfunc ListTabs() error { return nil }\n\nfunc NewTab() error { return nil }\n"
|
||||
if err := os.WriteFile(filepath.Join(fnDir, "tabs.go"), []byte(tabsSrc), 0644); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
|
||||
// App calls only ListTabs, but declares both.
|
||||
appDir := filepath.Join(root, "apps", "tabsapp")
|
||||
if err := os.MkdirAll(appDir, 0755); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
appSrc := "package main\n\nimport (\n\t\"fmt\"\n\t\"fn-registry/functions/browser\"\n)\n\nfunc main() {\n\tfmt.Println(browser.ListTabs())\n}\n"
|
||||
if err := os.WriteFile(filepath.Join(appDir, "main.go"), []byte(appSrc), 0644); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
|
||||
results, err := AuditUsesFunctions(root)
|
||||
if err != nil {
|
||||
t.Fatalf("AuditUsesFunctions: %v", err)
|
||||
}
|
||||
if len(results) != 1 {
|
||||
t.Fatalf("expected 1 result, got %d", len(results))
|
||||
}
|
||||
got := results[0]
|
||||
if len(got.Unused) != 1 || got.Unused[0] != "new_tab_go_browser" {
|
||||
t.Errorf("Unused = %v, want [new_tab_go_browser] (sibling must NOT rescue via shared file)", got.Unused)
|
||||
}
|
||||
}
|
||||
|
||||
// TestGoRealExportedName verifies the .go symbol extractor: top-level exported
|
||||
// funcs (plain and generic) are recovered, method receivers are skipped, the
|
||||
// result is cached, and unreadable/empty paths return "" without error.
|
||||
func TestGoRealExportedName(t *testing.T) {
|
||||
root := t.TempDir()
|
||||
if err := os.MkdirAll(filepath.Join(root, "functions", "infra"), 0755); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
// File whose first exported func is preceded by an unexported func + a method.
|
||||
src := "package infra\n\n" +
|
||||
"import \"fmt\"\n\n" +
|
||||
"func helper() {}\n\n" +
|
||||
"type T struct{}\n\n" +
|
||||
"func (t *T) Save() {}\n\n" +
|
||||
"func GenerateWailsCRUD(spec int) string { fmt.Println(spec); return \"\" }\n\n" +
|
||||
"func WailsStreamData[X any](xs []X) {}\n"
|
||||
rel := "functions/infra/sample.go"
|
||||
if err := os.WriteFile(filepath.Join(root, rel), []byte(src), 0644); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
cache := map[string]string{}
|
||||
|
||||
t.Run("golden: first top-level exported func (skips helper + method)", func(t *testing.T) {
|
||||
if got := goRealExportedName(root, rel, cache); got != "GenerateWailsCRUD" {
|
||||
t.Errorf("got %q, want GenerateWailsCRUD", got)
|
||||
}
|
||||
if cache[rel] != "GenerateWailsCRUD" {
|
||||
t.Errorf("cache[%q] = %q, want GenerateWailsCRUD", rel, cache[rel])
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("edge: generic func form func Name[T any](", func(t *testing.T) {
|
||||
genRel := "functions/infra/gen.go"
|
||||
genSrc := "package infra\n\nfunc WailsStreamData[X any](xs []X) {}\n"
|
||||
if err := os.WriteFile(filepath.Join(root, genRel), []byte(genSrc), 0644); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
if got := goRealExportedName(root, genRel, cache); got != "WailsStreamData" {
|
||||
t.Errorf("got %q, want WailsStreamData", got)
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("error: missing file -> empty string, cached", func(t *testing.T) {
|
||||
missRel := "functions/infra/does_not_exist.go"
|
||||
if got := goRealExportedName(root, missRel, cache); got != "" {
|
||||
t.Errorf("got %q, want empty for missing file", got)
|
||||
}
|
||||
if v, ok := cache[missRel]; !ok || v != "" {
|
||||
t.Errorf("missing file should be cached as empty, got ok=%v v=%q", ok, v)
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("error: empty file_path -> empty string", func(t *testing.T) {
|
||||
if got := goRealExportedName(root, "", cache); got != "" {
|
||||
t.Errorf("got %q, want empty for empty path", got)
|
||||
}
|
||||
})
|
||||
}
|
||||
|
||||
// TestAuditUsesFunctions_MissingDir verifies that apps whose dir_path does not
|
||||
// exist on disk get an entry with nil Missing/Unused slices (cannot inspect).
|
||||
func TestAuditUsesFunctions_MissingDir(t *testing.T) {
|
||||
|
||||
@@ -53,12 +53,8 @@ from .fdr_correction import fdr_correction
|
||||
from .suggest_reexpression import suggest_reexpression
|
||||
from .exploratory_caveats import exploratory_caveats
|
||||
from .render_eda_pdf import render_eda_pdf, render_eda_pdf_relational
|
||||
from .render_automatic_eda_pdf import render_automatic_eda_pdf
|
||||
from .render_automatic_eda_pptx import render_automatic_eda_pptx
|
||||
|
||||
__all__ = [
|
||||
"render_automatic_eda_pdf",
|
||||
"render_automatic_eda_pptx",
|
||||
"decode_qr_image",
|
||||
"adf_kpss_stationarity",
|
||||
"acf_pacf",
|
||||
|
||||
@@ -1,57 +0,0 @@
|
||||
"""AutomaticEDA — chapter-based, versioned EDA document with PDF + PPTX output.
|
||||
|
||||
Public surface (support package for the registry functions
|
||||
``render_automatic_eda_pdf`` and ``render_automatic_eda_pptx``):
|
||||
|
||||
- Document model: ``Heading``, ``Markdown``, ``KVTable``, ``DataTable``,
|
||||
``Figure``, ``Image``, ``Caption``, ``Note``, ``Chapter``; normalizers
|
||||
``as_blocks`` / ``as_chapters``; ``ENGINE_VERSION`` / ``ENGINE_NAME``.
|
||||
- ``build_document(profile, ctx)`` — assemble the ordered chapters of a profile.
|
||||
- ``render_pdf(chapters, out_path, meta)`` / ``render_pptx(...)`` — the two
|
||||
renderers (used by the public registry functions).
|
||||
- ``merge_manifest(...)`` — write/update the per-chapter version manifest.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from .model import ( # noqa: F401
|
||||
ENGINE_NAME,
|
||||
ENGINE_VERSION,
|
||||
Caption,
|
||||
Chapter,
|
||||
DataTable,
|
||||
Figure,
|
||||
Heading,
|
||||
Image,
|
||||
KVTable,
|
||||
Markdown,
|
||||
Note,
|
||||
as_blocks,
|
||||
as_chapters,
|
||||
merge_manifest,
|
||||
)
|
||||
from .chapters_registry import CHAPTER_ORDER, build_chapter, build_document # noqa: F401
|
||||
from .render_pdf_impl import render_pdf # noqa: F401
|
||||
from .render_pptx_impl import render_pptx # noqa: F401
|
||||
|
||||
__all__ = [
|
||||
"ENGINE_NAME",
|
||||
"ENGINE_VERSION",
|
||||
"Heading",
|
||||
"Markdown",
|
||||
"KVTable",
|
||||
"DataTable",
|
||||
"Figure",
|
||||
"Image",
|
||||
"Caption",
|
||||
"Note",
|
||||
"Chapter",
|
||||
"as_blocks",
|
||||
"as_chapters",
|
||||
"merge_manifest",
|
||||
"CHAPTER_ORDER",
|
||||
"build_chapter",
|
||||
"build_document",
|
||||
"render_pdf",
|
||||
"render_pptx",
|
||||
]
|
||||
@@ -1,7 +0,0 @@
|
||||
"""AutomaticEDA chapters.
|
||||
|
||||
Each chapter is a module ``<id>.py`` exposing ``build_<id>(profile, ctx) ->
|
||||
Chapter | None`` and a ``CHAPTER_VERSION`` constant. The canonical document
|
||||
order lives in :mod:`automatic_eda.chapters_registry`. Implemented today:
|
||||
``portada`` and ``overview`` (the reference chapters other agents copy).
|
||||
"""
|
||||
@@ -1,402 +0,0 @@
|
||||
"""Categorical distributions chapter (CAT DISTR).
|
||||
|
||||
Third reference chapter for AutomaticEDA. For every categorical column it shows,
|
||||
fulfilling the user's request:
|
||||
|
||||
1. A short opening explanation of **Shannon entropy** (what it measures, its 0
|
||||
and log2(k) bounds, the normalized 0–1 version) and the dataset row total used
|
||||
as a comparison baseline.
|
||||
2. Per column, a cardinality key/value table: distinct values, ``% distinct``
|
||||
(distinct / total rows), total dataset rows, singleton values (frequency 1),
|
||||
entropy with its theoretical maximum and the normalized ratio, mode, imbalance
|
||||
and string-length stats.
|
||||
3. A short note flagging problematic cardinality (id-like ≈100% distinct, or a
|
||||
single dominating category).
|
||||
4. A ``top-k`` table (value / count / %).
|
||||
5. A **donut pie chart** of the most common categories (top-k + an "Otros"
|
||||
bucket), drawn lazily so the renderers scale it to fit entirely.
|
||||
|
||||
Data comes from the ``eda`` group: each ``columns[i]['categorical']`` is the
|
||||
output of ``summarize_categorical`` (``top[{value,count,pct}]``, ``mode``,
|
||||
``n_distinct``, ``entropy``, ``imbalance``, ``len_min/mean/max``). The derived
|
||||
cardinality metrics and the pie figure are delegated to two registry functions
|
||||
(``categorical_cardinality_block`` and ``categorical_top_pie_figure``); both are
|
||||
imported lazily and degrade to a minimal inline fallback so this chapter never
|
||||
raises even if they are unavailable.
|
||||
|
||||
Contract: build_<id>(profile, ctx) -> Chapter | None ; CHAPTER_VERSION = "x.y.z".
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import math
|
||||
|
||||
from .. import model
|
||||
|
||||
CHAPTER_VERSION = "1.0.0"
|
||||
CHAPTER_ID = "cat_distr"
|
||||
CHAPTER_TITLE = "Distribuciones categóricas"
|
||||
|
||||
# Cap the number of categorical columns rendered to keep the document bounded;
|
||||
# the rest are summarized in a closing note (no silent truncation).
|
||||
MAX_COLS = 40
|
||||
# Rows shown in each top-k table and explicit slices in the pie.
|
||||
TOP_TABLE_ROWS = 15
|
||||
PIE_TOP_K = 6
|
||||
# Truncate very long category labels in tables (the renderer also wraps).
|
||||
LABEL_MAX = 48
|
||||
|
||||
|
||||
def _fmt_int(value) -> str:
|
||||
if value is None:
|
||||
return "—"
|
||||
try:
|
||||
return f"{int(value):,}".replace(",", ".")
|
||||
except (TypeError, ValueError):
|
||||
return str(value)
|
||||
|
||||
|
||||
def _fmt_num(value, decimals: int = 3) -> str:
|
||||
if value is None:
|
||||
return "—"
|
||||
if isinstance(value, bool):
|
||||
return str(value)
|
||||
if isinstance(value, int):
|
||||
return f"{value:,}".replace(",", ".")
|
||||
if isinstance(value, float):
|
||||
if value != value: # NaN
|
||||
return "NaN"
|
||||
if value in (float("inf"), float("-inf")):
|
||||
return str(value)
|
||||
text = f"{value:.{decimals}f}".rstrip("0").rstrip(".")
|
||||
return text if text else "0"
|
||||
return str(value)
|
||||
|
||||
|
||||
def _fmt_pct_value(value, decimals: int = 1) -> str:
|
||||
"""Format an already-in-percent value (0–100). None -> placeholder."""
|
||||
if value is None:
|
||||
return "—"
|
||||
try:
|
||||
return f"{float(value):.{decimals}f}%"
|
||||
except (TypeError, ValueError):
|
||||
return str(value)
|
||||
|
||||
|
||||
def _pct_from_maybe_fraction(value, decimals: int = 1) -> str:
|
||||
"""Format a percentage that may arrive as a 0–1 fraction or a 0–100 number."""
|
||||
if value is None:
|
||||
return "—"
|
||||
try:
|
||||
v = float(value)
|
||||
except (TypeError, ValueError):
|
||||
return str(value)
|
||||
if v <= 1.0:
|
||||
v *= 100.0
|
||||
return f"{v:.{decimals}f}%"
|
||||
|
||||
|
||||
def _truncate(text: str, limit: int = LABEL_MAX) -> str:
|
||||
s = model._safe_str(text)
|
||||
if len(s) <= limit:
|
||||
return s
|
||||
return s[: max(1, limit - 1)].rstrip() + "…"
|
||||
|
||||
|
||||
def _is_categorical(col: dict) -> bool:
|
||||
"""A column is treated as categorical when it carries a non-empty top list
|
||||
and is not a pure numeric column (numeric columns may still expose a top)."""
|
||||
if not isinstance(col, dict):
|
||||
return False
|
||||
cat = col.get("categorical")
|
||||
if not (isinstance(cat, dict) and cat.get("top")):
|
||||
return False
|
||||
if col.get("inferred_type") == "numeric":
|
||||
return False
|
||||
return True
|
||||
|
||||
|
||||
def _cardinality(cat: dict, n_rows) -> dict:
|
||||
"""Derive cardinality metrics for a column, via the registry function when
|
||||
available, otherwise a minimal inline fallback. Never raises."""
|
||||
try:
|
||||
from datascience.categorical_cardinality_block import (
|
||||
categorical_cardinality_block,
|
||||
)
|
||||
|
||||
out = categorical_cardinality_block(cat=cat, n_rows=n_rows)
|
||||
if isinstance(out, dict):
|
||||
return out
|
||||
except Exception: # noqa: BLE001 — fall back to the inline derivation.
|
||||
pass
|
||||
return _fallback_cardinality(cat, n_rows)
|
||||
|
||||
|
||||
def _fallback_cardinality(cat: dict, n_rows) -> dict:
|
||||
cat = cat or {}
|
||||
top = cat.get("top") or []
|
||||
n_distinct = cat.get("n_distinct")
|
||||
entropy = cat.get("entropy")
|
||||
try:
|
||||
nr = int(n_rows) if n_rows is not None else None
|
||||
except (TypeError, ValueError):
|
||||
nr = None
|
||||
pct_distinct = None
|
||||
if isinstance(n_distinct, (int, float)) and nr:
|
||||
pct_distinct = float(n_distinct) / nr * 100.0
|
||||
entropy_max = None
|
||||
if isinstance(n_distinct, (int, float)):
|
||||
entropy_max = math.log2(n_distinct) if n_distinct > 1 else 0.0
|
||||
entropy_norm = None
|
||||
if isinstance(entropy, (int, float)) and entropy_max:
|
||||
entropy_norm = max(0.0, min(1.0, float(entropy) / entropy_max))
|
||||
mode_pct = cat.get("mode_pct")
|
||||
if mode_pct is None and top and isinstance(top[0], dict):
|
||||
mode_pct = top[0].get("pct")
|
||||
# Normalize to a 0–100 scale: summarize_categorical emits a 0–1 fraction.
|
||||
if isinstance(mode_pct, (int, float)) and not isinstance(mode_pct, bool):
|
||||
mode_pct = float(mode_pct) * 100.0 if mode_pct <= 1.0 else float(mode_pct)
|
||||
else:
|
||||
mode_pct = None
|
||||
n_singletons = None
|
||||
if top:
|
||||
n_singletons = sum(
|
||||
1 for t in top if isinstance(t, dict) and t.get("count") == 1)
|
||||
return {
|
||||
"n_distinct": n_distinct,
|
||||
"n_rows": nr,
|
||||
"pct_distinct": pct_distinct,
|
||||
"entropy": entropy,
|
||||
"entropy_max": entropy_max,
|
||||
"entropy_norm": entropy_norm,
|
||||
"mode": cat.get("mode"),
|
||||
"mode_pct": mode_pct,
|
||||
"imbalance": cat.get("imbalance"),
|
||||
"n_singletons": n_singletons,
|
||||
"n_singletons_partial": (
|
||||
isinstance(n_distinct, (int, float)) and n_distinct > len(top)),
|
||||
"len_min": cat.get("len_min"),
|
||||
"len_mean": cat.get("len_mean"),
|
||||
"len_max": cat.get("len_max"),
|
||||
"id_like": pct_distinct is not None and pct_distinct >= 99.0,
|
||||
"dominated": mode_pct is not None and mode_pct >= 90.0,
|
||||
}
|
||||
|
||||
|
||||
def _pie_make(top, n_distinct, title, n_rows):
|
||||
"""Return a zero-arg callable that builds the donut figure lazily."""
|
||||
|
||||
def make():
|
||||
try:
|
||||
from datascience.categorical_top_pie_figure import (
|
||||
categorical_top_pie_figure,
|
||||
)
|
||||
|
||||
return categorical_top_pie_figure(
|
||||
top=top, n_distinct=n_distinct or 0, title=title,
|
||||
top_k=PIE_TOP_K, n_rows=n_rows)
|
||||
except Exception: # noqa: BLE001 — minimal local fallback figure.
|
||||
return _fallback_pie(top, title)
|
||||
|
||||
return make
|
||||
|
||||
|
||||
def _fallback_pie(top, title):
|
||||
"""Minimal donut figure used only if the registry function is unavailable."""
|
||||
import matplotlib
|
||||
|
||||
matplotlib.use("Agg")
|
||||
from matplotlib.figure import Figure
|
||||
|
||||
fig = Figure(figsize=(5.0, 3.2))
|
||||
ax = fig.add_subplot(111)
|
||||
items = [t for t in (top or [])
|
||||
if isinstance(t, dict) and isinstance(t.get("count"), (int, float))]
|
||||
items = sorted(items, key=lambda t: t.get("count") or 0, reverse=True)
|
||||
head = items[:PIE_TOP_K]
|
||||
rest = items[PIE_TOP_K:]
|
||||
labels = [_truncate(t.get("value"), 20) for t in head]
|
||||
sizes = [float(t.get("count") or 0) for t in head]
|
||||
if rest:
|
||||
labels.append(f"Otros ({len(rest)})")
|
||||
sizes.append(sum(float(t.get("count") or 0) for t in rest))
|
||||
if not sizes or sum(sizes) <= 0:
|
||||
ax.text(0.5, 0.5, "sin datos categóricos", ha="center", va="center")
|
||||
ax.axis("off")
|
||||
return fig
|
||||
ax.pie(sizes, labels=None, wedgeprops={"width": 0.42},
|
||||
autopct=lambda p: f"{p:.0f}%" if p >= 4 else "")
|
||||
ax.legend(labels, loc="center left", bbox_to_anchor=(1.0, 0.5),
|
||||
fontsize=7, frameon=False)
|
||||
ax.set_title(_truncate(title, 40))
|
||||
fig.tight_layout()
|
||||
return fig
|
||||
|
||||
|
||||
def _normalize_card(card: dict) -> dict:
|
||||
"""Make the cardinality dict robust regardless of the upstream scale.
|
||||
|
||||
``summarize_categorical`` emits ``mode_pct`` as a 0–1 fraction; bring it to a
|
||||
0–100 scale and recompute the ``dominated`` flag here so the chapter is
|
||||
correct whether it consumed the registry function or the inline fallback.
|
||||
"""
|
||||
card = dict(card or {})
|
||||
mp = card.get("mode_pct")
|
||||
if isinstance(mp, (int, float)) and not isinstance(mp, bool):
|
||||
mp = float(mp) * 100.0 if mp <= 1.0 else float(mp)
|
||||
else:
|
||||
mp = None
|
||||
card["mode_pct"] = mp
|
||||
card["dominated"] = mp is not None and mp >= 90.0
|
||||
pd = card.get("pct_distinct")
|
||||
card["id_like"] = isinstance(pd, (int, float)) and pd >= 99.0
|
||||
return card
|
||||
|
||||
|
||||
def _cardinality_block(card: dict):
|
||||
"""KVTable with the cardinality / entropy metrics for one column."""
|
||||
n_singletons = card.get("n_singletons")
|
||||
if n_singletons is not None and card.get("n_singletons_partial"):
|
||||
singletons = f"≥{_fmt_int(n_singletons)} (en top mostrado)"
|
||||
elif n_singletons is not None:
|
||||
singletons = _fmt_int(n_singletons)
|
||||
else:
|
||||
singletons = "—"
|
||||
|
||||
entropy_ref = _fmt_num(card.get("entropy"))
|
||||
emax = card.get("entropy_max")
|
||||
if emax is not None:
|
||||
entropy_ref = f"{entropy_ref} (máx {_fmt_num(emax)})"
|
||||
|
||||
mode = card.get("mode")
|
||||
mode_pct = card.get("mode_pct")
|
||||
mode_str = "—" if mode is None else model._safe_str(mode)
|
||||
if mode is not None and mode_pct is not None:
|
||||
mode_str = f"{mode_str} ({_fmt_pct_value(mode_pct)})"
|
||||
|
||||
rows = [
|
||||
("Valores distintos", _fmt_int(card.get("n_distinct"))),
|
||||
("% distintos", _fmt_pct_value(card.get("pct_distinct"))),
|
||||
("Total filas (dataset)", _fmt_int(card.get("n_rows"))),
|
||||
("Valores únicos (frecuencia 1)", singletons),
|
||||
("Entropía (bits)", entropy_ref),
|
||||
("Entropía normalizada (0–1)", _fmt_num(card.get("entropy_norm"))),
|
||||
("Moda", mode_str),
|
||||
]
|
||||
imbalance = card.get("imbalance")
|
||||
if imbalance is not None:
|
||||
rows.append(("Desbalance", _fmt_num(imbalance)))
|
||||
lm = card.get("len_min")
|
||||
lmean = card.get("len_mean")
|
||||
lmax = card.get("len_max")
|
||||
if any(v is not None for v in (lm, lmean, lmax)):
|
||||
rows.append((
|
||||
"Longitud (mín/media/máx)",
|
||||
f"{_fmt_num(lm)} / {_fmt_num(lmean)} / {_fmt_num(lmax)}"))
|
||||
return model.KVTable(rows=rows, title="Cardinalidad")
|
||||
|
||||
|
||||
def _flag_note(card: dict):
|
||||
"""Return a Note flagging problematic cardinality, or None."""
|
||||
if card.get("id_like"):
|
||||
return model.Note(
|
||||
"Casi todos los valores son distintos (≈100% distintos): la columna "
|
||||
"se comporta como un identificador y aporta poco para agrupar o "
|
||||
"comparar categorías.")
|
||||
if card.get("dominated"):
|
||||
mp = card.get("mode_pct")
|
||||
mp_str = _fmt_pct_value(mp) if mp is not None else "muy alta"
|
||||
return model.Note(
|
||||
f"Una sola categoría domina la columna (moda {mp_str}): la "
|
||||
"distribución está muy desbalanceada.")
|
||||
return None
|
||||
|
||||
|
||||
def _topk_table(cat: dict):
|
||||
"""DataTable value / count / % for the top categories."""
|
||||
top = cat.get("top") or []
|
||||
n_distinct = cat.get("n_distinct")
|
||||
header = ["Valor", "Conteo", "%"]
|
||||
rows = []
|
||||
for t in top[:TOP_TABLE_ROWS]:
|
||||
if not isinstance(t, dict):
|
||||
continue
|
||||
rows.append([
|
||||
model._safe_str(t.get("value")),
|
||||
_fmt_int(t.get("count")),
|
||||
_pct_from_maybe_fraction(t.get("pct")),
|
||||
])
|
||||
if not rows:
|
||||
return None
|
||||
shown = len(rows)
|
||||
if isinstance(n_distinct, (int, float)) and n_distinct > shown:
|
||||
note = f"top {shown} de {_fmt_int(n_distinct)} categorías distintas"
|
||||
else:
|
||||
note = f"{shown} categorías"
|
||||
return model.DataTable(header=header, rows=rows, title="Top categorías",
|
||||
note=note)
|
||||
|
||||
|
||||
def _intro_blocks(n_rows):
|
||||
total = _fmt_int(n_rows)
|
||||
text = (
|
||||
"La **entropía de Shannon** mide cómo de repartidos están los valores de "
|
||||
"una columna categórica, en bits. Vale 0 cuando una sola categoría "
|
||||
"concentra todas las filas (máxima previsibilidad) y alcanza su máximo, "
|
||||
"log2(k) para k categorías distintas, cuando todas aparecen por igual "
|
||||
"(máxima diversidad). La **entropía normalizada** (entropía dividida por "
|
||||
"su máximo) la lleva al rango 0–1 para comparar columnas con distinto "
|
||||
"número de categorías. Para cada columna se muestran los valores "
|
||||
"distintos, el porcentaje que representan sobre el total de filas, los "
|
||||
"valores únicos (que aparecen una sola vez), la tabla de las categorías "
|
||||
"más frecuentes y un gráfico de tarta (donut) de las más comunes."
|
||||
)
|
||||
if n_rows is not None:
|
||||
text += f" El dataset tiene {total} filas en total como referencia."
|
||||
return [
|
||||
model.Heading(text="Entropía y cardinalidad", level=2),
|
||||
model.Markdown(text=text),
|
||||
]
|
||||
|
||||
|
||||
def build_cat_distr(profile: dict, ctx: dict):
|
||||
"""Build the categorical-distributions Chapter, or None if the dataset has
|
||||
no categorical columns."""
|
||||
profile = profile or {}
|
||||
ctx = ctx or {}
|
||||
cols = profile.get("columns") or []
|
||||
cat_cols = [c for c in cols if _is_categorical(c)]
|
||||
if not cat_cols:
|
||||
return None
|
||||
|
||||
n_rows = profile.get("n_rows")
|
||||
blocks = list(_intro_blocks(n_rows))
|
||||
|
||||
rendered = cat_cols[:MAX_COLS]
|
||||
for col in rendered:
|
||||
name = col.get("name") or "(columna)"
|
||||
cat = col.get("categorical") or {}
|
||||
card = _normalize_card(_cardinality(cat, n_rows))
|
||||
|
||||
blocks.append(model.Heading(text=str(name), level=2))
|
||||
blocks.append(_cardinality_block(card))
|
||||
note = _flag_note(card)
|
||||
if note is not None:
|
||||
blocks.append(note)
|
||||
topk = _topk_table(cat)
|
||||
if topk is not None:
|
||||
blocks.append(topk)
|
||||
blocks.append(model.Figure(
|
||||
make=_pie_make(cat.get("top") or [], card.get("n_distinct"),
|
||||
str(name), n_rows),
|
||||
caption=(f"Categorías más comunes de «{_truncate(name, 32)}» "
|
||||
"(donut: top-k + «Otros»)")))
|
||||
|
||||
if len(cat_cols) > len(rendered):
|
||||
omitted = len(cat_cols) - len(rendered)
|
||||
blocks.append(model.Note(
|
||||
f"Se muestran las primeras {len(rendered)} columnas categóricas; "
|
||||
f"quedan {omitted} sin mostrar para mantener acotado el informe."))
|
||||
|
||||
return model.Chapter(id=CHAPTER_ID, title=CHAPTER_TITLE,
|
||||
version=CHAPTER_VERSION, blocks=blocks)
|
||||
@@ -1,186 +0,0 @@
|
||||
"""Tests for the CAT DISTR chapter — DoD: golden + edges + anti-cut.
|
||||
|
||||
Self-contained: builds synthetic TableProfiles (no DuckDB) so the suite is fast
|
||||
and deterministic. Verifies that ``build_cat_distr`` emits the blocks the user
|
||||
asked for (entropy intro, distinct/total/%-distinct/unique metrics, top-k table
|
||||
and a donut figure), that the chapter renders inside the full document to both
|
||||
PDF and PPTX showing that content, that a profile with no categorical columns
|
||||
yields ``None`` without raising, and that long labels / many columns are never
|
||||
cut in either output.
|
||||
"""
|
||||
|
||||
import os
|
||||
import re
|
||||
import tempfile
|
||||
|
||||
from pypdf import PdfReader
|
||||
from pptx import Presentation
|
||||
|
||||
from datascience.automatic_eda.model import (
|
||||
DataTable, Figure, Heading, KVTable, Note,
|
||||
)
|
||||
from datascience.automatic_eda.chapters.cat_distr import (
|
||||
CHAPTER_ID, CHAPTER_VERSION, build_cat_distr,
|
||||
)
|
||||
from datascience.render_automatic_eda_pdf import render_automatic_eda_pdf
|
||||
from datascience.render_automatic_eda_pptx import render_automatic_eda_pptx
|
||||
|
||||
|
||||
def _profile() -> dict:
|
||||
return {
|
||||
"table": "productos",
|
||||
"source": "/data/productos.csv",
|
||||
"profiled_at": "2026-06-30T10:00:00+00:00",
|
||||
"n_rows": 1000,
|
||||
"n_cols": 3,
|
||||
"quality_score": 90.0,
|
||||
"columns": [
|
||||
{"name": "precio", "inferred_type": "numeric", "null_pct": 0.0,
|
||||
"null_count": 0,
|
||||
"numeric": {"mean": 42.5, "median": 40.0, "min": 1.0,
|
||||
"max": 100.0, "std": 12.3}},
|
||||
{"name": "categoria", "inferred_type": "categorical",
|
||||
"null_pct": 0.0, "null_count": 0, "distinct_count": 8,
|
||||
"categorical": {
|
||||
"top": [
|
||||
{"value": "neumaticos", "count": 500, "pct": 0.5},
|
||||
{"value": "aceite", "count": 300, "pct": 0.3},
|
||||
{"value": "filtros", "count": 120, "pct": 0.12},
|
||||
{"value": "frenos", "count": 80, "pct": 0.08},
|
||||
],
|
||||
"mode": "neumaticos", "n_distinct": 8, "entropy": 1.6,
|
||||
"imbalance": 6.25, "len_min": 6, "len_mean": 7.5,
|
||||
"len_max": 10}},
|
||||
{"name": "uuid", "inferred_type": "categorical",
|
||||
"null_pct": 0.0, "null_count": 0, "distinct_count": 1000,
|
||||
"categorical": {
|
||||
"top": [{"value": f"id-{i}", "count": 1} for i in range(5)],
|
||||
"mode": "id-0", "n_distinct": 1000, "entropy": 9.97,
|
||||
"imbalance": 1.0}},
|
||||
],
|
||||
}
|
||||
|
||||
|
||||
def _pdf_text(path: str) -> str:
|
||||
txt = "".join((pg.extract_text() or "") for pg in PdfReader(path).pages)
|
||||
return re.sub(r"\s+", " ", txt)
|
||||
|
||||
|
||||
def _pptx_text(path: str) -> str:
|
||||
prs = Presentation(path)
|
||||
parts = []
|
||||
for sl in prs.slides:
|
||||
for sh in sl.shapes:
|
||||
if sh.has_text_frame:
|
||||
parts.append(sh.text_frame.text)
|
||||
if sh.has_table:
|
||||
tb = sh.table
|
||||
for r in range(len(tb.rows)):
|
||||
for c in range(len(tb.columns)):
|
||||
parts.append(tb.cell(r, c).text)
|
||||
return re.sub(r"\s+", " ", " ".join(parts))
|
||||
|
||||
|
||||
def _kinds(chapter):
|
||||
return [b.kind for b in chapter.blocks]
|
||||
|
||||
|
||||
def test_golden_build_cat_distr_emite_bloques_pedidos():
|
||||
ch = build_cat_distr(_profile(), {})
|
||||
assert ch is not None
|
||||
assert ch.id == CHAPTER_ID
|
||||
assert ch.version == CHAPTER_VERSION
|
||||
kinds = _kinds(ch)
|
||||
# Entropy intro present.
|
||||
headings = [b.text for b in ch.blocks if isinstance(b, Heading)]
|
||||
assert any("Entrop" in h for h in headings)
|
||||
md = next(b for b in ch.blocks if b.kind == "markdown")
|
||||
assert "entropía" in md.text.lower() and "log2" in md.text
|
||||
# Cardinality metrics: distinct, total rows, %-distinct, unique values.
|
||||
kv = next(b for b in ch.blocks if isinstance(b, KVTable))
|
||||
labels = [r[0] for r in kv.rows]
|
||||
assert "Valores distintos" in labels
|
||||
assert "% distintos" in labels
|
||||
assert "Total filas (dataset)" in labels
|
||||
assert "Valores únicos (frecuencia 1)" in labels
|
||||
assert any("Entropía" in lbl for lbl in labels)
|
||||
# Top-k table + pie figure.
|
||||
dt = next(b for b in ch.blocks if isinstance(b, DataTable))
|
||||
assert dt.header == ["Valor", "Conteo", "%"]
|
||||
assert any("neumaticos" in str(cell) for row in dt.rows for cell in row)
|
||||
assert any(isinstance(b, Figure) for b in ch.blocks)
|
||||
# id-like column flagged with a Note.
|
||||
assert any(isinstance(b, Note) and "identificador" in b.text
|
||||
for b in ch.blocks)
|
||||
|
||||
|
||||
def test_golden_render_pdf_muestra_categoricas():
|
||||
with tempfile.TemporaryDirectory() as d:
|
||||
out = os.path.join(d, "eda.pdf")
|
||||
res = render_automatic_eda_pdf(_profile(), out, {"title": "EDA"})
|
||||
assert res["path"] == out and os.path.exists(out)
|
||||
assert CHAPTER_ID in [c["id"] for c in res["chapters"]]
|
||||
txt = _pdf_text(out)
|
||||
assert "Entrop" in txt
|
||||
assert "distintos" in txt
|
||||
assert "categoria" in txt and "neumaticos" in txt
|
||||
assert "donut" in txt # figure caption rendered as text.
|
||||
assert "identificador" in txt # id-like note rendered.
|
||||
|
||||
|
||||
def test_golden_render_pptx_muestra_categoricas():
|
||||
with tempfile.TemporaryDirectory() as d:
|
||||
out = os.path.join(d, "eda.pptx")
|
||||
res = render_automatic_eda_pptx(_profile(), out, {"title": "EDA"})
|
||||
assert res["path"] == out and os.path.exists(out)
|
||||
assert CHAPTER_ID in [c["id"] for c in res["chapters"]]
|
||||
txt = _pptx_text(out)
|
||||
assert "Entrop" in txt
|
||||
assert "categoria" in txt and "neumaticos" in txt
|
||||
assert "distintos" in txt
|
||||
|
||||
|
||||
def test_edge_sin_categoricas_devuelve_none():
|
||||
only_numeric = {
|
||||
"n_rows": 10, "columns": [
|
||||
{"name": "x", "inferred_type": "numeric",
|
||||
"numeric": {"mean": 1.0}}]}
|
||||
assert build_cat_distr(only_numeric, {}) is None
|
||||
# None / empty / no-columns never raise and yield None.
|
||||
assert build_cat_distr(None, None) is None
|
||||
assert build_cat_distr({}, {}) is None
|
||||
assert build_cat_distr({"columns": []}, {}) is None
|
||||
|
||||
|
||||
def test_anti_corte_label_largo_y_muchas_columnas():
|
||||
long_label = ("Lorem ipsum dolor sit amet consectetur adipiscing elit sed "
|
||||
"do eiusmod tempor incididunt ut labore reprehenderit voluptate")
|
||||
cols = []
|
||||
for i in range(30):
|
||||
cols.append({
|
||||
"name": f"cat_{i}", "inferred_type": "categorical",
|
||||
"distinct_count": 3,
|
||||
"categorical": {
|
||||
"top": [{"value": long_label, "count": 60},
|
||||
{"value": "b", "count": 30},
|
||||
{"value": "c", "count": 10}],
|
||||
"mode": long_label, "n_distinct": 3, "entropy": 1.2}})
|
||||
profile = {"table": "t", "source": "t.csv", "n_rows": 100,
|
||||
"n_cols": len(cols), "columns": cols}
|
||||
|
||||
ch = build_cat_distr(profile, {})
|
||||
assert ch is not None
|
||||
with tempfile.TemporaryDirectory() as d:
|
||||
pdf = os.path.join(d, "anti.pdf")
|
||||
res = render_automatic_eda_pdf(profile, pdf, {"write_manifest": False})
|
||||
assert res["path"] == pdf
|
||||
assert res["n_pages"] > 1 # many columns spilled across pages, OK.
|
||||
txt = _pdf_text(pdf)
|
||||
# Long label wrapped (not truncated): every word survives.
|
||||
for word in ("Lorem", "incididunt", "reprehenderit", "voluptate"):
|
||||
assert word in txt
|
||||
# PPTX path must not raise either.
|
||||
pptx = os.path.join(d, "anti.pptx")
|
||||
res2 = render_automatic_eda_pptx(profile, pptx,
|
||||
{"write_manifest": False})
|
||||
assert res2["path"] == pptx and os.path.exists(pptx)
|
||||
@@ -1,176 +0,0 @@
|
||||
"""Overview chapter — df.head, column dictionary and describe (reference).
|
||||
|
||||
Second reference chapter for AutomaticEDA. Renders (across as many pages/slides
|
||||
as needed, the renderers paginate):
|
||||
|
||||
1. ``df.head`` — the first rows of the table. The current ``TableProfile`` does
|
||||
NOT carry the raw head, so this is read from ``ctx['head_rows']`` /
|
||||
``profile['head_rows']`` (a list of row dicts). When absent the chapter shows
|
||||
an honest placeholder documenting the missing key instead of inventing data.
|
||||
2. Column dictionary — name / type / nulls / non-null examples. Examples come
|
||||
from ``columns[i]['examples']`` when present; otherwise they are derived from
|
||||
real non-null profile values (categorical top values, numeric min/median/max)
|
||||
so the cell is never empty nor fabricated.
|
||||
3. ``df.describe`` — mean / median / min / max / std for every numeric column.
|
||||
|
||||
Contract: build_<id>(profile, ctx) -> Chapter | None ; CHAPTER_VERSION = "x.y.z".
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from .. import model
|
||||
|
||||
CHAPTER_VERSION = "1.0.0"
|
||||
CHAPTER_ID = "overview"
|
||||
CHAPTER_TITLE = "Overview"
|
||||
|
||||
# Profile/ctx keys the calculation phase must add for a full head + examples.
|
||||
HEAD_KEY = "head_rows" # list[dict] — df.head(n)
|
||||
EXAMPLES_KEY = "examples" # per column: list of non-null sample values
|
||||
|
||||
|
||||
def _fmt_num(value, decimals: int = 3) -> str:
|
||||
if value is None:
|
||||
return "—"
|
||||
if isinstance(value, bool):
|
||||
return str(value)
|
||||
if isinstance(value, int):
|
||||
return f"{value:,}".replace(",", ".")
|
||||
if isinstance(value, float):
|
||||
if value != value: # NaN
|
||||
return "NaN"
|
||||
if value in (float("inf"), float("-inf")):
|
||||
return str(value)
|
||||
text = f"{value:.{decimals}f}".rstrip("0").rstrip(".")
|
||||
return text if text else "0"
|
||||
return str(value)
|
||||
|
||||
|
||||
def _fmt_pct(value, decimals: int = 1) -> str:
|
||||
if value is None:
|
||||
return "—"
|
||||
try:
|
||||
return f"{float(value) * 100:.{decimals}f}%"
|
||||
except (TypeError, ValueError):
|
||||
return str(value)
|
||||
|
||||
|
||||
def _examples_for(col: dict) -> str:
|
||||
"""Build a short string of real non-null example values for a column."""
|
||||
explicit = col.get(EXAMPLES_KEY)
|
||||
if isinstance(explicit, (list, tuple)) and explicit:
|
||||
return ", ".join(model._safe_str(v) for v in explicit[:4])
|
||||
cat = col.get("categorical") or {}
|
||||
top = cat.get("top") or []
|
||||
if top:
|
||||
vals = [model._safe_str((t or {}).get("value")) for t in top[:4]
|
||||
if isinstance(t, dict)]
|
||||
vals = [v for v in vals if v]
|
||||
if vals:
|
||||
return ", ".join(vals)
|
||||
num = col.get("numeric") or {}
|
||||
if num:
|
||||
bits = []
|
||||
for key in ("min", "median", "max"):
|
||||
v = num.get(key)
|
||||
if v is not None:
|
||||
bits.append(_fmt_num(v))
|
||||
if bits:
|
||||
return ", ".join(bits)
|
||||
return "—"
|
||||
|
||||
|
||||
def _head_block(profile: dict, ctx: dict):
|
||||
"""Return a DataTable for df.head, or a Note documenting the missing key."""
|
||||
head = ctx.get(HEAD_KEY) or profile.get(HEAD_KEY)
|
||||
if isinstance(head, list) and head and isinstance(head[0], dict):
|
||||
# Column order from the profile, then any extra keys present in rows.
|
||||
cols = [c.get("name") for c in (profile.get("columns") or [])
|
||||
if c.get("name")]
|
||||
if not cols:
|
||||
cols = list(head[0].keys())
|
||||
rows = [[model._safe_str(r.get(c)) for c in cols] for r in head[:10]]
|
||||
return model.DataTable(header=cols, rows=rows,
|
||||
note=f"primeras {len(rows)} filas")
|
||||
return model.Note(
|
||||
"df.head no disponible: el TableProfile no incluye 'head_rows'. La fase "
|
||||
"de cálculo debe añadir profile['head_rows'] (lista de dicts fila) o "
|
||||
"pasarlo en ctx['head_rows'] para mostrar las primeras filas.")
|
||||
|
||||
|
||||
def _columns_block(profile: dict):
|
||||
cols = profile.get("columns") or []
|
||||
header = ["Columna", "Tipo", "Nulos", "Ejemplos (no nulos)"]
|
||||
rows = []
|
||||
for c in cols:
|
||||
if not isinstance(c, dict):
|
||||
continue
|
||||
name = c.get("name") or "(col)"
|
||||
ctype = c.get("inferred_type") or c.get("physical_type") or "—"
|
||||
sem = c.get("semantic_type")
|
||||
if sem:
|
||||
ctype = f"{ctype} ({sem})"
|
||||
null_pct = c.get("null_pct")
|
||||
null_count = c.get("null_count")
|
||||
if null_pct is not None:
|
||||
nulls = _fmt_pct(null_pct)
|
||||
if null_count is not None:
|
||||
nulls += f" ({null_count})"
|
||||
elif null_count is not None:
|
||||
nulls = str(null_count)
|
||||
else:
|
||||
nulls = "—"
|
||||
rows.append([name, ctype, nulls, _examples_for(c)])
|
||||
if not rows:
|
||||
return None
|
||||
return model.DataTable(header=header, rows=rows, title="Columnas")
|
||||
|
||||
|
||||
def _describe_block(profile: dict):
|
||||
cols = profile.get("columns") or []
|
||||
header = ["Columna", "mean", "median", "min", "max", "std"]
|
||||
rows = []
|
||||
for c in cols:
|
||||
if not isinstance(c, dict) or c.get("inferred_type") != "numeric":
|
||||
continue
|
||||
num = c.get("numeric") or {}
|
||||
if not num:
|
||||
continue
|
||||
rows.append([
|
||||
c.get("name") or "(col)",
|
||||
_fmt_num(num.get("mean")),
|
||||
_fmt_num(num.get("median")),
|
||||
_fmt_num(num.get("min")),
|
||||
_fmt_num(num.get("max")),
|
||||
_fmt_num(num.get("std")),
|
||||
])
|
||||
if not rows:
|
||||
return None
|
||||
return model.DataTable(header=header, rows=rows, title="Estadística (describe)")
|
||||
|
||||
|
||||
def build_overview(profile: dict, ctx: dict):
|
||||
"""Build the Overview Chapter, or None if the profile has no columns."""
|
||||
profile = profile or {}
|
||||
ctx = ctx or {}
|
||||
cols = profile.get("columns") or []
|
||||
if not cols and not (ctx.get(HEAD_KEY) or profile.get(HEAD_KEY)):
|
||||
return None
|
||||
|
||||
blocks = [
|
||||
model.Heading(text="Primeras filas (df.head)", level=2),
|
||||
_head_block(profile, ctx),
|
||||
]
|
||||
cols_block = _columns_block(profile)
|
||||
if cols_block is not None:
|
||||
blocks.append(model.Heading(
|
||||
text="Diccionario de columnas", level=2))
|
||||
blocks.append(cols_block)
|
||||
desc_block = _describe_block(profile)
|
||||
if desc_block is not None:
|
||||
blocks.append(model.Heading(
|
||||
text="Resumen estadístico numérico", level=2))
|
||||
blocks.append(desc_block)
|
||||
|
||||
return model.Chapter(id=CHAPTER_ID, title=CHAPTER_TITLE,
|
||||
version=CHAPTER_VERSION, blocks=blocks)
|
||||
@@ -1,156 +0,0 @@
|
||||
"""Cover chapter (PORTADA) — the reference chapter for AutomaticEDA.
|
||||
|
||||
Builds the document cover from a TableProfile plus an optional ``ctx`` of
|
||||
presentation metadata. Reads everything defensively (``.get``) and degrades
|
||||
honestly: a field that is neither in the profile nor in ``ctx`` is shown as a
|
||||
placeholder rather than invented, leaving a hook for the LLM layer to fill it.
|
||||
|
||||
Contract for chapter authors (see ``docs/capabilities/automatic_eda.md``):
|
||||
build_<id>(profile: dict, ctx: dict) -> Chapter | None
|
||||
CHAPTER_VERSION = "x.y.z"
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
from datetime import datetime, timezone
|
||||
|
||||
from .. import model
|
||||
|
||||
CHAPTER_VERSION = "1.0.0"
|
||||
CHAPTER_ID = "portada"
|
||||
CHAPTER_TITLE = "Portada"
|
||||
|
||||
# Default human description of what the table quality score measures. Chapters
|
||||
# can override it via ctx["quality_criteria"].
|
||||
_DEFAULT_QUALITY_CRITERIA = (
|
||||
"media de los scores por columna (0–100): completitud (sin nulos/vacíos), "
|
||||
"validez (tipo y rango coherentes) y consistencia (sin duplicados/constantes)."
|
||||
)
|
||||
|
||||
|
||||
def _storage_from_source(source: str) -> str:
|
||||
"""Infer the storage technology the dataset currently lives in.
|
||||
|
||||
Heuristic on the profile ``source`` string (a path, DSN or backend name).
|
||||
Returns a human label; falls back to the raw source when unknown.
|
||||
"""
|
||||
s = (source or "").strip().lower()
|
||||
if not s:
|
||||
return "—"
|
||||
if s.endswith(".csv") or s.endswith(".tsv"):
|
||||
return "CSV"
|
||||
if s.endswith(".parquet") or s.endswith(".pq"):
|
||||
return "Parquet"
|
||||
if s.endswith(".json") or s.endswith(".ndjson"):
|
||||
return "JSON"
|
||||
if s.endswith(".xlsx") or s.endswith(".xls"):
|
||||
return "Excel"
|
||||
if s.endswith((".duckdb", ".ddb")) or s == "duckdb" or s.endswith(".db"):
|
||||
return "DuckDB"
|
||||
if s.startswith(("postgres://", "postgresql://")) or "postgres" in s:
|
||||
return "PostgreSQL"
|
||||
if s.startswith("bigquery") or "bigquery" in s or s.count(".") == 2 and " " not in s:
|
||||
return "BigQuery"
|
||||
if "sqlite" in s:
|
||||
return "SQLite"
|
||||
# Unknown: show the raw source so nothing is hidden.
|
||||
return source
|
||||
|
||||
|
||||
def _fmt_int(v) -> str:
|
||||
if v is None:
|
||||
return "—"
|
||||
try:
|
||||
return f"{int(v):,}".replace(",", ".")
|
||||
except (TypeError, ValueError):
|
||||
return str(v)
|
||||
|
||||
|
||||
def _fmt_date_eu(value) -> str:
|
||||
"""Format a date/ISO string as European DD/MM/AAAA HH:mm (UI convention).
|
||||
|
||||
Accepts a datetime, an ISO-8601 string (with or without microseconds/tz) or
|
||||
any other string. Non-parseable strings are returned verbatim so nothing is
|
||||
lost; None yields a placeholder.
|
||||
"""
|
||||
if value is None:
|
||||
return "—"
|
||||
if isinstance(value, datetime):
|
||||
return value.strftime("%d/%m/%Y %H:%M")
|
||||
s = str(value).strip()
|
||||
if not s:
|
||||
return "—"
|
||||
try:
|
||||
dt = datetime.fromisoformat(s.replace("Z", "+00:00"))
|
||||
return dt.strftime("%d/%m/%Y %H:%M")
|
||||
except (TypeError, ValueError):
|
||||
# Try a couple of common forms before giving up.
|
||||
for fmt in ("%Y-%m-%d %H:%M:%S UTC", "%Y-%m-%d %H:%M UTC",
|
||||
"%Y-%m-%d %H:%M:%S", "%Y-%m-%d"):
|
||||
try:
|
||||
return datetime.strptime(s, fmt).strftime("%d/%m/%Y %H:%M")
|
||||
except ValueError:
|
||||
continue
|
||||
return s
|
||||
|
||||
|
||||
def build_portada(profile: dict, ctx: dict):
|
||||
"""Build the cover Chapter, or None if there is truly nothing to show."""
|
||||
profile = profile or {}
|
||||
ctx = ctx or {}
|
||||
|
||||
dataset_name = (ctx.get("dataset_name") or profile.get("table")
|
||||
or "(dataset sin nombre)")
|
||||
source = profile.get("source") or ""
|
||||
# Where the dataset comes from (origin), distinct from where it is stored.
|
||||
source_origin = ctx.get("source_origin") or source or "—"
|
||||
storage = ctx.get("storage") or _storage_from_source(source)
|
||||
|
||||
when = _fmt_date_eu(
|
||||
ctx.get("generated_at") or profile.get("profiled_at")
|
||||
or datetime.now(timezone.utc))
|
||||
|
||||
n_rows = profile.get("n_rows")
|
||||
n_cols = profile.get("n_cols")
|
||||
shape = f"{_fmt_int(n_rows)} filas × {_fmt_int(n_cols)} columnas"
|
||||
|
||||
score = profile.get("quality_score")
|
||||
quality_criteria = ctx.get("quality_criteria") or _DEFAULT_QUALITY_CRITERIA
|
||||
quality_value = "—" if score is None else f"{score} / 100"
|
||||
|
||||
# Granularity: ctx wins; else derive from key candidates; else be honest.
|
||||
granularity = ctx.get("granularity")
|
||||
if not granularity:
|
||||
keys = profile.get("key_candidates") or []
|
||||
if keys:
|
||||
granularity = ("Cada fila parece identificada por "
|
||||
+ ", ".join(str(k) for k in keys[:3]) + ".")
|
||||
else:
|
||||
granularity = ("Cada fila es… (granularidad no determinada — "
|
||||
"pendiente de la capa de cálculo/LLM).")
|
||||
|
||||
description = ctx.get("description")
|
||||
if not description:
|
||||
description = ("Descripción no provista — pendiente de la capa LLM "
|
||||
"(`run_llm`) o de `ctx['description']`.")
|
||||
|
||||
blocks = [
|
||||
model.Heading(text=str(dataset_name), level=1),
|
||||
model.Markdown(text="**Automatic-EDA** · informe exploratorio automático"),
|
||||
model.KVTable(rows=[
|
||||
("Fuente", source_origin),
|
||||
("Almacenamiento", storage),
|
||||
("Generado", when),
|
||||
("Tamaño", shape),
|
||||
("Calidad", quality_value),
|
||||
("Criterios de calidad", quality_criteria),
|
||||
]),
|
||||
model.Heading(text="Descripción", level=2),
|
||||
model.Markdown(text=str(description)),
|
||||
model.Heading(text="Granularidad", level=2),
|
||||
model.Markdown(text=str(granularity)),
|
||||
]
|
||||
|
||||
return model.Chapter(id=CHAPTER_ID, title=CHAPTER_TITLE,
|
||||
version=CHAPTER_VERSION, blocks=blocks)
|
||||
@@ -1,89 +0,0 @@
|
||||
"""Chapter registry — the canonical order of an AutomaticEDA document.
|
||||
|
||||
``CHAPTER_ORDER`` declares every chapter the engine will *ever* place, in the
|
||||
order they appear in the document. Each id maps by convention to a module
|
||||
``automatic_eda/chapters/<id>.py`` exposing ``build_<id>(profile, ctx) ->
|
||||
Chapter | None`` and a ``CHAPTER_VERSION`` constant.
|
||||
|
||||
This pre-declared order is what lets many agents add chapters in parallel
|
||||
without contention: an agent only creates its own ``chapters/<id>.py`` module —
|
||||
it never edits this file. ``build_document`` imports each chapter lazily; a
|
||||
chapter whose module does not exist yet (not implemented) is simply skipped, so
|
||||
the document is always renderable with whatever chapters are present today.
|
||||
|
||||
``build_document`` never raises: a chapter that errors out is dropped with a
|
||||
note, and a chapter that returns ``None`` (does not apply to this dataset, e.g.
|
||||
time series on a dataset with no date column) is omitted.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import importlib
|
||||
|
||||
from . import model
|
||||
|
||||
# Canonical document order. Implemented today: portada, overview. The rest are
|
||||
# placeholders other agents will fill by creating chapters/<id>.py — they will
|
||||
# appear in this exact position automatically once their module exists.
|
||||
CHAPTER_ORDER = [
|
||||
"portada", # cover
|
||||
"overview", # df.head + columns/types/nulls/examples + describe
|
||||
"num_distr", # numeric distributions
|
||||
"cat_distr", # categorical distributions
|
||||
"calidad", # data quality
|
||||
"correlacion", # correlations / associations
|
||||
"modelos", # cheap models (PCA/KMeans/outliers)
|
||||
"analisis_llm", # LLM interpretation
|
||||
"timeseries", # time-series analysis
|
||||
"geospatial", # geospatial
|
||||
"agregacion", # aggregations / pivots
|
||||
]
|
||||
|
||||
|
||||
def build_chapter(chapter_id: str, profile: dict, ctx: dict):
|
||||
"""Build a single chapter by id, or None if absent/not-applicable/error.
|
||||
|
||||
Looks up ``automatic_eda.chapters.<chapter_id>`` and calls its
|
||||
``build_<chapter_id>(profile, ctx)``. Returns a normalized Chapter, or None
|
||||
when the module is missing, the builder returns None, or anything raises.
|
||||
"""
|
||||
mod_name = f"{__package__}.chapters.{chapter_id}"
|
||||
try:
|
||||
mod = importlib.import_module(mod_name)
|
||||
except Exception: # noqa: BLE001 — chapter not implemented yet → skip.
|
||||
return None
|
||||
builder = getattr(mod, f"build_{chapter_id}", None)
|
||||
if builder is None:
|
||||
return None
|
||||
try:
|
||||
result = builder(profile or {}, ctx or {})
|
||||
except Exception: # noqa: BLE001 — a broken chapter never aborts the doc.
|
||||
return None
|
||||
return model.as_chapter(result)
|
||||
|
||||
|
||||
def build_document(profile: dict, ctx: dict = None) -> list:
|
||||
"""Build the full ordered list of chapters for a TableProfile.
|
||||
|
||||
Args:
|
||||
profile: the ``eda`` group TableProfile dict (may be None/empty).
|
||||
ctx: optional context dict carrying presentation metadata not present in
|
||||
the profile (dataset_name, source_origin, storage, generated_at,
|
||||
description, granularity, quality_criteria, head_rows, ...).
|
||||
|
||||
Returns:
|
||||
list[Chapter] in canonical order, containing only the chapters that are
|
||||
implemented and applicable. Never raises.
|
||||
"""
|
||||
if profile is None:
|
||||
profile = {}
|
||||
if not isinstance(profile, dict):
|
||||
profile = {}
|
||||
if ctx is None:
|
||||
ctx = {}
|
||||
chapters = []
|
||||
for cid in CHAPTER_ORDER:
|
||||
ch = build_chapter(cid, profile, ctx)
|
||||
if ch is not None and ch.blocks:
|
||||
chapters.append(ch)
|
||||
return chapters
|
||||
@@ -1,310 +0,0 @@
|
||||
"""AutomaticEDA document model — format-independent blocks and chapters.
|
||||
|
||||
This is the intermediate layer between *content* (what an EDA chapter wants to
|
||||
say) and *output format* (PDF for mobile reading, PPTX for sharing). A document
|
||||
is an ordered list of :class:`Chapter`. A chapter is ``{id, title, version,
|
||||
blocks}``. A block is one of a small, closed set of presentation primitives
|
||||
(heading, markdown, key/value table, data table, figure, image, caption, note).
|
||||
|
||||
Neither renderer knows anything about the EDA profile: they only know how to lay
|
||||
out blocks so that **nothing is ever cut** — long text wraps to whole lines,
|
||||
long tables split by rows repeating the header, figures and images are scaled to
|
||||
fit entirely. Each chapter declares its own ``version`` so every page/slide can
|
||||
be stamped ``<Chapter> · v<version>`` and tracked in a manifest for continuous,
|
||||
per-chapter improvement.
|
||||
|
||||
Reading is defensive throughout (the ``eda`` group "dict-no-throw" style): the
|
||||
normalizers accept dataclass blocks *or* plain dicts, coerce anything unknown
|
||||
into a readable :class:`Note` instead of raising, and the renderers degrade a
|
||||
malformed block to text rather than crashing the whole document.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import os
|
||||
from dataclasses import dataclass, field
|
||||
from typing import Any, Callable, Optional
|
||||
|
||||
# Global engine version. Bump when the document model or a renderer changes in a
|
||||
# way that affects output. Individual chapters carry their own CHAPTER_VERSION.
|
||||
ENGINE_VERSION = "1.0.0"
|
||||
ENGINE_NAME = "AutomaticEDA"
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Block primitives. Each carries a stable ``kind`` string so renderers can
|
||||
# dispatch by kind (works for dataclass instances and for plain dicts alike).
|
||||
# --------------------------------------------------------------------------- #
|
||||
@dataclass
|
||||
class Heading:
|
||||
"""A section heading. ``level`` 1 (largest) .. 3 (smallest)."""
|
||||
|
||||
text: str = ""
|
||||
level: int = 1
|
||||
kind: str = field(default="heading", init=False)
|
||||
|
||||
|
||||
@dataclass
|
||||
class Markdown:
|
||||
"""A block of light markdown text.
|
||||
|
||||
Supported subset (everything else is rendered verbatim, never dropped):
|
||||
``#``/``##``/``###`` headings, ``-``/``*`` bullet lists, ``| a | b |``
|
||||
tables (consecutive pipe lines become a data table), blank lines as
|
||||
paragraph breaks, and ``**bold**`` inline markers (markers are stripped, the
|
||||
text is kept). Text is wrapped to whole lines so it is never cut mid-line.
|
||||
"""
|
||||
|
||||
text: str = ""
|
||||
kind: str = field(default="markdown", init=False)
|
||||
|
||||
|
||||
@dataclass
|
||||
class KVTable:
|
||||
"""A two-column key/value table. ``rows`` is a list of ``(label, value)``."""
|
||||
|
||||
rows: list = field(default_factory=list)
|
||||
title: Optional[str] = None
|
||||
kind: str = field(default="kv_table", init=False)
|
||||
|
||||
|
||||
@dataclass
|
||||
class DataTable:
|
||||
"""A tabular block with a header row.
|
||||
|
||||
If it does not fit in the remaining page/slide space it is split by rows,
|
||||
**repeating the header** on each continuation. Long cell text wraps inside
|
||||
its column (the row grows taller) so no cell content is ever lost.
|
||||
"""
|
||||
|
||||
header: list = field(default_factory=list)
|
||||
rows: list = field(default_factory=list) # list[list[Any]]
|
||||
title: Optional[str] = None
|
||||
note: Optional[str] = None
|
||||
kind: str = field(default="data_table", init=False)
|
||||
|
||||
|
||||
@dataclass
|
||||
class Figure:
|
||||
"""A matplotlib figure, scaled to fit entirely (never cropped).
|
||||
|
||||
Provide either an already-built ``fig`` (a ``matplotlib.figure.Figure``) or
|
||||
a zero-arg ``make`` callable that returns one (lazy: only built when the
|
||||
renderer needs it). ``height_in`` is an optional hint for the target height
|
||||
on the page; renderers clamp it to the available space preserving aspect.
|
||||
"""
|
||||
|
||||
fig: Any = None
|
||||
make: Optional[Callable[[], Any]] = None
|
||||
caption: Optional[str] = None
|
||||
height_in: Optional[float] = None
|
||||
kind: str = field(default="figure", init=False)
|
||||
|
||||
|
||||
@dataclass
|
||||
class Image:
|
||||
"""A raster image (PNG/JPG) by path, scaled to fit entirely."""
|
||||
|
||||
path: str = ""
|
||||
caption: Optional[str] = None
|
||||
height_in: Optional[float] = None
|
||||
kind: str = field(default="image", init=False)
|
||||
|
||||
|
||||
@dataclass
|
||||
class Caption:
|
||||
"""Small auxiliary text rendered under a figure/table."""
|
||||
|
||||
text: str = ""
|
||||
kind: str = field(default="caption", init=False)
|
||||
|
||||
|
||||
@dataclass
|
||||
class Note:
|
||||
"""Small auxiliary note (italic). Also the fallback for unknown content."""
|
||||
|
||||
text: str = ""
|
||||
kind: str = field(default="note", init=False)
|
||||
|
||||
|
||||
@dataclass
|
||||
class Chapter:
|
||||
"""An ordered set of blocks with an id, a title and a generation version."""
|
||||
|
||||
id: str = ""
|
||||
title: str = ""
|
||||
version: str = "1.0.0"
|
||||
blocks: list = field(default_factory=list)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Defensive normalizers — accept dataclasses OR plain dicts, never raise.
|
||||
# --------------------------------------------------------------------------- #
|
||||
_BLOCK_BY_KIND = {
|
||||
"heading": Heading,
|
||||
"markdown": Markdown,
|
||||
"kv_table": KVTable,
|
||||
"data_table": DataTable,
|
||||
"figure": Figure,
|
||||
"image": Image,
|
||||
"caption": Caption,
|
||||
"note": Note,
|
||||
}
|
||||
|
||||
|
||||
def as_block(obj: Any):
|
||||
"""Coerce a value into a block dataclass. Unknown values become a Note."""
|
||||
if isinstance(obj, (Heading, Markdown, KVTable, DataTable, Figure, Image,
|
||||
Caption, Note)):
|
||||
return obj
|
||||
if isinstance(obj, dict):
|
||||
kind = obj.get("kind")
|
||||
cls = _BLOCK_BY_KIND.get(kind)
|
||||
if cls is None:
|
||||
return Note(text=_safe_str(obj))
|
||||
# Build only with fields the dataclass accepts (ignore extras).
|
||||
try:
|
||||
if cls is Heading:
|
||||
return Heading(text=_safe_str(obj.get("text")),
|
||||
level=int(obj.get("level", 1) or 1))
|
||||
if cls is Markdown:
|
||||
return Markdown(text=_safe_str(obj.get("text")))
|
||||
if cls is KVTable:
|
||||
return KVTable(rows=list(obj.get("rows") or []),
|
||||
title=obj.get("title"))
|
||||
if cls is DataTable:
|
||||
return DataTable(header=list(obj.get("header") or []),
|
||||
rows=list(obj.get("rows") or []),
|
||||
title=obj.get("title"), note=obj.get("note"))
|
||||
if cls is Figure:
|
||||
return Figure(fig=obj.get("fig"), make=obj.get("make"),
|
||||
caption=obj.get("caption"),
|
||||
height_in=obj.get("height_in"))
|
||||
if cls is Image:
|
||||
return Image(path=_safe_str(obj.get("path")),
|
||||
caption=obj.get("caption"),
|
||||
height_in=obj.get("height_in"))
|
||||
if cls is Caption:
|
||||
return Caption(text=_safe_str(obj.get("text")))
|
||||
if cls is Note:
|
||||
return Note(text=_safe_str(obj.get("text")))
|
||||
except Exception: # noqa: BLE001 — never raise on a malformed block.
|
||||
return Note(text=_safe_str(obj))
|
||||
return Note(text=_safe_str(obj))
|
||||
|
||||
|
||||
def as_blocks(seq: Any) -> list:
|
||||
"""Normalize an arbitrary sequence into a list of block dataclasses."""
|
||||
if seq is None:
|
||||
return []
|
||||
if not isinstance(seq, (list, tuple)):
|
||||
return [as_block(seq)]
|
||||
return [as_block(b) for b in seq]
|
||||
|
||||
|
||||
def as_chapter(obj: Any) -> Optional[Chapter]:
|
||||
"""Coerce a value into a Chapter (or None). Accepts a dict or a Chapter."""
|
||||
if obj is None:
|
||||
return None
|
||||
if isinstance(obj, Chapter):
|
||||
obj.blocks = as_blocks(obj.blocks)
|
||||
return obj
|
||||
if isinstance(obj, dict):
|
||||
return Chapter(
|
||||
id=_safe_str(obj.get("id")),
|
||||
title=_safe_str(obj.get("title")) or _safe_str(obj.get("id")),
|
||||
version=_safe_str(obj.get("version")) or "1.0.0",
|
||||
blocks=as_blocks(obj.get("blocks")),
|
||||
)
|
||||
return None
|
||||
|
||||
|
||||
def as_chapters(seq: Any) -> list:
|
||||
"""Normalize a sequence of chapters, dropping anything that can't coerce."""
|
||||
if seq is None:
|
||||
return []
|
||||
if isinstance(seq, Chapter):
|
||||
return [as_chapter(seq)]
|
||||
if not isinstance(seq, (list, tuple)):
|
||||
return []
|
||||
out = []
|
||||
for c in seq:
|
||||
ch = as_chapter(c)
|
||||
if ch is not None:
|
||||
out.append(ch)
|
||||
return out
|
||||
|
||||
|
||||
def _safe_str(v: Any) -> str:
|
||||
"""str() that never raises and maps None to ''."""
|
||||
if v is None:
|
||||
return ""
|
||||
try:
|
||||
return str(v)
|
||||
except Exception: # noqa: BLE001
|
||||
return ""
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Manifest — per-chapter versions and page/slide counts for tracking.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def merge_manifest(manifest_path: str, renderer: str, chapters_meta: list,
|
||||
generated_at: str,
|
||||
engine_version: str = ENGINE_VERSION) -> dict:
|
||||
"""Read-modify-write the AutomaticEDA manifest, merging one renderer's run.
|
||||
|
||||
The manifest lives next to the outputs as ``automatic_eda_manifest.json``
|
||||
and records, per chapter, its version plus the page count (PDF) and slide
|
||||
count (PPTX). Calling either renderer creates or updates it. Never raises:
|
||||
on any error returns the in-memory manifest without writing.
|
||||
|
||||
Args:
|
||||
manifest_path: path to the JSON manifest to create or update.
|
||||
renderer: "pdf" or "pptx" — selects which count key is written.
|
||||
chapters_meta: list of ``{"id", "version", "n_pages"|"n_slides"}``.
|
||||
generated_at: ISO-ish timestamp string for this run.
|
||||
engine_version: AutomaticEDA engine version.
|
||||
|
||||
Returns:
|
||||
The merged manifest dict (also written to disk on success).
|
||||
"""
|
||||
data: dict = {}
|
||||
try:
|
||||
if manifest_path and os.path.exists(manifest_path):
|
||||
with open(manifest_path, "r", encoding="utf-8") as fh:
|
||||
loaded = json.load(fh)
|
||||
if isinstance(loaded, dict):
|
||||
data = loaded
|
||||
except Exception: # noqa: BLE001 — a corrupt manifest is overwritten.
|
||||
data = {}
|
||||
|
||||
data["engine"] = ENGINE_NAME
|
||||
data["engine_version"] = engine_version
|
||||
data["generated_at"] = generated_at
|
||||
chapters = data.get("chapters")
|
||||
if not isinstance(chapters, dict):
|
||||
chapters = {}
|
||||
count_key = "n_slides" if renderer == "pptx" else "n_pages"
|
||||
for cm in chapters_meta or []:
|
||||
if not isinstance(cm, dict):
|
||||
continue
|
||||
cid = cm.get("id")
|
||||
if not cid:
|
||||
continue
|
||||
entry = chapters.get(cid)
|
||||
if not isinstance(entry, dict):
|
||||
entry = {}
|
||||
entry["version"] = cm.get("version") or entry.get("version") or "1.0.0"
|
||||
entry[count_key] = cm.get(count_key, cm.get("n_pages", cm.get("n_slides")))
|
||||
chapters[cid] = entry
|
||||
data["chapters"] = chapters
|
||||
|
||||
try:
|
||||
parent = os.path.dirname(os.path.abspath(manifest_path))
|
||||
os.makedirs(parent, exist_ok=True)
|
||||
with open(manifest_path, "w", encoding="utf-8") as fh:
|
||||
json.dump(data, fh, ensure_ascii=False, indent=2, default=str)
|
||||
except Exception: # noqa: BLE001 — never raise from the manifest writer.
|
||||
pass
|
||||
return data
|
||||
@@ -1,532 +0,0 @@
|
||||
"""AutomaticEDA PDF renderer — A5 portrait, mobile-first, never cuts content.
|
||||
|
||||
A flow paginator: it measures each block (using the deterministic character grid
|
||||
from :mod:`text_layout`) and places it top-to-bottom on the current page. When a
|
||||
unit does not fit in the remaining space it moves whole to the next page —
|
||||
text by whole lines (never mid-line, never mid-word), data tables by rows
|
||||
**repeating the header**, figures/images scaled to fit entirely (never cropped).
|
||||
|
||||
Each chapter starts on a fresh page and every page is stamped in the footer with
|
||||
``<Chapter> · v<version>`` plus the engine version and a running page number, so
|
||||
output is versioned per chapter for continuous improvement.
|
||||
|
||||
dict-no-throw: a failure inside one block is caught and noted; the PDF is always
|
||||
produced and at least one page is guaranteed. Engine: matplotlib ``PdfPages``.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import io
|
||||
import os
|
||||
|
||||
import matplotlib
|
||||
|
||||
matplotlib.use("Agg")
|
||||
|
||||
import matplotlib.image as mpimg # noqa: E402
|
||||
import matplotlib.pyplot as plt # noqa: E402
|
||||
from matplotlib.backends.backend_pdf import PdfPages # noqa: E402
|
||||
from matplotlib.patches import Rectangle # noqa: E402
|
||||
|
||||
from . import model # noqa: E402
|
||||
from . import text_layout as tl # noqa: E402
|
||||
|
||||
# A5 portrait, inches.
|
||||
_W, _H = 5.83, 8.27
|
||||
_ML, _MR, _MT, _MB = 0.5, 0.42, 0.55, 0.5
|
||||
_FOOTER_H = 0.34
|
||||
_USABLE_W = _W - _ML - _MR
|
||||
_CONTENT_TOP = _MT
|
||||
_CONTENT_BOTTOM = _H - _MB - _FOOTER_H
|
||||
|
||||
# Palette / type (inherits the Tufte-ish mobile look of render_eda_pdf).
|
||||
_INK = "#1b1b1b"
|
||||
_ACCENT = "#2a6f97"
|
||||
_MUTED = "#8a8a8a"
|
||||
_RULE = "#cccccc"
|
||||
_HEAD_BG = "#eef3f6"
|
||||
|
||||
_RC = {
|
||||
"font.size": 10,
|
||||
"font.family": "sans-serif",
|
||||
"figure.facecolor": "white",
|
||||
"savefig.facecolor": "white",
|
||||
"pdf.fonttype": 42, # embed TrueType — text stays selectable on mobile.
|
||||
}
|
||||
|
||||
# Font sizes (pt) and derived line heights (in).
|
||||
_FS_H1, _FS_H2, _FS_H3 = 17, 13, 11
|
||||
_FS_BODY, _FS_CELL, _FS_NOTE = 10.5, 9.0, 9.0
|
||||
_GAP = 0.12 # vertical gap after a block, inches.
|
||||
_CELL_PAD = 0.06 # horizontal padding inside a table cell, inches.
|
||||
_ROW_VPAD = 0.05 # vertical padding inside a table row, inches.
|
||||
|
||||
|
||||
class _PdfState:
|
||||
"""Mutable layout cursor for the running PDF document."""
|
||||
|
||||
def __init__(self, pdf, title: str):
|
||||
self.pdf = pdf
|
||||
self.title = title
|
||||
self.fig = None
|
||||
self.y = _CONTENT_TOP # inches from the top of the page.
|
||||
self.page = 0 # global page counter.
|
||||
self.chapter = None # current Chapter (for the footer).
|
||||
self.chapter_pages = 0 # pages produced for the current chapter.
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Coordinate helpers (inches-from-top → matplotlib figure fraction).
|
||||
# --------------------------------------------------------------------------- #
|
||||
def _yf(y_in: float) -> float:
|
||||
return 1.0 - (y_in / _H)
|
||||
|
||||
|
||||
def _xf(x_in: float) -> float:
|
||||
return x_in / _W
|
||||
|
||||
|
||||
def _new_page(st: _PdfState) -> None:
|
||||
"""Close the current page (if any) and open a fresh one with a footer."""
|
||||
_flush_page(st)
|
||||
st.fig = plt.figure(figsize=(_W, _H))
|
||||
st.y = _CONTENT_TOP
|
||||
st.page += 1
|
||||
st.chapter_pages += 1
|
||||
_draw_footer(st)
|
||||
|
||||
|
||||
def _flush_page(st: _PdfState) -> None:
|
||||
if st.fig is not None:
|
||||
st.pdf.savefig(st.fig)
|
||||
plt.close(st.fig)
|
||||
st.fig = None
|
||||
|
||||
|
||||
def _draw_footer(st: _PdfState) -> None:
|
||||
ch = st.chapter
|
||||
left = ""
|
||||
if ch is not None:
|
||||
left = f"{ch.title} · v{ch.version}"
|
||||
right = f"{model.ENGINE_NAME} v{model.ENGINE_VERSION} · p.{st.page}"
|
||||
yb = (_MB * 0.45) / _H
|
||||
st.fig.text(_xf(_ML), yb, left, fontsize=7.5, color=_MUTED,
|
||||
ha="left", va="center")
|
||||
st.fig.text(_xf(_W - _MR), yb, right, fontsize=7.5, color=_MUTED,
|
||||
ha="right", va="center")
|
||||
# A thin rule above the footer.
|
||||
st.fig.add_artist(Rectangle(
|
||||
(_xf(_ML), (_MB + _FOOTER_H * 0.5) / _H),
|
||||
_xf(_W - _MR) - _xf(_ML), 0.0008,
|
||||
transform=st.fig.transFigure, color=_RULE, lw=0.6))
|
||||
|
||||
|
||||
def _remaining(st: _PdfState) -> float:
|
||||
return _CONTENT_BOTTOM - st.y
|
||||
|
||||
|
||||
def _ensure_space(st: _PdfState, height: float) -> None:
|
||||
"""Open a new page if ``height`` does not fit in the remaining space."""
|
||||
if _remaining(st) < height:
|
||||
_new_page(st)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Block placers. Each advances st.y and paginates as needed.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def _place_heading(st: _PdfState, block) -> None:
|
||||
level = max(1, min(3, int(getattr(block, "level", 1) or 1)))
|
||||
fs = {1: _FS_H1, 2: _FS_H2, 3: _FS_H3}[level]
|
||||
text = tl.strip_inline_md(getattr(block, "text", ""))
|
||||
max_chars = tl.chars_per_line(_USABLE_W, fs)
|
||||
lines = tl.wrap(text, max_chars)
|
||||
lh = tl.line_height_in(fs, leading=1.2)
|
||||
block_h = lh * len(lines) + 0.06
|
||||
# Keep at least the heading + a couple of body lines together when possible.
|
||||
_ensure_space(st, min(block_h + tl.line_height_in(_FS_BODY) * 2,
|
||||
_CONTENT_BOTTOM - _CONTENT_TOP))
|
||||
for ln in lines:
|
||||
_ensure_space(st, lh)
|
||||
st.fig.text(_xf(_ML), _yf(st.y), ln, fontsize=fs, fontweight="bold",
|
||||
color=_INK, ha="left", va="top")
|
||||
st.y += lh
|
||||
if level == 1:
|
||||
# Accent underline under a top-level heading.
|
||||
st.fig.add_artist(Rectangle(
|
||||
(_xf(_ML), _yf(st.y + 0.02)), _xf(_ML + 1.4) - _xf(_ML), 0.0016,
|
||||
transform=st.fig.transFigure, color=_ACCENT, lw=0))
|
||||
st.y += 0.10
|
||||
st.y += _GAP
|
||||
|
||||
|
||||
def _place_text_lines(st: _PdfState, lines: list, fs: float, color: str,
|
||||
style: str = "normal", indent: float = 0.0) -> None:
|
||||
lh = tl.line_height_in(fs)
|
||||
for ln in lines:
|
||||
_ensure_space(st, lh)
|
||||
st.fig.text(_xf(_ML + indent), _yf(st.y), ln, fontsize=fs, color=color,
|
||||
ha="left", va="top", style=style)
|
||||
st.y += lh
|
||||
|
||||
|
||||
def _place_markdown(st: _PdfState, block) -> None:
|
||||
raw = getattr(block, "text", "") or ""
|
||||
md_lines = str(raw).split("\n")
|
||||
i = 0
|
||||
n = len(md_lines)
|
||||
while i < n:
|
||||
line = md_lines[i]
|
||||
stripped = line.strip()
|
||||
# Consecutive pipe-table lines → a DataTable.
|
||||
if stripped.startswith("|") and stripped.endswith("|"):
|
||||
j = i
|
||||
tbl_lines = []
|
||||
while j < n and md_lines[j].strip().startswith("|") \
|
||||
and md_lines[j].strip().endswith("|"):
|
||||
tbl_lines.append(md_lines[j])
|
||||
j += 1
|
||||
parsed = tl.parse_md_table(tbl_lines)
|
||||
if parsed:
|
||||
header, rows = parsed
|
||||
_place_data_table(st, model.DataTable(header=header, rows=rows))
|
||||
i = j
|
||||
continue
|
||||
if stripped == "":
|
||||
st.y += tl.line_height_in(_FS_BODY) * 0.5
|
||||
i += 1
|
||||
continue
|
||||
if stripped.startswith("### "):
|
||||
_place_heading(st, model.Heading(stripped[4:], level=3))
|
||||
i += 1
|
||||
continue
|
||||
if stripped.startswith("## "):
|
||||
_place_heading(st, model.Heading(stripped[3:], level=2))
|
||||
i += 1
|
||||
continue
|
||||
if stripped.startswith("# "):
|
||||
_place_heading(st, model.Heading(stripped[2:], level=1))
|
||||
i += 1
|
||||
continue
|
||||
if stripped.startswith("- ") or stripped.startswith("* "):
|
||||
content = tl.strip_inline_md(stripped[2:])
|
||||
bullet_chars = tl.chars_per_line(_USABLE_W - 0.22, _FS_BODY)
|
||||
wrapped = tl.wrap(content, bullet_chars)
|
||||
first = True
|
||||
for w in wrapped:
|
||||
prefix = "• " if first else " "
|
||||
_place_text_lines(st, [prefix + w], _FS_BODY, _INK,
|
||||
indent=0.0)
|
||||
first = False
|
||||
i += 1
|
||||
continue
|
||||
# Plain paragraph (gather following plain lines into one paragraph).
|
||||
para = [tl.strip_inline_md(stripped)]
|
||||
j = i + 1
|
||||
while j < n:
|
||||
nxt = md_lines[j].strip()
|
||||
if nxt == "" or nxt.startswith(("|", "#", "- ", "* ")):
|
||||
break
|
||||
para.append(tl.strip_inline_md(nxt))
|
||||
j += 1
|
||||
text = " ".join(para)
|
||||
max_chars = tl.chars_per_line(_USABLE_W, _FS_BODY)
|
||||
_place_text_lines(st, tl.wrap(text, max_chars), _FS_BODY, _INK)
|
||||
i = j
|
||||
st.y += _GAP
|
||||
|
||||
|
||||
def _place_kv_table(st: _PdfState, block) -> None:
|
||||
title = getattr(block, "title", None)
|
||||
if title:
|
||||
_place_heading(st, model.Heading(title, level=2))
|
||||
rows = getattr(block, "rows", []) or []
|
||||
key_w = 1.9 # inches reserved for the label column.
|
||||
val_chars = tl.chars_per_line(_USABLE_W - key_w - 0.1, _FS_BODY)
|
||||
lh = tl.line_height_in(_FS_BODY)
|
||||
for row in rows:
|
||||
try:
|
||||
label, value = row[0], row[1]
|
||||
except Exception: # noqa: BLE001
|
||||
label, value = str(row), ""
|
||||
v_lines = tl.wrap(model._safe_str(value), val_chars)
|
||||
row_h = lh * len(v_lines) + _ROW_VPAD
|
||||
_ensure_space(st, row_h)
|
||||
y0 = st.y
|
||||
st.fig.text(_xf(_ML), _yf(y0), tl.strip_inline_md(model._safe_str(label)),
|
||||
fontsize=_FS_BODY, color=_MUTED, ha="left", va="top")
|
||||
for k, vl in enumerate(v_lines):
|
||||
st.fig.text(_xf(_ML + key_w), _yf(y0 + k * lh), vl,
|
||||
fontsize=_FS_BODY, color=_INK, ha="left", va="top")
|
||||
st.y = y0 + row_h
|
||||
st.y += _GAP
|
||||
|
||||
|
||||
def _col_widths(header: list, rows: list, fs: float) -> list:
|
||||
"""Distribute usable width across columns proportional to content length."""
|
||||
ncol = len(header) if header else (len(rows[0]) if rows else 1)
|
||||
ncol = max(1, ncol)
|
||||
natural = [3] * ncol
|
||||
for c in range(ncol):
|
||||
if header and c < len(header):
|
||||
natural[c] = max(natural[c], len(model._safe_str(header[c])))
|
||||
for r in rows:
|
||||
if c < len(r):
|
||||
natural[c] = max(natural[c], len(model._safe_str(r[c])))
|
||||
# Clamp so one very long column does not starve the others.
|
||||
clamped = [min(max(w, 4), 40) for w in natural]
|
||||
total = float(sum(clamped)) or 1.0
|
||||
widths = [_USABLE_W * w / total for w in clamped]
|
||||
# Enforce a minimum readable column width.
|
||||
min_w = 0.45
|
||||
widths = [max(w, min_w) for w in widths]
|
||||
# Renormalize if the minimums pushed us over the usable width.
|
||||
s = sum(widths)
|
||||
if s > _USABLE_W:
|
||||
widths = [w * _USABLE_W / s for w in widths]
|
||||
return widths
|
||||
|
||||
|
||||
def _wrap_row(cells: list, widths: list, fs: float) -> list:
|
||||
"""Wrap each cell to its column width → list of line-lists per cell."""
|
||||
out = []
|
||||
for c, w in enumerate(widths):
|
||||
text = model._safe_str(cells[c]) if c < len(cells) else ""
|
||||
max_chars = tl.chars_per_line(w - _CELL_PAD * 2, fs)
|
||||
out.append(tl.wrap(text, max_chars))
|
||||
return out
|
||||
|
||||
|
||||
def _draw_table_row(st: _PdfState, cells_lines: list, widths: list, fs: float,
|
||||
y0: float, header: bool) -> float:
|
||||
lh = tl.line_height_in(fs)
|
||||
nlines = max((len(c) for c in cells_lines), default=1)
|
||||
row_h = lh * nlines + _ROW_VPAD * 2
|
||||
if header:
|
||||
st.fig.add_artist(Rectangle(
|
||||
(_xf(_ML), _yf(y0 + row_h)), _xf(_ML + _USABLE_W) - _xf(_ML),
|
||||
_yf(y0) - _yf(y0 + row_h), transform=st.fig.transFigure,
|
||||
color=_HEAD_BG, lw=0, zorder=0))
|
||||
x = _ML
|
||||
for c, lines in enumerate(cells_lines):
|
||||
for k, ln in enumerate(lines):
|
||||
st.fig.text(_xf(x + _CELL_PAD), _yf(y0 + _ROW_VPAD + k * lh), ln,
|
||||
fontsize=fs, color=_INK,
|
||||
fontweight="bold" if header else "normal",
|
||||
ha="left", va="top", zorder=2)
|
||||
x += widths[c]
|
||||
# Bottom rule of the row.
|
||||
st.fig.add_artist(Rectangle(
|
||||
(_xf(_ML), _yf(y0 + row_h)), _xf(_ML + _USABLE_W) - _xf(_ML), 0.0006,
|
||||
transform=st.fig.transFigure, color=_RULE, lw=0, zorder=1))
|
||||
return row_h
|
||||
|
||||
|
||||
def _place_data_table(st: _PdfState, block) -> None:
|
||||
title = getattr(block, "title", None)
|
||||
if title:
|
||||
_place_heading(st, model.Heading(title, level=2))
|
||||
header = list(getattr(block, "header", []) or [])
|
||||
rows = list(getattr(block, "rows", []) or [])
|
||||
fs = _FS_CELL
|
||||
widths = _col_widths(header, rows, fs)
|
||||
header_lines = _wrap_row(header, widths, fs) if header else None
|
||||
lh = tl.line_height_in(fs)
|
||||
|
||||
def header_h() -> float:
|
||||
if not header_lines:
|
||||
return 0.0
|
||||
return lh * max((len(c) for c in header_lines), default=1) + _ROW_VPAD * 2
|
||||
|
||||
def draw_header() -> None:
|
||||
if header_lines:
|
||||
st.y += _draw_table_row(st, header_lines, widths, fs, st.y,
|
||||
header=True)
|
||||
|
||||
# Ensure header + first row fit, else start on a new page.
|
||||
first_row_h = 0.0
|
||||
if rows:
|
||||
first_lines = _wrap_row(rows[0], widths, fs)
|
||||
first_row_h = lh * max((len(c) for c in first_lines), default=1) \
|
||||
+ _ROW_VPAD * 2
|
||||
_ensure_space(st, header_h() + max(first_row_h, lh))
|
||||
draw_header()
|
||||
for r in rows:
|
||||
cells_lines = _wrap_row(r, widths, fs)
|
||||
row_h = lh * max((len(c) for c in cells_lines), default=1) \
|
||||
+ _ROW_VPAD * 2
|
||||
if _remaining(st) < row_h:
|
||||
_new_page(st)
|
||||
draw_header() # repeat header on the continuation page.
|
||||
st.y += _draw_table_row(st, cells_lines, widths, fs, st.y, header=False)
|
||||
note = getattr(block, "note", None)
|
||||
if note:
|
||||
_place_text_lines(st, tl.wrap(model._safe_str(note),
|
||||
tl.chars_per_line(_USABLE_W, _FS_NOTE)),
|
||||
_FS_NOTE, _MUTED, style="italic")
|
||||
st.y += _GAP
|
||||
|
||||
|
||||
def _resolve_figure(block):
|
||||
fig = getattr(block, "fig", None)
|
||||
if fig is not None:
|
||||
return fig, False
|
||||
make = getattr(block, "make", None)
|
||||
if callable(make):
|
||||
try:
|
||||
return make(), True
|
||||
except Exception: # noqa: BLE001
|
||||
return None, False
|
||||
return None, False
|
||||
|
||||
|
||||
def _png_from_figure(fig) -> bytes:
|
||||
buf = io.BytesIO()
|
||||
fig.savefig(buf, format="png", dpi=150, bbox_inches="tight")
|
||||
buf.seek(0)
|
||||
return buf.read()
|
||||
|
||||
|
||||
def _place_image_array(st: _PdfState, arr, caption) -> None:
|
||||
h_px, w_px = arr.shape[0], arr.shape[1]
|
||||
aspect = (h_px / w_px) if w_px else 1.0
|
||||
max_h = _CONTENT_BOTTOM - _CONTENT_TOP
|
||||
target_w = _USABLE_W
|
||||
target_h = target_w * aspect
|
||||
if target_h > max_h:
|
||||
target_h = max_h
|
||||
target_w = target_h / aspect if aspect else _USABLE_W
|
||||
cap_h = tl.line_height_in(_FS_NOTE) + 0.04 if caption else 0.0
|
||||
# Move whole image to next page if it does not fit in remaining space.
|
||||
if _remaining(st) < target_h + cap_h:
|
||||
if (max_h) >= target_h + cap_h:
|
||||
_new_page(st)
|
||||
else:
|
||||
# Taller than a full page even at min — already clamped to max_h.
|
||||
_new_page(st)
|
||||
left_frac = _xf(_ML + (_USABLE_W - target_w) / 2.0)
|
||||
bottom_frac = _yf(st.y + target_h)
|
||||
ax = st.fig.add_axes([left_frac, bottom_frac, target_w / _W, target_h / _H])
|
||||
ax.imshow(arr)
|
||||
ax.axis("off")
|
||||
st.y += target_h + 0.04
|
||||
if caption:
|
||||
_place_text_lines(st, tl.wrap(model._safe_str(caption),
|
||||
tl.chars_per_line(_USABLE_W, _FS_NOTE)),
|
||||
_FS_NOTE, _MUTED, style="italic")
|
||||
st.y += _GAP
|
||||
|
||||
|
||||
def _place_figure(st: _PdfState, block) -> None:
|
||||
fig, owned = _resolve_figure(block)
|
||||
if fig is None:
|
||||
_place_text_lines(st, ["(figura no disponible)"], _FS_NOTE, _MUTED,
|
||||
style="italic")
|
||||
st.y += _GAP
|
||||
return
|
||||
try:
|
||||
png = _png_from_figure(fig)
|
||||
finally:
|
||||
if owned:
|
||||
try:
|
||||
plt.close(fig)
|
||||
except Exception: # noqa: BLE001
|
||||
pass
|
||||
arr = mpimg.imread(io.BytesIO(png))
|
||||
_place_image_array(st, arr, getattr(block, "caption", None))
|
||||
|
||||
|
||||
def _place_image(st: _PdfState, block) -> None:
|
||||
path = getattr(block, "path", "")
|
||||
if not path or not os.path.exists(path):
|
||||
_place_text_lines(st, [f"(imagen no encontrada: {path})"], _FS_NOTE,
|
||||
_MUTED, style="italic")
|
||||
st.y += _GAP
|
||||
return
|
||||
arr = mpimg.imread(path)
|
||||
_place_image_array(st, arr, getattr(block, "caption", None))
|
||||
|
||||
|
||||
def _place_caption(st: _PdfState, block) -> None:
|
||||
_place_text_lines(st, tl.wrap(getattr(block, "text", ""),
|
||||
tl.chars_per_line(_USABLE_W, _FS_NOTE)),
|
||||
_FS_NOTE, _MUTED, style="italic")
|
||||
st.y += _GAP
|
||||
|
||||
|
||||
def _place_note(st: _PdfState, block) -> None:
|
||||
_place_text_lines(st, tl.wrap(getattr(block, "text", ""),
|
||||
tl.chars_per_line(_USABLE_W, _FS_NOTE)),
|
||||
_FS_NOTE, _MUTED, style="italic")
|
||||
st.y += _GAP
|
||||
|
||||
|
||||
_PLACERS = {
|
||||
"heading": _place_heading,
|
||||
"markdown": _place_markdown,
|
||||
"kv_table": _place_kv_table,
|
||||
"data_table": _place_data_table,
|
||||
"figure": _place_figure,
|
||||
"image": _place_image,
|
||||
"caption": _place_caption,
|
||||
"note": _place_note,
|
||||
}
|
||||
|
||||
|
||||
def render_pdf(chapters: list, out_path: str, meta: dict = None) -> dict:
|
||||
"""Render a list of Chapters into an A5-portrait, mobile-readable PDF.
|
||||
|
||||
Never raises. Returns ``{path, n_pages, chapters, note}`` where ``chapters``
|
||||
is a list of ``{id, version, n_pages}`` for the manifest. On a fatal write
|
||||
error ``path`` is None and ``note`` explains why.
|
||||
"""
|
||||
meta = meta or {}
|
||||
chapters = model.as_chapters(chapters)
|
||||
notes = []
|
||||
|
||||
try:
|
||||
parent = os.path.dirname(os.path.abspath(out_path))
|
||||
os.makedirs(parent, exist_ok=True)
|
||||
except OSError as e:
|
||||
return {"path": None, "n_pages": 0, "chapters": [],
|
||||
"note": f"no se pudo crear el directorio destino: {e}"}
|
||||
|
||||
title = meta.get("title") or model.ENGINE_NAME
|
||||
chapters_meta = []
|
||||
try:
|
||||
with plt.rc_context(_RC):
|
||||
with PdfPages(out_path) as pdf:
|
||||
st = _PdfState(pdf, title)
|
||||
for ch in chapters:
|
||||
st.chapter = ch
|
||||
st.chapter_pages = 0
|
||||
_new_page(st) # each chapter starts on a fresh page.
|
||||
for block in ch.blocks:
|
||||
placer = _PLACERS.get(getattr(block, "kind", ""),
|
||||
_place_note)
|
||||
try:
|
||||
placer(st, block)
|
||||
except Exception as e: # noqa: BLE001
|
||||
notes.append(
|
||||
f"bloque '{getattr(block, 'kind', '?')}' del "
|
||||
f"capítulo '{ch.id}' omitido: {e}")
|
||||
chapters_meta.append({"id": ch.id, "version": ch.version,
|
||||
"n_pages": st.chapter_pages})
|
||||
_flush_page(st)
|
||||
if st.page == 0:
|
||||
# No chapters at all → guarantee one valid page.
|
||||
st.chapter = model.Chapter(id="vacio", title=title,
|
||||
version=model.ENGINE_VERSION)
|
||||
_new_page(st)
|
||||
_place_note(st, model.Note(
|
||||
"(documento vacío — sin capítulos aplicables)"))
|
||||
_flush_page(st)
|
||||
n_pages = st.page
|
||||
except Exception as e: # noqa: BLE001
|
||||
return {"path": None, "n_pages": 0, "chapters": [],
|
||||
"note": f"fallo al escribir el PDF: {e}"}
|
||||
|
||||
note = f"{n_pages} páginas"
|
||||
if notes:
|
||||
note += " · " + "; ".join(notes)
|
||||
return {"path": out_path, "n_pages": n_pages, "chapters": chapters_meta,
|
||||
"note": note}
|
||||
@@ -1,518 +0,0 @@
|
||||
"""AutomaticEDA PPTX renderer — 16:9 slides, never cuts content.
|
||||
|
||||
Same flow principle as the PDF renderer but onto PowerPoint slides: measure each
|
||||
block and place it top-to-bottom; when it does not fit in the remaining slide
|
||||
space, continue on a new slide titled ``<Chapter> (cont.)``. Data tables split by
|
||||
rows **repeating the header**; figures/images are scaled to fit entirely. Every
|
||||
slide carries a footer ``<Chapter> · v<version>`` plus the engine version.
|
||||
|
||||
dict-no-throw: a failure inside one block is caught and noted; the deck is always
|
||||
produced with at least one slide. Engine: ``python-pptx`` (added dependency).
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import io
|
||||
import os
|
||||
|
||||
from . import model
|
||||
from . import text_layout as tl
|
||||
|
||||
try:
|
||||
from pptx import Presentation
|
||||
from pptx.util import Inches, Pt, Emu
|
||||
from pptx.dml.color import RGBColor
|
||||
from pptx.enum.text import PP_ALIGN
|
||||
_PPTX_OK = True
|
||||
_PPTX_ERR = ""
|
||||
except Exception as _e: # noqa: BLE001 — surfaced as a dict-no-throw note.
|
||||
_PPTX_OK = False
|
||||
_PPTX_ERR = str(_e)
|
||||
|
||||
# 16:9 widescreen, inches.
|
||||
_W, _H = 13.333, 7.5
|
||||
_ML, _MR = 0.7, 0.7
|
||||
_TITLE_TOP, _TITLE_H = 0.28, 0.7
|
||||
_CONTENT_TOP = 1.12
|
||||
_FOOTER_H = 0.4
|
||||
_CONTENT_BOTTOM = _H - _FOOTER_H - 0.15
|
||||
_USABLE_W = _W - _ML - _MR
|
||||
|
||||
_INK = (0x1B, 0x1B, 0x1B)
|
||||
_ACCENT = (0x2A, 0x6F, 0x97)
|
||||
_MUTED = (0x8A, 0x8A, 0x8A)
|
||||
_HEAD_BG = (0xEE, 0xF3, 0xF6)
|
||||
_WHITE = (0xFF, 0xFF, 0xFF)
|
||||
|
||||
_FS_TITLE = 26
|
||||
_FS_H1, _FS_H2, _FS_H3 = 20, 16, 13
|
||||
_FS_BODY, _FS_CELL, _FS_NOTE = 14, 11, 11
|
||||
_GAP = 0.12
|
||||
|
||||
|
||||
class _PptxState:
|
||||
def __init__(self, prs, title: str):
|
||||
self.prs = prs
|
||||
self.title = title
|
||||
self.slide = None
|
||||
self.y = _CONTENT_TOP
|
||||
self.chapter = None
|
||||
self.slide_no = 0
|
||||
self.chapter_slides = 0
|
||||
|
||||
|
||||
def _rgb(c):
|
||||
return RGBColor(*c)
|
||||
|
||||
|
||||
def _new_slide(st: _PptxState, cont: bool = False) -> None:
|
||||
blank = st.prs.slide_layouts[6]
|
||||
st.slide = st.prs.slides.add_slide(blank)
|
||||
st.y = _CONTENT_TOP
|
||||
st.slide_no += 1
|
||||
st.chapter_slides += 1
|
||||
_draw_title(st, cont)
|
||||
_draw_footer(st)
|
||||
|
||||
|
||||
def _draw_title(st: _PptxState, cont: bool) -> None:
|
||||
ch = st.chapter
|
||||
title = ch.title if ch is not None else st.title
|
||||
if cont:
|
||||
title = f"{title} (cont.)"
|
||||
box = st.slide.shapes.add_textbox(
|
||||
Inches(_ML), Inches(_TITLE_TOP), Inches(_USABLE_W), Inches(_TITLE_H))
|
||||
tf = box.text_frame
|
||||
tf.word_wrap = True
|
||||
p = tf.paragraphs[0]
|
||||
run = p.add_run()
|
||||
run.text = title
|
||||
run.font.size = Pt(_FS_TITLE)
|
||||
run.font.bold = True
|
||||
run.font.color.rgb = _rgb(_INK)
|
||||
|
||||
|
||||
def _draw_footer(st: _PptxState) -> None:
|
||||
ch = st.chapter
|
||||
left = f"{ch.title} · v{ch.version}" if ch is not None else ""
|
||||
right = f"{model.ENGINE_NAME} v{model.ENGINE_VERSION} · {st.slide_no}"
|
||||
box = st.slide.shapes.add_textbox(
|
||||
Inches(_ML), Inches(_H - _FOOTER_H), Inches(_USABLE_W),
|
||||
Inches(_FOOTER_H * 0.7))
|
||||
tf = box.text_frame
|
||||
tf.word_wrap = False
|
||||
p = tf.paragraphs[0]
|
||||
r = p.add_run()
|
||||
r.text = left
|
||||
r.font.size = Pt(9)
|
||||
r.font.color.rgb = _rgb(_MUTED)
|
||||
# Right-aligned engine stamp on a second textbox.
|
||||
box2 = st.slide.shapes.add_textbox(
|
||||
Inches(_ML), Inches(_H - _FOOTER_H), Inches(_USABLE_W),
|
||||
Inches(_FOOTER_H * 0.7))
|
||||
tf2 = box2.text_frame
|
||||
p2 = tf2.paragraphs[0]
|
||||
p2.alignment = PP_ALIGN.RIGHT
|
||||
r2 = p2.add_run()
|
||||
r2.text = right
|
||||
r2.font.size = Pt(9)
|
||||
r2.font.color.rgb = _rgb(_MUTED)
|
||||
|
||||
|
||||
def _remaining(st: _PptxState) -> float:
|
||||
return _CONTENT_BOTTOM - st.y
|
||||
|
||||
|
||||
def _ensure(st: _PptxState, height: float) -> None:
|
||||
if _remaining(st) < height:
|
||||
_new_slide(st, cont=True)
|
||||
|
||||
|
||||
def _add_text(st: _PptxState, lines: list, fs: float, color, bold=False,
|
||||
italic=False, indent=0.0, bullet=False) -> None:
|
||||
lh = tl.line_height_in(fs)
|
||||
height = lh * len(lines) + 0.05
|
||||
_ensure(st, height)
|
||||
box = st.slide.shapes.add_textbox(
|
||||
Inches(_ML + indent), Inches(st.y), Inches(_USABLE_W - indent),
|
||||
Inches(height))
|
||||
tf = box.text_frame
|
||||
tf.word_wrap = True
|
||||
first = True
|
||||
for ln in lines:
|
||||
p = tf.paragraphs[0] if first else tf.add_paragraph()
|
||||
first = False
|
||||
run = p.add_run()
|
||||
run.text = ("• " + ln) if bullet else ln
|
||||
run.font.size = Pt(fs)
|
||||
run.font.bold = bold
|
||||
run.font.italic = italic
|
||||
run.font.color.rgb = _rgb(color)
|
||||
st.y += height
|
||||
|
||||
|
||||
def _place_heading(st: _PptxState, block) -> None:
|
||||
level = max(1, min(3, int(getattr(block, "level", 1) or 1)))
|
||||
fs = {1: _FS_H1, 2: _FS_H2, 3: _FS_H3}[level]
|
||||
text = tl.strip_inline_md(getattr(block, "text", ""))
|
||||
lines = tl.wrap(text, tl.chars_per_line(_USABLE_W, fs))
|
||||
_add_text(st, lines, fs, _INK, bold=True)
|
||||
st.y += 0.04
|
||||
|
||||
|
||||
def _place_markdown(st: _PptxState, block) -> None:
|
||||
raw = str(getattr(block, "text", "") or "")
|
||||
md_lines = raw.split("\n")
|
||||
i, n = 0, len(md_lines)
|
||||
while i < n:
|
||||
stripped = md_lines[i].strip()
|
||||
if stripped.startswith("|") and stripped.endswith("|"):
|
||||
j = i
|
||||
tbl = []
|
||||
while j < n and md_lines[j].strip().startswith("|") \
|
||||
and md_lines[j].strip().endswith("|"):
|
||||
tbl.append(md_lines[j])
|
||||
j += 1
|
||||
parsed = tl.parse_md_table(tbl)
|
||||
if parsed:
|
||||
header, rows = parsed
|
||||
_place_data_table(st, model.DataTable(header=header, rows=rows))
|
||||
i = j
|
||||
continue
|
||||
if stripped == "":
|
||||
st.y += tl.line_height_in(_FS_BODY) * 0.4
|
||||
i += 1
|
||||
continue
|
||||
if stripped.startswith("### "):
|
||||
_place_heading(st, model.Heading(stripped[4:], level=3))
|
||||
i += 1
|
||||
continue
|
||||
if stripped.startswith("## "):
|
||||
_place_heading(st, model.Heading(stripped[3:], level=2))
|
||||
i += 1
|
||||
continue
|
||||
if stripped.startswith("# "):
|
||||
_place_heading(st, model.Heading(stripped[2:], level=1))
|
||||
i += 1
|
||||
continue
|
||||
if stripped.startswith("- ") or stripped.startswith("* "):
|
||||
content = tl.strip_inline_md(stripped[2:])
|
||||
lines = tl.wrap(content, tl.chars_per_line(_USABLE_W - 0.3, _FS_BODY))
|
||||
_add_text(st, lines, _FS_BODY, _INK, bullet=True)
|
||||
i += 1
|
||||
continue
|
||||
para = [tl.strip_inline_md(stripped)]
|
||||
j = i + 1
|
||||
while j < n:
|
||||
nxt = md_lines[j].strip()
|
||||
if nxt == "" or nxt.startswith(("|", "#", "- ", "* ")):
|
||||
break
|
||||
para.append(tl.strip_inline_md(nxt))
|
||||
j += 1
|
||||
text = " ".join(para)
|
||||
_add_text(st, tl.wrap(text, tl.chars_per_line(_USABLE_W, _FS_BODY)),
|
||||
_FS_BODY, _INK)
|
||||
i = j
|
||||
st.y += _GAP
|
||||
|
||||
|
||||
def _place_kv_table(st: _PptxState, block) -> None:
|
||||
title = getattr(block, "title", None)
|
||||
if title:
|
||||
_place_heading(st, model.Heading(title, level=2))
|
||||
rows = getattr(block, "rows", []) or []
|
||||
data_rows = []
|
||||
for row in rows:
|
||||
try:
|
||||
label, value = row[0], row[1]
|
||||
except Exception: # noqa: BLE001
|
||||
label, value = str(row), ""
|
||||
data_rows.append([model._safe_str(label), model._safe_str(value)])
|
||||
_place_data_table(st, model.DataTable(header=["Campo", "Valor"],
|
||||
rows=data_rows), shaded_header=True,
|
||||
key_value=True)
|
||||
|
||||
|
||||
def _col_widths(header, rows):
|
||||
ncol = len(header) if header else (len(rows[0]) if rows else 1)
|
||||
ncol = max(1, ncol)
|
||||
natural = [3] * ncol
|
||||
for c in range(ncol):
|
||||
if header and c < len(header):
|
||||
natural[c] = max(natural[c], len(model._safe_str(header[c])))
|
||||
for r in rows:
|
||||
if c < len(r):
|
||||
natural[c] = max(natural[c], len(model._safe_str(r[c])))
|
||||
clamped = [min(max(w, 4), 44) for w in natural]
|
||||
total = float(sum(clamped)) or 1.0
|
||||
return [_USABLE_W * w / total for w in clamped]
|
||||
|
||||
|
||||
def _row_height_in(cells, widths, fs) -> float:
|
||||
lh = tl.line_height_in(fs)
|
||||
maxlines = 1
|
||||
for c, w in enumerate(widths):
|
||||
text = model._safe_str(cells[c]) if c < len(cells) else ""
|
||||
lines = tl.wrap(text, tl.chars_per_line(w - 0.12, fs))
|
||||
maxlines = max(maxlines, len(lines))
|
||||
return lh * maxlines + 0.10
|
||||
|
||||
|
||||
def _emit_table(st: _PptxState, header, chunk, widths, fs) -> None:
|
||||
nrows = len(chunk) + (1 if header else 0)
|
||||
ncol = len(widths)
|
||||
# Pre-measure total height to size the shape (pptx still auto-grows rows).
|
||||
heights = []
|
||||
if header:
|
||||
heights.append(_row_height_in(header, widths, fs))
|
||||
for r in chunk:
|
||||
heights.append(_row_height_in(r, widths, fs))
|
||||
total_h = sum(heights)
|
||||
gtable = st.slide.shapes.add_table(
|
||||
nrows, ncol, Inches(_ML), Inches(st.y), Inches(_USABLE_W),
|
||||
Inches(total_h)).table
|
||||
gtable.first_row = bool(header)
|
||||
gtable.horz_banding = False
|
||||
for c in range(ncol):
|
||||
gtable.columns[c].width = Emu(int(Inches(widths[c])))
|
||||
ridx = 0
|
||||
if header:
|
||||
for c in range(ncol):
|
||||
cell = gtable.cell(0, c)
|
||||
cell.text = model._safe_str(header[c]) if c < len(header) else ""
|
||||
_style_cell(cell, fs, _INK, bold=True, fill=_HEAD_BG)
|
||||
ridx = 1
|
||||
for r in chunk:
|
||||
for c in range(ncol):
|
||||
cell = gtable.cell(ridx, c)
|
||||
cell.text = model._safe_str(r[c]) if c < len(r) else ""
|
||||
_style_cell(cell, fs, _INK, bold=False, fill=_WHITE)
|
||||
ridx += 1
|
||||
st.y += total_h + _GAP
|
||||
|
||||
|
||||
def _style_cell(cell, fs, color, bold, fill) -> None:
|
||||
cell.fill.solid()
|
||||
cell.fill.fore_color.rgb = _rgb(fill)
|
||||
cell.margin_left = Inches(0.05)
|
||||
cell.margin_right = Inches(0.05)
|
||||
cell.margin_top = Inches(0.02)
|
||||
cell.margin_bottom = Inches(0.02)
|
||||
for p in cell.text_frame.paragraphs:
|
||||
for run in p.runs:
|
||||
run.font.size = Pt(fs)
|
||||
run.font.bold = bold
|
||||
run.font.color.rgb = _rgb(color)
|
||||
|
||||
|
||||
def _place_data_table(st: _PptxState, block, shaded_header=True,
|
||||
key_value=False) -> None:
|
||||
title = getattr(block, "title", None)
|
||||
if title:
|
||||
_place_heading(st, model.Heading(title, level=2))
|
||||
header = list(getattr(block, "header", []) or [])
|
||||
rows = list(getattr(block, "rows", []) or [])
|
||||
fs = _FS_CELL
|
||||
widths = _col_widths(header, rows)
|
||||
header_h = _row_height_in(header, widths, fs) if header else 0.0
|
||||
|
||||
idx = 0
|
||||
n = len(rows)
|
||||
if n == 0:
|
||||
# Header-only table still rendered (one slide).
|
||||
_ensure(st, header_h + 0.2)
|
||||
_emit_table(st, header, [], widths, fs)
|
||||
return
|
||||
while idx < n:
|
||||
# Greedily fill the current slide with as many rows as fit.
|
||||
if _remaining(st) < header_h + _row_height_in(rows[idx], widths, fs):
|
||||
_new_slide(st, cont=True)
|
||||
avail = _remaining(st) - header_h
|
||||
chunk = []
|
||||
used = 0.0
|
||||
while idx < n:
|
||||
rh = _row_height_in(rows[idx], widths, fs)
|
||||
if used + rh > avail and chunk:
|
||||
break
|
||||
chunk.append(rows[idx])
|
||||
used += rh
|
||||
idx += 1
|
||||
_emit_table(st, header, chunk, widths, fs)
|
||||
note = getattr(block, "note", None)
|
||||
if note:
|
||||
_add_text(st, tl.wrap(model._safe_str(note),
|
||||
tl.chars_per_line(_USABLE_W, _FS_NOTE)), _FS_NOTE, _MUTED,
|
||||
italic=True)
|
||||
|
||||
|
||||
def _img_size_px(data: bytes):
|
||||
try:
|
||||
from PIL import Image
|
||||
with Image.open(io.BytesIO(data)) as im:
|
||||
return im.size # (w, h)
|
||||
except Exception: # noqa: BLE001
|
||||
return (1200, 800)
|
||||
|
||||
|
||||
def _resolve_png(block):
|
||||
fig = getattr(block, "fig", None)
|
||||
make = getattr(block, "make", None)
|
||||
f = fig
|
||||
owned = False
|
||||
if f is None and callable(make):
|
||||
try:
|
||||
f = make()
|
||||
owned = True
|
||||
except Exception: # noqa: BLE001
|
||||
f = None
|
||||
if f is None:
|
||||
return None
|
||||
try:
|
||||
import matplotlib.pyplot as plt
|
||||
buf = io.BytesIO()
|
||||
f.savefig(buf, format="png", dpi=150, bbox_inches="tight")
|
||||
buf.seek(0)
|
||||
return buf.read()
|
||||
except Exception: # noqa: BLE001
|
||||
return None
|
||||
finally:
|
||||
if owned:
|
||||
try:
|
||||
import matplotlib.pyplot as plt
|
||||
plt.close(f)
|
||||
except Exception: # noqa: BLE001
|
||||
pass
|
||||
|
||||
|
||||
def _place_picture_bytes(st: _PptxState, data: bytes, caption) -> None:
|
||||
w_px, h_px = _img_size_px(data)
|
||||
aspect = (h_px / w_px) if w_px else 0.66
|
||||
max_h = _CONTENT_BOTTOM - _CONTENT_TOP
|
||||
target_w = _USABLE_W
|
||||
target_h = target_w * aspect
|
||||
if target_h > max_h:
|
||||
target_h = max_h
|
||||
target_w = target_h / aspect if aspect else _USABLE_W
|
||||
cap_h = tl.line_height_in(_FS_NOTE) + 0.05 if caption else 0.0
|
||||
if _remaining(st) < target_h + cap_h:
|
||||
_new_slide(st, cont=True)
|
||||
left = _ML + (_USABLE_W - target_w) / 2.0
|
||||
st.slide.shapes.add_picture(io.BytesIO(data), Inches(left), Inches(st.y),
|
||||
width=Inches(target_w), height=Inches(target_h))
|
||||
st.y += target_h + 0.05
|
||||
if caption:
|
||||
_add_text(st, tl.wrap(model._safe_str(caption),
|
||||
tl.chars_per_line(_USABLE_W, _FS_NOTE)), _FS_NOTE, _MUTED,
|
||||
italic=True)
|
||||
st.y += _GAP
|
||||
|
||||
|
||||
def _place_figure(st: _PptxState, block) -> None:
|
||||
png = _resolve_png(block)
|
||||
if png is None:
|
||||
_add_text(st, ["(figura no disponible)"], _FS_NOTE, _MUTED, italic=True)
|
||||
st.y += _GAP
|
||||
return
|
||||
_place_picture_bytes(st, png, getattr(block, "caption", None))
|
||||
|
||||
|
||||
def _place_image(st: _PptxState, block) -> None:
|
||||
path = getattr(block, "path", "")
|
||||
if not path or not os.path.exists(path):
|
||||
_add_text(st, [f"(imagen no encontrada: {path})"], _FS_NOTE, _MUTED,
|
||||
italic=True)
|
||||
st.y += _GAP
|
||||
return
|
||||
try:
|
||||
with open(path, "rb") as fh:
|
||||
data = fh.read()
|
||||
except Exception as e: # noqa: BLE001
|
||||
_add_text(st, [f"(no se pudo leer la imagen: {e})"], _FS_NOTE, _MUTED,
|
||||
italic=True)
|
||||
st.y += _GAP
|
||||
return
|
||||
_place_picture_bytes(st, data, getattr(block, "caption", None))
|
||||
|
||||
|
||||
def _place_caption(st: _PptxState, block) -> None:
|
||||
_add_text(st, tl.wrap(getattr(block, "text", ""),
|
||||
tl.chars_per_line(_USABLE_W, _FS_NOTE)), _FS_NOTE, _MUTED,
|
||||
italic=True)
|
||||
st.y += _GAP
|
||||
|
||||
|
||||
def _place_note(st: _PptxState, block) -> None:
|
||||
_place_caption(st, block)
|
||||
|
||||
|
||||
_PLACERS = {
|
||||
"heading": _place_heading,
|
||||
"markdown": _place_markdown,
|
||||
"kv_table": _place_kv_table,
|
||||
"data_table": _place_data_table,
|
||||
"figure": _place_figure,
|
||||
"image": _place_image,
|
||||
"caption": _place_caption,
|
||||
"note": _place_note,
|
||||
}
|
||||
|
||||
|
||||
def render_pptx(chapters: list, out_path: str, meta: dict = None) -> dict:
|
||||
"""Render a list of Chapters into a 16:9 PPTX deck. Never raises.
|
||||
|
||||
Returns ``{path, n_slides, chapters, note}`` where ``chapters`` is a list of
|
||||
``{id, version, n_slides}`` for the manifest. On a fatal error ``path`` is
|
||||
None and ``note`` explains why (e.g. python-pptx not installed).
|
||||
"""
|
||||
meta = meta or {}
|
||||
if not _PPTX_OK:
|
||||
return {"path": None, "n_slides": 0, "chapters": [],
|
||||
"note": f"python-pptx no disponible: {_PPTX_ERR}"}
|
||||
|
||||
chapters = model.as_chapters(chapters)
|
||||
notes = []
|
||||
try:
|
||||
parent = os.path.dirname(os.path.abspath(out_path))
|
||||
os.makedirs(parent, exist_ok=True)
|
||||
except OSError as e:
|
||||
return {"path": None, "n_slides": 0, "chapters": [],
|
||||
"note": f"no se pudo crear el directorio destino: {e}"}
|
||||
|
||||
title = meta.get("title") or model.ENGINE_NAME
|
||||
chapters_meta = []
|
||||
try:
|
||||
prs = Presentation()
|
||||
prs.slide_width = Inches(_W)
|
||||
prs.slide_height = Inches(_H)
|
||||
st = _PptxState(prs, title)
|
||||
for ch in chapters:
|
||||
st.chapter = ch
|
||||
st.chapter_slides = 0
|
||||
_new_slide(st, cont=False)
|
||||
for block in ch.blocks:
|
||||
placer = _PLACERS.get(getattr(block, "kind", ""), _place_note)
|
||||
try:
|
||||
placer(st, block)
|
||||
except Exception as e: # noqa: BLE001
|
||||
notes.append(
|
||||
f"bloque '{getattr(block, 'kind', '?')}' del capítulo "
|
||||
f"'{ch.id}' omitido: {e}")
|
||||
chapters_meta.append({"id": ch.id, "version": ch.version,
|
||||
"n_slides": st.chapter_slides})
|
||||
if st.slide_no == 0:
|
||||
st.chapter = model.Chapter(id="vacio", title=title,
|
||||
version=model.ENGINE_VERSION)
|
||||
_new_slide(st, cont=False)
|
||||
_place_note(st, model.Note(
|
||||
"(documento vacío — sin capítulos aplicables)"))
|
||||
prs.save(out_path)
|
||||
n_slides = st.slide_no
|
||||
except Exception as e: # noqa: BLE001
|
||||
return {"path": None, "n_slides": 0, "chapters": [],
|
||||
"note": f"fallo al escribir el PPTX: {e}"}
|
||||
|
||||
note = f"{n_slides} slides"
|
||||
if notes:
|
||||
note += " · " + "; ".join(notes)
|
||||
return {"path": out_path, "n_slides": n_slides, "chapters": chapters_meta,
|
||||
"note": note}
|
||||
@@ -1,107 +0,0 @@
|
||||
"""Shared text-measurement helpers for the AutomaticEDA renderers.
|
||||
|
||||
Both renderers flow content top-to-bottom and must know, *before* placing a
|
||||
block, how much vertical space it will take — that is what guarantees nothing is
|
||||
cut: a unit either fits in the remaining space or moves to the next page/slide
|
||||
whole. Measuring proportional text exactly in matplotlib/pptx is impractical, so
|
||||
we use a deterministic character-grid estimate (chars-per-line from an average
|
||||
glyph width) which slightly over-estimates and is therefore safe: it never
|
||||
claims something fits when it would overflow.
|
||||
|
||||
Wrapping is word-aware (``textwrap``) and additionally hard-splits any single
|
||||
token longer than the line so a 200-character value still wraps instead of
|
||||
overflowing — that is wrapping, not loss: every character is still rendered.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import textwrap
|
||||
|
||||
|
||||
def avg_char_width_in(fontsize_pt: float) -> float:
|
||||
"""Approximate average glyph width in inches for a sans-serif font.
|
||||
|
||||
~0.5 of the point size is a conservative mean advance width for proportional
|
||||
sans fonts; dividing by 72 converts points to inches.
|
||||
"""
|
||||
return 0.5 * fontsize_pt / 72.0
|
||||
|
||||
|
||||
def line_height_in(fontsize_pt: float, leading: float = 1.32) -> float:
|
||||
"""Line height in inches for a given font size and leading."""
|
||||
return leading * fontsize_pt / 72.0
|
||||
|
||||
|
||||
def chars_per_line(width_in: float, fontsize_pt: float) -> int:
|
||||
"""How many average glyphs fit in ``width_in`` at ``fontsize_pt``."""
|
||||
cw = avg_char_width_in(fontsize_pt)
|
||||
if cw <= 0:
|
||||
return 80
|
||||
n = int(width_in / cw)
|
||||
return max(1, n)
|
||||
|
||||
|
||||
def wrap(text: str, max_chars: int) -> list:
|
||||
"""Word-wrap ``text`` to lines of at most ``max_chars``, never losing chars.
|
||||
|
||||
Long tokens (no spaces) are hard-split so they cannot overflow. Existing
|
||||
newlines are honored as hard breaks. Empty input yields a single empty line
|
||||
so callers can still reserve a row.
|
||||
"""
|
||||
if max_chars < 1:
|
||||
max_chars = 1
|
||||
s = "" if text is None else str(text)
|
||||
out: list = []
|
||||
for raw_line in s.split("\n"):
|
||||
if raw_line == "":
|
||||
out.append("")
|
||||
continue
|
||||
# textwrap with break_long_words so no token overflows the column.
|
||||
wrapped = textwrap.wrap(
|
||||
raw_line, width=max_chars, break_long_words=True,
|
||||
break_on_hyphens=False, replace_whitespace=True,
|
||||
drop_whitespace=True,
|
||||
)
|
||||
if not wrapped:
|
||||
out.append("")
|
||||
else:
|
||||
out.extend(wrapped)
|
||||
return out or [""]
|
||||
|
||||
|
||||
def strip_inline_md(text: str) -> str:
|
||||
"""Strip a tiny subset of inline markdown markers, keeping the text.
|
||||
|
||||
Removes ``**bold**`` / ``__bold__`` / ``*em*`` / `` `code` `` markers so the
|
||||
content is preserved without trying to style spans (which the line-grid
|
||||
layout cannot do). Nothing is dropped except the markers themselves.
|
||||
"""
|
||||
if not text:
|
||||
return ""
|
||||
s = str(text)
|
||||
for marker in ("**", "__", "`"):
|
||||
s = s.replace(marker, "")
|
||||
return s
|
||||
|
||||
|
||||
def parse_md_table(lines: list):
|
||||
"""Parse consecutive ``| a | b |`` lines into ``(header, rows)`` or None.
|
||||
|
||||
Accepts an optional separator row (``|---|---|``) right after the header,
|
||||
which is ignored. Returns None if the lines are not a pipe table.
|
||||
"""
|
||||
cells_rows = []
|
||||
for ln in lines:
|
||||
s = ln.strip()
|
||||
if not (s.startswith("|") and s.endswith("|")):
|
||||
return None
|
||||
parts = [c.strip() for c in s.strip("|").split("|")]
|
||||
cells_rows.append(parts)
|
||||
if not cells_rows:
|
||||
return None
|
||||
header = cells_rows[0]
|
||||
body = cells_rows[1:]
|
||||
# Drop a markdown separator row (all cells are dashes/colons).
|
||||
if body and all(set(c) <= set("-: ") and "-" in c for c in body[0]):
|
||||
body = body[1:]
|
||||
return header, body
|
||||
@@ -1,115 +0,0 @@
|
||||
---
|
||||
id: categorical_cardinality_block_py_datascience
|
||||
name: categorical_cardinality_block
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def categorical_cardinality_block(cat: dict, n_rows: int) -> dict"
|
||||
description: "Deriva métricas de cardinalidad listas para renderizar a partir de la salida de summarize_categorical para UNA columna categórica más el número total de filas. Calcula pct_distinct, entropy_max=log2(n_distinct), entropy_norm (recortada a [0,1]), n_singletons (sobre el top visible) y los flags id_like / dominated. NO recalcula la entropía ni reimplementa summarize_categorical: la consume. Estilo dict-no-throw del grupo eda — nunca lanza."
|
||||
tags: [eda, categorical, cardinality, entropy, profiling, datascience, pure]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: [math]
|
||||
example: |
|
||||
from categorical_cardinality_block import categorical_cardinality_block
|
||||
cat = {"top": [{"value": "a", "count": 5, "pct": 0.5}], "mode": "a",
|
||||
"mode_pct": 0.5, "n_distinct": 4, "entropy": 1.685, "imbalance": 5.0,
|
||||
"len_min": 1, "len_mean": 1.0, "len_max": 1}
|
||||
block = categorical_cardinality_block(cat, n_rows=10)
|
||||
tested: true
|
||||
tests:
|
||||
- "test_normal_case"
|
||||
- "test_empty_cat_does_not_raise"
|
||||
- "test_none_cat_does_not_raise"
|
||||
- "test_n_rows_zero_no_zero_division"
|
||||
- "test_id_like_when_distinct_near_rows"
|
||||
- "test_dominated_when_mode_pct_high"
|
||||
- "test_mode_pct_fallback_from_top_fraction"
|
||||
- "test_n_singletons_partial_when_top_truncated"
|
||||
- "test_single_distinct_value_entropy_norm_none"
|
||||
test_file_path: "python/functions/datascience/categorical_cardinality_block_test.py"
|
||||
file_path: "python/functions/datascience/categorical_cardinality_block.py"
|
||||
params:
|
||||
- name: cat
|
||||
desc: "Dict producido por summarize_categorical para UNA columna categórica. Claves leídas (todas opcionales, lectura defensiva): top (list de {value,count,pct}), mode, mode_pct (puede faltar), n_distinct, entropy (Shannon en bits), imbalance, len_min, len_mean, len_max. None o no-dict se tratan como {}."
|
||||
- name: n_rows
|
||||
desc: "Número total de filas del dataset. Usado para pct_distinct. Si es 0 o None, pct_distinct sale None (sin ZeroDivisionError)."
|
||||
output: "Dict con exactamente 16 claves, todas siempre presentes: n_distinct, n_rows, pct_distinct, entropy, entropy_max, entropy_norm, mode, mode_pct, imbalance, n_singletons, n_singletons_partial, len_min, len_mean, len_max, id_like, dominated. Valores None/False cuando no son derivables; la función nunca lanza. pct_distinct en escala 0-100. entropy_max=log2(n_distinct) (0.0 si n_distinct in {0,1}). entropy_norm=entropy/entropy_max recortada a [0,1]. n_singletons = nº de elementos de top con count==1 (None si top vacío). n_singletons_partial=True si n_distinct>len(top). id_like=pct_distinct>=99. dominated=mode_pct>=90."
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from categorical_cardinality_block import categorical_cardinality_block
|
||||
|
||||
# Salida típica de summarize_categorical para una columna, con n_rows del dataset.
|
||||
cat = {
|
||||
"top": [
|
||||
{"value": "a", "count": 5, "pct": 0.5},
|
||||
{"value": "b", "count": 3, "pct": 0.3},
|
||||
{"value": "c", "count": 1, "pct": 0.1},
|
||||
{"value": "d", "count": 1, "pct": 0.1},
|
||||
],
|
||||
"mode": "a",
|
||||
"mode_pct": 0.5,
|
||||
"n_distinct": 4,
|
||||
"entropy": 1.685, # Shannon en bits (<= log2(4) = 2.0)
|
||||
"imbalance": 5.0,
|
||||
"len_min": 1, "len_mean": 1.0, "len_max": 1,
|
||||
}
|
||||
|
||||
categorical_cardinality_block(cat, n_rows=10)
|
||||
# {
|
||||
# "n_distinct": 4, "n_rows": 10,
|
||||
# "pct_distinct": 40.0, # 4 / 10 * 100
|
||||
# "entropy": 1.685,
|
||||
# "entropy_max": 2.0, # log2(4)
|
||||
# "entropy_norm": 0.8425, # 1.685 / 2.0, recortado a [0,1]
|
||||
# "mode": "a", "mode_pct": 0.5,
|
||||
# "imbalance": 5.0,
|
||||
# "n_singletons": 2, # c y d con count == 1
|
||||
# "n_singletons_partial": False, # top cubre los 4 distintos
|
||||
# "len_min": 1, "len_mean": 1.0, "len_max": 1,
|
||||
# "id_like": False, # pct_distinct 40 < 99
|
||||
# "dominated": False, # mode_pct 0.5 < 90
|
||||
# }
|
||||
```
|
||||
|
||||
## Cuando usarla
|
||||
|
||||
Úsala justo después de `summarize_categorical`, cuando vayas a renderizar el
|
||||
bloque de cardinalidad de una columna categórica en un EDA: necesitas el ratio
|
||||
de valores distintos (`pct_distinct`), la entropía normalizada al rango `[0,1]`
|
||||
para comparar columnas con cardinalidades distintas, el conteo de singletons, y
|
||||
las banderas heurísticas `id_like` (la columna parece un identificador) y
|
||||
`dominated` (una sola categoría domina). Pásale el dict crudo de
|
||||
`summarize_categorical` para esa columna y el `n_rows` total del dataset. No
|
||||
reimplementa nada: solo deriva métricas de presentación a partir de lo ya
|
||||
calculado.
|
||||
|
||||
## Gotchas
|
||||
|
||||
- **`mode_pct` se pasa tal cual viene en `cat`.** `summarize_categorical`
|
||||
produce `mode_pct` como **fracción** (0–1), no como porcentaje. El flag
|
||||
`dominated` compara `mode_pct >= 90.0`, así que con la salida cruda de
|
||||
`summarize_categorical` (fracciones) `dominated` no se dispara: aliméntalo con
|
||||
`mode_pct` en escala 0–100 si quieres usar esa bandera. Solo el camino de
|
||||
*fallback* (cuando `cat` no trae `mode_pct` y se deriva de `top[0]['pct']`)
|
||||
normaliza una fracción `<= 1` multiplicándola por 100.
|
||||
- **`n_singletons` solo cubre el `top` visible.** Si `summarize_categorical` se
|
||||
llamó con `top_k` pequeño, hay valores fuera del top; en ese caso
|
||||
`n_singletons_partial` es `True` para avisar de que el conteo es parcial.
|
||||
- **`pct_distinct` es `None` si `n_rows` es 0 o `None`** (no lanza
|
||||
`ZeroDivisionError`); por tanto `id_like` queda `False` en ese caso.
|
||||
- **`entropy_norm` es `None` cuando `entropy_max <= 0`** (columna constante,
|
||||
`n_distinct in {0,1}`): no hay división por cero y no se inventa un 0/1.
|
||||
- **No recalcula la entropía.** Si `cat['entropy']` es incoherente con
|
||||
`n_distinct`, `entropy_norm` se recorta a `[0,1]` pero el valor de entrada no
|
||||
se corrige.
|
||||
- **`bool` no cuenta como número.** Un `True`/`False` en una clave numérica de
|
||||
`cat` se trata como ausente (`None`), por la guarda defensiva.
|
||||
@@ -1,132 +0,0 @@
|
||||
"""Pure EDA helper: cardinality metrics block from a `summarize_categorical` output.
|
||||
|
||||
Part of the `eda` capability group. Consumes the per-column dict produced by
|
||||
``summarize_categorical`` (for a single categorical/text column) plus the total
|
||||
row count of the dataset and derives render-ready cardinality metrics: distinct
|
||||
ratio, normalized entropy, singleton count, and the ``id_like`` / ``dominated``
|
||||
flags.
|
||||
|
||||
It does NOT recompute the entropy nor reimplement ``summarize_categorical`` — it
|
||||
only reads that function's output. Dict-no-throw style of the `eda` group: it
|
||||
never raises. Missing or malformed inputs yield ``None``/``False``/``0`` for the
|
||||
affected keys, never an exception. Stdlib only (``math.log2``).
|
||||
"""
|
||||
|
||||
from math import log2
|
||||
|
||||
|
||||
def _num(value):
|
||||
"""Return ``value`` unchanged if it is a real (non-bool) number, else ``None``.
|
||||
|
||||
``bool`` is rejected on purpose: in Python ``True`` is an ``int`` but it is
|
||||
never a meaningful count/ratio here.
|
||||
"""
|
||||
if isinstance(value, bool):
|
||||
return None
|
||||
if isinstance(value, (int, float)):
|
||||
return value
|
||||
return None
|
||||
|
||||
|
||||
def categorical_cardinality_block(cat: dict, n_rows: int) -> dict:
|
||||
"""Derive cardinality metrics for one categorical column.
|
||||
|
||||
Args:
|
||||
cat: The per-column dict produced by ``summarize_categorical`` for a
|
||||
single categorical/text column. Expected (all optional, read
|
||||
defensively) keys: ``top`` (list of ``{value, count, pct}``),
|
||||
``mode``, ``mode_pct``, ``n_distinct``, ``entropy`` (Shannon, bits),
|
||||
``imbalance``, ``len_min``, ``len_mean``, ``len_max``. ``None`` or a
|
||||
non-dict is treated as ``{}``.
|
||||
n_rows: Total number of rows in the dataset (used for ``pct_distinct``).
|
||||
|
||||
Returns:
|
||||
Dict with exactly these keys, every one always present:
|
||||
``n_distinct``, ``n_rows``, ``pct_distinct``, ``entropy``,
|
||||
``entropy_max``, ``entropy_norm``, ``mode``, ``mode_pct``,
|
||||
``imbalance``, ``n_singletons``, ``n_singletons_partial``, ``len_min``,
|
||||
``len_mean``, ``len_max``, ``id_like``, ``dominated``. Values are
|
||||
``None``/``False`` when not derivable; the function never raises.
|
||||
"""
|
||||
cat = cat if isinstance(cat, dict) else {}
|
||||
|
||||
# --- passthroughs (numeric-validated, type preserved) ---
|
||||
n_distinct = _num(cat.get("n_distinct"))
|
||||
n_rows_out = _num(n_rows)
|
||||
entropy = _num(cat.get("entropy"))
|
||||
imbalance = _num(cat.get("imbalance"))
|
||||
len_min = _num(cat.get("len_min"))
|
||||
len_mean = _num(cat.get("len_mean"))
|
||||
len_max = _num(cat.get("len_max"))
|
||||
mode = cat.get("mode") # any value (or None); passthrough as-is
|
||||
|
||||
# --- pct_distinct ---
|
||||
if n_distinct is None or n_rows_out is None or n_rows_out == 0:
|
||||
pct_distinct = None
|
||||
else:
|
||||
pct_distinct = n_distinct / n_rows_out * 100.0
|
||||
|
||||
# --- entropy_max = log2(n_distinct) ---
|
||||
if n_distinct is None:
|
||||
entropy_max = None
|
||||
elif n_distinct > 1:
|
||||
entropy_max = log2(n_distinct)
|
||||
else: # n_distinct in {0, 1}
|
||||
entropy_max = 0.0
|
||||
|
||||
# --- entropy_norm = entropy / entropy_max, clipped to [0, 1] ---
|
||||
if entropy_max is not None and entropy_max > 0 and entropy is not None:
|
||||
entropy_norm = entropy / entropy_max
|
||||
entropy_norm = max(0.0, min(1.0, entropy_norm))
|
||||
else:
|
||||
entropy_norm = None
|
||||
|
||||
# --- mode_pct: prefer cat['mode_pct']; else derive from top[0].pct ---
|
||||
mode_pct = _num(cat.get("mode_pct"))
|
||||
top = cat.get("top")
|
||||
has_top = isinstance(top, (list, tuple)) and len(top) > 0
|
||||
if mode_pct is None and has_top:
|
||||
first = top[0]
|
||||
if isinstance(first, dict):
|
||||
first_pct = _num(first.get("pct"))
|
||||
if first_pct is not None:
|
||||
# Normalize to 0-100: a fraction (<= 1) becomes a percentage.
|
||||
mode_pct = first_pct * 100.0 if first_pct <= 1 else first_pct
|
||||
|
||||
# --- singletons (count == 1) within the visible top ---
|
||||
if has_top:
|
||||
n_singletons = sum(
|
||||
1
|
||||
for item in top
|
||||
if isinstance(item, dict) and _num(item.get("count")) == 1
|
||||
)
|
||||
else:
|
||||
n_singletons = None
|
||||
|
||||
# The singleton count only covers the visible top; there may be more
|
||||
# distinct values (and thus more singletons) outside it.
|
||||
top_len = len(top) if isinstance(top, (list, tuple)) else 0
|
||||
n_singletons_partial = bool(n_distinct is not None and n_distinct > top_len)
|
||||
|
||||
# --- derived flags ---
|
||||
id_like = pct_distinct is not None and pct_distinct >= 99.0
|
||||
dominated = mode_pct is not None and mode_pct >= 90.0
|
||||
|
||||
return {
|
||||
"n_distinct": n_distinct,
|
||||
"n_rows": n_rows_out,
|
||||
"pct_distinct": pct_distinct,
|
||||
"entropy": entropy,
|
||||
"entropy_max": entropy_max,
|
||||
"entropy_norm": entropy_norm,
|
||||
"mode": mode,
|
||||
"mode_pct": mode_pct,
|
||||
"imbalance": imbalance,
|
||||
"n_singletons": n_singletons,
|
||||
"n_singletons_partial": n_singletons_partial,
|
||||
"len_min": len_min,
|
||||
"len_mean": len_mean,
|
||||
"len_max": len_max,
|
||||
"id_like": id_like,
|
||||
"dominated": dominated,
|
||||
}
|
||||
@@ -1,216 +0,0 @@
|
||||
"""Tests para categorical_cardinality_block."""
|
||||
|
||||
import sys
|
||||
import os
|
||||
from math import log2
|
||||
|
||||
sys.path.insert(0, os.path.dirname(__file__))
|
||||
|
||||
from categorical_cardinality_block import categorical_cardinality_block
|
||||
|
||||
|
||||
# Output contract: every call returns exactly these 16 keys.
|
||||
EXPECTED_KEYS = {
|
||||
"n_distinct",
|
||||
"n_rows",
|
||||
"pct_distinct",
|
||||
"entropy",
|
||||
"entropy_max",
|
||||
"entropy_norm",
|
||||
"mode",
|
||||
"mode_pct",
|
||||
"imbalance",
|
||||
"n_singletons",
|
||||
"n_singletons_partial",
|
||||
"len_min",
|
||||
"len_mean",
|
||||
"len_max",
|
||||
"id_like",
|
||||
"dominated",
|
||||
}
|
||||
|
||||
|
||||
def _sample_cat():
|
||||
"""A realistic summarize_categorical output for one column."""
|
||||
return {
|
||||
"top": [
|
||||
{"value": "a", "count": 5, "pct": 0.5},
|
||||
{"value": "b", "count": 3, "pct": 0.3},
|
||||
{"value": "c", "count": 1, "pct": 0.1},
|
||||
{"value": "d", "count": 1, "pct": 0.1},
|
||||
],
|
||||
"mode": "a",
|
||||
"mode_pct": 0.5,
|
||||
"n_distinct": 4,
|
||||
"entropy": 1.685, # <= log2(4) = 2.0
|
||||
"imbalance": 5.0,
|
||||
"len_min": 1,
|
||||
"len_mean": 1.0,
|
||||
"len_max": 1,
|
||||
}
|
||||
|
||||
|
||||
def test_normal_case():
|
||||
"""Caso normal: pct_distinct, entropy_max=log2(n_distinct), entropy_norm in [0,1], n_singletons."""
|
||||
cat = _sample_cat()
|
||||
result = categorical_cardinality_block(cat, n_rows=10)
|
||||
|
||||
assert set(result.keys()) == EXPECTED_KEYS
|
||||
|
||||
# passthroughs
|
||||
assert result["n_distinct"] == 4
|
||||
assert result["n_rows"] == 10
|
||||
assert result["entropy"] == 1.685
|
||||
assert result["imbalance"] == 5.0
|
||||
assert result["mode"] == "a"
|
||||
assert result["mode_pct"] == 0.5 # passthrough, not normalized
|
||||
assert result["len_min"] == 1
|
||||
assert result["len_max"] == 1
|
||||
|
||||
# pct_distinct = 4 / 10 * 100
|
||||
assert abs(result["pct_distinct"] - 40.0) < 1e-12
|
||||
|
||||
# entropy_max = log2(4) = 2.0
|
||||
assert abs(result["entropy_max"] - log2(4)) < 1e-12
|
||||
assert abs(result["entropy_max"] - 2.0) < 1e-12
|
||||
|
||||
# entropy_norm = 1.685 / 2.0 = 0.8425, within [0, 1]
|
||||
assert abs(result["entropy_norm"] - 1.685 / 2.0) < 1e-12
|
||||
assert 0.0 <= result["entropy_norm"] <= 1.0
|
||||
|
||||
# singletons: c and d have count == 1
|
||||
assert result["n_singletons"] == 2
|
||||
# top covers all distinct values (4 == 4)
|
||||
assert result["n_singletons_partial"] is False
|
||||
|
||||
# neither id-like (40%) nor dominated (mode_pct 0.5)
|
||||
assert result["id_like"] is False
|
||||
assert result["dominated"] is False
|
||||
|
||||
|
||||
def test_empty_cat_does_not_raise():
|
||||
"""Caso cat={}: no lanza, claves derivadas None y flags False."""
|
||||
result = categorical_cardinality_block({}, n_rows=100)
|
||||
|
||||
assert set(result.keys()) == EXPECTED_KEYS
|
||||
for key in (
|
||||
"n_distinct",
|
||||
"pct_distinct",
|
||||
"entropy",
|
||||
"entropy_max",
|
||||
"entropy_norm",
|
||||
"mode",
|
||||
"mode_pct",
|
||||
"imbalance",
|
||||
"n_singletons",
|
||||
"len_min",
|
||||
"len_mean",
|
||||
"len_max",
|
||||
):
|
||||
assert result[key] is None
|
||||
assert result["n_singletons_partial"] is False
|
||||
assert result["id_like"] is False
|
||||
assert result["dominated"] is False
|
||||
# n_rows is a passthrough of the argument, still coherent.
|
||||
assert result["n_rows"] == 100
|
||||
|
||||
|
||||
def test_none_cat_does_not_raise():
|
||||
"""Caso cat=None: tratado como {}, mismas garantias que el dict vacio."""
|
||||
result = categorical_cardinality_block(None, n_rows=None)
|
||||
assert set(result.keys()) == EXPECTED_KEYS
|
||||
assert result["n_distinct"] is None
|
||||
assert result["pct_distinct"] is None
|
||||
assert result["entropy_max"] is None
|
||||
assert result["entropy_norm"] is None
|
||||
assert result["id_like"] is False
|
||||
assert result["dominated"] is False
|
||||
|
||||
|
||||
def test_n_rows_zero_no_zero_division():
|
||||
"""Caso n_rows=0: pct_distinct None sin ZeroDivisionError."""
|
||||
cat = _sample_cat()
|
||||
result = categorical_cardinality_block(cat, n_rows=0)
|
||||
assert result["pct_distinct"] is None
|
||||
# n_distinct still passes through.
|
||||
assert result["n_distinct"] == 4
|
||||
assert result["id_like"] is False
|
||||
|
||||
|
||||
def test_id_like_when_distinct_near_rows():
|
||||
"""id_like True cuando n_distinct ~ n_rows (pct_distinct >= 99)."""
|
||||
cat = {"n_distinct": 99, "entropy": 6.6, "top": [], "mode": None}
|
||||
result = categorical_cardinality_block(cat, n_rows=100)
|
||||
assert abs(result["pct_distinct"] - 99.0) < 1e-12
|
||||
assert result["id_like"] is True
|
||||
|
||||
# exact identity column: 100 / 100 = 100%
|
||||
cat_full = {"n_distinct": 100, "top": []}
|
||||
result_full = categorical_cardinality_block(cat_full, n_rows=100)
|
||||
assert result_full["id_like"] is True
|
||||
|
||||
|
||||
def test_dominated_when_mode_pct_high():
|
||||
"""dominated True cuando mode_pct alto (>= 90)."""
|
||||
cat = {
|
||||
"n_distinct": 3,
|
||||
"entropy": 0.3,
|
||||
"mode": "x",
|
||||
"mode_pct": 95.0,
|
||||
"top": [
|
||||
{"value": "x", "count": 95, "pct": 0.95},
|
||||
{"value": "y", "count": 3, "pct": 0.03},
|
||||
{"value": "z", "count": 2, "pct": 0.02},
|
||||
],
|
||||
"imbalance": 47.5,
|
||||
}
|
||||
result = categorical_cardinality_block(cat, n_rows=100)
|
||||
assert result["mode_pct"] == 95.0
|
||||
assert result["dominated"] is True
|
||||
|
||||
|
||||
def test_mode_pct_fallback_from_top_fraction():
|
||||
"""Sin mode_pct: deriva del pct del primer top, fraccion <=1 escala a 0-100."""
|
||||
cat = {
|
||||
"n_distinct": 3,
|
||||
"top": [
|
||||
{"value": "x", "count": 95, "pct": 0.95},
|
||||
{"value": "y", "count": 5, "pct": 0.05},
|
||||
],
|
||||
}
|
||||
result = categorical_cardinality_block(cat, n_rows=100)
|
||||
# 0.95 (fraction) -> 95.0 (percentage)
|
||||
assert abs(result["mode_pct"] - 95.0) < 1e-12
|
||||
assert result["dominated"] is True
|
||||
|
||||
|
||||
def test_n_singletons_partial_when_top_truncated():
|
||||
"""n_distinct > len(top): n_singletons cubre solo el top visible, partial True."""
|
||||
cat = {
|
||||
"n_distinct": 10,
|
||||
"top": [
|
||||
{"value": "a", "count": 4, "pct": 0.4},
|
||||
{"value": "b", "count": 1, "pct": 0.1},
|
||||
{"value": "c", "count": 1, "pct": 0.1},
|
||||
],
|
||||
"entropy": 2.5,
|
||||
}
|
||||
result = categorical_cardinality_block(cat, n_rows=12)
|
||||
assert result["n_singletons"] == 2 # only b, c visible
|
||||
assert result["n_singletons_partial"] is True
|
||||
|
||||
|
||||
def test_single_distinct_value_entropy_norm_none():
|
||||
"""n_distinct=1: entropy_max=0.0 -> entropy_norm None (no division by zero)."""
|
||||
cat = {
|
||||
"n_distinct": 1,
|
||||
"entropy": 0.0,
|
||||
"mode": "only",
|
||||
"mode_pct": 1.0,
|
||||
"top": [{"value": "only", "count": 7, "pct": 1.0}],
|
||||
"imbalance": 1.0,
|
||||
}
|
||||
result = categorical_cardinality_block(cat, n_rows=7)
|
||||
assert result["entropy_max"] == 0.0
|
||||
assert result["entropy_norm"] is None
|
||||
assert result["n_singletons"] == 0
|
||||
@@ -1,108 +0,0 @@
|
||||
---
|
||||
id: categorical_top_pie_figure_py_datascience
|
||||
name: categorical_top_pie_figure
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "def categorical_top_pie_figure(top: list, n_distinct: int = 0, title: str = \"\", top_k: int = 6, n_rows=None) -> \"matplotlib.figure.Figure\""
|
||||
description: "Construye una figura matplotlib tipo donut (pie con agujero central) de las top_k categorías más frecuentes de una columna categórica, agregando el resto en un sector gris \"Otros (N categorías)\". Consume el bloque `top` de summarize_categorical y devuelve un matplotlib.figure.Figure listo para rasterizar por el renderer del informe EDA. Backend Agg sin pyplot global; defensivo ante top vacío/None."
|
||||
tags: [eda, categorical, pie, donut, matplotlib, figure, visualization, datascience, impure]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: [matplotlib]
|
||||
example: |
|
||||
from categorical_top_pie_figure import categorical_top_pie_figure
|
||||
top = [
|
||||
{"value": "rojo", "count": 40, "pct": 0.4},
|
||||
{"value": "azul", "count": 30, "pct": 0.3},
|
||||
{"value": "verde", "count": 20, "pct": 0.2},
|
||||
]
|
||||
fig = categorical_top_pie_figure(top, n_distinct=12, title="color", top_k=6, n_rows=100)
|
||||
tested: true
|
||||
tests:
|
||||
- "test_returns_figure"
|
||||
- "test_ten_items_topk_six_yields_seven_wedges"
|
||||
- "test_empty_top_does_not_raise_and_returns_figure"
|
||||
- "test_long_value_truncated_in_legend"
|
||||
- "test_none_value_and_none_count_are_handled"
|
||||
- "test_n_rows_adds_exact_others_slice"
|
||||
test_file_path: "python/functions/datascience/categorical_top_pie_figure_test.py"
|
||||
file_path: "python/functions/datascience/categorical_top_pie_figure.py"
|
||||
params:
|
||||
- name: top
|
||||
desc: "Lista de dicts {value, count, pct} ordenada de mayor a menor por count (salida del bloque `top` de summarize_categorical). Puede venir vacía o con dicts incompletos: items no-dict, sin count, con count None o count <= 0 se descartan. value None se admite (sin etiqueta)."
|
||||
- name: n_distinct
|
||||
desc: "Nº total de categorías distintas de la columna. Etiqueta el sector agregado como \"Otros (n_distinct - top_k)\" (mínimo 0). Si no supera el nº de sectores mostrados, se usa el overflow real de `top` como nº de categorías agregadas. Default 0."
|
||||
- name: title
|
||||
desc: "Título de la figura (nombre de la columna). Se trunca a ~48 chars con elipsis si es muy largo. Default \"\" (sin título)."
|
||||
- name: top_k
|
||||
desc: "Nº máximo de sectores explícitos. Default 6. El sector \"Otros\" no cuenta contra este límite. Con top_k <= 0 se muestra al menos la categoría mayor."
|
||||
- name: n_rows
|
||||
desc: "Opcional. Total de filas del dataset. Si se da y la suma de counts mostrados < n_rows, el sector \"Otros\" usa (n_rows - suma_mostrada) como count para que los ángulos sean exactos respecto al total real. Si se omite, \"Otros\" usa la suma de counts fuera del top_k mostrado (solo cuando top trae más de top_k items). Default None."
|
||||
output: "Un matplotlib.figure.Figure (figsize 6.4x4.0, dpi 150) con un Axes donut (wedgeprops width 0.42) más una leyenda lateral con value truncado a 20 chars + count; el sector \"Otros\" en gris. Anotación central con el total n. Si no hay counts válidos, devuelve igualmente una Figure con un texto centrado \"sin datos categóricos\" (nunca lanza). El caller rasteriza/cierra la figura; la función no la muestra ni la guarda."
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from categorical_top_pie_figure import categorical_top_pie_figure
|
||||
|
||||
# `top` es la salida del bloque "top" de summarize_categorical (ya ordenado desc).
|
||||
top = [
|
||||
{"value": "rojo", "count": 40, "pct": 0.40},
|
||||
{"value": "azul", "count": 30, "pct": 0.30},
|
||||
{"value": "verde", "count": 20, "pct": 0.20},
|
||||
{"value": "amarillo", "count": 5, "pct": 0.05},
|
||||
]
|
||||
|
||||
fig = categorical_top_pie_figure(
|
||||
top,
|
||||
n_distinct=12, # 12 categorías distintas en total
|
||||
title="color_producto",
|
||||
top_k=6, # hasta 6 sectores explícitos
|
||||
n_rows=100, # "Otros" = 100 - 95 = 5, sobre 8 categorías agregadas
|
||||
)
|
||||
|
||||
# El renderer del informe lo rasteriza; aquí solo persistimos para inspección.
|
||||
fig.savefig("/tmp/donut_color.png")
|
||||
```
|
||||
|
||||
## Cuando usarla
|
||||
|
||||
Úsala dentro de un informe EDA cuando quieras visualizar la composición de una
|
||||
columna categórica de un vistazo: cuántas filas caen en las categorías
|
||||
dominantes frente a la cola larga. Pásale directamente el bloque `top` de
|
||||
`summarize_categorical` (ya ordenado de mayor a menor) más `n_distinct` para que
|
||||
el sector "Otros" indique cuántas categorías quedan agrupadas. Es la pareja
|
||||
"composición" del gráfico de barras top-k: el donut comunica proporciones del
|
||||
total, las barras comunican magnitudes comparables.
|
||||
|
||||
## Gotchas
|
||||
|
||||
- **Impura por matplotlib.** Toca la maquinaria de render. Usa el backend `Agg`
|
||||
y la API orientada a objetos `Figure`/`add_subplot` — NUNCA `pyplot.*` aquí,
|
||||
para no tocar el estado global ni filtrar figuras entre llamadas. `pyplot` NO
|
||||
es thread-safe; esta función evita ese riesgo construyendo el `Figure`
|
||||
directamente, así que es segura de llamar en bucle desde el renderer.
|
||||
- **El caller cierra la figura.** La función devuelve el `Figure` pero no lo
|
||||
muestra ni lo guarda. Quien la consume debe rasterizarla y luego liberarla
|
||||
(`fig.clf()` / `matplotlib.pyplot.close(fig)` si se usó pyplot en el caller)
|
||||
para no acumular memoria en lotes grandes de columnas.
|
||||
- **Pie engaña con muchos sectores.** Por eso `top_k` por defecto es 6 y el
|
||||
resto se agrega en "Otros": donuts con 15+ sectores son ilegibles. Para
|
||||
cardinalidad muy alta el donut solo muestra la cabeza de la distribución; la
|
||||
cola vive en el sector gris.
|
||||
- **Ángulos exactos solo con `n_rows`.** Sin `n_rows`, el sector "Otros" se
|
||||
calcula con el overflow presente en `top`; si `top` ya viene recortado a
|
||||
`top_k` por el productor, no habrá "Otros" aunque existan más categorías. Pasa
|
||||
`n_rows` (total de filas del dataset) para ángulos correctos respecto al total
|
||||
real.
|
||||
- **Defensiva, nunca lanza.** `top=[]`, `value=None`, `count=None` o counts no
|
||||
numéricos se manejan sin error: en el peor caso devuelve una `Figure` con
|
||||
"sin datos categóricos". No envuelvas la llamada en try/except por miedo a un
|
||||
raise — no lo hay.
|
||||
@@ -1,230 +0,0 @@
|
||||
"""Impure EDA helper: donut figure of the most common categories (`eda` group).
|
||||
|
||||
Builds a matplotlib donut (pie with a central hole) of the ``top_k`` most
|
||||
frequent categories of a categorical column, folding everything else into a
|
||||
single "Otros (N categorías)" slice. Returns a ready-to-rasterize
|
||||
``matplotlib.figure.Figure``; it never shows nor saves it.
|
||||
|
||||
Impure because it touches matplotlib's rendering machinery. It uses the headless
|
||||
Agg backend and the object-oriented ``Figure`` API (no ``pyplot``) so it leaks no
|
||||
global state and is safe to call repeatedly from a report renderer.
|
||||
"""
|
||||
|
||||
import matplotlib
|
||||
|
||||
matplotlib.use("Agg")
|
||||
|
||||
from matplotlib.figure import Figure # noqa: E402
|
||||
|
||||
|
||||
# Gray reserved for the aggregated "Otros" slice.
|
||||
_OTHER_COLOR = "#9e9e9e"
|
||||
# Muted gray for secondary text (title fallback, center annotation, no-data).
|
||||
_MUTED_TEXT = "#5f6b7a"
|
||||
# Pleasant, colour-blind-friendly qualitative palette for the explicit slices.
|
||||
_PALETTE = [
|
||||
"#4C72B0",
|
||||
"#DD8452",
|
||||
"#55A868",
|
||||
"#C44E52",
|
||||
"#8172B3",
|
||||
"#937860",
|
||||
"#DA8BC3",
|
||||
"#8C8C8C",
|
||||
"#CCB974",
|
||||
"#64B5CD",
|
||||
]
|
||||
|
||||
|
||||
def _truncate(text, width: int = 20) -> str:
|
||||
"""Truncate ``text`` to ``width`` chars, appending an ellipsis if cut."""
|
||||
s = "" if text is None else str(text)
|
||||
if len(s) <= width:
|
||||
return s
|
||||
if width <= 1:
|
||||
return s[:width]
|
||||
return s[: width - 1] + "…"
|
||||
|
||||
|
||||
def categorical_top_pie_figure(
|
||||
top: list,
|
||||
n_distinct: int = 0,
|
||||
title: str = "",
|
||||
top_k: int = 6,
|
||||
n_rows=None,
|
||||
) -> "matplotlib.figure.Figure":
|
||||
"""Build a donut figure of the most common categories of a column.
|
||||
|
||||
Renders the ``top_k`` most frequent categories as explicit donut slices and
|
||||
aggregates every remaining category into a single gray "Otros (N
|
||||
categorías)" slice. Category names are not painted on the wedges; they are
|
||||
listed in a lateral legend (truncated value + count) to avoid overlap on
|
||||
narrow (mobile) figures.
|
||||
|
||||
The function is fully defensive: empty input, missing/``None`` values or
|
||||
counts never raise. When there is nothing valid to draw it still returns a
|
||||
``Figure`` carrying a centered "sin datos categóricos" message.
|
||||
|
||||
Args:
|
||||
top: List of ``{value, count, pct}`` dicts, already sorted by ``count``
|
||||
descending (the ``top`` block of ``summarize_categorical``). May be
|
||||
empty or carry incomplete/``None`` entries; non-dict items, items
|
||||
without a positive numeric ``count`` and ``None`` counts are skipped.
|
||||
n_distinct: Total number of distinct categories in the column. Used to
|
||||
label the aggregated slice as "Otros (n_distinct - top_k)" (floored
|
||||
at 0). Ignored when it does not exceed the number of shown slices.
|
||||
title: Figure title (the column name). Truncated when too long.
|
||||
top_k: Maximum number of explicit slices. Default 6. The "Otros" slice
|
||||
does not count against this limit.
|
||||
n_rows: Optional total row count of the dataset. When given and the sum
|
||||
of shown counts is below ``n_rows``, the "Otros" slice uses
|
||||
``n_rows - sum_shown`` as its count so the wedge angles are exact
|
||||
with respect to the real total. When omitted, "Otros" uses the sum
|
||||
of the counts that fall outside the shown ``top_k`` (only when
|
||||
``top`` carries more than ``top_k`` items).
|
||||
|
||||
Returns:
|
||||
A ``matplotlib.figure.Figure`` with a single donut Axes plus a lateral
|
||||
legend. The caller is responsible for rasterizing/closing it.
|
||||
"""
|
||||
fig = Figure(figsize=(6.4, 4.0), dpi=150)
|
||||
ax = fig.add_subplot(111)
|
||||
|
||||
safe_title = _truncate(title, 48)
|
||||
|
||||
# --- Defensive parse: keep only well-formed {value, count} with count > 0.
|
||||
cleaned = []
|
||||
if isinstance(top, list):
|
||||
for item in top:
|
||||
if not isinstance(item, dict):
|
||||
continue
|
||||
count = item.get("count")
|
||||
if count is None:
|
||||
continue
|
||||
try:
|
||||
count = float(count)
|
||||
except (TypeError, ValueError):
|
||||
continue
|
||||
if count <= 0:
|
||||
continue
|
||||
cleaned.append((item.get("value"), count))
|
||||
|
||||
if not cleaned:
|
||||
ax.axis("off")
|
||||
ax.text(
|
||||
0.5,
|
||||
0.5,
|
||||
"sin datos categóricos",
|
||||
ha="center",
|
||||
va="center",
|
||||
fontsize=12,
|
||||
color=_MUTED_TEXT,
|
||||
transform=ax.transAxes,
|
||||
)
|
||||
if safe_title:
|
||||
ax.set_title(safe_title, fontsize=12, loc="center", pad=8)
|
||||
fig.tight_layout()
|
||||
return fig
|
||||
|
||||
# --- Split into shown slices and the aggregated remainder.
|
||||
shown = cleaned[: max(int(top_k), 0)]
|
||||
if not shown: # top_k <= 0 — show at least the largest category.
|
||||
shown = cleaned[:1]
|
||||
|
||||
sum_shown = sum(c for _, c in shown)
|
||||
overflow_count = sum(c for _, c in cleaned[len(shown):])
|
||||
|
||||
# How many categories are folded into "Otros".
|
||||
try:
|
||||
nd = int(n_distinct)
|
||||
except (TypeError, ValueError):
|
||||
nd = 0
|
||||
others_categories = max(nd - len(shown), 0)
|
||||
# If n_distinct is unknown/too small, fall back to the overflow we actually
|
||||
# have in `top` beyond the shown slices.
|
||||
overflow_items = len(cleaned) - len(shown)
|
||||
if others_categories == 0 and overflow_items > 0:
|
||||
others_categories = overflow_items
|
||||
|
||||
# Count attributed to the "Otros" slice for exact angles.
|
||||
others_count = 0.0
|
||||
if n_rows is not None:
|
||||
try:
|
||||
total_rows = float(n_rows)
|
||||
except (TypeError, ValueError):
|
||||
total_rows = None
|
||||
if total_rows is not None and total_rows > sum_shown:
|
||||
others_count = total_rows - sum_shown
|
||||
if others_count <= 0:
|
||||
others_count = overflow_count
|
||||
|
||||
labels = [v for v, _ in shown]
|
||||
values = [c for _, c in shown]
|
||||
colors = [_PALETTE[i % len(_PALETTE)] for i in range(len(shown))]
|
||||
|
||||
has_others = others_count > 0 and others_categories > 0
|
||||
if has_others:
|
||||
values.append(others_count)
|
||||
labels.append("Otros")
|
||||
colors.append(_OTHER_COLOR)
|
||||
|
||||
total = sum(values)
|
||||
|
||||
def _autopct(pct: float) -> str:
|
||||
# Hide tiny labels to avoid crowding the wedges.
|
||||
return f"{pct:.0f}%" if pct >= 5 else ""
|
||||
|
||||
wedges, _texts, autotexts = ax.pie(
|
||||
values,
|
||||
colors=colors,
|
||||
startangle=90,
|
||||
counterclock=False,
|
||||
wedgeprops={"width": 0.42, "edgecolor": "white", "linewidth": 1.0},
|
||||
autopct=_autopct,
|
||||
pctdistance=0.79,
|
||||
textprops={"fontsize": 8},
|
||||
)
|
||||
for at in autotexts:
|
||||
at.set_color("white")
|
||||
at.set_fontweight("bold")
|
||||
ax.set_aspect("equal")
|
||||
|
||||
# --- Lateral legend: truncated value + count (+ "(N categorías)" for Otros).
|
||||
legend_labels = []
|
||||
for idx, (lab, val) in enumerate(zip(labels, values)):
|
||||
if has_others and idx == len(labels) - 1:
|
||||
legend_labels.append(
|
||||
f"Otros ({others_categories} categorías) — {int(round(val))}"
|
||||
)
|
||||
else:
|
||||
legend_labels.append(f"{_truncate(lab, 20)} — {int(round(val))}")
|
||||
|
||||
ax.legend(
|
||||
wedges,
|
||||
legend_labels,
|
||||
title="Categorías",
|
||||
loc="center left",
|
||||
bbox_to_anchor=(1.02, 0.5),
|
||||
fontsize=8,
|
||||
title_fontsize=9,
|
||||
frameon=False,
|
||||
)
|
||||
|
||||
if safe_title:
|
||||
ax.set_title(safe_title, fontsize=13, loc="left", pad=10)
|
||||
|
||||
# Center annotation: total count covered by the donut.
|
||||
ax.text(
|
||||
0,
|
||||
0,
|
||||
f"n={int(round(total))}",
|
||||
ha="center",
|
||||
va="center",
|
||||
fontsize=11,
|
||||
color=_MUTED_TEXT,
|
||||
fontweight="bold",
|
||||
)
|
||||
|
||||
# Leave room on the right for the legend (avoid clipping it).
|
||||
fig.subplots_adjust(left=0.02, right=0.62, top=0.88, bottom=0.06)
|
||||
return fig
|
||||
@@ -1,104 +0,0 @@
|
||||
"""Tests para categorical_top_pie_figure (donut de categorías top, grupo eda).
|
||||
|
||||
Usa el backend Agg sin pyplot; no muestra ni guarda figuras. Cada test cierra
|
||||
explícitamente la Figure construida (matplotlib.pyplot.close) para no acumular
|
||||
estado entre tests.
|
||||
"""
|
||||
|
||||
import matplotlib
|
||||
|
||||
matplotlib.use("Agg")
|
||||
|
||||
import matplotlib.pyplot as plt # noqa: E402
|
||||
from matplotlib.figure import Figure # noqa: E402
|
||||
|
||||
from categorical_top_pie_figure import categorical_top_pie_figure
|
||||
|
||||
|
||||
def _make_top(n):
|
||||
"""n items {value, count, pct} ordenados desc por count."""
|
||||
return [
|
||||
{"value": f"cat_{i}", "count": n - i, "pct": (n - i) / sum(range(1, n + 1))}
|
||||
for i in range(n)
|
||||
]
|
||||
|
||||
|
||||
def _wedges(ax):
|
||||
"""Devuelve los wedges (sectores) de un Axes con un pie."""
|
||||
from matplotlib.patches import Wedge
|
||||
|
||||
return [p for p in ax.patches if isinstance(p, Wedge)]
|
||||
|
||||
|
||||
def test_returns_figure():
|
||||
fig = categorical_top_pie_figure(_make_top(3), n_distinct=3, title="col")
|
||||
assert isinstance(fig, Figure)
|
||||
plt.close(fig)
|
||||
|
||||
|
||||
def test_ten_items_topk_six_yields_seven_wedges():
|
||||
top = _make_top(10)
|
||||
fig = categorical_top_pie_figure(top, n_distinct=10, title="muchas", top_k=6)
|
||||
ax = fig.axes[0]
|
||||
wedges = _wedges(ax)
|
||||
# 6 categorías explícitas + 1 sector "Otros".
|
||||
assert len(wedges) == 7
|
||||
plt.close(fig)
|
||||
|
||||
|
||||
def test_empty_top_does_not_raise_and_returns_figure():
|
||||
fig = categorical_top_pie_figure([], n_distinct=0, title="vacía")
|
||||
assert isinstance(fig, Figure)
|
||||
# Sin datos: no debe haber sectores de pie.
|
||||
assert len(_wedges(fig.axes[0])) == 0
|
||||
plt.close(fig)
|
||||
|
||||
|
||||
def test_long_value_truncated_in_legend():
|
||||
long_value = "una_categoria_con_un_nombre_larguisimo_que_excede_el_limite"
|
||||
top = [
|
||||
{"value": long_value, "count": 10, "pct": 0.5},
|
||||
{"value": "corta", "count": 10, "pct": 0.5},
|
||||
]
|
||||
fig = categorical_top_pie_figure(top, n_distinct=2, title="col", top_k=6)
|
||||
ax = fig.axes[0]
|
||||
legend = ax.get_legend()
|
||||
assert legend is not None
|
||||
texts = [t.get_text() for t in legend.get_texts()]
|
||||
# El valor largo aparece truncado con elipsis y NO en su forma completa.
|
||||
assert any("…" in t for t in texts)
|
||||
assert long_value not in " ".join(texts)
|
||||
plt.close(fig)
|
||||
|
||||
|
||||
def test_none_value_and_none_count_are_handled():
|
||||
top = [
|
||||
{"value": None, "count": 5, "pct": 0.5},
|
||||
{"value": "b", "count": None, "pct": 0.0}, # count None -> se descarta
|
||||
{"value": "c", "count": 5, "pct": 0.5},
|
||||
]
|
||||
fig = categorical_top_pie_figure(top, n_distinct=2, title="con nones", top_k=6)
|
||||
assert isinstance(fig, Figure)
|
||||
# Solo 2 items válidos, sin overflow -> 2 wedges, sin "Otros".
|
||||
assert len(_wedges(fig.axes[0])) == 2
|
||||
plt.close(fig)
|
||||
|
||||
|
||||
def test_n_rows_adds_exact_others_slice():
|
||||
# 3 categorías mostradas suman 30, dataset real 100 -> "Otros" = 70.
|
||||
top = _make_top(3) # counts 3,2,1 -> reescalamos abajo
|
||||
top = [
|
||||
{"value": "a", "count": 15, "pct": 0.15},
|
||||
{"value": "b", "count": 10, "pct": 0.10},
|
||||
{"value": "c", "count": 5, "pct": 0.05},
|
||||
]
|
||||
fig = categorical_top_pie_figure(
|
||||
top, n_distinct=20, title="col", top_k=3, n_rows=100
|
||||
)
|
||||
ax = fig.axes[0]
|
||||
# 3 explícitas + Otros.
|
||||
assert len(_wedges(ax)) == 4
|
||||
legend_texts = [t.get_text() for t in ax.get_legend().get_texts()]
|
||||
# El sector Otros refleja n_distinct - top_k = 17 categorías y count 70.
|
||||
assert any("Otros (17 categorías)" in t and "70" in t for t in legend_texts)
|
||||
plt.close(fig)
|
||||
@@ -1,107 +0,0 @@
|
||||
---
|
||||
name: render_automatic_eda_pdf
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "def render_automatic_eda_pdf(chapters_or_profile, out_path: str, meta: dict = None) -> dict"
|
||||
description: "Renderiza un documento AutomaticEDA por CAPÍTULOS (modelo de bloques independiente del formato) en un PDF A5 retrato pensado para LEER EN EL MÓVIL. Acepta una lista de capítulos del modelo o directamente un TableProfile del grupo eda (en cuyo caso construye los capítulos canónicos con build_document). El paginador MIDE cada bloque y NUNCA corta nada: el texto se envuelve a líneas completas, las tablas largas se parten por filas REPITIENDO la cabecera, figuras e imágenes se escalan para caber enteras. Cada capítulo empieza en página nueva con pie 'Capítulo · vX.Y.Z' y se escribe un manifiesto automatic_eda_manifest.json junto a la salida para seguimiento por capítulo. dict-no-throw: nunca lanza, devuelve {path, n_pages, chapters, manifest_path, note}. Motor matplotlib PdfPages. Aditivo: NO reemplaza render_eda_pdf."
|
||||
tags: [eda, pdf, render, report, mobile, automatic-eda, chapters, versioned, no-cut, pagination, matplotlib, datascience, python]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: [os, matplotlib, "datascience.automatic_eda"]
|
||||
params:
|
||||
- name: chapters_or_profile
|
||||
desc: "una lista de capítulos del modelo AutomaticEDA (dataclasses Chapter o dicts {id,title,version,blocks}) O un TableProfile dict del grupo eda. Si es un TableProfile, los capítulos canónicos se construyen con build_document(profile, meta['ctx']). Un capítulo es {id,title,version,blocks}; un bloque es uno de: heading, markdown, kv_table, data_table, figure, image, caption, note. Lectura defensiva: cualquier cosa no reconocida se degrada a Note, nunca lanza."
|
||||
- name: out_path
|
||||
desc: "ruta del archivo PDF de salida. Los directorios padre se crean si faltan. Si está en un directorio no escribible (p.ej. /proc/...) devuelve {path:None, note:<causa>} sin lanzar."
|
||||
- name: meta
|
||||
desc: "dict opcional. Claves: title (título de portada/pie), ctx (contexto de presentación pasado a los builders de capítulo cuando se da un profile: dataset_name, source_origin, storage, generated_at, description, granularity, quality_criteria, head_rows...), manifest_path (override; por defecto automatic_eda_manifest.json junto a out_path), write_manifest (False para no escribirlo), generated_at."
|
||||
output: "dict (nunca lanza): {path: str|None, n_pages: int, chapters: list[{id,version,n_pages}], manifest_path: str|None, note: str}. En éxito path es la ruta escrita, n_pages el total de páginas, chapters el desglose por capítulo para el manifiesto. En error fatal path es None y note explica la causa."
|
||||
tested: true
|
||||
tests: ["test_golden_profile_genera_pdf_portada_y_overview", "test_edge_tabla_larga_parte_repitiendo_cabecera", "test_edge_celda_larga_no_se_corta", "test_no_corta_texto_markdown", "test_edge_profile_none_y_vacio_un_pagina", "test_error_path_directorio_no_escribible_no_revienta"]
|
||||
test_file_path: "python/functions/datascience/render_automatic_eda_pdf_test.py"
|
||||
file_path: "python/functions/datascience/render_automatic_eda_pdf.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from datascience import render_automatic_eda_pdf
|
||||
|
||||
# Caso 1: directamente desde un TableProfile del grupo eda.
|
||||
# profile = profile_table(db, "ventas", backend="duckdb")["profile"]
|
||||
profile = {
|
||||
"table": "ventas", "source": "/data/ventas.csv",
|
||||
"n_rows": 1000, "n_cols": 2, "quality_score": 92.5,
|
||||
"columns": [
|
||||
{"name": "precio", "inferred_type": "numeric", "null_pct": 0.01,
|
||||
"null_count": 10,
|
||||
"numeric": {"mean": 42.5, "median": 40.0, "min": 1.0, "max": 100.0,
|
||||
"std": 12.3}},
|
||||
{"name": "categoria", "inferred_type": "categorical", "null_pct": 0.0,
|
||||
"categorical": {"top": [{"value": "neumaticos", "count": 500},
|
||||
{"value": "aceite", "count": 300}]}},
|
||||
],
|
||||
}
|
||||
res = render_automatic_eda_pdf(
|
||||
profile, "reports/ventas_aeda.pdf",
|
||||
{"title": "EDA — ventas",
|
||||
"ctx": {"dataset_name": "Ventas", "source_origin": "ERP export",
|
||||
"description": "Líneas de venta del ERP.",
|
||||
"granularity": "Cada fila es una línea de venta."}})
|
||||
print(res["n_pages"], res["chapters"], res["manifest_path"])
|
||||
# -> 3 [{'id':'portada','version':'1.0.0','n_pages':1},
|
||||
# {'id':'overview','version':'1.0.0','n_pages':2}] reports/automatic_eda_manifest.json
|
||||
|
||||
# Caso 2: desde capítulos construidos a mano (modelo de bloques).
|
||||
from datascience.automatic_eda.model import Chapter, Heading, DataTable
|
||||
ch = Chapter(id="resumen", title="Resumen", version="1.0.0", blocks=[
|
||||
Heading("Tabla", 1),
|
||||
DataTable(header=["col", "valor"], rows=[["a", "1"], ["b", "2"]]),
|
||||
])
|
||||
render_automatic_eda_pdf([ch], "reports/manual.pdf")
|
||||
```
|
||||
|
||||
## Cuando usarla
|
||||
|
||||
Cuando quieras el **PDF móvil del nuevo motor AutomaticEDA por capítulos** (portada
|
||||
+ overview + los capítulos que existan): después de `profile_table(...)`, pásale el
|
||||
`profile` y obtienes un PDF A5 retrato versionado por capítulo, con manifiesto. Úsala
|
||||
como capa de presentación PDF del grupo `eda` cuando necesites **garantía de no-corte**
|
||||
(texto, tablas e imágenes nunca recortados) y **versionado por capítulo** para mejora
|
||||
continua. Es el reemplazo evolutivo de `render_eda_pdf`: comparte estética Tufte/móvil
|
||||
pero separa contenido (capítulos/bloques) de formato (renderer), de modo que el mismo
|
||||
documento se emite también como PPTX (`render_automatic_eda_pptx`). Para añadir un
|
||||
capítulo nuevo, ver `docs/capabilities/automatic_eda.md`.
|
||||
|
||||
## Gotchas
|
||||
|
||||
- **Impura**: escribe el PDF en `out_path` (crea los directorios padre) y, salvo
|
||||
`meta['write_manifest']=False`, un `automatic_eda_manifest.json` junto a la salida.
|
||||
Backend headless `Agg` de matplotlib (corre en agentes/CI sin display).
|
||||
- **Nunca lanza** (dict-no-throw): un bloque o capítulo que falle se omite y se anota
|
||||
en `note`; el PDF se genera igual. Un profile `None`/`{}` produce un PDF de 1 página
|
||||
válido. `out_path` no escribible → `{path: None, note: <causa>}`.
|
||||
- **No corta nada**: el paginador mide cada bloque con una rejilla de caracteres
|
||||
(sobre-estima ligeramente, nunca afirma que algo cabe cuando se desbordaría). El
|
||||
texto se envuelve a líneas completas (sin cortar a media palabra), las tablas largas
|
||||
se parten por filas **repitiendo la cabecera**, las celdas con texto largo se
|
||||
envuelven dentro de su columna (la fila crece), y figuras/imágenes se escalan para
|
||||
caber enteras (nunca se recortan).
|
||||
- **Tablas muy anchas**: con muchas columnas (>10) cada columna se estrecha y su texto
|
||||
se envuelve en varias líneas (sigue sin perderse). El reparto por columnas-en-grupos
|
||||
para tablas muy anchas es una mejora pendiente (ver capability page).
|
||||
- **head_rows / examples**: el capítulo Overview muestra `df.head` desde
|
||||
`ctx['head_rows']`/`profile['head_rows']` y ejemplos no-nulos desde
|
||||
`columns[i]['examples']`; si el profile no los trae (hoy no los trae), degrada con un
|
||||
placeholder honesto y deriva los ejemplos de los valores reales del perfil (top
|
||||
categóricos, min/median/max numéricos). Documentado en el contrato.
|
||||
- **Registro en el package**: el `## Ejemplo` usa `from datascience import
|
||||
render_automatic_eda_pdf` (añadido al `__init__.py`); el test importa el módulo
|
||||
directo para no depender de ese registro.
|
||||
- **Fechas en UI europeas**: la portada formatea la fecha como `DD/MM/AAAA HH:mm`.
|
||||
@@ -1,83 +0,0 @@
|
||||
"""render_automatic_eda_pdf — chapter-based EDA report as an A5-portrait PDF.
|
||||
|
||||
Public ``eda``-group entry point of the AutomaticEDA engine. Takes either a list
|
||||
of chapters (the format-independent document model) or an ``eda`` TableProfile
|
||||
dict (in which case the canonical chapters are built with ``build_document``),
|
||||
and renders a mobile-first PDF whose paginator MEASURES every block and never
|
||||
cuts text, tables or images: text wraps to whole lines, long tables split by
|
||||
rows repeating the header, figures/images scale to fit entirely. Each chapter
|
||||
starts on a fresh page stamped ``<Chapter> · v<version>`` in the footer, and a
|
||||
per-chapter manifest (``automatic_eda_manifest.json``) is written next to the
|
||||
output for version tracking.
|
||||
|
||||
dict-no-throw: never raises. Returns ``{path, n_pages, chapters, manifest_path,
|
||||
note}``; on a fatal write error ``path`` is None and ``note`` explains why.
|
||||
|
||||
Additive: this does NOT replace ``render_eda_pdf`` (still used by
|
||||
``profile_table(emit_pdf=True)``). It is the new engine that will, in the next
|
||||
phase, let every EDA emit both a PDF and a PPTX from the same chapter model.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
|
||||
from datascience.automatic_eda import build_document, merge_manifest, render_pdf
|
||||
from datascience.automatic_eda.model import as_chapter, as_chapters
|
||||
|
||||
|
||||
def _coerce_chapters(chapters_or_profile, meta: dict) -> list:
|
||||
"""Accept chapters OR an eda profile and return a list of Chapter."""
|
||||
arg = chapters_or_profile
|
||||
if isinstance(arg, (list, tuple)):
|
||||
return as_chapters(list(arg))
|
||||
if isinstance(arg, dict):
|
||||
# A single chapter dict has 'blocks'; a profile has columns/table/rows.
|
||||
if "blocks" in arg and "columns" not in arg:
|
||||
ch = as_chapter(arg)
|
||||
return [ch] if ch is not None else []
|
||||
# Treat as an eda TableProfile.
|
||||
return build_document(arg, (meta or {}).get("ctx"))
|
||||
return []
|
||||
|
||||
|
||||
def render_automatic_eda_pdf(chapters_or_profile, out_path: str,
|
||||
meta: dict = None) -> dict:
|
||||
"""Render an AutomaticEDA document into a mobile-readable PDF.
|
||||
|
||||
Args:
|
||||
chapters_or_profile: either a list of chapters (``Chapter`` dataclasses
|
||||
or dicts following the document model) or an ``eda`` TableProfile
|
||||
dict — in the latter case the canonical chapters are built via
|
||||
``build_document(profile, meta['ctx'])``.
|
||||
out_path: filesystem path for the PDF (parent dirs are created).
|
||||
meta: optional dict. Recognised keys: ``title`` (cover/footer title),
|
||||
``ctx`` (presentation context passed to chapter builders when a
|
||||
profile is given), ``manifest_path`` (override; defaults to
|
||||
``automatic_eda_manifest.json`` beside ``out_path``),
|
||||
``write_manifest`` (set False to skip), ``generated_at``.
|
||||
|
||||
Returns:
|
||||
dict (never raises): ``{path, n_pages, chapters, manifest_path, note}``.
|
||||
"""
|
||||
meta = dict(meta or {})
|
||||
chapters = _coerce_chapters(chapters_or_profile, meta)
|
||||
result = render_pdf(chapters, out_path, meta)
|
||||
|
||||
manifest_path = None
|
||||
if meta.get("write_manifest", True) and result.get("path"):
|
||||
manifest_path = meta.get("manifest_path")
|
||||
if not manifest_path:
|
||||
manifest_path = os.path.join(
|
||||
os.path.dirname(os.path.abspath(out_path)),
|
||||
"automatic_eda_manifest.json")
|
||||
generated_at = meta.get("generated_at") or _now_iso()
|
||||
merge_manifest(manifest_path, "pdf", result.get("chapters") or [],
|
||||
generated_at)
|
||||
result["manifest_path"] = manifest_path
|
||||
return result
|
||||
|
||||
|
||||
def _now_iso() -> str:
|
||||
from datetime import datetime, timezone
|
||||
return datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M:%S UTC")
|
||||
@@ -1,140 +0,0 @@
|
||||
"""Tests for render_automatic_eda_pdf — DoD: golden + edges + error path.
|
||||
|
||||
Self-contained: builds a synthetic TableProfile (no DuckDB) so the suite is fast
|
||||
and deterministic. Verifies the cover/overview reference chapters render, that
|
||||
long tables split by rows repeating the header without losing any cell text,
|
||||
that an empty/None profile still yields a valid 1-page PDF, and that an
|
||||
unwritable destination returns ``{path: None}`` without raising.
|
||||
"""
|
||||
|
||||
import os
|
||||
import re
|
||||
import tempfile
|
||||
|
||||
from pypdf import PdfReader
|
||||
|
||||
from datascience.render_automatic_eda_pdf import render_automatic_eda_pdf
|
||||
from datascience.automatic_eda.model import Chapter, DataTable, Heading, Markdown
|
||||
|
||||
|
||||
def _profile() -> dict:
|
||||
return {
|
||||
"table": "ventas",
|
||||
"source": "/data/ventas.csv",
|
||||
"profiled_at": "2026-06-30T10:00:00+00:00",
|
||||
"n_rows": 1000,
|
||||
"n_cols": 3,
|
||||
"quality_score": 92.5,
|
||||
"key_candidates": ["id"],
|
||||
"type_breakdown": {"numeric": 2, "categorical": 1},
|
||||
"columns": [
|
||||
{"name": "id", "inferred_type": "numeric", "null_pct": 0.0,
|
||||
"null_count": 0,
|
||||
"numeric": {"mean": 500.0, "median": 500.0, "min": 1.0,
|
||||
"max": 1000.0, "std": 288.7}},
|
||||
{"name": "precio", "inferred_type": "numeric", "null_pct": 0.01,
|
||||
"null_count": 10,
|
||||
"numeric": {"mean": 42.5, "median": 40.0, "min": 1.0,
|
||||
"max": 100.0, "std": 12.3}},
|
||||
{"name": "categoria", "inferred_type": "categorical",
|
||||
"null_pct": 0.0, "null_count": 0,
|
||||
"categorical": {"top": [{"value": "neumaticos", "count": 500},
|
||||
{"value": "aceite", "count": 300}]}},
|
||||
],
|
||||
}
|
||||
|
||||
|
||||
def _pdf_text(path: str) -> str:
|
||||
txt = "".join((pg.extract_text() or "") for pg in PdfReader(path).pages)
|
||||
return re.sub(r"\s+", " ", txt)
|
||||
|
||||
|
||||
def test_golden_profile_genera_pdf_portada_y_overview():
|
||||
with tempfile.TemporaryDirectory() as d:
|
||||
out = os.path.join(d, "eda.pdf")
|
||||
res = render_automatic_eda_pdf(_profile(), out, {"title": "EDA — ventas"})
|
||||
assert res["path"] == out
|
||||
assert os.path.exists(out)
|
||||
assert res["n_pages"] >= 2 # portada + overview (1+ each).
|
||||
ids = [c["id"] for c in res["chapters"]]
|
||||
assert "portada" in ids and "overview" in ids
|
||||
# Manifest written next to the output with both chapters versioned.
|
||||
assert res["manifest_path"] and os.path.exists(res["manifest_path"])
|
||||
txt = _pdf_text(out)
|
||||
# Cover fields.
|
||||
assert "Automatic-EDA" in txt
|
||||
assert "CSV" in txt # storage inferred from .csv source.
|
||||
assert "Calidad" in txt and "92.5" in txt
|
||||
assert "Fuente" in txt
|
||||
# Overview content: column dictionary + describe.
|
||||
assert "precio" in txt and "categoria" in txt
|
||||
assert "median" in txt
|
||||
|
||||
|
||||
def test_edge_tabla_larga_parte_repitiendo_cabecera():
|
||||
# 60 rows over 6 wide columns: the table must split across pages and repeat
|
||||
# the header on every continuation page (headers wide enough not to wrap).
|
||||
header = ["ALPHA", "BETA", "GAMMA", "DELTA", "EPSILON", "ZETA"]
|
||||
rows = [[f"r{r}c{c}" for c in range(6)] for r in range(60)]
|
||||
ch = Chapter(id="edge", title="Edge", version="1.0.0",
|
||||
blocks=[Heading("Tabla", 1),
|
||||
DataTable(header=header, rows=rows)])
|
||||
with tempfile.TemporaryDirectory() as d:
|
||||
out = os.path.join(d, "edge.pdf")
|
||||
res = render_automatic_eda_pdf([ch], out, {"write_manifest": False})
|
||||
assert res["path"] == out
|
||||
reader = PdfReader(out)
|
||||
n_pages = len(reader.pages)
|
||||
assert n_pages > 1 # table spilled to several pages.
|
||||
pages_with_header = sum(
|
||||
1 for pg in reader.pages if "ALPHA" in (pg.extract_text() or ""))
|
||||
assert pages_with_header == n_pages # header repeated on every page.
|
||||
|
||||
|
||||
def test_edge_celda_larga_no_se_corta():
|
||||
# A single cell with ~150 chars must wrap inside its column (the row grows),
|
||||
# never truncated: all of its words survive in the rendered PDF.
|
||||
long_cell = ("Lorem ipsum dolor sit amet consectetur adipiscing elit sed do "
|
||||
"eiusmod tempor incididunt ut labore et dolore magna aliqua "
|
||||
"reprehenderit voluptate")
|
||||
header = ["clave", "descripcion"]
|
||||
rows = [["k1", long_cell], ["k2", "corto"]]
|
||||
ch = Chapter(id="edge2", title="Edge2", version="1.0.0",
|
||||
blocks=[DataTable(header=header, rows=rows)])
|
||||
with tempfile.TemporaryDirectory() as d:
|
||||
out = os.path.join(d, "edge2.pdf")
|
||||
render_automatic_eda_pdf([ch], out, {"write_manifest": False})
|
||||
txt = _pdf_text(out)
|
||||
# Every word of the long cell present (wrapped, not truncated).
|
||||
for word in ("Lorem", "incididunt", "reprehenderit", "voluptate"):
|
||||
assert word in txt
|
||||
|
||||
|
||||
def test_no_corta_texto_markdown():
|
||||
para = " ".join(f"palabra{i}" for i in range(120))
|
||||
ch = Chapter(id="md", title="MD", version="1.0.0",
|
||||
blocks=[Markdown(text=para)])
|
||||
with tempfile.TemporaryDirectory() as d:
|
||||
out = os.path.join(d, "md.pdf")
|
||||
render_automatic_eda_pdf([ch], out, {"write_manifest": False})
|
||||
txt = _pdf_text(out)
|
||||
for i in (0, 60, 119): # first, middle, last words all present.
|
||||
assert f"palabra{i}" in txt
|
||||
|
||||
|
||||
def test_edge_profile_none_y_vacio_un_pagina():
|
||||
with tempfile.TemporaryDirectory() as d:
|
||||
for arg, name in ((None, "none"), ({}, "empty")):
|
||||
out = os.path.join(d, f"{name}.pdf")
|
||||
res = render_automatic_eda_pdf(arg, out, {"write_manifest": False})
|
||||
assert res["path"] == out
|
||||
assert os.path.exists(out)
|
||||
assert res["n_pages"] == 1
|
||||
|
||||
|
||||
def test_error_path_directorio_no_escribible_no_revienta():
|
||||
res = render_automatic_eda_pdf(_profile(), "/proc/nope/x.pdf",
|
||||
{"write_manifest": False})
|
||||
assert res["path"] is None
|
||||
assert res["n_pages"] == 0
|
||||
assert res["note"]
|
||||
@@ -1,86 +0,0 @@
|
||||
---
|
||||
name: render_automatic_eda_pptx
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "def render_automatic_eda_pptx(chapters_or_profile, out_path: str, meta: dict = None) -> dict"
|
||||
description: "Renderiza un documento AutomaticEDA por CAPÍTULOS (modelo de bloques independiente del formato) en una presentación PPTX 16:9 pensada para COMPARTIR. Acepta una lista de capítulos del modelo o directamente un TableProfile del grupo eda (construye los capítulos canónicos con build_document). Mismo principio anti-corte que el renderer PDF: cada bloque se mide y, si no cabe en la slide, continúa en una slide '<Capítulo> (cont.)'; las tablas largas se parten por filas REPITIENDO la cabecera; las figuras matplotlib se exportan a PNG e insertan escaladas para caber enteras. Cada slide lleva pie 'Capítulo · vX.Y.Z' y se escribe automatic_eda_manifest.json junto a la salida. dict-no-throw: nunca lanza, devuelve {path, n_slides, chapters, manifest_path, note}. Motor python-pptx (dependencia declarada en python/pyproject.toml)."
|
||||
tags: [eda, pptx, render, report, share, automatic-eda, chapters, versioned, no-cut, slides, python-pptx, datascience, python]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: [os, "python-pptx", "datascience.automatic_eda"]
|
||||
params:
|
||||
- name: chapters_or_profile
|
||||
desc: "una lista de capítulos del modelo AutomaticEDA (dataclasses Chapter o dicts {id,title,version,blocks}) O un TableProfile dict del grupo eda. Si es un TableProfile, los capítulos canónicos se construyen con build_document(profile, meta['ctx']). Bloques soportados: heading, markdown, kv_table, data_table, figure, image, caption, note. Lectura defensiva: lo no reconocido se degrada a Note, nunca lanza."
|
||||
- name: out_path
|
||||
desc: "ruta del archivo PPTX de salida. Los directorios padre se crean si faltan. Directorio no escribible → {path:None, note:<causa>} sin lanzar."
|
||||
- name: meta
|
||||
desc: "dict opcional. Claves: title (título), ctx (contexto de presentación para los builders de capítulo cuando se da un profile), manifest_path (override; por defecto automatic_eda_manifest.json junto a out_path), write_manifest (False para no escribirlo), generated_at."
|
||||
output: "dict (nunca lanza): {path: str|None, n_slides: int, chapters: list[{id,version,n_slides}], manifest_path: str|None, note: str}. En error fatal (incluida python-pptx no instalada) path es None y note explica la causa."
|
||||
tested: true
|
||||
tests: ["test_golden_profile_genera_pptx_portada_y_overview", "test_edge_tabla_larga_parte_repitiendo_cabecera_sin_cortar", "test_edge_profile_none_y_vacio_un_slide", "test_error_path_directorio_no_escribible_no_revienta"]
|
||||
test_file_path: "python/functions/datascience/render_automatic_eda_pptx_test.py"
|
||||
file_path: "python/functions/datascience/render_automatic_eda_pptx.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from datascience import render_automatic_eda_pptx
|
||||
|
||||
# Desde un TableProfile del grupo eda (mismo modelo que el renderer PDF).
|
||||
profile = {
|
||||
"table": "ventas", "source": "/data/ventas.csv",
|
||||
"n_rows": 1000, "n_cols": 2, "quality_score": 92.5,
|
||||
"columns": [
|
||||
{"name": "precio", "inferred_type": "numeric", "null_pct": 0.01,
|
||||
"numeric": {"mean": 42.5, "median": 40.0, "min": 1.0, "max": 100.0,
|
||||
"std": 12.3}},
|
||||
{"name": "categoria", "inferred_type": "categorical", "null_pct": 0.0,
|
||||
"categorical": {"top": [{"value": "neumaticos", "count": 500}]}},
|
||||
],
|
||||
}
|
||||
res = render_automatic_eda_pptx(
|
||||
profile, "reports/ventas_aeda.pptx",
|
||||
{"title": "EDA — ventas",
|
||||
"ctx": {"dataset_name": "Ventas", "source_origin": "ERP export"}})
|
||||
print(res["n_slides"], res["chapters"], res["manifest_path"])
|
||||
# -> 3 [{'id':'portada','version':'1.0.0','n_slides':1},
|
||||
# {'id':'overview','version':'1.0.0','n_slides':2}] reports/automatic_eda_manifest.json
|
||||
```
|
||||
|
||||
## Cuando usarla
|
||||
|
||||
Cuando quieras **compartir el EDA como una presentación** (no para móvil sino para
|
||||
enseñar a alguien): mismo documento por capítulos que el PDF, emitido como PPTX 16:9.
|
||||
Úsala junto a `render_automatic_eda_pdf` para que cada EDA tenga sus dos salidas (PDF
|
||||
móvil + PPTX para compartir) desde el mismo modelo de capítulos. Garantiza no-corte:
|
||||
ningún texto, tabla ni imagen se recorta — lo que no cabe en una slide continúa en otra
|
||||
`(cont.)` con la cabecera repetida en las tablas. Para añadir capítulos nuevos al
|
||||
documento, ver `docs/capabilities/automatic_eda.md`.
|
||||
|
||||
## Gotchas
|
||||
|
||||
- **Impura**: escribe el PPTX en `out_path` y, salvo `meta['write_manifest']=False`, el
|
||||
manifiesto `automatic_eda_manifest.json` junto a la salida.
|
||||
- **Dependencia python-pptx**: declarada en `python/pyproject.toml`
|
||||
(`python-pptx>=1.0.2`). Si no está instalada, devuelve `{path: None, note:
|
||||
'python-pptx no disponible: ...'}` sin lanzar. Instalar:
|
||||
`uv pip install --python python/.venv/bin/python3 python-pptx`.
|
||||
- **Nunca lanza** (dict-no-throw): un bloque que falle se omite y se anota en `note`; el
|
||||
deck se genera igual. Un profile `None`/`{}` produce un deck de 1 slide válido.
|
||||
- **No corta nada**: cada bloque se mide; si no cabe en la slide actual, abre una slide
|
||||
`(cont.)`. Las tablas largas se parten por filas **repitiendo la cabecera** (las filas
|
||||
restantes pasan a la siguiente slide). Las figuras matplotlib se exportan a PNG en
|
||||
memoria y se insertan escaladas para caber enteras (nunca recortadas).
|
||||
- **Figuras**: un bloque `figure` puede traer una `matplotlib.figure.Figure` ya
|
||||
construida o un callable `make` (se construye perezosamente). Se cierra tras
|
||||
rasterizar. Las imágenes (`image`) por ruta se escalan manteniendo el aspecto.
|
||||
- **Tablas anchas**: con muchas columnas el ancho por columna se reduce y el texto se
|
||||
envuelve dentro de la celda (sigue sin perderse). El reparto por grupos de columnas
|
||||
para tablas muy anchas es mejora pendiente.
|
||||
@@ -1,76 +0,0 @@
|
||||
"""render_automatic_eda_pptx — chapter-based EDA report as a 16:9 PPTX deck.
|
||||
|
||||
Public ``eda``-group entry point that renders an AutomaticEDA document (a list
|
||||
of chapters, or an ``eda`` TableProfile from which the canonical chapters are
|
||||
built) into a PowerPoint deck for sharing. Same anti-cut principle as the PDF
|
||||
renderer: every block is measured and, when it does not fit, continues on a new
|
||||
slide titled ``<Chapter> (cont.)``; data tables split by rows repeating the
|
||||
header; matplotlib figures are exported to PNG and inserted scaled to fit
|
||||
entirely. Each slide is stamped ``<Chapter> · v<version>`` and a per-chapter
|
||||
manifest (``automatic_eda_manifest.json``) is written next to the output.
|
||||
|
||||
dict-no-throw: never raises. Returns ``{path, n_slides, chapters,
|
||||
manifest_path, note}``; on a fatal error ``path`` is None and ``note`` explains
|
||||
why (e.g. python-pptx not installed).
|
||||
|
||||
Engine: ``python-pptx`` (added dependency; declared in python/pyproject.toml).
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
|
||||
from datascience.automatic_eda import build_document, merge_manifest, render_pptx
|
||||
from datascience.automatic_eda.model import as_chapter, as_chapters
|
||||
|
||||
|
||||
def _coerce_chapters(chapters_or_profile, meta: dict) -> list:
|
||||
"""Accept chapters OR an eda profile and return a list of Chapter."""
|
||||
arg = chapters_or_profile
|
||||
if isinstance(arg, (list, tuple)):
|
||||
return as_chapters(list(arg))
|
||||
if isinstance(arg, dict):
|
||||
if "blocks" in arg and "columns" not in arg:
|
||||
ch = as_chapter(arg)
|
||||
return [ch] if ch is not None else []
|
||||
return build_document(arg, (meta or {}).get("ctx"))
|
||||
return []
|
||||
|
||||
|
||||
def render_automatic_eda_pptx(chapters_or_profile, out_path: str,
|
||||
meta: dict = None) -> dict:
|
||||
"""Render an AutomaticEDA document into a shareable PPTX deck.
|
||||
|
||||
Args:
|
||||
chapters_or_profile: a list of chapters (``Chapter`` dataclasses or
|
||||
dicts) or an ``eda`` TableProfile dict (chapters built via
|
||||
``build_document(profile, meta['ctx'])``).
|
||||
out_path: filesystem path for the PPTX (parent dirs are created).
|
||||
meta: optional dict. Recognised keys: ``title``, ``ctx``,
|
||||
``manifest_path`` (defaults to ``automatic_eda_manifest.json`` beside
|
||||
``out_path``), ``write_manifest`` (False to skip), ``generated_at``.
|
||||
|
||||
Returns:
|
||||
dict (never raises): ``{path, n_slides, chapters, manifest_path, note}``.
|
||||
"""
|
||||
meta = dict(meta or {})
|
||||
chapters = _coerce_chapters(chapters_or_profile, meta)
|
||||
result = render_pptx(chapters, out_path, meta)
|
||||
|
||||
manifest_path = None
|
||||
if meta.get("write_manifest", True) and result.get("path"):
|
||||
manifest_path = meta.get("manifest_path")
|
||||
if not manifest_path:
|
||||
manifest_path = os.path.join(
|
||||
os.path.dirname(os.path.abspath(out_path)),
|
||||
"automatic_eda_manifest.json")
|
||||
generated_at = meta.get("generated_at") or _now_iso()
|
||||
merge_manifest(manifest_path, "pptx", result.get("chapters") or [],
|
||||
generated_at)
|
||||
result["manifest_path"] = manifest_path
|
||||
return result
|
||||
|
||||
|
||||
def _now_iso() -> str:
|
||||
from datetime import datetime, timezone
|
||||
return datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M:%S UTC")
|
||||
@@ -1,114 +0,0 @@
|
||||
"""Tests for render_automatic_eda_pptx — DoD: golden + edges + error path.
|
||||
|
||||
Self-contained synthetic TableProfile (no DuckDB). Verifies the cover/overview
|
||||
chapters render to slides, that long tables split across slides repeating the
|
||||
header without losing cell text, that an empty/None profile yields a valid
|
||||
1-slide deck, and that an unwritable destination returns ``{path: None}``.
|
||||
"""
|
||||
|
||||
import os
|
||||
import tempfile
|
||||
|
||||
from pptx import Presentation
|
||||
|
||||
from datascience.render_automatic_eda_pptx import render_automatic_eda_pptx
|
||||
from datascience.automatic_eda.model import Chapter, DataTable, Heading
|
||||
|
||||
|
||||
def _profile() -> dict:
|
||||
return {
|
||||
"table": "ventas",
|
||||
"source": "/data/ventas.csv",
|
||||
"profiled_at": "2026-06-30T10:00:00+00:00",
|
||||
"n_rows": 1000,
|
||||
"n_cols": 2,
|
||||
"quality_score": 92.5,
|
||||
"columns": [
|
||||
{"name": "precio", "inferred_type": "numeric", "null_pct": 0.01,
|
||||
"null_count": 10,
|
||||
"numeric": {"mean": 42.5, "median": 40.0, "min": 1.0,
|
||||
"max": 100.0, "std": 12.3}},
|
||||
{"name": "categoria", "inferred_type": "categorical",
|
||||
"null_pct": 0.0, "null_count": 0,
|
||||
"categorical": {"top": [{"value": "neumaticos", "count": 500},
|
||||
{"value": "aceite", "count": 300}]}},
|
||||
],
|
||||
}
|
||||
|
||||
|
||||
def _slide_texts(path: str) -> list:
|
||||
prs = Presentation(path)
|
||||
out = []
|
||||
for sl in prs.slides:
|
||||
parts = []
|
||||
for sh in sl.shapes:
|
||||
if sh.has_text_frame:
|
||||
parts.append(sh.text_frame.text)
|
||||
if sh.has_table:
|
||||
tb = sh.table
|
||||
for r in range(len(tb.rows)):
|
||||
for c in range(len(tb.columns)):
|
||||
parts.append(tb.cell(r, c).text)
|
||||
out.append(" ".join(parts))
|
||||
return out
|
||||
|
||||
|
||||
def test_golden_profile_genera_pptx_portada_y_overview():
|
||||
with tempfile.TemporaryDirectory() as d:
|
||||
out = os.path.join(d, "eda.pptx")
|
||||
res = render_automatic_eda_pptx(_profile(), out, {"title": "EDA — ventas"})
|
||||
assert res["path"] == out
|
||||
assert os.path.exists(out)
|
||||
assert res["n_slides"] >= 2
|
||||
ids = [c["id"] for c in res["chapters"]]
|
||||
assert "portada" in ids and "overview" in ids
|
||||
assert res["manifest_path"] and os.path.exists(res["manifest_path"])
|
||||
joined = " ".join(_slide_texts(out))
|
||||
assert "Automatic-EDA" in joined
|
||||
assert "CSV" in joined
|
||||
assert "92.5" in joined
|
||||
assert "precio" in joined and "categoria" in joined
|
||||
assert "median" in joined
|
||||
|
||||
|
||||
def test_edge_tabla_larga_parte_repitiendo_cabecera_sin_cortar():
|
||||
long_cell = ("Lorem ipsum dolor sit amet consectetur adipiscing elit sed do "
|
||||
"eiusmod tempor incididunt reprehenderit voluptate")
|
||||
header = ["ALPHA", "BETA", "GAMMA", "DELTA"]
|
||||
rows = [[f"r{r}c{c}" for c in range(4)] for r in range(50)]
|
||||
rows[0][1] = long_cell
|
||||
ch = Chapter(id="edge", title="Edge", version="1.0.0",
|
||||
blocks=[Heading("Tabla", 1),
|
||||
DataTable(header=header, rows=rows)])
|
||||
with tempfile.TemporaryDirectory() as d:
|
||||
out = os.path.join(d, "edge.pptx")
|
||||
res = render_automatic_eda_pptx([ch], out, {"write_manifest": False})
|
||||
assert res["path"] == out
|
||||
texts = _slide_texts(out)
|
||||
assert res["n_slides"] > 1 # table spilled to several slides.
|
||||
# Header repeated: every slide that carries table rows shows "ALPHA".
|
||||
slides_with_header = sum(1 for t in texts if "ALPHA" in t)
|
||||
assert slides_with_header >= 2
|
||||
joined = " ".join(texts)
|
||||
assert "Lorem ipsum dolor" in joined and "reprehenderit voluptate" in joined
|
||||
# No row lost: every data cell r0..r49 col0 present.
|
||||
for r in (0, 25, 49):
|
||||
assert f"r{r}c0" in joined
|
||||
|
||||
|
||||
def test_edge_profile_none_y_vacio_un_slide():
|
||||
with tempfile.TemporaryDirectory() as d:
|
||||
for arg, name in ((None, "none"), ({}, "empty")):
|
||||
out = os.path.join(d, f"{name}.pptx")
|
||||
res = render_automatic_eda_pptx(arg, out, {"write_manifest": False})
|
||||
assert res["path"] == out
|
||||
assert os.path.exists(out)
|
||||
assert res["n_slides"] == 1
|
||||
|
||||
|
||||
def test_error_path_directorio_no_escribible_no_revienta():
|
||||
res = render_automatic_eda_pptx(_profile(), "/proc/nope/x.pptx",
|
||||
{"write_manifest": False})
|
||||
assert res["path"] is None
|
||||
assert res["n_slides"] == 0
|
||||
assert res["note"]
|
||||
@@ -28,7 +28,6 @@ dependencies = [
|
||||
"pypdf>=6.10.0",
|
||||
"pyproj>=3.7.2",
|
||||
"python-docx>=1.2.0",
|
||||
"python-pptx>=1.0.2",
|
||||
"pyyaml>=6.0.3",
|
||||
"qrcode[pil]>=8.2",
|
||||
"rapidfuzz>=3.14.5",
|
||||
|
||||
Reference in New Issue
Block a user