Compare commits
4 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| a2074a0167 | |||
| c6d9bc26da | |||
| d1a3d58a6b | |||
| b5334a2e97 |
@@ -25,7 +25,8 @@ cabecera, y figuras/imágenes se escalan para caber enteras.
|
||||
```
|
||||
Document = list[Chapter]
|
||||
Chapter = { id: str, title: str, version: str, blocks: list[Block] }
|
||||
Block = Heading | Markdown | KVTable | DataTable | Figure | Image | Caption | Note
|
||||
Block = Heading | Markdown | KVTable | DataTable | Figure | Image | Caption
|
||||
| Note | Group | GlossaryEntry
|
||||
```
|
||||
|
||||
Importa el modelo desde `datascience.automatic_eda.model` (o
|
||||
@@ -44,6 +45,10 @@ reconocido se degrada a `Note`, nunca lanza).
|
||||
| `Figure(fig=None, make=None, caption=None, height_in=None)` | una `matplotlib.figure.Figure` ya construida (`fig`) o un callable `make()->Figure` (perezoso) | se rasteriza y escala para caber entera (nunca recortada) |
|
||||
| `Image(path, caption=None, height_in=None)` | ruta a PNG/JPG | se escala para caber entera |
|
||||
| `Caption(text)` / `Note(text)` | texto auxiliar pequeño | pie/nota en gris; `Note` es además el fallback de lo desconocido |
|
||||
| `Group(blocks, title=None)` | unidad **keep-together**: sus bloques se mantienen juntos | el renderer mide el grupo entero y lo mueve completo a la página/slide siguiente si no cabe; encoge la figura para dejar sitio al título+texto. Ver §11 |
|
||||
| `GlossaryEntry(key, label, definition)` | una entrada del glosario (destino clicable) | la genera el capítulo `glosario`; registra su posición como destino de los términos marcados. Ver §11 |
|
||||
|
||||
`Figure`/`Image` aceptan `height_in` (hint): el renderer **clampa** la figura a esa altura máxima (lo usa `Group` para encoger la figura). Toda figura escala dejando sitio a su caption en la misma página/slide; en PPTX el caption es **siempre** visible (si no se da `caption`, cae al último heading o a "Figura").
|
||||
|
||||
### Subset de markdown soportado (`Markdown`)
|
||||
|
||||
@@ -84,8 +89,9 @@ El orden canónico está **pre-declarado** en
|
||||
|
||||
```python
|
||||
CHAPTER_ORDER = [
|
||||
"portada", "overview", "num_distr", "cat_distr", "calidad", "correlacion",
|
||||
"modelos", "analisis_llm", "timeseries", "geospatial", "agregacion",
|
||||
"portada", "overview", "analisis_llm", "num_distr", "cat_distr", "calidad",
|
||||
"correlacion", "modelos", "timeseries", "geospatial", "agregacion",
|
||||
"glosario",
|
||||
]
|
||||
```
|
||||
|
||||
@@ -95,6 +101,15 @@ CHAPTER_ORDER = [
|
||||
`CHAPTER_ORDER`) y aparecerá automáticamente en su posición. Esto permite que muchos
|
||||
agentes trabajen **en paralelo** sin contención: cada uno toca solo su archivo.
|
||||
|
||||
**Dos capítulos tienen posición especial** (los gestiona `build_document`, no toques esto):
|
||||
|
||||
- `portada`: se **construye el último** (después del cuerpo) para poder resumir el
|
||||
análisis, pero se **coloca el primero**. Recibe `ctx['document_summary']` (ver §5) con
|
||||
un resumen agregado del resto. Decisión del usuario: la portada refleja hallazgos.
|
||||
- `glosario`: se construye y se **coloca el último**. Lee los términos que los demás
|
||||
capítulos registraron en `ctx['glossary']` (ver §11). Si no se registró ninguno, el
|
||||
capítulo devuelve `None` y desaparece.
|
||||
|
||||
Si tu capítulo usa un `<id>` que aún no está en `CHAPTER_ORDER`, añádelo en la posición
|
||||
correcta (única edición compartida; coordínala con el orquestador).
|
||||
|
||||
@@ -143,6 +158,8 @@ defensivo). Esto habilita el **seguimiento y la mejora continua por capítulo**.
|
||||
| `granularity` | "Cada fila es…" (portada). Default: derivado de `key_candidates` |
|
||||
| `quality_criteria` | criterios del score de calidad (portada) |
|
||||
| `head_rows` | `list[dict]` con `df.head` (overview). Ver §7 |
|
||||
| `glossary` | `GlossaryCollector` compartido — los capítulos registran términos en él. Lo crea `build_document`; ver §11 |
|
||||
| `document_summary` | dict con el resumen agregado del cuerpo (n_rows, n_cols, quality_score, n_numeric, n_categorical, chapter_titles, …). Lo calcula `build_document` y lo consume la portada |
|
||||
|
||||
Un capítulo puede definir y consumir sus propias claves `ctx` — documenta cuáles en su
|
||||
docstring.
|
||||
@@ -279,6 +296,109 @@ sus bloques presentes y el no-corte (texto largo intacto en la salida). Patrón:
|
||||
|
||||
---
|
||||
|
||||
## 11. Glosario, keep-together y zebra (motor, fase 4a)
|
||||
|
||||
Tres capacidades transversales del motor que **todos** los capítulos pueden usar. La 6.1
|
||||
(glosario) requiere que el capítulo coopere (registrar + marcar términos); la 6.2
|
||||
(keep-together) es opt-in por capítulo (envolver bloques en `Group`); la 6.3 (zebra) es
|
||||
automática (no hay nada que hacer).
|
||||
|
||||
### 11.1 Glosario con términos clicables
|
||||
|
||||
El glosario es un capítulo nuevo (`chapters/glosario.py`) que se renderiza **siempre el
|
||||
último** y lista cada término técnico que algún capítulo haya registrado. Cada aparición
|
||||
del término en el texto se vuelve un **clic real** que salta a su entrada: en PDF como
|
||||
*link annotation* interno (post-proceso con PyMuPDF, porque `PdfPages` no soporta
|
||||
hyperlinks internos), en PPTX como *slide-jump* nativo (`ppaction://hlinksldjump`).
|
||||
|
||||
**API exacta para un capítulo (dos pasos):**
|
||||
|
||||
1. **Registrar el término** en el colector compartido `ctx['glossary']` (un
|
||||
`model.GlossaryCollector`, creado por `build_document` y pasado a todos los capítulos):
|
||||
|
||||
```python
|
||||
glossary = ctx.get("glossary")
|
||||
if isinstance(glossary, model.GlossaryCollector):
|
||||
glossary.add("entropia", "Entropía (de Shannon)", "Medida, en bits, de …")
|
||||
```
|
||||
|
||||
`add(key, label, definition)` es idempotente (la primera definición de cada `key` gana).
|
||||
`key` debe ser `[A-Za-z0-9_]+`. Si no hay colector en `ctx` (renderizado suelto), el
|
||||
capítulo simplemente no marca términos — degrada sin romper.
|
||||
|
||||
2. **Marcar cada aparición** en el texto de un bloque `Markdown` con el span inline
|
||||
`[[term:KEY]]texto visible[[/term]]`. El texto visible puede llevar `**negrita**`. El
|
||||
marcador no altera el texto visible (se elimina como cualquier marcador inline); solo
|
||||
añade el destino clicable.
|
||||
|
||||
```python
|
||||
# En cat_distr (ejemplo real ya implementado):
|
||||
"La [[term:entropia]]**entropía de Shannon**[[/term]] mide cómo de repartidos…"
|
||||
```
|
||||
|
||||
Eso es todo: el capítulo `glosario` recoge los términos (orden alfabético por `label`),
|
||||
emite un `GlossaryEntry` por término, y los renderers cablean los enlaces automáticamente.
|
||||
Si ningún capítulo registró términos, el glosario no aparece.
|
||||
|
||||
**Helpers de `text_layout` (no reimplementar):** `parse_inline_rich(text)` →
|
||||
`[(texto, is_bold, term_key), …]`; `wrap_rich_terms(text, max_chars)` → líneas de esos
|
||||
spans sin corte. `strip_inline_md` ya elimina los marcadores `[[term:…]]`/`[[/term]]`.
|
||||
(Las funciones previas `parse_inline_bold` / `wrap_rich` siguen existiendo, sin términos.)
|
||||
|
||||
**Funciones del registry que cablean los enlaces** (grupo `eda`, ya invocadas por los
|
||||
renderers; degradan en silencio si faltan): `add_pdf_internal_links_py_datascience`
|
||||
(PyMuPDF, link GOTO) y `pptx_link_run_to_slide_py_datascience` (salto a slide nativo).
|
||||
Dependencia: `pymupdf` (declarada en `python/pyproject.toml`).
|
||||
|
||||
**Trabajo de la siguiente fase — enganchar más términos.** El mecanismo está hecho y
|
||||
probado de extremo a extremo con `entropia` (en `cat_distr`). Cada capítulo debe registrar
|
||||
y marcar SUS términos con el mismo patrón de dos pasos. Candidatos por capítulo:
|
||||
|
||||
| Capítulo | Términos a enganchar (key sugerida) |
|
||||
|---|---|
|
||||
| `cat_distr` | `entropia` ✅ (hecho) |
|
||||
| `calidad` | `completitud`, `validez`, `consistencia` |
|
||||
| `correlacion` | `cramers_v`, `fdr` (comparaciones múltiples), método de correlación usado |
|
||||
| `modelos` | `pca`, `silhouette`, `isolation_forest` |
|
||||
| `timeseries` | `estacionariedad`, `acf_pacf`, `stl` |
|
||||
| `num_distr` | `iqr`, `curtosis`, `outlier` (vallas de Tukey) |
|
||||
|
||||
Define la definición de cada término en su capítulo (constante local, como
|
||||
`_TERM_ENTROPIA_DEF` en `cat_distr`) y márcalo en su primera aparición.
|
||||
|
||||
### 11.2 Keep-together: gráfico junto a su título y texto (`Group`)
|
||||
|
||||
Para que un encabezado no quede en una página/slide y su figura en la siguiente, envuelve
|
||||
los bloques de una misma idea en un `model.Group`:
|
||||
|
||||
```python
|
||||
blocks.append(model.Group(blocks=[
|
||||
model.Heading(text=str(name), level=2),
|
||||
model.Figure(make=_figura_perezosa(...), caption="…"),
|
||||
model.Markdown(text="explicación…"),
|
||||
]))
|
||||
```
|
||||
|
||||
El renderer **mide el grupo entero** antes de dibujar nada: si no cabe en lo que queda de
|
||||
página/slide pero cabe en una entera, lo mueve **completo** a la siguiente; y **encoge la
|
||||
figura** (vía `height_in`) lo justo para que el título + texto + figura quepan juntos. Si
|
||||
el grupo es más alto que una página entera, empieza en una nueva y fluye (degradación
|
||||
honesta, nunca corta). Ejemplo real implementado: `num_distr` envuelve cada columna
|
||||
(heading + figura histograma/boxplot + nota) en un `Group`.
|
||||
|
||||
Recomendado para `agregacion` y cualquier capítulo donde una figura deba ir pegada a su
|
||||
título/explicación. Coste: si un capítulo inspecciona `chapter.blocks` en sus tests, ahora
|
||||
encontrará `Group`s — aplana con un helper recursivo (ver `num_distr_test.py::_flatten`).
|
||||
|
||||
### 11.3 Zebra striping en tablas (automático)
|
||||
|
||||
Todo `DataTable` se renderiza con **filas pares sombreadas** (gris muy suave `#f6f8fa`) y
|
||||
cabecera con su fondo propio. Es automático en PDF y PPTX; el patrón se mantiene coherente
|
||||
cuando una tabla larga se parte y repite cabecera (el índice de fila es lógico, no por
|
||||
página). No hay nada que hacer en los capítulos.
|
||||
|
||||
---
|
||||
|
||||
## 10. Integración futura con `profile_table` (siguiente fase)
|
||||
|
||||
`profile_table(emit_pdf=True)` usa hoy `render_eda_pdf` (intacto). En la siguiente fase
|
||||
|
||||
@@ -68,11 +68,13 @@ from .extract_timeseries_raw import extract_timeseries_raw
|
||||
from .build_eda_render_ctx import build_eda_render_ctx
|
||||
from .profile_datetime import profile_datetime
|
||||
from .resample_timeseries import resample_timeseries
|
||||
from .add_pdf_internal_links import add_pdf_internal_links
|
||||
|
||||
__all__ = [
|
||||
"detect_time_column",
|
||||
"extract_timeseries_raw",
|
||||
"build_eda_render_ctx",
|
||||
"add_pdf_internal_links",
|
||||
"profile_datetime",
|
||||
"resample_timeseries",
|
||||
"render_automatic_eda_pdf",
|
||||
|
||||
@@ -0,0 +1,85 @@
|
||||
---
|
||||
name: add_pdf_internal_links
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "def add_pdf_internal_links(pdf_path: str, links: list) -> dict"
|
||||
description: "Postprocesa un PDF YA escrito insertando link annotations internos de tipo GOTO ('ir a') con PyMuPDF (import fitz). Pensado para PDFs generados por matplotlib PdfPages, que NO soporta hyperlinks internos: tras escribir el PDF se reabre y, por cada entrada de `links`, se añade una anotacion clicable desde un rectangulo de una pagina origen (src_page + src_rect en puntos top-left) hasta un punto de una pagina destino (dst_page + dst_point). Caso de uso tipico del grupo eda: hacer clicables los terminos de un AutomaticEDA que apuntan a su entrada en el glosario al final del documento. Estilo dict-no-throw: NUNCA lanza; valida cada link y SALTA (n_skipped++) los malformados o fuera de rango en vez de fallar. Guarda de forma segura escribiendo a un temporal en el mismo directorio y haciendo os.replace atomico (evita corromper el original). Devuelve {status:ok,n_links,n_skipped} o {status:error,error}; si pymupdf no esta disponible o el archivo no existe devuelve status error."
|
||||
tags: [eda, datascience, pdf, links, glossary, pymupdf, fitz, postprocess, python]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: []
|
||||
params:
|
||||
- name: pdf_path
|
||||
desc: "ruta al PDF existente (str no vacio). Se reescribe IN SITU (in-place) tras añadir los links: se guarda a un temporal `.<base>.tmp_links` en el mismo directorio y se reemplaza atomicamente con os.replace. Si no es str o no existe el archivo -> {status:error}."
|
||||
- name: links
|
||||
desc: "lista de dicts, uno por link a insertar. Cada dict: src_page (int 0-based de la pagina origen), src_rect ([x0,y0,x1,y1] del rectangulo clicable en PUNTOS PDF 1/72\" con origen ARRIBA-IZQUIERDA), dst_page (int 0-based de la pagina destino), dst_point ([x,y] punto destino, mismos puntos top-left). Las entradas que no son dict, con page fuera de rango [0,page_count), src_rect que no tenga 4 numeros o dst_point que no tenga 2 numeros se SALTAN (n_skipped++), no lanzan. None se trata como lista vacia."
|
||||
output: "dict (NUNCA lanza): en exito {\"status\":\"ok\",\"n_links\":int,\"n_skipped\":int} con n_links = anotaciones GOTO insertadas y n_skipped = entradas invalidas saltadas. En fallo {\"status\":\"error\",\"error\":str}: pymupdf no disponible, pdf_path no es str / no existe, links no es lista, o cualquier excepcion global (el PDF original queda intacto porque el replace solo ocurre tras un save correcto)."
|
||||
tested: true
|
||||
tests: ["test_add_goto_link_basico", "test_links_invalidos_se_saltan", "test_archivo_inexistente_devuelve_error"]
|
||||
test_file_path: "python/functions/datascience/add_pdf_internal_links_test.py"
|
||||
file_path: "python/functions/datascience/add_pdf_internal_links.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
import sys, os
|
||||
sys.path.insert(0, os.path.join("python", "functions"))
|
||||
from datascience import add_pdf_internal_links
|
||||
|
||||
# Tienes un PDF ya escrito por matplotlib PdfPages (sin hyperlinks internos).
|
||||
# Quieres que el texto "Margen bruto" de la pagina 0 (rectangulo en puntos
|
||||
# top-left) salte a su entrada del glosario en la ultima pagina (indice 7).
|
||||
res = add_pdf_internal_links(
|
||||
"reports/eda.pdf",
|
||||
[
|
||||
{"src_page": 0, "src_rect": [72, 120, 180, 134], "dst_page": 7, "dst_point": [72, 200]},
|
||||
{"src_page": 0, "src_rect": [72, 140, 180, 154], "dst_page": 7, "dst_point": [72, 260]},
|
||||
],
|
||||
)
|
||||
# res == {"status": "ok", "n_links": 2, "n_skipped": 0}
|
||||
```
|
||||
|
||||
## Cuando usarla
|
||||
|
||||
Justo DESPUES de escribir un PDF con matplotlib `PdfPages` (o cualquier motor
|
||||
que no genere hyperlinks internos) cuando necesitas que ciertos terminos o
|
||||
referencias sean clicables y salten a otra pagina del mismo documento — el caso
|
||||
canonico es enlazar los terminos de un AutomaticEDA con su entrada de glosario
|
||||
al final. Es un paso de postproceso: primero generas el PDF y calculas en que
|
||||
rectangulo quedo cada termino (en puntos PDF), luego pasas esa lista a esta
|
||||
funcion para inyectar las anotaciones GOTO.
|
||||
|
||||
## Gotchas
|
||||
|
||||
- **Impura — reescribe el archivo IN SITU.** El PDF en `pdf_path` se reemplaza
|
||||
por la version con los links. El guardado es seguro: escribe a un temporal
|
||||
`.<base>.tmp_links` en el MISMO directorio y hace `os.replace` atomico tras
|
||||
cerrar el documento, asi un fallo a mitad no corrompe el original. Aun asi,
|
||||
conserva una copia si el PDF es valioso.
|
||||
- **Sistema de coordenadas: puntos top-left, igual que matplotlib.** PyMuPDF y
|
||||
matplotlib (PdfPages) usan ambos PUNTOS PDF (1/72") con el origen ARRIBA-
|
||||
IZQUIERDA, asi que los rectangulos/puntos COINCIDEN: el `src_rect` que calcules
|
||||
con la geometria de la figura matplotlib se pasa tal cual, sin invertir el eje
|
||||
Y. (Ojo: el espacio de datos de matplotlib SI tiene el origen abajo; lo que
|
||||
coincide es el espacio de la PAGINA en puntos.)
|
||||
- **Indices de pagina 0-based.** `src_page` / `dst_page` son indices base 0
|
||||
(la primera pagina es 0). Fuera del rango `[0, page_count)` el link se SALTA
|
||||
(cuenta en `n_skipped`), no lanza.
|
||||
- **dict-no-throw, validacion por-link.** Las entradas malformadas (no dict,
|
||||
page fuera de rango, `src_rect` sin 4 numeros, `dst_point` sin 2 numeros) se
|
||||
saltan individualmente e incrementan `n_skipped`; el resto de links validos se
|
||||
insertan igual. La funcion solo devuelve `{status:error}` ante fallos globales
|
||||
(pymupdf ausente, archivo inexistente, `links` no es lista).
|
||||
- **`error_type: error_go_core` es metadata del registry, no comportamiento.**
|
||||
Toda funcion impura debe declararlo y el indexer lo exige, pero el codigo NUNCA
|
||||
lanza esa excepcion: degrada al dict de estado.
|
||||
- **Requiere PyMuPDF (`import fitz`).** Si no esta instalado devuelve
|
||||
`{"status":"error","error":"pymupdf no disponible: ..."}`. En el registry el
|
||||
venv `python/.venv` ya lo trae.
|
||||
@@ -0,0 +1,132 @@
|
||||
"""Postprocesa un PDF existente insertando link annotations internos (GOTO).
|
||||
|
||||
Motor: PyMuPDF (``import fitz``). Pensado para PDFs generados por matplotlib
|
||||
``PdfPages``, que no soporta hyperlinks internos: tras escribir el PDF, esta
|
||||
funcion lo reabre y le añade anotaciones "ir a" (GOTO) desde un rectangulo de
|
||||
una pagina origen hasta un punto de una pagina destino. Util para hacer
|
||||
clicables terminos que apuntan a su entrada en un glosario al final del
|
||||
documento.
|
||||
|
||||
Estilo dict-no-throw del grupo `eda`: NUNCA lanza; devuelve un dict de estado.
|
||||
"""
|
||||
|
||||
import os
|
||||
|
||||
|
||||
def add_pdf_internal_links(pdf_path: str, links: list) -> dict:
|
||||
"""Añade link annotations internos (GOTO) a un PDF ya escrito.
|
||||
|
||||
Postprocesa un PDF (p.ej. generado por matplotlib PdfPages, que NO soporta
|
||||
hyperlinks internos) insertando, por cada entrada de ``links``, una
|
||||
anotacion de tipo "ir a" desde un rectangulo de una pagina origen hasta un
|
||||
punto de una pagina destino. Sirve para hacer clicables terminos que apuntan
|
||||
a su entrada en un glosario al final del documento.
|
||||
|
||||
Args:
|
||||
pdf_path: ruta al PDF existente (se reescribe in situ).
|
||||
links: lista de dicts, cada uno:
|
||||
{
|
||||
"src_page": int, # indice 0-based de la pagina origen
|
||||
"src_rect": [x0,y0,x1,y1], # rectangulo clicable, en PUNTOS PDF
|
||||
# (1/72") con origen ARRIBA-IZQUIERDA
|
||||
"dst_page": int, # indice 0-based de la pagina destino
|
||||
"dst_point": [x, y], # punto destino, mismos puntos top-left
|
||||
}
|
||||
|
||||
Returns:
|
||||
dict (NUNCA lanza): {"status":"ok","n_links":int,"n_skipped":int}
|
||||
o {"status":"error","error":str}. Si pymupdf no esta disponible o el
|
||||
archivo no existe -> {"status":"error", ...}.
|
||||
"""
|
||||
try:
|
||||
try:
|
||||
import fitz # PyMuPDF
|
||||
except Exception as exc: # ImportError u otro fallo de carga
|
||||
return {"status": "error", "error": f"pymupdf no disponible: {exc}"}
|
||||
|
||||
if not isinstance(pdf_path, str) or not pdf_path:
|
||||
return {"status": "error", "error": "pdf_path debe ser una ruta no vacia"}
|
||||
if not os.path.isfile(pdf_path):
|
||||
return {"status": "error", "error": f"el archivo no existe: {pdf_path}"}
|
||||
|
||||
if links is None:
|
||||
links = []
|
||||
if not isinstance(links, (list, tuple)):
|
||||
return {"status": "error", "error": "links debe ser una lista de dicts"}
|
||||
|
||||
doc = fitz.open(pdf_path)
|
||||
try:
|
||||
n_pages = doc.page_count
|
||||
n_ok = 0
|
||||
n_skipped = 0
|
||||
|
||||
for link in links:
|
||||
if not isinstance(link, dict):
|
||||
n_skipped += 1
|
||||
continue
|
||||
|
||||
src_page = link.get("src_page")
|
||||
dst_page = link.get("dst_page")
|
||||
src_rect = link.get("src_rect")
|
||||
dst_point = link.get("dst_point")
|
||||
|
||||
# src_page / dst_page: enteros 0-based en rango.
|
||||
if not _is_int(src_page) or not _is_int(dst_page):
|
||||
n_skipped += 1
|
||||
continue
|
||||
if not (0 <= src_page < n_pages) or not (0 <= dst_page < n_pages):
|
||||
n_skipped += 1
|
||||
continue
|
||||
|
||||
# src_rect: 4 numeros.
|
||||
if not _is_num_seq(src_rect, 4):
|
||||
n_skipped += 1
|
||||
continue
|
||||
# dst_point: 2 numeros.
|
||||
if not _is_num_seq(dst_point, 2):
|
||||
n_skipped += 1
|
||||
continue
|
||||
|
||||
try:
|
||||
doc[int(src_page)].insert_link(
|
||||
{
|
||||
"kind": fitz.LINK_GOTO,
|
||||
"from": fitz.Rect(*[float(v) for v in src_rect]),
|
||||
"page": int(dst_page),
|
||||
"to": fitz.Point(*[float(v) for v in dst_point]),
|
||||
}
|
||||
)
|
||||
n_ok += 1
|
||||
except Exception:
|
||||
n_skipped += 1
|
||||
continue
|
||||
|
||||
# Guardado seguro: escribir a temporal en el mismo directorio y
|
||||
# reemplazar atomicamente (evita corromper el PDF original).
|
||||
directory = os.path.dirname(os.path.abspath(pdf_path)) or "."
|
||||
base = os.path.basename(pdf_path)
|
||||
tmp_path = os.path.join(directory, f".{base}.tmp_links")
|
||||
doc.save(tmp_path)
|
||||
finally:
|
||||
doc.close()
|
||||
|
||||
os.replace(tmp_path, pdf_path)
|
||||
|
||||
return {"status": "ok", "n_links": n_ok, "n_skipped": n_skipped}
|
||||
except Exception as exc: # degrada cualquier fallo a dict de error
|
||||
return {"status": "error", "error": str(exc)}
|
||||
|
||||
|
||||
def _is_int(value) -> bool:
|
||||
"""True si value es un entero (no bool)."""
|
||||
return isinstance(value, int) and not isinstance(value, bool)
|
||||
|
||||
|
||||
def _is_num_seq(value, length: int) -> bool:
|
||||
"""True si value es una secuencia de `length` numeros (int/float, no bool)."""
|
||||
if not isinstance(value, (list, tuple)) or len(value) != length:
|
||||
return False
|
||||
for v in value:
|
||||
if isinstance(v, bool) or not isinstance(v, (int, float)):
|
||||
return False
|
||||
return True
|
||||
@@ -0,0 +1,77 @@
|
||||
"""Tests para add_pdf_internal_links."""
|
||||
|
||||
import os
|
||||
import sys
|
||||
|
||||
import pytest
|
||||
|
||||
sys.path.insert(0, os.path.dirname(__file__))
|
||||
|
||||
from add_pdf_internal_links import add_pdf_internal_links
|
||||
|
||||
|
||||
def test_add_goto_link_basico(tmp_path):
|
||||
"""Golden: un PDF de 2 paginas recibe un link GOTO de la pag 0 a la pag 1."""
|
||||
fitz = pytest.importorskip("fitz")
|
||||
|
||||
# 1) PDF temporal de 2 paginas A5 (~419x595 puntos).
|
||||
pdf = str(tmp_path / "doc.pdf")
|
||||
doc = fitz.open()
|
||||
doc.new_page(width=419, height=595)
|
||||
doc.new_page(width=419, height=595)
|
||||
doc.save(pdf)
|
||||
doc.close()
|
||||
|
||||
# 2) Insertar un link interno desde la pag 0 hacia la pag 1.
|
||||
res = add_pdf_internal_links(
|
||||
pdf,
|
||||
[{"src_page": 0, "src_rect": [50, 50, 200, 70], "dst_page": 1, "dst_point": [40, 40]}],
|
||||
)
|
||||
assert res["status"] == "ok"
|
||||
assert res["n_links"] == 1
|
||||
assert res["n_skipped"] == 0
|
||||
|
||||
# 3) Reabrir y verificar que la pag 0 tiene un link GOTO a la pag 1.
|
||||
doc = fitz.open(pdf)
|
||||
try:
|
||||
links = doc[0].get_links()
|
||||
goto = [l for l in links if l.get("kind") == fitz.LINK_GOTO and l.get("page") == 1]
|
||||
assert len(goto) >= 1
|
||||
finally:
|
||||
doc.close()
|
||||
|
||||
|
||||
def test_links_invalidos_se_saltan(tmp_path):
|
||||
"""Edge: entradas malformadas o fuera de rango incrementan n_skipped, no lanzan."""
|
||||
fitz = pytest.importorskip("fitz")
|
||||
|
||||
pdf = str(tmp_path / "doc.pdf")
|
||||
doc = fitz.open()
|
||||
doc.new_page(width=419, height=595)
|
||||
doc.new_page(width=419, height=595)
|
||||
doc.save(pdf)
|
||||
doc.close()
|
||||
|
||||
res = add_pdf_internal_links(
|
||||
pdf,
|
||||
[
|
||||
# valido
|
||||
{"src_page": 0, "src_rect": [10, 10, 90, 30], "dst_page": 1, "dst_point": [20, 20]},
|
||||
# dst_page fuera de rango
|
||||
{"src_page": 0, "src_rect": [10, 40, 90, 60], "dst_page": 9, "dst_point": [20, 20]},
|
||||
# src_rect con 3 numeros
|
||||
{"src_page": 0, "src_rect": [10, 70, 90], "dst_page": 1, "dst_point": [20, 20]},
|
||||
# no es dict
|
||||
"no-soy-un-dict",
|
||||
],
|
||||
)
|
||||
assert res["status"] == "ok"
|
||||
assert res["n_links"] == 1
|
||||
assert res["n_skipped"] == 3
|
||||
|
||||
|
||||
def test_archivo_inexistente_devuelve_error():
|
||||
"""Error path: pdf_path inexistente -> status error sin lanzar."""
|
||||
res = add_pdf_internal_links("/ruta/que/no/existe_xyz.pdf", [])
|
||||
assert res["status"] == "error"
|
||||
assert "error" in res
|
||||
@@ -21,6 +21,9 @@ from .model import ( # noqa: F401
|
||||
Chapter,
|
||||
DataTable,
|
||||
Figure,
|
||||
GlossaryCollector,
|
||||
GlossaryEntry,
|
||||
Group,
|
||||
Heading,
|
||||
Image,
|
||||
KVTable,
|
||||
@@ -45,6 +48,9 @@ __all__ = [
|
||||
"Image",
|
||||
"Caption",
|
||||
"Note",
|
||||
"Group",
|
||||
"GlossaryEntry",
|
||||
"GlossaryCollector",
|
||||
"Chapter",
|
||||
"as_blocks",
|
||||
"as_chapters",
|
||||
|
||||
@@ -1,22 +1,26 @@
|
||||
"""Data-quality chapter (CALIDAD) for AutomaticEDA.
|
||||
|
||||
Builds the quality chapter from a ``TableProfile`` of the ``eda`` group. The
|
||||
chapter answers, in Spanish and as tables, the three things the user asked for:
|
||||
chapter implements the quality model of report 2046:
|
||||
|
||||
1. **En qué se basa la calidad** — an intro paragraph explaining the criteria and
|
||||
their weights (completeness, validity, consistency) before any number, plus a
|
||||
table-level summary (global score and aggregates).
|
||||
1. **En qué se basa la calidad** — an intro paragraph explaining the two scored
|
||||
dimensions and their weights (completitud 60%, validez 40%) plus the
|
||||
table-level row uniqueness, BEFORE any number, and stating explicitly that
|
||||
outliers are reported as observations and do **not** lower the score. The
|
||||
criteria terms (calidad de datos, completitud, validez, unicidad de registro)
|
||||
are hooked into the shared glossary as clickable jumps.
|
||||
2. **Scores por columna** — a table with, per column, the total quality score and
|
||||
its breakdown into completeness / validity / consistency.
|
||||
3. **Problemas en español** — a second table listing, per column, the readable
|
||||
issues in Spanish (kept separate from the type ``flags``).
|
||||
its breakdown into completeness / validity (no consistency dimension).
|
||||
3. **Problemas de calidad** — a table listing ONLY real quality defects
|
||||
(nulls, empty cells, values not conforming to their type/semantics).
|
||||
4. **Observaciones analíticas** — a SEPARATE table for outliers, constant
|
||||
columns, high-cardinality ids and strong skew, with an explicit note that
|
||||
these do not affect the score.
|
||||
|
||||
The breakdown and the issues are NOT recomputed here: they come from the registry
|
||||
function ``column_quality_score`` (group ``eda``), which already derives
|
||||
``{score, completeness, validity, consistency, issues}`` from the ColumnProfile.
|
||||
This chapter is render-only — it consumes that function and lays the result out
|
||||
as model blocks; the renderers paginate tables (splitting by rows, repeating the
|
||||
header) and wrap long cells so nothing is ever cut.
|
||||
The breakdown, issues and observations are NOT recomputed here: they come from
|
||||
the registry function ``column_quality_score`` (group ``eda``), which derives
|
||||
``{score, completeness, validity, dimensions, applicable, issues,
|
||||
observations}`` from the ColumnProfile. This chapter is render-only.
|
||||
|
||||
Contract: build_<id>(profile, ctx) -> Chapter | None ; CHAPTER_VERSION = "x.y.z".
|
||||
"""
|
||||
@@ -33,28 +37,47 @@ try: # pragma: no cover - import wiring
|
||||
except Exception: # noqa: BLE001 - never let an import error abort the document.
|
||||
_column_quality_score = None
|
||||
|
||||
CHAPTER_VERSION = "1.0.0"
|
||||
CHAPTER_VERSION = "2.0.0"
|
||||
CHAPTER_ID = "calidad"
|
||||
CHAPTER_TITLE = "Calidad"
|
||||
|
||||
# Weights mirror column_quality_score: completeness 0.5, validity 0.3,
|
||||
# consistency 0.2. Kept here only to render the human explanation; the actual
|
||||
# numbers always come from the function so the two never drift in computation.
|
||||
_CRITERIA_INTRO = (
|
||||
"La calidad de cada columna es un score de 0 a 100 que combina tres "
|
||||
"criterios, cada uno con un peso:\n\n"
|
||||
"- **Completitud (peso 50%)**: proporción de valores presentes (sin nulos "
|
||||
"ni vacíos). Una columna con muchos nulos baja de score.\n"
|
||||
"- **Validez (peso 30%)**: los valores son coherentes con su tipo y rango "
|
||||
"esperado (penaliza outliers y semánticas declaradas que no coinciden).\n"
|
||||
"- **Consistencia (peso 20%)**: la columna aporta información útil (penaliza "
|
||||
"columnas constantes o identificadores de cardinalidad muy alta).\n\n"
|
||||
"Score = 100 × (0,5·completitud + 0,3·validez + 0,2·consistencia). "
|
||||
"Los problemas detectados por columna se listan en español más abajo."
|
||||
)
|
||||
# Glossary terms this chapter explains (report 2046 §6). Registered in the shared
|
||||
# collector and marked clickable on their first appearance (contract §11.1).
|
||||
_TERMS = {
|
||||
"calidad_datos": (
|
||||
"Calidad de datos (score 0-100)",
|
||||
"Mide hasta qué punto los datos están presentes y son utilizables tal "
|
||||
"cual, no si son «buenos para el análisis». Se compone solo de "
|
||||
"dimensiones medibles automáticamente desde el perfil de la tabla, sin "
|
||||
"fuente externa de verdad: completitud (60%), validez (40%, cuando es "
|
||||
"medible) y, a nivel de tabla, unicidad de registro. Los valores "
|
||||
"atípicos NO bajan la calidad: se listan aparte como observaciones.",
|
||||
),
|
||||
"completitud": (
|
||||
"Completitud",
|
||||
"Proporción de valores realmente presentes en una columna (1 − % de "
|
||||
"nulos; en texto, las celdas vacías también cuentan como faltantes). Los "
|
||||
"nulos y vacíos bajan el score porque falta información que debería "
|
||||
"estar. Pesa el 60% del score de columna.",
|
||||
),
|
||||
"validez": (
|
||||
"Validez",
|
||||
"Proporción de valores que encajan con su tipo o formato esperado: un "
|
||||
"número que parsea, una fecha legible, un email con forma de email. Los "
|
||||
"valores que no parsean a su tipo bajan el score. Si la columna es texto "
|
||||
"libre sin formato esperado, la validez no se puede medir y el score se "
|
||||
"basa solo en la completitud. Pesa el 40% del score cuando es medible.",
|
||||
),
|
||||
"unicidad_registro": (
|
||||
"Unicidad de registro",
|
||||
"A nivel de tabla, las filas duplicadas restan calidad al conjunto "
|
||||
"(1 − % de filas duplicadas). Es distinta de que una columna no-clave "
|
||||
"repita valores, que no es un defecto de calidad.",
|
||||
),
|
||||
}
|
||||
|
||||
# Cap for the joined issues cell so a single row never grows taller than a page;
|
||||
# the remainder is summarized as "(+N más)" instead of being silently dropped.
|
||||
# Cap for the joined cell so a single row never grows taller than a page; the
|
||||
# remainder is summarized as "(+N más)" instead of being silently dropped.
|
||||
_ISSUES_MAXLEN = 160
|
||||
|
||||
|
||||
@@ -82,12 +105,19 @@ def _fmt_unit_pct(value) -> str:
|
||||
return str(value)
|
||||
|
||||
|
||||
def _fmt_validity(value) -> str:
|
||||
"""Validity is ``None`` when not applicable: show ``n/a`` not a fake 0%."""
|
||||
if value is None:
|
||||
return "n/a"
|
||||
return _fmt_unit_pct(value)
|
||||
|
||||
|
||||
def _quality_of(col: dict) -> dict:
|
||||
"""Return ``{score, completeness, validity, consistency, issues}`` for a column.
|
||||
"""Return the quality dict for a column.
|
||||
|
||||
Uses the registry ``column_quality_score`` when available; otherwise falls
|
||||
back to the per-column ``quality_score`` already in the profile (number only,
|
||||
empty breakdown/issues). Never raises.
|
||||
empty breakdown/issues/observations). Never raises.
|
||||
"""
|
||||
if not isinstance(col, dict):
|
||||
col = {}
|
||||
@@ -98,26 +128,25 @@ def _quality_of(col: dict) -> dict:
|
||||
return res
|
||||
except Exception: # noqa: BLE001 - degrade instead of aborting.
|
||||
pass
|
||||
# Fallback: only the final score is available pre-computed in the profile.
|
||||
return {
|
||||
"score": col.get("quality_score"),
|
||||
"completeness": None,
|
||||
"validity": None,
|
||||
"consistency": None,
|
||||
"issues": [],
|
||||
"observations": [],
|
||||
}
|
||||
|
||||
|
||||
def _join_issues(issues) -> str:
|
||||
"""Join Spanish issue strings into one cell, truncating overly long lists.
|
||||
def _join_cells(items) -> str:
|
||||
"""Join Spanish strings into one cell, truncating overly long lists.
|
||||
|
||||
The renderer wraps cell text, but a column with many long issues could make a
|
||||
single row taller than a whole page; cap the length and append ``(+N más)``
|
||||
so the count of hidden issues is honest rather than silently lost.
|
||||
The renderer wraps cell text, but a column with many long entries could make
|
||||
a single row taller than a whole page; cap the length and append ``(+N más)``
|
||||
so the count of hidden entries is honest rather than silently lost.
|
||||
"""
|
||||
if not isinstance(issues, (list, tuple)) or not issues:
|
||||
if not isinstance(items, (list, tuple)) or not items:
|
||||
return ""
|
||||
parts = [model._safe_str(i).strip() for i in issues]
|
||||
parts = [model._safe_str(i).strip() for i in items]
|
||||
parts = [p for p in parts if p]
|
||||
if not parts:
|
||||
return ""
|
||||
@@ -142,6 +171,33 @@ def _columns_with_quality(profile: dict):
|
||||
yield c, _quality_of(c)
|
||||
|
||||
|
||||
def _fmt_unit_pct_or_pct(value) -> str:
|
||||
"""Format a value that may be a 0-1 fraction or an already-0-100 percentage."""
|
||||
try:
|
||||
num = float(value)
|
||||
except (TypeError, ValueError):
|
||||
return model._safe_str(value)
|
||||
if num != num: # NaN
|
||||
return "—"
|
||||
pct = num * 100 if num <= 1.0 else num
|
||||
text = f"{pct:.1f}".rstrip("0").rstrip(".")
|
||||
return f"{text}%"
|
||||
|
||||
|
||||
def _row_uniqueness(profile: dict):
|
||||
"""Return row uniqueness (1 - duplicate_pct) in [0,1], or None if unknown."""
|
||||
dup = profile.get("duplicate_pct")
|
||||
if dup is None:
|
||||
return None
|
||||
try:
|
||||
d = float(dup)
|
||||
except (TypeError, ValueError):
|
||||
return None
|
||||
if d > 1.0: # tolerate a 0-100 scale
|
||||
d = d / 100.0
|
||||
return max(0.0, min(1.0, 1.0 - d))
|
||||
|
||||
|
||||
def _summary_block(profile: dict, evaluated: list):
|
||||
"""Table-level KVTable: global score and quality aggregates."""
|
||||
rows = []
|
||||
@@ -153,14 +209,15 @@ def _summary_block(profile: dict, evaluated: list):
|
||||
if isinstance(q.get("completeness"), (int, float))]
|
||||
vals = [q.get("validity") for _, q in evaluated
|
||||
if isinstance(q.get("validity"), (int, float))]
|
||||
cons = [q.get("consistency") for _, q in evaluated
|
||||
if isinstance(q.get("consistency"), (int, float))]
|
||||
if comps:
|
||||
rows.append(("Completitud media", _fmt_unit_pct(sum(comps) / len(comps))))
|
||||
if vals:
|
||||
rows.append(("Validez media", _fmt_unit_pct(sum(vals) / len(vals))))
|
||||
if cons:
|
||||
rows.append(("Consistencia media", _fmt_unit_pct(sum(cons) / len(cons))))
|
||||
rows.append(("Validez media (donde aplica)",
|
||||
_fmt_unit_pct(sum(vals) / len(vals))))
|
||||
|
||||
ru = _row_uniqueness(profile)
|
||||
if ru is not None:
|
||||
rows.append(("Unicidad de registro", _fmt_unit_pct(ru)))
|
||||
|
||||
n_problem = sum(1 for _, q in evaluated if q.get("issues"))
|
||||
rows.append(("Columnas con problemas", str(n_problem)))
|
||||
@@ -182,22 +239,9 @@ def _summary_block(profile: dict, evaluated: list):
|
||||
return model.KVTable(rows=rows, title="Resumen de calidad")
|
||||
|
||||
|
||||
def _fmt_unit_pct_or_pct(value) -> str:
|
||||
"""Format a value that may be a 0-1 fraction or an already-0-100 percentage."""
|
||||
try:
|
||||
num = float(value)
|
||||
except (TypeError, ValueError):
|
||||
return model._safe_str(value)
|
||||
if num != num: # NaN
|
||||
return "—"
|
||||
pct = num * 100 if num <= 1.0 else num
|
||||
text = f"{pct:.1f}".rstrip("0").rstrip(".")
|
||||
return f"{text}%"
|
||||
|
||||
|
||||
def _scores_block(evaluated: list):
|
||||
"""DataTable with per-column score and its three-criteria breakdown."""
|
||||
header = ["Columna", "Calidad", "Completitud", "Validez", "Consistencia"]
|
||||
"""DataTable with per-column score and its completeness/validity breakdown."""
|
||||
header = ["Columna", "Calidad", "Completitud", "Validez"]
|
||||
rows = []
|
||||
# Worst columns first so the reader sees the problems at the top.
|
||||
ordered = sorted(
|
||||
@@ -210,22 +254,22 @@ def _scores_block(evaluated: list):
|
||||
col.get("name") or "(col)",
|
||||
_fmt_score(q.get("score")),
|
||||
_fmt_unit_pct(q.get("completeness")),
|
||||
_fmt_unit_pct(q.get("validity")),
|
||||
_fmt_unit_pct(q.get("consistency")),
|
||||
_fmt_validity(q.get("validity")),
|
||||
])
|
||||
if not rows:
|
||||
return None
|
||||
return model.DataTable(header=header, rows=rows,
|
||||
title="Scores de calidad por columna",
|
||||
note="0 = peor, 100 = mejor; ordenado de peor a mejor")
|
||||
note="0 = peor, 100 = mejor; «n/a» = dimensión no "
|
||||
"medible; ordenado de peor a mejor")
|
||||
|
||||
|
||||
def _issues_block(evaluated: list):
|
||||
"""DataTable listing Spanish issues per column, or a Note when there are none."""
|
||||
header = ["Columna", "Problemas detectados (español)"]
|
||||
"""DataTable listing ONLY real quality defects per column, or a Note."""
|
||||
header = ["Columna", "Problemas de calidad (español)"]
|
||||
rows = []
|
||||
for col, q in evaluated:
|
||||
joined = _join_issues(q.get("issues"))
|
||||
joined = _join_cells(q.get("issues"))
|
||||
if joined:
|
||||
rows.append([col.get("name") or "(col)", joined])
|
||||
if not rows:
|
||||
@@ -235,6 +279,63 @@ def _issues_block(evaluated: list):
|
||||
title="Problemas de calidad por columna")
|
||||
|
||||
|
||||
def _observations_block(evaluated: list):
|
||||
"""DataTable listing analytical observations per column, or None.
|
||||
|
||||
Observations (outliers, constant columns, ids, strong skew) are NOT quality
|
||||
defects: they do not affect the score. Returned as a separate table from the
|
||||
issues so the report never presents a legitimate outlier as a problem.
|
||||
"""
|
||||
header = ["Columna", "Observaciones analíticas"]
|
||||
rows = []
|
||||
for col, q in evaluated:
|
||||
joined = _join_cells(q.get("observations"))
|
||||
if joined:
|
||||
rows.append([col.get("name") or "(col)", joined])
|
||||
if not rows:
|
||||
return None
|
||||
return model.DataTable(
|
||||
header=header, rows=rows,
|
||||
title="Observaciones analíticas por columna",
|
||||
note="No son defectos de calidad y NO afectan al score; orientan el "
|
||||
"análisis (atípicos, columnas constantes, identificadores).")
|
||||
|
||||
|
||||
def _term(key: str, label: str, mark: bool) -> str:
|
||||
"""Render a term as a clickable glossary span when marking is enabled."""
|
||||
if mark:
|
||||
return f"[[term:{key}]]**{label}**[[/term]]"
|
||||
return f"**{label}**"
|
||||
|
||||
|
||||
def _criteria_intro(mark: bool) -> str:
|
||||
"""Intro paragraph explaining the two scored dimensions and the principle."""
|
||||
calidad = _term("calidad_datos", "calidad de datos", mark)
|
||||
completitud = _term("completitud", "Completitud (peso 60%)", mark)
|
||||
validez = _term("validez", "Validez (peso 40%, cuando es medible)", mark)
|
||||
unicidad = _term("unicidad_registro", "unicidad de registro", mark)
|
||||
return (
|
||||
f"La {calidad} de cada columna es un score de 0 a 100 que combina solo "
|
||||
"dimensiones medibles desde el perfil de la tabla, sin fuente externa "
|
||||
"de verdad:\n\n"
|
||||
f"- {completitud}: proporción de valores presentes (1 − % de nulos; en "
|
||||
"texto, las celdas vacías cuentan como faltantes). Los nulos y vacíos "
|
||||
"bajan el score.\n"
|
||||
f"- {validez}: proporción de valores que encajan con su tipo o formato "
|
||||
"(un número que parsea, una fecha legible, un email con forma de email). "
|
||||
"Si una columna es texto libre sin formato esperado, la validez no se "
|
||||
"mide y el score se basa solo en la completitud.\n\n"
|
||||
f"Score de columna = 100 × (0,6·completitud + 0,4·validez), "
|
||||
"renormalizado cuando la validez no aplica. A nivel de tabla se añade "
|
||||
f"la {unicidad} (1 − % de filas duplicadas).\n\n"
|
||||
"**Los valores atípicos (outliers) NO bajan la calidad.** Un valor "
|
||||
"extremo puede ser real y correcto; detectar atípicos es parte del "
|
||||
"análisis de la distribución, no un juicio de corrección. Por eso, junto "
|
||||
"con las columnas constantes y los identificadores, se listan aparte "
|
||||
"como **observaciones analíticas** que no afectan al score."
|
||||
)
|
||||
|
||||
|
||||
def build_calidad(profile: dict, ctx: dict):
|
||||
"""Build the data-quality Chapter, or None if the profile has no columns.
|
||||
|
||||
@@ -250,17 +351,35 @@ def build_calidad(profile: dict, ctx: dict):
|
||||
if not evaluated:
|
||||
return None # no columns to score -> chapter does not apply.
|
||||
|
||||
# Register the criteria terms in the shared glossary (if present) and mark
|
||||
# their first appearance clickable. Contract §11.1.
|
||||
glossary = ctx.get("glossary")
|
||||
mark = False
|
||||
if isinstance(glossary, model.GlossaryCollector):
|
||||
for key, (label, definition) in _TERMS.items():
|
||||
glossary.add(key, label, definition)
|
||||
mark = True
|
||||
|
||||
blocks = [
|
||||
model.Heading(text="Cómo se calcula la calidad", level=2),
|
||||
model.Markdown(text=_CRITERIA_INTRO),
|
||||
model.Markdown(text=_criteria_intro(mark)),
|
||||
_summary_block(profile, evaluated),
|
||||
model.Heading(text="Scores por columna", level=2),
|
||||
]
|
||||
scores = _scores_block(evaluated)
|
||||
if scores is not None:
|
||||
blocks.append(scores)
|
||||
blocks.append(model.Heading(text="Problemas detectados", level=2))
|
||||
|
||||
blocks.append(model.Heading(text="Problemas de calidad", level=2))
|
||||
blocks.append(_issues_block(evaluated))
|
||||
|
||||
observations = _observations_block(evaluated)
|
||||
if observations is not None:
|
||||
blocks.append(model.Heading(text="Observaciones analíticas", level=2))
|
||||
blocks.append(model.Note(
|
||||
"Las observaciones siguientes NO son defectos de calidad y no "
|
||||
"afectan al score: son señales para orientar el análisis."))
|
||||
blocks.append(observations)
|
||||
|
||||
return model.Chapter(id=CHAPTER_ID, title=CHAPTER_TITLE,
|
||||
version=CHAPTER_VERSION, blocks=blocks)
|
||||
|
||||
@@ -1,11 +1,12 @@
|
||||
"""Tests for the CALIDAD chapter — DoD: golden + edges + anti-cut.
|
||||
"""Tests for the CALIDAD chapter — DoD: golden + edges + anti-cut + glossary.
|
||||
|
||||
Self-contained: builds synthetic TableProfiles (no DuckDB) so the suite is fast
|
||||
and deterministic. Verifies that the chapter explains the quality criteria, shows
|
||||
per-column scores with the completeness/validity/consistency breakdown, lists the
|
||||
issues in Spanish (separate from the type flags), returns None when it does not
|
||||
apply, and that a wide profile with long names renders to PDF and PPTX without
|
||||
cutting any cell text (long content wraps, it is never truncated).
|
||||
and deterministic. Verifies the report-2046 quality model: the chapter explains
|
||||
the two scored dimensions (completitud 60% / validez 40%), shows per-column
|
||||
scores without a consistency column, keeps quality DEFECTS (issues) separate
|
||||
from analytical OBSERVATIONS (outliers, constant, ids), hooks the criteria terms
|
||||
into the glossary, returns None when it does not apply, and renders a wide
|
||||
profile to PDF and PPTX without cutting any cell text.
|
||||
"""
|
||||
|
||||
import os
|
||||
@@ -20,28 +21,30 @@ from datascience.automatic_eda.chapters.calidad import (
|
||||
CHAPTER_VERSION,
|
||||
)
|
||||
from datascience.automatic_eda import build_document, render_pdf, render_pptx
|
||||
from datascience.automatic_eda import model
|
||||
|
||||
|
||||
def _profile() -> dict:
|
||||
"""A small profile with one column per quality problem (nulls, outliers,
|
||||
constant, high-cardinality id) plus one clean column."""
|
||||
constant, high-cardinality id) plus one clean column. ``outlier_pct`` is in
|
||||
the 0-100 scale that describe_numeric actually emits."""
|
||||
return {
|
||||
"table": "demo",
|
||||
"quality_score": 72.5,
|
||||
"quality_score": 82.0,
|
||||
"duplicate_pct": 0.04,
|
||||
"null_cell_pct": 0.11,
|
||||
"constant_cols": ["flag_const"],
|
||||
"all_null_cols": [],
|
||||
"columns": [
|
||||
{"name": "edad", "inferred_type": "integer", "null_pct": 0.2,
|
||||
"numeric": {"outlier_pct": 0.15, "min": 0, "max": 99},
|
||||
"quality_score": 60},
|
||||
{"name": "edad", "inferred_type": "numeric", "null_pct": 0.2,
|
||||
"n_rows": 100, "unique_pct": 0.5,
|
||||
"numeric": {"outlier_pct": 15.0, "min": 0, "max": 99}},
|
||||
{"name": "nombre", "inferred_type": "text", "null_pct": 0.0,
|
||||
"unique_pct": 0.98, "quality_score": 80},
|
||||
"unique_pct": 0.98, "flags": ["possible_id"]},
|
||||
{"name": "flag_const", "inferred_type": "text", "null_pct": 0.0,
|
||||
"flags": ["constant"], "quality_score": 50},
|
||||
{"name": "limpia", "inferred_type": "float", "null_pct": 0.0,
|
||||
"numeric": {"outlier_pct": 0.0}, "quality_score": 100},
|
||||
"unique_pct": 0.01, "flags": ["constant"]},
|
||||
{"name": "limpia", "inferred_type": "numeric", "null_pct": 0.0,
|
||||
"unique_pct": 0.5, "numeric": {"outlier_pct": 0.0}},
|
||||
],
|
||||
}
|
||||
|
||||
@@ -50,16 +53,9 @@ def _tables(chapter):
|
||||
return [b for b in chapter.blocks if getattr(b, "kind", None) == "data_table"]
|
||||
|
||||
|
||||
def _scores_table(chapter):
|
||||
def _table_by_title(chapter, needle):
|
||||
for t in _tables(chapter):
|
||||
if "Scores" in (t.title or ""):
|
||||
return t
|
||||
return None
|
||||
|
||||
|
||||
def _issues_table(chapter):
|
||||
for t in _tables(chapter):
|
||||
if "Problemas" in (t.title or ""):
|
||||
if needle in (t.title or ""):
|
||||
return t
|
||||
return None
|
||||
|
||||
@@ -73,41 +69,84 @@ def test_golden_chapter_estructura_y_version():
|
||||
assert ch.id == "calidad"
|
||||
assert ch.version == CHAPTER_VERSION
|
||||
kinds = [b.kind for b in ch.blocks]
|
||||
# intro heading + markdown criteria + summary kv + scores table + issues table
|
||||
assert "markdown" in kinds and "kv_table" in kinds and "data_table" in kinds
|
||||
|
||||
|
||||
def test_golden_intro_explica_criterios_y_pesos():
|
||||
def test_golden_intro_explica_dos_dimensiones_y_pesos():
|
||||
ch = build_calidad(_profile(), {})
|
||||
intro = [b for b in ch.blocks if b.kind == "markdown"][0].text
|
||||
for needle in ("Completitud", "Validez", "Consistencia",
|
||||
"50%", "30%", "20%"):
|
||||
for needle in ("Completitud", "Validez", "60%", "40%",
|
||||
"unicidad de registro"):
|
||||
assert needle in intro, f"falta {needle!r} en la intro de criterios"
|
||||
# El principio: los outliers NO bajan la calidad.
|
||||
assert "atípicos" in intro and "NO bajan" in intro
|
||||
# Ya no se menciona la dimensión consistencia eliminada.
|
||||
assert "20%" not in intro
|
||||
|
||||
|
||||
def test_golden_scores_incluyen_desglose_por_criterio():
|
||||
def test_golden_scores_sin_columna_consistencia():
|
||||
ch = build_calidad(_profile(), {})
|
||||
scores = _scores_table(ch)
|
||||
scores = _table_by_title(ch, "Scores")
|
||||
assert scores is not None
|
||||
assert scores.header == ["Columna", "Calidad", "Completitud",
|
||||
"Validez", "Consistencia"]
|
||||
# 4 columns scored, none dropped.
|
||||
assert scores.header == ["Columna", "Calidad", "Completitud", "Validez"]
|
||||
assert "Consistencia" not in scores.header
|
||||
assert len(scores.rows) == 4
|
||||
names = {r[0] for r in scores.rows}
|
||||
assert names == {"edad", "nombre", "flag_const", "limpia"}
|
||||
|
||||
|
||||
def test_golden_issues_en_espanol_separados_de_flags():
|
||||
def test_golden_outliers_en_observaciones_no_en_problemas():
|
||||
ch = build_calidad(_profile(), {})
|
||||
issues = _issues_table(ch)
|
||||
assert issues is not None
|
||||
flat = " | ".join(" ".join(r) for r in issues.rows)
|
||||
assert "nulos" in flat # completeness issue (ES)
|
||||
assert "outliers" in flat # validity issue (ES)
|
||||
assert "columna constante" in flat
|
||||
assert "posible id de alta cardinalidad" in flat
|
||||
# The raw type flag string must NOT leak as a "problem".
|
||||
assert "constant" not in flat or "columna constante" in flat
|
||||
problemas = _table_by_title(ch, "Problemas de calidad")
|
||||
observaciones = _table_by_title(ch, "Observaciones")
|
||||
assert problemas is not None
|
||||
assert observaciones is not None
|
||||
|
||||
problemas_txt = " | ".join(" ".join(r) for r in problemas.rows)
|
||||
observaciones_txt = " | ".join(" ".join(r) for r in observaciones.rows)
|
||||
|
||||
# Los nulos SÍ son problema de calidad.
|
||||
assert "nulos" in problemas_txt
|
||||
# Los outliers NO aparecen como problema...
|
||||
assert "atípic" not in problemas_txt and "outlier" not in problemas_txt
|
||||
# ...sino como observación analítica.
|
||||
assert "atípic" in observaciones_txt
|
||||
# Constante e id: observaciones, no problemas.
|
||||
assert "constante" in observaciones_txt
|
||||
assert "identificador" in observaciones_txt
|
||||
assert "constante" not in problemas_txt
|
||||
|
||||
|
||||
def test_golden_score_columna_limpia_es_100():
|
||||
"""Columna sin nulos, numérica nativa: score 100 aunque tenga (o no) outliers."""
|
||||
ch = build_calidad(_profile(), {})
|
||||
scores = _table_by_title(ch, "Scores")
|
||||
by_name = {r[0]: r for r in scores.rows}
|
||||
assert by_name["limpia"][1] == "100 / 100"
|
||||
# edad: 20% nulos -> 100*(0.6*0.8 + 0.4*1.0) = 88; los outliers no bajan nada.
|
||||
assert by_name["edad"][1] == "88 / 100"
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Glosario (contrato §11.1)
|
||||
# --------------------------------------------------------------------------- #
|
||||
def test_glosario_registra_los_cuatro_terminos_y_marca_clicable():
|
||||
glossary = model.GlossaryCollector()
|
||||
ch = build_calidad(_profile(), {"glossary": glossary})
|
||||
for key in ("calidad_datos", "completitud", "validez", "unicidad_registro"):
|
||||
assert glossary.has(key), f"término {key!r} no registrado en el glosario"
|
||||
intro = [b for b in ch.blocks if b.kind == "markdown"][0].text
|
||||
# Con colector presente, la primera aparición se marca clicable.
|
||||
assert "[[term:completitud]]" in intro
|
||||
assert "[[term:validez]]" in intro
|
||||
assert "[[term:calidad_datos]]" in intro
|
||||
assert "[[term:unicidad_registro]]" in intro
|
||||
|
||||
|
||||
def test_sin_glosario_no_marca_terminos():
|
||||
ch = build_calidad(_profile(), {}) # ctx sin glossary
|
||||
intro = [b for b in ch.blocks if b.kind == "markdown"][0].text
|
||||
assert "[[term:" not in intro
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
@@ -124,17 +163,17 @@ def test_edge_perfil_limpio_sin_problemas_usa_nota():
|
||||
prof = {
|
||||
"quality_score": 100,
|
||||
"columns": [
|
||||
{"name": "a", "inferred_type": "float", "null_pct": 0.0,
|
||||
"numeric": {"outlier_pct": 0.0}},
|
||||
{"name": "b", "inferred_type": "float", "null_pct": 0.0,
|
||||
"numeric": {"outlier_pct": 0.0}},
|
||||
{"name": "a", "inferred_type": "numeric", "null_pct": 0.0,
|
||||
"unique_pct": 0.5, "numeric": {"outlier_pct": 0.0}},
|
||||
{"name": "b", "inferred_type": "numeric", "null_pct": 0.0,
|
||||
"unique_pct": 0.5, "numeric": {"outlier_pct": 0.0}},
|
||||
],
|
||||
}
|
||||
ch = build_calidad(prof, {})
|
||||
assert ch is not None
|
||||
assert _issues_table(ch) is None # no issues table
|
||||
assert _table_by_title(ch, "Problemas de calidad") is None # no issues table
|
||||
notes = [b for b in ch.blocks if b.kind == "note"]
|
||||
assert notes and "No se detectaron problemas" in notes[0].text
|
||||
assert any("No se detectaron problemas" in n.text for n in notes)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
@@ -143,44 +182,42 @@ def test_edge_perfil_limpio_sin_problemas_usa_nota():
|
||||
def _wide_profile(ncols: int = 22) -> dict:
|
||||
cols = [
|
||||
{"name": "identificador_unico_de_transaccion_con_nombre_muy_largo",
|
||||
"inferred_type": "text", "null_pct": 0.0, "unique_pct": 0.99},
|
||||
"inferred_type": "text", "null_pct": 0.0, "unique_pct": 0.99,
|
||||
"flags": ["possible_id"]},
|
||||
{"name": "columna_constante_sin_ninguna_variacion_de_valor",
|
||||
"inferred_type": "text", "null_pct": 0.0, "flags": ["constant"]},
|
||||
"inferred_type": "text", "null_pct": 0.0, "unique_pct": 0.01,
|
||||
"flags": ["constant"]},
|
||||
]
|
||||
for k in range(ncols - 2):
|
||||
cols.append({
|
||||
"name": f"metrica_numerica_de_negocio_{k:02d}_con_nombre_largo",
|
||||
"inferred_type": "float", "null_pct": 0.1 + (k % 3) * 0.05,
|
||||
"numeric": {"outlier_pct": 0.08, "min": 0, "max": 1000},
|
||||
"inferred_type": "numeric", "null_pct": 0.1 + (k % 3) * 0.05,
|
||||
"unique_pct": 0.5,
|
||||
"numeric": {"outlier_pct": 8.0, "min": 0, "max": 1000},
|
||||
})
|
||||
return {"table": "ancha", "quality_score": 70.0, "columns": cols}
|
||||
return {"table": "ancha", "quality_score": 70.0, "duplicate_pct": 0.0,
|
||||
"columns": cols}
|
||||
|
||||
|
||||
def test_anticut_pdf_y_pptx_no_truncan_nombres_largos():
|
||||
prof = _wide_profile(22)
|
||||
full = build_document(prof, {"dataset_name": "ancha"})
|
||||
assert any(c.id == "calidad" for c in full)
|
||||
# Render ONLY the calidad chapter so the anti-cut assertions are scoped to
|
||||
# this chapter (other chapters, e.g. portada, legitimately contain '…').
|
||||
chapters = [c for c in full if c.id == "calidad"]
|
||||
long_name = "metrica_numerica_de_negocio_00_con_nombre_largo"
|
||||
with tempfile.TemporaryDirectory() as d:
|
||||
pdf = os.path.join(d, "q.pdf")
|
||||
pptx = os.path.join(d, "q.pptx")
|
||||
rp = render_pdf(chapters, pdf, {"title": "EDA"})
|
||||
rx = render_pptx(chapters, pptx, {"title": "EDA"})
|
||||
render_pptx(chapters, pptx, {"title": "EDA"})
|
||||
assert os.path.exists(pdf) and os.path.exists(pptx)
|
||||
# The wide table forces pagination across several pages/slides.
|
||||
assert (rp or {}).get("n_pages", 0) >= 2
|
||||
|
||||
# PDF: the long name survives whole once wraps (spaces/newlines) removed,
|
||||
# and there is no truncation marker.
|
||||
pdf_txt = "".join((pg.extract_text() or "") for pg in PdfReader(pdf).pages)
|
||||
assert "…" not in pdf_txt and "..." not in pdf_txt
|
||||
norm = re.sub(r"\s+", "", pdf_txt)
|
||||
assert long_name in norm, "el nombre largo se cortó en el PDF"
|
||||
|
||||
# PPTX: long name present in some cell, untruncated.
|
||||
allt = []
|
||||
for s in Presentation(pptx).slides:
|
||||
for sh in s.shapes:
|
||||
|
||||
@@ -33,10 +33,23 @@ import math
|
||||
|
||||
from .. import model
|
||||
|
||||
CHAPTER_VERSION = "1.0.0"
|
||||
CHAPTER_VERSION = "1.1.0"
|
||||
CHAPTER_ID = "cat_distr"
|
||||
CHAPTER_TITLE = "Distribuciones categóricas"
|
||||
|
||||
# Glossary term this chapter explains. Registered in the shared collector and
|
||||
# marked clickable on its first appearance (end-to-end glossary example —
|
||||
# mejora 6). Other chapters hook their own terms the same way (see the contract).
|
||||
_TERM_ENTROPIA_KEY = "entropia"
|
||||
_TERM_ENTROPIA_LABEL = "Entropía (de Shannon)"
|
||||
_TERM_ENTROPIA_DEF = (
|
||||
"Medida, en bits, de cómo de repartidos están los valores de una columna "
|
||||
"categórica. Vale 0 cuando una sola categoría concentra todas las filas "
|
||||
"(máxima previsibilidad) y alcanza su máximo, log2(k) para k categorías "
|
||||
"distintas, cuando todas aparecen por igual (máxima diversidad). La entropía "
|
||||
"normalizada (entropía dividida por su máximo) la lleva al rango 0–1 para "
|
||||
"comparar columnas con distinto número de categorías.")
|
||||
|
||||
# Cap the number of categorical columns rendered to keep the document bounded;
|
||||
# the rest are summarized in a closing note (no silent truncation).
|
||||
MAX_COLS = 40
|
||||
@@ -337,10 +350,14 @@ def _topk_table(cat: dict):
|
||||
note=note)
|
||||
|
||||
|
||||
def _intro_blocks(n_rows):
|
||||
def _intro_blocks(n_rows, mark_term: bool = False):
|
||||
total = _fmt_int(n_rows)
|
||||
# Mark the first appearance of the term as a clickable glossary jump when the
|
||||
# term was registered (mark_term). The visible text is identical either way.
|
||||
entropia = ("[[term:entropia]]**entropía de Shannon**[[/term]]" if mark_term
|
||||
else "**entropía de Shannon**")
|
||||
text = (
|
||||
"La **entropía de Shannon** mide cómo de repartidos están los valores de "
|
||||
f"La {entropia} mide cómo de repartidos están los valores de "
|
||||
"una columna categórica, en bits. Vale 0 cuando una sola categoría "
|
||||
"concentra todas las filas (máxima previsibilidad) y alcanza su máximo, "
|
||||
"log2(k) para k categorías distintas, cuando todas aparecen por igual "
|
||||
@@ -370,7 +387,15 @@ def build_cat_distr(profile: dict, ctx: dict):
|
||||
return None
|
||||
|
||||
n_rows = profile.get("n_rows")
|
||||
blocks = list(_intro_blocks(n_rows))
|
||||
# Register "entropía" in the shared glossary collector (if present) and mark
|
||||
# its first appearance clickable. End-to-end glossary example (mejora 6).
|
||||
glossary = ctx.get("glossary")
|
||||
mark_term = False
|
||||
if isinstance(glossary, model.GlossaryCollector):
|
||||
glossary.add(_TERM_ENTROPIA_KEY, _TERM_ENTROPIA_LABEL,
|
||||
_TERM_ENTROPIA_DEF)
|
||||
mark_term = True
|
||||
blocks = list(_intro_blocks(n_rows, mark_term=mark_term))
|
||||
|
||||
rendered = cat_cols[:MAX_COLS]
|
||||
for col in rendered:
|
||||
|
||||
@@ -0,0 +1,47 @@
|
||||
"""Glossary chapter (GLOSARIO) — always the last chapter, clickable terms.
|
||||
|
||||
Renders one entry per glossary term that the other chapters registered during
|
||||
the document build through ``ctx['glossary'].add(key, label, definition)`` (see
|
||||
``GlossaryCollector`` in ``model.py``). Each entry is a clickable destination:
|
||||
every in-text appearance a chapter marked with ``[[term:key]]texto[[/term]]``
|
||||
becomes a real jump to its entry here — PDF link annotations (PyMuPDF) and PPTX
|
||||
native slide jumps, both wired by the renderers.
|
||||
|
||||
Returns ``None`` when no term was registered (there is nothing to show), so the
|
||||
chapter simply disappears from documents that did not mark any term.
|
||||
|
||||
Contract: build_<id>(profile, ctx) -> Chapter | None ; CHAPTER_VERSION = "x.y.z".
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from .. import model
|
||||
|
||||
CHAPTER_VERSION = "1.0.0"
|
||||
CHAPTER_ID = "glosario"
|
||||
CHAPTER_TITLE = "Glosario"
|
||||
|
||||
|
||||
def build_glosario(profile: dict, ctx: dict):
|
||||
"""Build the glossary Chapter from the shared collector, or None if empty."""
|
||||
ctx = ctx or {}
|
||||
glossary = ctx.get("glossary")
|
||||
if not isinstance(glossary, model.GlossaryCollector) or not glossary:
|
||||
return None
|
||||
|
||||
blocks = [
|
||||
model.Heading(text="Glosario de términos", level=1),
|
||||
model.Markdown(text=(
|
||||
"Definición de los términos técnicos que aparecen en el informe. "
|
||||
"Cada término va resaltado en el texto y, al pulsarlo, salta a su "
|
||||
"definición en esta sección.")),
|
||||
]
|
||||
# One clickable destination per term, alphabetically by visible label.
|
||||
for term in glossary.terms(by="label"):
|
||||
blocks.append(model.GlossaryEntry(
|
||||
key=model._safe_str(term.get("key")),
|
||||
label=model._safe_str(term.get("label")),
|
||||
definition=model._safe_str(term.get("definition"))))
|
||||
|
||||
return model.Chapter(id=CHAPTER_ID, title=CHAPTER_TITLE,
|
||||
version=CHAPTER_VERSION, blocks=blocks)
|
||||
@@ -34,7 +34,7 @@ try:
|
||||
except Exception: # noqa: BLE001 — keep the chapter importable no matter what.
|
||||
build_boxplot_stats = None # type: ignore[assignment]
|
||||
|
||||
CHAPTER_VERSION = "1.0.0"
|
||||
CHAPTER_VERSION = "1.1.0"
|
||||
CHAPTER_ID = "num_distr"
|
||||
CHAPTER_TITLE = "Distribuciones numéricas"
|
||||
|
||||
@@ -278,12 +278,17 @@ def build_num_distr(profile: dict, ctx: dict):
|
||||
box = build_boxplot_stats(numeric) or {}
|
||||
except Exception: # noqa: BLE001 — degrade, never raise.
|
||||
box = {}
|
||||
blocks.append(model.Heading(text=str(name), level=2))
|
||||
blocks.append(model.Figure(
|
||||
make=_figure_maker(name, numeric, box),
|
||||
caption=f"Distribución de «{name}» — histograma (media/mediana/±σ) "
|
||||
f"y boxplot."))
|
||||
blocks.append(model.Markdown(text=_stats_note(name, numeric, box)))
|
||||
# Keep the column heading, its figure and its stats note together on the
|
||||
# same page/slide (mejora 3 — keep-together): the renderers measure the
|
||||
# whole Group and move it whole when it would not fit.
|
||||
blocks.append(model.Group(blocks=[
|
||||
model.Heading(text=str(name), level=2),
|
||||
model.Figure(
|
||||
make=_figure_maker(name, numeric, box),
|
||||
caption=f"Distribución de «{name}» — histograma "
|
||||
f"(media/mediana/±σ) y boxplot."),
|
||||
model.Markdown(text=_stats_note(name, numeric, box)),
|
||||
]))
|
||||
|
||||
return model.Chapter(id=CHAPTER_ID, title=CHAPTER_TITLE,
|
||||
version=CHAPTER_VERSION, blocks=blocks)
|
||||
|
||||
@@ -65,19 +65,33 @@ def _pdf_text(path: str) -> str:
|
||||
return re.sub(r"\s+", " ", txt)
|
||||
|
||||
|
||||
def _flatten(blocks):
|
||||
"""Expand keep-together Groups so the per-column heading/figure/markdown are
|
||||
inspectable as a flat block list (the chapter wraps each column in a Group)."""
|
||||
out = []
|
||||
for b in blocks:
|
||||
if getattr(b, "kind", "") == "group":
|
||||
out.extend(_flatten(getattr(b, "blocks", []) or []))
|
||||
else:
|
||||
out.append(b)
|
||||
return out
|
||||
|
||||
|
||||
def test_golden_chapter_estructura_y_bloques():
|
||||
ch = build_num_distr(_profile(n_numeric=2), {})
|
||||
assert ch is not None
|
||||
assert ch.id == "num_distr"
|
||||
assert ch.version == CHAPTER_VERSION
|
||||
kinds = [b.kind for b in ch.blocks]
|
||||
# Per-column blocks are wrapped in keep-together Groups: flatten to inspect.
|
||||
flat = _flatten(ch.blocks)
|
||||
kinds = [b.kind for b in flat]
|
||||
# Heading + intro Markdown, then per column: Heading + Figure + Markdown.
|
||||
assert kinds[0] == "heading"
|
||||
assert kinds[1] == "markdown"
|
||||
assert kinds.count("figure") == 2 # one figure per numeric column.
|
||||
assert kinds.count("heading") == 1 + 2 # chapter title + one per column.
|
||||
# Each figure has a lazy maker that produces a real matplotlib figure.
|
||||
figs = [b for b in ch.blocks if b.kind == "figure"]
|
||||
figs = [b for b in flat if b.kind == "figure"]
|
||||
fig = figs[0].make()
|
||||
assert fig is not None
|
||||
# Two stacked axes: histogram + boxplot share the figure.
|
||||
@@ -90,7 +104,8 @@ def test_golden_media_mediana_sigma_y_boxplot_presentes():
|
||||
# The intro documents the three reference lines and the Tukey boxplot; the
|
||||
# per-column note carries the actual mean/median/σ numbers and the shape.
|
||||
ch = build_num_distr(_profile(n_numeric=1, extra_categorical=False), {})
|
||||
md_texts = " ".join(b.text for b in ch.blocks if b.kind == "markdown")
|
||||
md_texts = " ".join(b.text for b in _flatten(ch.blocks)
|
||||
if b.kind == "markdown")
|
||||
assert "media" in md_texts and "mediana" in md_texts
|
||||
assert "±1σ" in md_texts or "σ" in md_texts
|
||||
assert "boxplot" in md_texts.lower()
|
||||
@@ -126,7 +141,8 @@ def test_anti_corte_muchas_columnas_pdf_y_pptx():
|
||||
# 8 numeric columns + long note text: nothing may be cut. Every column
|
||||
# heading must survive in both the PDF text and the PPTX deck.
|
||||
ch = build_num_distr(_profile(n_numeric=8), {})
|
||||
names = [b.text for b in ch.blocks if b.kind == "heading" and b.level == 2]
|
||||
names = [b.text for b in _flatten(ch.blocks)
|
||||
if b.kind == "heading" and b.level == 2]
|
||||
assert len(names) == 8
|
||||
with tempfile.TemporaryDirectory() as d:
|
||||
pdf = os.path.join(d, "num.pdf")
|
||||
|
||||
@@ -17,7 +17,7 @@ from datetime import datetime, timezone
|
||||
|
||||
from .. import model
|
||||
|
||||
CHAPTER_VERSION = "1.0.0"
|
||||
CHAPTER_VERSION = "1.1.0"
|
||||
CHAPTER_ID = "portada"
|
||||
CHAPTER_TITLE = "Portada"
|
||||
|
||||
@@ -67,6 +67,53 @@ def _fmt_int(v) -> str:
|
||||
return str(v)
|
||||
|
||||
|
||||
def _fmt_pct(value) -> str:
|
||||
"""Format a percentage that may arrive as a 0–1 fraction or a 0–100 number."""
|
||||
if value is None:
|
||||
return "—"
|
||||
try:
|
||||
v = float(value)
|
||||
except (TypeError, ValueError):
|
||||
return str(value)
|
||||
if 0 < v <= 1.0:
|
||||
v *= 100.0
|
||||
return f"{v:.1f}%"
|
||||
|
||||
|
||||
def _summary_blocks(summary) -> list:
|
||||
"""Mini-summary of the rest of the analysis, shown on the cover (mejora 5).
|
||||
|
||||
The cover is built AFTER the body (``build_document`` passes the aggregated
|
||||
``ctx['document_summary']``), so it can reflect what the analysis found:
|
||||
shape, column types, quality flags and which chapters were included. Returns
|
||||
an empty list when there is no summary (the cover degrades to its metadata
|
||||
table only)."""
|
||||
if not isinstance(summary, dict) or not summary:
|
||||
return []
|
||||
rows = []
|
||||
n_num = summary.get("n_numeric")
|
||||
n_cat = summary.get("n_categorical")
|
||||
if n_num is not None or n_cat is not None:
|
||||
rows.append(("Columnas numéricas / categóricas",
|
||||
f"{_fmt_int(n_num)} / {_fmt_int(n_cat)}"))
|
||||
if summary.get("duplicate_pct") is not None:
|
||||
rows.append(("Filas duplicadas", _fmt_pct(summary.get("duplicate_pct"))))
|
||||
if summary.get("null_cell_pct") is not None:
|
||||
rows.append(("Celdas nulas", _fmt_pct(summary.get("null_cell_pct"))))
|
||||
titles = summary.get("chapter_titles") or []
|
||||
if titles:
|
||||
rows.append(("Capítulos del informe", _fmt_int(len(titles))))
|
||||
|
||||
blocks = [model.Heading(text="Resumen del análisis", level=2)]
|
||||
if rows:
|
||||
blocks.append(model.KVTable(rows=rows))
|
||||
if titles:
|
||||
bullets = "\n".join(f"- {model._safe_str(t)}" for t in titles)
|
||||
blocks.append(model.Markdown(
|
||||
text="Este informe incluye los siguientes capítulos:\n" + bullets))
|
||||
return blocks
|
||||
|
||||
|
||||
def _fmt_date_eu(value) -> str:
|
||||
"""Format a date/ISO string as European DD/MM/AAAA HH:mm (UI convention).
|
||||
|
||||
@@ -152,5 +199,8 @@ def build_portada(profile: dict, ctx: dict):
|
||||
model.Markdown(text=str(granularity)),
|
||||
]
|
||||
|
||||
# Mini-summary of the rest of the analysis (built last, shown on the cover).
|
||||
blocks.extend(_summary_blocks(ctx.get("document_summary")))
|
||||
|
||||
return model.Chapter(id=CHAPTER_ID, title=CHAPTER_TITLE,
|
||||
version=CHAPTER_VERSION, blocks=blocks)
|
||||
|
||||
@@ -26,7 +26,7 @@ from . import model
|
||||
# placeholders other agents will fill by creating chapters/<id>.py — they will
|
||||
# appear in this exact position automatically once their module exists.
|
||||
CHAPTER_ORDER = [
|
||||
"portada", # cover
|
||||
"portada", # cover — BUILT LAST, PLACED FIRST (see build_document).
|
||||
"overview", # df.head + columns/types/nulls/examples + describe
|
||||
"analisis_llm", # LLM interpretation — sits next to overview (user request)
|
||||
"num_distr", # numeric distributions
|
||||
@@ -37,8 +37,15 @@ CHAPTER_ORDER = [
|
||||
"timeseries", # time-series analysis
|
||||
"geospatial", # geospatial
|
||||
"agregacion", # aggregations / pivots
|
||||
"glosario", # glossary — ALWAYS LAST; clickable term destinations.
|
||||
]
|
||||
|
||||
# Chapters whose position is special-cased by build_document: portada is built
|
||||
# last (so it can summarize the rest) but placed first; glosario is built and
|
||||
# placed last (it reads the terms every other chapter registered).
|
||||
_PORTADA = "portada"
|
||||
_GLOSARIO = "glosario"
|
||||
|
||||
|
||||
def build_chapter(chapter_id: str, profile: dict, ctx: dict):
|
||||
"""Build a single chapter by id, or None if absent/not-applicable/error.
|
||||
@@ -75,15 +82,72 @@ def build_document(profile: dict, ctx: dict = None) -> list:
|
||||
list[Chapter] in canonical order, containing only the chapters that are
|
||||
implemented and applicable. Never raises.
|
||||
"""
|
||||
if profile is None:
|
||||
profile = {}
|
||||
if not isinstance(profile, dict):
|
||||
profile = {}
|
||||
if ctx is None:
|
||||
ctx = {}
|
||||
chapters = []
|
||||
# Copy ctx so the shared collector / summary we add do not leak to the caller.
|
||||
ctx = dict(ctx) if isinstance(ctx, dict) else {}
|
||||
|
||||
# A single glossary collector is shared by every chapter via ctx['glossary'].
|
||||
# Chapters call ctx['glossary'].add(key, label, definition) and mark in-text
|
||||
# appearances with [[term:key]]…[[/term]]; the glosario chapter renders the
|
||||
# registered terms and the renderers wire the clickable links.
|
||||
glossary = ctx.get("glossary")
|
||||
if not isinstance(glossary, model.GlossaryCollector):
|
||||
glossary = model.GlossaryCollector()
|
||||
ctx["glossary"] = glossary
|
||||
|
||||
# 1) Body: every chapter except portada (built last) and glosario (placed
|
||||
# last), in canonical order. This also fills the glossary collector.
|
||||
body = []
|
||||
for cid in CHAPTER_ORDER:
|
||||
if cid in (_PORTADA, _GLOSARIO):
|
||||
continue
|
||||
ch = build_chapter(cid, profile, ctx)
|
||||
if ch is not None and ch.blocks:
|
||||
chapters.append(ch)
|
||||
body.append(ch)
|
||||
|
||||
# 2) Aggregated summary of the rest, for the cover (user decision: the cover
|
||||
# is BUILT after the body so it can reflect what the analysis found).
|
||||
ctx["document_summary"] = _summarize_document(profile, body)
|
||||
|
||||
# 3) Build the cover last, place it FIRST.
|
||||
portada = build_chapter(_PORTADA, profile, ctx)
|
||||
# 4) Build the glossary last (reads the terms the body registered), place LAST.
|
||||
glosario = build_chapter(_GLOSARIO, profile, ctx)
|
||||
|
||||
chapters = []
|
||||
if portada is not None and portada.blocks:
|
||||
chapters.append(portada)
|
||||
chapters.extend(body)
|
||||
if glosario is not None and glosario.blocks:
|
||||
chapters.append(glosario)
|
||||
return chapters
|
||||
|
||||
|
||||
def _summarize_document(profile: dict, body: list) -> dict:
|
||||
"""Aggregate a tiny findings summary of the body for the cover. Never raises.
|
||||
|
||||
Returns a dict with dataset shape, quality, column-type counts and the list
|
||||
of chapters actually included — enough for the cover to show a mini-summary
|
||||
of the analysis without re-deriving anything."""
|
||||
try:
|
||||
cols = profile.get("columns") or []
|
||||
n_num = sum(1 for c in cols if isinstance(c, dict)
|
||||
and c.get("inferred_type") == "numeric")
|
||||
n_cat = sum(1 for c in cols if isinstance(c, dict)
|
||||
and isinstance(c.get("categorical"), dict)
|
||||
and c.get("categorical", {}).get("top")
|
||||
and c.get("inferred_type") != "numeric")
|
||||
return {
|
||||
"n_chapters": len(body),
|
||||
"chapter_titles": [getattr(c, "title", "") for c in body],
|
||||
"n_rows": profile.get("n_rows"),
|
||||
"n_cols": profile.get("n_cols"),
|
||||
"quality_score": profile.get("quality_score"),
|
||||
"n_numeric": n_num,
|
||||
"n_categorical": n_cat,
|
||||
"duplicate_pct": profile.get("duplicate_pct"),
|
||||
"null_cell_pct": profile.get("null_cell_pct"),
|
||||
}
|
||||
except Exception: # noqa: BLE001 — the summary is best-effort.
|
||||
return {"n_chapters": len(body) if isinstance(body, list) else 0}
|
||||
|
||||
@@ -128,6 +128,39 @@ class Note:
|
||||
kind: str = field(default="note", init=False)
|
||||
|
||||
|
||||
@dataclass
|
||||
class Group:
|
||||
"""A keep-together unit: its blocks render on the SAME page/slide.
|
||||
|
||||
Renderers measure the whole group first; if it does not fit in the remaining
|
||||
space they move it *whole* to the next page (PDF) or slide (PPTX) before
|
||||
drawing anything — so a heading never gets stranded apart from the figure and
|
||||
text it introduces. If the group is taller than a full page even on its own,
|
||||
it starts on a fresh page and flows (honest degradation, never cut). Use it to
|
||||
bind ``Heading`` + ``Markdown`` + ``Figure`` of one idea together (see the
|
||||
DISTR NUM / AGREGACION chapters).
|
||||
"""
|
||||
|
||||
blocks: list = field(default_factory=list)
|
||||
title: Optional[str] = None
|
||||
kind: str = field(default="group", init=False)
|
||||
|
||||
|
||||
@dataclass
|
||||
class GlossaryEntry:
|
||||
"""One glossary term: a clickable destination at the end of the document.
|
||||
|
||||
Rendered as the term ``label`` (heading) plus its ``definition`` (markdown).
|
||||
The renderers register its page/slide position as the link target so every
|
||||
in-text appearance of the same ``key`` becomes a real clickable jump (PDF link
|
||||
annotation via PyMuPDF; PPTX internal slide jump)."""
|
||||
|
||||
key: str = ""
|
||||
label: str = ""
|
||||
definition: str = ""
|
||||
kind: str = field(default="glossary_entry", init=False)
|
||||
|
||||
|
||||
@dataclass
|
||||
class Chapter:
|
||||
"""An ordered set of blocks with an id, a title and a generation version."""
|
||||
@@ -150,13 +183,17 @@ _BLOCK_BY_KIND = {
|
||||
"image": Image,
|
||||
"caption": Caption,
|
||||
"note": Note,
|
||||
"group": Group,
|
||||
"glossary_entry": GlossaryEntry,
|
||||
}
|
||||
|
||||
|
||||
def as_block(obj: Any):
|
||||
"""Coerce a value into a block dataclass. Unknown values become a Note."""
|
||||
if isinstance(obj, (Heading, Markdown, KVTable, DataTable, Figure, Image,
|
||||
Caption, Note)):
|
||||
Caption, Note, Group, GlossaryEntry)):
|
||||
if isinstance(obj, Group):
|
||||
obj.blocks = as_blocks(obj.blocks)
|
||||
return obj
|
||||
if isinstance(obj, dict):
|
||||
kind = obj.get("kind")
|
||||
@@ -189,6 +226,13 @@ def as_block(obj: Any):
|
||||
return Caption(text=_safe_str(obj.get("text")))
|
||||
if cls is Note:
|
||||
return Note(text=_safe_str(obj.get("text")))
|
||||
if cls is Group:
|
||||
return Group(blocks=as_blocks(obj.get("blocks")),
|
||||
title=obj.get("title"))
|
||||
if cls is GlossaryEntry:
|
||||
return GlossaryEntry(key=_safe_str(obj.get("key")),
|
||||
label=_safe_str(obj.get("label")),
|
||||
definition=_safe_str(obj.get("definition")))
|
||||
except Exception: # noqa: BLE001 — never raise on a malformed block.
|
||||
return Note(text=_safe_str(obj))
|
||||
return Note(text=_safe_str(obj))
|
||||
@@ -246,6 +290,67 @@ def _safe_str(v: Any) -> str:
|
||||
return ""
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Glossary collector — chapters register the terms they use; the glosario
|
||||
# chapter renders them at the end and the renderers wire the clickable links.
|
||||
# --------------------------------------------------------------------------- #
|
||||
class GlossaryCollector:
|
||||
"""Accumulates glossary terms registered by chapters during document build.
|
||||
|
||||
A single instance is created by :func:`build_document` and passed to every
|
||||
chapter via ``ctx['glossary']``. A chapter calls ``add(key, label,
|
||||
definition)`` to declare a term it explains (e.g. ``"entropia"`` →
|
||||
"Entropía"), and marks each in-text appearance with the inline span
|
||||
``[[term:key]]texto visible[[/term]]`` (see ``text_layout.parse_inline_rich``).
|
||||
The ``glosario`` chapter reads ``terms()`` to emit one :class:`GlossaryEntry`
|
||||
per term; the renderers turn every marked appearance into a real click that
|
||||
jumps to that entry. First registration of a key wins (idempotent); never
|
||||
raises."""
|
||||
|
||||
def __init__(self):
|
||||
self._terms: dict = {}
|
||||
self._order: list = []
|
||||
|
||||
def add(self, key: Any, label: Any = None, definition: Any = "") -> str:
|
||||
"""Register a term and return its normalized key (''. if invalid)."""
|
||||
try:
|
||||
k = _safe_str(key).strip()
|
||||
if not k:
|
||||
return ""
|
||||
if k not in self._terms:
|
||||
self._terms[k] = {
|
||||
"key": k,
|
||||
"label": _safe_str(label).strip() or k,
|
||||
"definition": _safe_str(definition),
|
||||
}
|
||||
self._order.append(k)
|
||||
return k
|
||||
except Exception: # noqa: BLE001 — collecting a term never breaks a build.
|
||||
return ""
|
||||
|
||||
def has(self, key: Any) -> bool:
|
||||
return _safe_str(key).strip() in self._terms
|
||||
|
||||
def get(self, key: Any) -> Optional[dict]:
|
||||
return self._terms.get(_safe_str(key).strip())
|
||||
|
||||
def terms(self, by: str = "label") -> list:
|
||||
"""Return the registered terms as dicts.
|
||||
|
||||
``by='label'`` (default) sorts alphabetically by visible label;
|
||||
``by='order'`` keeps first-appearance order."""
|
||||
if by == "order":
|
||||
return [self._terms[k] for k in self._order]
|
||||
return sorted(self._terms.values(),
|
||||
key=lambda t: _safe_str(t.get("label")).lower())
|
||||
|
||||
def __len__(self) -> int:
|
||||
return len(self._terms)
|
||||
|
||||
def __bool__(self) -> bool:
|
||||
return bool(self._terms)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Manifest — per-chapter versions and page/slide counts for tracking.
|
||||
# --------------------------------------------------------------------------- #
|
||||
|
||||
@@ -0,0 +1,354 @@
|
||||
"""Tests for the AutomaticEDA engine features added in phase 4a.
|
||||
|
||||
Covers, with executable evidence, the six render-engine improvements:
|
||||
|
||||
1. Bold no longer overlaps the following text in the PDF (real width measured).
|
||||
2. Zebra striping on data tables (PDF Rectangle fills + PPTX cell fills).
|
||||
3. Keep-together: a Group moves whole to the next page/slide (heading never gets
|
||||
stranded from its figure).
|
||||
4. Every PPTX figure carries a visible caption/title (fallback to the heading).
|
||||
5. Cover is built last but placed first and reflects an aggregated summary.
|
||||
6. Glossary is the last chapter; the term "entropía" is a real clickable link in
|
||||
the PDF (PyMuPDF GOTO annotation) and in the PPTX (native slide-jump run).
|
||||
|
||||
Self-contained: synthetic profiles, no DuckDB. Heavy renderer checks (fitz/pptx)
|
||||
skip cleanly when the optional engine is missing.
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
|
||||
import pytest
|
||||
|
||||
_HERE = os.path.dirname(os.path.abspath(__file__))
|
||||
_FUNCTIONS = os.path.abspath(os.path.join(_HERE, "..", "..", "..")) # python/functions
|
||||
if _FUNCTIONS not in sys.path:
|
||||
sys.path.insert(0, _FUNCTIONS)
|
||||
|
||||
import matplotlib # noqa: E402
|
||||
|
||||
matplotlib.use("Agg")
|
||||
import matplotlib.colors as mcolors # noqa: E402
|
||||
import matplotlib.pyplot as plt # noqa: E402
|
||||
from matplotlib.patches import Rectangle # noqa: E402
|
||||
|
||||
from datascience.automatic_eda import model # noqa: E402
|
||||
from datascience.automatic_eda import render_pdf_impl as RP # noqa: E402
|
||||
from datascience.automatic_eda import render_pptx_impl as RX # noqa: E402
|
||||
from datascience.automatic_eda import build_document # noqa: E402
|
||||
from datascience.render_automatic_eda_pdf import render_automatic_eda_pdf # noqa: E402
|
||||
from datascience.render_automatic_eda_pptx import render_automatic_eda_pptx # noqa: E402
|
||||
|
||||
|
||||
class _FakePdf:
|
||||
"""Stand-in for PdfPages so the placers can call _new_page in unit tests."""
|
||||
|
||||
def savefig(self, fig): # noqa: D401
|
||||
pass
|
||||
|
||||
|
||||
def _small_fig():
|
||||
fig = plt.figure(figsize=(4.0, 1.5))
|
||||
ax = fig.add_subplot(111)
|
||||
ax.plot([0, 1, 2], [1, 3, 2])
|
||||
return fig
|
||||
|
||||
|
||||
def _profile_with_cat_and_num():
|
||||
"""A tiny profile that triggers cat_distr (→ entropía term) and num_distr."""
|
||||
return {
|
||||
"table": "ventas", "n_rows": 120, "n_cols": 2, "quality_score": 91,
|
||||
"duplicate_pct": 1.5, "null_cell_pct": 0.8,
|
||||
"columns": [
|
||||
{"name": "region", "inferred_type": "categorical",
|
||||
"categorical": {
|
||||
"top": [{"value": "norte", "count": 50, "pct": 0.42},
|
||||
{"value": "sur", "count": 40, "pct": 0.33},
|
||||
{"value": "este", "count": 30, "pct": 0.25}],
|
||||
"mode": "norte", "n_distinct": 3, "entropy": 1.55,
|
||||
"imbalance": 0.1}},
|
||||
{"name": "importe", "inferred_type": "numeric",
|
||||
"numeric": {"mean": 50.0, "median": 48.0, "std": 10.0,
|
||||
"min": 10, "max": 99, "iqr": 15,
|
||||
"histogram": [{"lo": 0, "hi": 50, "count": 40},
|
||||
{"lo": 50, "hi": 100, "count": 80}]}},
|
||||
],
|
||||
}
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# 1) Bold does not overlap the following text (PDF).
|
||||
# --------------------------------------------------------------------------- #
|
||||
def test_pdf_bold_span_does_not_overlap_following_text():
|
||||
fig = plt.figure(figsize=(RP._W, RP._H))
|
||||
st = RP._PdfState(_FakePdf(), "t")
|
||||
st.fig = fig
|
||||
st.page = 1
|
||||
# A wide bold token immediately followed by normal text on the SAME line.
|
||||
rich = [[("PALABRAMUYANCHAENNEGRITA", True, None),
|
||||
(" texto normal justo después", False, None)]]
|
||||
RP._place_rich_lines(st, rich, RP._FS_BODY, RP._INK)
|
||||
|
||||
renderer = fig.canvas.get_renderer()
|
||||
boxes = sorted((t.get_window_extent(renderer) for t in fig.texts),
|
||||
key=lambda b: b.x0)
|
||||
assert len(boxes) == 2, "se esperaban dos spans dibujados"
|
||||
# The bold span ends before the normal span starts (no overlap). 1px slack.
|
||||
assert boxes[0].x1 <= boxes[1].x0 + 1.0, \
|
||||
"la negrita se solapa con el texto siguiente"
|
||||
plt.close(fig)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# 2) Zebra striping.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def _facecolor_eq(artist, hexcolor) -> bool:
|
||||
want = mcolors.to_rgba(hexcolor)
|
||||
got = artist.get_facecolor()
|
||||
return all(abs(a - b) < 0.02 for a, b in zip(got[:3], want[:3]))
|
||||
|
||||
|
||||
def test_pdf_table_has_zebra_striping():
|
||||
fig = plt.figure(figsize=(RP._W, RP._H))
|
||||
st = RP._PdfState(_FakePdf(), "t")
|
||||
st.fig = fig
|
||||
st.page = 1
|
||||
st.chapter = model.Chapter(id="c", title="C", version="1.0.0")
|
||||
dt = model.DataTable(header=["A", "B"],
|
||||
rows=[["1", "x"], ["2", "y"], ["3", "z"], ["4", "w"]])
|
||||
RP._place_data_table(st, dt)
|
||||
zebra = [a for a in fig.findobj(Rectangle) if _facecolor_eq(a, RP._ZEBRA)]
|
||||
# 4 data rows → even rows (1-based 2 and 4) shaded = 2 zebra rectangles.
|
||||
assert len(zebra) == 2, f"esperadas 2 filas zebra, hay {len(zebra)}"
|
||||
plt.close(fig)
|
||||
|
||||
|
||||
def test_pptx_table_has_zebra_striping(tmp_path):
|
||||
pptx = pytest.importorskip("pptx")
|
||||
from pptx import Presentation
|
||||
from pptx.dml.color import RGBColor
|
||||
|
||||
doc = [model.Chapter(id="c", title="Tabla", version="1.0.0", blocks=[
|
||||
model.DataTable(header=["A", "B"],
|
||||
rows=[["1", "x"], ["2", "y"], ["3", "z"], ["4", "w"]])])]
|
||||
out = str(tmp_path / "zebra.pptx")
|
||||
assert render_automatic_eda_pptx(doc, out, {"write_manifest": False})["path"]
|
||||
|
||||
prs = Presentation(out)
|
||||
table = None
|
||||
for slide in prs.slides:
|
||||
for sh in slide.shapes:
|
||||
if sh.has_table:
|
||||
table = sh.table
|
||||
break
|
||||
assert table is not None, "no se encontró la tabla en el deck"
|
||||
zebra = RGBColor(0xF6, 0xF8, 0xFA)
|
||||
white = RGBColor(0xFF, 0xFF, 0xFF)
|
||||
# Row 0 = header; data rows follow. Even data rows (table rows 2, 4) shaded.
|
||||
assert table.cell(1, 0).fill.fore_color.rgb == white
|
||||
assert table.cell(2, 0).fill.fore_color.rgb == zebra
|
||||
assert table.cell(4, 0).fill.fore_color.rgb == zebra
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# 3) Keep-together (Group): heading + figure never split.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def test_pdf_group_moves_whole_to_next_page_when_it_does_not_fit():
|
||||
fig = plt.figure(figsize=(RP._W, RP._H))
|
||||
st = RP._PdfState(_FakePdf(), "t")
|
||||
st.fig = fig
|
||||
st.page = 1
|
||||
st.chapter = model.Chapter(id="c", title="C", version="1.0.0")
|
||||
grp = model.Group(blocks=[
|
||||
model.Heading(text="Sección con figura", level=2),
|
||||
model.Figure(make=_small_fig, caption="cap"),
|
||||
model.Markdown(text="Descripción breve de la figura."),
|
||||
])
|
||||
# Only ~0.4in left: the group does not fit here but fits on a fresh page.
|
||||
st.y = RP._CONTENT_BOTTOM - 0.4
|
||||
page_before = st.page
|
||||
RP._place_group(st, grp)
|
||||
# Exactly one page break: the whole group (heading+figure+text) stays
|
||||
# together on the new page — no second break inside it.
|
||||
assert st.page == page_before + 1
|
||||
plt.close(st.fig)
|
||||
|
||||
|
||||
def test_pdf_group_does_not_break_when_it_fits():
|
||||
fig = plt.figure(figsize=(RP._W, RP._H))
|
||||
st = RP._PdfState(_FakePdf(), "t")
|
||||
st.fig = fig
|
||||
st.page = 1
|
||||
st.chapter = model.Chapter(id="c", title="C", version="1.0.0")
|
||||
grp = model.Group(blocks=[
|
||||
model.Heading(text="Cabe entera", level=2),
|
||||
model.Figure(make=_small_fig, caption="cap"),
|
||||
])
|
||||
st.y = RP._CONTENT_TOP # empty page → fits, must not break.
|
||||
page_before = st.page
|
||||
RP._place_group(st, grp)
|
||||
assert st.page == page_before
|
||||
plt.close(st.fig)
|
||||
|
||||
|
||||
def test_pptx_group_moves_whole_to_next_slide(tmp_path):
|
||||
pytest.importorskip("pptx")
|
||||
from pptx import Presentation
|
||||
from pptx.util import Inches
|
||||
|
||||
prs = Presentation()
|
||||
prs.slide_width = Inches(RX._W)
|
||||
prs.slide_height = Inches(RX._H)
|
||||
st = RX._PptxState(prs, "t")
|
||||
st.chapter = model.Chapter(id="c", title="C", version="1.0.0")
|
||||
RX._new_slide(st, cont=False)
|
||||
grp = model.Group(blocks=[
|
||||
model.Heading(text="Sección con figura", level=2),
|
||||
model.Figure(make=_small_fig, caption="cap"),
|
||||
model.Markdown(text="Descripción breve."),
|
||||
])
|
||||
st.y = RX._CONTENT_BOTTOM - 0.4 # does not fit here.
|
||||
slide_before = st.slide_no
|
||||
RX._place_group(st, grp)
|
||||
assert st.slide_no == slide_before + 1 # one jump; group kept together.
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# 4) Every PPTX figure carries a visible caption/title.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def test_pptx_figure_without_caption_gets_heading_title(tmp_path):
|
||||
pytest.importorskip("pptx")
|
||||
from pptx import Presentation
|
||||
from pptx.enum.shapes import MSO_SHAPE_TYPE
|
||||
|
||||
doc = [model.Chapter(id="c", title="Cap", version="1.0.0", blocks=[
|
||||
model.Heading(text="Mi sección gráfica", level=2),
|
||||
model.Figure(make=_small_fig), # NO caption provided.
|
||||
])]
|
||||
out = str(tmp_path / "cap.pptx")
|
||||
assert render_automatic_eda_pptx(doc, out, {"write_manifest": False})["path"]
|
||||
|
||||
prs = Presentation(out)
|
||||
for slide in prs.slides:
|
||||
has_pic = any(sh.shape_type == MSO_SHAPE_TYPE.PICTURE
|
||||
for sh in slide.shapes)
|
||||
if not has_pic:
|
||||
continue
|
||||
italic = [r.text for sh in slide.shapes if sh.has_text_frame
|
||||
for p in sh.text_frame.paragraphs for r in p.runs
|
||||
if r.font.italic and r.text.strip()]
|
||||
assert italic, "la figura no lleva caption visible en su slide"
|
||||
assert any("Mi sección gráfica" in t for t in italic), \
|
||||
"el caption no cayó al título de la sección"
|
||||
return
|
||||
pytest.fail("no se encontró ningún slide con imagen")
|
||||
|
||||
|
||||
def test_pptx_no_figure_slide_is_ever_untitled(tmp_path):
|
||||
"""Invariant: across many figures (incl. tall ones), NO slide with an image
|
||||
lacks a visible caption — the caption never spills to the next slide."""
|
||||
pytest.importorskip("pptx")
|
||||
from pptx import Presentation
|
||||
from pptx.enum.shapes import MSO_SHAPE_TYPE
|
||||
|
||||
def _tall_fig():
|
||||
fig = plt.figure(figsize=(5.0, 4.6)) # nearly square → fills the slide.
|
||||
fig.add_subplot(111).bar([1, 2, 3], [4, 5, 6])
|
||||
return fig
|
||||
|
||||
blocks = []
|
||||
for i in range(6):
|
||||
blocks.append(model.Heading(text=f"Gráfico {i}", level=2))
|
||||
blocks.append(model.Figure(
|
||||
make=_tall_fig,
|
||||
caption=("Una descripción de la figura deliberadamente larga para "
|
||||
"que el caption ocupe más de una línea al envolverse en el "
|
||||
f"ancho del slide — figura número {i} del bloque.")))
|
||||
doc = [model.Chapter(id="c", title="Muchas figuras", version="1.0.0",
|
||||
blocks=blocks)]
|
||||
out = str(tmp_path / "many.pptx")
|
||||
assert render_automatic_eda_pptx(doc, out, {"write_manifest": False})["path"]
|
||||
|
||||
prs = Presentation(out)
|
||||
missing = []
|
||||
pics = 0
|
||||
for i, slide in enumerate(prs.slides):
|
||||
if not any(sh.shape_type == MSO_SHAPE_TYPE.PICTURE
|
||||
for sh in slide.shapes):
|
||||
continue
|
||||
pics += 1
|
||||
italic = [r.text for sh in slide.shapes if sh.has_text_frame
|
||||
for p in sh.text_frame.paragraphs for r in p.runs
|
||||
if r.font.italic and r.text.strip()]
|
||||
if not italic:
|
||||
missing.append(i)
|
||||
assert pics >= 6, f"esperadas >=6 figuras, hay {pics}"
|
||||
assert not missing, f"slides con imagen sin caption: {missing}"
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# 5) Cover built last, placed first, with an aggregated summary.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def test_cover_first_glossary_last_with_summary():
|
||||
chs = build_document(_profile_with_cat_and_num(), ctx={"dataset_name": "v"})
|
||||
ids = [c.id for c in chs]
|
||||
assert ids[0] == "portada", f"la portada no es la primera: {ids}"
|
||||
assert ids[-1] == "glosario", f"el glosario no es el último: {ids}"
|
||||
cover = chs[0]
|
||||
headings = [b.text for b in cover.blocks if b.kind == "heading"]
|
||||
assert any("Resumen" in h for h in headings), \
|
||||
"la portada no incluye el resumen agregado"
|
||||
# The summary reflects the body chapters (e.g. the numeric/categorical ones).
|
||||
cover_text = " ".join(
|
||||
b.text for b in cover.blocks if getattr(b, "kind", "") == "markdown")
|
||||
assert "Distribuciones" in cover_text, \
|
||||
"el resumen de portada no menciona los capítulos del cuerpo"
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# 6) Glossary clickable in PDF (PyMuPDF GOTO) and PPTX (native slide jump).
|
||||
# --------------------------------------------------------------------------- #
|
||||
def test_pdf_glossary_term_is_clickable(tmp_path):
|
||||
fitz = pytest.importorskip("fitz")
|
||||
out = str(tmp_path / "glos.pdf")
|
||||
res = render_automatic_eda_pdf(_profile_with_cat_and_num(), out,
|
||||
{"ctx": {"dataset_name": "v"},
|
||||
"write_manifest": False})
|
||||
assert res["path"] == out and os.path.exists(out)
|
||||
|
||||
doc = fitz.open(out)
|
||||
goto = [(pno, l) for pno in range(doc.page_count)
|
||||
for l in doc[pno].get_links() if l.get("kind") == fitz.LINK_GOTO]
|
||||
doc.close()
|
||||
assert goto, "no hay ningún enlace interno (entropía → glosario) en el PDF"
|
||||
# Destination must be a real page in the document (the glossary page).
|
||||
assert all(0 <= l.get("page", -1) for _p, l in goto)
|
||||
|
||||
|
||||
def test_pptx_glossary_term_is_clickable(tmp_path):
|
||||
pytest.importorskip("pptx")
|
||||
from pptx import Presentation
|
||||
from pptx.oxml.ns import qn
|
||||
|
||||
out = str(tmp_path / "glos.pptx")
|
||||
res = render_automatic_eda_pptx(_profile_with_cat_and_num(), out,
|
||||
{"ctx": {"dataset_name": "v"},
|
||||
"write_manifest": False})
|
||||
assert res["path"] == out and os.path.exists(out)
|
||||
|
||||
prs = Presentation(out)
|
||||
found = False
|
||||
for slide in prs.slides:
|
||||
for sh in slide.shapes:
|
||||
if not sh.has_text_frame:
|
||||
continue
|
||||
for p in sh.text_frame.paragraphs:
|
||||
for r in p.runs:
|
||||
rpr = r._r.find(qn("a:rPr"))
|
||||
if rpr is None:
|
||||
continue
|
||||
hl = rpr.find(qn("a:hlinkClick"))
|
||||
if hl is not None and \
|
||||
hl.get("action") == "ppaction://hlinksldjump":
|
||||
found = True
|
||||
assert found, "ningún término tiene hyperlink de salto a slide en el PPTX"
|
||||
@@ -60,6 +60,8 @@ _FS_BODY, _FS_CELL, _FS_NOTE = 10.5, 9.0, 9.0
|
||||
_GAP = 0.12 # vertical gap after a block, inches.
|
||||
_CELL_PAD = 0.06 # horizontal padding inside a table cell, inches.
|
||||
_ROW_VPAD = 0.05 # vertical padding inside a table row, inches.
|
||||
_ZEBRA = "#f6f8fa" # very light grey for zebra-striped (even) table rows.
|
||||
_LINK = "#2a6f97" # accent colour for clickable glossary terms.
|
||||
|
||||
|
||||
class _PdfState:
|
||||
@@ -73,6 +75,11 @@ class _PdfState:
|
||||
self.page = 0 # global page counter.
|
||||
self.chapter = None # current Chapter (for the footer).
|
||||
self.chapter_pages = 0 # pages produced for the current chapter.
|
||||
self.last_heading = "" # text of the most recent heading.
|
||||
# Glossary wiring (mejora 6). Pages are 0-based; rects/points are in PDF
|
||||
# points (1/72") with a top-left origin — same convention as PyMuPDF.
|
||||
self.term_sources = [] # [{key, page, rect:[x0,y0,x1,y1]}]
|
||||
self.term_dests = {} # key -> {page, point:[x,y]}
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
@@ -121,6 +128,35 @@ def _draw_footer(st: _PdfState) -> None:
|
||||
transform=st.fig.transFigure, color=_RULE, lw=0.6))
|
||||
|
||||
|
||||
def _text_width_in(st: _PdfState, s: str, fs: float, bold: bool) -> float:
|
||||
"""Real rendered width (inches) of ``s`` at ``fs`` with the given weight.
|
||||
|
||||
Measured with the Agg renderer's own font metrics (the same TrueType the PDF
|
||||
backend embeds), so a **bold** span advances the cursor by its ACTUAL width —
|
||||
fixing the bug where bold text overlapped the following normal text because
|
||||
the cursor advanced by the normal-weight average-glyph estimate. Falls back to
|
||||
the deterministic character grid if the renderer is unavailable, so it never
|
||||
raises.
|
||||
"""
|
||||
if not s:
|
||||
return 0.0
|
||||
try:
|
||||
from matplotlib.font_manager import FontProperties
|
||||
renderer = st.fig.canvas.get_renderer()
|
||||
prop = FontProperties(family="sans-serif", size=fs,
|
||||
weight="bold" if bold else "normal")
|
||||
w_px, _h, _d = renderer.get_text_width_height_descent(s, prop, False)
|
||||
return w_px / float(st.fig.dpi)
|
||||
except Exception: # noqa: BLE001 — fall back to the conservative grid metric.
|
||||
return tl.avg_char_width_in(fs) * len(s)
|
||||
|
||||
|
||||
def _pt_rect(x0_in: float, y_top_in: float, x1_in: float,
|
||||
y_bottom_in: float) -> list:
|
||||
"""An inches box (top-left origin) → a PDF-points rect for PyMuPDF links."""
|
||||
return [x0_in * 72.0, y_top_in * 72.0, x1_in * 72.0, y_bottom_in * 72.0]
|
||||
|
||||
|
||||
def _remaining(st: _PdfState) -> float:
|
||||
return _CONTENT_BOTTOM - st.y
|
||||
|
||||
@@ -138,6 +174,7 @@ def _place_heading(st: _PdfState, block) -> None:
|
||||
level = max(1, min(3, int(getattr(block, "level", 1) or 1)))
|
||||
fs = {1: _FS_H1, 2: _FS_H2, 3: _FS_H3}[level]
|
||||
text = tl.strip_inline_md(getattr(block, "text", ""))
|
||||
st.last_heading = text or st.last_heading
|
||||
max_chars = tl.chars_per_line(_USABLE_W, fs)
|
||||
lines = tl.wrap(text, max_chars)
|
||||
lh = tl.line_height_in(fs, leading=1.2)
|
||||
@@ -171,17 +208,19 @@ def _place_text_lines(st: _PdfState, lines: list, fs: float, color: str,
|
||||
|
||||
def _place_rich_lines(st: _PdfState, rich_lines: list, fs: float, color: str,
|
||||
indent: float = 0.0, prefixes=None) -> None:
|
||||
"""Draw pre-wrapped lines of styled segments (bold spans rendered bold).
|
||||
"""Draw pre-wrapped lines of styled segments (bold + clickable term spans).
|
||||
|
||||
Each line is ``[(text, is_bold), ...]``. Segments are placed left-to-right,
|
||||
advancing x by the deterministic character grid (same metric the wrapper
|
||||
used), so a bold span is rendered with ``fontweight='bold'`` without
|
||||
changing the line's measured width — the no-cut guarantee is preserved.
|
||||
Each line is a list of ``(text, is_bold)`` or ``(text, is_bold, term_key)``
|
||||
segments. Segments are placed left-to-right, advancing x by the segment's
|
||||
REAL rendered width (measured with the renderer's font metrics for the actual
|
||||
weight) — this is what stops a bold span from overlapping the following text:
|
||||
the cursor no longer advances by the normal-weight estimate. A segment with a
|
||||
``term_key`` is drawn in the accent colour and its rectangle is recorded in
|
||||
``st.term_sources`` so it becomes a clickable jump to the glossary entry.
|
||||
``prefixes`` is an optional ``(first_line, other_lines)`` pair (e.g. a
|
||||
bullet) drawn before the segments.
|
||||
"""
|
||||
lh = tl.line_height_in(fs)
|
||||
cw = tl.avg_char_width_in(fs)
|
||||
for idx, segs in enumerate(rich_lines):
|
||||
_ensure_space(st, lh)
|
||||
x = _ML + indent
|
||||
@@ -190,14 +229,23 @@ def _place_rich_lines(st: _PdfState, rich_lines: list, fs: float, color: str,
|
||||
if prefix:
|
||||
st.fig.text(_xf(x), _yf(st.y), prefix, fontsize=fs, color=color,
|
||||
ha="left", va="top")
|
||||
x += cw * len(prefix)
|
||||
for seg_text, is_bold in segs:
|
||||
x += _text_width_in(st, prefix, fs, False)
|
||||
for seg in segs:
|
||||
if len(seg) == 3:
|
||||
seg_text, is_bold, term = seg
|
||||
else:
|
||||
seg_text, is_bold, term = seg[0], seg[1], None
|
||||
if seg_text == "":
|
||||
continue
|
||||
st.fig.text(_xf(x), _yf(st.y), seg_text, fontsize=fs, color=color,
|
||||
ha="left", va="top",
|
||||
w = _text_width_in(st, seg_text, fs, bool(is_bold))
|
||||
st.fig.text(_xf(x), _yf(st.y), seg_text, fontsize=fs,
|
||||
color=(_LINK if term else color), ha="left", va="top",
|
||||
fontweight="bold" if is_bold else "normal")
|
||||
x += cw * len(seg_text)
|
||||
if term:
|
||||
st.term_sources.append({
|
||||
"key": term, "page": st.page - 1,
|
||||
"rect": _pt_rect(x, st.y, x + w, st.y + lh)})
|
||||
x += w
|
||||
st.y += lh
|
||||
|
||||
|
||||
@@ -242,7 +290,7 @@ def _place_markdown(st: _PdfState, block) -> None:
|
||||
if stripped.startswith("- ") or stripped.startswith("* "):
|
||||
content = stripped[2:] # keep inline markers for bold rendering.
|
||||
bullet_chars = tl.chars_per_line(_USABLE_W - 0.22, _FS_BODY)
|
||||
rich = tl.wrap_rich(content, bullet_chars)
|
||||
rich = tl.wrap_rich_terms(content, bullet_chars)
|
||||
_place_rich_lines(st, rich, _FS_BODY, _INK,
|
||||
prefixes=("• ", " "))
|
||||
i += 1
|
||||
@@ -258,7 +306,8 @@ def _place_markdown(st: _PdfState, block) -> None:
|
||||
j += 1
|
||||
text = " ".join(para)
|
||||
max_chars = tl.chars_per_line(_USABLE_W, _FS_BODY)
|
||||
_place_rich_lines(st, tl.wrap_rich(text, max_chars), _FS_BODY, _INK)
|
||||
_place_rich_lines(st, tl.wrap_rich_terms(text, max_chars), _FS_BODY,
|
||||
_INK)
|
||||
i = j
|
||||
st.y += _GAP
|
||||
|
||||
@@ -325,15 +374,18 @@ def _wrap_row(cells: list, widths: list, fs: float) -> list:
|
||||
|
||||
|
||||
def _draw_table_row(st: _PdfState, cells_lines: list, widths: list, fs: float,
|
||||
y0: float, header: bool) -> float:
|
||||
y0: float, header: bool, zebra: bool = False) -> float:
|
||||
lh = tl.line_height_in(fs)
|
||||
nlines = max((len(c) for c in cells_lines), default=1)
|
||||
row_h = lh * nlines + _ROW_VPAD * 2
|
||||
if header:
|
||||
# Background: header band, or a faint zebra fill for even data rows. Drawn
|
||||
# below the text/rule (zorder 0) so striping never hides cell content.
|
||||
bg = _HEAD_BG if header else (_ZEBRA if zebra else None)
|
||||
if bg is not None:
|
||||
st.fig.add_artist(Rectangle(
|
||||
(_xf(_ML), _yf(y0 + row_h)), _xf(_ML + _USABLE_W) - _xf(_ML),
|
||||
_yf(y0) - _yf(y0 + row_h), transform=st.fig.transFigure,
|
||||
color=_HEAD_BG, lw=0, zorder=0))
|
||||
color=bg, lw=0, zorder=0))
|
||||
x = _ML
|
||||
for c, lines in enumerate(cells_lines):
|
||||
for k, ln in enumerate(lines):
|
||||
@@ -378,14 +430,18 @@ def _place_data_table(st: _PdfState, block) -> None:
|
||||
+ _ROW_VPAD * 2
|
||||
_ensure_space(st, header_h() + max(first_row_h, lh))
|
||||
draw_header()
|
||||
for r in rows:
|
||||
# ``data_idx`` is the LOGICAL row index (not reset across page breaks) so the
|
||||
# zebra pattern stays coherent when a long table splits and repeats the
|
||||
# header: even rows (1-based) are shaded → 0-based odd indices.
|
||||
for data_idx, r in enumerate(rows):
|
||||
cells_lines = _wrap_row(r, widths, fs)
|
||||
row_h = lh * max((len(c) for c in cells_lines), default=1) \
|
||||
+ _ROW_VPAD * 2
|
||||
if _remaining(st) < row_h:
|
||||
_new_page(st)
|
||||
draw_header() # repeat header on the continuation page.
|
||||
st.y += _draw_table_row(st, cells_lines, widths, fs, st.y, header=False)
|
||||
st.y += _draw_table_row(st, cells_lines, widths, fs, st.y,
|
||||
header=False, zebra=(data_idx % 2 == 1))
|
||||
note = getattr(block, "note", None)
|
||||
if note:
|
||||
_place_text_lines(st, tl.wrap(model._safe_str(note),
|
||||
@@ -414,53 +470,98 @@ def _png_from_figure(fig) -> bytes:
|
||||
return buf.read()
|
||||
|
||||
|
||||
def _place_image_array(st: _PdfState, arr, caption) -> None:
|
||||
def _figure_png_cached(block):
|
||||
"""Rasterize a Figure to PNG bytes ONCE and cache (bytes, aspect).
|
||||
|
||||
Measuring (keep-together) and drawing must agree on the REAL aspect ratio:
|
||||
``bbox_inches='tight'`` changes it vs ``figsize``, so we rasterize once and
|
||||
reuse the bytes for both. Cached on the block; never raises."""
|
||||
cached = getattr(block, "_aeda_png", None)
|
||||
if cached is not None:
|
||||
return cached
|
||||
fig, owned = _resolve_figure(block)
|
||||
data = None
|
||||
if fig is not None:
|
||||
try:
|
||||
data = _png_from_figure(fig)
|
||||
finally:
|
||||
if owned:
|
||||
try:
|
||||
plt.close(fig)
|
||||
except Exception: # noqa: BLE001
|
||||
pass
|
||||
aspect = 0.66
|
||||
if data is not None:
|
||||
try:
|
||||
arr = mpimg.imread(io.BytesIO(data))
|
||||
aspect = (arr.shape[0] / arr.shape[1]) if arr.shape[1] else 0.66
|
||||
except Exception: # noqa: BLE001
|
||||
aspect = 0.66
|
||||
try:
|
||||
block._aeda_png = (data, aspect)
|
||||
return block._aeda_png
|
||||
except Exception: # noqa: BLE001 — block may reject attributes; degrade.
|
||||
return (data, aspect)
|
||||
|
||||
|
||||
def _image_aspect(block) -> float:
|
||||
"""Real aspect (h/w) of an Image block by path, for measurement."""
|
||||
path = getattr(block, "path", "")
|
||||
if path and os.path.exists(path):
|
||||
try:
|
||||
arr = mpimg.imread(path)
|
||||
return (arr.shape[0] / arr.shape[1]) if arr.shape[1] else 0.66
|
||||
except Exception: # noqa: BLE001
|
||||
pass
|
||||
return 0.66
|
||||
|
||||
|
||||
def _place_image_array(st: _PdfState, arr, caption, max_h_in=None) -> None:
|
||||
h_px, w_px = arr.shape[0], arr.shape[1]
|
||||
aspect = (h_px / w_px) if w_px else 1.0
|
||||
# Reserve the caption's REAL (possibly multi-line) height FIRST, then scale
|
||||
# the image to (max_h - cap_reserve) so figure + caption always fit the same
|
||||
# page. cap_reserve adds a cushion so the caption never spills to next page.
|
||||
cap_lines = (tl.wrap(model._safe_str(caption),
|
||||
tl.chars_per_line(_USABLE_W, _FS_NOTE))
|
||||
if caption else [])
|
||||
cap_real = tl.line_height_in(_FS_NOTE) * len(cap_lines) if caption else 0.0
|
||||
cap_reserve = (cap_real + 0.04 + 0.08) if caption else 0.0
|
||||
max_h = _CONTENT_BOTTOM - _CONTENT_TOP
|
||||
# height_in hint (model.Figure/Image): cap the height so a figure in a
|
||||
# keep-together Group shrinks to leave room for its heading and text.
|
||||
if isinstance(max_h_in, (int, float)) and max_h_in > 0:
|
||||
max_h = min(max_h, float(max_h_in))
|
||||
max_img_h = max(max_h - cap_reserve, 0.6)
|
||||
target_w = _USABLE_W
|
||||
target_h = target_w * aspect
|
||||
if target_h > max_h:
|
||||
target_h = max_h
|
||||
if target_h > max_img_h:
|
||||
target_h = max_img_h
|
||||
target_w = target_h / aspect if aspect else _USABLE_W
|
||||
cap_h = tl.line_height_in(_FS_NOTE) + 0.04 if caption else 0.0
|
||||
# Move whole image to next page if it does not fit in remaining space.
|
||||
if _remaining(st) < target_h + cap_h:
|
||||
if (max_h) >= target_h + cap_h:
|
||||
_new_page(st)
|
||||
else:
|
||||
# Taller than a full page even at min — already clamped to max_h.
|
||||
_new_page(st)
|
||||
if _remaining(st) < target_h + cap_reserve:
|
||||
_new_page(st)
|
||||
left_frac = _xf(_ML + (_USABLE_W - target_w) / 2.0)
|
||||
bottom_frac = _yf(st.y + target_h)
|
||||
ax = st.fig.add_axes([left_frac, bottom_frac, target_w / _W, target_h / _H])
|
||||
ax.imshow(arr)
|
||||
ax.axis("off")
|
||||
st.y += target_h + 0.04
|
||||
if caption:
|
||||
_place_text_lines(st, tl.wrap(model._safe_str(caption),
|
||||
tl.chars_per_line(_USABLE_W, _FS_NOTE)),
|
||||
_FS_NOTE, _MUTED, style="italic")
|
||||
if cap_lines:
|
||||
_place_text_lines(st, cap_lines, _FS_NOTE, _MUTED, style="italic")
|
||||
st.y += _GAP
|
||||
|
||||
|
||||
def _place_figure(st: _PdfState, block) -> None:
|
||||
fig, owned = _resolve_figure(block)
|
||||
if fig is None:
|
||||
png, _aspect = _figure_png_cached(block)
|
||||
if png is None:
|
||||
_place_text_lines(st, ["(figura no disponible)"], _FS_NOTE, _MUTED,
|
||||
style="italic")
|
||||
st.y += _GAP
|
||||
return
|
||||
try:
|
||||
png = _png_from_figure(fig)
|
||||
finally:
|
||||
if owned:
|
||||
try:
|
||||
plt.close(fig)
|
||||
except Exception: # noqa: BLE001
|
||||
pass
|
||||
arr = mpimg.imread(io.BytesIO(png))
|
||||
_place_image_array(st, arr, getattr(block, "caption", None))
|
||||
_place_image_array(st, arr, getattr(block, "caption", None),
|
||||
max_h_in=getattr(block, "height_in", None))
|
||||
|
||||
|
||||
def _place_image(st: _PdfState, block) -> None:
|
||||
@@ -471,7 +572,8 @@ def _place_image(st: _PdfState, block) -> None:
|
||||
st.y += _GAP
|
||||
return
|
||||
arr = mpimg.imread(path)
|
||||
_place_image_array(st, arr, getattr(block, "caption", None))
|
||||
_place_image_array(st, arr, getattr(block, "caption", None),
|
||||
max_h_in=getattr(block, "height_in", None))
|
||||
|
||||
|
||||
def _place_caption(st: _PdfState, block) -> None:
|
||||
@@ -488,6 +590,189 @@ def _place_note(st: _PdfState, block) -> None:
|
||||
st.y += _GAP
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Block measurement (mejora 3 — keep-together). These estimate a block's height
|
||||
# WITHOUT drawing it, so a Group can decide to move whole to the next page before
|
||||
# anything is drawn. Over-estimating is safe: it only triggers an earlier page
|
||||
# break, never a content cut (the placers keep their own no-cut pagination).
|
||||
# --------------------------------------------------------------------------- #
|
||||
def _measure_heading_text(text: str, level: int) -> float:
|
||||
level = max(1, min(3, int(level or 1)))
|
||||
fs = {1: _FS_H1, 2: _FS_H2, 3: _FS_H3}[level]
|
||||
lines = tl.wrap(tl.strip_inline_md(text), tl.chars_per_line(_USABLE_W, fs))
|
||||
h = tl.line_height_in(fs, leading=1.2) * len(lines) + 0.06
|
||||
if level == 1:
|
||||
h += 0.10
|
||||
return h + _GAP
|
||||
|
||||
|
||||
def _measure_markdown(block) -> float:
|
||||
raw = str(getattr(block, "text", "") or "")
|
||||
md_lines = raw.split("\n")
|
||||
h = 0.0
|
||||
i, n = 0, len(md_lines)
|
||||
while i < n:
|
||||
stripped = md_lines[i].strip()
|
||||
if stripped.startswith("|") and stripped.endswith("|"):
|
||||
j = i
|
||||
while j < n and md_lines[j].strip().startswith("|") \
|
||||
and md_lines[j].strip().endswith("|"):
|
||||
j += 1
|
||||
h += (tl.line_height_in(_FS_CELL) + _ROW_VPAD * 2) * (j - i) + _GAP
|
||||
i = j
|
||||
continue
|
||||
if stripped == "":
|
||||
h += tl.line_height_in(_FS_BODY) * 0.5
|
||||
i += 1
|
||||
continue
|
||||
if stripped.startswith("### "):
|
||||
h += _measure_heading_text(stripped[4:], 3)
|
||||
i += 1
|
||||
continue
|
||||
if stripped.startswith("## "):
|
||||
h += _measure_heading_text(stripped[3:], 2)
|
||||
i += 1
|
||||
continue
|
||||
if stripped.startswith("# "):
|
||||
h += _measure_heading_text(stripped[2:], 1)
|
||||
i += 1
|
||||
continue
|
||||
if stripped.startswith("- ") or stripped.startswith("* "):
|
||||
lines = tl.wrap_rich_terms(
|
||||
stripped[2:], tl.chars_per_line(_USABLE_W - 0.22, _FS_BODY))
|
||||
h += tl.line_height_in(_FS_BODY) * len(lines)
|
||||
i += 1
|
||||
continue
|
||||
para = [stripped]
|
||||
j = i + 1
|
||||
while j < n:
|
||||
nxt = md_lines[j].strip()
|
||||
if nxt == "" or nxt.startswith(("|", "#", "- ", "* ")):
|
||||
break
|
||||
para.append(nxt)
|
||||
j += 1
|
||||
lines = tl.wrap_rich_terms(" ".join(para),
|
||||
tl.chars_per_line(_USABLE_W, _FS_BODY))
|
||||
h += tl.line_height_in(_FS_BODY) * len(lines)
|
||||
i = j
|
||||
return h + _GAP
|
||||
|
||||
|
||||
def _measure_figure_like(block) -> float:
|
||||
max_h = _CONTENT_BOTTOM - _CONTENT_TOP
|
||||
hint = getattr(block, "height_in", None)
|
||||
if isinstance(hint, (int, float)) and hint > 0:
|
||||
target_h = min(float(hint), max_h)
|
||||
else:
|
||||
# Real rasterized aspect (cached) so measuring matches drawing.
|
||||
if getattr(block, "kind", "") == "image":
|
||||
aspect = _image_aspect(block)
|
||||
else:
|
||||
_data, aspect = _figure_png_cached(block)
|
||||
target_h = min(_USABLE_W * aspect, max_h)
|
||||
cap = getattr(block, "caption", None)
|
||||
cap_h = tl.line_height_in(_FS_NOTE) + 0.04 if cap else 0.0
|
||||
return target_h + 0.04 + cap_h + _GAP
|
||||
|
||||
|
||||
def _measure_block(st: _PdfState, block) -> float:
|
||||
kind = getattr(block, "kind", "")
|
||||
try:
|
||||
if kind == "heading":
|
||||
return _measure_heading_text(getattr(block, "text", ""),
|
||||
getattr(block, "level", 1))
|
||||
if kind == "markdown":
|
||||
return _measure_markdown(block)
|
||||
if kind in ("figure", "image"):
|
||||
return _measure_figure_like(block)
|
||||
if kind in ("caption", "note"):
|
||||
lines = tl.wrap(getattr(block, "text", ""),
|
||||
tl.chars_per_line(_USABLE_W, _FS_NOTE))
|
||||
return tl.line_height_in(_FS_NOTE) * len(lines) + _GAP
|
||||
if kind == "kv_table":
|
||||
rows = getattr(block, "rows", []) or []
|
||||
return (tl.line_height_in(_FS_BODY) + _ROW_VPAD) * (len(rows) + 1) \
|
||||
+ _GAP
|
||||
if kind == "data_table":
|
||||
rows = getattr(block, "rows", []) or []
|
||||
return (tl.line_height_in(_FS_CELL) + _ROW_VPAD * 2) \
|
||||
* (len(rows) + 1) + _GAP
|
||||
if kind == "group":
|
||||
return sum(_measure_block(st, b)
|
||||
for b in (getattr(block, "blocks", []) or []))
|
||||
except Exception: # noqa: BLE001 — a measurement never aborts rendering.
|
||||
pass
|
||||
return tl.line_height_in(_FS_BODY)
|
||||
|
||||
|
||||
def _shrink_group_figures(st: _PdfState, blocks: list, avail_full: float) -> None:
|
||||
"""Cap each figure's height (via height_in) so the whole group fits a page.
|
||||
|
||||
The figure shrinks just enough to leave room for its heading, text and
|
||||
caption — keep-together puts the chart on the SAME page as its title and
|
||||
description instead of pushing it to the next page."""
|
||||
fig_blocks = [b for b in blocks
|
||||
if getattr(b, "kind", "") in ("figure", "image")]
|
||||
if not fig_blocks:
|
||||
return
|
||||
nonfig_h = sum(_measure_block(st, b) for b in blocks
|
||||
if getattr(b, "kind", "") not in ("figure", "image"))
|
||||
fig_overhead = tl.line_height_in(_FS_NOTE) + 0.04 + 0.04 + _GAP
|
||||
budget = avail_full - nonfig_h - 0.08 * len(fig_blocks)
|
||||
if budget <= 0.8:
|
||||
return
|
||||
per = budget / len(fig_blocks) - fig_overhead
|
||||
if per <= 0.6:
|
||||
return
|
||||
for fb in fig_blocks:
|
||||
cur = getattr(fb, "height_in", None)
|
||||
fb.height_in = (min(float(cur), per)
|
||||
if isinstance(cur, (int, float)) and cur > 0 else per)
|
||||
|
||||
|
||||
def _place_group(st: _PdfState, block) -> None:
|
||||
"""Render a keep-together Group: move it whole to the next page if needed."""
|
||||
blocks = getattr(block, "blocks", []) or []
|
||||
if not blocks:
|
||||
return
|
||||
avail_full = _CONTENT_BOTTOM - _CONTENT_TOP
|
||||
_shrink_group_figures(st, blocks, avail_full)
|
||||
total = sum(_measure_block(st, b) for b in blocks)
|
||||
if total <= avail_full:
|
||||
# Fits on one page: keep it together by moving whole when it won't fit.
|
||||
if total > _remaining(st):
|
||||
_new_page(st)
|
||||
elif st.y > _CONTENT_TOP + 1e-6:
|
||||
# Taller than a full page: at least start it on a fresh page, then flow.
|
||||
_new_page(st)
|
||||
for b in blocks:
|
||||
placer = _PLACERS.get(getattr(b, "kind", ""), _place_note)
|
||||
try:
|
||||
placer(st, b)
|
||||
except Exception: # noqa: BLE001 — a bad block never aborts the group.
|
||||
pass
|
||||
|
||||
|
||||
def _place_glossary_entry(st: _PdfState, block) -> None:
|
||||
"""Render one glossary term and register it as a clickable link target."""
|
||||
key = getattr(block, "key", "")
|
||||
label = getattr(block, "label", "") or key
|
||||
definition = getattr(block, "definition", "")
|
||||
# Reserve the term + its first definition line together, then anchor the
|
||||
# destination at the resolved page/position before drawing.
|
||||
_ensure_space(st, tl.line_height_in(_FS_H3, leading=1.2)
|
||||
+ tl.line_height_in(_FS_BODY) * 2)
|
||||
if key:
|
||||
st.term_dests[key] = {"page": st.page - 1,
|
||||
"point": [_ML * 72.0, st.y * 72.0]}
|
||||
_place_heading(st, model.Heading(text=str(label), level=3))
|
||||
if definition:
|
||||
_place_text_lines(st, tl.wrap(model._safe_str(definition),
|
||||
tl.chars_per_line(_USABLE_W, _FS_BODY)),
|
||||
_FS_BODY, _INK)
|
||||
st.y += _GAP * 0.5
|
||||
|
||||
|
||||
_PLACERS = {
|
||||
"heading": _place_heading,
|
||||
"markdown": _place_markdown,
|
||||
@@ -497,6 +782,8 @@ _PLACERS = {
|
||||
"image": _place_image,
|
||||
"caption": _place_caption,
|
||||
"note": _place_note,
|
||||
"group": _place_group,
|
||||
"glossary_entry": _place_glossary_entry,
|
||||
}
|
||||
|
||||
|
||||
@@ -553,8 +840,42 @@ def render_pdf(chapters: list, out_path: str, meta: dict = None) -> dict:
|
||||
return {"path": None, "n_pages": 0, "chapters": [],
|
||||
"note": f"fallo al escribir el PDF: {e}"}
|
||||
|
||||
# Mejora 6 — wire clickable glossary links now the PDF is closed on disk.
|
||||
# PdfPages cannot emit internal hyperlinks, so we post-process with PyMuPDF
|
||||
# (delegated registry function). Degrades silently if it is unavailable.
|
||||
n_links = _wire_glossary_links(st, out_path, notes)
|
||||
|
||||
note = f"{n_pages} páginas"
|
||||
if n_links:
|
||||
note += f" · {n_links} enlaces de glosario"
|
||||
if notes:
|
||||
note += " · " + "; ".join(notes)
|
||||
return {"path": out_path, "n_pages": n_pages, "chapters": chapters_meta,
|
||||
"note": note}
|
||||
|
||||
|
||||
def _wire_glossary_links(st: _PdfState, out_path: str, notes: list) -> int:
|
||||
"""Build {source rect → glossary dest} links and apply them via PyMuPDF.
|
||||
|
||||
Returns the number of links applied (0 if there is nothing to wire or the
|
||||
post-processor is unavailable). Never raises."""
|
||||
try:
|
||||
links = []
|
||||
for src in st.term_sources:
|
||||
dest = st.term_dests.get(src.get("key"))
|
||||
if not dest:
|
||||
continue
|
||||
links.append({
|
||||
"src_page": src["page"], "src_rect": src["rect"],
|
||||
"dst_page": dest["page"], "dst_point": dest["point"]})
|
||||
if not links:
|
||||
return 0
|
||||
from datascience.add_pdf_internal_links import add_pdf_internal_links
|
||||
res = add_pdf_internal_links(out_path, links)
|
||||
if isinstance(res, dict) and res.get("status") == "ok":
|
||||
return int(res.get("n_links") or 0)
|
||||
if isinstance(res, dict) and res.get("error"):
|
||||
notes.append(f"glosario sin enlaces: {res.get('error')}")
|
||||
except Exception as e: # noqa: BLE001 — links are best-effort.
|
||||
notes.append(f"glosario sin enlaces: {e}")
|
||||
return 0
|
||||
|
||||
@@ -43,6 +43,8 @@ _ACCENT = (0x2A, 0x6F, 0x97)
|
||||
_MUTED = (0x8A, 0x8A, 0x8A)
|
||||
_HEAD_BG = (0xEE, 0xF3, 0xF6)
|
||||
_WHITE = (0xFF, 0xFF, 0xFF)
|
||||
_ZEBRA = (0xF6, 0xF8, 0xFA) # faint grey for even (zebra) data rows.
|
||||
_LINK = (0x2A, 0x6F, 0x97) # accent colour for clickable glossary terms.
|
||||
|
||||
_FS_TITLE = 26
|
||||
_FS_H1, _FS_H2, _FS_H3 = 20, 16, 13
|
||||
@@ -59,6 +61,10 @@ class _PptxState:
|
||||
self.chapter = None
|
||||
self.slide_no = 0
|
||||
self.chapter_slides = 0
|
||||
self.last_heading = "" # text of the most recent heading.
|
||||
# Glossary wiring (mejora 6): runs to link and per-term target slide.
|
||||
self.term_runs = [] # [(key, run)]
|
||||
self.term_anchor_slide = {} # key -> Slide (glossary entry)
|
||||
|
||||
|
||||
def _rgb(c):
|
||||
@@ -155,9 +161,13 @@ def _add_rich_text(st: _PptxState, rich_lines: list, fs: float, color,
|
||||
indent=0.0, bullet=False) -> None:
|
||||
"""Add pre-wrapped lines of styled segments as one paragraph per line.
|
||||
|
||||
Each line is ``[(text, is_bold), ...]``; every segment becomes its own run
|
||||
so ``**bold**`` spans render with native PowerPoint bold (``run.font.bold``)
|
||||
without affecting the measured height (one paragraph per pre-wrapped line).
|
||||
Each line is a list of ``(text, is_bold)`` or ``(text, is_bold, term_key)``
|
||||
segments; every segment becomes its own run so ``**bold**`` spans render with
|
||||
native PowerPoint bold (``run.font.bold``) without affecting the measured
|
||||
height (one paragraph per pre-wrapped line). A segment carrying a
|
||||
``term_key`` is drawn in the accent colour and its run is recorded in
|
||||
``st.term_runs`` so it later becomes a native hyperlink jumping to the
|
||||
glossary slide of that term.
|
||||
"""
|
||||
lh = tl.line_height_in(fs)
|
||||
height = lh * len(rich_lines) + 0.05
|
||||
@@ -176,14 +186,20 @@ def _add_rich_text(st: _PptxState, rich_lines: list, fs: float, color,
|
||||
r0.text = "• "
|
||||
r0.font.size = Pt(fs)
|
||||
r0.font.color.rgb = _rgb(color)
|
||||
for seg_text, is_bold in segs:
|
||||
for seg in segs:
|
||||
if len(seg) == 3:
|
||||
seg_text, is_bold, term = seg
|
||||
else:
|
||||
seg_text, is_bold, term = seg[0], seg[1], None
|
||||
if seg_text == "":
|
||||
continue
|
||||
run = p.add_run()
|
||||
run.text = seg_text
|
||||
run.font.size = Pt(fs)
|
||||
run.font.bold = bool(is_bold)
|
||||
run.font.color.rgb = _rgb(color)
|
||||
run.font.color.rgb = _rgb(_LINK if term else color)
|
||||
if term:
|
||||
st.term_runs.append((term, run, st.slide))
|
||||
st.y += height
|
||||
|
||||
|
||||
@@ -191,6 +207,7 @@ def _place_heading(st: _PptxState, block) -> None:
|
||||
level = max(1, min(3, int(getattr(block, "level", 1) or 1)))
|
||||
fs = {1: _FS_H1, 2: _FS_H2, 3: _FS_H3}[level]
|
||||
text = tl.strip_inline_md(getattr(block, "text", ""))
|
||||
st.last_heading = text or st.last_heading
|
||||
lines = tl.wrap(text, tl.chars_per_line(_USABLE_W, fs))
|
||||
_add_text(st, lines, fs, _INK, bold=True)
|
||||
st.y += 0.04
|
||||
@@ -233,12 +250,12 @@ def _place_markdown(st: _PptxState, block) -> None:
|
||||
continue
|
||||
if stripped.startswith("- ") or stripped.startswith("* "):
|
||||
content = stripped[2:] # keep inline markers for bold rendering.
|
||||
rich = tl.wrap_rich(content,
|
||||
tl.chars_per_line(_USABLE_W - 0.3, _FS_BODY))
|
||||
rich = tl.wrap_rich_terms(content,
|
||||
tl.chars_per_line(_USABLE_W - 0.3, _FS_BODY))
|
||||
_add_rich_text(st, rich, _FS_BODY, _INK, bullet=True)
|
||||
i += 1
|
||||
continue
|
||||
para = [stripped] # keep inline markers; wrap_rich renders **bold**.
|
||||
para = [stripped] # keep inline markers; wrap_rich_terms renders **bold**.
|
||||
j = i + 1
|
||||
while j < n:
|
||||
nxt = md_lines[j].strip()
|
||||
@@ -247,8 +264,8 @@ def _place_markdown(st: _PptxState, block) -> None:
|
||||
para.append(nxt)
|
||||
j += 1
|
||||
text = " ".join(para)
|
||||
_add_rich_text(st, tl.wrap_rich(text, tl.chars_per_line(_USABLE_W, _FS_BODY)),
|
||||
_FS_BODY, _INK)
|
||||
_add_rich_text(st, tl.wrap_rich_terms(
|
||||
text, tl.chars_per_line(_USABLE_W, _FS_BODY)), _FS_BODY, _INK)
|
||||
i = j
|
||||
st.y += _GAP
|
||||
|
||||
@@ -295,7 +312,8 @@ def _row_height_in(cells, widths, fs) -> float:
|
||||
return lh * maxlines + 0.10
|
||||
|
||||
|
||||
def _emit_table(st: _PptxState, header, chunk, widths, fs) -> None:
|
||||
def _emit_table(st: _PptxState, header, chunk, widths, fs,
|
||||
start_index: int = 0) -> None:
|
||||
nrows = len(chunk) + (1 if header else 0)
|
||||
ncol = len(widths)
|
||||
# Pre-measure total height to size the shape (pptx still auto-grows rows).
|
||||
@@ -319,11 +337,14 @@ def _emit_table(st: _PptxState, header, chunk, widths, fs) -> None:
|
||||
cell.text = model._safe_str(header[c]) if c < len(header) else ""
|
||||
_style_cell(cell, fs, _INK, bold=True, fill=_HEAD_BG)
|
||||
ridx = 1
|
||||
for r in chunk:
|
||||
# Zebra striping: shade even data rows (1-based) using the GLOBAL row index
|
||||
# (start_index offset) so the pattern stays coherent across split chunks.
|
||||
for k, r in enumerate(chunk):
|
||||
fill = _ZEBRA if (start_index + k) % 2 == 1 else _WHITE
|
||||
for c in range(ncol):
|
||||
cell = gtable.cell(ridx, c)
|
||||
cell.text = model._safe_str(r[c]) if c < len(r) else ""
|
||||
_style_cell(cell, fs, _INK, bold=False, fill=_WHITE)
|
||||
_style_cell(cell, fs, _INK, bold=False, fill=fill)
|
||||
ridx += 1
|
||||
st.y += total_h + _GAP
|
||||
|
||||
@@ -367,6 +388,7 @@ def _place_data_table(st: _PptxState, block, shaded_header=True,
|
||||
avail = _remaining(st) - header_h
|
||||
chunk = []
|
||||
used = 0.0
|
||||
chunk_start = idx # global index of the first row in this chunk (zebra).
|
||||
while idx < n:
|
||||
rh = _row_height_in(rows[idx], widths, fs)
|
||||
if used + rh > avail and chunk:
|
||||
@@ -374,7 +396,7 @@ def _place_data_table(st: _PptxState, block, shaded_header=True,
|
||||
chunk.append(rows[idx])
|
||||
used += rh
|
||||
idx += 1
|
||||
_emit_table(st, header, chunk, widths, fs)
|
||||
_emit_table(st, header, chunk, widths, fs, start_index=chunk_start)
|
||||
note = getattr(block, "note", None)
|
||||
if note:
|
||||
_add_text(st, tl.wrap(model._safe_str(note),
|
||||
@@ -421,54 +443,97 @@ def _resolve_png(block):
|
||||
pass
|
||||
|
||||
|
||||
def _place_picture_bytes(st: _PptxState, data: bytes, caption) -> None:
|
||||
def _figure_bytes_cached(block):
|
||||
"""Rasterize a figure/image to PNG bytes ONCE and cache (bytes, aspect).
|
||||
|
||||
Measuring (keep-together) and drawing must agree on the real aspect ratio —
|
||||
``bbox_inches='tight'`` changes it vs ``figsize``, so we rasterize once and
|
||||
reuse the bytes for both. Cached on the block; never raises."""
|
||||
cached = getattr(block, "_aeda_png", None)
|
||||
if cached is not None:
|
||||
return cached
|
||||
kind = getattr(block, "kind", "")
|
||||
data = None
|
||||
if kind == "image":
|
||||
path = getattr(block, "path", "")
|
||||
if path and os.path.exists(path):
|
||||
try:
|
||||
with open(path, "rb") as fh:
|
||||
data = fh.read()
|
||||
except Exception: # noqa: BLE001
|
||||
data = None
|
||||
else:
|
||||
data = _resolve_png(block)
|
||||
aspect = 0.66
|
||||
if data is not None:
|
||||
w_px, h_px = _img_size_px(data)
|
||||
aspect = (h_px / w_px) if w_px else 0.66
|
||||
try:
|
||||
block._aeda_png = (data, aspect)
|
||||
return block._aeda_png
|
||||
except Exception: # noqa: BLE001 — block may reject attributes; degrade.
|
||||
return (data, aspect)
|
||||
|
||||
|
||||
def _place_picture_bytes(st: _PptxState, data: bytes, caption,
|
||||
max_h_in=None) -> None:
|
||||
# Mejora 4 — every figure on a slide carries a visible caption/title. If the
|
||||
# block has no caption, fall back to the current section heading, then to a
|
||||
# generic label, so no image is ever shown untitled.
|
||||
caption = (model._safe_str(caption).strip()
|
||||
or model._safe_str(st.last_heading).strip() or "Figura")
|
||||
w_px, h_px = _img_size_px(data)
|
||||
aspect = (h_px / w_px) if w_px else 0.66
|
||||
# Reserve the caption's REAL (possibly multi-line) height FIRST, then scale
|
||||
# the image to (max_h - cap_reserve): a figure never fills the whole slide,
|
||||
# so its caption always fits on the SAME slide and no image is untitled.
|
||||
# cap_real = what _add_text consumes; cap_reserve adds the post-image gap and
|
||||
# a small cushion so the caption never spills to the next slide.
|
||||
cap_lines = tl.wrap(caption, tl.chars_per_line(_USABLE_W, _FS_NOTE))
|
||||
cap_real = tl.line_height_in(_FS_NOTE) * len(cap_lines) + 0.05
|
||||
cap_reserve = cap_real + 0.05 + 0.10
|
||||
max_h = _CONTENT_BOTTOM - _CONTENT_TOP
|
||||
# height_in hint (model.Figure/Image): cap the target height so a figure in a
|
||||
# keep-together Group shrinks to leave room for its heading and text.
|
||||
if isinstance(max_h_in, (int, float)) and max_h_in > 0:
|
||||
max_h = min(max_h, float(max_h_in))
|
||||
max_img_h = max(max_h - cap_reserve, 0.6)
|
||||
target_w = _USABLE_W
|
||||
target_h = target_w * aspect
|
||||
if target_h > max_h:
|
||||
target_h = max_h
|
||||
if target_h > max_img_h:
|
||||
target_h = max_img_h
|
||||
target_w = target_h / aspect if aspect else _USABLE_W
|
||||
cap_h = tl.line_height_in(_FS_NOTE) + 0.05 if caption else 0.0
|
||||
if _remaining(st) < target_h + cap_h:
|
||||
# Keep the image and its caption together on the same slide.
|
||||
if _remaining(st) < target_h + cap_reserve:
|
||||
_new_slide(st, cont=True)
|
||||
left = _ML + (_USABLE_W - target_w) / 2.0
|
||||
st.slide.shapes.add_picture(io.BytesIO(data), Inches(left), Inches(st.y),
|
||||
width=Inches(target_w), height=Inches(target_h))
|
||||
st.y += target_h + 0.05
|
||||
if caption:
|
||||
_add_text(st, tl.wrap(model._safe_str(caption),
|
||||
tl.chars_per_line(_USABLE_W, _FS_NOTE)), _FS_NOTE, _MUTED,
|
||||
italic=True)
|
||||
_add_text(st, cap_lines, _FS_NOTE, _MUTED, italic=True)
|
||||
st.y += _GAP
|
||||
|
||||
|
||||
def _place_figure(st: _PptxState, block) -> None:
|
||||
png = _resolve_png(block)
|
||||
png, _aspect = _figure_bytes_cached(block)
|
||||
if png is None:
|
||||
_add_text(st, ["(figura no disponible)"], _FS_NOTE, _MUTED, italic=True)
|
||||
st.y += _GAP
|
||||
return
|
||||
_place_picture_bytes(st, png, getattr(block, "caption", None))
|
||||
_place_picture_bytes(st, png, getattr(block, "caption", None),
|
||||
max_h_in=getattr(block, "height_in", None))
|
||||
|
||||
|
||||
def _place_image(st: _PptxState, block) -> None:
|
||||
path = getattr(block, "path", "")
|
||||
if not path or not os.path.exists(path):
|
||||
data, _aspect = _figure_bytes_cached(block)
|
||||
if data is None:
|
||||
path = getattr(block, "path", "")
|
||||
_add_text(st, [f"(imagen no encontrada: {path})"], _FS_NOTE, _MUTED,
|
||||
italic=True)
|
||||
st.y += _GAP
|
||||
return
|
||||
try:
|
||||
with open(path, "rb") as fh:
|
||||
data = fh.read()
|
||||
except Exception as e: # noqa: BLE001
|
||||
_add_text(st, [f"(no se pudo leer la imagen: {e})"], _FS_NOTE, _MUTED,
|
||||
italic=True)
|
||||
st.y += _GAP
|
||||
return
|
||||
_place_picture_bytes(st, data, getattr(block, "caption", None))
|
||||
_place_picture_bytes(st, data, getattr(block, "caption", None),
|
||||
max_h_in=getattr(block, "height_in", None))
|
||||
|
||||
|
||||
def _place_caption(st: _PptxState, block) -> None:
|
||||
@@ -482,6 +547,170 @@ def _place_note(st: _PptxState, block) -> None:
|
||||
_place_caption(st, block)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Block measurement (mejora 3 — keep-together). Estimate a block's slide height
|
||||
# WITHOUT drawing it so a Group can move whole to the next slide before drawing.
|
||||
# Over-estimating only triggers an earlier slide break, never a content cut.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def _measure_heading_text(text: str, level: int) -> float:
|
||||
level = max(1, min(3, int(level or 1)))
|
||||
fs = {1: _FS_H1, 2: _FS_H2, 3: _FS_H3}[level]
|
||||
lines = tl.wrap(tl.strip_inline_md(text), tl.chars_per_line(_USABLE_W, fs))
|
||||
return tl.line_height_in(fs) * len(lines) + 0.05 + 0.04
|
||||
|
||||
|
||||
def _measure_markdown(block) -> float:
|
||||
raw = str(getattr(block, "text", "") or "")
|
||||
md_lines = raw.split("\n")
|
||||
h = 0.0
|
||||
i, n = 0, len(md_lines)
|
||||
while i < n:
|
||||
stripped = md_lines[i].strip()
|
||||
if stripped.startswith("|") and stripped.endswith("|"):
|
||||
j = i
|
||||
while j < n and md_lines[j].strip().startswith("|") \
|
||||
and md_lines[j].strip().endswith("|"):
|
||||
j += 1
|
||||
h += (tl.line_height_in(_FS_CELL) + 0.10) * (j - i) + _GAP
|
||||
i = j
|
||||
continue
|
||||
if stripped == "":
|
||||
h += tl.line_height_in(_FS_BODY) * 0.4
|
||||
i += 1
|
||||
continue
|
||||
if stripped.startswith("### "):
|
||||
h += _measure_heading_text(stripped[4:], 3)
|
||||
i += 1
|
||||
continue
|
||||
if stripped.startswith("## "):
|
||||
h += _measure_heading_text(stripped[3:], 2)
|
||||
i += 1
|
||||
continue
|
||||
if stripped.startswith("# "):
|
||||
h += _measure_heading_text(stripped[2:], 1)
|
||||
i += 1
|
||||
continue
|
||||
if stripped.startswith("- ") or stripped.startswith("* "):
|
||||
lines = tl.wrap_rich_terms(
|
||||
stripped[2:], tl.chars_per_line(_USABLE_W - 0.3, _FS_BODY))
|
||||
h += tl.line_height_in(_FS_BODY) * len(lines) + 0.05
|
||||
i += 1
|
||||
continue
|
||||
para = [stripped]
|
||||
j = i + 1
|
||||
while j < n:
|
||||
nxt = md_lines[j].strip()
|
||||
if nxt == "" or nxt.startswith(("|", "#", "- ", "* ")):
|
||||
break
|
||||
para.append(nxt)
|
||||
j += 1
|
||||
lines = tl.wrap_rich_terms(" ".join(para),
|
||||
tl.chars_per_line(_USABLE_W, _FS_BODY))
|
||||
h += tl.line_height_in(_FS_BODY) * len(lines) + 0.05
|
||||
i = j
|
||||
return h + _GAP
|
||||
|
||||
|
||||
def _measure_figure_like(block) -> float:
|
||||
max_h = _CONTENT_BOTTOM - _CONTENT_TOP
|
||||
hint = getattr(block, "height_in", None)
|
||||
if isinstance(hint, (int, float)) and hint > 0:
|
||||
max_h = min(max_h, float(hint))
|
||||
# Use the REAL rasterized aspect (cached) so measuring matches drawing — this
|
||||
# is what keeps a figure together with its heading instead of splitting.
|
||||
_data, aspect = _figure_bytes_cached(block)
|
||||
target_h = min(_USABLE_W * aspect, max_h)
|
||||
# Caption is always emitted now (mejora 4), so always reserve its line.
|
||||
cap_h = tl.line_height_in(_FS_NOTE) + 0.05
|
||||
return target_h + 0.05 + cap_h + _GAP
|
||||
|
||||
|
||||
def _measure_block(st: _PptxState, block) -> float:
|
||||
kind = getattr(block, "kind", "")
|
||||
try:
|
||||
if kind == "heading":
|
||||
return _measure_heading_text(getattr(block, "text", ""),
|
||||
getattr(block, "level", 1))
|
||||
if kind == "markdown":
|
||||
return _measure_markdown(block)
|
||||
if kind in ("figure", "image"):
|
||||
return _measure_figure_like(block)
|
||||
if kind in ("caption", "note"):
|
||||
lines = tl.wrap(getattr(block, "text", ""),
|
||||
tl.chars_per_line(_USABLE_W, _FS_NOTE))
|
||||
return tl.line_height_in(_FS_NOTE) * len(lines) + 0.05 + _GAP
|
||||
if kind in ("kv_table", "data_table"):
|
||||
rows = getattr(block, "rows", []) or []
|
||||
return (tl.line_height_in(_FS_CELL) + 0.10) * (len(rows) + 1) + _GAP
|
||||
if kind == "group":
|
||||
return sum(_measure_block(st, b)
|
||||
for b in (getattr(block, "blocks", []) or []))
|
||||
except Exception: # noqa: BLE001 — a measurement never aborts rendering.
|
||||
pass
|
||||
return tl.line_height_in(_FS_BODY)
|
||||
|
||||
|
||||
def _shrink_group_figures(st: _PptxState, blocks: list, avail_full: float) -> None:
|
||||
"""Cap each figure's height (via height_in) so the whole group fits a slide.
|
||||
|
||||
The figure shrinks just enough to leave room for its heading, text and
|
||||
caption — that is how keep-together puts a chart on the SAME slide as its
|
||||
title and description instead of pushing it to the next slide."""
|
||||
fig_blocks = [b for b in blocks
|
||||
if getattr(b, "kind", "") in ("figure", "image")]
|
||||
if not fig_blocks:
|
||||
return
|
||||
nonfig_h = sum(_measure_block(st, b) for b in blocks
|
||||
if getattr(b, "kind", "") not in ("figure", "image"))
|
||||
fig_overhead = tl.line_height_in(_FS_NOTE) + 0.05 + 0.05 + _GAP
|
||||
budget = avail_full - nonfig_h - 0.10 * len(fig_blocks)
|
||||
if budget <= 1.0:
|
||||
return # not enough room to keep together; let it flow (degrade).
|
||||
per = budget / len(fig_blocks) - fig_overhead
|
||||
if per <= 0.8:
|
||||
return
|
||||
for fb in fig_blocks:
|
||||
cur = getattr(fb, "height_in", None)
|
||||
fb.height_in = (min(float(cur), per)
|
||||
if isinstance(cur, (int, float)) and cur > 0 else per)
|
||||
|
||||
|
||||
def _place_group(st: _PptxState, block) -> None:
|
||||
"""Render a keep-together Group: move it whole to the next slide if needed."""
|
||||
blocks = getattr(block, "blocks", []) or []
|
||||
if not blocks:
|
||||
return
|
||||
avail_full = _CONTENT_BOTTOM - _CONTENT_TOP
|
||||
_shrink_group_figures(st, blocks, avail_full)
|
||||
total = sum(_measure_block(st, b) for b in blocks)
|
||||
if total <= avail_full:
|
||||
if total > _remaining(st):
|
||||
_new_slide(st, cont=True)
|
||||
elif st.y > _CONTENT_TOP + 1e-6:
|
||||
_new_slide(st, cont=True)
|
||||
for b in blocks:
|
||||
placer = _PLACERS.get(getattr(b, "kind", ""), _place_note)
|
||||
try:
|
||||
placer(st, b)
|
||||
except Exception: # noqa: BLE001 — a bad block never aborts the group.
|
||||
pass
|
||||
|
||||
|
||||
def _place_glossary_entry(st: _PptxState, block) -> None:
|
||||
"""Render one glossary term and register its slide as the link target."""
|
||||
key = getattr(block, "key", "")
|
||||
label = getattr(block, "label", "") or key
|
||||
definition = getattr(block, "definition", "")
|
||||
_ensure(st, tl.line_height_in(_FS_H3) + tl.line_height_in(_FS_BODY) * 2)
|
||||
if key:
|
||||
st.term_anchor_slide[key] = st.slide
|
||||
_place_heading(st, model.Heading(text=str(label), level=3))
|
||||
if definition:
|
||||
_add_text(st, tl.wrap(model._safe_str(definition),
|
||||
tl.chars_per_line(_USABLE_W, _FS_BODY)), _FS_BODY, _INK)
|
||||
st.y += _GAP
|
||||
|
||||
|
||||
_PLACERS = {
|
||||
"heading": _place_heading,
|
||||
"markdown": _place_markdown,
|
||||
@@ -491,6 +720,8 @@ _PLACERS = {
|
||||
"image": _place_image,
|
||||
"caption": _place_caption,
|
||||
"note": _place_note,
|
||||
"group": _place_group,
|
||||
"glossary_entry": _place_glossary_entry,
|
||||
}
|
||||
|
||||
|
||||
@@ -542,6 +773,9 @@ def render_pptx(chapters: list, out_path: str, meta: dict = None) -> dict:
|
||||
_new_slide(st, cont=False)
|
||||
_place_note(st, model.Note(
|
||||
"(documento vacío — sin capítulos aplicables)"))
|
||||
# Mejora 6 — wire clickable glossary terms to their entry slide (native
|
||||
# PowerPoint slide-jump). Delegated registry function; degrades silently.
|
||||
n_links = _wire_glossary_links(st, notes)
|
||||
prs.save(out_path)
|
||||
n_slides = st.slide_no
|
||||
except Exception as e: # noqa: BLE001
|
||||
@@ -549,7 +783,35 @@ def render_pptx(chapters: list, out_path: str, meta: dict = None) -> dict:
|
||||
"note": f"fallo al escribir el PPTX: {e}"}
|
||||
|
||||
note = f"{n_slides} slides"
|
||||
if n_links:
|
||||
note += f" · {n_links} enlaces de glosario"
|
||||
if notes:
|
||||
note += " · " + "; ".join(notes)
|
||||
return {"path": out_path, "n_slides": n_slides, "chapters": chapters_meta,
|
||||
"note": note}
|
||||
|
||||
|
||||
def _wire_glossary_links(st: _PptxState, notes: list) -> int:
|
||||
"""Turn each recorded term run into a native jump to its glossary slide.
|
||||
|
||||
Returns the number of links applied. A term whose only appearance is inside
|
||||
its own glossary entry (source slide == target slide) is skipped. Never
|
||||
raises."""
|
||||
if not st.term_runs or not st.term_anchor_slide:
|
||||
return 0
|
||||
linked = 0
|
||||
try:
|
||||
from datascience.pptx_link_run_to_slide import pptx_link_run_to_slide
|
||||
except Exception as e: # noqa: BLE001
|
||||
notes.append(f"glosario sin enlaces: {e}")
|
||||
return 0
|
||||
for key, run, src_slide in st.term_runs:
|
||||
tgt = st.term_anchor_slide.get(key)
|
||||
if tgt is None or tgt is src_slide:
|
||||
continue
|
||||
try:
|
||||
if pptx_link_run_to_slide(run, src_slide, tgt):
|
||||
linked += 1
|
||||
except Exception: # noqa: BLE001 — links are best-effort.
|
||||
pass
|
||||
return linked
|
||||
|
||||
@@ -24,6 +24,13 @@ import textwrap
|
||||
# the visible text matches ``strip_inline_md`` exactly.
|
||||
_INLINE_SPAN_RE = re.compile(r"(\*\*.+?\*\*|__.+?__|`.+?`)")
|
||||
|
||||
# Glossary term span: ``[[term:key]]texto visible[[/term]]``. The visible text
|
||||
# (which may itself contain ``**bold**``) is kept and tagged with ``key`` so the
|
||||
# renderers can turn each appearance into a clickable jump to the glossary entry.
|
||||
_TERM_SPAN_RE = re.compile(r"\[\[term:([A-Za-z0-9_]+)\]\](.*?)\[\[/term\]\]",
|
||||
re.S)
|
||||
_TERM_OPEN_RE = re.compile(r"\[\[term:[A-Za-z0-9_]+\]\]")
|
||||
|
||||
|
||||
def avg_char_width_in(fontsize_pt: float) -> float:
|
||||
"""Approximate average glyph width in inches for a sans-serif font.
|
||||
@@ -86,11 +93,21 @@ def strip_inline_md(text: str) -> str:
|
||||
if not text:
|
||||
return ""
|
||||
s = str(text)
|
||||
# Drop glossary term markers, keeping the visible inner text.
|
||||
s = _TERM_SPAN_RE.sub(lambda m: m.group(2), s)
|
||||
s = _TERM_OPEN_RE.sub("", s) # leftover unbalanced open marker.
|
||||
s = s.replace("[[/term]]", "") # leftover unbalanced close marker.
|
||||
for marker in ("**", "__", "`"):
|
||||
s = s.replace(marker, "")
|
||||
return s
|
||||
|
||||
|
||||
def _strip_term_markers(s: str) -> str:
|
||||
"""Remove any (balanced or leftover) glossary term markers, keeping text."""
|
||||
s = _TERM_OPEN_RE.sub("", s)
|
||||
return s.replace("[[/term]]", "")
|
||||
|
||||
|
||||
def _strip_leftover_markers(s: str) -> str:
|
||||
"""Drop any unbalanced inline markers from a plain (non-span) fragment.
|
||||
|
||||
@@ -222,6 +239,118 @@ def wrap_rich(text: str, max_chars: int):
|
||||
return lines or [[("", False)]]
|
||||
|
||||
|
||||
def parse_inline_rich(text: str):
|
||||
"""Split ``text`` into ``[(fragment, is_bold, term_key), ...]``.
|
||||
|
||||
Extends :func:`parse_inline_bold` with glossary term spans
|
||||
``[[term:key]]visible[[/term]]``: the inner ``visible`` text is parsed for
|
||||
``**bold**`` as usual and every resulting fragment carries ``term_key`` so the
|
||||
renderers can make it clickable. Text outside a term span gets ``term_key =
|
||||
None``. Unbalanced term markers are stripped (kept identical to
|
||||
:func:`strip_inline_md`). The concatenation of all fragment texts equals
|
||||
``strip_inline_md(text)`` — visible characters and wrapping are unchanged; only
|
||||
the bold flag and the term key are added. Adjacent fragments with the same
|
||||
(bold, term) are merged.
|
||||
"""
|
||||
s = "" if text is None else str(text)
|
||||
if not s:
|
||||
return []
|
||||
out = []
|
||||
|
||||
def _emit(fragment: str, bold: bool, term) -> None:
|
||||
if fragment == "":
|
||||
return
|
||||
if out and out[-1][1] == bold and out[-1][2] == term:
|
||||
out[-1] = (out[-1][0] + fragment, bold, term)
|
||||
else:
|
||||
out.append((fragment, bold, term))
|
||||
|
||||
def _emit_bolded(segment: str, term) -> None:
|
||||
# Reuse the bold parser on a term-marker-free segment.
|
||||
for frag, bold in parse_inline_bold(_strip_term_markers(segment)):
|
||||
_emit(frag, bold, term)
|
||||
|
||||
pos = 0
|
||||
for m in _TERM_SPAN_RE.finditer(s):
|
||||
if m.start() > pos:
|
||||
_emit_bolded(s[pos:m.start()], None)
|
||||
_emit_bolded(m.group(2), m.group(1))
|
||||
pos = m.end()
|
||||
if pos < len(s):
|
||||
_emit_bolded(s[pos:], None)
|
||||
return out
|
||||
|
||||
|
||||
def wrap_rich_terms(text: str, max_chars: int):
|
||||
"""Like :func:`wrap_rich` but preserving glossary term keys per fragment.
|
||||
|
||||
Returns ``list[list[(fragment, is_bold, term_key)]]`` — one inner list per
|
||||
output line. Wrapping is word-aware and hard-splits over-long tokens so no
|
||||
line exceeds ``max_chars`` (the renderers measure these very lines). Term and
|
||||
bold flags never widen a line: the visible width matches :func:`wrap`.
|
||||
"""
|
||||
if max_chars < 1:
|
||||
max_chars = 1
|
||||
spans = parse_inline_rich(text)
|
||||
if not spans:
|
||||
return [[("", False, None)]]
|
||||
|
||||
tokens = [] # each: (word, bold, term) or ("\n", None, None)
|
||||
for frag, bold, term in spans:
|
||||
parts = frag.split("\n")
|
||||
for pi, part in enumerate(parts):
|
||||
if pi > 0:
|
||||
tokens.append(("\n", None, None))
|
||||
for word in part.split(" "):
|
||||
if word == "":
|
||||
continue
|
||||
tokens.append((word, bold, term))
|
||||
|
||||
lines = []
|
||||
cur = []
|
||||
cur_len = 0
|
||||
|
||||
def _flush():
|
||||
nonlocal cur, cur_len
|
||||
merged = []
|
||||
for k, (word, bold, term) in enumerate(cur):
|
||||
piece = word if k == 0 else " " + word
|
||||
if merged and merged[-1][1] == bold and merged[-1][2] == term:
|
||||
merged[-1] = (merged[-1][0] + piece, bold, term)
|
||||
else:
|
||||
merged.append((piece, bold, term))
|
||||
lines.append(merged or [("", False, None)])
|
||||
cur = []
|
||||
cur_len = 0
|
||||
|
||||
for word, bold, term in tokens:
|
||||
if bold is None: # forced newline
|
||||
_flush()
|
||||
continue
|
||||
if len(word) > max_chars:
|
||||
if cur:
|
||||
_flush()
|
||||
chunks = _hard_split(word, max_chars)
|
||||
for ci, chunk in enumerate(chunks):
|
||||
if ci < len(chunks) - 1:
|
||||
lines.append([(chunk, bold, term)])
|
||||
else:
|
||||
cur = [(chunk, bold, term)]
|
||||
cur_len = len(chunk)
|
||||
continue
|
||||
add = len(word) if cur_len == 0 else cur_len + 1 + len(word)
|
||||
if cur_len != 0 and add > max_chars:
|
||||
_flush()
|
||||
cur = [(word, bold, term)]
|
||||
cur_len = len(word)
|
||||
else:
|
||||
cur.append((word, bold, term))
|
||||
cur_len = add
|
||||
if cur:
|
||||
_flush()
|
||||
return lines or [[("", False, None)]]
|
||||
|
||||
|
||||
def parse_md_table(lines: list):
|
||||
"""Parse consecutive ``| a | b |`` lines into ``(header, rows)`` or None.
|
||||
|
||||
|
||||
@@ -4,10 +4,10 @@ name: column_quality_score
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
version: "2.0.0"
|
||||
purity: pure
|
||||
signature: "def column_quality_score(col: dict) -> dict"
|
||||
description: "Calcula un score de calidad de datos 0-100 para un ColumnProfile del grupo eda, con desglose completeness/validity/consistency y lista de issues legibles. Funcion pura, no muta el input."
|
||||
description: "Calcula un score de calidad de datos 0-100 para un ColumnProfile del grupo eda. Combina completeness (0.6) y validity (0.4) con renormalizacion por aplicabilidad; los outliers, columnas constantes e ids NO bajan el score (van a observations). Devuelve desglose por dimension, issues (defectos) y observations (señales analiticas). Funcion pura, no muta el input."
|
||||
tags: [eda, data-quality, profiling, scoring, datascience]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
@@ -17,20 +17,26 @@ error_type: ""
|
||||
imports: []
|
||||
example: |
|
||||
from datascience import column_quality_score
|
||||
col = {"name": "precio", "inferred_type": "float", "null_pct": 0.2,
|
||||
"unique_pct": 0.4, "flags": [], "numeric": {"outlier_pct": 0.08}}
|
||||
col = {"name": "precio", "inferred_type": "numeric", "null_pct": 0.2,
|
||||
"unique_pct": 0.4, "flags": [], "numeric": {"outlier_pct": 8.0}}
|
||||
column_quality_score(col)
|
||||
# {"score": 86.8, "completeness": 0.8, "validity": 0.92,
|
||||
# "consistency": 1.0, "issues": ["20% nulos", "8% outliers"]}
|
||||
# {"score": 88.0, "completeness": 0.8, "validity": 1.0,
|
||||
# "applicable": ["completeness", "validity"], "issues": ["20% nulos"],
|
||||
# "observations": ["8% de valores atípicos (z-score>3): ..."]}
|
||||
tested: true
|
||||
tests:
|
||||
- "test_clean_column_high_score"
|
||||
- "test_half_null_lowers_completeness_and_score"
|
||||
- "test_constant_column_flags_issue"
|
||||
- "test_weights_60_40_native_type"
|
||||
- "test_outliers_do_not_penalize_score"
|
||||
- "test_nulls_lower_score_more_than_outliers"
|
||||
- "test_validity_from_parse_rate_lowers_score"
|
||||
- "test_validity_from_match_rate"
|
||||
- "test_free_text_renormalizes_to_completeness_only"
|
||||
- "test_all_null_column_scores_zero"
|
||||
- "test_constant_column_scores_full_and_is_observation"
|
||||
- "test_high_cardinality_id_scores_full_and_is_observation"
|
||||
- "test_mostly_null_no_double_counts_validity"
|
||||
- "test_empty_dict_does_not_crash"
|
||||
- "test_outliers_penalize_validity"
|
||||
- "test_mostly_null_flag_halves_validity"
|
||||
- "test_high_cardinality_text_flagged_as_id"
|
||||
- "test_none_values_treated_defensively"
|
||||
- "test_does_not_mutate_input"
|
||||
test_file_path: "python/functions/datascience/column_quality_score_test.py"
|
||||
@@ -38,16 +44,22 @@ file_path: "python/functions/datascience/column_quality_score.py"
|
||||
params:
|
||||
- name: col
|
||||
desc: >
|
||||
ColumnProfile dict del grupo eda (p.ej. salida de summarize_table_duckdb).
|
||||
Se leen sus claves de forma defensiva con .get(...) y se toleran valores
|
||||
None. Claves usadas: null_pct (0-1), inferred_type, semantic_type,
|
||||
unique_pct (0-1), flags (list[str], reconoce "constant"/"mostly_null"),
|
||||
numeric ({outlier_pct: 0-1, ...}|None) y match_rate (opcional, 0-1).
|
||||
ColumnProfile dict del grupo eda (p.ej. salida de summarize_table_duckdb /
|
||||
profile_table). Se leen sus claves de forma defensiva con .get(...) y se
|
||||
toleran valores None. Claves usadas: null_pct (0-1), n_rows, empty_count
|
||||
(texto), inferred_type, semantic_type, validity_rate (0-1, lo expone
|
||||
profile_table al promocionar texto a numero/fecha), match_rate (0-1),
|
||||
unique_pct (0-1), flags (list[str], reconoce
|
||||
"constant"/"possible_id"/"high_cardinality") y numeric ({outlier_pct: 0-100,
|
||||
skew, ...}|None).
|
||||
output: >
|
||||
dict con score (float 0-100, redondeado a 1 decimal), completeness (0-1),
|
||||
validity (0-1), consistency (0-1) e issues (list[str] de descripciones
|
||||
legibles de los problemas detectados). score = round(100 * (0.5*completeness
|
||||
+ 0.3*validity + 0.2*consistency), 1).
|
||||
dict con score (float 0-100, 1 decimal), completeness (0-1), validity (0-1 o
|
||||
None si no aplicable), dimensions ({completeness, validity}), applicable
|
||||
(list[str] de dimensiones que entraron en el score), issues (list[str] SOLO de
|
||||
defectos de calidad: nulos, vacios, valores no conformes) y observations
|
||||
(list[str] de señales analiticas que NO bajan el score: outliers, columna
|
||||
constante, posible id, asimetria). score = round(100 * (0.6*completeness +
|
||||
0.4*validity) / pesos_aplicables, 1), renormalizado cuando validity no aplica.
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
@@ -59,51 +71,71 @@ from datascience import column_quality_score
|
||||
col = {
|
||||
"name": "precio",
|
||||
"physical_type": "DOUBLE",
|
||||
"inferred_type": "float",
|
||||
"inferred_type": "numeric",
|
||||
"semantic_type": "",
|
||||
"count": 800,
|
||||
"n_rows": 1000,
|
||||
"null_count": 200,
|
||||
"null_pct": 0.20,
|
||||
"distinct_count": 400,
|
||||
"unique_pct": 0.40,
|
||||
"flags": [],
|
||||
"numeric": {"outlier_pct": 0.08},
|
||||
"numeric": {"outlier_pct": 8.0, "skew": 0.3},
|
||||
"categorical": None,
|
||||
"datetime": None,
|
||||
}
|
||||
|
||||
column_quality_score(col)
|
||||
# {
|
||||
# "score": 86.8,
|
||||
# "completeness": 0.8, # 1 - 0.20
|
||||
# "validity": 0.92, # 1 - min(0.08, 0.3)
|
||||
# "consistency": 1.0,
|
||||
# "issues": ["20% nulos", "8% outliers"],
|
||||
# "score": 88.0, # 100 * (0.6*0.8 + 0.4*1.0)
|
||||
# "completeness": 0.8, # 1 - 0.20
|
||||
# "validity": 1.0, # numerica nativa: el tipo es conforme
|
||||
# "dimensions": {"completeness": 0.8, "validity": 1.0},
|
||||
# "applicable": ["completeness", "validity"],
|
||||
# "issues": ["20% nulos"], # SOLO defectos de calidad
|
||||
# "observations": ["8% de valores atípicos (z-score>3): ..."], # NO bajan score
|
||||
# }
|
||||
```
|
||||
|
||||
## Cuando usarla
|
||||
|
||||
Cuando hayas perfilado una tabla con el grupo `eda` (p.ej.
|
||||
`summarize_table_duckdb`) y necesites un numero 0-100 por columna para
|
||||
ordenar/priorizar limpieza de datos, pintar semaforos de calidad en un
|
||||
dashboard, o decidir que columnas descartar antes de modelar. Es la capa de
|
||||
scoring sobre el ColumnProfile crudo: lee el perfil, no toca los datos.
|
||||
`summarize_table_duckdb` / `profile_table`) y necesites un numero 0-100 por
|
||||
columna para ordenar/priorizar limpieza de datos, pintar semaforos de calidad,
|
||||
o decidir que columnas descartar antes de modelar. Separa los **defectos de
|
||||
calidad reales** (`issues`: nulos, vacios, valores que no parsean a su tipo) de
|
||||
las **observaciones analiticas** (`observations`: outliers, columnas constantes,
|
||||
ids), que se reportan pero no penalizan. Es la capa de scoring sobre el
|
||||
ColumnProfile crudo: lee el perfil, no toca los datos.
|
||||
|
||||
## Notas
|
||||
## Gotchas
|
||||
|
||||
Funcion pura, sin I/O ni dependencias externas, no muta `col`. Lee todas las
|
||||
claves con `.get(...)` y tolera que vengan en `None` (un ColumnProfile recien
|
||||
salido de `summarize_table_duckdb` trae muchas claves a `None`), por lo que
|
||||
nunca falla por claves ausentes — un `{}` produce un resultado bien definido.
|
||||
Funcion pura, sin I/O, no muta `col`. Aun asi conviene saber:
|
||||
|
||||
Pesos del score: completeness 0.5, validity 0.3, consistency 0.2.
|
||||
- **Los outliers NO bajan el score.** Un valor extremo puede ser real y correcto
|
||||
(un cliente que compra mucho); detectar atipicos es analisis de la
|
||||
distribucion, no un juicio de correccion. Salen en `observations`, no en
|
||||
`issues`. Mismo trato para columnas constantes e identificadores de alta
|
||||
cardinalidad: son observaciones, no defectos.
|
||||
- **`validity` puede ser `None`** (no aplicable): texto libre sin `semantic_type`
|
||||
ni `validity_rate`, o columna 100% nula. En ese caso el score se renormaliza a
|
||||
solo `completeness` (la columna no se premia ni castiga por algo no medible).
|
||||
- **`outlier_pct` se interpreta en escala 0-100** (la que emite
|
||||
`describe_numeric`, z-score>3). Pasar una fraccion 0-1 produce un texto de
|
||||
observacion con el % equivocado, pero NUNCA afecta al score.
|
||||
- **`validity_rate` lo puebla `profile_table`** al promocionar una columna de
|
||||
texto a numero/fecha (fraccion que parsea). Si no esta presente y el tipo es
|
||||
nativo numerico/fecha/bool, `validity = 1.0`.
|
||||
- Sin doble conteo: la falta de datos cuenta solo en `completeness` (el antiguo
|
||||
castigo de `mostly_null` sobre `validity` se elimino).
|
||||
|
||||
- **completeness** = `1 - null_pct` (None -> 0 nulls -> 1.0).
|
||||
- **validity**: parte de 1.0 y penaliza `min(outlier_pct, 0.3)` en columnas
|
||||
numericas, `0.5 * (1 - match_rate)` si hay `semantic_type` declarado con
|
||||
`match_rate` bajo disponible, y multiplica por 0.5 si el flag `mostly_null`
|
||||
esta presente.
|
||||
- **consistency**: 1.0 salvo flag `constant` (-> 0.3, columna poco informativa)
|
||||
o texto con `unique_pct > 0.9` (-> 0.6, posible id de alta cardinalidad).
|
||||
## Capability growth log
|
||||
|
||||
- v2.0.0 (2026-06-30) — nueva formula de calidad (report 2046): pesos 60/40
|
||||
(completeness/validity) con renormalizacion por aplicabilidad; se elimina la
|
||||
dimension `consistency`-como-informatividad y el doble castigo de
|
||||
`mostly_null`; los outliers/constantes/ids salen del score a `observations`;
|
||||
validity mide conformidad real (parse rate / match rate / tipo nativo). Salida
|
||||
ampliada con `dimensions`, `applicable` y `observations`.
|
||||
- v1.0.0 — version inicial: pesos 50/30/20 (completeness/validity/consistency),
|
||||
los outliers penalizaban validity (con bug de escala) y consistency penalizaba
|
||||
informatividad.
|
||||
|
||||
@@ -1,34 +1,78 @@
|
||||
"""Score de calidad de datos (0-100) para un ColumnProfile del grupo eda.
|
||||
|
||||
Funcion pura: dado el perfil de una columna producido por el grupo de
|
||||
capacidad `eda` (p.ej. summarize_table_duckdb), calcula un score agregado
|
||||
de calidad junto a su desglose en completeness / validity / consistency y
|
||||
una lista de issues legibles. No realiza I/O ni muta el input.
|
||||
capacidad `eda` (p.ej. summarize_table_duckdb / profile_table), calcula un
|
||||
score agregado de calidad junto a su desglose por dimension y dos listas
|
||||
legibles separadas: `issues` (defectos de calidad reales que SI bajan el
|
||||
score) y `observations` (señales analiticas que NO bajan el score). No
|
||||
realiza I/O ni muta el input.
|
||||
|
||||
Modelo (DAMA-DMBOK / ISO 8000), ver report 2046:
|
||||
|
||||
- Solo entran en el score las dimensiones medibles automaticamente desde el
|
||||
perfil, sin fuente externa de verdad: completeness y validity por columna.
|
||||
- Renormalizacion por aplicabilidad: si una dimension no es medible en la
|
||||
columna (texto libre sin semantica -> validity no aplica; columna 100% nula
|
||||
-> validity no medible), se excluye y los pesos se renormalizan sobre las
|
||||
aplicables. Una columna ni se premia ni se castiga por algo no medible.
|
||||
- Sin doble conteo: la falta de datos cuenta solo en completeness (se elimino
|
||||
el antiguo castigo extra de `mostly_null` sobre validity).
|
||||
- Los OUTLIERS NO bajan la calidad. Un valor extremo puede ser real y
|
||||
correcto; detectar atipicos es analisis de la distribucion, no un juicio de
|
||||
coreccion. Outliers, columnas constantes e identificadores de alta
|
||||
cardinalidad pasan a `observations`, nunca a `issues`.
|
||||
"""
|
||||
|
||||
|
||||
# Pesos base de las dimensiones de columna (se renormalizan por aplicabilidad).
|
||||
_W_COMPLETENESS = 0.6
|
||||
_W_VALIDITY = 0.4
|
||||
|
||||
# Tipos inferidos cuyo almacen garantiza la conformidad de tipo (validity=1.0)
|
||||
# cuando NO vienen de una promocion de texto (en cuyo caso manda validity_rate).
|
||||
_NATIVE_TYPED = ("numeric", "integer", "float", "datetime", "date", "boolean", "bool")
|
||||
|
||||
|
||||
def column_quality_score(col: dict) -> dict:
|
||||
"""Calcula un score de calidad de datos 0-100 para un ColumnProfile.
|
||||
|
||||
El score pondera tres dimensiones:
|
||||
- completeness (0.5): proporcion de valores no nulos.
|
||||
- validity (0.3): ausencia de outliers / heuristicas de validez.
|
||||
- consistency (0.2): la columna aporta informacion (no constante, no ruido).
|
||||
El score combina solo dimensiones de calidad medibles desde el perfil, con
|
||||
renormalizacion por aplicabilidad:
|
||||
|
||||
- completeness (peso base 0.6, siempre aplica): proporcion de valores
|
||||
presentes = 1 - null_pct. En texto, las celdas vacias (`empty_count`)
|
||||
tambien cuentan como faltantes.
|
||||
- validity (peso base 0.4, cuando hay un criterio de validacion real):
|
||||
fraccion de valores no nulos conformes a su tipo/semantica. Tipo nativo
|
||||
numerico/fecha/bool = 1.0; texto promovido a numero/fecha = parse rate
|
||||
(`validity_rate`); texto con `semantic_type` regexable = `match_rate`;
|
||||
texto libre o columna 100% nula = NO aplicable (renormaliza a solo
|
||||
completeness).
|
||||
|
||||
Los outliers, columnas constantes, identificadores y asimetria fuerte NO
|
||||
bajan el score: se devuelven en `observations`.
|
||||
|
||||
Args:
|
||||
col: ColumnProfile dict del grupo eda. Se leen las claves de forma
|
||||
defensiva con .get(...) y se tolera que muchas vengan en None.
|
||||
Claves relevantes: null_pct, inferred_type, semantic_type,
|
||||
unique_pct, flags (list[str]), numeric ({outlier_pct, ...}|None),
|
||||
match_rate (opcional).
|
||||
Claves relevantes: null_pct (0-1), n_rows, empty_count,
|
||||
inferred_type, semantic_type, validity_rate (0-1, lo expone
|
||||
profile_table al promocionar texto a numero/fecha), match_rate
|
||||
(0-1), unique_pct (0-1), flags (list[str], reconoce
|
||||
"constant"/"possible_id"/"high_cardinality"), numeric
|
||||
({outlier_pct: 0-100, skew, ...}|None).
|
||||
|
||||
Returns:
|
||||
dict con:
|
||||
score (float, 0-100, redondeado a 1 decimal),
|
||||
completeness (float, 0-1),
|
||||
validity (float, 0-1),
|
||||
consistency (float, 0-1),
|
||||
issues (list[str]) descripciones legibles de los problemas.
|
||||
score (float 0-100, redondeado a 1 decimal),
|
||||
completeness (float 0-1),
|
||||
validity (float 0-1 | None si no aplicable),
|
||||
dimensions ({completeness, validity}),
|
||||
applicable (list[str] de dimensiones que entraron en el score),
|
||||
issues (list[str]) SOLO defectos de calidad (nulos, vacios,
|
||||
valores no conformes a su tipo/semantica),
|
||||
observations (list[str]) señales analiticas que NO bajan el score
|
||||
(outliers, columna constante, posible id, asimetria).
|
||||
"""
|
||||
if not isinstance(col, dict):
|
||||
col = {}
|
||||
@@ -39,103 +83,153 @@ def column_quality_score(col: dict) -> dict:
|
||||
flags = set(flags)
|
||||
|
||||
issues: list[str] = []
|
||||
observations: list[str] = []
|
||||
|
||||
inferred_type = col.get("inferred_type") or ""
|
||||
semantic_type = col.get("semantic_type") or ""
|
||||
|
||||
# --- completeness -------------------------------------------------
|
||||
null_pct = col.get("null_pct")
|
||||
if null_pct is None:
|
||||
null_pct = 0.0
|
||||
try:
|
||||
null_pct = float(null_pct)
|
||||
except (TypeError, ValueError):
|
||||
null_pct = 0.0
|
||||
null_pct = _clamp(null_pct, 0.0, 1.0)
|
||||
# Falta de datos = nulos + (en texto) celdas vacias. Es el unico sitio
|
||||
# donde la falta de datos cuenta: nunca se duplica en validity.
|
||||
null_pct = _clamp(_num(col.get("null_pct"), 0.0), 0.0, 1.0)
|
||||
completeness = 1.0 - null_pct
|
||||
if null_pct > 0:
|
||||
issues.append(f"{round(null_pct * 100)}% nulos")
|
||||
issues.append(f"{_pct(null_pct)} nulos")
|
||||
|
||||
# --- validity -----------------------------------------------------
|
||||
validity = 1.0
|
||||
inferred_type = col.get("inferred_type") or ""
|
||||
empty_frac = 0.0
|
||||
n_rows = col.get("n_rows")
|
||||
empty_count = col.get("empty_count")
|
||||
if (
|
||||
isinstance(n_rows, (int, float)) and not isinstance(n_rows, bool) and n_rows > 0
|
||||
and isinstance(empty_count, (int, float)) and not isinstance(empty_count, bool)
|
||||
and empty_count > 0
|
||||
):
|
||||
empty_frac = _clamp(float(empty_count) / float(n_rows), 0.0, 1.0)
|
||||
completeness = _clamp(completeness - empty_frac, 0.0, 1.0)
|
||||
issues.append(f"{_pct(empty_frac)} vacíos")
|
||||
|
||||
numeric = col.get("numeric")
|
||||
is_numeric = inferred_type in ("integer", "float", "numeric") or isinstance(numeric, dict)
|
||||
if isinstance(numeric, dict):
|
||||
outlier_pct = numeric.get("outlier_pct")
|
||||
if outlier_pct is not None:
|
||||
try:
|
||||
outlier_pct = float(outlier_pct)
|
||||
except (TypeError, ValueError):
|
||||
outlier_pct = 0.0
|
||||
outlier_pct = _clamp(outlier_pct, 0.0, 1.0)
|
||||
if outlier_pct > 0:
|
||||
penalty = min(outlier_pct, 0.3)
|
||||
validity -= penalty
|
||||
issues.append(f"{round(outlier_pct * 100)}% outliers")
|
||||
|
||||
# semantic_type declarado pero con baja tasa de match (si la conocemos).
|
||||
semantic_type = col.get("semantic_type") or ""
|
||||
match_rate = col.get("match_rate")
|
||||
if semantic_type and match_rate is not None:
|
||||
try:
|
||||
match_rate = float(match_rate)
|
||||
except (TypeError, ValueError):
|
||||
match_rate = None
|
||||
if match_rate is not None:
|
||||
match_rate = _clamp(match_rate, 0.0, 1.0)
|
||||
if match_rate < 1.0:
|
||||
shortfall = 1.0 - match_rate
|
||||
validity -= 0.5 * shortfall
|
||||
issues.append(
|
||||
f"semantic_type '{semantic_type}' con baja coincidencia "
|
||||
f"({round(match_rate * 100)}%)"
|
||||
)
|
||||
|
||||
if "mostly_null" in flags:
|
||||
validity *= 0.5
|
||||
issues.append("mayoritariamente nula")
|
||||
|
||||
validity = _clamp(validity, 0.0, 1.0)
|
||||
|
||||
# --- consistency --------------------------------------------------
|
||||
consistency = 1.0
|
||||
if "constant" in flags:
|
||||
consistency = 0.3
|
||||
issues.append("columna constante")
|
||||
# --- validity (con renormalizacion por aplicabilidad) -------------
|
||||
# None = no medible -> se excluye del score (no penaliza ni premia).
|
||||
validity = None
|
||||
if completeness <= 0.0:
|
||||
# Columna 100% faltante: no hay valores no nulos sobre los que medir
|
||||
# conformidad. validity no aplica -> el score sale solo de completeness
|
||||
# (= 0). Es el peor defecto de calidad posible.
|
||||
validity = None
|
||||
else:
|
||||
unique_pct = col.get("unique_pct")
|
||||
if unique_pct is not None:
|
||||
try:
|
||||
unique_pct = float(unique_pct)
|
||||
except (TypeError, ValueError):
|
||||
unique_pct = None
|
||||
if (
|
||||
inferred_type == "text"
|
||||
validity_rate = col.get("validity_rate")
|
||||
match_rate = col.get("match_rate")
|
||||
if validity_rate is not None:
|
||||
# Texto promovido a numero/fecha: parse rate real de la muestra.
|
||||
v = _num(validity_rate, None)
|
||||
if v is not None:
|
||||
validity = _clamp(v, 0.0, 1.0)
|
||||
if validity < 1.0:
|
||||
kind = (
|
||||
"número" if inferred_type == "numeric"
|
||||
else "fecha" if inferred_type == "datetime"
|
||||
else inferred_type or "su tipo"
|
||||
)
|
||||
issues.append(
|
||||
f"{_pct(1.0 - validity)} no parsea al tipo {kind}"
|
||||
)
|
||||
elif inferred_type in _NATIVE_TYPED:
|
||||
# Tipo nativo garantizado por el almacen: no hay valores que no
|
||||
# parseen. validity = 1.0 (no se confunde con tener outliers).
|
||||
validity = 1.0
|
||||
elif semantic_type and match_rate is not None:
|
||||
v = _num(match_rate, None)
|
||||
if v is not None:
|
||||
validity = _clamp(v, 0.0, 1.0)
|
||||
if validity < 1.0:
|
||||
issues.append(
|
||||
f"{_pct(1.0 - validity)} no casa con el "
|
||||
f"formato «{semantic_type}»"
|
||||
)
|
||||
else:
|
||||
# Texto libre / categorica sin semantica: no hay criterio honesto
|
||||
# de validez. No aplica.
|
||||
validity = None
|
||||
|
||||
# --- observations (NO bajan el score) -----------------------------
|
||||
numeric = col.get("numeric")
|
||||
if isinstance(numeric, dict):
|
||||
# outlier_pct viene en escala 0-100 desde describe_numeric (z-score>3).
|
||||
outlier_pct = _num(numeric.get("outlier_pct"), None)
|
||||
if outlier_pct is not None and outlier_pct >= 0.05:
|
||||
observations.append(
|
||||
f"{_pct(outlier_pct / 100.0)} de valores atípicos (z-score>3): "
|
||||
"revisar si son errores u observaciones legítimas"
|
||||
)
|
||||
skew = _num(numeric.get("skew"), None)
|
||||
if skew is not None and abs(skew) >= 1.0:
|
||||
observations.append(
|
||||
f"asimetría fuerte (skew={round(skew, 2)}): considerar "
|
||||
"re-expresión antes de modelar"
|
||||
)
|
||||
|
||||
if "constant" in flags:
|
||||
observations.append(
|
||||
"columna constante: aporta poca información para el análisis"
|
||||
)
|
||||
|
||||
unique_pct = _num(col.get("unique_pct"), None)
|
||||
is_id = (
|
||||
"possible_id" in flags
|
||||
or "high_cardinality" in flags
|
||||
or (
|
||||
inferred_type in ("text", "categorical")
|
||||
and unique_pct is not None
|
||||
and _clamp(unique_pct, 0.0, 1.0) > 0.9
|
||||
):
|
||||
consistency = 0.6
|
||||
issues.append("posible id de alta cardinalidad")
|
||||
|
||||
consistency = _clamp(consistency, 0.0, 1.0)
|
||||
|
||||
# --- score agregado ----------------------------------------------
|
||||
score = round(
|
||||
100.0 * (0.5 * completeness + 0.3 * validity + 0.2 * consistency),
|
||||
1,
|
||||
)
|
||||
)
|
||||
if is_id:
|
||||
observations.append(
|
||||
"valores casi únicos: posible identificador (no es un defecto de calidad)"
|
||||
)
|
||||
|
||||
# Silencia warnings sobre la variable de tipo no usada.
|
||||
_ = is_numeric
|
||||
# --- score agregado con renormalizacion ---------------------------
|
||||
applicable = ["completeness"]
|
||||
num = _W_COMPLETENESS * completeness
|
||||
den = _W_COMPLETENESS
|
||||
if validity is not None:
|
||||
applicable.append("validity")
|
||||
num += _W_VALIDITY * validity
|
||||
den += _W_VALIDITY
|
||||
score = round(100.0 * num / den, 1) if den > 0 else 0.0
|
||||
|
||||
return {
|
||||
"score": score,
|
||||
"completeness": completeness,
|
||||
"validity": validity,
|
||||
"consistency": consistency,
|
||||
"dimensions": {"completeness": completeness, "validity": validity},
|
||||
"applicable": applicable,
|
||||
"issues": issues,
|
||||
"observations": observations,
|
||||
}
|
||||
|
||||
|
||||
def _pct(frac: float) -> str:
|
||||
"""Formatea una fraccion 0-1 como porcentaje honesto: «N%» si >=1%, «0.N%»
|
||||
por debajo (para no mostrar «0%» cuando hay un defecto real pequeño)."""
|
||||
p = frac * 100.0
|
||||
if p >= 1.0:
|
||||
return f"{round(p)}%"
|
||||
return f"{p:.1f}%"
|
||||
|
||||
|
||||
def _num(x, default):
|
||||
"""Convierte x a float; devuelve `default` si es None o no parseable."""
|
||||
if x is None:
|
||||
return default
|
||||
if isinstance(x, bool):
|
||||
return default
|
||||
try:
|
||||
return float(x)
|
||||
except (TypeError, ValueError):
|
||||
return default
|
||||
|
||||
|
||||
def _clamp(x: float, lo: float, hi: float) -> float:
|
||||
"""Recorta x al rango [lo, hi]."""
|
||||
if x < lo:
|
||||
|
||||
@@ -1,4 +1,12 @@
|
||||
"""Tests para column_quality_score."""
|
||||
"""Tests para column_quality_score (nueva fórmula, report 2046).
|
||||
|
||||
Verifica las invariantes de la fórmula de calidad:
|
||||
- completeness (0.6) + validity (0.4) con renormalización por aplicabilidad.
|
||||
- Los OUTLIERS no bajan el score (van a observations, no a issues).
|
||||
- Columnas constantes e ids no bajan el score (observations).
|
||||
- Sin doble conteo de la falta de datos.
|
||||
- all-null -> score 0; función pura (no muta el input).
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
@@ -9,11 +17,11 @@ from column_quality_score import column_quality_score
|
||||
|
||||
|
||||
def _clean_numeric_col() -> dict:
|
||||
"""ColumnProfile de una columna numerica sana, sin problemas."""
|
||||
"""ColumnProfile de una columna numérica nativa sana, sin problemas."""
|
||||
return {
|
||||
"name": "edad",
|
||||
"physical_type": "INTEGER",
|
||||
"inferred_type": "integer",
|
||||
"inferred_type": "numeric",
|
||||
"semantic_type": "",
|
||||
"count": 1000,
|
||||
"n_rows": 1000,
|
||||
@@ -28,85 +36,163 @@ def _clean_numeric_col() -> dict:
|
||||
}
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Golden
|
||||
# --------------------------------------------------------------------------- #
|
||||
def test_clean_column_high_score():
|
||||
out = column_quality_score(_clean_numeric_col())
|
||||
assert out["score"] > 90
|
||||
assert out["score"] == 100.0
|
||||
assert out["completeness"] == 1.0
|
||||
assert out["validity"] == 1.0
|
||||
assert out["consistency"] == 1.0
|
||||
assert out["applicable"] == ["completeness", "validity"]
|
||||
assert out["issues"] == []
|
||||
assert out["observations"] == []
|
||||
|
||||
|
||||
def test_half_null_lowers_completeness_and_score():
|
||||
def test_weights_60_40_native_type():
|
||||
"""30% nulos en numérica nativa: score = 100*(0.6*0.7 + 0.4*1.0) = 82."""
|
||||
col = _clean_numeric_col()
|
||||
col["null_count"] = 500
|
||||
col["null_pct"] = 0.5
|
||||
clean_score = column_quality_score(_clean_numeric_col())["score"]
|
||||
col["null_pct"] = 0.30
|
||||
col["null_count"] = 300
|
||||
out = column_quality_score(col)
|
||||
assert out["completeness"] == 0.5
|
||||
assert out["score"] < clean_score
|
||||
assert any("nulos" in issue for issue in out["issues"])
|
||||
assert out["completeness"] == 0.7
|
||||
assert out["validity"] == 1.0
|
||||
assert out["score"] == 82.0
|
||||
assert any("nulos" in i for i in out["issues"])
|
||||
|
||||
|
||||
def test_constant_column_flags_issue():
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Outliers FUERA del score
|
||||
# --------------------------------------------------------------------------- #
|
||||
def test_outliers_do_not_penalize_score():
|
||||
"""Columna con outliers pero sin nulos -> score máximo; outliers en observations."""
|
||||
col = _clean_numeric_col()
|
||||
col["numeric"] = {"outlier_pct": 18.0, "skew": 0.2} # 18% atípicos (escala 0-100)
|
||||
out = column_quality_score(col)
|
||||
assert out["score"] == 100.0 # los outliers NO bajan la calidad
|
||||
assert out["validity"] == 1.0
|
||||
# No aparecen como problema de calidad...
|
||||
assert not any("atípic" in i or "outlier" in i for i in out["issues"])
|
||||
# ...sino como observación analítica.
|
||||
assert any("atípic" in o for o in out["observations"])
|
||||
|
||||
|
||||
def test_nulls_lower_score_more_than_outliers():
|
||||
"""Vacíos sí penalizan; outliers no: comparar las dos columnas."""
|
||||
con_nulos = _clean_numeric_col()
|
||||
con_nulos["null_pct"] = 0.30
|
||||
con_outliers = _clean_numeric_col()
|
||||
con_outliers["numeric"] = {"outlier_pct": 30.0}
|
||||
assert column_quality_score(con_nulos)["score"] < \
|
||||
column_quality_score(con_outliers)["score"]
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Validity: aplicabilidad y renormalización
|
||||
# --------------------------------------------------------------------------- #
|
||||
def test_validity_from_parse_rate_lowers_score():
|
||||
"""Numérica como texto con 20% basura: validity=0.8 -> score=92."""
|
||||
col = {
|
||||
"name": "precio_txt", "inferred_type": "numeric", "semantic_type": "decimal",
|
||||
"null_pct": 0.0, "validity_rate": 0.80, "flags": [], "numeric": None,
|
||||
}
|
||||
out = column_quality_score(col)
|
||||
assert out["validity"] == 0.8
|
||||
assert out["score"] == 92.0 # 100*(0.6 + 0.4*0.8)
|
||||
assert any("no parsea" in i for i in out["issues"])
|
||||
|
||||
|
||||
def test_validity_from_match_rate():
|
||||
"""Texto con semantic_type y 5% no conforme: validity=0.95."""
|
||||
col = {
|
||||
"name": "email", "inferred_type": "text", "semantic_type": "email",
|
||||
"null_pct": 0.0, "match_rate": 0.95, "unique_pct": 0.5, "flags": [],
|
||||
}
|
||||
out = column_quality_score(col)
|
||||
assert out["validity"] == 0.95
|
||||
assert out["score"] == 98.0 # 100*(0.6 + 0.4*0.95)
|
||||
assert any("no casa" in i for i in out["issues"])
|
||||
|
||||
|
||||
def test_free_text_renormalizes_to_completeness_only():
|
||||
"""Texto libre sin semántica: validity no aplica -> score = 100*completeness."""
|
||||
col = {
|
||||
"name": "comentario", "inferred_type": "text", "semantic_type": "",
|
||||
"null_pct": 0.30, "unique_pct": 0.5, "flags": [], "numeric": None,
|
||||
}
|
||||
out = column_quality_score(col)
|
||||
assert out["validity"] is None
|
||||
assert out["applicable"] == ["completeness"]
|
||||
assert out["completeness"] == 0.7
|
||||
assert out["score"] == 70.0 # renormalizado a solo completeness
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Casos límite (report §4.6)
|
||||
# --------------------------------------------------------------------------- #
|
||||
def test_all_null_column_scores_zero():
|
||||
col = _clean_numeric_col()
|
||||
col["null_pct"] = 1.0
|
||||
col["null_count"] = 1000
|
||||
out = column_quality_score(col)
|
||||
assert out["completeness"] == 0.0
|
||||
assert out["validity"] is None # no medible sin valores no nulos
|
||||
assert out["score"] == 0.0
|
||||
|
||||
|
||||
def test_constant_column_scores_full_and_is_observation():
|
||||
"""Columna constante: dato válido y completo -> score 100; baja info = observación."""
|
||||
col = _clean_numeric_col()
|
||||
col["flags"] = ["constant"]
|
||||
col["distinct_count"] = 1
|
||||
col["unique_pct"] = 0.001
|
||||
out = column_quality_score(col)
|
||||
assert out["consistency"] == 0.3
|
||||
assert any("constante" in issue for issue in out["issues"])
|
||||
assert out["score"] == 100.0 # NO se castiga la baja informatividad
|
||||
assert not any("constante" in i for i in out["issues"])
|
||||
assert any("constante" in o for o in out["observations"])
|
||||
|
||||
|
||||
def test_high_cardinality_id_scores_full_and_is_observation():
|
||||
"""Id de alta cardinalidad: unicidad perfecta -> score 100; posible id = observación."""
|
||||
col = {
|
||||
"name": "uuid", "inferred_type": "text", "semantic_type": "",
|
||||
"null_pct": 0.0, "unique_pct": 0.99, "flags": ["possible_id"],
|
||||
"numeric": None,
|
||||
}
|
||||
out = column_quality_score(col)
|
||||
assert out["score"] == 100.0
|
||||
assert not any("identificador" in i for i in out["issues"])
|
||||
assert any("identificador" in o for o in out["observations"])
|
||||
|
||||
|
||||
def test_mostly_null_no_double_counts_validity():
|
||||
"""85% nulos: solo completeness penaliza; validity nativa sigue 1.0 (sin doble castigo)."""
|
||||
col = _clean_numeric_col()
|
||||
col["null_pct"] = 0.85
|
||||
col["flags"] = ["mostly_null"]
|
||||
out = column_quality_score(col)
|
||||
assert out["validity"] == 1.0 # ya no se multiplica por 0.5
|
||||
# score = 100*(0.6*0.15 + 0.4*1.0) = 49
|
||||
assert out["score"] == 49.0
|
||||
assert not any("mayoritariamente" in o for o in out["observations"])
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Robustez
|
||||
# --------------------------------------------------------------------------- #
|
||||
def test_empty_dict_does_not_crash():
|
||||
out = column_quality_score({})
|
||||
assert isinstance(out["score"], float)
|
||||
assert out["completeness"] == 1.0
|
||||
assert 0.0 <= out["score"] <= 100.0
|
||||
assert isinstance(out["issues"], list)
|
||||
|
||||
|
||||
def test_outliers_penalize_validity():
|
||||
col = _clean_numeric_col()
|
||||
col["numeric"] = {"outlier_pct": 0.2}
|
||||
out = column_quality_score(col)
|
||||
assert out["validity"] < 1.0
|
||||
assert any("outliers" in issue for issue in out["issues"])
|
||||
|
||||
|
||||
def test_mostly_null_flag_halves_validity():
|
||||
col = _clean_numeric_col()
|
||||
col["null_pct"] = 0.85
|
||||
col["flags"] = ["mostly_null"]
|
||||
out = column_quality_score(col)
|
||||
assert out["validity"] == 0.5
|
||||
assert any("mayoritariamente nula" in issue for issue in out["issues"])
|
||||
|
||||
|
||||
def test_high_cardinality_text_flagged_as_id():
|
||||
col = {
|
||||
"name": "uuid",
|
||||
"inferred_type": "text",
|
||||
"semantic_type": "",
|
||||
"null_pct": 0.0,
|
||||
"unique_pct": 0.99,
|
||||
"flags": [],
|
||||
"numeric": None,
|
||||
}
|
||||
out = column_quality_score(col)
|
||||
assert out["consistency"] < 1.0
|
||||
assert any("alta cardinalidad" in issue for issue in out["issues"])
|
||||
assert isinstance(out["observations"], list)
|
||||
|
||||
|
||||
def test_none_values_treated_defensively():
|
||||
col = {
|
||||
"name": "x",
|
||||
"inferred_type": None,
|
||||
"semantic_type": None,
|
||||
"null_pct": None,
|
||||
"unique_pct": None,
|
||||
"flags": None,
|
||||
"numeric": None,
|
||||
"name": "x", "inferred_type": None, "semantic_type": None,
|
||||
"null_pct": None, "unique_pct": None, "flags": None, "numeric": None,
|
||||
}
|
||||
out = column_quality_score(col)
|
||||
assert out["completeness"] == 1.0
|
||||
|
||||
@@ -0,0 +1,85 @@
|
||||
---
|
||||
name: pptx_link_run_to_slide
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "def pptx_link_run_to_slide(run, source_slide, target_slide) -> bool"
|
||||
description: "Convierte un run de texto de python-pptx en un hyperlink INTERNO 'ir a la diapositiva'. python-pptx soporta run.hyperlink.address para URLs externas pero NO para saltar a otra slide del mismo deck; esta función crea ese salto manipulando el XML: añade una relación slide->slide (RT.SLIDE) y un <a:hlinkClick> con action='ppaction://hlinksldjump' y el r:id de la relación, insertado como primer hijo del <a:rPr> del run (orden del schema CT_TextCharacterProperties). Idempotente (elimina un hlinkClick previo antes de insertar). Al pulsar el texto en PowerPoint o visores compatibles se navega a target_slide. Motor python-pptx. No lanza nunca: cualquier excepción -> return False."
|
||||
tags: [eda, pptx, hyperlink, slide-jump, navigation, glossary, automatic-eda, python-pptx, xml, datascience, python]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: ["python-pptx"]
|
||||
params:
|
||||
- name: run
|
||||
desc: "el pptx.text.text._Run cuyo texto se vuelve clicable. Debe pertenecer a un run real (expone ._r, el elemento <a:r>). Un objeto sin ._r hace que la función devuelva False sin lanzar."
|
||||
- name: source_slide
|
||||
desc: "la Slide que contiene el run. Su part recibe la relación slide->slide (relate_to con RELATIONSHIP_TYPE.SLIDE); el r:id resultante se referencia en el hlinkClick."
|
||||
- name: target_slide
|
||||
desc: "la Slide de destino del salto. Debe pertenecer al MISMO Presentation que source_slide para que la relación interna sea válida."
|
||||
output: "bool. True si se aplicó el hyperlink interno (relación creada + <a:hlinkClick> insertado en el rPr del run); False si algo lo impidió (run inválido, slides de presentaciones distintas, etc.). Nunca lanza."
|
||||
tested: true
|
||||
tests: ["test_golden_run_se_vuelve_salto_a_otra_slide", "test_idempotente_reaplica_sin_duplicar_hlinkclick", "test_error_path_run_invalido_devuelve_false_sin_lanzar"]
|
||||
test_file_path: "python/functions/datascience/pptx_link_run_to_slide_test.py"
|
||||
file_path: "python/functions/datascience/pptx_link_run_to_slide.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from pptx import Presentation
|
||||
from pptx.util import Inches
|
||||
from pptx.oxml.ns import qn
|
||||
|
||||
from datascience.pptx_link_run_to_slide import pptx_link_run_to_slide
|
||||
|
||||
prs = Presentation()
|
||||
blank = prs.slide_layouts[6] # layout en blanco
|
||||
slide0 = prs.slides.add_slide(blank)
|
||||
slide1 = prs.slides.add_slide(blank) # destino del salto (p.ej. el glosario)
|
||||
|
||||
box = slide0.shapes.add_textbox(Inches(1), Inches(1), Inches(4), Inches(1))
|
||||
run = box.text_frame.paragraphs[0].add_run()
|
||||
run.text = "ir al glosario"
|
||||
|
||||
ok = pptx_link_run_to_slide(run, slide0, slide1)
|
||||
print(ok) # -> True
|
||||
|
||||
# El run quedó con <a:rPr><a:hlinkClick action="ppaction://hlinksldjump" r:id="rIdN"/></a:rPr>
|
||||
hlink = run._r.get_or_add_rPr().find(qn("a:hlinkClick"))
|
||||
print(hlink.get("action")) # -> ppaction://hlinksldjump
|
||||
prs.save("deck_con_salto.pptx")
|
||||
```
|
||||
|
||||
## Cuando usarla
|
||||
|
||||
Cuando construyas un deck PPTX con **navegación interna** y quieras que un texto salte a
|
||||
otra diapositiva al pulsarlo: un **glosario clicable** (cada término enlaza a su slide de
|
||||
definición), un **índice/tabla de contenidos navegable**, botones "volver a la portada", o
|
||||
referencias cruzadas entre capítulos. Es la pieza que `python-pptx` no cubre de fábrica —
|
||||
úsala sobre los runs ya creados por renderers como `render_automatic_eda_pptx` del grupo
|
||||
`eda` para enriquecer el deck con saltos sin reescribir el XML a mano cada vez.
|
||||
|
||||
## Gotchas
|
||||
|
||||
- **Impura**: muta el XML del run y crea una relación nueva en el part de `source_slide`.
|
||||
- **Solo navega en visores que respetan `ppaction://hlinksldjump`**: PowerPoint y la
|
||||
mayoría de visores compatibles lo siguen; algunos visores web/ligeros lo ignoran (el
|
||||
texto se ve igual pero no salta).
|
||||
- **Mismo Presentation**: `source_slide` y `target_slide` deben pertenecer al mismo deck.
|
||||
Si son de presentaciones distintas, la relación interna no es válida y el salto no
|
||||
funcionará (la función puede devolver True por crear la relación, pero el resultado en
|
||||
el visor no será el esperado).
|
||||
- **El `<a:hlinkClick>` vive en el `<a:rPr>` del run**, no como hijo directo del `<a:r>`.
|
||||
Para localizarlo: `run._r.get_or_add_rPr().find(qn("a:hlinkClick"))` (un `find` sobre
|
||||
`run._r` devuelve `None` porque solo mira hijos directos del `<a:r>`).
|
||||
- **Idempotente**: si el run ya tenía un `hlinkClick` (p.ej. una URL externa o un salto
|
||||
previo), se elimina antes de insertar el nuevo — un run tiene como mucho un click-link.
|
||||
- **Nunca lanza**: cualquier excepción (run sin `._r`, slides incompatibles, etc.) se
|
||||
traga y devuelve `False`. Comprobar el booleano si el salto es crítico.
|
||||
- **Dependencia python-pptx**: declarada en `python/pyproject.toml`. Tests con
|
||||
`~/fn_registry/python/.venv/bin/python3` (tiene `python-pptx` instalado).
|
||||
@@ -0,0 +1,50 @@
|
||||
"""Convierte un run de texto de python-pptx en un hyperlink interno "ir a la diapositiva".
|
||||
|
||||
python-pptx expone ``run.hyperlink.address`` para URLs externas, pero NO ofrece una
|
||||
API pública para saltar a otra diapositiva del mismo deck. Esta función crea ese salto
|
||||
interno manipulando el XML: añade una relación ``slide -> slide`` y un
|
||||
``<a:hlinkClick>`` con la acción ``ppaction://hlinksldjump`` en el run, de modo que al
|
||||
pulsar el texto en PowerPoint (o en visores que respetan esa acción) se navega a la
|
||||
diapositiva de destino.
|
||||
"""
|
||||
|
||||
from pptx.opc.constants import RELATIONSHIP_TYPE as RT
|
||||
from pptx.oxml.ns import qn
|
||||
|
||||
|
||||
def pptx_link_run_to_slide(run, source_slide, target_slide) -> bool:
|
||||
"""Convierte un run de texto en un hyperlink interno "ir a la diapositiva".
|
||||
|
||||
Añade una relación ``slide -> slide`` desde la slide origen al part de la slide
|
||||
destino y crea un ``<a:hlinkClick>`` con ``action="ppaction://hlinksldjump"`` como
|
||||
primer hijo del ``<a:rPr>`` del run (orden válido del schema
|
||||
``CT_TextCharacterProperties``). La operación es idempotente: un ``hlinkClick``
|
||||
previo en el mismo run se elimina antes de insertar el nuevo.
|
||||
|
||||
Args:
|
||||
run: el ``pptx.text.text._Run`` cuyo texto se vuelve clicable.
|
||||
source_slide: la ``Slide`` que contiene el run.
|
||||
target_slide: la ``Slide`` de destino del salto.
|
||||
|
||||
Returns:
|
||||
True si se aplicó el hyperlink; False si algo impidió aplicarlo (no lanza).
|
||||
"""
|
||||
try:
|
||||
rId = source_slide.part.relate_to(target_slide.part, RT.SLIDE)
|
||||
rPr = run._r.get_or_add_rPr()
|
||||
# Elimina un hlinkClick previo si lo hubiera (idempotente).
|
||||
for existing in rPr.findall(qn("a:hlinkClick")):
|
||||
rPr.remove(existing)
|
||||
hlink = rPr.makeelement(
|
||||
qn("a:hlinkClick"),
|
||||
{
|
||||
qn("r:id"): rId,
|
||||
"action": "ppaction://hlinksldjump",
|
||||
},
|
||||
)
|
||||
# a:hlinkClick debe ir como primer hijo de rPr
|
||||
# (orden del schema CT_TextCharacterProperties).
|
||||
rPr.insert(0, hlink)
|
||||
return True
|
||||
except Exception:
|
||||
return False
|
||||
@@ -0,0 +1,73 @@
|
||||
"""Tests for pptx_link_run_to_slide — salto interno run -> diapositiva.
|
||||
|
||||
Self-contained: construye una Presentation en memoria con dos slides en blanco,
|
||||
un textbox con un run en la slide 0, y verifica que la función inyecta un
|
||||
``<a:hlinkClick>`` con ``action="ppaction://hlinksldjump"`` y un ``r:id`` que
|
||||
resuelve al part de la slide 1.
|
||||
"""
|
||||
|
||||
import pytest
|
||||
|
||||
pytest.importorskip("pptx")
|
||||
|
||||
from pptx import Presentation # noqa: E402
|
||||
from pptx.oxml.ns import qn # noqa: E402
|
||||
from pptx.util import Inches # noqa: E402
|
||||
|
||||
from datascience.pptx_link_run_to_slide import pptx_link_run_to_slide # noqa: E402
|
||||
|
||||
|
||||
def _two_slide_deck_with_run():
|
||||
prs = Presentation()
|
||||
blank = prs.slide_layouts[6] # layout en blanco
|
||||
slide0 = prs.slides.add_slide(blank)
|
||||
slide1 = prs.slides.add_slide(blank)
|
||||
|
||||
box = slide0.shapes.add_textbox(Inches(1), Inches(1), Inches(4), Inches(1))
|
||||
tf = box.text_frame
|
||||
para = tf.paragraphs[0]
|
||||
run = para.add_run()
|
||||
run.text = "ir al glosario"
|
||||
return prs, slide0, slide1, run
|
||||
|
||||
|
||||
def test_golden_run_se_vuelve_salto_a_otra_slide():
|
||||
prs, slide0, slide1, run = _two_slide_deck_with_run()
|
||||
|
||||
ok = pptx_link_run_to_slide(run, slide0, slide1)
|
||||
assert ok is True
|
||||
|
||||
# El hlinkClick es hijo del rPr del run (orden del schema
|
||||
# CT_TextCharacterProperties), no hijo directo del <a:r>.
|
||||
rPr = run._r.get_or_add_rPr()
|
||||
hlink = rPr.find(qn("a:hlinkClick"))
|
||||
assert hlink is not None
|
||||
assert hlink.get("action") == "ppaction://hlinksldjump"
|
||||
|
||||
rId = hlink.get(qn("r:id"))
|
||||
assert rId, "el hlinkClick debe llevar un r:id no vacío"
|
||||
|
||||
# El rId debe existir en las relaciones de la slide origen y apuntar
|
||||
# al part de la slide destino.
|
||||
rels = slide0.part.rels
|
||||
assert rId in rels
|
||||
assert rels[rId].target_part is slide1.part
|
||||
|
||||
|
||||
def test_idempotente_reaplica_sin_duplicar_hlinkclick():
|
||||
prs, slide0, slide1, run = _two_slide_deck_with_run()
|
||||
|
||||
assert pptx_link_run_to_slide(run, slide0, slide1) is True
|
||||
assert pptx_link_run_to_slide(run, slide0, slide1) is True
|
||||
|
||||
rPr = run._r.get_or_add_rPr()
|
||||
hlinks = rPr.findall(qn("a:hlinkClick"))
|
||||
assert len(hlinks) == 1
|
||||
|
||||
|
||||
def test_error_path_run_invalido_devuelve_false_sin_lanzar():
|
||||
prs, slide0, slide1, _run = _two_slide_deck_with_run()
|
||||
|
||||
# Un objeto sin ._r ni soporte de relación -> la función no lanza, devuelve False.
|
||||
ok = pptx_link_run_to_slide(object(), slide0, slide1)
|
||||
assert ok is False
|
||||
@@ -3,7 +3,7 @@ name: summarize_table_duckdb
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
version: "1.1.0"
|
||||
purity: impure
|
||||
signature: "def summarize_table_duckdb(db_path: str, table: str, high_card_ratio: float = 0.9) -> dict"
|
||||
description: "Perfila una tabla DuckDB en una sola pasada SQL (SUMMARIZE, push-down sin traer filas a RAM) y devuelve el esqueleto de un TableProfile con el perfil base por columna. Corazon del grupo eda: base barata sobre la que otras funciones anaden lo estadistico fino (skew/kurtosis/histograma sobre muestra)."
|
||||
@@ -64,6 +64,7 @@ else:
|
||||
- **`distinct_count` exacto para tablas <=200k filas, aproximado+capado por encima**: `SUMMARIZE` usa HyperLogLog (`approx_unique`), que SOBREESTIMA y en tablas pequenas puede reportar mas distintos que filas (inflando `unique_pct` por encima de 1.0 y disparando flags `possible_id` falsos). Por eso, para `n_rows <= 200000` la funcion calcula `COUNT(DISTINCT)` EXACTO en una sola query combinada (barata) y usa ese valor. Para tablas mas grandes mantiene `approx_unique` pero lo CAPA a `n_rows` (`distinct_count = min(approx_unique, n_rows)`). En ambos casos `unique_pct = min(distinct_count / n_rows, 1.0)`, asi que `distinct_count` nunca supera las filas ni `unique_pct` pasa de 1.0. Los flags `possible_id` / `high_cardinality` derivan de ese `distinct_count` ya corregido (exacto y fiable por debajo de 200k filas; aproximado y conservador por encima).
|
||||
- **`SUMMARIZE` NO da skew, kurtosis ni histograma**, ni percentiles finos (p1/p5/p95/p99), moda, outliers, correlaciones, key_candidates ni quality_score. Esas claves quedan en `None`/`[]` a proposito: las rellena otra funcion del grupo `eda` sobre una muestra. El sub-dict `numeric` solo trae min, max, mean, std, p25, p50, p75.
|
||||
- **`SUMMARIZE.count` es el total de filas, no el no-nulo**: la funcion deriva el `count` no-nulo del ColumnProfile como `n_rows - null_count` (con `null_count` redondeado de `null_percentage`).
|
||||
- **`duplicate_rows`/`duplicate_pct` se pueblan push-down** (desde v1.1.0) con `count(*)` sobre `SELECT DISTINCT *` (sin traer filas a RAM): `duplicate_rows = n_rows - filas_distintas`, `duplicate_pct` en fraccion 0-1. Habilitan la dimension de unicidad de registro del score de dataset (`profile_table` paso 6). Si la tabla tiene tipos no comparables con `DISTINCT` (BLOB/LIST/MAP) la query degrada y ambas vuelven a `None` (renormaliza el score a solo `cell_quality`).
|
||||
- **min/max/avg/std/q25/q50/q75 vienen como strings** desde DuckDB; se convierten a float (None si la columna no es numerica).
|
||||
- **Requiere DuckDB 1.5.2** (columnas de `SUMMARIZE` validadas con esa version: column_name, column_type, min, max, approx_unique, avg, std, q25, q50, q75, count, null_percentage).
|
||||
- **El identificador de tabla se interpola** (no parametrizable en `SUMMARIZE`): por eso se valida contra `^[A-Za-z_][A-Za-z0-9_]*$` antes de citarlo. Un nombre invalido (p.ej. con `;` o espacios) devuelve `{status:'error'}` sin tocar la base.
|
||||
|
||||
@@ -196,6 +196,21 @@ def summarize_table_duckdb(
|
||||
sum(c["null_pct"] for c in columns) / len(columns) if columns else 0.0
|
||||
)
|
||||
|
||||
# Unicidad de registro: filas duplicadas via COUNT de filas distintas
|
||||
# push-down (DISTINCT *), sin traer filas a RAM. Habilita la dimension
|
||||
# de uniqueness del score de dataset (1 - duplicate_pct). Degrada a None
|
||||
# si la tabla tiene tipos no comparables con DISTINCT (BLOB/LIST/MAP).
|
||||
duplicate_rows = None
|
||||
duplicate_pct = None
|
||||
if n_rows > 0:
|
||||
dup_res = duckdb_query_readonly(
|
||||
db_path, f"SELECT count(*) AS c FROM (SELECT DISTINCT * FROM {quoted})"
|
||||
)
|
||||
if dup_res["status"] == "ok" and dup_res["rows"]:
|
||||
distinct_rows = int(dup_res["rows"][0]["c"])
|
||||
duplicate_rows = max(0, n_rows - distinct_rows)
|
||||
duplicate_pct = duplicate_rows / n_rows # fraccion 0-1
|
||||
|
||||
profile = {
|
||||
"table": table,
|
||||
"source": "duckdb",
|
||||
@@ -203,8 +218,8 @@ def summarize_table_duckdb(
|
||||
"n_rows": n_rows,
|
||||
"n_cols": len(columns),
|
||||
"size_bytes": None,
|
||||
"duplicate_rows": None,
|
||||
"duplicate_pct": None,
|
||||
"duplicate_rows": duplicate_rows,
|
||||
"duplicate_pct": duplicate_pct,
|
||||
"constant_cols": constant_cols,
|
||||
"all_null_cols": all_null_cols,
|
||||
"null_cell_pct": null_cell_pct,
|
||||
|
||||
@@ -54,6 +54,30 @@ def test_shape_y_metadatos_tabla(db):
|
||||
assert profile["correlations"] is None
|
||||
|
||||
|
||||
def test_duplicate_pct_sin_duplicados(db):
|
||||
"""Tabla con todas las filas distintas: duplicate_pct = 0, no None."""
|
||||
profile = summarize_table_duckdb(db, "ventas")["profile"]
|
||||
assert profile["duplicate_rows"] == 0
|
||||
assert profile["duplicate_pct"] == 0.0
|
||||
|
||||
|
||||
def test_duplicate_pct_con_duplicados(tmp_path):
|
||||
"""Filas repetidas: duplicate_rows/duplicate_pct se pueblan push-down."""
|
||||
path = str(tmp_path / "dups.duckdb")
|
||||
con = duckdb.connect(path)
|
||||
con.execute("CREATE TABLE t (a INTEGER, b VARCHAR)")
|
||||
# 5 filas, 2 de ellas idénticas a otras -> 2 duplicadas sobre 5 = 0.4.
|
||||
con.execute(
|
||||
"INSERT INTO t VALUES "
|
||||
"(1,'x'), (2,'y'), (1,'x'), (3,'z'), (2,'y')"
|
||||
)
|
||||
con.close()
|
||||
profile = summarize_table_duckdb(path, "t")["profile"]
|
||||
assert profile["n_rows"] == 5
|
||||
assert profile["duplicate_rows"] == 2
|
||||
assert profile["duplicate_pct"] == 0.4
|
||||
|
||||
|
||||
def test_column_profile_shape(db):
|
||||
profile = summarize_table_duckdb(db, "ventas")["profile"]
|
||||
by_name = {c["name"]: c for c in profile["columns"]}
|
||||
|
||||
@@ -4,7 +4,7 @@ kind: pipeline
|
||||
lang: py
|
||||
domain: pipelines
|
||||
purity: impure
|
||||
version: "1.0.0"
|
||||
version: "1.1.0"
|
||||
signature: "def profile_table(db_path: str, table: str, backend: str = \"duckdb\", sample: int = 5000, run_models: bool = False, run_llm: bool = False, run_series: bool = False, emit_pdf: bool = False, emit_automatic: bool = False, report_dir: str = \"reports\", write_report: bool = True) -> dict"
|
||||
description: "Orquestador one-shot del grupo de capacidad eda: perfila UNA tabla (DuckDB o PostgreSQL) end-to-end componiendo las funciones del grupo (perfil base SQL + muestreo read-only + inferencia semantica + promocion de tipo + estadistica numerica/categorica + score de calidad + correlaciones con correccion FDR + re-expresion de Tukey + avisos exploratorios) y, opcional, modelos baratos (run_models), interpretacion LLM (run_llm) y analisis de serie temporal por columna (run_series: estacionariedad ADF+KPSS, ACF/PACF, STL, retornos). Emite el TableProfile completo mas (opcional) report markdown + JSON sidecar + PDF movil (emit_pdf). Es la composicion canonica para hazme un EDA de esta tabla."
|
||||
tags: [eda, duckdb, postgres, profiling, data-quality, pipeline, dataops, timeseries]
|
||||
@@ -114,3 +114,12 @@ para auditar la calidad de una tabla ya productiva. Reemplaza orquestar a mano
|
||||
Formatos exoticos pueden descartarse silenciosamente del calculo numerico.
|
||||
- `db_path` debe existir: DuckDB read-only NO crea la base. El muestreo usa el
|
||||
sandbox por defecto de `duckdb_query_readonly` (sin acceso a FS/red).
|
||||
- **Score de calidad (report 2046, desde v1.1.0).** Paso 5: cada columna recibe
|
||||
`quality_score` de `column_quality_score` con la formula 60/40
|
||||
(completeness/validity); al promocionar texto a numero/fecha se expone
|
||||
`col["validity_rate"]` (parse rate de la muestra) para alimentar la dimension
|
||||
validity. Paso 6: el score de dataset NO es la media simple — es
|
||||
`100 * (0.85*cell_quality + 0.15*row_uniqueness)`, donde
|
||||
`cell_quality = media(score_col/100)` y `row_uniqueness = 1 - duplicate_pct`.
|
||||
Si `duplicate_pct` es `None` (backend sin calcularlo) el score se renormaliza
|
||||
a solo `cell_quality`. Los outliers NO bajan el score (van a `observations`).
|
||||
|
||||
@@ -477,9 +477,18 @@ def profile_table(
|
||||
if vals and (len(ok) / len(vals)) >= _PROMOTE_MIN_PARSE:
|
||||
col["inferred_type"] = "numeric"
|
||||
inferred = "numeric"
|
||||
# Tasa de parseo real de la muestra: alimenta la
|
||||
# dimension validity de column_quality_score (fraccion
|
||||
# de valores conformes al tipo numerico promovido).
|
||||
col["validity_rate"] = len(ok) / len(vals)
|
||||
elif semantic in _DATETIME_SEMANTIC:
|
||||
col["inferred_type"] = "datetime"
|
||||
inferred = "datetime"
|
||||
# Tasa de parseo de la muestra a fecha (mismo papel que el
|
||||
# parse rate numerico) para la dimension validity.
|
||||
parsed_dt = [_to_ordinal_days(v) for v in vals]
|
||||
ok_dt = [d for d in parsed_dt if d is not None]
|
||||
col["validity_rate"] = (len(ok_dt) / len(vals)) if vals else None
|
||||
|
||||
# 4) Enriquecer segun el inferred_type final.
|
||||
if inferred == "numeric":
|
||||
@@ -506,11 +515,36 @@ def profile_table(
|
||||
# 5) Score de calidad por columna.
|
||||
col["quality_score"] = column_quality_score(col).get("score")
|
||||
|
||||
# 6) Score agregado de la tabla (media de columnas).
|
||||
# 6) Score agregado de la tabla (report 2046): NO media simple.
|
||||
# cell_quality = media de los scores de columna, en [0,1].
|
||||
# row_uniqueness = 1 - duplicate_pct (unicidad de registro).
|
||||
# score = 100 * (0.85*cell_quality + 0.15*row_uniqueness).
|
||||
# Renormaliza a solo cell_quality si duplicate_pct no se pudo calcular.
|
||||
scores = [
|
||||
c["quality_score"] for c in cols if c.get("quality_score") is not None
|
||||
]
|
||||
prof["quality_score"] = round(sum(scores) / len(scores), 1) if scores else None
|
||||
if scores:
|
||||
cell_quality = (sum(scores) / len(scores)) / 100.0
|
||||
dup_pct = prof.get("duplicate_pct")
|
||||
if dup_pct is not None:
|
||||
try:
|
||||
d = float(dup_pct)
|
||||
except (TypeError, ValueError):
|
||||
d = None
|
||||
else:
|
||||
d = None
|
||||
if d is not None:
|
||||
# Tolerar escala 0-100 por si algun backend la entrega asi.
|
||||
if d > 1.0:
|
||||
d = d / 100.0
|
||||
row_uniqueness = max(0.0, min(1.0, 1.0 - d))
|
||||
prof["quality_score"] = round(
|
||||
100.0 * (0.85 * cell_quality + 0.15 * row_uniqueness), 1
|
||||
)
|
||||
else:
|
||||
prof["quality_score"] = round(100.0 * cell_quality, 1)
|
||||
else:
|
||||
prof["quality_score"] = None
|
||||
|
||||
# 7) Candidatos a clave.
|
||||
key_candidates = []
|
||||
|
||||
@@ -25,6 +25,7 @@ dependencies = [
|
||||
"polars>=1.40.1",
|
||||
"pymeshlab>=2025.7.post1",
|
||||
"pymssql>=2.3.13",
|
||||
"pymupdf>=1.28.0",
|
||||
"pypdf>=6.10.0",
|
||||
"pyproj>=3.7.2",
|
||||
"python-docx>=1.2.0",
|
||||
|
||||
Reference in New Issue
Block a user