feat(eda): nueva fórmula de calidad de datos (report 2046) + capítulo calidad

Implementa el modelo de calidad del report 2046 en el grupo eda. Score de columna: 0.6·completeness + 0.4·validity con renormalización por aplicabilidad (si la validez no es medible —texto libre o columna 100% nula— el score se basa solo en completeness). Validez = conformidad real al tipo: nativo numérico/fecha/bool = 1.0; texto promovido a número/fecha = parse rate (validity_rate); texto con semantic_type = match_rate; texto libre = no aplica. Outliers, columnas constantes e identificadores salen del score a un bloque de observaciones analíticas (no son defectos de calidad). Se elimina el doble conteo de la falta de datos (mostly_null ya no castiga validez) y el bug de escala de outliers (que además ya no entran en el score). Score de dataset: 100·(0.85·cell_quality + 0.15·row_uniqueness) en vez de la media simple. Se pobla duplicate_rows/duplicate_pct push-down en summarize_table_duckdb (COUNT sobre DISTINCT *, sin RAM) para habilitar la unicidad de registro; renormaliza a solo cell_quality si no se puede calcular. Capítulo calidad (v2.0.0): intro de dos dimensiones (60/40) que declara que los outliers no bajan el score; tabla de scores Columna|Calidad|Completitud|Validez (sin Consistencia, n/a cuando no aplica); DOS tablas separadas (Problemas de calidad vs Observaciones analíticas); resumen con Unicidad de registro; glosario clicable de completitud, validez, unicidad de registro y calidad de datos. Verificado: 123 tests verdes (automatic_eda + render_automatic_eda + column_quality_score + summarize_table_duckdb + profile_table). Golden EDA de titanic (run_models+run_llm) con score recomputado a mano, outliers separados en observaciones y glosario clicable (5 links GOTO en el PDF). column_quality_score v2.0.0, summarize_table_duckdb v1.1.0, profile_table v1.1.0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
merge: Fase 4a AutomaticEDA motor+glosario (verificado met)
2026-06-30 18:10:23 +02:00 · 2026-06-30 17:45:30 +02:00 · 2026-06-30 17:35:19 +02:00 · 2026-06-30 16:19:52 +02:00
31 changed files with 2895 additions and 435 deletions
@@ -25,7 +25,8 @@ cabecera, y figuras/imágenes se escalan para caber enteras.
 ```
 Document  = list[Chapter]
 Chapter   = { id: str, title: str, version: str, blocks: list[Block] }
-Block     = Heading | Markdown | KVTable | DataTable | Figure | Image | Caption | Note
+Block     = Heading | Markdown | KVTable | DataTable | Figure | Image | Caption
+          | Note | Group | GlossaryEntry
 ```

 Importa el modelo desde `datascience.automatic_eda.model` (o
@@ -44,6 +45,10 @@ reconocido se degrada a `Note`, nunca lanza).
 | `Figure(fig=None, make=None, caption=None, height_in=None)` | una `matplotlib.figure.Figure` ya construida (`fig`) o un callable `make()->Figure` (perezoso) | se rasteriza y escala para caber entera (nunca recortada) |
 | `Image(path, caption=None, height_in=None)` | ruta a PNG/JPG | se escala para caber entera |
 | `Caption(text)` / `Note(text)` | texto auxiliar pequeño | pie/nota en gris; `Note` es además el fallback de lo desconocido |
+| `Group(blocks, title=None)` | unidad **keep-together**: sus bloques se mantienen juntos | el renderer mide el grupo entero y lo mueve completo a la página/slide siguiente si no cabe; encoge la figura para dejar sitio al título+texto. Ver §11 |
+| `GlossaryEntry(key, label, definition)` | una entrada del glosario (destino clicable) | la genera el capítulo `glosario`; registra su posición como destino de los términos marcados. Ver §11 |
+
+`Figure`/`Image` aceptan `height_in` (hint): el renderer **clampa** la figura a esa altura máxima (lo usa `Group` para encoger la figura). Toda figura escala dejando sitio a su caption en la misma página/slide; en PPTX el caption es **siempre** visible (si no se da `caption`, cae al último heading o a "Figura").

 ### Subset de markdown soportado (`Markdown`)

@@ -84,8 +89,9 @@ El orden canónico está **pre-declarado** en

 ```python
 CHAPTER_ORDER = [
-    "portada", "overview", "num_distr", "cat_distr", "calidad", "correlacion",
-    "modelos", "analisis_llm", "timeseries", "geospatial", "agregacion",
+    "portada", "overview", "analisis_llm", "num_distr", "cat_distr", "calidad",
+    "correlacion", "modelos", "timeseries", "geospatial", "agregacion",
+    "glosario",
 ]
 ```

@@ -95,6 +101,15 @@ CHAPTER_ORDER = [
 `CHAPTER_ORDER`) y aparecerá automáticamente en su posición. Esto permite que muchos
 agentes trabajen **en paralelo** sin contención: cada uno toca solo su archivo.

+**Dos capítulos tienen posición especial** (los gestiona `build_document`, no toques esto):
+
+- `portada`: se **construye el último** (después del cuerpo) para poder resumir el
+  análisis, pero se **coloca el primero**. Recibe `ctx['document_summary']` (ver §5) con
+  un resumen agregado del resto. Decisión del usuario: la portada refleja hallazgos.
+- `glosario`: se construye y se **coloca el último**. Lee los términos que los demás
+  capítulos registraron en `ctx['glossary']` (ver §11). Si no se registró ninguno, el
+  capítulo devuelve `None` y desaparece.
+
 Si tu capítulo usa un `<id>` que aún no está en `CHAPTER_ORDER`, añádelo en la posición
 correcta (única edición compartida; coordínala con el orquestador).

@@ -143,6 +158,8 @@ defensivo). Esto habilita el **seguimiento y la mejora continua por capítulo**.
 | `granularity` | "Cada fila es…" (portada). Default: derivado de `key_candidates` |
 | `quality_criteria` | criterios del score de calidad (portada) |
 | `head_rows` | `list[dict]` con `df.head` (overview). Ver §7 |
+| `glossary` | `GlossaryCollector` compartido — los capítulos registran términos en él. Lo crea `build_document`; ver §11 |
+| `document_summary` | dict con el resumen agregado del cuerpo (n_rows, n_cols, quality_score, n_numeric, n_categorical, chapter_titles, …). Lo calcula `build_document` y lo consume la portada |

 Un capítulo puede definir y consumir sus propias claves `ctx` — documenta cuáles en su
 docstring.
@@ -279,6 +296,109 @@ sus bloques presentes y el no-corte (texto largo intacto en la salida). Patrón:

 ---

+## 11. Glosario, keep-together y zebra (motor, fase 4a)
+
+Tres capacidades transversales del motor que **todos** los capítulos pueden usar. La 6.1
+(glosario) requiere que el capítulo coopere (registrar + marcar términos); la 6.2
+(keep-together) es opt-in por capítulo (envolver bloques en `Group`); la 6.3 (zebra) es
+automática (no hay nada que hacer).
+
+### 11.1 Glosario con términos clicables
+
+El glosario es un capítulo nuevo (`chapters/glosario.py`) que se renderiza **siempre el
+último** y lista cada término técnico que algún capítulo haya registrado. Cada aparición
+del término en el texto se vuelve un **clic real** que salta a su entrada: en PDF como
+*link annotation* interno (post-proceso con PyMuPDF, porque `PdfPages` no soporta
+hyperlinks internos), en PPTX como *slide-jump* nativo (`ppaction://hlinksldjump`).
+
+**API exacta para un capítulo (dos pasos):**
+
+1. **Registrar el término** en el colector compartido `ctx['glossary']` (un
+   `model.GlossaryCollector`, creado por `build_document` y pasado a todos los capítulos):
+
+   ```python
+   glossary = ctx.get("glossary")
+   if isinstance(glossary, model.GlossaryCollector):
+       glossary.add("entropia", "Entropía (de Shannon)", "Medida, en bits, de …")
+   ```
+
+   `add(key, label, definition)` es idempotente (la primera definición de cada `key` gana).
+   `key` debe ser `[A-Za-z0-9_]+`. Si no hay colector en `ctx` (renderizado suelto), el
+   capítulo simplemente no marca términos — degrada sin romper.
+
+2. **Marcar cada aparición** en el texto de un bloque `Markdown` con el span inline
+   `[[term:KEY]]texto visible[[/term]]`. El texto visible puede llevar `**negrita**`. El
+   marcador no altera el texto visible (se elimina como cualquier marcador inline); solo
+   añade el destino clicable.
+
+   ```python
+   # En cat_distr (ejemplo real ya implementado):
+   "La [[term:entropia]]**entropía de Shannon**[[/term]] mide cómo de repartidos…"
+   ```
+
+Eso es todo: el capítulo `glosario` recoge los términos (orden alfabético por `label`),
+emite un `GlossaryEntry` por término, y los renderers cablean los enlaces automáticamente.
+Si ningún capítulo registró términos, el glosario no aparece.
+
+**Helpers de `text_layout` (no reimplementar):** `parse_inline_rich(text)` →
+`[(texto, is_bold, term_key), …]`; `wrap_rich_terms(text, max_chars)` → líneas de esos
+spans sin corte. `strip_inline_md` ya elimina los marcadores `[[term:…]]`/`[[/term]]`.
+(Las funciones previas `parse_inline_bold` / `wrap_rich` siguen existiendo, sin términos.)
+
+**Funciones del registry que cablean los enlaces** (grupo `eda`, ya invocadas por los
+renderers; degradan en silencio si faltan): `add_pdf_internal_links_py_datascience`
+(PyMuPDF, link GOTO) y `pptx_link_run_to_slide_py_datascience` (salto a slide nativo).
+Dependencia: `pymupdf` (declarada en `python/pyproject.toml`).
+
+**Trabajo de la siguiente fase — enganchar más términos.** El mecanismo está hecho y
+probado de extremo a extremo con `entropia` (en `cat_distr`). Cada capítulo debe registrar
+y marcar SUS términos con el mismo patrón de dos pasos. Candidatos por capítulo:
+
+| Capítulo | Términos a enganchar (key sugerida) |
+|---|---|
+| `cat_distr` | `entropia` ✅ (hecho) |
+| `calidad` | `completitud`, `validez`, `consistencia` |
+| `correlacion` | `cramers_v`, `fdr` (comparaciones múltiples), método de correlación usado |
+| `modelos` | `pca`, `silhouette`, `isolation_forest` |
+| `timeseries` | `estacionariedad`, `acf_pacf`, `stl` |
+| `num_distr` | `iqr`, `curtosis`, `outlier` (vallas de Tukey) |
+
+Define la definición de cada término en su capítulo (constante local, como
+`_TERM_ENTROPIA_DEF` en `cat_distr`) y márcalo en su primera aparición.
+
+### 11.2 Keep-together: gráfico junto a su título y texto (`Group`)
+
+Para que un encabezado no quede en una página/slide y su figura en la siguiente, envuelve
+los bloques de una misma idea en un `model.Group`:
+
+```python
+blocks.append(model.Group(blocks=[
+    model.Heading(text=str(name), level=2),
+    model.Figure(make=_figura_perezosa(...), caption="…"),
+    model.Markdown(text="explicación…"),
+]))
+```
+
+El renderer **mide el grupo entero** antes de dibujar nada: si no cabe en lo que queda de
+página/slide pero cabe en una entera, lo mueve **completo** a la siguiente; y **encoge la
+figura** (vía `height_in`) lo justo para que el título + texto + figura quepan juntos. Si
+el grupo es más alto que una página entera, empieza en una nueva y fluye (degradación
+honesta, nunca corta). Ejemplo real implementado: `num_distr` envuelve cada columna
+(heading + figura histograma/boxplot + nota) en un `Group`.
+
+Recomendado para `agregacion` y cualquier capítulo donde una figura deba ir pegada a su
+título/explicación. Coste: si un capítulo inspecciona `chapter.blocks` en sus tests, ahora
+encontrará `Group`s — aplana con un helper recursivo (ver `num_distr_test.py::_flatten`).
+
+### 11.3 Zebra striping en tablas (automático)
+
+Todo `DataTable` se renderiza con **filas pares sombreadas** (gris muy suave `#f6f8fa`) y
+cabecera con su fondo propio. Es automático en PDF y PPTX; el patrón se mantiene coherente
+cuando una tabla larga se parte y repite cabecera (el índice de fila es lógico, no por
+página). No hay nada que hacer en los capítulos.
+
+---
+
 ## 10. Integración futura con `profile_table` (siguiente fase)

 `profile_table(emit_pdf=True)` usa hoy `render_eda_pdf` (intacto). En la siguiente fase
@@ -68,11 +68,13 @@ from .extract_timeseries_raw import extract_timeseries_raw
 from .build_eda_render_ctx import build_eda_render_ctx
 from .profile_datetime import profile_datetime
 from .resample_timeseries import resample_timeseries
+from .add_pdf_internal_links import add_pdf_internal_links

 __all__ = [
    "detect_time_column",
    "extract_timeseries_raw",
    "build_eda_render_ctx",
+    "add_pdf_internal_links",
    "profile_datetime",
    "resample_timeseries",
    "render_automatic_eda_pdf",
@@ -0,0 +1,85 @@
+---
+name: add_pdf_internal_links
+kind: function
+lang: py
+domain: datascience
+version: "1.0.0"
+purity: impure
+signature: "def add_pdf_internal_links(pdf_path: str, links: list) -> dict"
+description: "Postprocesa un PDF YA escrito insertando link annotations internos de tipo GOTO ('ir a') con PyMuPDF (import fitz). Pensado para PDFs generados por matplotlib PdfPages, que NO soporta hyperlinks internos: tras escribir el PDF se reabre y, por cada entrada de `links`, se añade una anotacion clicable desde un rectangulo de una pagina origen (src_page + src_rect en puntos top-left) hasta un punto de una pagina destino (dst_page + dst_point). Caso de uso tipico del grupo eda: hacer clicables los terminos de un AutomaticEDA que apuntan a su entrada en el glosario al final del documento. Estilo dict-no-throw: NUNCA lanza; valida cada link y SALTA (n_skipped++) los malformados o fuera de rango en vez de fallar. Guarda de forma segura escribiendo a un temporal en el mismo directorio y haciendo os.replace atomico (evita corromper el original). Devuelve {status:ok,n_links,n_skipped} o {status:error,error}; si pymupdf no esta disponible o el archivo no existe devuelve status error."
+tags: [eda, datascience, pdf, links, glossary, pymupdf, fitz, postprocess, python]
+uses_functions: []
+uses_types: []
+returns: []
+returns_optional: false
+error_type: "error_go_core"
+imports: []
+params:
+  - name: pdf_path
+    desc: "ruta al PDF existente (str no vacio). Se reescribe IN SITU (in-place) tras añadir los links: se guarda a un temporal `.<base>.tmp_links` en el mismo directorio y se reemplaza atomicamente con os.replace. Si no es str o no existe el archivo -> {status:error}."
+  - name: links
+    desc: "lista de dicts, uno por link a insertar. Cada dict: src_page (int 0-based de la pagina origen), src_rect ([x0,y0,x1,y1] del rectangulo clicable en PUNTOS PDF 1/72\" con origen ARRIBA-IZQUIERDA), dst_page (int 0-based de la pagina destino), dst_point ([x,y] punto destino, mismos puntos top-left). Las entradas que no son dict, con page fuera de rango [0,page_count), src_rect que no tenga 4 numeros o dst_point que no tenga 2 numeros se SALTAN (n_skipped++), no lanzan. None se trata como lista vacia."
+output: "dict (NUNCA lanza): en exito {\"status\":\"ok\",\"n_links\":int,\"n_skipped\":int} con n_links = anotaciones GOTO insertadas y n_skipped = entradas invalidas saltadas. En fallo {\"status\":\"error\",\"error\":str}: pymupdf no disponible, pdf_path no es str / no existe, links no es lista, o cualquier excepcion global (el PDF original queda intacto porque el replace solo ocurre tras un save correcto)."
+tested: true
+tests: ["test_add_goto_link_basico", "test_links_invalidos_se_saltan", "test_archivo_inexistente_devuelve_error"]
+test_file_path: "python/functions/datascience/add_pdf_internal_links_test.py"
+file_path: "python/functions/datascience/add_pdf_internal_links.py"
+---
+
+## Ejemplo
+
+```python
+import sys, os
+sys.path.insert(0, os.path.join("python", "functions"))
+from datascience import add_pdf_internal_links
+
+# Tienes un PDF ya escrito por matplotlib PdfPages (sin hyperlinks internos).
+# Quieres que el texto "Margen bruto" de la pagina 0 (rectangulo en puntos
+# top-left) salte a su entrada del glosario en la ultima pagina (indice 7).
+res = add_pdf_internal_links(
+    "reports/eda.pdf",
+    [
+        {"src_page": 0, "src_rect": [72, 120, 180, 134], "dst_page": 7, "dst_point": [72, 200]},
+        {"src_page": 0, "src_rect": [72, 140, 180, 154], "dst_page": 7, "dst_point": [72, 260]},
+    ],
+)
+# res == {"status": "ok", "n_links": 2, "n_skipped": 0}
+```
+
+## Cuando usarla
+
+Justo DESPUES de escribir un PDF con matplotlib `PdfPages` (o cualquier motor
+que no genere hyperlinks internos) cuando necesitas que ciertos terminos o
+referencias sean clicables y salten a otra pagina del mismo documento — el caso
+canonico es enlazar los terminos de un AutomaticEDA con su entrada de glosario
+al final. Es un paso de postproceso: primero generas el PDF y calculas en que
+rectangulo quedo cada termino (en puntos PDF), luego pasas esa lista a esta
+funcion para inyectar las anotaciones GOTO.
+
+## Gotchas
+
+- **Impura — reescribe el archivo IN SITU.** El PDF en `pdf_path` se reemplaza
+  por la version con los links. El guardado es seguro: escribe a un temporal
+  `.<base>.tmp_links` en el MISMO directorio y hace `os.replace` atomico tras
+  cerrar el documento, asi un fallo a mitad no corrompe el original. Aun asi,
+  conserva una copia si el PDF es valioso.
+- **Sistema de coordenadas: puntos top-left, igual que matplotlib.** PyMuPDF y
+  matplotlib (PdfPages) usan ambos PUNTOS PDF (1/72") con el origen ARRIBA-
+  IZQUIERDA, asi que los rectangulos/puntos COINCIDEN: el `src_rect` que calcules
+  con la geometria de la figura matplotlib se pasa tal cual, sin invertir el eje
+  Y. (Ojo: el espacio de datos de matplotlib SI tiene el origen abajo; lo que
+  coincide es el espacio de la PAGINA en puntos.)
+- **Indices de pagina 0-based.** `src_page` / `dst_page` son indices base 0
+  (la primera pagina es 0). Fuera del rango `[0, page_count)` el link se SALTA
+  (cuenta en `n_skipped`), no lanza.
+- **dict-no-throw, validacion por-link.** Las entradas malformadas (no dict,
+  page fuera de rango, `src_rect` sin 4 numeros, `dst_point` sin 2 numeros) se
+  saltan individualmente e incrementan `n_skipped`; el resto de links validos se
+  insertan igual. La funcion solo devuelve `{status:error}` ante fallos globales
+  (pymupdf ausente, archivo inexistente, `links` no es lista).
+- **`error_type: error_go_core` es metadata del registry, no comportamiento.**
+  Toda funcion impura debe declararlo y el indexer lo exige, pero el codigo NUNCA
+  lanza esa excepcion: degrada al dict de estado.
+- **Requiere PyMuPDF (`import fitz`).** Si no esta instalado devuelve
+  `{"status":"error","error":"pymupdf no disponible: ..."}`. En el registry el
+  venv `python/.venv` ya lo trae.
@@ -0,0 +1,132 @@
+"""Postprocesa un PDF existente insertando link annotations internos (GOTO).
+
+Motor: PyMuPDF (``import fitz``). Pensado para PDFs generados por matplotlib
+``PdfPages``, que no soporta hyperlinks internos: tras escribir el PDF, esta
+funcion lo reabre y le añade anotaciones "ir a" (GOTO) desde un rectangulo de
+una pagina origen hasta un punto de una pagina destino. Util para hacer
+clicables terminos que apuntan a su entrada en un glosario al final del
+documento.
+
+Estilo dict-no-throw del grupo `eda`: NUNCA lanza; devuelve un dict de estado.
+"""
+
+import os
+
+
+def add_pdf_internal_links(pdf_path: str, links: list) -> dict:
+    """Añade link annotations internos (GOTO) a un PDF ya escrito.
+
+    Postprocesa un PDF (p.ej. generado por matplotlib PdfPages, que NO soporta
+    hyperlinks internos) insertando, por cada entrada de ``links``, una
+    anotacion de tipo "ir a" desde un rectangulo de una pagina origen hasta un
+    punto de una pagina destino. Sirve para hacer clicables terminos que apuntan
+    a su entrada en un glosario al final del documento.
+
+    Args:
+        pdf_path: ruta al PDF existente (se reescribe in situ).
+        links: lista de dicts, cada uno:
+            {
+              "src_page": int,            # indice 0-based de la pagina origen
+              "src_rect": [x0,y0,x1,y1],  # rectangulo clicable, en PUNTOS PDF
+                                          # (1/72") con origen ARRIBA-IZQUIERDA
+              "dst_page": int,            # indice 0-based de la pagina destino
+              "dst_point": [x, y],        # punto destino, mismos puntos top-left
+            }
+
+    Returns:
+        dict (NUNCA lanza): {"status":"ok","n_links":int,"n_skipped":int}
+        o {"status":"error","error":str}. Si pymupdf no esta disponible o el
+        archivo no existe -> {"status":"error", ...}.
+    """
+    try:
+        try:
+            import fitz  # PyMuPDF
+        except Exception as exc:  # ImportError u otro fallo de carga
+            return {"status": "error", "error": f"pymupdf no disponible: {exc}"}
+
+        if not isinstance(pdf_path, str) or not pdf_path:
+            return {"status": "error", "error": "pdf_path debe ser una ruta no vacia"}
+        if not os.path.isfile(pdf_path):
+            return {"status": "error", "error": f"el archivo no existe: {pdf_path}"}
+
+        if links is None:
+            links = []
+        if not isinstance(links, (list, tuple)):
+            return {"status": "error", "error": "links debe ser una lista de dicts"}
+
+        doc = fitz.open(pdf_path)
+        try:
+            n_pages = doc.page_count
+            n_ok = 0
+            n_skipped = 0
+
+            for link in links:
+                if not isinstance(link, dict):
+                    n_skipped += 1
+                    continue
+
+                src_page = link.get("src_page")
+                dst_page = link.get("dst_page")
+                src_rect = link.get("src_rect")
+                dst_point = link.get("dst_point")
+
+                # src_page / dst_page: enteros 0-based en rango.
+                if not _is_int(src_page) or not _is_int(dst_page):
+                    n_skipped += 1
+                    continue
+                if not (0 <= src_page < n_pages) or not (0 <= dst_page < n_pages):
+                    n_skipped += 1
+                    continue
+
+                # src_rect: 4 numeros.
+                if not _is_num_seq(src_rect, 4):
+                    n_skipped += 1
+                    continue
+                # dst_point: 2 numeros.
+                if not _is_num_seq(dst_point, 2):
+                    n_skipped += 1
+                    continue
+
+                try:
+                    doc[int(src_page)].insert_link(
+                        {
+                            "kind": fitz.LINK_GOTO,
+                            "from": fitz.Rect(*[float(v) for v in src_rect]),
+                            "page": int(dst_page),
+                            "to": fitz.Point(*[float(v) for v in dst_point]),
+                        }
+                    )
+                    n_ok += 1
+                except Exception:
+                    n_skipped += 1
+                    continue
+
+            # Guardado seguro: escribir a temporal en el mismo directorio y
+            # reemplazar atomicamente (evita corromper el PDF original).
+            directory = os.path.dirname(os.path.abspath(pdf_path)) or "."
+            base = os.path.basename(pdf_path)
+            tmp_path = os.path.join(directory, f".{base}.tmp_links")
+            doc.save(tmp_path)
+        finally:
+            doc.close()
+
+        os.replace(tmp_path, pdf_path)
+
+        return {"status": "ok", "n_links": n_ok, "n_skipped": n_skipped}
+    except Exception as exc:  # degrada cualquier fallo a dict de error
+        return {"status": "error", "error": str(exc)}
+
+
+def _is_int(value) -> bool:
+    """True si value es un entero (no bool)."""
+    return isinstance(value, int) and not isinstance(value, bool)
+
+
+def _is_num_seq(value, length: int) -> bool:
+    """True si value es una secuencia de `length` numeros (int/float, no bool)."""
+    if not isinstance(value, (list, tuple)) or len(value) != length:
+        return False
+    for v in value:
+        if isinstance(v, bool) or not isinstance(v, (int, float)):
+            return False
+    return True
@@ -0,0 +1,77 @@
+"""Tests para add_pdf_internal_links."""
+
+import os
+import sys
+
+import pytest
+
+sys.path.insert(0, os.path.dirname(__file__))
+
+from add_pdf_internal_links import add_pdf_internal_links
+
+
+def test_add_goto_link_basico(tmp_path):
+    """Golden: un PDF de 2 paginas recibe un link GOTO de la pag 0 a la pag 1."""
+    fitz = pytest.importorskip("fitz")
+
+    # 1) PDF temporal de 2 paginas A5 (~419x595 puntos).
+    pdf = str(tmp_path / "doc.pdf")
+    doc = fitz.open()
+    doc.new_page(width=419, height=595)
+    doc.new_page(width=419, height=595)
+    doc.save(pdf)
+    doc.close()
+
+    # 2) Insertar un link interno desde la pag 0 hacia la pag 1.
+    res = add_pdf_internal_links(
+        pdf,
+        [{"src_page": 0, "src_rect": [50, 50, 200, 70], "dst_page": 1, "dst_point": [40, 40]}],
+    )
+    assert res["status"] == "ok"
+    assert res["n_links"] == 1
+    assert res["n_skipped"] == 0
+
+    # 3) Reabrir y verificar que la pag 0 tiene un link GOTO a la pag 1.
+    doc = fitz.open(pdf)
+    try:
+        links = doc[0].get_links()
+        goto = [l for l in links if l.get("kind") == fitz.LINK_GOTO and l.get("page") == 1]
+        assert len(goto) >= 1
+    finally:
+        doc.close()
+
+
+def test_links_invalidos_se_saltan(tmp_path):
+    """Edge: entradas malformadas o fuera de rango incrementan n_skipped, no lanzan."""
+    fitz = pytest.importorskip("fitz")
+
+    pdf = str(tmp_path / "doc.pdf")
+    doc = fitz.open()
+    doc.new_page(width=419, height=595)
+    doc.new_page(width=419, height=595)
+    doc.save(pdf)
+    doc.close()
+
+    res = add_pdf_internal_links(
+        pdf,
+        [
+            # valido
+            {"src_page": 0, "src_rect": [10, 10, 90, 30], "dst_page": 1, "dst_point": [20, 20]},
+            # dst_page fuera de rango
+            {"src_page": 0, "src_rect": [10, 40, 90, 60], "dst_page": 9, "dst_point": [20, 20]},
+            # src_rect con 3 numeros
+            {"src_page": 0, "src_rect": [10, 70, 90], "dst_page": 1, "dst_point": [20, 20]},
+            # no es dict
+            "no-soy-un-dict",
+        ],
+    )
+    assert res["status"] == "ok"
+    assert res["n_links"] == 1
+    assert res["n_skipped"] == 3
+
+
+def test_archivo_inexistente_devuelve_error():
+    """Error path: pdf_path inexistente -> status error sin lanzar."""
+    res = add_pdf_internal_links("/ruta/que/no/existe_xyz.pdf", [])
+    assert res["status"] == "error"
+    assert "error" in res
@@ -21,6 +21,9 @@ from .model import (  # noqa: F401
    Chapter,
    DataTable,
    Figure,
+    GlossaryCollector,
+    GlossaryEntry,
+    Group,
    Heading,
    Image,
    KVTable,
@@ -45,6 +48,9 @@ __all__ = [
    "Image",
    "Caption",
    "Note",
+    "Group",
+    "GlossaryEntry",
+    "GlossaryCollector",
    "Chapter",
    "as_blocks",
    "as_chapters",
@@ -1,22 +1,26 @@
 """Data-quality chapter (CALIDAD) for AutomaticEDA.

 Builds the quality chapter from a ``TableProfile`` of the ``eda`` group. The
-chapter answers, in Spanish and as tables, the three things the user asked for:
+chapter implements the quality model of report 2046:

-1. **En qué se basa la calidad** — an intro paragraph explaining the criteria and
-   their weights (completeness, validity, consistency) before any number, plus a
-   table-level summary (global score and aggregates).
+1. **En qué se basa la calidad** — an intro paragraph explaining the two scored
+   dimensions and their weights (completitud 60%, validez 40%) plus the
+   table-level row uniqueness, BEFORE any number, and stating explicitly that
+   outliers are reported as observations and do **not** lower the score. The
+   criteria terms (calidad de datos, completitud, validez, unicidad de registro)
+   are hooked into the shared glossary as clickable jumps.
 2. **Scores por columna** — a table with, per column, the total quality score and
-   its breakdown into completeness / validity / consistency.
-3. **Problemas en español** — a second table listing, per column, the readable
-   issues in Spanish (kept separate from the type ``flags``).
+   its breakdown into completeness / validity (no consistency dimension).
+3. **Problemas de calidad** — a table listing ONLY real quality defects
+   (nulls, empty cells, values not conforming to their type/semantics).
+4. **Observaciones analíticas** — a SEPARATE table for outliers, constant
+   columns, high-cardinality ids and strong skew, with an explicit note that
+   these do not affect the score.

-The breakdown and the issues are NOT recomputed here: they come from the registry
-function ``column_quality_score`` (group ``eda``), which already derives
-``{score, completeness, validity, consistency, issues}`` from the ColumnProfile.
-This chapter is render-only — it consumes that function and lays the result out
-as model blocks; the renderers paginate tables (splitting by rows, repeating the
-header) and wrap long cells so nothing is ever cut.
+The breakdown, issues and observations are NOT recomputed here: they come from
+the registry function ``column_quality_score`` (group ``eda``), which derives
+``{score, completeness, validity, dimensions, applicable, issues,
+observations}`` from the ColumnProfile. This chapter is render-only.

 Contract: build_<id>(profile, ctx) -> Chapter | None ; CHAPTER_VERSION = "x.y.z".
 """
@@ -33,28 +37,47 @@ try:  # pragma: no cover - import wiring
 except Exception:  # noqa: BLE001 - never let an import error abort the document.
    _column_quality_score = None

-CHAPTER_VERSION = "1.0.0"
+CHAPTER_VERSION = "2.0.0"
 CHAPTER_ID = "calidad"
 CHAPTER_TITLE = "Calidad"

-# Weights mirror column_quality_score: completeness 0.5, validity 0.3,
-# consistency 0.2. Kept here only to render the human explanation; the actual
-# numbers always come from the function so the two never drift in computation.
-_CRITERIA_INTRO = (
-    "La calidad de cada columna es un score de 0 a 100 que combina tres "
-    "criterios, cada uno con un peso:\n\n"
-    "- **Completitud (peso 50%)**: proporción de valores presentes (sin nulos "
-    "ni vacíos). Una columna con muchos nulos baja de score.\n"
-    "- **Validez (peso 30%)**: los valores son coherentes con su tipo y rango "
-    "esperado (penaliza outliers y semánticas declaradas que no coinciden).\n"
-    "- **Consistencia (peso 20%)**: la columna aporta información útil (penaliza "
-    "columnas constantes o identificadores de cardinalidad muy alta).\n\n"
-    "Score = 100 × (0,5·completitud + 0,3·validez + 0,2·consistencia). "
-    "Los problemas detectados por columna se listan en español más abajo."
-)
+# Glossary terms this chapter explains (report 2046 §6). Registered in the shared
+# collector and marked clickable on their first appearance (contract §11.1).
+_TERMS = {
+    "calidad_datos": (
+        "Calidad de datos (score 0-100)",
+        "Mide hasta qué punto los datos están presentes y son utilizables tal "
+        "cual, no si son «buenos para el análisis». Se compone solo de "
+        "dimensiones medibles automáticamente desde el perfil de la tabla, sin "
+        "fuente externa de verdad: completitud (60%), validez (40%, cuando es "
+        "medible) y, a nivel de tabla, unicidad de registro. Los valores "
+        "atípicos NO bajan la calidad: se listan aparte como observaciones.",
+    ),
+    "completitud": (
+        "Completitud",
+        "Proporción de valores realmente presentes en una columna (1 − % de "
+        "nulos; en texto, las celdas vacías también cuentan como faltantes). Los "
+        "nulos y vacíos bajan el score porque falta información que debería "
+        "estar. Pesa el 60% del score de columna.",
+    ),
+    "validez": (
+        "Validez",
+        "Proporción de valores que encajan con su tipo o formato esperado: un "
+        "número que parsea, una fecha legible, un email con forma de email. Los "
+        "valores que no parsean a su tipo bajan el score. Si la columna es texto "
+        "libre sin formato esperado, la validez no se puede medir y el score se "
+        "basa solo en la completitud. Pesa el 40% del score cuando es medible.",
+    ),
+    "unicidad_registro": (
+        "Unicidad de registro",
+        "A nivel de tabla, las filas duplicadas restan calidad al conjunto "
+        "(1 − % de filas duplicadas). Es distinta de que una columna no-clave "
+        "repita valores, que no es un defecto de calidad.",
+    ),
+}

-# Cap for the joined issues cell so a single row never grows taller than a page;
-# the remainder is summarized as "(+N más)" instead of being silently dropped.
+# Cap for the joined cell so a single row never grows taller than a page; the
+# remainder is summarized as "(+N más)" instead of being silently dropped.
 _ISSUES_MAXLEN = 160


@@ -82,12 +105,19 @@ def _fmt_unit_pct(value) -> str:
        return str(value)


+def _fmt_validity(value) -> str:
+    """Validity is ``None`` when not applicable: show ``n/a`` not a fake 0%."""
+    if value is None:
+        return "n/a"
+    return _fmt_unit_pct(value)
+
+
 def _quality_of(col: dict) -> dict:
-    """Return ``{score, completeness, validity, consistency, issues}`` for a column.
+    """Return the quality dict for a column.

    Uses the registry ``column_quality_score`` when available; otherwise falls
    back to the per-column ``quality_score`` already in the profile (number only,
-    empty breakdown/issues). Never raises.
+    empty breakdown/issues/observations). Never raises.
    """
    if not isinstance(col, dict):
        col = {}
@@ -98,26 +128,25 @@ def _quality_of(col: dict) -> dict:
                return res
        except Exception:  # noqa: BLE001 - degrade instead of aborting.
            pass
-    # Fallback: only the final score is available pre-computed in the profile.
    return {
        "score": col.get("quality_score"),
        "completeness": None,
        "validity": None,
-        "consistency": None,
        "issues": [],
+        "observations": [],
    }


-def _join_issues(issues) -> str:
-    """Join Spanish issue strings into one cell, truncating overly long lists.
+def _join_cells(items) -> str:
+    """Join Spanish strings into one cell, truncating overly long lists.

-    The renderer wraps cell text, but a column with many long issues could make a
-    single row taller than a whole page; cap the length and append ``(+N más)``
-    so the count of hidden issues is honest rather than silently lost.
+    The renderer wraps cell text, but a column with many long entries could make
+    a single row taller than a whole page; cap the length and append ``(+N más)``
+    so the count of hidden entries is honest rather than silently lost.
    """
-    if not isinstance(issues, (list, tuple)) or not issues:
+    if not isinstance(items, (list, tuple)) or not items:
        return ""
-    parts = [model._safe_str(i).strip() for i in issues]
+    parts = [model._safe_str(i).strip() for i in items]
    parts = [p for p in parts if p]
    if not parts:
        return ""
@@ -142,6 +171,33 @@ def _columns_with_quality(profile: dict):
            yield c, _quality_of(c)


+def _fmt_unit_pct_or_pct(value) -> str:
+    """Format a value that may be a 0-1 fraction or an already-0-100 percentage."""
+    try:
+        num = float(value)
+    except (TypeError, ValueError):
+        return model._safe_str(value)
+    if num != num:  # NaN
+        return "—"
+    pct = num * 100 if num <= 1.0 else num
+    text = f"{pct:.1f}".rstrip("0").rstrip(".")
+    return f"{text}%"
+
+
+def _row_uniqueness(profile: dict):
+    """Return row uniqueness (1 - duplicate_pct) in [0,1], or None if unknown."""
+    dup = profile.get("duplicate_pct")
+    if dup is None:
+        return None
+    try:
+        d = float(dup)
+    except (TypeError, ValueError):
+        return None
+    if d > 1.0:  # tolerate a 0-100 scale
+        d = d / 100.0
+    return max(0.0, min(1.0, 1.0 - d))
+
+
 def _summary_block(profile: dict, evaluated: list):
    """Table-level KVTable: global score and quality aggregates."""
    rows = []
@@ -153,14 +209,15 @@ def _summary_block(profile: dict, evaluated: list):
             if isinstance(q.get("completeness"), (int, float))]
    vals = [q.get("validity") for _, q in evaluated
            if isinstance(q.get("validity"), (int, float))]
-    cons = [q.get("consistency") for _, q in evaluated
-            if isinstance(q.get("consistency"), (int, float))]
    if comps:
        rows.append(("Completitud media", _fmt_unit_pct(sum(comps) / len(comps))))
    if vals:
-        rows.append(("Validez media", _fmt_unit_pct(sum(vals) / len(vals))))
-    if cons:
-        rows.append(("Consistencia media", _fmt_unit_pct(sum(cons) / len(cons))))
+        rows.append(("Validez media (donde aplica)",
+                     _fmt_unit_pct(sum(vals) / len(vals))))
+
+    ru = _row_uniqueness(profile)
+    if ru is not None:
+        rows.append(("Unicidad de registro", _fmt_unit_pct(ru)))

    n_problem = sum(1 for _, q in evaluated if q.get("issues"))
    rows.append(("Columnas con problemas", str(n_problem)))
@@ -182,22 +239,9 @@ def _summary_block(profile: dict, evaluated: list):
    return model.KVTable(rows=rows, title="Resumen de calidad")


-def _fmt_unit_pct_or_pct(value) -> str:
-    """Format a value that may be a 0-1 fraction or an already-0-100 percentage."""
-    try:
-        num = float(value)
-    except (TypeError, ValueError):
-        return model._safe_str(value)
-    if num != num:  # NaN
-        return "—"
-    pct = num * 100 if num <= 1.0 else num
-    text = f"{pct:.1f}".rstrip("0").rstrip(".")
-    return f"{text}%"
-
-
 def _scores_block(evaluated: list):
-    """DataTable with per-column score and its three-criteria breakdown."""
-    header = ["Columna", "Calidad", "Completitud", "Validez", "Consistencia"]
+    """DataTable with per-column score and its completeness/validity breakdown."""
+    header = ["Columna", "Calidad", "Completitud", "Validez"]
    rows = []
    # Worst columns first so the reader sees the problems at the top.
    ordered = sorted(
@@ -210,22 +254,22 @@ def _scores_block(evaluated: list):
            col.get("name") or "(col)",
            _fmt_score(q.get("score")),
            _fmt_unit_pct(q.get("completeness")),
-            _fmt_unit_pct(q.get("validity")),
-            _fmt_unit_pct(q.get("consistency")),
+            _fmt_validity(q.get("validity")),
        ])
    if not rows:
        return None
    return model.DataTable(header=header, rows=rows,
                           title="Scores de calidad por columna",
-                           note="0 = peor, 100 = mejor; ordenado de peor a mejor")
+                           note="0 = peor, 100 = mejor; «n/a» = dimensión no "
+                                "medible; ordenado de peor a mejor")


 def _issues_block(evaluated: list):
-    """DataTable listing Spanish issues per column, or a Note when there are none."""
-    header = ["Columna", "Problemas detectados (español)"]
+    """DataTable listing ONLY real quality defects per column, or a Note."""
+    header = ["Columna", "Problemas de calidad (español)"]
    rows = []
    for col, q in evaluated:
-        joined = _join_issues(q.get("issues"))
+        joined = _join_cells(q.get("issues"))
        if joined:
            rows.append([col.get("name") or "(col)", joined])
    if not rows:
@@ -235,6 +279,63 @@ def _issues_block(evaluated: list):
                           title="Problemas de calidad por columna")


+def _observations_block(evaluated: list):
+    """DataTable listing analytical observations per column, or None.
+
+    Observations (outliers, constant columns, ids, strong skew) are NOT quality
+    defects: they do not affect the score. Returned as a separate table from the
+    issues so the report never presents a legitimate outlier as a problem.
+    """
+    header = ["Columna", "Observaciones analíticas"]
+    rows = []
+    for col, q in evaluated:
+        joined = _join_cells(q.get("observations"))
+        if joined:
+            rows.append([col.get("name") or "(col)", joined])
+    if not rows:
+        return None
+    return model.DataTable(
+        header=header, rows=rows,
+        title="Observaciones analíticas por columna",
+        note="No son defectos de calidad y NO afectan al score; orientan el "
+             "análisis (atípicos, columnas constantes, identificadores).")
+
+
+def _term(key: str, label: str, mark: bool) -> str:
+    """Render a term as a clickable glossary span when marking is enabled."""
+    if mark:
+        return f"[[term:{key}]]**{label}**[[/term]]"
+    return f"**{label}**"
+
+
+def _criteria_intro(mark: bool) -> str:
+    """Intro paragraph explaining the two scored dimensions and the principle."""
+    calidad = _term("calidad_datos", "calidad de datos", mark)
+    completitud = _term("completitud", "Completitud (peso 60%)", mark)
+    validez = _term("validez", "Validez (peso 40%, cuando es medible)", mark)
+    unicidad = _term("unicidad_registro", "unicidad de registro", mark)
+    return (
+        f"La {calidad} de cada columna es un score de 0 a 100 que combina solo "
+        "dimensiones medibles desde el perfil de la tabla, sin fuente externa "
+        "de verdad:\n\n"
+        f"- {completitud}: proporción de valores presentes (1 − % de nulos; en "
+        "texto, las celdas vacías cuentan como faltantes). Los nulos y vacíos "
+        "bajan el score.\n"
+        f"- {validez}: proporción de valores que encajan con su tipo o formato "
+        "(un número que parsea, una fecha legible, un email con forma de email). "
+        "Si una columna es texto libre sin formato esperado, la validez no se "
+        "mide y el score se basa solo en la completitud.\n\n"
+        f"Score de columna = 100 × (0,6·completitud + 0,4·validez), "
+        "renormalizado cuando la validez no aplica. A nivel de tabla se añade "
+        f"la {unicidad} (1 − % de filas duplicadas).\n\n"
+        "**Los valores atípicos (outliers) NO bajan la calidad.** Un valor "
+        "extremo puede ser real y correcto; detectar atípicos es parte del "
+        "análisis de la distribución, no un juicio de corrección. Por eso, junto "
+        "con las columnas constantes y los identificadores, se listan aparte "
+        "como **observaciones analíticas** que no afectan al score."
+    )
+
+
 def build_calidad(profile: dict, ctx: dict):
    """Build the data-quality Chapter, or None if the profile has no columns.

@@ -250,17 +351,35 @@ def build_calidad(profile: dict, ctx: dict):
    if not evaluated:
        return None  # no columns to score -> chapter does not apply.

+    # Register the criteria terms in the shared glossary (if present) and mark
+    # their first appearance clickable. Contract §11.1.
+    glossary = ctx.get("glossary")
+    mark = False
+    if isinstance(glossary, model.GlossaryCollector):
+        for key, (label, definition) in _TERMS.items():
+            glossary.add(key, label, definition)
+        mark = True
+
    blocks = [
        model.Heading(text="Cómo se calcula la calidad", level=2),
-        model.Markdown(text=_CRITERIA_INTRO),
+        model.Markdown(text=_criteria_intro(mark)),
        _summary_block(profile, evaluated),
        model.Heading(text="Scores por columna", level=2),
    ]
    scores = _scores_block(evaluated)
    if scores is not None:
        blocks.append(scores)
-    blocks.append(model.Heading(text="Problemas detectados", level=2))
+
+    blocks.append(model.Heading(text="Problemas de calidad", level=2))
    blocks.append(_issues_block(evaluated))

+    observations = _observations_block(evaluated)
+    if observations is not None:
+        blocks.append(model.Heading(text="Observaciones analíticas", level=2))
+        blocks.append(model.Note(
+            "Las observaciones siguientes NO son defectos de calidad y no "
+            "afectan al score: son señales para orientar el análisis."))
+        blocks.append(observations)
+
    return model.Chapter(id=CHAPTER_ID, title=CHAPTER_TITLE,
                         version=CHAPTER_VERSION, blocks=blocks)
@@ -1,11 +1,12 @@
-"""Tests for the CALIDAD chapter — DoD: golden + edges + anti-cut.
+"""Tests for the CALIDAD chapter — DoD: golden + edges + anti-cut + glossary.

 Self-contained: builds synthetic TableProfiles (no DuckDB) so the suite is fast
-and deterministic. Verifies that the chapter explains the quality criteria, shows
-per-column scores with the completeness/validity/consistency breakdown, lists the
-issues in Spanish (separate from the type flags), returns None when it does not
-apply, and that a wide profile with long names renders to PDF and PPTX without
-cutting any cell text (long content wraps, it is never truncated).
+and deterministic. Verifies the report-2046 quality model: the chapter explains
+the two scored dimensions (completitud 60% / validez 40%), shows per-column
+scores without a consistency column, keeps quality DEFECTS (issues) separate
+from analytical OBSERVATIONS (outliers, constant, ids), hooks the criteria terms
+into the glossary, returns None when it does not apply, and renders a wide
+profile to PDF and PPTX without cutting any cell text.
 """

 import os
@@ -20,28 +21,30 @@ from datascience.automatic_eda.chapters.calidad import (
    CHAPTER_VERSION,
 )
 from datascience.automatic_eda import build_document, render_pdf, render_pptx
+from datascience.automatic_eda import model


 def _profile() -> dict:
    """A small profile with one column per quality problem (nulls, outliers,
-    constant, high-cardinality id) plus one clean column."""
+    constant, high-cardinality id) plus one clean column. ``outlier_pct`` is in
+    the 0-100 scale that describe_numeric actually emits."""
    return {
        "table": "demo",
-        "quality_score": 72.5,
+        "quality_score": 82.0,
        "duplicate_pct": 0.04,
        "null_cell_pct": 0.11,
        "constant_cols": ["flag_const"],
        "all_null_cols": [],
        "columns": [
-            {"name": "edad", "inferred_type": "integer", "null_pct": 0.2,
-             "numeric": {"outlier_pct": 0.15, "min": 0, "max": 99},
-             "quality_score": 60},
+            {"name": "edad", "inferred_type": "numeric", "null_pct": 0.2,
+             "n_rows": 100, "unique_pct": 0.5,
+             "numeric": {"outlier_pct": 15.0, "min": 0, "max": 99}},
            {"name": "nombre", "inferred_type": "text", "null_pct": 0.0,
-             "unique_pct": 0.98, "quality_score": 80},
+             "unique_pct": 0.98, "flags": ["possible_id"]},
            {"name": "flag_const", "inferred_type": "text", "null_pct": 0.0,
-             "flags": ["constant"], "quality_score": 50},
-            {"name": "limpia", "inferred_type": "float", "null_pct": 0.0,
-             "numeric": {"outlier_pct": 0.0}, "quality_score": 100},
+             "unique_pct": 0.01, "flags": ["constant"]},
+            {"name": "limpia", "inferred_type": "numeric", "null_pct": 0.0,
+             "unique_pct": 0.5, "numeric": {"outlier_pct": 0.0}},
        ],
    }

@@ -50,16 +53,9 @@ def _tables(chapter):
    return [b for b in chapter.blocks if getattr(b, "kind", None) == "data_table"]


-def _scores_table(chapter):
+def _table_by_title(chapter, needle):
    for t in _tables(chapter):
-        if "Scores" in (t.title or ""):
-            return t
-    return None
-
-
-def _issues_table(chapter):
-    for t in _tables(chapter):
-        if "Problemas" in (t.title or ""):
+        if needle in (t.title or ""):
            return t
    return None

@@ -73,41 +69,84 @@ def test_golden_chapter_estructura_y_version():
    assert ch.id == "calidad"
    assert ch.version == CHAPTER_VERSION
    kinds = [b.kind for b in ch.blocks]
-    # intro heading + markdown criteria + summary kv + scores table + issues table
    assert "markdown" in kinds and "kv_table" in kinds and "data_table" in kinds


-def test_golden_intro_explica_criterios_y_pesos():
+def test_golden_intro_explica_dos_dimensiones_y_pesos():
    ch = build_calidad(_profile(), {})
    intro = [b for b in ch.blocks if b.kind == "markdown"][0].text
-    for needle in ("Completitud", "Validez", "Consistencia",
-                   "50%", "30%", "20%"):
+    for needle in ("Completitud", "Validez", "60%", "40%",
+                   "unicidad de registro"):
        assert needle in intro, f"falta {needle!r} en la intro de criterios"
+    # El principio: los outliers NO bajan la calidad.
+    assert "atípicos" in intro and "NO bajan" in intro
+    # Ya no se menciona la dimensión consistencia eliminada.
+    assert "20%" not in intro


-def test_golden_scores_incluyen_desglose_por_criterio():
+def test_golden_scores_sin_columna_consistencia():
    ch = build_calidad(_profile(), {})
-    scores = _scores_table(ch)
+    scores = _table_by_title(ch, "Scores")
    assert scores is not None
-    assert scores.header == ["Columna", "Calidad", "Completitud",
-                             "Validez", "Consistencia"]
-    # 4 columns scored, none dropped.
+    assert scores.header == ["Columna", "Calidad", "Completitud", "Validez"]
+    assert "Consistencia" not in scores.header
    assert len(scores.rows) == 4
    names = {r[0] for r in scores.rows}
    assert names == {"edad", "nombre", "flag_const", "limpia"}


-def test_golden_issues_en_espanol_separados_de_flags():
+def test_golden_outliers_en_observaciones_no_en_problemas():
    ch = build_calidad(_profile(), {})
-    issues = _issues_table(ch)
-    assert issues is not None
-    flat = " | ".join(" ".join(r) for r in issues.rows)
-    assert "nulos" in flat            # completeness issue (ES)
-    assert "outliers" in flat         # validity issue (ES)
-    assert "columna constante" in flat
-    assert "posible id de alta cardinalidad" in flat
-    # The raw type flag string must NOT leak as a "problem".
-    assert "constant" not in flat or "columna constante" in flat
+    problemas = _table_by_title(ch, "Problemas de calidad")
+    observaciones = _table_by_title(ch, "Observaciones")
+    assert problemas is not None
+    assert observaciones is not None
+
+    problemas_txt = " | ".join(" ".join(r) for r in problemas.rows)
+    observaciones_txt = " | ".join(" ".join(r) for r in observaciones.rows)
+
+    # Los nulos SÍ son problema de calidad.
+    assert "nulos" in problemas_txt
+    # Los outliers NO aparecen como problema...
+    assert "atípic" not in problemas_txt and "outlier" not in problemas_txt
+    # ...sino como observación analítica.
+    assert "atípic" in observaciones_txt
+    # Constante e id: observaciones, no problemas.
+    assert "constante" in observaciones_txt
+    assert "identificador" in observaciones_txt
+    assert "constante" not in problemas_txt
+
+
+def test_golden_score_columna_limpia_es_100():
+    """Columna sin nulos, numérica nativa: score 100 aunque tenga (o no) outliers."""
+    ch = build_calidad(_profile(), {})
+    scores = _table_by_title(ch, "Scores")
+    by_name = {r[0]: r for r in scores.rows}
+    assert by_name["limpia"][1] == "100 / 100"
+    # edad: 20% nulos -> 100*(0.6*0.8 + 0.4*1.0) = 88; los outliers no bajan nada.
+    assert by_name["edad"][1] == "88 / 100"
+
+
+# --------------------------------------------------------------------------- #
+# Glosario (contrato §11.1)
+# --------------------------------------------------------------------------- #
+def test_glosario_registra_los_cuatro_terminos_y_marca_clicable():
+    glossary = model.GlossaryCollector()
+    ch = build_calidad(_profile(), {"glossary": glossary})
+    for key in ("calidad_datos", "completitud", "validez", "unicidad_registro"):
+        assert glossary.has(key), f"término {key!r} no registrado en el glosario"
+    intro = [b for b in ch.blocks if b.kind == "markdown"][0].text
+    # Con colector presente, la primera aparición se marca clicable.
+    assert "[[term:completitud]]" in intro
+    assert "[[term:validez]]" in intro
+    assert "[[term:calidad_datos]]" in intro
+    assert "[[term:unicidad_registro]]" in intro
+
+
+def test_sin_glosario_no_marca_terminos():
+    ch = build_calidad(_profile(), {})  # ctx sin glossary
+    intro = [b for b in ch.blocks if b.kind == "markdown"][0].text
+    assert "[[term:" not in intro


 # --------------------------------------------------------------------------- #
@@ -124,17 +163,17 @@ def test_edge_perfil_limpio_sin_problemas_usa_nota():
    prof = {
        "quality_score": 100,
        "columns": [
-            {"name": "a", "inferred_type": "float", "null_pct": 0.0,
-             "numeric": {"outlier_pct": 0.0}},
-            {"name": "b", "inferred_type": "float", "null_pct": 0.0,
-             "numeric": {"outlier_pct": 0.0}},
+            {"name": "a", "inferred_type": "numeric", "null_pct": 0.0,
+             "unique_pct": 0.5, "numeric": {"outlier_pct": 0.0}},
+            {"name": "b", "inferred_type": "numeric", "null_pct": 0.0,
+             "unique_pct": 0.5, "numeric": {"outlier_pct": 0.0}},
        ],
    }
    ch = build_calidad(prof, {})
    assert ch is not None
-    assert _issues_table(ch) is None  # no issues table
+    assert _table_by_title(ch, "Problemas de calidad") is None  # no issues table
    notes = [b for b in ch.blocks if b.kind == "note"]
-    assert notes and "No se detectaron problemas" in notes[0].text
+    assert any("No se detectaron problemas" in n.text for n in notes)


 # --------------------------------------------------------------------------- #
@@ -143,44 +182,42 @@ def test_edge_perfil_limpio_sin_problemas_usa_nota():
 def _wide_profile(ncols: int = 22) -> dict:
    cols = [
        {"name": "identificador_unico_de_transaccion_con_nombre_muy_largo",
-         "inferred_type": "text", "null_pct": 0.0, "unique_pct": 0.99},
+         "inferred_type": "text", "null_pct": 0.0, "unique_pct": 0.99,
+         "flags": ["possible_id"]},
        {"name": "columna_constante_sin_ninguna_variacion_de_valor",
-         "inferred_type": "text", "null_pct": 0.0, "flags": ["constant"]},
+         "inferred_type": "text", "null_pct": 0.0, "unique_pct": 0.01,
+         "flags": ["constant"]},
    ]
    for k in range(ncols - 2):
        cols.append({
            "name": f"metrica_numerica_de_negocio_{k:02d}_con_nombre_largo",
-            "inferred_type": "float", "null_pct": 0.1 + (k % 3) * 0.05,
-            "numeric": {"outlier_pct": 0.08, "min": 0, "max": 1000},
+            "inferred_type": "numeric", "null_pct": 0.1 + (k % 3) * 0.05,
+            "unique_pct": 0.5,
+            "numeric": {"outlier_pct": 8.0, "min": 0, "max": 1000},
        })
-    return {"table": "ancha", "quality_score": 70.0, "columns": cols}
+    return {"table": "ancha", "quality_score": 70.0, "duplicate_pct": 0.0,
+            "columns": cols}


 def test_anticut_pdf_y_pptx_no_truncan_nombres_largos():
    prof = _wide_profile(22)
    full = build_document(prof, {"dataset_name": "ancha"})
    assert any(c.id == "calidad" for c in full)
-    # Render ONLY the calidad chapter so the anti-cut assertions are scoped to
-    # this chapter (other chapters, e.g. portada, legitimately contain '…').
    chapters = [c for c in full if c.id == "calidad"]
    long_name = "metrica_numerica_de_negocio_00_con_nombre_largo"
    with tempfile.TemporaryDirectory() as d:
        pdf = os.path.join(d, "q.pdf")
        pptx = os.path.join(d, "q.pptx")
        rp = render_pdf(chapters, pdf, {"title": "EDA"})
-        rx = render_pptx(chapters, pptx, {"title": "EDA"})
+        render_pptx(chapters, pptx, {"title": "EDA"})
        assert os.path.exists(pdf) and os.path.exists(pptx)
-        # The wide table forces pagination across several pages/slides.
        assert (rp or {}).get("n_pages", 0) >= 2

-        # PDF: the long name survives whole once wraps (spaces/newlines) removed,
-        # and there is no truncation marker.
        pdf_txt = "".join((pg.extract_text() or "") for pg in PdfReader(pdf).pages)
        assert "…" not in pdf_txt and "..." not in pdf_txt
        norm = re.sub(r"\s+", "", pdf_txt)
        assert long_name in norm, "el nombre largo se cortó en el PDF"

-        # PPTX: long name present in some cell, untruncated.
        allt = []
        for s in Presentation(pptx).slides:
            for sh in s.shapes:
@@ -33,10 +33,23 @@ import math

 from .. import model

-CHAPTER_VERSION = "1.0.0"
+CHAPTER_VERSION = "1.1.0"
 CHAPTER_ID = "cat_distr"
 CHAPTER_TITLE = "Distribuciones categóricas"

+# Glossary term this chapter explains. Registered in the shared collector and
+# marked clickable on its first appearance (end-to-end glossary example —
+# mejora 6). Other chapters hook their own terms the same way (see the contract).
+_TERM_ENTROPIA_KEY = "entropia"
+_TERM_ENTROPIA_LABEL = "Entropía (de Shannon)"
+_TERM_ENTROPIA_DEF = (
+    "Medida, en bits, de cómo de repartidos están los valores de una columna "
+    "categórica. Vale 0 cuando una sola categoría concentra todas las filas "
+    "(máxima previsibilidad) y alcanza su máximo, log2(k) para k categorías "
+    "distintas, cuando todas aparecen por igual (máxima diversidad). La entropía "
+    "normalizada (entropía dividida por su máximo) la lleva al rango 0–1 para "
+    "comparar columnas con distinto número de categorías.")
+
 # Cap the number of categorical columns rendered to keep the document bounded;
 # the rest are summarized in a closing note (no silent truncation).
 MAX_COLS = 40
@@ -337,10 +350,14 @@ def _topk_table(cat: dict):
                           note=note)


-def _intro_blocks(n_rows):
+def _intro_blocks(n_rows, mark_term: bool = False):
    total = _fmt_int(n_rows)
+    # Mark the first appearance of the term as a clickable glossary jump when the
+    # term was registered (mark_term). The visible text is identical either way.
+    entropia = ("[[term:entropia]]**entropía de Shannon**[[/term]]" if mark_term
+                else "**entropía de Shannon**")
    text = (
-        "La **entropía de Shannon** mide cómo de repartidos están los valores de "
+        f"La {entropia} mide cómo de repartidos están los valores de "
        "una columna categórica, en bits. Vale 0 cuando una sola categoría "
        "concentra todas las filas (máxima previsibilidad) y alcanza su máximo, "
        "log2(k) para k categorías distintas, cuando todas aparecen por igual "
@@ -370,7 +387,15 @@ def build_cat_distr(profile: dict, ctx: dict):
        return None

    n_rows = profile.get("n_rows")
-    blocks = list(_intro_blocks(n_rows))
+    # Register "entropía" in the shared glossary collector (if present) and mark
+    # its first appearance clickable. End-to-end glossary example (mejora 6).
+    glossary = ctx.get("glossary")
+    mark_term = False
+    if isinstance(glossary, model.GlossaryCollector):
+        glossary.add(_TERM_ENTROPIA_KEY, _TERM_ENTROPIA_LABEL,
+                     _TERM_ENTROPIA_DEF)
+        mark_term = True
+    blocks = list(_intro_blocks(n_rows, mark_term=mark_term))

    rendered = cat_cols[:MAX_COLS]
    for col in rendered:
@@ -0,0 +1,47 @@
+"""Glossary chapter (GLOSARIO) — always the last chapter, clickable terms.
+
+Renders one entry per glossary term that the other chapters registered during
+the document build through ``ctx['glossary'].add(key, label, definition)`` (see
+``GlossaryCollector`` in ``model.py``). Each entry is a clickable destination:
+every in-text appearance a chapter marked with ``[[term:key]]texto[[/term]]``
+becomes a real jump to its entry here — PDF link annotations (PyMuPDF) and PPTX
+native slide jumps, both wired by the renderers.
+
+Returns ``None`` when no term was registered (there is nothing to show), so the
+chapter simply disappears from documents that did not mark any term.
+
+Contract: build_<id>(profile, ctx) -> Chapter | None ; CHAPTER_VERSION = "x.y.z".
+"""
+
+from __future__ import annotations
+
+from .. import model
+
+CHAPTER_VERSION = "1.0.0"
+CHAPTER_ID = "glosario"
+CHAPTER_TITLE = "Glosario"
+
+
+def build_glosario(profile: dict, ctx: dict):
+    """Build the glossary Chapter from the shared collector, or None if empty."""
+    ctx = ctx or {}
+    glossary = ctx.get("glossary")
+    if not isinstance(glossary, model.GlossaryCollector) or not glossary:
+        return None
+
+    blocks = [
+        model.Heading(text="Glosario de términos", level=1),
+        model.Markdown(text=(
+            "Definición de los términos técnicos que aparecen en el informe. "
+            "Cada término va resaltado en el texto y, al pulsarlo, salta a su "
+            "definición en esta sección.")),
+    ]
+    # One clickable destination per term, alphabetically by visible label.
+    for term in glossary.terms(by="label"):
+        blocks.append(model.GlossaryEntry(
+            key=model._safe_str(term.get("key")),
+            label=model._safe_str(term.get("label")),
+            definition=model._safe_str(term.get("definition"))))
+
+    return model.Chapter(id=CHAPTER_ID, title=CHAPTER_TITLE,
+                         version=CHAPTER_VERSION, blocks=blocks)
@@ -34,7 +34,7 @@ try:
 except Exception:  # noqa: BLE001 — keep the chapter importable no matter what.
    build_boxplot_stats = None  # type: ignore[assignment]

-CHAPTER_VERSION = "1.0.0"
+CHAPTER_VERSION = "1.1.0"
 CHAPTER_ID = "num_distr"
 CHAPTER_TITLE = "Distribuciones numéricas"

@@ -278,12 +278,17 @@ def build_num_distr(profile: dict, ctx: dict):
                box = build_boxplot_stats(numeric) or {}
            except Exception:  # noqa: BLE001 — degrade, never raise.
                box = {}
-        blocks.append(model.Heading(text=str(name), level=2))
-        blocks.append(model.Figure(
-            make=_figure_maker(name, numeric, box),
-            caption=f"Distribución de «{name}» — histograma (media/mediana/±σ) "
-                    f"y boxplot."))
-        blocks.append(model.Markdown(text=_stats_note(name, numeric, box)))
+        # Keep the column heading, its figure and its stats note together on the
+        # same page/slide (mejora 3 — keep-together): the renderers measure the
+        # whole Group and move it whole when it would not fit.
+        blocks.append(model.Group(blocks=[
+            model.Heading(text=str(name), level=2),
+            model.Figure(
+                make=_figure_maker(name, numeric, box),
+                caption=f"Distribución de «{name}» — histograma "
+                        f"(media/mediana/±σ) y boxplot."),
+            model.Markdown(text=_stats_note(name, numeric, box)),
+        ]))

    return model.Chapter(id=CHAPTER_ID, title=CHAPTER_TITLE,
                         version=CHAPTER_VERSION, blocks=blocks)
@@ -65,19 +65,33 @@ def _pdf_text(path: str) -> str:
    return re.sub(r"\s+", " ", txt)


+def _flatten(blocks):
+    """Expand keep-together Groups so the per-column heading/figure/markdown are
+    inspectable as a flat block list (the chapter wraps each column in a Group)."""
+    out = []
+    for b in blocks:
+        if getattr(b, "kind", "") == "group":
+            out.extend(_flatten(getattr(b, "blocks", []) or []))
+        else:
+            out.append(b)
+    return out
+
+
 def test_golden_chapter_estructura_y_bloques():
    ch = build_num_distr(_profile(n_numeric=2), {})
    assert ch is not None
    assert ch.id == "num_distr"
    assert ch.version == CHAPTER_VERSION
-    kinds = [b.kind for b in ch.blocks]
+    # Per-column blocks are wrapped in keep-together Groups: flatten to inspect.
+    flat = _flatten(ch.blocks)
+    kinds = [b.kind for b in flat]
    # Heading + intro Markdown, then per column: Heading + Figure + Markdown.
    assert kinds[0] == "heading"
    assert kinds[1] == "markdown"
    assert kinds.count("figure") == 2          # one figure per numeric column.
    assert kinds.count("heading") == 1 + 2     # chapter title + one per column.
    # Each figure has a lazy maker that produces a real matplotlib figure.
-    figs = [b for b in ch.blocks if b.kind == "figure"]
+    figs = [b for b in flat if b.kind == "figure"]
    fig = figs[0].make()
    assert fig is not None
    # Two stacked axes: histogram + boxplot share the figure.
@@ -90,7 +104,8 @@ def test_golden_media_mediana_sigma_y_boxplot_presentes():
    # The intro documents the three reference lines and the Tukey boxplot; the
    # per-column note carries the actual mean/median/σ numbers and the shape.
    ch = build_num_distr(_profile(n_numeric=1, extra_categorical=False), {})
-    md_texts = " ".join(b.text for b in ch.blocks if b.kind == "markdown")
+    md_texts = " ".join(b.text for b in _flatten(ch.blocks)
+                        if b.kind == "markdown")
    assert "media" in md_texts and "mediana" in md_texts
    assert "±1σ" in md_texts or "σ" in md_texts
    assert "boxplot" in md_texts.lower()
@@ -126,7 +141,8 @@ def test_anti_corte_muchas_columnas_pdf_y_pptx():
    # 8 numeric columns + long note text: nothing may be cut. Every column
    # heading must survive in both the PDF text and the PPTX deck.
    ch = build_num_distr(_profile(n_numeric=8), {})
-    names = [b.text for b in ch.blocks if b.kind == "heading" and b.level == 2]
+    names = [b.text for b in _flatten(ch.blocks)
+             if b.kind == "heading" and b.level == 2]
    assert len(names) == 8
    with tempfile.TemporaryDirectory() as d:
        pdf = os.path.join(d, "num.pdf")
@@ -17,7 +17,7 @@ from datetime import datetime, timezone

 from .. import model

-CHAPTER_VERSION = "1.0.0"
+CHAPTER_VERSION = "1.1.0"
 CHAPTER_ID = "portada"
 CHAPTER_TITLE = "Portada"

@@ -67,6 +67,53 @@ def _fmt_int(v) -> str:
        return str(v)


+def _fmt_pct(value) -> str:
+    """Format a percentage that may arrive as a 0–1 fraction or a 0–100 number."""
+    if value is None:
+        return "—"
+    try:
+        v = float(value)
+    except (TypeError, ValueError):
+        return str(value)
+    if 0 < v <= 1.0:
+        v *= 100.0
+    return f"{v:.1f}%"
+
+
+def _summary_blocks(summary) -> list:
+    """Mini-summary of the rest of the analysis, shown on the cover (mejora 5).
+
+    The cover is built AFTER the body (``build_document`` passes the aggregated
+    ``ctx['document_summary']``), so it can reflect what the analysis found:
+    shape, column types, quality flags and which chapters were included. Returns
+    an empty list when there is no summary (the cover degrades to its metadata
+    table only)."""
+    if not isinstance(summary, dict) or not summary:
+        return []
+    rows = []
+    n_num = summary.get("n_numeric")
+    n_cat = summary.get("n_categorical")
+    if n_num is not None or n_cat is not None:
+        rows.append(("Columnas numéricas / categóricas",
+                     f"{_fmt_int(n_num)} / {_fmt_int(n_cat)}"))
+    if summary.get("duplicate_pct") is not None:
+        rows.append(("Filas duplicadas", _fmt_pct(summary.get("duplicate_pct"))))
+    if summary.get("null_cell_pct") is not None:
+        rows.append(("Celdas nulas", _fmt_pct(summary.get("null_cell_pct"))))
+    titles = summary.get("chapter_titles") or []
+    if titles:
+        rows.append(("Capítulos del informe", _fmt_int(len(titles))))
+
+    blocks = [model.Heading(text="Resumen del análisis", level=2)]
+    if rows:
+        blocks.append(model.KVTable(rows=rows))
+    if titles:
+        bullets = "\n".join(f"- {model._safe_str(t)}" for t in titles)
+        blocks.append(model.Markdown(
+            text="Este informe incluye los siguientes capítulos:\n" + bullets))
+    return blocks
+
+
 def _fmt_date_eu(value) -> str:
    """Format a date/ISO string as European DD/MM/AAAA HH:mm (UI convention).

@@ -152,5 +199,8 @@ def build_portada(profile: dict, ctx: dict):
        model.Markdown(text=str(granularity)),
    ]

+    # Mini-summary of the rest of the analysis (built last, shown on the cover).
+    blocks.extend(_summary_blocks(ctx.get("document_summary")))
+
    return model.Chapter(id=CHAPTER_ID, title=CHAPTER_TITLE,
                         version=CHAPTER_VERSION, blocks=blocks)
@@ -26,7 +26,7 @@ from . import model
 # placeholders other agents will fill by creating chapters/<id>.py — they will
 # appear in this exact position automatically once their module exists.
 CHAPTER_ORDER = [
-    "portada",       # cover
+    "portada",       # cover — BUILT LAST, PLACED FIRST (see build_document).
    "overview",      # df.head + columns/types/nulls/examples + describe
    "analisis_llm",  # LLM interpretation — sits next to overview (user request)
    "num_distr",     # numeric distributions
@@ -37,8 +37,15 @@ CHAPTER_ORDER = [
    "timeseries",    # time-series analysis
    "geospatial",    # geospatial
    "agregacion",    # aggregations / pivots
+    "glosario",      # glossary — ALWAYS LAST; clickable term destinations.
 ]

+# Chapters whose position is special-cased by build_document: portada is built
+# last (so it can summarize the rest) but placed first; glosario is built and
+# placed last (it reads the terms every other chapter registered).
+_PORTADA = "portada"
+_GLOSARIO = "glosario"
+

 def build_chapter(chapter_id: str, profile: dict, ctx: dict):
    """Build a single chapter by id, or None if absent/not-applicable/error.
@@ -75,15 +82,72 @@ def build_document(profile: dict, ctx: dict = None) -> list:
        list[Chapter] in canonical order, containing only the chapters that are
        implemented and applicable. Never raises.
    """
-    if profile is None:
-        profile = {}
    if not isinstance(profile, dict):
        profile = {}
-    if ctx is None:
-        ctx = {}
-    chapters = []
+    # Copy ctx so the shared collector / summary we add do not leak to the caller.
+    ctx = dict(ctx) if isinstance(ctx, dict) else {}
+
+    # A single glossary collector is shared by every chapter via ctx['glossary'].
+    # Chapters call ctx['glossary'].add(key, label, definition) and mark in-text
+    # appearances with [[term:key]]…[[/term]]; the glosario chapter renders the
+    # registered terms and the renderers wire the clickable links.
+    glossary = ctx.get("glossary")
+    if not isinstance(glossary, model.GlossaryCollector):
+        glossary = model.GlossaryCollector()
+        ctx["glossary"] = glossary
+
+    # 1) Body: every chapter except portada (built last) and glosario (placed
+    # last), in canonical order. This also fills the glossary collector.
+    body = []
    for cid in CHAPTER_ORDER:
+        if cid in (_PORTADA, _GLOSARIO):
+            continue
        ch = build_chapter(cid, profile, ctx)
        if ch is not None and ch.blocks:
-            chapters.append(ch)
+            body.append(ch)
+
+    # 2) Aggregated summary of the rest, for the cover (user decision: the cover
+    # is BUILT after the body so it can reflect what the analysis found).
+    ctx["document_summary"] = _summarize_document(profile, body)
+
+    # 3) Build the cover last, place it FIRST.
+    portada = build_chapter(_PORTADA, profile, ctx)
+    # 4) Build the glossary last (reads the terms the body registered), place LAST.
+    glosario = build_chapter(_GLOSARIO, profile, ctx)
+
+    chapters = []
+    if portada is not None and portada.blocks:
+        chapters.append(portada)
+    chapters.extend(body)
+    if glosario is not None and glosario.blocks:
+        chapters.append(glosario)
    return chapters
+
+
+def _summarize_document(profile: dict, body: list) -> dict:
+    """Aggregate a tiny findings summary of the body for the cover. Never raises.
+
+    Returns a dict with dataset shape, quality, column-type counts and the list
+    of chapters actually included — enough for the cover to show a mini-summary
+    of the analysis without re-deriving anything."""
+    try:
+        cols = profile.get("columns") or []
+        n_num = sum(1 for c in cols if isinstance(c, dict)
+                    and c.get("inferred_type") == "numeric")
+        n_cat = sum(1 for c in cols if isinstance(c, dict)
+                    and isinstance(c.get("categorical"), dict)
+                    and c.get("categorical", {}).get("top")
+                    and c.get("inferred_type") != "numeric")
+        return {
+            "n_chapters": len(body),
+            "chapter_titles": [getattr(c, "title", "") for c in body],
+            "n_rows": profile.get("n_rows"),
+            "n_cols": profile.get("n_cols"),
+            "quality_score": profile.get("quality_score"),
+            "n_numeric": n_num,
+            "n_categorical": n_cat,
+            "duplicate_pct": profile.get("duplicate_pct"),
+            "null_cell_pct": profile.get("null_cell_pct"),
+        }
+    except Exception:  # noqa: BLE001 — the summary is best-effort.
+        return {"n_chapters": len(body) if isinstance(body, list) else 0}
@@ -128,6 +128,39 @@ class Note:
    kind: str = field(default="note", init=False)


+@dataclass
+class Group:
+    """A keep-together unit: its blocks render on the SAME page/slide.
+
+    Renderers measure the whole group first; if it does not fit in the remaining
+    space they move it *whole* to the next page (PDF) or slide (PPTX) before
+    drawing anything — so a heading never gets stranded apart from the figure and
+    text it introduces. If the group is taller than a full page even on its own,
+    it starts on a fresh page and flows (honest degradation, never cut). Use it to
+    bind ``Heading`` + ``Markdown`` + ``Figure`` of one idea together (see the
+    DISTR NUM / AGREGACION chapters).
+    """
+
+    blocks: list = field(default_factory=list)
+    title: Optional[str] = None
+    kind: str = field(default="group", init=False)
+
+
+@dataclass
+class GlossaryEntry:
+    """One glossary term: a clickable destination at the end of the document.
+
+    Rendered as the term ``label`` (heading) plus its ``definition`` (markdown).
+    The renderers register its page/slide position as the link target so every
+    in-text appearance of the same ``key`` becomes a real clickable jump (PDF link
+    annotation via PyMuPDF; PPTX internal slide jump)."""
+
+    key: str = ""
+    label: str = ""
+    definition: str = ""
+    kind: str = field(default="glossary_entry", init=False)
+
+
@dataclass
 class Chapter:
    """An ordered set of blocks with an id, a title and a generation version."""
@@ -150,13 +183,17 @@ _BLOCK_BY_KIND = {
    "image": Image,
    "caption": Caption,
    "note": Note,
+    "group": Group,
+    "glossary_entry": GlossaryEntry,
 }


 def as_block(obj: Any):
    """Coerce a value into a block dataclass. Unknown values become a Note."""
    if isinstance(obj, (Heading, Markdown, KVTable, DataTable, Figure, Image,
-                        Caption, Note)):
+                        Caption, Note, Group, GlossaryEntry)):
+        if isinstance(obj, Group):
+            obj.blocks = as_blocks(obj.blocks)
        return obj
    if isinstance(obj, dict):
        kind = obj.get("kind")
@@ -189,6 +226,13 @@ def as_block(obj: Any):
                return Caption(text=_safe_str(obj.get("text")))
            if cls is Note:
                return Note(text=_safe_str(obj.get("text")))
+            if cls is Group:
+                return Group(blocks=as_blocks(obj.get("blocks")),
+                             title=obj.get("title"))
+            if cls is GlossaryEntry:
+                return GlossaryEntry(key=_safe_str(obj.get("key")),
+                                     label=_safe_str(obj.get("label")),
+                                     definition=_safe_str(obj.get("definition")))
        except Exception:  # noqa: BLE001 — never raise on a malformed block.
            return Note(text=_safe_str(obj))
    return Note(text=_safe_str(obj))
@@ -246,6 +290,67 @@ def _safe_str(v: Any) -> str:
        return ""


+# --------------------------------------------------------------------------- #
+# Glossary collector — chapters register the terms they use; the glosario
+# chapter renders them at the end and the renderers wire the clickable links.
+# --------------------------------------------------------------------------- #
+class GlossaryCollector:
+    """Accumulates glossary terms registered by chapters during document build.
+
+    A single instance is created by :func:`build_document` and passed to every
+    chapter via ``ctx['glossary']``. A chapter calls ``add(key, label,
+    definition)`` to declare a term it explains (e.g. ``"entropia"`` →
+    "Entropía"), and marks each in-text appearance with the inline span
+    ``[[term:key]]texto visible[[/term]]`` (see ``text_layout.parse_inline_rich``).
+    The ``glosario`` chapter reads ``terms()`` to emit one :class:`GlossaryEntry`
+    per term; the renderers turn every marked appearance into a real click that
+    jumps to that entry. First registration of a key wins (idempotent); never
+    raises."""
+
+    def __init__(self):
+        self._terms: dict = {}
+        self._order: list = []
+
+    def add(self, key: Any, label: Any = None, definition: Any = "") -> str:
+        """Register a term and return its normalized key (''. if invalid)."""
+        try:
+            k = _safe_str(key).strip()
+            if not k:
+                return ""
+            if k not in self._terms:
+                self._terms[k] = {
+                    "key": k,
+                    "label": _safe_str(label).strip() or k,
+                    "definition": _safe_str(definition),
+                }
+                self._order.append(k)
+            return k
+        except Exception:  # noqa: BLE001 — collecting a term never breaks a build.
+            return ""
+
+    def has(self, key: Any) -> bool:
+        return _safe_str(key).strip() in self._terms
+
+    def get(self, key: Any) -> Optional[dict]:
+        return self._terms.get(_safe_str(key).strip())
+
+    def terms(self, by: str = "label") -> list:
+        """Return the registered terms as dicts.
+
+        ``by='label'`` (default) sorts alphabetically by visible label;
+        ``by='order'`` keeps first-appearance order."""
+        if by == "order":
+            return [self._terms[k] for k in self._order]
+        return sorted(self._terms.values(),
+                      key=lambda t: _safe_str(t.get("label")).lower())
+
+    def __len__(self) -> int:
+        return len(self._terms)
+
+    def __bool__(self) -> bool:
+        return bool(self._terms)
+
+
 # --------------------------------------------------------------------------- #
 # Manifest — per-chapter versions and page/slide counts for tracking.
 # --------------------------------------------------------------------------- #
@@ -0,0 +1,354 @@
+"""Tests for the AutomaticEDA engine features added in phase 4a.
+
+Covers, with executable evidence, the six render-engine improvements:
+
+1. Bold no longer overlaps the following text in the PDF (real width measured).
+2. Zebra striping on data tables (PDF Rectangle fills + PPTX cell fills).
+3. Keep-together: a Group moves whole to the next page/slide (heading never gets
+   stranded from its figure).
+4. Every PPTX figure carries a visible caption/title (fallback to the heading).
+5. Cover is built last but placed first and reflects an aggregated summary.
+6. Glossary is the last chapter; the term "entropía" is a real clickable link in
+   the PDF (PyMuPDF GOTO annotation) and in the PPTX (native slide-jump run).
+
+Self-contained: synthetic profiles, no DuckDB. Heavy renderer checks (fitz/pptx)
+skip cleanly when the optional engine is missing.
+"""
+
+import os
+import sys
+
+import pytest
+
+_HERE = os.path.dirname(os.path.abspath(__file__))
+_FUNCTIONS = os.path.abspath(os.path.join(_HERE, "..", "..", ".."))  # python/functions
+if _FUNCTIONS not in sys.path:
+    sys.path.insert(0, _FUNCTIONS)
+
+import matplotlib  # noqa: E402
+
+matplotlib.use("Agg")
+import matplotlib.colors as mcolors  # noqa: E402
+import matplotlib.pyplot as plt  # noqa: E402
+from matplotlib.patches import Rectangle  # noqa: E402
+
+from datascience.automatic_eda import model  # noqa: E402
+from datascience.automatic_eda import render_pdf_impl as RP  # noqa: E402
+from datascience.automatic_eda import render_pptx_impl as RX  # noqa: E402
+from datascience.automatic_eda import build_document  # noqa: E402
+from datascience.render_automatic_eda_pdf import render_automatic_eda_pdf  # noqa: E402
+from datascience.render_automatic_eda_pptx import render_automatic_eda_pptx  # noqa: E402
+
+
+class _FakePdf:
+    """Stand-in for PdfPages so the placers can call _new_page in unit tests."""
+
+    def savefig(self, fig):  # noqa: D401
+        pass
+
+
+def _small_fig():
+    fig = plt.figure(figsize=(4.0, 1.5))
+    ax = fig.add_subplot(111)
+    ax.plot([0, 1, 2], [1, 3, 2])
+    return fig
+
+
+def _profile_with_cat_and_num():
+    """A tiny profile that triggers cat_distr (→ entropía term) and num_distr."""
+    return {
+        "table": "ventas", "n_rows": 120, "n_cols": 2, "quality_score": 91,
+        "duplicate_pct": 1.5, "null_cell_pct": 0.8,
+        "columns": [
+            {"name": "region", "inferred_type": "categorical",
+             "categorical": {
+                 "top": [{"value": "norte", "count": 50, "pct": 0.42},
+                         {"value": "sur", "count": 40, "pct": 0.33},
+                         {"value": "este", "count": 30, "pct": 0.25}],
+                 "mode": "norte", "n_distinct": 3, "entropy": 1.55,
+                 "imbalance": 0.1}},
+            {"name": "importe", "inferred_type": "numeric",
+             "numeric": {"mean": 50.0, "median": 48.0, "std": 10.0,
+                         "min": 10, "max": 99, "iqr": 15,
+                         "histogram": [{"lo": 0, "hi": 50, "count": 40},
+                                       {"lo": 50, "hi": 100, "count": 80}]}},
+        ],
+    }
+
+
+# --------------------------------------------------------------------------- #
+# 1) Bold does not overlap the following text (PDF).
+# --------------------------------------------------------------------------- #
+def test_pdf_bold_span_does_not_overlap_following_text():
+    fig = plt.figure(figsize=(RP._W, RP._H))
+    st = RP._PdfState(_FakePdf(), "t")
+    st.fig = fig
+    st.page = 1
+    # A wide bold token immediately followed by normal text on the SAME line.
+    rich = [[("PALABRAMUYANCHAENNEGRITA", True, None),
+             (" texto normal justo después", False, None)]]
+    RP._place_rich_lines(st, rich, RP._FS_BODY, RP._INK)
+
+    renderer = fig.canvas.get_renderer()
+    boxes = sorted((t.get_window_extent(renderer) for t in fig.texts),
+                   key=lambda b: b.x0)
+    assert len(boxes) == 2, "se esperaban dos spans dibujados"
+    # The bold span ends before the normal span starts (no overlap). 1px slack.
+    assert boxes[0].x1 <= boxes[1].x0 + 1.0, \
+        "la negrita se solapa con el texto siguiente"
+    plt.close(fig)
+
+
+# --------------------------------------------------------------------------- #
+# 2) Zebra striping.
+# --------------------------------------------------------------------------- #
+def _facecolor_eq(artist, hexcolor) -> bool:
+    want = mcolors.to_rgba(hexcolor)
+    got = artist.get_facecolor()
+    return all(abs(a - b) < 0.02 for a, b in zip(got[:3], want[:3]))
+
+
+def test_pdf_table_has_zebra_striping():
+    fig = plt.figure(figsize=(RP._W, RP._H))
+    st = RP._PdfState(_FakePdf(), "t")
+    st.fig = fig
+    st.page = 1
+    st.chapter = model.Chapter(id="c", title="C", version="1.0.0")
+    dt = model.DataTable(header=["A", "B"],
+                         rows=[["1", "x"], ["2", "y"], ["3", "z"], ["4", "w"]])
+    RP._place_data_table(st, dt)
+    zebra = [a for a in fig.findobj(Rectangle) if _facecolor_eq(a, RP._ZEBRA)]
+    # 4 data rows → even rows (1-based 2 and 4) shaded = 2 zebra rectangles.
+    assert len(zebra) == 2, f"esperadas 2 filas zebra, hay {len(zebra)}"
+    plt.close(fig)
+
+
+def test_pptx_table_has_zebra_striping(tmp_path):
+    pptx = pytest.importorskip("pptx")
+    from pptx import Presentation
+    from pptx.dml.color import RGBColor
+
+    doc = [model.Chapter(id="c", title="Tabla", version="1.0.0", blocks=[
+        model.DataTable(header=["A", "B"],
+                        rows=[["1", "x"], ["2", "y"], ["3", "z"], ["4", "w"]])])]
+    out = str(tmp_path / "zebra.pptx")
+    assert render_automatic_eda_pptx(doc, out, {"write_manifest": False})["path"]
+
+    prs = Presentation(out)
+    table = None
+    for slide in prs.slides:
+        for sh in slide.shapes:
+            if sh.has_table:
+                table = sh.table
+                break
+    assert table is not None, "no se encontró la tabla en el deck"
+    zebra = RGBColor(0xF6, 0xF8, 0xFA)
+    white = RGBColor(0xFF, 0xFF, 0xFF)
+    # Row 0 = header; data rows follow. Even data rows (table rows 2, 4) shaded.
+    assert table.cell(1, 0).fill.fore_color.rgb == white
+    assert table.cell(2, 0).fill.fore_color.rgb == zebra
+    assert table.cell(4, 0).fill.fore_color.rgb == zebra
+
+
+# --------------------------------------------------------------------------- #
+# 3) Keep-together (Group): heading + figure never split.
+# --------------------------------------------------------------------------- #
+def test_pdf_group_moves_whole_to_next_page_when_it_does_not_fit():
+    fig = plt.figure(figsize=(RP._W, RP._H))
+    st = RP._PdfState(_FakePdf(), "t")
+    st.fig = fig
+    st.page = 1
+    st.chapter = model.Chapter(id="c", title="C", version="1.0.0")
+    grp = model.Group(blocks=[
+        model.Heading(text="Sección con figura", level=2),
+        model.Figure(make=_small_fig, caption="cap"),
+        model.Markdown(text="Descripción breve de la figura."),
+    ])
+    # Only ~0.4in left: the group does not fit here but fits on a fresh page.
+    st.y = RP._CONTENT_BOTTOM - 0.4
+    page_before = st.page
+    RP._place_group(st, grp)
+    # Exactly one page break: the whole group (heading+figure+text) stays
+    # together on the new page — no second break inside it.
+    assert st.page == page_before + 1
+    plt.close(st.fig)
+
+
+def test_pdf_group_does_not_break_when_it_fits():
+    fig = plt.figure(figsize=(RP._W, RP._H))
+    st = RP._PdfState(_FakePdf(), "t")
+    st.fig = fig
+    st.page = 1
+    st.chapter = model.Chapter(id="c", title="C", version="1.0.0")
+    grp = model.Group(blocks=[
+        model.Heading(text="Cabe entera", level=2),
+        model.Figure(make=_small_fig, caption="cap"),
+    ])
+    st.y = RP._CONTENT_TOP  # empty page → fits, must not break.
+    page_before = st.page
+    RP._place_group(st, grp)
+    assert st.page == page_before
+    plt.close(st.fig)
+
+
+def test_pptx_group_moves_whole_to_next_slide(tmp_path):
+    pytest.importorskip("pptx")
+    from pptx import Presentation
+    from pptx.util import Inches
+
+    prs = Presentation()
+    prs.slide_width = Inches(RX._W)
+    prs.slide_height = Inches(RX._H)
+    st = RX._PptxState(prs, "t")
+    st.chapter = model.Chapter(id="c", title="C", version="1.0.0")
+    RX._new_slide(st, cont=False)
+    grp = model.Group(blocks=[
+        model.Heading(text="Sección con figura", level=2),
+        model.Figure(make=_small_fig, caption="cap"),
+        model.Markdown(text="Descripción breve."),
+    ])
+    st.y = RX._CONTENT_BOTTOM - 0.4  # does not fit here.
+    slide_before = st.slide_no
+    RX._place_group(st, grp)
+    assert st.slide_no == slide_before + 1  # one jump; group kept together.
+
+
+# --------------------------------------------------------------------------- #
+# 4) Every PPTX figure carries a visible caption/title.
+# --------------------------------------------------------------------------- #
+def test_pptx_figure_without_caption_gets_heading_title(tmp_path):
+    pytest.importorskip("pptx")
+    from pptx import Presentation
+    from pptx.enum.shapes import MSO_SHAPE_TYPE
+
+    doc = [model.Chapter(id="c", title="Cap", version="1.0.0", blocks=[
+        model.Heading(text="Mi sección gráfica", level=2),
+        model.Figure(make=_small_fig),  # NO caption provided.
+    ])]
+    out = str(tmp_path / "cap.pptx")
+    assert render_automatic_eda_pptx(doc, out, {"write_manifest": False})["path"]
+
+    prs = Presentation(out)
+    for slide in prs.slides:
+        has_pic = any(sh.shape_type == MSO_SHAPE_TYPE.PICTURE
+                      for sh in slide.shapes)
+        if not has_pic:
+            continue
+        italic = [r.text for sh in slide.shapes if sh.has_text_frame
+                  for p in sh.text_frame.paragraphs for r in p.runs
+                  if r.font.italic and r.text.strip()]
+        assert italic, "la figura no lleva caption visible en su slide"
+        assert any("Mi sección gráfica" in t for t in italic), \
+            "el caption no cayó al título de la sección"
+        return
+    pytest.fail("no se encontró ningún slide con imagen")
+
+
+def test_pptx_no_figure_slide_is_ever_untitled(tmp_path):
+    """Invariant: across many figures (incl. tall ones), NO slide with an image
+    lacks a visible caption — the caption never spills to the next slide."""
+    pytest.importorskip("pptx")
+    from pptx import Presentation
+    from pptx.enum.shapes import MSO_SHAPE_TYPE
+
+    def _tall_fig():
+        fig = plt.figure(figsize=(5.0, 4.6))  # nearly square → fills the slide.
+        fig.add_subplot(111).bar([1, 2, 3], [4, 5, 6])
+        return fig
+
+    blocks = []
+    for i in range(6):
+        blocks.append(model.Heading(text=f"Gráfico {i}", level=2))
+        blocks.append(model.Figure(
+            make=_tall_fig,
+            caption=("Una descripción de la figura deliberadamente larga para "
+                     "que el caption ocupe más de una línea al envolverse en el "
+                     f"ancho del slide — figura número {i} del bloque.")))
+    doc = [model.Chapter(id="c", title="Muchas figuras", version="1.0.0",
+                         blocks=blocks)]
+    out = str(tmp_path / "many.pptx")
+    assert render_automatic_eda_pptx(doc, out, {"write_manifest": False})["path"]
+
+    prs = Presentation(out)
+    missing = []
+    pics = 0
+    for i, slide in enumerate(prs.slides):
+        if not any(sh.shape_type == MSO_SHAPE_TYPE.PICTURE
+                   for sh in slide.shapes):
+            continue
+        pics += 1
+        italic = [r.text for sh in slide.shapes if sh.has_text_frame
+                  for p in sh.text_frame.paragraphs for r in p.runs
+                  if r.font.italic and r.text.strip()]
+        if not italic:
+            missing.append(i)
+    assert pics >= 6, f"esperadas >=6 figuras, hay {pics}"
+    assert not missing, f"slides con imagen sin caption: {missing}"
+
+
+# --------------------------------------------------------------------------- #
+# 5) Cover built last, placed first, with an aggregated summary.
+# --------------------------------------------------------------------------- #
+def test_cover_first_glossary_last_with_summary():
+    chs = build_document(_profile_with_cat_and_num(), ctx={"dataset_name": "v"})
+    ids = [c.id for c in chs]
+    assert ids[0] == "portada", f"la portada no es la primera: {ids}"
+    assert ids[-1] == "glosario", f"el glosario no es el último: {ids}"
+    cover = chs[0]
+    headings = [b.text for b in cover.blocks if b.kind == "heading"]
+    assert any("Resumen" in h for h in headings), \
+        "la portada no incluye el resumen agregado"
+    # The summary reflects the body chapters (e.g. the numeric/categorical ones).
+    cover_text = " ".join(
+        b.text for b in cover.blocks if getattr(b, "kind", "") == "markdown")
+    assert "Distribuciones" in cover_text, \
+        "el resumen de portada no menciona los capítulos del cuerpo"
+
+
+# --------------------------------------------------------------------------- #
+# 6) Glossary clickable in PDF (PyMuPDF GOTO) and PPTX (native slide jump).
+# --------------------------------------------------------------------------- #
+def test_pdf_glossary_term_is_clickable(tmp_path):
+    fitz = pytest.importorskip("fitz")
+    out = str(tmp_path / "glos.pdf")
+    res = render_automatic_eda_pdf(_profile_with_cat_and_num(), out,
+                                   {"ctx": {"dataset_name": "v"},
+                                    "write_manifest": False})
+    assert res["path"] == out and os.path.exists(out)
+
+    doc = fitz.open(out)
+    goto = [(pno, l) for pno in range(doc.page_count)
+            for l in doc[pno].get_links() if l.get("kind") == fitz.LINK_GOTO]
+    doc.close()
+    assert goto, "no hay ningún enlace interno (entropía → glosario) en el PDF"
+    # Destination must be a real page in the document (the glossary page).
+    assert all(0 <= l.get("page", -1) for _p, l in goto)
+
+
+def test_pptx_glossary_term_is_clickable(tmp_path):
+    pytest.importorskip("pptx")
+    from pptx import Presentation
+    from pptx.oxml.ns import qn
+
+    out = str(tmp_path / "glos.pptx")
+    res = render_automatic_eda_pptx(_profile_with_cat_and_num(), out,
+                                    {"ctx": {"dataset_name": "v"},
+                                     "write_manifest": False})
+    assert res["path"] == out and os.path.exists(out)
+
+    prs = Presentation(out)
+    found = False
+    for slide in prs.slides:
+        for sh in slide.shapes:
+            if not sh.has_text_frame:
+                continue
+            for p in sh.text_frame.paragraphs:
+                for r in p.runs:
+                    rpr = r._r.find(qn("a:rPr"))
+                    if rpr is None:
+                        continue
+                    hl = rpr.find(qn("a:hlinkClick"))
+                    if hl is not None and \
+                            hl.get("action") == "ppaction://hlinksldjump":
+                        found = True
+    assert found, "ningún término tiene hyperlink de salto a slide en el PPTX"
@@ -60,6 +60,8 @@ _FS_BODY, _FS_CELL, _FS_NOTE = 10.5, 9.0, 9.0
 _GAP = 0.12          # vertical gap after a block, inches.
 _CELL_PAD = 0.06     # horizontal padding inside a table cell, inches.
 _ROW_VPAD = 0.05     # vertical padding inside a table row, inches.
+_ZEBRA = "#f6f8fa"   # very light grey for zebra-striped (even) table rows.
+_LINK = "#2a6f97"    # accent colour for clickable glossary terms.


 class _PdfState:
@@ -73,6 +75,11 @@ class _PdfState:
        self.page = 0                # global page counter.
        self.chapter = None          # current Chapter (for the footer).
        self.chapter_pages = 0       # pages produced for the current chapter.
+        self.last_heading = ""       # text of the most recent heading.
+        # Glossary wiring (mejora 6). Pages are 0-based; rects/points are in PDF
+        # points (1/72") with a top-left origin — same convention as PyMuPDF.
+        self.term_sources = []       # [{key, page, rect:[x0,y0,x1,y1]}]
+        self.term_dests = {}         # key -> {page, point:[x,y]}


 # --------------------------------------------------------------------------- #
@@ -121,6 +128,35 @@ def _draw_footer(st: _PdfState) -> None:
        transform=st.fig.transFigure, color=_RULE, lw=0.6))


+def _text_width_in(st: _PdfState, s: str, fs: float, bold: bool) -> float:
+    """Real rendered width (inches) of ``s`` at ``fs`` with the given weight.
+
+    Measured with the Agg renderer's own font metrics (the same TrueType the PDF
+    backend embeds), so a **bold** span advances the cursor by its ACTUAL width —
+    fixing the bug where bold text overlapped the following normal text because
+    the cursor advanced by the normal-weight average-glyph estimate. Falls back to
+    the deterministic character grid if the renderer is unavailable, so it never
+    raises.
+    """
+    if not s:
+        return 0.0
+    try:
+        from matplotlib.font_manager import FontProperties
+        renderer = st.fig.canvas.get_renderer()
+        prop = FontProperties(family="sans-serif", size=fs,
+                              weight="bold" if bold else "normal")
+        w_px, _h, _d = renderer.get_text_width_height_descent(s, prop, False)
+        return w_px / float(st.fig.dpi)
+    except Exception:  # noqa: BLE001 — fall back to the conservative grid metric.
+        return tl.avg_char_width_in(fs) * len(s)
+
+
+def _pt_rect(x0_in: float, y_top_in: float, x1_in: float,
+             y_bottom_in: float) -> list:
+    """An inches box (top-left origin) → a PDF-points rect for PyMuPDF links."""
+    return [x0_in * 72.0, y_top_in * 72.0, x1_in * 72.0, y_bottom_in * 72.0]
+
+
 def _remaining(st: _PdfState) -> float:
    return _CONTENT_BOTTOM - st.y

@@ -138,6 +174,7 @@ def _place_heading(st: _PdfState, block) -> None:
    level = max(1, min(3, int(getattr(block, "level", 1) or 1)))
    fs = {1: _FS_H1, 2: _FS_H2, 3: _FS_H3}[level]
    text = tl.strip_inline_md(getattr(block, "text", ""))
+    st.last_heading = text or st.last_heading
    max_chars = tl.chars_per_line(_USABLE_W, fs)
    lines = tl.wrap(text, max_chars)
    lh = tl.line_height_in(fs, leading=1.2)
@@ -171,17 +208,19 @@ def _place_text_lines(st: _PdfState, lines: list, fs: float, color: str,

 def _place_rich_lines(st: _PdfState, rich_lines: list, fs: float, color: str,
                      indent: float = 0.0, prefixes=None) -> None:
-    """Draw pre-wrapped lines of styled segments (bold spans rendered bold).
+    """Draw pre-wrapped lines of styled segments (bold + clickable term spans).

-    Each line is ``[(text, is_bold), ...]``. Segments are placed left-to-right,
-    advancing x by the deterministic character grid (same metric the wrapper
-    used), so a bold span is rendered with ``fontweight='bold'`` without
-    changing the line's measured width — the no-cut guarantee is preserved.
+    Each line is a list of ``(text, is_bold)`` or ``(text, is_bold, term_key)``
+    segments. Segments are placed left-to-right, advancing x by the segment's
+    REAL rendered width (measured with the renderer's font metrics for the actual
+    weight) — this is what stops a bold span from overlapping the following text:
+    the cursor no longer advances by the normal-weight estimate. A segment with a
+    ``term_key`` is drawn in the accent colour and its rectangle is recorded in
+    ``st.term_sources`` so it becomes a clickable jump to the glossary entry.
    ``prefixes`` is an optional ``(first_line, other_lines)`` pair (e.g. a
    bullet) drawn before the segments.
    """
    lh = tl.line_height_in(fs)
-    cw = tl.avg_char_width_in(fs)
    for idx, segs in enumerate(rich_lines):
        _ensure_space(st, lh)
        x = _ML + indent
@@ -190,14 +229,23 @@ def _place_rich_lines(st: _PdfState, rich_lines: list, fs: float, color: str,
            if prefix:
                st.fig.text(_xf(x), _yf(st.y), prefix, fontsize=fs, color=color,
                            ha="left", va="top")
-                x += cw * len(prefix)
-        for seg_text, is_bold in segs:
+                x += _text_width_in(st, prefix, fs, False)
+        for seg in segs:
+            if len(seg) == 3:
+                seg_text, is_bold, term = seg
+            else:
+                seg_text, is_bold, term = seg[0], seg[1], None
            if seg_text == "":
                continue
-            st.fig.text(_xf(x), _yf(st.y), seg_text, fontsize=fs, color=color,
-                        ha="left", va="top",
+            w = _text_width_in(st, seg_text, fs, bool(is_bold))
+            st.fig.text(_xf(x), _yf(st.y), seg_text, fontsize=fs,
+                        color=(_LINK if term else color), ha="left", va="top",
                        fontweight="bold" if is_bold else "normal")
-            x += cw * len(seg_text)
+            if term:
+                st.term_sources.append({
+                    "key": term, "page": st.page - 1,
+                    "rect": _pt_rect(x, st.y, x + w, st.y + lh)})
+            x += w
        st.y += lh


@@ -242,7 +290,7 @@ def _place_markdown(st: _PdfState, block) -> None:
        if stripped.startswith("- ") or stripped.startswith("* "):
            content = stripped[2:]  # keep inline markers for bold rendering.
            bullet_chars = tl.chars_per_line(_USABLE_W - 0.22, _FS_BODY)
-            rich = tl.wrap_rich(content, bullet_chars)
+            rich = tl.wrap_rich_terms(content, bullet_chars)
            _place_rich_lines(st, rich, _FS_BODY, _INK,
                              prefixes=("•  ", "   "))
            i += 1
@@ -258,7 +306,8 @@ def _place_markdown(st: _PdfState, block) -> None:
            j += 1
        text = " ".join(para)
        max_chars = tl.chars_per_line(_USABLE_W, _FS_BODY)
-        _place_rich_lines(st, tl.wrap_rich(text, max_chars), _FS_BODY, _INK)
+        _place_rich_lines(st, tl.wrap_rich_terms(text, max_chars), _FS_BODY,
+                          _INK)
        i = j
    st.y += _GAP

@@ -325,15 +374,18 @@ def _wrap_row(cells: list, widths: list, fs: float) -> list:


 def _draw_table_row(st: _PdfState, cells_lines: list, widths: list, fs: float,
-                    y0: float, header: bool) -> float:
+                    y0: float, header: bool, zebra: bool = False) -> float:
    lh = tl.line_height_in(fs)
    nlines = max((len(c) for c in cells_lines), default=1)
    row_h = lh * nlines + _ROW_VPAD * 2
-    if header:
+    # Background: header band, or a faint zebra fill for even data rows. Drawn
+    # below the text/rule (zorder 0) so striping never hides cell content.
+    bg = _HEAD_BG if header else (_ZEBRA if zebra else None)
+    if bg is not None:
        st.fig.add_artist(Rectangle(
            (_xf(_ML), _yf(y0 + row_h)), _xf(_ML + _USABLE_W) - _xf(_ML),
            _yf(y0) - _yf(y0 + row_h), transform=st.fig.transFigure,
-            color=_HEAD_BG, lw=0, zorder=0))
+            color=bg, lw=0, zorder=0))
    x = _ML
    for c, lines in enumerate(cells_lines):
        for k, ln in enumerate(lines):
@@ -378,14 +430,18 @@ def _place_data_table(st: _PdfState, block) -> None:
            + _ROW_VPAD * 2
    _ensure_space(st, header_h() + max(first_row_h, lh))
    draw_header()
-    for r in rows:
+    # ``data_idx`` is the LOGICAL row index (not reset across page breaks) so the
+    # zebra pattern stays coherent when a long table splits and repeats the
+    # header: even rows (1-based) are shaded → 0-based odd indices.
+    for data_idx, r in enumerate(rows):
        cells_lines = _wrap_row(r, widths, fs)
        row_h = lh * max((len(c) for c in cells_lines), default=1) \
            + _ROW_VPAD * 2
        if _remaining(st) < row_h:
            _new_page(st)
            draw_header()  # repeat header on the continuation page.
-        st.y += _draw_table_row(st, cells_lines, widths, fs, st.y, header=False)
+        st.y += _draw_table_row(st, cells_lines, widths, fs, st.y,
+                                header=False, zebra=(data_idx % 2 == 1))
    note = getattr(block, "note", None)
    if note:
        _place_text_lines(st, tl.wrap(model._safe_str(note),
@@ -414,53 +470,98 @@ def _png_from_figure(fig) -> bytes:
    return buf.read()


-def _place_image_array(st: _PdfState, arr, caption) -> None:
+def _figure_png_cached(block):
+    """Rasterize a Figure to PNG bytes ONCE and cache (bytes, aspect).
+
+    Measuring (keep-together) and drawing must agree on the REAL aspect ratio:
+    ``bbox_inches='tight'`` changes it vs ``figsize``, so we rasterize once and
+    reuse the bytes for both. Cached on the block; never raises."""
+    cached = getattr(block, "_aeda_png", None)
+    if cached is not None:
+        return cached
+    fig, owned = _resolve_figure(block)
+    data = None
+    if fig is not None:
+        try:
+            data = _png_from_figure(fig)
+        finally:
+            if owned:
+                try:
+                    plt.close(fig)
+                except Exception:  # noqa: BLE001
+                    pass
+    aspect = 0.66
+    if data is not None:
+        try:
+            arr = mpimg.imread(io.BytesIO(data))
+            aspect = (arr.shape[0] / arr.shape[1]) if arr.shape[1] else 0.66
+        except Exception:  # noqa: BLE001
+            aspect = 0.66
+    try:
+        block._aeda_png = (data, aspect)
+        return block._aeda_png
+    except Exception:  # noqa: BLE001 — block may reject attributes; degrade.
+        return (data, aspect)
+
+
+def _image_aspect(block) -> float:
+    """Real aspect (h/w) of an Image block by path, for measurement."""
+    path = getattr(block, "path", "")
+    if path and os.path.exists(path):
+        try:
+            arr = mpimg.imread(path)
+            return (arr.shape[0] / arr.shape[1]) if arr.shape[1] else 0.66
+        except Exception:  # noqa: BLE001
+            pass
+    return 0.66
+
+
+def _place_image_array(st: _PdfState, arr, caption, max_h_in=None) -> None:
    h_px, w_px = arr.shape[0], arr.shape[1]
    aspect = (h_px / w_px) if w_px else 1.0
+    # Reserve the caption's REAL (possibly multi-line) height FIRST, then scale
+    # the image to (max_h - cap_reserve) so figure + caption always fit the same
+    # page. cap_reserve adds a cushion so the caption never spills to next page.
+    cap_lines = (tl.wrap(model._safe_str(caption),
+                         tl.chars_per_line(_USABLE_W, _FS_NOTE))
+                 if caption else [])
+    cap_real = tl.line_height_in(_FS_NOTE) * len(cap_lines) if caption else 0.0
+    cap_reserve = (cap_real + 0.04 + 0.08) if caption else 0.0
    max_h = _CONTENT_BOTTOM - _CONTENT_TOP
+    # height_in hint (model.Figure/Image): cap the height so a figure in a
+    # keep-together Group shrinks to leave room for its heading and text.
+    if isinstance(max_h_in, (int, float)) and max_h_in > 0:
+        max_h = min(max_h, float(max_h_in))
+    max_img_h = max(max_h - cap_reserve, 0.6)
    target_w = _USABLE_W
    target_h = target_w * aspect
-    if target_h > max_h:
-        target_h = max_h
+    if target_h > max_img_h:
+        target_h = max_img_h
        target_w = target_h / aspect if aspect else _USABLE_W
-    cap_h = tl.line_height_in(_FS_NOTE) + 0.04 if caption else 0.0
    # Move whole image to next page if it does not fit in remaining space.
-    if _remaining(st) < target_h + cap_h:
-        if (max_h) >= target_h + cap_h:
-            _new_page(st)
-        else:
-            # Taller than a full page even at min — already clamped to max_h.
-            _new_page(st)
+    if _remaining(st) < target_h + cap_reserve:
+        _new_page(st)
    left_frac = _xf(_ML + (_USABLE_W - target_w) / 2.0)
    bottom_frac = _yf(st.y + target_h)
    ax = st.fig.add_axes([left_frac, bottom_frac, target_w / _W, target_h / _H])
    ax.imshow(arr)
    ax.axis("off")
    st.y += target_h + 0.04
-    if caption:
-        _place_text_lines(st, tl.wrap(model._safe_str(caption),
-                          tl.chars_per_line(_USABLE_W, _FS_NOTE)),
-                          _FS_NOTE, _MUTED, style="italic")
+    if cap_lines:
+        _place_text_lines(st, cap_lines, _FS_NOTE, _MUTED, style="italic")
    st.y += _GAP


 def _place_figure(st: _PdfState, block) -> None:
-    fig, owned = _resolve_figure(block)
-    if fig is None:
+    png, _aspect = _figure_png_cached(block)
+    if png is None:
        _place_text_lines(st, ["(figura no disponible)"], _FS_NOTE, _MUTED,
                          style="italic")
        st.y += _GAP
        return
-    try:
-        png = _png_from_figure(fig)
-    finally:
-        if owned:
-            try:
-                plt.close(fig)
-            except Exception:  # noqa: BLE001
-                pass
    arr = mpimg.imread(io.BytesIO(png))
-    _place_image_array(st, arr, getattr(block, "caption", None))
+    _place_image_array(st, arr, getattr(block, "caption", None),
+                       max_h_in=getattr(block, "height_in", None))


 def _place_image(st: _PdfState, block) -> None:
@@ -471,7 +572,8 @@ def _place_image(st: _PdfState, block) -> None:
        st.y += _GAP
        return
    arr = mpimg.imread(path)
-    _place_image_array(st, arr, getattr(block, "caption", None))
+    _place_image_array(st, arr, getattr(block, "caption", None),
+                       max_h_in=getattr(block, "height_in", None))


 def _place_caption(st: _PdfState, block) -> None:
@@ -488,6 +590,189 @@ def _place_note(st: _PdfState, block) -> None:
    st.y += _GAP


+# --------------------------------------------------------------------------- #
+# Block measurement (mejora 3 — keep-together). These estimate a block's height
+# WITHOUT drawing it, so a Group can decide to move whole to the next page before
+# anything is drawn. Over-estimating is safe: it only triggers an earlier page
+# break, never a content cut (the placers keep their own no-cut pagination).
+# --------------------------------------------------------------------------- #
+def _measure_heading_text(text: str, level: int) -> float:
+    level = max(1, min(3, int(level or 1)))
+    fs = {1: _FS_H1, 2: _FS_H2, 3: _FS_H3}[level]
+    lines = tl.wrap(tl.strip_inline_md(text), tl.chars_per_line(_USABLE_W, fs))
+    h = tl.line_height_in(fs, leading=1.2) * len(lines) + 0.06
+    if level == 1:
+        h += 0.10
+    return h + _GAP
+
+
+def _measure_markdown(block) -> float:
+    raw = str(getattr(block, "text", "") or "")
+    md_lines = raw.split("\n")
+    h = 0.0
+    i, n = 0, len(md_lines)
+    while i < n:
+        stripped = md_lines[i].strip()
+        if stripped.startswith("|") and stripped.endswith("|"):
+            j = i
+            while j < n and md_lines[j].strip().startswith("|") \
+                    and md_lines[j].strip().endswith("|"):
+                j += 1
+            h += (tl.line_height_in(_FS_CELL) + _ROW_VPAD * 2) * (j - i) + _GAP
+            i = j
+            continue
+        if stripped == "":
+            h += tl.line_height_in(_FS_BODY) * 0.5
+            i += 1
+            continue
+        if stripped.startswith("### "):
+            h += _measure_heading_text(stripped[4:], 3)
+            i += 1
+            continue
+        if stripped.startswith("## "):
+            h += _measure_heading_text(stripped[3:], 2)
+            i += 1
+            continue
+        if stripped.startswith("# "):
+            h += _measure_heading_text(stripped[2:], 1)
+            i += 1
+            continue
+        if stripped.startswith("- ") or stripped.startswith("* "):
+            lines = tl.wrap_rich_terms(
+                stripped[2:], tl.chars_per_line(_USABLE_W - 0.22, _FS_BODY))
+            h += tl.line_height_in(_FS_BODY) * len(lines)
+            i += 1
+            continue
+        para = [stripped]
+        j = i + 1
+        while j < n:
+            nxt = md_lines[j].strip()
+            if nxt == "" or nxt.startswith(("|", "#", "- ", "* ")):
+                break
+            para.append(nxt)
+            j += 1
+        lines = tl.wrap_rich_terms(" ".join(para),
+                                   tl.chars_per_line(_USABLE_W, _FS_BODY))
+        h += tl.line_height_in(_FS_BODY) * len(lines)
+        i = j
+    return h + _GAP
+
+
+def _measure_figure_like(block) -> float:
+    max_h = _CONTENT_BOTTOM - _CONTENT_TOP
+    hint = getattr(block, "height_in", None)
+    if isinstance(hint, (int, float)) and hint > 0:
+        target_h = min(float(hint), max_h)
+    else:
+        # Real rasterized aspect (cached) so measuring matches drawing.
+        if getattr(block, "kind", "") == "image":
+            aspect = _image_aspect(block)
+        else:
+            _data, aspect = _figure_png_cached(block)
+        target_h = min(_USABLE_W * aspect, max_h)
+    cap = getattr(block, "caption", None)
+    cap_h = tl.line_height_in(_FS_NOTE) + 0.04 if cap else 0.0
+    return target_h + 0.04 + cap_h + _GAP
+
+
+def _measure_block(st: _PdfState, block) -> float:
+    kind = getattr(block, "kind", "")
+    try:
+        if kind == "heading":
+            return _measure_heading_text(getattr(block, "text", ""),
+                                         getattr(block, "level", 1))
+        if kind == "markdown":
+            return _measure_markdown(block)
+        if kind in ("figure", "image"):
+            return _measure_figure_like(block)
+        if kind in ("caption", "note"):
+            lines = tl.wrap(getattr(block, "text", ""),
+                            tl.chars_per_line(_USABLE_W, _FS_NOTE))
+            return tl.line_height_in(_FS_NOTE) * len(lines) + _GAP
+        if kind == "kv_table":
+            rows = getattr(block, "rows", []) or []
+            return (tl.line_height_in(_FS_BODY) + _ROW_VPAD) * (len(rows) + 1) \
+                + _GAP
+        if kind == "data_table":
+            rows = getattr(block, "rows", []) or []
+            return (tl.line_height_in(_FS_CELL) + _ROW_VPAD * 2) \
+                * (len(rows) + 1) + _GAP
+        if kind == "group":
+            return sum(_measure_block(st, b)
+                       for b in (getattr(block, "blocks", []) or []))
+    except Exception:  # noqa: BLE001 — a measurement never aborts rendering.
+        pass
+    return tl.line_height_in(_FS_BODY)
+
+
+def _shrink_group_figures(st: _PdfState, blocks: list, avail_full: float) -> None:
+    """Cap each figure's height (via height_in) so the whole group fits a page.
+
+    The figure shrinks just enough to leave room for its heading, text and
+    caption — keep-together puts the chart on the SAME page as its title and
+    description instead of pushing it to the next page."""
+    fig_blocks = [b for b in blocks
+                  if getattr(b, "kind", "") in ("figure", "image")]
+    if not fig_blocks:
+        return
+    nonfig_h = sum(_measure_block(st, b) for b in blocks
+                   if getattr(b, "kind", "") not in ("figure", "image"))
+    fig_overhead = tl.line_height_in(_FS_NOTE) + 0.04 + 0.04 + _GAP
+    budget = avail_full - nonfig_h - 0.08 * len(fig_blocks)
+    if budget <= 0.8:
+        return
+    per = budget / len(fig_blocks) - fig_overhead
+    if per <= 0.6:
+        return
+    for fb in fig_blocks:
+        cur = getattr(fb, "height_in", None)
+        fb.height_in = (min(float(cur), per)
+                        if isinstance(cur, (int, float)) and cur > 0 else per)
+
+
+def _place_group(st: _PdfState, block) -> None:
+    """Render a keep-together Group: move it whole to the next page if needed."""
+    blocks = getattr(block, "blocks", []) or []
+    if not blocks:
+        return
+    avail_full = _CONTENT_BOTTOM - _CONTENT_TOP
+    _shrink_group_figures(st, blocks, avail_full)
+    total = sum(_measure_block(st, b) for b in blocks)
+    if total <= avail_full:
+        # Fits on one page: keep it together by moving whole when it won't fit.
+        if total > _remaining(st):
+            _new_page(st)
+    elif st.y > _CONTENT_TOP + 1e-6:
+        # Taller than a full page: at least start it on a fresh page, then flow.
+        _new_page(st)
+    for b in blocks:
+        placer = _PLACERS.get(getattr(b, "kind", ""), _place_note)
+        try:
+            placer(st, b)
+        except Exception:  # noqa: BLE001 — a bad block never aborts the group.
+            pass
+
+
+def _place_glossary_entry(st: _PdfState, block) -> None:
+    """Render one glossary term and register it as a clickable link target."""
+    key = getattr(block, "key", "")
+    label = getattr(block, "label", "") or key
+    definition = getattr(block, "definition", "")
+    # Reserve the term + its first definition line together, then anchor the
+    # destination at the resolved page/position before drawing.
+    _ensure_space(st, tl.line_height_in(_FS_H3, leading=1.2)
+                  + tl.line_height_in(_FS_BODY) * 2)
+    if key:
+        st.term_dests[key] = {"page": st.page - 1,
+                              "point": [_ML * 72.0, st.y * 72.0]}
+    _place_heading(st, model.Heading(text=str(label), level=3))
+    if definition:
+        _place_text_lines(st, tl.wrap(model._safe_str(definition),
+                          tl.chars_per_line(_USABLE_W, _FS_BODY)),
+                          _FS_BODY, _INK)
+    st.y += _GAP * 0.5
+
+
 _PLACERS = {
    "heading": _place_heading,
    "markdown": _place_markdown,
@@ -497,6 +782,8 @@ _PLACERS = {
    "image": _place_image,
    "caption": _place_caption,
    "note": _place_note,
+    "group": _place_group,
+    "glossary_entry": _place_glossary_entry,
 }


@@ -553,8 +840,42 @@ def render_pdf(chapters: list, out_path: str, meta: dict = None) -> dict:
        return {"path": None, "n_pages": 0, "chapters": [],
                "note": f"fallo al escribir el PDF: {e}"}

+    # Mejora 6 — wire clickable glossary links now the PDF is closed on disk.
+    # PdfPages cannot emit internal hyperlinks, so we post-process with PyMuPDF
+    # (delegated registry function). Degrades silently if it is unavailable.
+    n_links = _wire_glossary_links(st, out_path, notes)
+
    note = f"{n_pages} páginas"
+    if n_links:
+        note += f" · {n_links} enlaces de glosario"
    if notes:
        note += " · " + "; ".join(notes)
    return {"path": out_path, "n_pages": n_pages, "chapters": chapters_meta,
            "note": note}
+
+
+def _wire_glossary_links(st: _PdfState, out_path: str, notes: list) -> int:
+    """Build {source rect → glossary dest} links and apply them via PyMuPDF.
+
+    Returns the number of links applied (0 if there is nothing to wire or the
+    post-processor is unavailable). Never raises."""
+    try:
+        links = []
+        for src in st.term_sources:
+            dest = st.term_dests.get(src.get("key"))
+            if not dest:
+                continue
+            links.append({
+                "src_page": src["page"], "src_rect": src["rect"],
+                "dst_page": dest["page"], "dst_point": dest["point"]})
+        if not links:
+            return 0
+        from datascience.add_pdf_internal_links import add_pdf_internal_links
+        res = add_pdf_internal_links(out_path, links)
+        if isinstance(res, dict) and res.get("status") == "ok":
+            return int(res.get("n_links") or 0)
+        if isinstance(res, dict) and res.get("error"):
+            notes.append(f"glosario sin enlaces: {res.get('error')}")
+    except Exception as e:  # noqa: BLE001 — links are best-effort.
+        notes.append(f"glosario sin enlaces: {e}")
+    return 0
@@ -43,6 +43,8 @@ _ACCENT = (0x2A, 0x6F, 0x97)
 _MUTED = (0x8A, 0x8A, 0x8A)
 _HEAD_BG = (0xEE, 0xF3, 0xF6)
 _WHITE = (0xFF, 0xFF, 0xFF)
+_ZEBRA = (0xF6, 0xF8, 0xFA)   # faint grey for even (zebra) data rows.
+_LINK = (0x2A, 0x6F, 0x97)    # accent colour for clickable glossary terms.

 _FS_TITLE = 26
 _FS_H1, _FS_H2, _FS_H3 = 20, 16, 13
@@ -59,6 +61,10 @@ class _PptxState:
        self.chapter = None
        self.slide_no = 0
        self.chapter_slides = 0
+        self.last_heading = ""        # text of the most recent heading.
+        # Glossary wiring (mejora 6): runs to link and per-term target slide.
+        self.term_runs = []           # [(key, run)]
+        self.term_anchor_slide = {}   # key -> Slide (glossary entry)


 def _rgb(c):
@@ -155,9 +161,13 @@ def _add_rich_text(st: _PptxState, rich_lines: list, fs: float, color,
                   indent=0.0, bullet=False) -> None:
    """Add pre-wrapped lines of styled segments as one paragraph per line.

-    Each line is ``[(text, is_bold), ...]``; every segment becomes its own run
-    so ``**bold**`` spans render with native PowerPoint bold (``run.font.bold``)
-    without affecting the measured height (one paragraph per pre-wrapped line).
+    Each line is a list of ``(text, is_bold)`` or ``(text, is_bold, term_key)``
+    segments; every segment becomes its own run so ``**bold**`` spans render with
+    native PowerPoint bold (``run.font.bold``) without affecting the measured
+    height (one paragraph per pre-wrapped line). A segment carrying a
+    ``term_key`` is drawn in the accent colour and its run is recorded in
+    ``st.term_runs`` so it later becomes a native hyperlink jumping to the
+    glossary slide of that term.
    """
    lh = tl.line_height_in(fs)
    height = lh * len(rich_lines) + 0.05
@@ -176,14 +186,20 @@ def _add_rich_text(st: _PptxState, rich_lines: list, fs: float, color,
            r0.text = "•  "
            r0.font.size = Pt(fs)
            r0.font.color.rgb = _rgb(color)
-        for seg_text, is_bold in segs:
+        for seg in segs:
+            if len(seg) == 3:
+                seg_text, is_bold, term = seg
+            else:
+                seg_text, is_bold, term = seg[0], seg[1], None
            if seg_text == "":
                continue
            run = p.add_run()
            run.text = seg_text
            run.font.size = Pt(fs)
            run.font.bold = bool(is_bold)
-            run.font.color.rgb = _rgb(color)
+            run.font.color.rgb = _rgb(_LINK if term else color)
+            if term:
+                st.term_runs.append((term, run, st.slide))
    st.y += height


@@ -191,6 +207,7 @@ def _place_heading(st: _PptxState, block) -> None:
    level = max(1, min(3, int(getattr(block, "level", 1) or 1)))
    fs = {1: _FS_H1, 2: _FS_H2, 3: _FS_H3}[level]
    text = tl.strip_inline_md(getattr(block, "text", ""))
+    st.last_heading = text or st.last_heading
    lines = tl.wrap(text, tl.chars_per_line(_USABLE_W, fs))
    _add_text(st, lines, fs, _INK, bold=True)
    st.y += 0.04
@@ -233,12 +250,12 @@ def _place_markdown(st: _PptxState, block) -> None:
            continue
        if stripped.startswith("- ") or stripped.startswith("* "):
            content = stripped[2:]  # keep inline markers for bold rendering.
-            rich = tl.wrap_rich(content,
-                                tl.chars_per_line(_USABLE_W - 0.3, _FS_BODY))
+            rich = tl.wrap_rich_terms(content,
+                                      tl.chars_per_line(_USABLE_W - 0.3, _FS_BODY))
            _add_rich_text(st, rich, _FS_BODY, _INK, bullet=True)
            i += 1
            continue
-        para = [stripped]  # keep inline markers; wrap_rich renders **bold**.
+        para = [stripped]  # keep inline markers; wrap_rich_terms renders **bold**.
        j = i + 1
        while j < n:
            nxt = md_lines[j].strip()
@@ -247,8 +264,8 @@ def _place_markdown(st: _PptxState, block) -> None:
            para.append(nxt)
            j += 1
        text = " ".join(para)
-        _add_rich_text(st, tl.wrap_rich(text, tl.chars_per_line(_USABLE_W, _FS_BODY)),
-                       _FS_BODY, _INK)
+        _add_rich_text(st, tl.wrap_rich_terms(
+            text, tl.chars_per_line(_USABLE_W, _FS_BODY)), _FS_BODY, _INK)
        i = j
    st.y += _GAP

@@ -295,7 +312,8 @@ def _row_height_in(cells, widths, fs) -> float:
    return lh * maxlines + 0.10


-def _emit_table(st: _PptxState, header, chunk, widths, fs) -> None:
+def _emit_table(st: _PptxState, header, chunk, widths, fs,
+                start_index: int = 0) -> None:
    nrows = len(chunk) + (1 if header else 0)
    ncol = len(widths)
    # Pre-measure total height to size the shape (pptx still auto-grows rows).
@@ -319,11 +337,14 @@ def _emit_table(st: _PptxState, header, chunk, widths, fs) -> None:
            cell.text = model._safe_str(header[c]) if c < len(header) else ""
            _style_cell(cell, fs, _INK, bold=True, fill=_HEAD_BG)
        ridx = 1
-    for r in chunk:
+    # Zebra striping: shade even data rows (1-based) using the GLOBAL row index
+    # (start_index offset) so the pattern stays coherent across split chunks.
+    for k, r in enumerate(chunk):
+        fill = _ZEBRA if (start_index + k) % 2 == 1 else _WHITE
        for c in range(ncol):
            cell = gtable.cell(ridx, c)
            cell.text = model._safe_str(r[c]) if c < len(r) else ""
-            _style_cell(cell, fs, _INK, bold=False, fill=_WHITE)
+            _style_cell(cell, fs, _INK, bold=False, fill=fill)
        ridx += 1
    st.y += total_h + _GAP

@@ -367,6 +388,7 @@ def _place_data_table(st: _PptxState, block, shaded_header=True,
        avail = _remaining(st) - header_h
        chunk = []
        used = 0.0
+        chunk_start = idx  # global index of the first row in this chunk (zebra).
        while idx < n:
            rh = _row_height_in(rows[idx], widths, fs)
            if used + rh > avail and chunk:
@@ -374,7 +396,7 @@ def _place_data_table(st: _PptxState, block, shaded_header=True,
            chunk.append(rows[idx])
            used += rh
            idx += 1
-        _emit_table(st, header, chunk, widths, fs)
+        _emit_table(st, header, chunk, widths, fs, start_index=chunk_start)
    note = getattr(block, "note", None)
    if note:
        _add_text(st, tl.wrap(model._safe_str(note),
@@ -421,54 +443,97 @@ def _resolve_png(block):
                pass


-def _place_picture_bytes(st: _PptxState, data: bytes, caption) -> None:
+def _figure_bytes_cached(block):
+    """Rasterize a figure/image to PNG bytes ONCE and cache (bytes, aspect).
+
+    Measuring (keep-together) and drawing must agree on the real aspect ratio —
+    ``bbox_inches='tight'`` changes it vs ``figsize``, so we rasterize once and
+    reuse the bytes for both. Cached on the block; never raises."""
+    cached = getattr(block, "_aeda_png", None)
+    if cached is not None:
+        return cached
+    kind = getattr(block, "kind", "")
+    data = None
+    if kind == "image":
+        path = getattr(block, "path", "")
+        if path and os.path.exists(path):
+            try:
+                with open(path, "rb") as fh:
+                    data = fh.read()
+            except Exception:  # noqa: BLE001
+                data = None
+    else:
+        data = _resolve_png(block)
+    aspect = 0.66
+    if data is not None:
+        w_px, h_px = _img_size_px(data)
+        aspect = (h_px / w_px) if w_px else 0.66
+    try:
+        block._aeda_png = (data, aspect)
+        return block._aeda_png
+    except Exception:  # noqa: BLE001 — block may reject attributes; degrade.
+        return (data, aspect)
+
+
+def _place_picture_bytes(st: _PptxState, data: bytes, caption,
+                         max_h_in=None) -> None:
+    # Mejora 4 — every figure on a slide carries a visible caption/title. If the
+    # block has no caption, fall back to the current section heading, then to a
+    # generic label, so no image is ever shown untitled.
+    caption = (model._safe_str(caption).strip()
+               or model._safe_str(st.last_heading).strip() or "Figura")
    w_px, h_px = _img_size_px(data)
    aspect = (h_px / w_px) if w_px else 0.66
+    # Reserve the caption's REAL (possibly multi-line) height FIRST, then scale
+    # the image to (max_h - cap_reserve): a figure never fills the whole slide,
+    # so its caption always fits on the SAME slide and no image is untitled.
+    # cap_real = what _add_text consumes; cap_reserve adds the post-image gap and
+    # a small cushion so the caption never spills to the next slide.
+    cap_lines = tl.wrap(caption, tl.chars_per_line(_USABLE_W, _FS_NOTE))
+    cap_real = tl.line_height_in(_FS_NOTE) * len(cap_lines) + 0.05
+    cap_reserve = cap_real + 0.05 + 0.10
    max_h = _CONTENT_BOTTOM - _CONTENT_TOP
+    # height_in hint (model.Figure/Image): cap the target height so a figure in a
+    # keep-together Group shrinks to leave room for its heading and text.
+    if isinstance(max_h_in, (int, float)) and max_h_in > 0:
+        max_h = min(max_h, float(max_h_in))
+    max_img_h = max(max_h - cap_reserve, 0.6)
    target_w = _USABLE_W
    target_h = target_w * aspect
-    if target_h > max_h:
-        target_h = max_h
+    if target_h > max_img_h:
+        target_h = max_img_h
        target_w = target_h / aspect if aspect else _USABLE_W
-    cap_h = tl.line_height_in(_FS_NOTE) + 0.05 if caption else 0.0
-    if _remaining(st) < target_h + cap_h:
+    # Keep the image and its caption together on the same slide.
+    if _remaining(st) < target_h + cap_reserve:
        _new_slide(st, cont=True)
    left = _ML + (_USABLE_W - target_w) / 2.0
    st.slide.shapes.add_picture(io.BytesIO(data), Inches(left), Inches(st.y),
                                width=Inches(target_w), height=Inches(target_h))
    st.y += target_h + 0.05
-    if caption:
-        _add_text(st, tl.wrap(model._safe_str(caption),
-                  tl.chars_per_line(_USABLE_W, _FS_NOTE)), _FS_NOTE, _MUTED,
-                  italic=True)
+    _add_text(st, cap_lines, _FS_NOTE, _MUTED, italic=True)
    st.y += _GAP


 def _place_figure(st: _PptxState, block) -> None:
-    png = _resolve_png(block)
+    png, _aspect = _figure_bytes_cached(block)
    if png is None:
        _add_text(st, ["(figura no disponible)"], _FS_NOTE, _MUTED, italic=True)
        st.y += _GAP
        return
-    _place_picture_bytes(st, png, getattr(block, "caption", None))
+    _place_picture_bytes(st, png, getattr(block, "caption", None),
+                         max_h_in=getattr(block, "height_in", None))


 def _place_image(st: _PptxState, block) -> None:
-    path = getattr(block, "path", "")
-    if not path or not os.path.exists(path):
+    data, _aspect = _figure_bytes_cached(block)
+    if data is None:
+        path = getattr(block, "path", "")
        _add_text(st, [f"(imagen no encontrada: {path})"], _FS_NOTE, _MUTED,
                  italic=True)
        st.y += _GAP
        return
-    try:
-        with open(path, "rb") as fh:
-            data = fh.read()
-    except Exception as e:  # noqa: BLE001
-        _add_text(st, [f"(no se pudo leer la imagen: {e})"], _FS_NOTE, _MUTED,
-                  italic=True)
-        st.y += _GAP
-        return
-    _place_picture_bytes(st, data, getattr(block, "caption", None))
+    _place_picture_bytes(st, data, getattr(block, "caption", None),
+                         max_h_in=getattr(block, "height_in", None))


 def _place_caption(st: _PptxState, block) -> None:
@@ -482,6 +547,170 @@ def _place_note(st: _PptxState, block) -> None:
    _place_caption(st, block)


+# --------------------------------------------------------------------------- #
+# Block measurement (mejora 3 — keep-together). Estimate a block's slide height
+# WITHOUT drawing it so a Group can move whole to the next slide before drawing.
+# Over-estimating only triggers an earlier slide break, never a content cut.
+# --------------------------------------------------------------------------- #
+def _measure_heading_text(text: str, level: int) -> float:
+    level = max(1, min(3, int(level or 1)))
+    fs = {1: _FS_H1, 2: _FS_H2, 3: _FS_H3}[level]
+    lines = tl.wrap(tl.strip_inline_md(text), tl.chars_per_line(_USABLE_W, fs))
+    return tl.line_height_in(fs) * len(lines) + 0.05 + 0.04
+
+
+def _measure_markdown(block) -> float:
+    raw = str(getattr(block, "text", "") or "")
+    md_lines = raw.split("\n")
+    h = 0.0
+    i, n = 0, len(md_lines)
+    while i < n:
+        stripped = md_lines[i].strip()
+        if stripped.startswith("|") and stripped.endswith("|"):
+            j = i
+            while j < n and md_lines[j].strip().startswith("|") \
+                    and md_lines[j].strip().endswith("|"):
+                j += 1
+            h += (tl.line_height_in(_FS_CELL) + 0.10) * (j - i) + _GAP
+            i = j
+            continue
+        if stripped == "":
+            h += tl.line_height_in(_FS_BODY) * 0.4
+            i += 1
+            continue
+        if stripped.startswith("### "):
+            h += _measure_heading_text(stripped[4:], 3)
+            i += 1
+            continue
+        if stripped.startswith("## "):
+            h += _measure_heading_text(stripped[3:], 2)
+            i += 1
+            continue
+        if stripped.startswith("# "):
+            h += _measure_heading_text(stripped[2:], 1)
+            i += 1
+            continue
+        if stripped.startswith("- ") or stripped.startswith("* "):
+            lines = tl.wrap_rich_terms(
+                stripped[2:], tl.chars_per_line(_USABLE_W - 0.3, _FS_BODY))
+            h += tl.line_height_in(_FS_BODY) * len(lines) + 0.05
+            i += 1
+            continue
+        para = [stripped]
+        j = i + 1
+        while j < n:
+            nxt = md_lines[j].strip()
+            if nxt == "" or nxt.startswith(("|", "#", "- ", "* ")):
+                break
+            para.append(nxt)
+            j += 1
+        lines = tl.wrap_rich_terms(" ".join(para),
+                                   tl.chars_per_line(_USABLE_W, _FS_BODY))
+        h += tl.line_height_in(_FS_BODY) * len(lines) + 0.05
+        i = j
+    return h + _GAP
+
+
+def _measure_figure_like(block) -> float:
+    max_h = _CONTENT_BOTTOM - _CONTENT_TOP
+    hint = getattr(block, "height_in", None)
+    if isinstance(hint, (int, float)) and hint > 0:
+        max_h = min(max_h, float(hint))
+    # Use the REAL rasterized aspect (cached) so measuring matches drawing — this
+    # is what keeps a figure together with its heading instead of splitting.
+    _data, aspect = _figure_bytes_cached(block)
+    target_h = min(_USABLE_W * aspect, max_h)
+    # Caption is always emitted now (mejora 4), so always reserve its line.
+    cap_h = tl.line_height_in(_FS_NOTE) + 0.05
+    return target_h + 0.05 + cap_h + _GAP
+
+
+def _measure_block(st: _PptxState, block) -> float:
+    kind = getattr(block, "kind", "")
+    try:
+        if kind == "heading":
+            return _measure_heading_text(getattr(block, "text", ""),
+                                         getattr(block, "level", 1))
+        if kind == "markdown":
+            return _measure_markdown(block)
+        if kind in ("figure", "image"):
+            return _measure_figure_like(block)
+        if kind in ("caption", "note"):
+            lines = tl.wrap(getattr(block, "text", ""),
+                            tl.chars_per_line(_USABLE_W, _FS_NOTE))
+            return tl.line_height_in(_FS_NOTE) * len(lines) + 0.05 + _GAP
+        if kind in ("kv_table", "data_table"):
+            rows = getattr(block, "rows", []) or []
+            return (tl.line_height_in(_FS_CELL) + 0.10) * (len(rows) + 1) + _GAP
+        if kind == "group":
+            return sum(_measure_block(st, b)
+                       for b in (getattr(block, "blocks", []) or []))
+    except Exception:  # noqa: BLE001 — a measurement never aborts rendering.
+        pass
+    return tl.line_height_in(_FS_BODY)
+
+
+def _shrink_group_figures(st: _PptxState, blocks: list, avail_full: float) -> None:
+    """Cap each figure's height (via height_in) so the whole group fits a slide.
+
+    The figure shrinks just enough to leave room for its heading, text and
+    caption — that is how keep-together puts a chart on the SAME slide as its
+    title and description instead of pushing it to the next slide."""
+    fig_blocks = [b for b in blocks
+                  if getattr(b, "kind", "") in ("figure", "image")]
+    if not fig_blocks:
+        return
+    nonfig_h = sum(_measure_block(st, b) for b in blocks
+                   if getattr(b, "kind", "") not in ("figure", "image"))
+    fig_overhead = tl.line_height_in(_FS_NOTE) + 0.05 + 0.05 + _GAP
+    budget = avail_full - nonfig_h - 0.10 * len(fig_blocks)
+    if budget <= 1.0:
+        return  # not enough room to keep together; let it flow (degrade).
+    per = budget / len(fig_blocks) - fig_overhead
+    if per <= 0.8:
+        return
+    for fb in fig_blocks:
+        cur = getattr(fb, "height_in", None)
+        fb.height_in = (min(float(cur), per)
+                        if isinstance(cur, (int, float)) and cur > 0 else per)
+
+
+def _place_group(st: _PptxState, block) -> None:
+    """Render a keep-together Group: move it whole to the next slide if needed."""
+    blocks = getattr(block, "blocks", []) or []
+    if not blocks:
+        return
+    avail_full = _CONTENT_BOTTOM - _CONTENT_TOP
+    _shrink_group_figures(st, blocks, avail_full)
+    total = sum(_measure_block(st, b) for b in blocks)
+    if total <= avail_full:
+        if total > _remaining(st):
+            _new_slide(st, cont=True)
+    elif st.y > _CONTENT_TOP + 1e-6:
+        _new_slide(st, cont=True)
+    for b in blocks:
+        placer = _PLACERS.get(getattr(b, "kind", ""), _place_note)
+        try:
+            placer(st, b)
+        except Exception:  # noqa: BLE001 — a bad block never aborts the group.
+            pass
+
+
+def _place_glossary_entry(st: _PptxState, block) -> None:
+    """Render one glossary term and register its slide as the link target."""
+    key = getattr(block, "key", "")
+    label = getattr(block, "label", "") or key
+    definition = getattr(block, "definition", "")
+    _ensure(st, tl.line_height_in(_FS_H3) + tl.line_height_in(_FS_BODY) * 2)
+    if key:
+        st.term_anchor_slide[key] = st.slide
+    _place_heading(st, model.Heading(text=str(label), level=3))
+    if definition:
+        _add_text(st, tl.wrap(model._safe_str(definition),
+                  tl.chars_per_line(_USABLE_W, _FS_BODY)), _FS_BODY, _INK)
+    st.y += _GAP
+
+
 _PLACERS = {
    "heading": _place_heading,
    "markdown": _place_markdown,
@@ -491,6 +720,8 @@ _PLACERS = {
    "image": _place_image,
    "caption": _place_caption,
    "note": _place_note,
+    "group": _place_group,
+    "glossary_entry": _place_glossary_entry,
 }


@@ -542,6 +773,9 @@ def render_pptx(chapters: list, out_path: str, meta: dict = None) -> dict:
            _new_slide(st, cont=False)
            _place_note(st, model.Note(
                "(documento vacío — sin capítulos aplicables)"))
+        # Mejora 6 — wire clickable glossary terms to their entry slide (native
+        # PowerPoint slide-jump). Delegated registry function; degrades silently.
+        n_links = _wire_glossary_links(st, notes)
        prs.save(out_path)
        n_slides = st.slide_no
    except Exception as e:  # noqa: BLE001
@@ -549,7 +783,35 @@ def render_pptx(chapters: list, out_path: str, meta: dict = None) -> dict:
                "note": f"fallo al escribir el PPTX: {e}"}

    note = f"{n_slides} slides"
+    if n_links:
+        note += f" · {n_links} enlaces de glosario"
    if notes:
        note += " · " + "; ".join(notes)
    return {"path": out_path, "n_slides": n_slides, "chapters": chapters_meta,
            "note": note}
+
+
+def _wire_glossary_links(st: _PptxState, notes: list) -> int:
+    """Turn each recorded term run into a native jump to its glossary slide.
+
+    Returns the number of links applied. A term whose only appearance is inside
+    its own glossary entry (source slide == target slide) is skipped. Never
+    raises."""
+    if not st.term_runs or not st.term_anchor_slide:
+        return 0
+    linked = 0
+    try:
+        from datascience.pptx_link_run_to_slide import pptx_link_run_to_slide
+    except Exception as e:  # noqa: BLE001
+        notes.append(f"glosario sin enlaces: {e}")
+        return 0
+    for key, run, src_slide in st.term_runs:
+        tgt = st.term_anchor_slide.get(key)
+        if tgt is None or tgt is src_slide:
+            continue
+        try:
+            if pptx_link_run_to_slide(run, src_slide, tgt):
+                linked += 1
+        except Exception:  # noqa: BLE001 — links are best-effort.
+            pass
+    return linked
@@ -24,6 +24,13 @@ import textwrap
 # the visible text matches ``strip_inline_md`` exactly.
 _INLINE_SPAN_RE = re.compile(r"(\*\*.+?\*\*|__.+?__|`.+?`)")

+# Glossary term span: ``[[term:key]]texto visible[[/term]]``. The visible text
+# (which may itself contain ``**bold**``) is kept and tagged with ``key`` so the
+# renderers can turn each appearance into a clickable jump to the glossary entry.
+_TERM_SPAN_RE = re.compile(r"\[\[term:([A-Za-z0-9_]+)\]\](.*?)\[\[/term\]\]",
+                           re.S)
+_TERM_OPEN_RE = re.compile(r"\[\[term:[A-Za-z0-9_]+\]\]")
+

 def avg_char_width_in(fontsize_pt: float) -> float:
    """Approximate average glyph width in inches for a sans-serif font.
@@ -86,11 +93,21 @@ def strip_inline_md(text: str) -> str:
    if not text:
        return ""
    s = str(text)
+    # Drop glossary term markers, keeping the visible inner text.
+    s = _TERM_SPAN_RE.sub(lambda m: m.group(2), s)
+    s = _TERM_OPEN_RE.sub("", s)      # leftover unbalanced open marker.
+    s = s.replace("[[/term]]", "")    # leftover unbalanced close marker.
    for marker in ("**", "__", "`"):
        s = s.replace(marker, "")
    return s


+def _strip_term_markers(s: str) -> str:
+    """Remove any (balanced or leftover) glossary term markers, keeping text."""
+    s = _TERM_OPEN_RE.sub("", s)
+    return s.replace("[[/term]]", "")
+
+
 def _strip_leftover_markers(s: str) -> str:
    """Drop any unbalanced inline markers from a plain (non-span) fragment.

@@ -222,6 +239,118 @@ def wrap_rich(text: str, max_chars: int):
    return lines or [[("", False)]]


+def parse_inline_rich(text: str):
+    """Split ``text`` into ``[(fragment, is_bold, term_key), ...]``.
+
+    Extends :func:`parse_inline_bold` with glossary term spans
+    ``[[term:key]]visible[[/term]]``: the inner ``visible`` text is parsed for
+    ``**bold**`` as usual and every resulting fragment carries ``term_key`` so the
+    renderers can make it clickable. Text outside a term span gets ``term_key =
+    None``. Unbalanced term markers are stripped (kept identical to
+    :func:`strip_inline_md`). The concatenation of all fragment texts equals
+    ``strip_inline_md(text)`` — visible characters and wrapping are unchanged; only
+    the bold flag and the term key are added. Adjacent fragments with the same
+    (bold, term) are merged.
+    """
+    s = "" if text is None else str(text)
+    if not s:
+        return []
+    out = []
+
+    def _emit(fragment: str, bold: bool, term) -> None:
+        if fragment == "":
+            return
+        if out and out[-1][1] == bold and out[-1][2] == term:
+            out[-1] = (out[-1][0] + fragment, bold, term)
+        else:
+            out.append((fragment, bold, term))
+
+    def _emit_bolded(segment: str, term) -> None:
+        # Reuse the bold parser on a term-marker-free segment.
+        for frag, bold in parse_inline_bold(_strip_term_markers(segment)):
+            _emit(frag, bold, term)
+
+    pos = 0
+    for m in _TERM_SPAN_RE.finditer(s):
+        if m.start() > pos:
+            _emit_bolded(s[pos:m.start()], None)
+        _emit_bolded(m.group(2), m.group(1))
+        pos = m.end()
+    if pos < len(s):
+        _emit_bolded(s[pos:], None)
+    return out
+
+
+def wrap_rich_terms(text: str, max_chars: int):
+    """Like :func:`wrap_rich` but preserving glossary term keys per fragment.
+
+    Returns ``list[list[(fragment, is_bold, term_key)]]`` — one inner list per
+    output line. Wrapping is word-aware and hard-splits over-long tokens so no
+    line exceeds ``max_chars`` (the renderers measure these very lines). Term and
+    bold flags never widen a line: the visible width matches :func:`wrap`.
+    """
+    if max_chars < 1:
+        max_chars = 1
+    spans = parse_inline_rich(text)
+    if not spans:
+        return [[("", False, None)]]
+
+    tokens = []  # each: (word, bold, term) or ("\n", None, None)
+    for frag, bold, term in spans:
+        parts = frag.split("\n")
+        for pi, part in enumerate(parts):
+            if pi > 0:
+                tokens.append(("\n", None, None))
+            for word in part.split(" "):
+                if word == "":
+                    continue
+                tokens.append((word, bold, term))
+
+    lines = []
+    cur = []
+    cur_len = 0
+
+    def _flush():
+        nonlocal cur, cur_len
+        merged = []
+        for k, (word, bold, term) in enumerate(cur):
+            piece = word if k == 0 else " " + word
+            if merged and merged[-1][1] == bold and merged[-1][2] == term:
+                merged[-1] = (merged[-1][0] + piece, bold, term)
+            else:
+                merged.append((piece, bold, term))
+        lines.append(merged or [("", False, None)])
+        cur = []
+        cur_len = 0
+
+    for word, bold, term in tokens:
+        if bold is None:  # forced newline
+            _flush()
+            continue
+        if len(word) > max_chars:
+            if cur:
+                _flush()
+            chunks = _hard_split(word, max_chars)
+            for ci, chunk in enumerate(chunks):
+                if ci < len(chunks) - 1:
+                    lines.append([(chunk, bold, term)])
+                else:
+                    cur = [(chunk, bold, term)]
+                    cur_len = len(chunk)
+            continue
+        add = len(word) if cur_len == 0 else cur_len + 1 + len(word)
+        if cur_len != 0 and add > max_chars:
+            _flush()
+            cur = [(word, bold, term)]
+            cur_len = len(word)
+        else:
+            cur.append((word, bold, term))
+            cur_len = add
+    if cur:
+        _flush()
+    return lines or [[("", False, None)]]
+
+
 def parse_md_table(lines: list):
    """Parse consecutive ``| a | b |`` lines into ``(header, rows)`` or None.

@@ -4,10 +4,10 @@ name: column_quality_score
 kind: function
 lang: py
 domain: datascience
-version: "1.0.0"
+version: "2.0.0"
 purity: pure
 signature: "def column_quality_score(col: dict) -> dict"
-description: "Calcula un score de calidad de datos 0-100 para un ColumnProfile del grupo eda, con desglose completeness/validity/consistency y lista de issues legibles. Funcion pura, no muta el input."
+description: "Calcula un score de calidad de datos 0-100 para un ColumnProfile del grupo eda. Combina completeness (0.6) y validity (0.4) con renormalizacion por aplicabilidad; los outliers, columnas constantes e ids NO bajan el score (van a observations). Devuelve desglose por dimension, issues (defectos) y observations (señales analiticas). Funcion pura, no muta el input."
 tags: [eda, data-quality, profiling, scoring, datascience]
 uses_functions: []
 uses_types: []
@@ -17,20 +17,26 @@ error_type: ""
 imports: []
 example: |
  from datascience import column_quality_score
-  col = {"name": "precio", "inferred_type": "float", "null_pct": 0.2,
-         "unique_pct": 0.4, "flags": [], "numeric": {"outlier_pct": 0.08}}
+  col = {"name": "precio", "inferred_type": "numeric", "null_pct": 0.2,
+         "unique_pct": 0.4, "flags": [], "numeric": {"outlier_pct": 8.0}}
  column_quality_score(col)
-  # {"score": 86.8, "completeness": 0.8, "validity": 0.92,
-  #  "consistency": 1.0, "issues": ["20% nulos", "8% outliers"]}
+  # {"score": 88.0, "completeness": 0.8, "validity": 1.0,
+  #  "applicable": ["completeness", "validity"], "issues": ["20% nulos"],
+  #  "observations": ["8% de valores atípicos (z-score>3): ..."]}
 tested: true
 tests:
  - "test_clean_column_high_score"
-  - "test_half_null_lowers_completeness_and_score"
-  - "test_constant_column_flags_issue"
+  - "test_weights_60_40_native_type"
+  - "test_outliers_do_not_penalize_score"
+  - "test_nulls_lower_score_more_than_outliers"
+  - "test_validity_from_parse_rate_lowers_score"
+  - "test_validity_from_match_rate"
+  - "test_free_text_renormalizes_to_completeness_only"
+  - "test_all_null_column_scores_zero"
+  - "test_constant_column_scores_full_and_is_observation"
+  - "test_high_cardinality_id_scores_full_and_is_observation"
+  - "test_mostly_null_no_double_counts_validity"
  - "test_empty_dict_does_not_crash"
-  - "test_outliers_penalize_validity"
-  - "test_mostly_null_flag_halves_validity"
-  - "test_high_cardinality_text_flagged_as_id"
  - "test_none_values_treated_defensively"
  - "test_does_not_mutate_input"
 test_file_path: "python/functions/datascience/column_quality_score_test.py"
@@ -38,16 +44,22 @@ file_path: "python/functions/datascience/column_quality_score.py"
 params:
  - name: col
    desc: >
-      ColumnProfile dict del grupo eda (p.ej. salida de summarize_table_duckdb).
-      Se leen sus claves de forma defensiva con .get(...) y se toleran valores
-      None. Claves usadas: null_pct (0-1), inferred_type, semantic_type,
-      unique_pct (0-1), flags (list[str], reconoce "constant"/"mostly_null"),
-      numeric ({outlier_pct: 0-1, ...}|None) y match_rate (opcional, 0-1).
+      ColumnProfile dict del grupo eda (p.ej. salida de summarize_table_duckdb /
+      profile_table). Se leen sus claves de forma defensiva con .get(...) y se
+      toleran valores None. Claves usadas: null_pct (0-1), n_rows, empty_count
+      (texto), inferred_type, semantic_type, validity_rate (0-1, lo expone
+      profile_table al promocionar texto a numero/fecha), match_rate (0-1),
+      unique_pct (0-1), flags (list[str], reconoce
+      "constant"/"possible_id"/"high_cardinality") y numeric ({outlier_pct: 0-100,
+      skew, ...}|None).
 output: >
-  dict con score (float 0-100, redondeado a 1 decimal), completeness (0-1),
-  validity (0-1), consistency (0-1) e issues (list[str] de descripciones
-  legibles de los problemas detectados). score = round(100 * (0.5*completeness
-  + 0.3*validity + 0.2*consistency), 1).
+  dict con score (float 0-100, 1 decimal), completeness (0-1), validity (0-1 o
+  None si no aplicable), dimensions ({completeness, validity}), applicable
+  (list[str] de dimensiones que entraron en el score), issues (list[str] SOLO de
+  defectos de calidad: nulos, vacios, valores no conformes) y observations
+  (list[str] de señales analiticas que NO bajan el score: outliers, columna
+  constante, posible id, asimetria). score = round(100 * (0.6*completeness +
+  0.4*validity) / pesos_aplicables, 1), renormalizado cuando validity no aplica.
 ---

 ## Ejemplo
@@ -59,51 +71,71 @@ from datascience import column_quality_score
 col = {
    "name": "precio",
    "physical_type": "DOUBLE",
-    "inferred_type": "float",
+    "inferred_type": "numeric",
    "semantic_type": "",
-    "count": 800,
    "n_rows": 1000,
    "null_count": 200,
    "null_pct": 0.20,
    "distinct_count": 400,
    "unique_pct": 0.40,
    "flags": [],
-    "numeric": {"outlier_pct": 0.08},
+    "numeric": {"outlier_pct": 8.0, "skew": 0.3},
    "categorical": None,
    "datetime": None,
 }

 column_quality_score(col)
 # {
-#   "score": 86.8,
-#   "completeness": 0.8,    # 1 - 0.20
-#   "validity": 0.92,       # 1 - min(0.08, 0.3)
-#   "consistency": 1.0,
-#   "issues": ["20% nulos", "8% outliers"],
+#   "score": 88.0,            # 100 * (0.6*0.8 + 0.4*1.0)
+#   "completeness": 0.8,      # 1 - 0.20
+#   "validity": 1.0,          # numerica nativa: el tipo es conforme
+#   "dimensions": {"completeness": 0.8, "validity": 1.0},
+#   "applicable": ["completeness", "validity"],
+#   "issues": ["20% nulos"],                       # SOLO defectos de calidad
+#   "observations": ["8% de valores atípicos (z-score>3): ..."],  # NO bajan score
 # }
 ```

 ## Cuando usarla

 Cuando hayas perfilado una tabla con el grupo `eda` (p.ej.
-`summarize_table_duckdb`) y necesites un numero 0-100 por columna para
-ordenar/priorizar limpieza de datos, pintar semaforos de calidad en un
-dashboard, o decidir que columnas descartar antes de modelar. Es la capa de
-scoring sobre el ColumnProfile crudo: lee el perfil, no toca los datos.
+`summarize_table_duckdb` / `profile_table`) y necesites un numero 0-100 por
+columna para ordenar/priorizar limpieza de datos, pintar semaforos de calidad,
+o decidir que columnas descartar antes de modelar. Separa los **defectos de
+calidad reales** (`issues`: nulos, vacios, valores que no parsean a su tipo) de
+las **observaciones analiticas** (`observations`: outliers, columnas constantes,
+ids), que se reportan pero no penalizan. Es la capa de scoring sobre el
+ColumnProfile crudo: lee el perfil, no toca los datos.

-## Notas
+## Gotchas

-Funcion pura, sin I/O ni dependencias externas, no muta `col`. Lee todas las
-claves con `.get(...)` y tolera que vengan en `None` (un ColumnProfile recien
-salido de `summarize_table_duckdb` trae muchas claves a `None`), por lo que
-nunca falla por claves ausentes — un `{}` produce un resultado bien definido.
+Funcion pura, sin I/O, no muta `col`. Aun asi conviene saber:

-Pesos del score: completeness 0.5, validity 0.3, consistency 0.2.
+- **Los outliers NO bajan el score.** Un valor extremo puede ser real y correcto
+  (un cliente que compra mucho); detectar atipicos es analisis de la
+  distribucion, no un juicio de correccion. Salen en `observations`, no en
+  `issues`. Mismo trato para columnas constantes e identificadores de alta
+  cardinalidad: son observaciones, no defectos.
+- **`validity` puede ser `None`** (no aplicable): texto libre sin `semantic_type`
+  ni `validity_rate`, o columna 100% nula. En ese caso el score se renormaliza a
+  solo `completeness` (la columna no se premia ni castiga por algo no medible).
+- **`outlier_pct` se interpreta en escala 0-100** (la que emite
+  `describe_numeric`, z-score>3). Pasar una fraccion 0-1 produce un texto de
+  observacion con el % equivocado, pero NUNCA afecta al score.
+- **`validity_rate` lo puebla `profile_table`** al promocionar una columna de
+  texto a numero/fecha (fraccion que parsea). Si no esta presente y el tipo es
+  nativo numerico/fecha/bool, `validity = 1.0`.
+- Sin doble conteo: la falta de datos cuenta solo en `completeness` (el antiguo
+  castigo de `mostly_null` sobre `validity` se elimino).

- **completeness** = `1 - null_pct` (None -> 0 nulls -> 1.0).
- **validity**: parte de 1.0 y penaliza `min(outlier_pct, 0.3)` en columnas
-  numericas, `0.5 * (1 - match_rate)` si hay `semantic_type` declarado con
-  `match_rate` bajo disponible, y multiplica por 0.5 si el flag `mostly_null`
-  esta presente.
- **consistency**: 1.0 salvo flag `constant` (-> 0.3, columna poco informativa)
-  o texto con `unique_pct > 0.9` (-> 0.6, posible id de alta cardinalidad).
+## Capability growth log
+
+- v2.0.0 (2026-06-30) — nueva formula de calidad (report 2046): pesos 60/40
+  (completeness/validity) con renormalizacion por aplicabilidad; se elimina la
+  dimension `consistency`-como-informatividad y el doble castigo de
+  `mostly_null`; los outliers/constantes/ids salen del score a `observations`;
+  validity mide conformidad real (parse rate / match rate / tipo nativo). Salida
+  ampliada con `dimensions`, `applicable` y `observations`.
+- v1.0.0 — version inicial: pesos 50/30/20 (completeness/validity/consistency),
+  los outliers penalizaban validity (con bug de escala) y consistency penalizaba
+  informatividad.
@@ -1,34 +1,78 @@
 """Score de calidad de datos (0-100) para un ColumnProfile del grupo eda.

 Funcion pura: dado el perfil de una columna producido por el grupo de
-capacidad `eda` (p.ej. summarize_table_duckdb), calcula un score agregado
-de calidad junto a su desglose en completeness / validity / consistency y
-una lista de issues legibles. No realiza I/O ni muta el input.
+capacidad `eda` (p.ej. summarize_table_duckdb / profile_table), calcula un
+score agregado de calidad junto a su desglose por dimension y dos listas
+legibles separadas: `issues` (defectos de calidad reales que SI bajan el
+score) y `observations` (señales analiticas que NO bajan el score). No
+realiza I/O ni muta el input.
+
+Modelo (DAMA-DMBOK / ISO 8000), ver report 2046:
+
+- Solo entran en el score las dimensiones medibles automaticamente desde el
+  perfil, sin fuente externa de verdad: completeness y validity por columna.
+- Renormalizacion por aplicabilidad: si una dimension no es medible en la
+  columna (texto libre sin semantica -> validity no aplica; columna 100% nula
+  -> validity no medible), se excluye y los pesos se renormalizan sobre las
+  aplicables. Una columna ni se premia ni se castiga por algo no medible.
+- Sin doble conteo: la falta de datos cuenta solo en completeness (se elimino
+  el antiguo castigo extra de `mostly_null` sobre validity).
+- Los OUTLIERS NO bajan la calidad. Un valor extremo puede ser real y
+  correcto; detectar atipicos es analisis de la distribucion, no un juicio de
+  coreccion. Outliers, columnas constantes e identificadores de alta
+  cardinalidad pasan a `observations`, nunca a `issues`.
 """


+# Pesos base de las dimensiones de columna (se renormalizan por aplicabilidad).
+_W_COMPLETENESS = 0.6
+_W_VALIDITY = 0.4
+
+# Tipos inferidos cuyo almacen garantiza la conformidad de tipo (validity=1.0)
+# cuando NO vienen de una promocion de texto (en cuyo caso manda validity_rate).
+_NATIVE_TYPED = ("numeric", "integer", "float", "datetime", "date", "boolean", "bool")
+
+
 def column_quality_score(col: dict) -> dict:
    """Calcula un score de calidad de datos 0-100 para un ColumnProfile.

-    El score pondera tres dimensiones:
-      - completeness (0.5): proporcion de valores no nulos.
-      - validity     (0.3): ausencia de outliers / heuristicas de validez.
-      - consistency  (0.2): la columna aporta informacion (no constante, no ruido).
+    El score combina solo dimensiones de calidad medibles desde el perfil, con
+    renormalizacion por aplicabilidad:
+
+      - completeness (peso base 0.6, siempre aplica): proporcion de valores
+        presentes = 1 - null_pct. En texto, las celdas vacias (`empty_count`)
+        tambien cuentan como faltantes.
+      - validity (peso base 0.4, cuando hay un criterio de validacion real):
+        fraccion de valores no nulos conformes a su tipo/semantica. Tipo nativo
+        numerico/fecha/bool = 1.0; texto promovido a numero/fecha = parse rate
+        (`validity_rate`); texto con `semantic_type` regexable = `match_rate`;
+        texto libre o columna 100% nula = NO aplicable (renormaliza a solo
+        completeness).
+
+    Los outliers, columnas constantes, identificadores y asimetria fuerte NO
+    bajan el score: se devuelven en `observations`.

    Args:
        col: ColumnProfile dict del grupo eda. Se leen las claves de forma
            defensiva con .get(...) y se tolera que muchas vengan en None.
-            Claves relevantes: null_pct, inferred_type, semantic_type,
-            unique_pct, flags (list[str]), numeric ({outlier_pct, ...}|None),
-            match_rate (opcional).
+            Claves relevantes: null_pct (0-1), n_rows, empty_count,
+            inferred_type, semantic_type, validity_rate (0-1, lo expone
+            profile_table al promocionar texto a numero/fecha), match_rate
+            (0-1), unique_pct (0-1), flags (list[str], reconoce
+            "constant"/"possible_id"/"high_cardinality"), numeric
+            ({outlier_pct: 0-100, skew, ...}|None).

    Returns:
        dict con:
-          score        (float, 0-100, redondeado a 1 decimal),
-          completeness (float, 0-1),
-          validity     (float, 0-1),
-          consistency  (float, 0-1),
-          issues       (list[str]) descripciones legibles de los problemas.
+          score        (float 0-100, redondeado a 1 decimal),
+          completeness (float 0-1),
+          validity     (float 0-1 | None si no aplicable),
+          dimensions   ({completeness, validity}),
+          applicable   (list[str] de dimensiones que entraron en el score),
+          issues       (list[str]) SOLO defectos de calidad (nulos, vacios,
+                       valores no conformes a su tipo/semantica),
+          observations (list[str]) señales analiticas que NO bajan el score
+                       (outliers, columna constante, posible id, asimetria).
    """
    if not isinstance(col, dict):
        col = {}
@@ -39,103 +83,153 @@ def column_quality_score(col: dict) -> dict:
    flags = set(flags)

    issues: list[str] = []
+    observations: list[str] = []
+
+    inferred_type = col.get("inferred_type") or ""
+    semantic_type = col.get("semantic_type") or ""

    # --- completeness -------------------------------------------------
-    null_pct = col.get("null_pct")
-    if null_pct is None:
-        null_pct = 0.0
-    try:
-        null_pct = float(null_pct)
-    except (TypeError, ValueError):
-        null_pct = 0.0
-    null_pct = _clamp(null_pct, 0.0, 1.0)
+    # Falta de datos = nulos + (en texto) celdas vacias. Es el unico sitio
+    # donde la falta de datos cuenta: nunca se duplica en validity.
+    null_pct = _clamp(_num(col.get("null_pct"), 0.0), 0.0, 1.0)
    completeness = 1.0 - null_pct
    if null_pct > 0:
-        issues.append(f"{round(null_pct * 100)}% nulos")
+        issues.append(f"{_pct(null_pct)} nulos")

-    # --- validity -----------------------------------------------------
-    validity = 1.0
-    inferred_type = col.get("inferred_type") or ""
+    empty_frac = 0.0
+    n_rows = col.get("n_rows")
+    empty_count = col.get("empty_count")
+    if (
+        isinstance(n_rows, (int, float)) and not isinstance(n_rows, bool) and n_rows > 0
+        and isinstance(empty_count, (int, float)) and not isinstance(empty_count, bool)
+        and empty_count > 0
+    ):
+        empty_frac = _clamp(float(empty_count) / float(n_rows), 0.0, 1.0)
+        completeness = _clamp(completeness - empty_frac, 0.0, 1.0)
+        issues.append(f"{_pct(empty_frac)} vacíos")

-    numeric = col.get("numeric")
-    is_numeric = inferred_type in ("integer", "float", "numeric") or isinstance(numeric, dict)
-    if isinstance(numeric, dict):
-        outlier_pct = numeric.get("outlier_pct")
-        if outlier_pct is not None:
-            try:
-                outlier_pct = float(outlier_pct)
-            except (TypeError, ValueError):
-                outlier_pct = 0.0
-            outlier_pct = _clamp(outlier_pct, 0.0, 1.0)
-            if outlier_pct > 0:
-                penalty = min(outlier_pct, 0.3)
-                validity -= penalty
-                issues.append(f"{round(outlier_pct * 100)}% outliers")
-
-    # semantic_type declarado pero con baja tasa de match (si la conocemos).
-    semantic_type = col.get("semantic_type") or ""
-    match_rate = col.get("match_rate")
-    if semantic_type and match_rate is not None:
-        try:
-            match_rate = float(match_rate)
-        except (TypeError, ValueError):
-            match_rate = None
-        if match_rate is not None:
-            match_rate = _clamp(match_rate, 0.0, 1.0)
-            if match_rate < 1.0:
-                shortfall = 1.0 - match_rate
-                validity -= 0.5 * shortfall
-                issues.append(
-                    f"semantic_type '{semantic_type}' con baja coincidencia "
-                    f"({round(match_rate * 100)}%)"
-                )
-
-    if "mostly_null" in flags:
-        validity *= 0.5
-        issues.append("mayoritariamente nula")
-
-    validity = _clamp(validity, 0.0, 1.0)
-
-    # --- consistency --------------------------------------------------
-    consistency = 1.0
-    if "constant" in flags:
-        consistency = 0.3
-        issues.append("columna constante")
+    # --- validity (con renormalizacion por aplicabilidad) -------------
+    # None = no medible -> se excluye del score (no penaliza ni premia).
+    validity = None
+    if completeness <= 0.0:
+        # Columna 100% faltante: no hay valores no nulos sobre los que medir
+        # conformidad. validity no aplica -> el score sale solo de completeness
+        # (= 0). Es el peor defecto de calidad posible.
+        validity = None
    else:
-        unique_pct = col.get("unique_pct")
-        if unique_pct is not None:
-            try:
-                unique_pct = float(unique_pct)
-            except (TypeError, ValueError):
-                unique_pct = None
-        if (
-            inferred_type == "text"
+        validity_rate = col.get("validity_rate")
+        match_rate = col.get("match_rate")
+        if validity_rate is not None:
+            # Texto promovido a numero/fecha: parse rate real de la muestra.
+            v = _num(validity_rate, None)
+            if v is not None:
+                validity = _clamp(v, 0.0, 1.0)
+                if validity < 1.0:
+                    kind = (
+                        "número" if inferred_type == "numeric"
+                        else "fecha" if inferred_type == "datetime"
+                        else inferred_type or "su tipo"
+                    )
+                    issues.append(
+                        f"{_pct(1.0 - validity)} no parsea al tipo {kind}"
+                    )
+        elif inferred_type in _NATIVE_TYPED:
+            # Tipo nativo garantizado por el almacen: no hay valores que no
+            # parseen. validity = 1.0 (no se confunde con tener outliers).
+            validity = 1.0
+        elif semantic_type and match_rate is not None:
+            v = _num(match_rate, None)
+            if v is not None:
+                validity = _clamp(v, 0.0, 1.0)
+                if validity < 1.0:
+                    issues.append(
+                        f"{_pct(1.0 - validity)} no casa con el "
+                        f"formato «{semantic_type}»"
+                    )
+        else:
+            # Texto libre / categorica sin semantica: no hay criterio honesto
+            # de validez. No aplica.
+            validity = None
+
+    # --- observations (NO bajan el score) -----------------------------
+    numeric = col.get("numeric")
+    if isinstance(numeric, dict):
+        # outlier_pct viene en escala 0-100 desde describe_numeric (z-score>3).
+        outlier_pct = _num(numeric.get("outlier_pct"), None)
+        if outlier_pct is not None and outlier_pct >= 0.05:
+            observations.append(
+                f"{_pct(outlier_pct / 100.0)} de valores atípicos (z-score>3): "
+                "revisar si son errores u observaciones legítimas"
+            )
+        skew = _num(numeric.get("skew"), None)
+        if skew is not None and abs(skew) >= 1.0:
+            observations.append(
+                f"asimetría fuerte (skew={round(skew, 2)}): considerar "
+                "re-expresión antes de modelar"
+            )
+
+    if "constant" in flags:
+        observations.append(
+            "columna constante: aporta poca información para el análisis"
+        )
+
+    unique_pct = _num(col.get("unique_pct"), None)
+    is_id = (
+        "possible_id" in flags
+        or "high_cardinality" in flags
+        or (
+            inferred_type in ("text", "categorical")
            and unique_pct is not None
            and _clamp(unique_pct, 0.0, 1.0) > 0.9
-        ):
-            consistency = 0.6
-            issues.append("posible id de alta cardinalidad")
-
-    consistency = _clamp(consistency, 0.0, 1.0)
-
-    # --- score agregado ----------------------------------------------
-    score = round(
-        100.0 * (0.5 * completeness + 0.3 * validity + 0.2 * consistency),
-        1,
+        )
    )
+    if is_id:
+        observations.append(
+            "valores casi únicos: posible identificador (no es un defecto de calidad)"
+        )

-    # Silencia warnings sobre la variable de tipo no usada.
-    _ = is_numeric
+    # --- score agregado con renormalizacion ---------------------------
+    applicable = ["completeness"]
+    num = _W_COMPLETENESS * completeness
+    den = _W_COMPLETENESS
+    if validity is not None:
+        applicable.append("validity")
+        num += _W_VALIDITY * validity
+        den += _W_VALIDITY
+    score = round(100.0 * num / den, 1) if den > 0 else 0.0

    return {
        "score": score,
        "completeness": completeness,
        "validity": validity,
-        "consistency": consistency,
+        "dimensions": {"completeness": completeness, "validity": validity},
+        "applicable": applicable,
        "issues": issues,
+        "observations": observations,
    }


+def _pct(frac: float) -> str:
+    """Formatea una fraccion 0-1 como porcentaje honesto: «N%» si >=1%, «0.N%»
+    por debajo (para no mostrar «0%» cuando hay un defecto real pequeño)."""
+    p = frac * 100.0
+    if p >= 1.0:
+        return f"{round(p)}%"
+    return f"{p:.1f}%"
+
+
+def _num(x, default):
+    """Convierte x a float; devuelve `default` si es None o no parseable."""
+    if x is None:
+        return default
+    if isinstance(x, bool):
+        return default
+    try:
+        return float(x)
+    except (TypeError, ValueError):
+        return default
+
+
 def _clamp(x: float, lo: float, hi: float) -> float:
    """Recorta x al rango [lo, hi]."""
    if x < lo:
@@ -1,4 +1,12 @@
-"""Tests para column_quality_score."""
+"""Tests para column_quality_score (nueva fórmula, report 2046).
+
+Verifica las invariantes de la fórmula de calidad:
+  - completeness (0.6) + validity (0.4) con renormalización por aplicabilidad.
+  - Los OUTLIERS no bajan el score (van a observations, no a issues).
+  - Columnas constantes e ids no bajan el score (observations).
+  - Sin doble conteo de la falta de datos.
+  - all-null -> score 0; función pura (no muta el input).
+"""

 import os
 import sys
@@ -9,11 +17,11 @@ from column_quality_score import column_quality_score


 def _clean_numeric_col() -> dict:
-    """ColumnProfile de una columna numerica sana, sin problemas."""
+    """ColumnProfile de una columna numérica nativa sana, sin problemas."""
    return {
        "name": "edad",
        "physical_type": "INTEGER",
-        "inferred_type": "integer",
+        "inferred_type": "numeric",
        "semantic_type": "",
        "count": 1000,
        "n_rows": 1000,
@@ -28,85 +36,163 @@ def _clean_numeric_col() -> dict:
    }


+# --------------------------------------------------------------------------- #
+# Golden
+# --------------------------------------------------------------------------- #
 def test_clean_column_high_score():
    out = column_quality_score(_clean_numeric_col())
-    assert out["score"] > 90
+    assert out["score"] == 100.0
    assert out["completeness"] == 1.0
    assert out["validity"] == 1.0
-    assert out["consistency"] == 1.0
+    assert out["applicable"] == ["completeness", "validity"]
    assert out["issues"] == []
+    assert out["observations"] == []


-def test_half_null_lowers_completeness_and_score():
+def test_weights_60_40_native_type():
+    """30% nulos en numérica nativa: score = 100*(0.6*0.7 + 0.4*1.0) = 82."""
    col = _clean_numeric_col()
-    col["null_count"] = 500
-    col["null_pct"] = 0.5
-    clean_score = column_quality_score(_clean_numeric_col())["score"]
+    col["null_pct"] = 0.30
+    col["null_count"] = 300
    out = column_quality_score(col)
-    assert out["completeness"] == 0.5
-    assert out["score"] < clean_score
-    assert any("nulos" in issue for issue in out["issues"])
+    assert out["completeness"] == 0.7
+    assert out["validity"] == 1.0
+    assert out["score"] == 82.0
+    assert any("nulos" in i for i in out["issues"])


-def test_constant_column_flags_issue():
+# --------------------------------------------------------------------------- #
+# Outliers FUERA del score
+# --------------------------------------------------------------------------- #
+def test_outliers_do_not_penalize_score():
+    """Columna con outliers pero sin nulos -> score máximo; outliers en observations."""
+    col = _clean_numeric_col()
+    col["numeric"] = {"outlier_pct": 18.0, "skew": 0.2}  # 18% atípicos (escala 0-100)
+    out = column_quality_score(col)
+    assert out["score"] == 100.0  # los outliers NO bajan la calidad
+    assert out["validity"] == 1.0
+    # No aparecen como problema de calidad...
+    assert not any("atípic" in i or "outlier" in i for i in out["issues"])
+    # ...sino como observación analítica.
+    assert any("atípic" in o for o in out["observations"])
+
+
+def test_nulls_lower_score_more_than_outliers():
+    """Vacíos sí penalizan; outliers no: comparar las dos columnas."""
+    con_nulos = _clean_numeric_col()
+    con_nulos["null_pct"] = 0.30
+    con_outliers = _clean_numeric_col()
+    con_outliers["numeric"] = {"outlier_pct": 30.0}
+    assert column_quality_score(con_nulos)["score"] < \
+        column_quality_score(con_outliers)["score"]
+
+
+# --------------------------------------------------------------------------- #
+# Validity: aplicabilidad y renormalización
+# --------------------------------------------------------------------------- #
+def test_validity_from_parse_rate_lowers_score():
+    """Numérica como texto con 20% basura: validity=0.8 -> score=92."""
+    col = {
+        "name": "precio_txt", "inferred_type": "numeric", "semantic_type": "decimal",
+        "null_pct": 0.0, "validity_rate": 0.80, "flags": [], "numeric": None,
+    }
+    out = column_quality_score(col)
+    assert out["validity"] == 0.8
+    assert out["score"] == 92.0  # 100*(0.6 + 0.4*0.8)
+    assert any("no parsea" in i for i in out["issues"])
+
+
+def test_validity_from_match_rate():
+    """Texto con semantic_type y 5% no conforme: validity=0.95."""
+    col = {
+        "name": "email", "inferred_type": "text", "semantic_type": "email",
+        "null_pct": 0.0, "match_rate": 0.95, "unique_pct": 0.5, "flags": [],
+    }
+    out = column_quality_score(col)
+    assert out["validity"] == 0.95
+    assert out["score"] == 98.0  # 100*(0.6 + 0.4*0.95)
+    assert any("no casa" in i for i in out["issues"])
+
+
+def test_free_text_renormalizes_to_completeness_only():
+    """Texto libre sin semántica: validity no aplica -> score = 100*completeness."""
+    col = {
+        "name": "comentario", "inferred_type": "text", "semantic_type": "",
+        "null_pct": 0.30, "unique_pct": 0.5, "flags": [], "numeric": None,
+    }
+    out = column_quality_score(col)
+    assert out["validity"] is None
+    assert out["applicable"] == ["completeness"]
+    assert out["completeness"] == 0.7
+    assert out["score"] == 70.0  # renormalizado a solo completeness
+
+
+# --------------------------------------------------------------------------- #
+# Casos límite (report §4.6)
+# --------------------------------------------------------------------------- #
+def test_all_null_column_scores_zero():
+    col = _clean_numeric_col()
+    col["null_pct"] = 1.0
+    col["null_count"] = 1000
+    out = column_quality_score(col)
+    assert out["completeness"] == 0.0
+    assert out["validity"] is None  # no medible sin valores no nulos
+    assert out["score"] == 0.0
+
+
+def test_constant_column_scores_full_and_is_observation():
+    """Columna constante: dato válido y completo -> score 100; baja info = observación."""
    col = _clean_numeric_col()
    col["flags"] = ["constant"]
    col["distinct_count"] = 1
    col["unique_pct"] = 0.001
    out = column_quality_score(col)
-    assert out["consistency"] == 0.3
-    assert any("constante" in issue for issue in out["issues"])
+    assert out["score"] == 100.0  # NO se castiga la baja informatividad
+    assert not any("constante" in i for i in out["issues"])
+    assert any("constante" in o for o in out["observations"])


+def test_high_cardinality_id_scores_full_and_is_observation():
+    """Id de alta cardinalidad: unicidad perfecta -> score 100; posible id = observación."""
+    col = {
+        "name": "uuid", "inferred_type": "text", "semantic_type": "",
+        "null_pct": 0.0, "unique_pct": 0.99, "flags": ["possible_id"],
+        "numeric": None,
+    }
+    out = column_quality_score(col)
+    assert out["score"] == 100.0
+    assert not any("identificador" in i for i in out["issues"])
+    assert any("identificador" in o for o in out["observations"])
+
+
+def test_mostly_null_no_double_counts_validity():
+    """85% nulos: solo completeness penaliza; validity nativa sigue 1.0 (sin doble castigo)."""
+    col = _clean_numeric_col()
+    col["null_pct"] = 0.85
+    col["flags"] = ["mostly_null"]
+    out = column_quality_score(col)
+    assert out["validity"] == 1.0  # ya no se multiplica por 0.5
+    # score = 100*(0.6*0.15 + 0.4*1.0) = 49
+    assert out["score"] == 49.0
+    assert not any("mayoritariamente" in o for o in out["observations"])
+
+
+# --------------------------------------------------------------------------- #
+# Robustez
+# --------------------------------------------------------------------------- #
 def test_empty_dict_does_not_crash():
    out = column_quality_score({})
    assert isinstance(out["score"], float)
    assert out["completeness"] == 1.0
    assert 0.0 <= out["score"] <= 100.0
    assert isinstance(out["issues"], list)
-
-
-def test_outliers_penalize_validity():
-    col = _clean_numeric_col()
-    col["numeric"] = {"outlier_pct": 0.2}
-    out = column_quality_score(col)
-    assert out["validity"] < 1.0
-    assert any("outliers" in issue for issue in out["issues"])
-
-
-def test_mostly_null_flag_halves_validity():
-    col = _clean_numeric_col()
-    col["null_pct"] = 0.85
-    col["flags"] = ["mostly_null"]
-    out = column_quality_score(col)
-    assert out["validity"] == 0.5
-    assert any("mayoritariamente nula" in issue for issue in out["issues"])
-
-
-def test_high_cardinality_text_flagged_as_id():
-    col = {
-        "name": "uuid",
-        "inferred_type": "text",
-        "semantic_type": "",
-        "null_pct": 0.0,
-        "unique_pct": 0.99,
-        "flags": [],
-        "numeric": None,
-    }
-    out = column_quality_score(col)
-    assert out["consistency"] < 1.0
-    assert any("alta cardinalidad" in issue for issue in out["issues"])
+    assert isinstance(out["observations"], list)


 def test_none_values_treated_defensively():
    col = {
-        "name": "x",
-        "inferred_type": None,
-        "semantic_type": None,
-        "null_pct": None,
-        "unique_pct": None,
-        "flags": None,
-        "numeric": None,
+        "name": "x", "inferred_type": None, "semantic_type": None,
+        "null_pct": None, "unique_pct": None, "flags": None, "numeric": None,
    }
    out = column_quality_score(col)
    assert out["completeness"] == 1.0
@@ -0,0 +1,85 @@
+---
+name: pptx_link_run_to_slide
+kind: function
+lang: py
+domain: datascience
+version: "1.0.0"
+purity: impure
+signature: "def pptx_link_run_to_slide(run, source_slide, target_slide) -> bool"
+description: "Convierte un run de texto de python-pptx en un hyperlink INTERNO 'ir a la diapositiva'. python-pptx soporta run.hyperlink.address para URLs externas pero NO para saltar a otra slide del mismo deck; esta función crea ese salto manipulando el XML: añade una relación slide->slide (RT.SLIDE) y un <a:hlinkClick> con action='ppaction://hlinksldjump' y el r:id de la relación, insertado como primer hijo del <a:rPr> del run (orden del schema CT_TextCharacterProperties). Idempotente (elimina un hlinkClick previo antes de insertar). Al pulsar el texto en PowerPoint o visores compatibles se navega a target_slide. Motor python-pptx. No lanza nunca: cualquier excepción -> return False."
+tags: [eda, pptx, hyperlink, slide-jump, navigation, glossary, automatic-eda, python-pptx, xml, datascience, python]
+uses_functions: []
+uses_types: []
+returns: []
+returns_optional: false
+error_type: "error_go_core"
+imports: ["python-pptx"]
+params:
+  - name: run
+    desc: "el pptx.text.text._Run cuyo texto se vuelve clicable. Debe pertenecer a un run real (expone ._r, el elemento <a:r>). Un objeto sin ._r hace que la función devuelva False sin lanzar."
+  - name: source_slide
+    desc: "la Slide que contiene el run. Su part recibe la relación slide->slide (relate_to con RELATIONSHIP_TYPE.SLIDE); el r:id resultante se referencia en el hlinkClick."
+  - name: target_slide
+    desc: "la Slide de destino del salto. Debe pertenecer al MISMO Presentation que source_slide para que la relación interna sea válida."
+output: "bool. True si se aplicó el hyperlink interno (relación creada + <a:hlinkClick> insertado en el rPr del run); False si algo lo impidió (run inválido, slides de presentaciones distintas, etc.). Nunca lanza."
+tested: true
+tests: ["test_golden_run_se_vuelve_salto_a_otra_slide", "test_idempotente_reaplica_sin_duplicar_hlinkclick", "test_error_path_run_invalido_devuelve_false_sin_lanzar"]
+test_file_path: "python/functions/datascience/pptx_link_run_to_slide_test.py"
+file_path: "python/functions/datascience/pptx_link_run_to_slide.py"
+---
+
+## Ejemplo
+
+```python
+from pptx import Presentation
+from pptx.util import Inches
+from pptx.oxml.ns import qn
+
+from datascience.pptx_link_run_to_slide import pptx_link_run_to_slide
+
+prs = Presentation()
+blank = prs.slide_layouts[6]            # layout en blanco
+slide0 = prs.slides.add_slide(blank)
+slide1 = prs.slides.add_slide(blank)    # destino del salto (p.ej. el glosario)
+
+box = slide0.shapes.add_textbox(Inches(1), Inches(1), Inches(4), Inches(1))
+run = box.text_frame.paragraphs[0].add_run()
+run.text = "ir al glosario"
+
+ok = pptx_link_run_to_slide(run, slide0, slide1)
+print(ok)  # -> True
+
+# El run quedó con <a:rPr><a:hlinkClick action="ppaction://hlinksldjump" r:id="rIdN"/></a:rPr>
+hlink = run._r.get_or_add_rPr().find(qn("a:hlinkClick"))
+print(hlink.get("action"))  # -> ppaction://hlinksldjump
+prs.save("deck_con_salto.pptx")
+```
+
+## Cuando usarla
+
+Cuando construyas un deck PPTX con **navegación interna** y quieras que un texto salte a
+otra diapositiva al pulsarlo: un **glosario clicable** (cada término enlaza a su slide de
+definición), un **índice/tabla de contenidos navegable**, botones "volver a la portada", o
+referencias cruzadas entre capítulos. Es la pieza que `python-pptx` no cubre de fábrica —
+úsala sobre los runs ya creados por renderers como `render_automatic_eda_pptx` del grupo
+`eda` para enriquecer el deck con saltos sin reescribir el XML a mano cada vez.
+
+## Gotchas
+
+- **Impura**: muta el XML del run y crea una relación nueva en el part de `source_slide`.
+- **Solo navega en visores que respetan `ppaction://hlinksldjump`**: PowerPoint y la
+  mayoría de visores compatibles lo siguen; algunos visores web/ligeros lo ignoran (el
+  texto se ve igual pero no salta).
+- **Mismo Presentation**: `source_slide` y `target_slide` deben pertenecer al mismo deck.
+  Si son de presentaciones distintas, la relación interna no es válida y el salto no
+  funcionará (la función puede devolver True por crear la relación, pero el resultado en
+  el visor no será el esperado).
+- **El `<a:hlinkClick>` vive en el `<a:rPr>` del run**, no como hijo directo del `<a:r>`.
+  Para localizarlo: `run._r.get_or_add_rPr().find(qn("a:hlinkClick"))` (un `find` sobre
+  `run._r` devuelve `None` porque solo mira hijos directos del `<a:r>`).
+- **Idempotente**: si el run ya tenía un `hlinkClick` (p.ej. una URL externa o un salto
+  previo), se elimina antes de insertar el nuevo — un run tiene como mucho un click-link.
+- **Nunca lanza**: cualquier excepción (run sin `._r`, slides incompatibles, etc.) se
+  traga y devuelve `False`. Comprobar el booleano si el salto es crítico.
+- **Dependencia python-pptx**: declarada en `python/pyproject.toml`. Tests con
+  `~/fn_registry/python/.venv/bin/python3` (tiene `python-pptx` instalado).
@@ -0,0 +1,50 @@
+"""Convierte un run de texto de python-pptx en un hyperlink interno "ir a la diapositiva".
+
+python-pptx expone ``run.hyperlink.address`` para URLs externas, pero NO ofrece una
+API pública para saltar a otra diapositiva del mismo deck. Esta función crea ese salto
+interno manipulando el XML: añade una relación ``slide -> slide`` y un
+``<a:hlinkClick>`` con la acción ``ppaction://hlinksldjump`` en el run, de modo que al
+pulsar el texto en PowerPoint (o en visores que respetan esa acción) se navega a la
+diapositiva de destino.
+"""
+
+from pptx.opc.constants import RELATIONSHIP_TYPE as RT
+from pptx.oxml.ns import qn
+
+
+def pptx_link_run_to_slide(run, source_slide, target_slide) -> bool:
+    """Convierte un run de texto en un hyperlink interno "ir a la diapositiva".
+
+    Añade una relación ``slide -> slide`` desde la slide origen al part de la slide
+    destino y crea un ``<a:hlinkClick>`` con ``action="ppaction://hlinksldjump"`` como
+    primer hijo del ``<a:rPr>`` del run (orden válido del schema
+    ``CT_TextCharacterProperties``). La operación es idempotente: un ``hlinkClick``
+    previo en el mismo run se elimina antes de insertar el nuevo.
+
+    Args:
+        run: el ``pptx.text.text._Run`` cuyo texto se vuelve clicable.
+        source_slide: la ``Slide`` que contiene el run.
+        target_slide: la ``Slide`` de destino del salto.
+
+    Returns:
+        True si se aplicó el hyperlink; False si algo impidió aplicarlo (no lanza).
+    """
+    try:
+        rId = source_slide.part.relate_to(target_slide.part, RT.SLIDE)
+        rPr = run._r.get_or_add_rPr()
+        # Elimina un hlinkClick previo si lo hubiera (idempotente).
+        for existing in rPr.findall(qn("a:hlinkClick")):
+            rPr.remove(existing)
+        hlink = rPr.makeelement(
+            qn("a:hlinkClick"),
+            {
+                qn("r:id"): rId,
+                "action": "ppaction://hlinksldjump",
+            },
+        )
+        # a:hlinkClick debe ir como primer hijo de rPr
+        # (orden del schema CT_TextCharacterProperties).
+        rPr.insert(0, hlink)
+        return True
+    except Exception:
+        return False
@@ -0,0 +1,73 @@
+"""Tests for pptx_link_run_to_slide — salto interno run -> diapositiva.
+
+Self-contained: construye una Presentation en memoria con dos slides en blanco,
+un textbox con un run en la slide 0, y verifica que la función inyecta un
+``<a:hlinkClick>`` con ``action="ppaction://hlinksldjump"`` y un ``r:id`` que
+resuelve al part de la slide 1.
+"""
+
+import pytest
+
+pytest.importorskip("pptx")
+
+from pptx import Presentation  # noqa: E402
+from pptx.oxml.ns import qn  # noqa: E402
+from pptx.util import Inches  # noqa: E402
+
+from datascience.pptx_link_run_to_slide import pptx_link_run_to_slide  # noqa: E402
+
+
+def _two_slide_deck_with_run():
+    prs = Presentation()
+    blank = prs.slide_layouts[6]  # layout en blanco
+    slide0 = prs.slides.add_slide(blank)
+    slide1 = prs.slides.add_slide(blank)
+
+    box = slide0.shapes.add_textbox(Inches(1), Inches(1), Inches(4), Inches(1))
+    tf = box.text_frame
+    para = tf.paragraphs[0]
+    run = para.add_run()
+    run.text = "ir al glosario"
+    return prs, slide0, slide1, run
+
+
+def test_golden_run_se_vuelve_salto_a_otra_slide():
+    prs, slide0, slide1, run = _two_slide_deck_with_run()
+
+    ok = pptx_link_run_to_slide(run, slide0, slide1)
+    assert ok is True
+
+    # El hlinkClick es hijo del rPr del run (orden del schema
+    # CT_TextCharacterProperties), no hijo directo del <a:r>.
+    rPr = run._r.get_or_add_rPr()
+    hlink = rPr.find(qn("a:hlinkClick"))
+    assert hlink is not None
+    assert hlink.get("action") == "ppaction://hlinksldjump"
+
+    rId = hlink.get(qn("r:id"))
+    assert rId, "el hlinkClick debe llevar un r:id no vacío"
+
+    # El rId debe existir en las relaciones de la slide origen y apuntar
+    # al part de la slide destino.
+    rels = slide0.part.rels
+    assert rId in rels
+    assert rels[rId].target_part is slide1.part
+
+
+def test_idempotente_reaplica_sin_duplicar_hlinkclick():
+    prs, slide0, slide1, run = _two_slide_deck_with_run()
+
+    assert pptx_link_run_to_slide(run, slide0, slide1) is True
+    assert pptx_link_run_to_slide(run, slide0, slide1) is True
+
+    rPr = run._r.get_or_add_rPr()
+    hlinks = rPr.findall(qn("a:hlinkClick"))
+    assert len(hlinks) == 1
+
+
+def test_error_path_run_invalido_devuelve_false_sin_lanzar():
+    prs, slide0, slide1, _run = _two_slide_deck_with_run()
+
+    # Un objeto sin ._r ni soporte de relación -> la función no lanza, devuelve False.
+    ok = pptx_link_run_to_slide(object(), slide0, slide1)
+    assert ok is False
@@ -3,7 +3,7 @@ name: summarize_table_duckdb
 kind: function
 lang: py
 domain: datascience
-version: "1.0.0"
+version: "1.1.0"
 purity: impure
 signature: "def summarize_table_duckdb(db_path: str, table: str, high_card_ratio: float = 0.9) -> dict"
 description: "Perfila una tabla DuckDB en una sola pasada SQL (SUMMARIZE, push-down sin traer filas a RAM) y devuelve el esqueleto de un TableProfile con el perfil base por columna. Corazon del grupo eda: base barata sobre la que otras funciones anaden lo estadistico fino (skew/kurtosis/histograma sobre muestra)."
@@ -64,6 +64,7 @@ else:
 - **`distinct_count` exacto para tablas <=200k filas, aproximado+capado por encima**: `SUMMARIZE` usa HyperLogLog (`approx_unique`), que SOBREESTIMA y en tablas pequenas puede reportar mas distintos que filas (inflando `unique_pct` por encima de 1.0 y disparando flags `possible_id` falsos). Por eso, para `n_rows <= 200000` la funcion calcula `COUNT(DISTINCT)` EXACTO en una sola query combinada (barata) y usa ese valor. Para tablas mas grandes mantiene `approx_unique` pero lo CAPA a `n_rows` (`distinct_count = min(approx_unique, n_rows)`). En ambos casos `unique_pct = min(distinct_count / n_rows, 1.0)`, asi que `distinct_count` nunca supera las filas ni `unique_pct` pasa de 1.0. Los flags `possible_id` / `high_cardinality` derivan de ese `distinct_count` ya corregido (exacto y fiable por debajo de 200k filas; aproximado y conservador por encima).
 - **`SUMMARIZE` NO da skew, kurtosis ni histograma**, ni percentiles finos (p1/p5/p95/p99), moda, outliers, correlaciones, key_candidates ni quality_score. Esas claves quedan en `None`/`[]` a proposito: las rellena otra funcion del grupo `eda` sobre una muestra. El sub-dict `numeric` solo trae min, max, mean, std, p25, p50, p75.
 - **`SUMMARIZE.count` es el total de filas, no el no-nulo**: la funcion deriva el `count` no-nulo del ColumnProfile como `n_rows - null_count` (con `null_count` redondeado de `null_percentage`).
+- **`duplicate_rows`/`duplicate_pct` se pueblan push-down** (desde v1.1.0) con `count(*)` sobre `SELECT DISTINCT *` (sin traer filas a RAM): `duplicate_rows = n_rows - filas_distintas`, `duplicate_pct` en fraccion 0-1. Habilitan la dimension de unicidad de registro del score de dataset (`profile_table` paso 6). Si la tabla tiene tipos no comparables con `DISTINCT` (BLOB/LIST/MAP) la query degrada y ambas vuelven a `None` (renormaliza el score a solo `cell_quality`).
 - **min/max/avg/std/q25/q50/q75 vienen como strings** desde DuckDB; se convierten a float (None si la columna no es numerica).
 - **Requiere DuckDB 1.5.2** (columnas de `SUMMARIZE` validadas con esa version: column_name, column_type, min, max, approx_unique, avg, std, q25, q50, q75, count, null_percentage).
 - **El identificador de tabla se interpola** (no parametrizable en `SUMMARIZE`): por eso se valida contra `^[A-Za-z_][A-Za-z0-9_]*$` antes de citarlo. Un nombre invalido (p.ej. con `;` o espacios) devuelve `{status:'error'}` sin tocar la base.
@@ -196,6 +196,21 @@ def summarize_table_duckdb(
            sum(c["null_pct"] for c in columns) / len(columns) if columns else 0.0
        )

+        # Unicidad de registro: filas duplicadas via COUNT de filas distintas
+        # push-down (DISTINCT *), sin traer filas a RAM. Habilita la dimension
+        # de uniqueness del score de dataset (1 - duplicate_pct). Degrada a None
+        # si la tabla tiene tipos no comparables con DISTINCT (BLOB/LIST/MAP).
+        duplicate_rows = None
+        duplicate_pct = None
+        if n_rows > 0:
+            dup_res = duckdb_query_readonly(
+                db_path, f"SELECT count(*) AS c FROM (SELECT DISTINCT * FROM {quoted})"
+            )
+            if dup_res["status"] == "ok" and dup_res["rows"]:
+                distinct_rows = int(dup_res["rows"][0]["c"])
+                duplicate_rows = max(0, n_rows - distinct_rows)
+                duplicate_pct = duplicate_rows / n_rows  # fraccion 0-1
+
        profile = {
            "table": table,
            "source": "duckdb",
@@ -203,8 +218,8 @@ def summarize_table_duckdb(
            "n_rows": n_rows,
            "n_cols": len(columns),
            "size_bytes": None,
-            "duplicate_rows": None,
-            "duplicate_pct": None,
+            "duplicate_rows": duplicate_rows,
+            "duplicate_pct": duplicate_pct,
            "constant_cols": constant_cols,
            "all_null_cols": all_null_cols,
            "null_cell_pct": null_cell_pct,
@@ -54,6 +54,30 @@ def test_shape_y_metadatos_tabla(db):
    assert profile["correlations"] is None


+def test_duplicate_pct_sin_duplicados(db):
+    """Tabla con todas las filas distintas: duplicate_pct = 0, no None."""
+    profile = summarize_table_duckdb(db, "ventas")["profile"]
+    assert profile["duplicate_rows"] == 0
+    assert profile["duplicate_pct"] == 0.0
+
+
+def test_duplicate_pct_con_duplicados(tmp_path):
+    """Filas repetidas: duplicate_rows/duplicate_pct se pueblan push-down."""
+    path = str(tmp_path / "dups.duckdb")
+    con = duckdb.connect(path)
+    con.execute("CREATE TABLE t (a INTEGER, b VARCHAR)")
+    # 5 filas, 2 de ellas idénticas a otras -> 2 duplicadas sobre 5 = 0.4.
+    con.execute(
+        "INSERT INTO t VALUES "
+        "(1,'x'), (2,'y'), (1,'x'), (3,'z'), (2,'y')"
+    )
+    con.close()
+    profile = summarize_table_duckdb(path, "t")["profile"]
+    assert profile["n_rows"] == 5
+    assert profile["duplicate_rows"] == 2
+    assert profile["duplicate_pct"] == 0.4
+
+
 def test_column_profile_shape(db):
    profile = summarize_table_duckdb(db, "ventas")["profile"]
    by_name = {c["name"]: c for c in profile["columns"]}
@@ -4,7 +4,7 @@ kind: pipeline
 lang: py
 domain: pipelines
 purity: impure
-version: "1.0.0"
+version: "1.1.0"
 signature: "def profile_table(db_path: str, table: str, backend: str = \"duckdb\", sample: int = 5000, run_models: bool = False, run_llm: bool = False, run_series: bool = False, emit_pdf: bool = False, emit_automatic: bool = False, report_dir: str = \"reports\", write_report: bool = True) -> dict"
 description: "Orquestador one-shot del grupo de capacidad eda: perfila UNA tabla (DuckDB o PostgreSQL) end-to-end componiendo las funciones del grupo (perfil base SQL + muestreo read-only + inferencia semantica + promocion de tipo + estadistica numerica/categorica + score de calidad + correlaciones con correccion FDR + re-expresion de Tukey + avisos exploratorios) y, opcional, modelos baratos (run_models), interpretacion LLM (run_llm) y analisis de serie temporal por columna (run_series: estacionariedad ADF+KPSS, ACF/PACF, STL, retornos). Emite el TableProfile completo mas (opcional) report markdown + JSON sidecar + PDF movil (emit_pdf). Es la composicion canonica para hazme un EDA de esta tabla."
 tags: [eda, duckdb, postgres, profiling, data-quality, pipeline, dataops, timeseries]
@@ -114,3 +114,12 @@ para auditar la calidad de una tabla ya productiva. Reemplaza orquestar a mano
  Formatos exoticos pueden descartarse silenciosamente del calculo numerico.
 - `db_path` debe existir: DuckDB read-only NO crea la base. El muestreo usa el
  sandbox por defecto de `duckdb_query_readonly` (sin acceso a FS/red).
+- **Score de calidad (report 2046, desde v1.1.0).** Paso 5: cada columna recibe
+  `quality_score` de `column_quality_score` con la formula 60/40
+  (completeness/validity); al promocionar texto a numero/fecha se expone
+  `col["validity_rate"]` (parse rate de la muestra) para alimentar la dimension
+  validity. Paso 6: el score de dataset NO es la media simple — es
+  `100 * (0.85*cell_quality + 0.15*row_uniqueness)`, donde
+  `cell_quality = media(score_col/100)` y `row_uniqueness = 1 - duplicate_pct`.
+  Si `duplicate_pct` es `None` (backend sin calcularlo) el score se renormaliza
+  a solo `cell_quality`. Los outliers NO bajan el score (van a `observations`).
@@ -477,9 +477,18 @@ def profile_table(
                    if vals and (len(ok) / len(vals)) >= _PROMOTE_MIN_PARSE:
                        col["inferred_type"] = "numeric"
                        inferred = "numeric"
+                        # Tasa de parseo real de la muestra: alimenta la
+                        # dimension validity de column_quality_score (fraccion
+                        # de valores conformes al tipo numerico promovido).
+                        col["validity_rate"] = len(ok) / len(vals)
                elif semantic in _DATETIME_SEMANTIC:
                    col["inferred_type"] = "datetime"
                    inferred = "datetime"
+                    # Tasa de parseo de la muestra a fecha (mismo papel que el
+                    # parse rate numerico) para la dimension validity.
+                    parsed_dt = [_to_ordinal_days(v) for v in vals]
+                    ok_dt = [d for d in parsed_dt if d is not None]
+                    col["validity_rate"] = (len(ok_dt) / len(vals)) if vals else None

            # 4) Enriquecer segun el inferred_type final.
            if inferred == "numeric":
@@ -506,11 +515,36 @@ def profile_table(
            # 5) Score de calidad por columna.
            col["quality_score"] = column_quality_score(col).get("score")

-        # 6) Score agregado de la tabla (media de columnas).
+        # 6) Score agregado de la tabla (report 2046): NO media simple.
+        #   cell_quality   = media de los scores de columna, en [0,1].
+        #   row_uniqueness = 1 - duplicate_pct (unicidad de registro).
+        #   score = 100 * (0.85*cell_quality + 0.15*row_uniqueness).
+        # Renormaliza a solo cell_quality si duplicate_pct no se pudo calcular.
        scores = [
            c["quality_score"] for c in cols if c.get("quality_score") is not None
        ]
-        prof["quality_score"] = round(sum(scores) / len(scores), 1) if scores else None
+        if scores:
+            cell_quality = (sum(scores) / len(scores)) / 100.0
+            dup_pct = prof.get("duplicate_pct")
+            if dup_pct is not None:
+                try:
+                    d = float(dup_pct)
+                except (TypeError, ValueError):
+                    d = None
+            else:
+                d = None
+            if d is not None:
+                # Tolerar escala 0-100 por si algun backend la entrega asi.
+                if d > 1.0:
+                    d = d / 100.0
+                row_uniqueness = max(0.0, min(1.0, 1.0 - d))
+                prof["quality_score"] = round(
+                    100.0 * (0.85 * cell_quality + 0.15 * row_uniqueness), 1
+                )
+            else:
+                prof["quality_score"] = round(100.0 * cell_quality, 1)
+        else:
+            prof["quality_score"] = None

        # 7) Candidatos a clave.
        key_candidates = []
@@ -25,6 +25,7 @@ dependencies = [
    "polars>=1.40.1",
    "pymeshlab>=2025.7.post1",
    "pymssql>=2.3.13",
+    "pymupdf>=1.28.0",
    "pypdf>=6.10.0",
    "pyproj>=3.7.2",
    "python-docx>=1.2.0",
Author	SHA1	Message	Date
egutierrez	a2074a0167	feat(eda): nueva fórmula de calidad de datos (report 2046) + capítulo calidad Implementa el modelo de calidad del report 2046 en el grupo eda. Score de columna: 0.6·completeness + 0.4·validity con renormalización por aplicabilidad (si la validez no es medible —texto libre o columna 100% nula— el score se basa solo en completeness). Validez = conformidad real al tipo: nativo numérico/fecha/bool = 1.0; texto promovido a número/fecha = parse rate (validity_rate); texto con semantic_type = match_rate; texto libre = no aplica. Outliers, columnas constantes e identificadores salen del score a un bloque de observaciones analíticas (no son defectos de calidad). Se elimina el doble conteo de la falta de datos (mostly_null ya no castiga validez) y el bug de escala de outliers (que además ya no entran en el score). Score de dataset: 100·(0.85·cell_quality + 0.15·row_uniqueness) en vez de la media simple. Se pobla duplicate_rows/duplicate_pct push-down en summarize_table_duckdb (COUNT sobre DISTINCT *, sin RAM) para habilitar la unicidad de registro; renormaliza a solo cell_quality si no se puede calcular. Capítulo calidad (v2.0.0): intro de dos dimensiones (60/40) que declara que los outliers no bajan el score; tabla de scores Columna\|Calidad\|Completitud\|Validez (sin Consistencia, n/a cuando no aplica); DOS tablas separadas (Problemas de calidad vs Observaciones analíticas); resumen con Unicidad de registro; glosario clicable de completitud, validez, unicidad de registro y calidad de datos. Verificado: 123 tests verdes (automatic_eda + render_automatic_eda + column_quality_score + summarize_table_duckdb + profile_table). Golden EDA de titanic (run_models+run_llm) con score recomputado a mano, outliers separados en observaciones y glosario clicable (5 links GOTO en el PDF). column_quality_score v2.0.0, summarize_table_duckdb v1.1.0, profile_table v1.1.0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-30 18:10:23 +02:00
egutierrez	c6d9bc26da	merge: Fase 4a AutomaticEDA motor+glosario (verificado met) - fix negrita-pisa PDF, zebra striping (PDF+PPT), keep-together (Group: heading+figura+texto misma pagina/slide), imagenes con caption en PPT - portada construida-al-final mostrada en posicion 1 (con resumen agregado del cuerpo) - capitulo glosario al final + terminos clicables REALES: PDF link annotation (add_pdf_internal_links, PyMuPDF) + PPT hyperlink nativo (pptx_link_run_to_slide); entropia enganchado en cat_distr como ejemplo E2E - contrato docs/automatic_eda_contract.md §11 (glosario + keep-together + zebra) - pymupdf>=1.28.0	2026-06-30 17:45:30 +02:00
egutierrez	d1a3d58a6b	feat(eda): motor AutomaticEDA fase 4a — render fixes + keep-together + glosario clicable Mejoras transversales del motor de render (no del contenido de capítulos): 1. Fix negrita pisa texto (PDF): _place_rich_lines mide el ancho REAL de cada span con las métricas de fuente del renderer (peso correcto) en vez del grid de ancho medio; negrita y normal en la misma línea ya no se solapan. 2. Zebra striping: filas pares sombreadas (#f6f8fa) en DataTable (PDF + PPTX), coherente al partir tablas largas (índice de fila lógico, no por página). 3. Keep-together: bloque Group nuevo; el renderer mide el grupo entero y lo mueve completo a la página/slide siguiente si no cabe, y encoge la figura (height_in) para dejar sitio a su título y texto. num_distr lo usa. 4. Caption siempre visible en toda figura PPTX (fallback al heading); la figura reserva el alto de su caption para que ambos quepan en el mismo slide. 5. Portada construida al final (con resumen agregado del análisis vía ctx['document_summary']) pero colocada primera por build_document. 6. Glosario: capítulo nuevo (último) + GlossaryCollector en ctx; los capítulos registran términos y marcan apariciones con [[term:key]]...[[/term]]. Links clicables reales: PDF (PyMuPDF, link GOTO) y PPTX (slide-jump nativo). Enganchado "entropía" en cat_distr como ejemplo end-to-end. Funciones reutilizables delegadas a fn-constructor (tag eda): - add_pdf_internal_links_py_datascience (PyMuPDF) - pptx_link_run_to_slide_py_datascience (slide-jump) Contrato docs/automatic_eda_contract.md actualizado (§1/§3/§5 + §11 nueva) con la API de glosario, keep-together y zebra para la siguiente fase. PyMuPDF declarado en pyproject. Suite verde (90 tests); golden titanic verificado. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-30 17:35:19 +02:00
egutierrez	b5334a2e97	merge: Fase 3 AutomaticEDA wiring (verificado met) - build_eda_render_ctx: arma ctx (raw_numeric, timeseries_raw, geo_points, db_path+table) desde tabla DuckDB - pipeline render_automatic_eda: perfila + ctx + build_document -> PDF + PPTX (11 capitulos poblados) - profile_table: flag emit_automatic emite el report AutomaticEDA (PDF+PPT) sin romper render_eda_pdf - text_layout: render real de negrita en PDF y PPTX - .claude/commands/eda.md actualizado Los 4 capitulos que degradaban (modelos/timeseries/geospatial/agregacion) ahora salen POBLADOS end-to-end.	2026-06-30 16:19:52 +02:00