feat(eda): capítulo OUTLIERS — valores atípicos univariantes + multivariantes

Nuevo capítulo dedicado `outliers` para el motor AutomaticEDA que reúne y profundiza en un solo sitio el análisis de valores atípicos, hoy disperso entre `num_distr` (conteo por columna) y `modelos` (IsolationForest). Se registra en `chapters_registry.py` entre `missingness` y `correlacion` (bloque de calidad de datos: calidad → missingness → outliers). Contenido del capítulo: - Resumen univariante por columna: nº y % de atípicos por Tukey (1.5·IQR) y por z-score (|z| > 3), con vallas inferior/superior y valores extremos. Ordenado por contaminación y marcando las columnas más afectadas. Reusa las funciones del registry `build_boxplot_stats` (vallas desde los percentiles del profile) y `detect_outliers` (regla z-score sobre la muestra cruda de `ctx`). - Boxplots de Tukey de las columnas más contaminadas (caja, bigotes y puntos atípicos), delegados a la función nueva `build_boxplots_figure`. - Multivariante: filas anómalas considerando todas las columnas a la vez con `isolation_forest_outliers` — nº y % de filas, las más anómalas con su score y las dimensiones que las hacen raras (top columnas por |z|, vía la función nueva `summarize_outlier_dims`). El detector se corre en vivo sobre `raw_numeric` para que el indexado de filas coincida exactamente con el de las dimensiones; cae al bloque precomputado del perfil cuando no hay muestra cruda (preset lite). - Interpretación exploratoria: un atípico no es necesariamente un error (distingue error de dato vs dato real extremo) y recomendaciones (revisar, winsorizar o re-expresar, enlazando con la re-expresión de Tukey del perfil). Términos clicables registrados en el glosario compartido: `outlier`, `tukey_fence`, `zscore`, `isolation_forest`. Funciones nuevas del registry (dominio datascience, grupo eda): - `build_boxplots_figure_py_datascience` (figure helper, impura) - `summarize_outlier_dims_py_datascience` (pura) El capítulo se activa con ≥1 columna numérica y devuelve None en su ausencia; lee todo defensivo y nunca lanza. Tests: capítulo (golden + edges + error path + render PDF/PPTX) y ambas funciones nuevas. Suite de no-regresión de AutomaticEDA verde. Verificado end-to-end con el dataset Titanic (Fare/Parch/SibSp como las columnas más contaminadas). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
merge(eda): capitulo MISSINGNESS — patrones de nulos (co-ocurrencia + MCAR/MAR)
2026-06-30 21:12:40 +02:00 · 2026-06-30 20:42:46 +02:00 · 2026-06-30 20:41:29 +02:00 · 2026-06-30 20:39:16 +02:00 · 2026-06-30 20:38:17 +02:00 · 2026-06-30 20:37:01 +02:00
45 changed files with 6359 additions and 6 deletions
@@ -31,7 +31,7 @@ import math

 from .. import model

-CHAPTER_VERSION = "1.0.0"
+CHAPTER_VERSION = "1.1.0"
 CHAPTER_ID = "correlacion"
 CHAPTER_TITLE = "Correlación"

@@ -47,6 +47,13 @@ _MAX_MATRIX_LABELS = 16
 # How many pairs to show in each of the top-positive / top-negative tables.
 _TOP_N = 10

+# How many of the strongest numeric-numeric pairs to draw as scatter plots on
+# each sign (positive / negative). A scatter per pair carries a fitted line/curve
+# and a relationship-type label; keeping the count small keeps the chapter
+# readable on a phone / a slide. Only signed (Pearson/Spearman) pairs qualify —
+# Cramér's V / correlation ratio pairs are not numeric-numeric, so no scatter.
+_SCATTER_TOP_N = 3
+
 # Glossary terms this chapter explains. Each is registered in the shared
 # collector (ctx['glossary']) and marked clickable on its first appearance in the
 # body — the canonical two-step pattern (see ``cat_distr`` for the reference
@@ -314,6 +321,139 @@ def _fdr_text(corr: dict, mark_term: bool = False) -> str | None:
    return " ".join(parts)


+def _is_seq(values) -> bool:
+    """True for a non-empty list/tuple of values (a raw numeric column)."""
+    return isinstance(values, (list, tuple)) and len(values) > 0
+
+
+def _select_scatter_pairs(pairs: list, top_n: int = _SCATTER_TOP_N):
+    """Pick the strongest numeric-numeric pairs to draw as scatters.
+
+    Only signed (Pearson/Spearman) pairs are numeric-numeric and thus eligible
+    for a scatter with a fitted curve. Returns up to ``top_n`` of the strongest
+    positive pairs followed by up to ``top_n`` of the strongest negative ones,
+    each ranked by magnitude. Mixed-type metrics (Cramér's V, correlation ratio,
+    mutual information) are excluded — they have no x/y scatter interpretation.
+    """
+    positive = []
+    negative = []
+    for pair in pairs:
+        if not isinstance(pair, dict) or not _is_signed(pair):
+            continue
+        value = pair.get("value")
+        if not _is_num(value):
+            continue
+        if value > 0:
+            positive.append(pair)
+        elif value < 0:
+            negative.append(pair)
+    positive.sort(key=lambda p: abs(float(p.get("value", 0.0))), reverse=True)
+    negative.sort(key=lambda p: abs(float(p.get("value", 0.0))), reverse=True)
+    return positive[:top_n] + negative[:top_n]
+
+
+def _classification_note(a: str, b: str, cls: dict) -> str:
+    """Human-readable sentence describing the relationship of a pair.
+
+    Plain text (not baked into the figure image) so the type label is selectable
+    in the PDF / extractable by pdftotext, and sits right next to its scatter
+    inside the keep-together Group.
+    """
+    tipo = model._safe_str(cls.get("tipo")) or "sin forma clara"
+    bits = []
+    pearson = cls.get("pearson")
+    spearman = cls.get("spearman")
+    r2_lin = cls.get("r2_linear")
+    r2_poly = None
+    for key in ("r2_poly2", "r2_poly3"):
+        v = cls.get(key)
+        if _is_num(v) and (r2_poly is None or float(v) > r2_poly):
+            r2_poly = float(v)
+    if _is_num(pearson):
+        bits.append(f"Pearson r={float(pearson):+.2f}")
+    if _is_num(spearman):
+        bits.append(f"Spearman ρ={float(spearman):+.2f}")
+    if _is_num(r2_lin):
+        bits.append(f"R² lineal={float(r2_lin):.2f}")
+    if r2_poly is not None:
+        bits.append(f"R² polinómico={r2_poly:.2f}")
+    metrics = "; ".join(bits)
+    text = (f"Relación **{tipo}** entre «{a}» y «{b}»."
+            + (f" {metrics}." if metrics else ""))
+    return text
+
+
+def _scatter_blocks(pairs: list, raw_numeric):
+    """Build keep-together scatter Groups for the strongest num-num pairs.
+
+    Returns a list of blocks (a Heading plus one Group per pair), or an empty
+    list when there is no raw numeric data (e.g. the lite profile drops
+    ``ctx['raw_numeric']`` to skip live recomputation) or the relationship
+    helpers are unavailable. Never raises: any failure degrades to no scatters,
+    leaving the matrix + tables intact.
+    """
+    if not isinstance(raw_numeric, dict) or not raw_numeric:
+        return []
+    selected = _select_scatter_pairs(pairs)
+    if not selected:
+        return []
+
+    # The relationship helpers live in the datascience package. Import lazily so
+    # the chapter still builds (matrix + tables) when they are absent.
+    try:
+        from datascience.classify_relationship_type import (
+            classify_relationship_type,
+        )
+        from datascience.relationship_scatter_figure import (
+            relationship_scatter_figure,
+        )
+    except Exception:  # noqa: BLE001 — degrade, never break the chapter.
+        return []
+
+    groups = []
+    for pair in selected:
+        a = pair.get("a")
+        b = pair.get("b")
+        xs = raw_numeric.get(a)
+        ys = raw_numeric.get(b)
+        # Edge: a selected pair has no raw column (aggregated profile, renamed
+        # column, …) — skip just that pair, keep the rest.
+        if not _is_seq(xs) or not _is_seq(ys):
+            continue
+        try:
+            cls = classify_relationship_type(list(xs), list(ys)) or {}
+        except Exception:  # noqa: BLE001
+            continue
+        a_lbl = model._safe_str(a)
+        b_lbl = model._safe_str(b)
+
+        def _make(xs=xs, ys=ys, a_lbl=a_lbl, b_lbl=b_lbl, cls=cls):
+            return relationship_scatter_figure(
+                list(xs), list(ys), x_label=a_lbl, y_label=b_lbl,
+                classification=cls)
+
+        groups.append(model.Group(blocks=[
+            model.Heading(text=f"{a_lbl} ↔ {b_lbl}", level=2),
+            model.Figure(
+                make=_make,
+                caption=(f"Dispersión de «{a_lbl}» frente a «{b_lbl}» con la "
+                         "curva de ajuste del mejor modelo.")),
+            model.Markdown(text=_classification_note(a_lbl, b_lbl, cls)),
+        ]))
+
+    if not groups:
+        return []
+    intro = model.Markdown(text=(
+        "Para los pares numéricos más fuertes (positivos y negativos) se dibuja "
+        "la nube de puntos con su ajuste y se clasifica el **tipo de relación**: "
+        "**lineal** (una recta basta), **polinómica** (curva de grado 2/3 que "
+        "mejora claramente el ajuste lineal), **monótona no-lineal** (crece o "
+        "decrece siempre pero no en línea recta; Spearman ≫ Pearson) o "
+        "**débil/sin forma**."))
+    return [model.Heading(text="Relaciones más fuertes (scatter)", level=2),
+            intro] + groups
+
+
 def build_correlacion(profile: dict, ctx: dict):
    """Build the Correlation Chapter, or None if there are no pairs to show.

@@ -392,6 +532,18 @@ def build_correlacion(profile: dict, ctx: dict):
            "No se han hallado correlaciones negativas significativas entre "
            "columnas numéricas.")))

+    # 2.5) Scatter plots of the strongest numeric-numeric pairs, each with its
+    # fitted curve and a relationship-type label (lineal / polinómica / monótona
+    # / débil). Needs the raw numeric sample (ctx['raw_numeric'], row-aligned);
+    # when it is absent (aggregated/lite profile) the scatters are simply omitted
+    # and the matrix + tables above stand on their own.
+    raw_numeric = None
+    if isinstance(ctx, dict):
+        raw_numeric = ctx.get("raw_numeric") or profile.get("raw_numeric")
+    else:
+        raw_numeric = profile.get("raw_numeric")
+    blocks.extend(_scatter_blocks(pairs, raw_numeric))
+
    # 3) Spuriousness caveat for level-based correlations (Granger–Newbold).
    caveat = corr.get("levels_caveat")
    if isinstance(caveat, str) and caveat.strip():
@@ -175,6 +175,105 @@ def test_anticorte_matriz_ancha_y_etiquetas_largas_no_se_cortan():
        assert "azufre" in _pdf_text(pdf)


+def _raw_numeric_for_profile(n: int = 80) -> dict:
+    """Row-aligned raw numeric sample matching the signed pairs of _profile().
+
+    Builds columns with a clear, deterministic shape so the relationship-type
+    classifier has something unambiguous to label:
+      - density vs alcohol: strong negative linear (the top-negative pair).
+      - alcohol vs quality: positive linear.
+      - ph, fixed_acidity, sulphates: filler columns for the remaining pairs.
+    """
+    import math as _m
+
+    alcohol = [8.0 + 0.05 * i for i in range(n)]
+    density = [1.0 - 0.002 * a for a in alcohol]           # neg linear vs alcohol
+    quality = [3.0 + 0.4 * a + (0.1 if i % 2 else -0.1)    # pos linear vs alcohol
+               for i, a in enumerate(alcohol)]
+    ph = [3.0 + 0.3 * _m.sin(i / 5.0) for i in range(n)]
+    fixed_acidity = [7.0 - 0.5 * p for p in ph]            # neg linear vs ph
+    sulphates = [0.5 + 0.01 * (i % 7) for i in range(n)]
+    return {
+        "alcohol": alcohol, "density": density, "quality": quality,
+        "ph": ph, "fixed_acidity": fixed_acidity, "sulphates": sulphates,
+    }
+
+
+def test_golden_scatters_de_pares_num_num_con_tipo_de_relacion():
+    """Con ctx['raw_numeric'], el capítulo añade scatters (Figure dentro de Group)
+    de los pares num-num más fuertes, cada uno con su etiqueta de tipo en texto."""
+    from datascience.automatic_eda.model import Group
+
+    ctx = {"raw_numeric": _raw_numeric_for_profile()}
+    ch = build_correlacion(_profile(), ctx)
+    assert ch is not None
+    groups = [b for b in ch.blocks if isinstance(b, Group)]
+    assert groups, "debe emitir al menos un Group con scatter"
+    # Cada Group lleva su figura (lazy) y una nota de texto con el tipo.
+    for g in groups:
+        gkinds = [b.kind for b in g.blocks]
+        assert "figure" in gkinds and "markdown" in gkinds
+    # La sección y la etiqueta de tipo aparecen como texto plano (extraíble).
+    headings = " ".join(b.text for b in ch.blocks if b.kind == "heading")
+    assert "Relaciones más fuertes" in headings
+    body = " ".join(b.text for g in groups for b in g.blocks
+                    if b.kind == "markdown")
+    assert any(t in body for t in
+               ("lineal", "polinómica", "monótona", "sin forma"))
+    # El par num-num más fuerte (density ↔ alcohol) tiene scatter; el par cat-cat
+    # (region ↔ type) NO — no es numérico.
+    assert "density" in body or "alcohol" in body
+    assert "region" not in body and "type" not in body
+
+
+def test_golden_pdf_muestra_scatters_con_etiqueta_de_tipo():
+    """En el PDF, el capítulo Correlación incluye los scatters y su etiqueta de
+    tipo en texto seleccionable (pdftotext la encuentra)."""
+    prof = _profile()
+    ctx = {"raw_numeric": _raw_numeric_for_profile()}
+    with tempfile.TemporaryDirectory() as d:
+        pdf = os.path.join(d, "corr_scatter.pdf")
+        rp = render_automatic_eda_pdf(prof, pdf, {"title": "EDA — wine",
+                                                  "ctx": ctx})
+        assert rp["path"] == pdf and rp["n_pages"] >= 1
+        txt = _pdf_text(pdf)
+        assert "Relaciones" in txt and "scatter" in txt.lower()
+        # Alguna etiqueta de tipo de relación, en texto.
+        assert any(t in txt for t in
+                   ("lineal", "polin", "monóton", "monoton", "sin forma"))
+
+
+def test_edge_sin_raw_numeric_omite_scatters_sin_lanzar():
+    """profile lite / ctx None: sin raw_numeric el capítulo omite los scatters
+    pero sigue emitiendo matriz + tablas (no lanza)."""
+    from datascience.automatic_eda.model import Group
+
+    for ctx in (None, {}, {"raw_numeric": None}, {"raw_numeric": {}}):
+        ch = build_correlacion(_profile(), ctx)
+        assert ch is not None
+        assert not [b for b in ch.blocks if isinstance(b, Group)]
+        # La matriz y al menos una tabla top siguen presentes.
+        assert any(b.kind == "figure" for b in ch.blocks)
+        assert any(b.kind == "data_table" for b in ch.blocks)
+
+
+def test_edge_par_sin_columna_cruda_se_omite_sin_lanzar():
+    """Si un par seleccionado no tiene su columna en raw_numeric, se omite ese
+    par (no lanza); los demás scatters se construyen igual."""
+    from datascience.automatic_eda.model import Group
+
+    raw = _raw_numeric_for_profile()
+    raw.pop("density", None)   # rompe el par density ↔ alcohol
+    ch = build_correlacion(_profile(), {"raw_numeric": raw})
+    assert ch is not None
+    groups = [b for b in ch.blocks if isinstance(b, Group)]
+    body = " ".join(b.text for g in groups for b in g.blocks
+                    if b.kind == "markdown")
+    # density desaparece de los scatters; otros pares (p.ej. ph↔fixed_acidity,
+    # alcohol↔quality) pueden seguir presentes sin error.
+    assert "density" not in body
+
+
 def test_glosario_engancha_metodos_y_fdr():
    """Mejora 4b: los métodos de correlación (Pearson, Spearman, Cramér's V,
    razón de correlación) y la corrección por comparaciones múltiples (FDR) se
@@ -0,0 +1,593 @@
+"""Outliers chapter (OUTLIERS) — univariate + multivariate atypical values.
+
+Today the analysis of atypical values is scattered across the document: the
+NUM DISTR chapter mentions the per-column outlier count inside each distribution
+figure, and the MODELOS chapter runs Isolation Forest as one of several cheap
+models. This chapter gathers and deepens the whole outlier story in a single
+place, with its interpretation: an [[term:outlier]]outlier[[/term]] is **not
+necessarily an error** — it can be a legitimate, extreme but real observation —
+so the reading is exploratory (what to look at), never confirmatory (what to
+delete).
+
+Sections, in order:
+
+1. **Resumen univariante por columna** — for every numeric column, the number
+   and percentage of atypical values by two complementary criteria: Tukey's
+   1.5·IQR rule ([[term:tukey_fence]]vallas de Tukey[[/term]]) and the
+   [[term:zscore]]z-score[[/term]] rule (|z| > 3). The most contaminated columns
+   are flagged. The fences come from the pure registry function
+   ``build_boxplot_stats`` (derived from the profile percentiles); the per-column
+   counts use the raw sample in ``ctx['raw_numeric']`` when available (the exact
+   count), degrading to the profile's own z-score counts otherwise.
+2. **Boxplots** — a single figure with the Tukey boxplots of the most
+   contaminated columns (box, whiskers and atypical points), delegated to the
+   reusable registry helper ``build_boxplots_figure``.
+3. **Multivariante (filas anómalas)** — rows that are atypical considering ALL
+   columns at once, via the registry function ``isolation_forest_outliers``: the
+   count and percentage of anomalous rows, the most anomalous rows with their
+   score, and the dimensions that make each one rare (top columns by |z|, via
+   ``summarize_outlier_dims``). Run live on ``ctx['raw_numeric']`` (the same
+   numeric columns ``summarize_outlier_dims`` uses, so the row indexing stays
+   coherent and the dimension breakdown is correct); falls back to the
+   precomputed ``profile['models']['outliers']`` only when no raw sample is
+   available (e.g. the lite preset), where no per-row breakdown is shown.
+4. **Interpretación** — outlier ≠ error: how to tell a data-entry error from a
+   genuine extreme value, and what to do (inspect, winsorize, or re-express —
+   linking to the Tukey re-expression the profile already computes).
+
+The chapter activates whenever the table has at least one numeric column; with
+no numeric column it returns ``None`` and disappears from the document.
+
+Reads everything defensively (``.get``) and never raises: every registry
+delegation is imported lazily and degraded to an honest note on any failure.
+
+Contract: build_<id>(profile, ctx) -> Chapter | None ; CHAPTER_VERSION = "x.y.z".
+"""
+
+from __future__ import annotations
+
+from .. import model
+
+CHAPTER_VERSION = "1.0.0"
+CHAPTER_ID = "outliers"
+CHAPTER_TITLE = "Valores atípicos"
+
+# z-score threshold for the univariate z rule: |z| > 3 flags a value ~3 standard
+# deviations from the mean (≈99.7% of a normal distribution lies within ±3σ).
+_Z_THRESH = 3.0
+# How many columns to draw in the boxplots figure (most contaminated first) and
+# how many anomalous rows to list in the multivariate table.
+_TOP_BOX = 12
+_TOP_ROWS = 12
+# Cap on the raw atypical values passed as boxplot fliers, so a heavy-tailed
+# column does not flood the figure with thousands of points.
+_MAX_FLIERS = 200
+# How many columns flagged as "most contaminated" in the summary note.
+_TOP_FLAGGED = 3
+
+# Glossary terms this chapter explains (contract §11.1). Registered in the shared
+# collector and marked clickable on first appearance. ``isolation_forest`` and
+# ``zscore`` may also be registered by the MODELOS chapter — ``add`` is
+# idempotent (first definition wins), so registering them here is harmless and
+# keeps this chapter self-contained when MODELOS does not render.
+_TERM_DEFS = {
+    "outlier": (
+        "Valor atípico (outlier)",
+        "Una observación que se aparta mucho del grueso de los datos. Un atípico "
+        "NO es necesariamente un error: puede ser un fallo de medida o de "
+        "registro, pero también un dato real extremo (un cliente que gasta diez "
+        "veces la media, un día de ventas excepcional). Por eso se señalan para "
+        "revisarlos, no para borrarlos automáticamente.",
+    ),
+    "tukey_fence": (
+        "Vallas de Tukey (1,5·IQR)",
+        "Regla clásica para marcar atípicos a partir de los cuartiles: se calcula "
+        "el rango intercuartílico IQR = P75 − P25 y se trazan dos vallas, una "
+        "inferior en P25 − 1,5·IQR y otra superior en P75 + 1,5·IQR. Los valores "
+        "que caen fuera de esas vallas se consideran atípicos. Es robusta porque "
+        "se apoya en la mediana y los cuartiles, no en la media.",
+    ),
+    "zscore": (
+        "z-score (puntuación típica)",
+        "Mide a cuántas desviaciones típicas está un valor de la media de su "
+        "columna: z = (valor − media) / desviación típica. Un |z| grande (aquí > "
+        "3) señala un valor alejado del centro. A diferencia de las vallas de "
+        "Tukey, el z-score usa media y desviación, así que es más sensible a la "
+        "presencia de los propios atípicos.",
+    ),
+    "isolation_forest": (
+        "Isolation Forest (anomalías multivariantes)",
+        "Algoritmo de detección de anomalías que considera TODAS las columnas a "
+        "la vez: construye árboles que parten el espacio con cortes aleatorios y "
+        "mide cuántos cortes hacen falta para aislar cada fila. Las filas raras "
+        "se aíslan con muy pocos cortes y se marcan como atípicas según un umbral "
+        "de contaminación. Detecta combinaciones de valores poco frecuentes que "
+        "ninguna columna por separado revelaría.",
+    ),
+}
+
+
+# --------------------------------------------------------------------------- #
+# Lazy registry delegations (each degrades to None / no-op on any failure).
+# --------------------------------------------------------------------------- #
+def _load_build_boxplot_stats():
+    try:
+        from datascience.build_boxplot_stats import build_boxplot_stats
+        return build_boxplot_stats
+    except Exception:  # noqa: BLE001
+        return None
+
+
+def _load_detect_outliers():
+    # detect_outliers lives in the monolithic ``datascience.datascience`` module
+    # (file_path datascience.py), not in its own submodule — try both shapes.
+    try:
+        from datascience.datascience import detect_outliers
+        return detect_outliers
+    except Exception:  # noqa: BLE001
+        try:
+            from datascience import detect_outliers
+            return detect_outliers
+        except Exception:  # noqa: BLE001
+            return None
+
+
+def _load_isolation_forest():
+    try:
+        from datascience.isolation_forest_outliers import isolation_forest_outliers
+        return isolation_forest_outliers
+    except Exception:  # noqa: BLE001
+        return None
+
+
+def _load_summarize_dims():
+    try:
+        from datascience.summarize_outlier_dims import summarize_outlier_dims
+        return summarize_outlier_dims
+    except Exception:  # noqa: BLE001
+        return None
+
+
+# --------------------------------------------------------------------------- #
+# Defensive formatters (own copy: the chapter never imports siblings).
+# --------------------------------------------------------------------------- #
+def _fmt_num(value, decimals: int = 3) -> str:
+    if value is None:
+        return "—"
+    if isinstance(value, bool):
+        return "sí" if value else "no"
+    if isinstance(value, int):
+        return f"{value:,}".replace(",", ".")
+    if isinstance(value, float):
+        if value != value:  # NaN
+            return "—"
+        if value in (float("inf"), float("-inf")):
+            return str(value)
+        text = f"{value:.{decimals}f}".rstrip("0").rstrip(".")
+        return text if text else "0"
+    return model._safe_str(value)
+
+
+def _fmt_int(value) -> str:
+    if value is None:
+        return "—"
+    try:
+        return f"{int(round(float(value))):,}".replace(",", ".")
+    except (TypeError, ValueError):
+        return model._safe_str(value)
+
+
+def _fmt_pct(value, decimals: int = 2) -> str:
+    """Format an already-0-100 value as a percentage. None -> placeholder."""
+    if value is None:
+        return "—"
+    try:
+        return f"{float(value):.{decimals}f}%"
+    except (TypeError, ValueError):
+        return model._safe_str(value)
+
+
+def _term(mark: bool, key: str, text: str) -> str:
+    return f"[[term:{key}]]{text}[[/term]]" if mark else text
+
+
+def _is_dict(v) -> bool:
+    return isinstance(v, dict)
+
+
+# --------------------------------------------------------------------------- #
+# Profile reads.
+# --------------------------------------------------------------------------- #
+def _numeric_columns(profile: dict) -> list:
+    """Return [(name, numeric_dict)] for numeric columns with usable stats."""
+    out = []
+    for col in profile.get("columns") or []:
+        if not isinstance(col, dict):
+            continue
+        if col.get("inferred_type") != "numeric":
+            continue
+        num = col.get("numeric")
+        if not isinstance(num, dict) or not num:
+            continue
+        if num.get("mean") is None and num.get("median") is None:
+            continue
+        out.append((col.get("name") or "(columna)", num))
+    return out
+
+
+def _clean_values(raw):
+    """Return the finite float values of a raw column list (drop None/NaN/inf)."""
+    if not isinstance(raw, (list, tuple)):
+        return None
+    vals = []
+    for v in raw:
+        if v is None or isinstance(v, bool):
+            continue
+        try:
+            f = float(v)
+        except (TypeError, ValueError):
+            continue
+        if f != f or f in (float("inf"), float("-inf")):
+            continue
+        vals.append(f)
+    return vals
+
+
+# --------------------------------------------------------------------------- #
+# Per-column univariate summary.
+# --------------------------------------------------------------------------- #
+def _univariate_row(name, numeric, raw_vals, box_fn, detect_fn):
+    """Compute one univariate summary row + boxplot inputs for a column.
+
+    Returns a dict with the table cells and, when raw values are available, the
+    exact Tukey/z counts and the list of atypical (flier) values; otherwise it
+    degrades to the profile's own z-score counts and the fence flags.
+    """
+    box = {}
+    if box_fn is not None:
+        try:
+            box = box_fn(numeric) or {}
+        except Exception:  # noqa: BLE001
+            box = {}
+    lf = box.get("lower_fence")
+    uf = box.get("upper_fence")
+
+    vals = _clean_values(raw_vals)
+    n_tukey = pct_tukey = None
+    n_z = pct_z = None
+    low_extreme = high_extreme = None
+    fliers = []
+    contamination = None  # metric used to rank columns (prefer Tukey %).
+
+    if vals:
+        n = len(vals)
+        tukey_out = []
+        for v in vals:
+            below = (lf is not None and v < lf)
+            above = (uf is not None and v > uf)
+            if below or above:
+                tukey_out.append(v)
+        n_tukey = len(tukey_out)
+        pct_tukey = 100.0 * n_tukey / n if n else None
+        if tukey_out:
+            low_extreme = min(tukey_out)
+            high_extreme = max(tukey_out)
+            fliers = tukey_out[:_MAX_FLIERS]
+        # z-score rule via the registry function (returns parallel bools).
+        if detect_fn is not None:
+            try:
+                flags = detect_fn(vals, _Z_THRESH) or []
+                n_z = int(sum(1 for b in flags if b))
+                pct_z = 100.0 * n_z / n if n else None
+            except Exception:  # noqa: BLE001
+                n_z = pct_z = None
+        contamination = pct_tukey
+    else:
+        # Degrade: no raw sample for this column. The profile's own outlier
+        # count/pct come from the z-score block (build_boxplot_stats note); the
+        # Tukey count is unknown, only the fence flags are.
+        n_z = numeric.get("n_outliers")
+        pct_z = numeric.get("outlier_pct")
+        if box.get("has_low_outliers") and box.get("min") is not None:
+            low_extreme = box.get("min")
+        if box.get("has_high_outliers") and box.get("max") is not None:
+            high_extreme = box.get("max")
+        contamination = pct_z if isinstance(pct_z, (int, float)) else None
+
+    # Compact "extremos atípicos" cell: down/up arrows for the low/high tail.
+    extremes = []
+    if low_extreme is not None:
+        extremes.append(f"↓ {_fmt_num(low_extreme)}")
+    if high_extreme is not None:
+        extremes.append(f"↑ {_fmt_num(high_extreme)}")
+    extremes_cell = "  ".join(extremes) if extremes else "—"
+
+    return {
+        "name": model._safe_str(name),
+        "n_tukey": n_tukey,
+        "pct_tukey": pct_tukey,
+        "n_z": n_z,
+        "pct_z": pct_z,
+        "lower_fence": lf,
+        "upper_fence": uf,
+        "extremes": extremes_cell,
+        "box": box,
+        "fliers": fliers,
+        "has_raw": bool(vals),
+        "contamination": contamination if isinstance(contamination, (int, float)) else -1.0,
+    }
+
+
+def _univariate_table(rows: list) -> model.DataTable:
+    header = ["Columna", "Atípicos Tukey", "% Tukey", "Atípicos z", "% z",
+              "Valla inf.", "Valla sup.", "Extremos atípicos"]
+    table_rows = []
+    for r in rows:
+        table_rows.append([
+            r["name"],
+            _fmt_int(r["n_tukey"]) if r["n_tukey"] is not None else "—",
+            _fmt_pct(r["pct_tukey"]) if r["pct_tukey"] is not None else "—",
+            _fmt_int(r["n_z"]) if r["n_z"] is not None else "—",
+            _fmt_pct(r["pct_z"]) if r["pct_z"] is not None else "—",
+            _fmt_num(r["lower_fence"]),
+            _fmt_num(r["upper_fence"]),
+            r["extremes"],
+        ])
+    return model.DataTable(
+        header=header, rows=table_rows,
+        title="Valores atípicos por columna",
+        note="Tukey = fuera de las vallas 1,5·IQR · z = |z-score| > 3 · "
+             "ordenado de más a menos contaminada")
+
+
+# --------------------------------------------------------------------------- #
+# Multivariate (Isolation Forest) section.
+# --------------------------------------------------------------------------- #
+def _resolve_multivariate(profile: dict, ctx: dict, raw_numeric):
+    """Return (outliers_dict_or_None, source).
+
+    Prefers a LIVE Isolation Forest over ``raw_numeric`` so the detector and
+    ``summarize_outlier_dims`` use EXACTLY the same numeric columns and the same
+    valid-row indexing — otherwise the precomputed ``profile['models']
+    ['outliers']`` (run by MODELOS over a possibly different column subset) would
+    yield ``row_index`` values that no longer point at the rows
+    ``summarize_outlier_dims`` reconstructs, mislabelling the "dimensions that
+    make each row rare". Falls back to the precomputed block when no raw sample
+    is available (e.g. the lite preset drops ``raw_numeric``)."""
+    if _is_dict(raw_numeric) and raw_numeric:
+        iso = _load_isolation_forest()
+        if iso is not None:
+            try:
+                out = iso(raw_numeric)
+                if _is_dict(out) and out.get("n_outliers") is not None and out.get("n_rows_used"):
+                    return out, "live"
+            except Exception:  # noqa: BLE001
+                pass
+    # Fallback: the model the MODELOS chapter already computed (no raw sample to
+    # recompute against, so no per-row dimension breakdown either).
+    models = profile.get("models") if _is_dict(profile.get("models")) else {}
+    pre = models.get("outliers") if _is_dict(models) else None
+    if _is_dict(pre) and pre.get("n_outliers") is not None and pre.get("n_rows_used"):
+        return pre, "precomputed"
+    return None, "none"
+
+
+def _multivariate_blocks(outliers: dict, raw_numeric, mark: bool) -> list:
+    isof = _term(mark, "isolation_forest", "**Isolation Forest**")
+    blocks = [
+        model.Heading(text="Filas atípicas (multivariante)", level=2),
+        model.Markdown(text=(
+            f"Hasta aquí cada columna se ha mirado por separado. {isof} busca "
+            "filas raras considerando **todas las columnas a la vez**: una fila "
+            "puede ser normal en cada variable y aun así ser atípica por la "
+            "**combinación** de sus valores (p. ej. una edad baja con una tarifa "
+            "muy alta). La tabla resume cuántas filas se marcaron y el umbral de "
+            "decisión.")),
+        model.KVTable(rows=[
+            ("Filas analizadas", _fmt_int(outliers.get("n_rows_used"))),
+            ("Columnas consideradas", _fmt_int(outliers.get("n_features"))),
+            ("Filas atípicas", _fmt_int(outliers.get("n_outliers"))),
+            ("% filas atípicas", _fmt_pct(outliers.get("outlier_pct"))),
+            ("Umbral de decisión", _fmt_num(outliers.get("threshold"), 4)),
+        ], title="Anomalías multivariantes"),
+    ]
+
+    rows_in = outliers.get("outlier_rows") or []
+    if not rows_in:
+        return blocks
+
+    # Enrich each anomalous row with the dimensions that make it rare, when the
+    # raw sample is available (summarize_outlier_dims reconstructs the same
+    # valid-row indexing as isolation_forest_outliers).
+    dims_by_row = {}
+    if _is_dict(raw_numeric) and raw_numeric:
+        summ = _load_summarize_dims()
+        if summ is not None:
+            try:
+                enriched = summ(raw_numeric, rows_in, top_k=3) or []
+                for e in enriched:
+                    if _is_dict(e) and e.get("row_index") is not None:
+                        dims_by_row[e.get("row_index")] = e.get("dims") or []
+            except Exception:  # noqa: BLE001
+                dims_by_row = {}
+
+    has_dims = bool(dims_by_row)
+    header = ["Fila (entre válidas)", "Score"]
+    if has_dims:
+        header.append("Dimensiones que la hacen rara (col = valor, z)")
+    table_rows = []
+    for r in rows_in[:_TOP_ROWS]:
+        if not _is_dict(r):
+            continue
+        ridx = r.get("row_index")
+        cells = [_fmt_int(ridx), _fmt_num(r.get("score"), 4)]
+        if has_dims:
+            dims = dims_by_row.get(ridx) or []
+            parts = []
+            for d in dims:
+                if not _is_dict(d):
+                    continue
+                parts.append(
+                    f"{model._safe_str(d.get('col'))} = {_fmt_num(d.get('value'))} "
+                    f"(z {_fmt_num(d.get('z'), 2)})")
+            cells.append("; ".join(parts) if parts else "—")
+        table_rows.append(cells)
+
+    if table_rows:
+        shown = len(table_rows)
+        total = outliers.get("n_outliers")
+        note = "las filas más anómalas primero (score más bajo = más rara)"
+        if isinstance(total, int) and total > shown:
+            note += f" — top {shown} de {total}"
+        if not has_dims:
+            note += (" · no se pudo recuperar la muestra cruda para explicar las "
+                     "dimensiones de cada fila")
+        blocks.append(model.DataTable(
+            header=header, rows=table_rows,
+            title="Filas más atípicas", note=note))
+    return blocks
+
+
+# --------------------------------------------------------------------------- #
+# Interpretation section.
+# --------------------------------------------------------------------------- #
+def _interpretation_block(mark: bool) -> model.Markdown:
+    outlier = _term(mark, "outlier", "atípico")
+    text = (
+        f"**Un {outlier} no es necesariamente un error.** Conviene distinguir "
+        "dos casos antes de actuar:\n\n"
+        "- **Error de dato** (medida, registro o unidad equivocada): una edad de "
+        "200 años, un importe negativo donde no puede haberlo, un decimal "
+        "desplazado. Estos sí se corrigen o se eliminan, idealmente en el origen.\n"
+        "- **Dato real extremo**: una observación legítima de la cola de la "
+        "distribución (un cliente que gasta mucho más, una tarifa de lujo, un día "
+        "de ventas excepcional). Borrarla sesga el análisis y oculta información "
+        "valiosa.\n\n"
+        "**Qué hacer.** Primero, **revisar** los valores señalados arriba contra "
+        "su origen para decidir cuál de los dos casos es. Si son errores, "
+        "corregirlos. Si son datos reales que distorsionan medias y modelos, hay "
+        "alternativas a borrarlos: **winsorizar** (recortar los extremos a un "
+        "percentil), o **re-expresar** la variable (por ejemplo una "
+        "transformación logarítmica o la escalera de re-expresión de Tukey que "
+        "este mismo perfil ya calcula para las columnas asimétricas), que suele "
+        "domar la cola sin perder ninguna fila. La elección depende del objetivo: "
+        "esta lectura es **exploratoria** —orienta dónde mirar—, no una regla "
+        "automática de limpieza.")
+    return model.Markdown(text=text)
+
+
+# --------------------------------------------------------------------------- #
+# Entry point.
+# --------------------------------------------------------------------------- #
+def build_outliers(profile: dict, ctx: dict):
+    """Build the OUTLIERS Chapter, or None if the dataset has no numeric column."""
+    profile = profile or {}
+    ctx = ctx or {}
+    if not isinstance(profile, dict):
+        return None
+
+    numerics = _numeric_columns(profile)
+    if not numerics:
+        return None  # chapter does not apply to a dataset with no numerics.
+
+    # Register glossary terms (if a collector is present) and mark them clickable.
+    glossary = ctx.get("glossary")
+    mark = False
+    if isinstance(glossary, model.GlossaryCollector):
+        for key, (label, definition) in _TERM_DEFS.items():
+            glossary.add(key, label, definition)
+        mark = True
+
+    raw_numeric = ctx.get("raw_numeric")
+    raw_numeric = raw_numeric if isinstance(raw_numeric, dict) else {}
+
+    box_fn = _load_build_boxplot_stats()
+    detect_fn = _load_detect_outliers()
+
+    # --- Univariate summary ------------------------------------------------- #
+    uni_rows = []
+    for name, numeric in numerics:
+        uni_rows.append(_univariate_row(
+            name, numeric, raw_numeric.get(name), box_fn, detect_fn))
+    # Rank columns by contamination (Tukey % when available, else z %).
+    uni_rows.sort(key=lambda r: r.get("contamination", -1.0), reverse=True)
+
+    intro = (
+        "Este capítulo reúne en un solo sitio el análisis de los **valores "
+        "atípicos** de la tabla, que en el resto del informe aparecen dispersos. "
+        f"Un {_term(mark, 'outlier', 'atípico')} es una observación que se aparta "
+        "mucho del grueso de los datos. Cada columna numérica se evalúa con dos "
+        f"criterios complementarios: las {_term(mark, 'tukey_fence', 'vallas de Tukey')} "
+        "(fuera de P25−1,5·IQR o P75+1,5·IQR, robusto a la propia cola) y el "
+        f"{_term(mark, 'zscore', 'z-score')} (|z| > 3, sensible a la media). La "
+        "tabla está ordenada de la columna más contaminada a la menos.")
+
+    blocks = [
+        model.Heading(text=CHAPTER_TITLE, level=1),
+        model.Markdown(text=intro),
+        _univariate_table(uni_rows),
+    ]
+
+    # Flag the most contaminated columns explicitly.
+    flagged = [r["name"] for r in uni_rows
+               if r.get("contamination", -1.0) > 0][:_TOP_FLAGGED]
+    if flagged:
+        names = ", ".join(f"**{n}**" for n in flagged)
+        blocks.append(model.Markdown(text=(
+            f"Las columnas con mayor proporción de atípicos son {names}: "
+            "concentran el grueso de los valores fuera de las vallas y son las "
+            "primeras a revisar.")))
+
+    # --- Boxplots figure ---------------------------------------------------- #
+    box_entries = [
+        {"name": r["name"], "box": r["box"], "fliers": r.get("fliers")}
+        for r in uni_rows
+        if r.get("box")
+    ][:_TOP_BOX]
+    if box_entries:
+        def _boxplots_make(entries=box_entries):
+            try:
+                from datascience.build_boxplots_figure import build_boxplots_figure
+                return build_boxplots_figure(
+                    entries, title="Boxplots de Tukey por columna",
+                    max_boxes=_TOP_BOX)
+            except Exception:  # noqa: BLE001 — minimal fallback figure.
+                import matplotlib
+                matplotlib.use("Agg")
+                from matplotlib.figure import Figure
+                fig = Figure(figsize=(5.0, 2.2))
+                ax = fig.add_subplot(111)
+                ax.text(0.5, 0.5, "(boxplots no disponibles)",
+                        ha="center", va="center")
+                ax.axis("off")
+                return fig
+
+        blocks.append(model.Group(blocks=[
+            model.Heading(text="Boxplots", level=2),
+            model.Markdown(text=(
+                "Cada caja abarca del primer al tercer cuartil (P25–P75), la línea "
+                "interior es la mediana y los bigotes llegan hasta 1,5·IQR; los "
+                "puntos son los valores que caen fuera de las vallas (atípicos por "
+                "Tukey).")),
+            model.Figure(
+                make=_boxplots_make,
+                caption="Boxplots de Tukey de las columnas más contaminadas."),
+        ]))
+
+    # --- Multivariate ------------------------------------------------------- #
+    outliers, _src = _resolve_multivariate(profile, ctx, raw_numeric)
+    if outliers is not None:
+        blocks.extend(_multivariate_blocks(outliers, raw_numeric, mark))
+    else:
+        blocks.append(model.Heading(text="Filas atípicas (multivariante)", level=2))
+        blocks.append(model.Note(
+            "No se pudo analizar la anomalía multivariante: hacen falta al menos "
+            "dos columnas numéricas y la muestra cruda (o los modelos del perfil) "
+            "para correr Isolation Forest."))
+
+    # --- Interpretation ----------------------------------------------------- #
+    blocks.append(model.Heading(text="Cómo interpretar los atípicos", level=2))
+    blocks.append(_interpretation_block(mark))
+
+    return model.Chapter(id=CHAPTER_ID, title=CHAPTER_TITLE,
+                         version=CHAPTER_VERSION, blocks=blocks)
@@ -0,0 +1,304 @@
+"""Tests for the OUTLIERS chapter — DoD: golden + edges + error path.
+
+Self-contained: builds synthetic ``numeric`` blocks + a raw_numeric sample (no
+DuckDB) so the suite is fast and deterministic. Verifies that the chapter emits
+the univariate per-column table, a boxplots figure, the multivariate Isolation
+Forest section and the outlier≠error interpretation; that the most contaminated
+column is ranked first; that a profile with no numeric column yields None; that
+None/empty never raises; that the glossary terms are registered; and that the
+chapter renders into both PDF and PPTX without cutting its title.
+"""
+
+import math
+import os
+import re
+import tempfile
+
+from pypdf import PdfReader
+
+from datascience.automatic_eda.chapters.outliers import (
+    build_outliers, CHAPTER_VERSION, CHAPTER_TITLE, _TERM_DEFS,
+)
+from datascience.automatic_eda import model
+from datascience.render_automatic_eda_pdf import render_automatic_eda_pdf
+from datascience.render_automatic_eda_pptx import render_automatic_eda_pptx
+
+
+def _percentile(sorted_vals, q):
+    """Linear-interpolation percentile (q in 0..1) on an already-sorted list."""
+    if not sorted_vals:
+        return None
+    if len(sorted_vals) == 1:
+        return float(sorted_vals[0])
+    pos = q * (len(sorted_vals) - 1)
+    lo = int(math.floor(pos))
+    hi = int(math.ceil(pos))
+    if lo == hi:
+        return float(sorted_vals[lo])
+    frac = pos - lo
+    return float(sorted_vals[lo] * (1 - frac) + sorted_vals[hi] * frac)
+
+
+def _col_from_values(values, nbins=10):
+    """Build a ``numeric`` sub-block shaped like describe_numeric's output from a
+    concrete list of raw values, so the profile percentiles and the raw sample
+    are consistent (the boxplot fences match the crudo)."""
+    vals = [float(v) for v in values]
+    s = sorted(vals)
+    n = len(s)
+    mean = sum(vals) / n
+    var = sum((v - mean) ** 2 for v in vals) / n
+    std = math.sqrt(var)
+    median = _percentile(s, 0.5)
+    p25 = _percentile(s, 0.25)
+    p75 = _percentile(s, 0.75)
+    mn, mx = s[0], s[-1]
+    # z-score outlier count (population), what the profile's n_outliers carries.
+    n_out = sum(1 for v in vals if std > 0 and abs((v - mean) / std) > 3.0)
+    width = (mx - mn) / nbins if mx > mn else 1.0
+    hist = [{"lo": mn + i * width, "hi": mn + (i + 1) * width, "count": 1}
+            for i in range(nbins)]
+    return {
+        "min": mn, "max": mx, "mean": mean, "median": median, "std": std,
+        "p25": p25, "p50": median, "p75": p75, "iqr": (p75 - p25),
+        "n_outliers": n_out, "outlier_pct": 100.0 * n_out / n,
+        "distribution_type": "right-skewed", "histogram": hist,
+    }
+
+
+def _fare_values():
+    """A heavy-tailed column (most ~10-30, a few 200-512): clear Tukey/z outliers."""
+    base = [7.0 + (i % 25) for i in range(120)]      # bulk 7..31
+    tail = [180.0, 210.0, 263.0, 512.0]              # extreme upper tail
+    return base + tail
+
+
+def _age_values():
+    """A roughly symmetric column with one extreme low value."""
+    base = [22.0 + (i % 40) for i in range(120)]     # 22..61
+    return base + [80.0, 0.5, 74.0, 1.0]
+
+
+def _quiet_values():
+    """A clean column with no atypical values."""
+    return [50.0 + (i % 5) for i in range(124)]
+
+
+def _profile_and_ctx(with_models=True, with_raw=True):
+    fare = _fare_values()
+    age = _age_values()
+    quiet = _quiet_values()
+    cols = [
+        {"name": "Fare", "inferred_type": "numeric", "numeric": _col_from_values(fare)},
+        {"name": "Age", "inferred_type": "numeric", "numeric": _col_from_values(age)},
+        {"name": "Quiet", "inferred_type": "numeric", "numeric": _col_from_values(quiet)},
+        {"name": "Sexo", "inferred_type": "categorical",
+         "categorical": {"top": [{"value": "male", "count": 80}]}},
+    ]
+    profile = {"table": "titanic", "n_rows": len(fare), "n_cols": len(cols),
+               "columns": cols}
+    if with_models:
+        profile["models"] = {
+            "outliers": {
+                "n_outliers": 4, "outlier_pct": 3.2,
+                "outlier_rows": [
+                    {"row_index": 123, "score": -0.21},
+                    {"row_index": 121, "score": -0.15},
+                ],
+                "threshold": -0.02, "n_rows_used": 124, "n_features": 3,
+            }
+        }
+    ctx = {}
+    if with_raw:
+        ctx["raw_numeric"] = {"Fare": fare, "Age": age, "Quiet": quiet}
+    return profile, ctx
+
+
+def _pdf_text(path: str) -> str:
+    txt = "".join((pg.extract_text() or "") for pg in PdfReader(path).pages)
+    return re.sub(r"\s+", " ", txt)
+
+
+def _flatten(blocks):
+    out = []
+    for b in blocks:
+        if getattr(b, "kind", "") == "group":
+            out.extend(_flatten(getattr(b, "blocks", []) or []))
+        else:
+            out.append(b)
+    return out
+
+
+# --------------------------------------------------------------------------- #
+# Golden.
+# --------------------------------------------------------------------------- #
+def test_golden_estructura_y_secciones():
+    profile, ctx = _profile_and_ctx()
+    ctx["glossary"] = model.GlossaryCollector()
+    ch = build_outliers(profile, ctx)
+    assert ch is not None
+    assert ch.id == "outliers"
+    assert ch.version == CHAPTER_VERSION
+
+    flat = _flatten(ch.blocks)
+    kinds = [b.kind for b in flat]
+    # Title heading + univariate DataTable + boxplots Figure + multivariate
+    # KVTable + interpretation Markdown.
+    assert kinds[0] == "heading" and flat[0].text == CHAPTER_TITLE
+    tables = [b for b in flat if b.kind == "data_table"]
+    titles = [t.title for t in tables]
+    assert any(t and "atípicos por columna" in t for t in titles)
+    assert any(b.kind == "figure" for b in flat), "falta la figura de boxplots"
+    assert any(b.kind == "kv_table" for b in flat), "falta el resumen multivariante"
+
+    # The boxplots figure maker yields a real matplotlib figure (or its fallback).
+    fig = next(b for b in flat if b.kind == "figure").make()
+    assert fig is not None
+    import matplotlib.pyplot as plt
+    plt.close(fig)
+
+
+def test_golden_fare_es_la_mas_contaminada():
+    # The univariate table must rank Fare (heavy tail) first and report a
+    # non-zero Tukey percentage for it.
+    profile, ctx = _profile_and_ctx()
+    ch = build_outliers(profile, ctx)
+    table = next(b for b in _flatten(ch.blocks)
+                 if b.kind == "data_table" and b.title
+                 and "atípicos por columna" in b.title)
+    first_col = table.rows[0][0]
+    assert first_col == "Fare", f"esperaba Fare primera, fue {first_col}"
+    # % Tukey column (index 2) of the first row must be > 0.
+    pct_cell = table.rows[0][2]
+    assert pct_cell not in ("—", "0%", "0.00%"), f"% Tukey de Fare vacío: {pct_cell}"
+    # The z-score rule (detect_outliers) must actually run with raw_numeric: at
+    # least one column reports a non-empty z count/percentage (regression guard
+    # for the detect_outliers import path).
+    z_pcts = [r[4] for r in table.rows]
+    assert any(c not in ("—",) for c in z_pcts), f"columna z toda vacía: {z_pcts}"
+    z_counts = [r[3] for r in table.rows]
+    assert any(c not in ("—",) for c in z_counts), f"conteo z vacío: {z_counts}"
+
+
+def test_golden_interpretacion_outlier_no_es_error():
+    profile, ctx = _profile_and_ctx()
+    ch = build_outliers(profile, ctx)
+    md = " ".join(b.text for b in _flatten(ch.blocks) if b.kind == "markdown")
+    assert "no es necesariamente un error" in md.lower()
+    # Mentions the actionable options (winsorize / re-express).
+    assert "winsoriz" in md.lower()
+    assert "re-expres" in md.lower() or "logarítmic" in md.lower()
+
+
+def test_golden_terminos_glosario_registrados():
+    profile, ctx = _profile_and_ctx()
+    gloss = model.GlossaryCollector()
+    ctx["glossary"] = gloss
+    build_outliers(profile, ctx)
+    for key in _TERM_DEFS:
+        assert gloss.has(key), f"término '{key}' no registrado en el glosario"
+    # Terms are marked clickable in the body text.
+    md = " ".join(b.text for b in _flatten(build_outliers(profile, ctx).blocks)
+                  if b.kind == "markdown")
+    assert "[[term:outlier]]" in md and "[[term:tukey_fence]]" in md
+
+
+# --------------------------------------------------------------------------- #
+# Multivariate.
+# --------------------------------------------------------------------------- #
+def test_multivariante_live_con_raw_y_dims():
+    # With a raw sample the chapter runs Isolation Forest live (over the same
+    # columns summarize_outlier_dims uses) and lists the anomalous rows with the
+    # dimensions that make each one rare.
+    profile, ctx = _profile_and_ctx(with_models=False, with_raw=True)
+    ch = build_outliers(profile, ctx)
+    flat = _flatten(ch.blocks)
+    kv = next(b for b in flat if b.kind == "kv_table")
+    flat_kv = " ".join(f"{k} {v}" for (k, v) in kv.rows)
+    assert "Filas atípicas" in flat_kv
+    # A non-zero number of anomalous rows is reported.
+    n_cell = dict(kv.rows).get("Filas atípicas")
+    assert n_cell not in (None, "—", "0"), f"sin filas atípicas: {n_cell}"
+    # The anomalous-rows table carries the per-row dimension breakdown.
+    tbls = [b for b in flat if b.kind == "data_table" and b.title
+            and "más atípicas" in b.title]
+    assert tbls, "falta la tabla de filas más atípicas"
+    assert any("hacen rara" in h for h in tbls[0].header), \
+        f"falta la columna de dimensiones: {tbls[0].header}"
+
+
+def test_multivariante_precomputed_sin_raw():
+    # Without a raw sample the chapter falls back to profile['models']['outliers']
+    # (lite preset path); the precomputed n_outliers (4) surfaces in the KV table.
+    profile, ctx = _profile_and_ctx(with_models=True, with_raw=False)
+    ch = build_outliers(profile, ctx)
+    kv = next(b for b in _flatten(ch.blocks) if b.kind == "kv_table")
+    assert any("4" in str(v) for (k, v) in kv.rows)
+
+
+def test_multivariante_ausente_degrada_a_nota():
+    # No models and no raw sample → an honest note, never a crash.
+    profile, ctx = _profile_and_ctx(with_models=False, with_raw=False)
+    ch = build_outliers(profile, ctx)
+    assert ch is not None
+    notes = [b.text for b in _flatten(ch.blocks) if b.kind == "note"]
+    assert any("Isolation Forest" in n for n in notes)
+
+
+# --------------------------------------------------------------------------- #
+# Edges / error path.
+# --------------------------------------------------------------------------- #
+def test_edge_sin_columnas_numericas_devuelve_none():
+    prof = {"columns": [{"name": "c", "inferred_type": "categorical",
+                         "categorical": {"top": [{"value": "x", "count": 3}]}}]}
+    assert build_outliers(prof, {}) is None
+
+
+def test_edge_solo_texto_sintetico_devuelve_none():
+    # A text-only synthetic table (no numeric column) yields None (does not break).
+    prof = {"table": "notas", "n_rows": 3, "n_cols": 1,
+            "columns": [{"name": "comentario", "inferred_type": "text",
+                         "text": {"n_docs": 3}}]}
+    assert build_outliers(prof, {}) is None
+
+
+def test_edge_profile_none_y_vacio_no_revienta():
+    assert build_outliers(None, None) is None
+    assert build_outliers({}, {}) is None
+    assert build_outliers({"columns": []}, {}) is None
+
+
+def test_edge_sin_raw_numeric_degrada_a_perfil():
+    # Without raw_numeric the chapter still builds, using the profile z-score
+    # counts; the univariate table exists and Tukey counts degrade to '—'.
+    profile, ctx = _profile_and_ctx(with_models=True, with_raw=False)
+    ch = build_outliers(profile, ctx)
+    assert ch is not None
+    table = next(b for b in _flatten(ch.blocks)
+                 if b.kind == "data_table" and b.title
+                 and "atípicos por columna" in b.title)
+    # z column comes from the profile; Tukey count is unknown ('—').
+    assert all(len(r) == 8 for r in table.rows)
+
+
+# --------------------------------------------------------------------------- #
+# Anti-cut render.
+# --------------------------------------------------------------------------- #
+def test_render_pdf_y_pptx_incluyen_el_capitulo():
+    profile, ctx = _profile_and_ctx()
+    # The renderers build the whole document; the chapter is reached via the
+    # registry. Render the chapter standalone through a one-chapter document by
+    # passing the profile directly (the renderers run the full chapter registry).
+    with tempfile.TemporaryDirectory() as d:
+        pdf = os.path.join(d, "out.pdf")
+        res_pdf = render_automatic_eda_pdf(profile, pdf,
+                                           {"write_manifest": False, "ctx": ctx})
+        assert res_pdf["path"] == pdf
+        txt = _pdf_text(pdf)
+        assert CHAPTER_TITLE in txt, "el capítulo OUTLIERS no aparece en el PDF"
+        assert "Fare" in txt
+        pptx = os.path.join(d, "out.pptx")
+        res_pptx = render_automatic_eda_pptx(profile, pptx,
+                                             {"write_manifest": False, "ctx": ctx})
+        assert res_pptx["path"] == pptx
+        assert res_pptx["n_slides"] >= 1
@@ -0,0 +1,559 @@
+"""Free-text / NLP distributions chapter (TEXT DISTR) for AutomaticEDA.
+
+First chapter for **non-tabular** content: it profiles the linguistic content of
+any column holding long free text (reviews, descriptions, comments, tickets) that
+the categorical chapter cannot meaningfully summarize (high cardinality, many
+words per value). It is the cheap, model-free counterpart to ``cat_distr`` for
+columns that are prose rather than discrete labels.
+
+Activation (returns ``None`` when it does not apply):
+
+1. Cheap gate from the aggregated profile: at least one non-numeric column whose
+   ``categorical.len_mean`` (mean character length) is ``>= _MIN_LEN_CHARS``.
+   A dataset whose only string columns are short labels (e.g. titanic's
+   ``Name``, ~27 chars) never passes this gate, so the chapter disappears with
+   zero extra work and the existing report is untouched.
+2. Confirmation from a raw sample: each candidate column is sampled (push-down
+   ``extract_text_sample`` over ``ctx['db_path']``/``ctx['table']``, or an
+   in-memory ``ctx['text_raw']`` for tests) and kept only if the **median word
+   count is ``>= _MIN_WORDS``** — i.e. it is genuinely long text, not a long
+   single token. If no column survives, the chapter returns ``None``.
+
+Per surviving column the chapter emits, kept together on its own page/slide
+(``Group(page_break_before=...)``):
+
+- a key/value summary (documents, length percentiles, vocabulary richness with
+  **[[term:ttr]]TTR[[/term]]** and **[[term:hapax]]hapax legomena[[/term]]**,
+  dominant language, exact-duplicate %, readability when available);
+- a word-count histogram figure;
+- a top-terms table + a horizontal bar figure;
+- bigram and trigram frequency tables;
+- a detected-language bar figure (when ``langdetect`` is available);
+- an optional word-cloud figure (only when ``wordcloud`` is installed);
+- a closing note on duplicates / readability degradation.
+
+Every metric is delegated to pure ``eda`` registry functions
+(``compute_text_length_stats``, ``compute_vocabulary_stats``,
+``compute_top_ngrams``, ``detect_corpus_language``, ``compute_text_duplicates``,
+``compute_text_readability``) and the raw sample to ``extract_text_sample``; all
+are imported defensively so a missing function or optional library degrades that
+single piece to a note instead of aborting the chapter. Optional libraries
+(``langdetect``, ``textstat``, ``wordcloud``, ``datasketch``) are never required:
+the piece is silently omitted when they are absent.
+
+Contract: build_<id>(profile, ctx) -> Chapter | None ; CHAPTER_VERSION = "x.y.z".
+"""
+
+from __future__ import annotations
+
+from .. import model
+
+CHAPTER_VERSION = "1.0.0"
+CHAPTER_ID = "text_distr"
+CHAPTER_TITLE = "Texto libre (NLP)"
+
+# Cheap activation gate (characters): a non-numeric column whose mean string
+# length reaches this is a candidate for "long text". Short labels (titanic's
+# Name ≈ 27 chars) stay below it, so the chapter does not fire on them.
+_MIN_LEN_CHARS = 50
+# Confirmation gate (words): a candidate is kept only if its median document has
+# at least this many words — genuine prose, not a long id/URL token.
+_MIN_WORDS = 20
+# Bound the document so very wide datasets stay readable.
+_MAX_TEXT_COLS = 5
+# Raw text rows to sample per column when the chapter must extract them itself.
+_SAMPLE_ROWS = 2000
+# Rows shown in the frequency tables.
+_TOP_TERMS = 15
+_TOP_NGRAMS = 10
+
+# Glossary terms this chapter explains (registered in the shared collector and
+# marked clickable on first appearance — same mechanism as cat_distr's entropía).
+_TERMS = {
+    "ttr": (
+        "TTR (type-token ratio)",
+        "Riqueza léxica de un texto: número de palabras distintas (tipos) "
+        "dividido por el número total de palabras (tokens). Vale 1 cuando no se "
+        "repite ninguna palabra (máxima variedad) y baja hacia 0 cuando el "
+        "vocabulario se repite mucho. Depende de la longitud del corpus, así que "
+        "compara mejor textos de tamaño parecido."),
+    "hapax": (
+        "Hapax legomena",
+        "Palabras que aparecen una sola vez en todo el corpus. Un porcentaje "
+        "alto de hapax indica vocabulario muy variado o, a veces, ruido "
+        "(erratas, identificadores, tokens raros). Se expresa como porcentaje "
+        "sobre el número de palabras distintas."),
+}
+
+
+def _fmt_int(value) -> str:
+    if value is None:
+        return "—"
+    try:
+        return f"{int(value):,}".replace(",", ".")
+    except (TypeError, ValueError):
+        return str(value)
+
+
+def _fmt_num(value, decimals: int = 2) -> str:
+    if value is None:
+        return "—"
+    if isinstance(value, bool):
+        return str(value)
+    if isinstance(value, int):
+        return f"{value:,}".replace(",", ".")
+    if isinstance(value, float):
+        if value != value:  # NaN
+            return "NaN"
+        if value in (float("inf"), float("-inf")):
+            return str(value)
+        text = f"{value:.{decimals}f}".rstrip("0").rstrip(".")
+        return text if text else "0"
+    return str(value)
+
+
+def _fmt_pct(value, decimals: int = 1) -> str:
+    if value is None:
+        return "—"
+    try:
+        return f"{float(value):.{decimals}f}%"
+    except (TypeError, ValueError):
+        return str(value)
+
+
+def _truncate(text, limit: int = 40) -> str:
+    s = model._safe_str(text)
+    return s if len(s) <= limit else s[: max(1, limit - 1)].rstrip() + "…"
+
+
+# --------------------------------------------------------------------------- #
+# Defensive wrappers around the registry functions: each returns the function's
+# output dict or a safe empty default, never raising and never importing at
+# module load (so the chapter stays importable even if a function is missing).
+# --------------------------------------------------------------------------- #
+def _length_stats(texts) -> dict:
+    try:
+        from datascience.compute_text_length_stats import compute_text_length_stats
+        out = compute_text_length_stats(texts)
+        if isinstance(out, dict):
+            return out
+    except Exception:  # noqa: BLE001
+        pass
+    return {}
+
+
+def _vocab_stats(texts) -> dict:
+    try:
+        from datascience.compute_vocabulary_stats import compute_vocabulary_stats
+        out = compute_vocabulary_stats(texts, top_k=_TOP_TERMS)
+        if isinstance(out, dict):
+            return out
+    except Exception:  # noqa: BLE001
+        pass
+    return {}
+
+
+def _ngrams(texts, n) -> list:
+    try:
+        from datascience.compute_top_ngrams import compute_top_ngrams
+        out = compute_top_ngrams(texts, n=n, top_k=_TOP_NGRAMS)
+        if isinstance(out, dict):
+            return out.get("top") or []
+    except Exception:  # noqa: BLE001
+        pass
+    return []
+
+
+def _language(texts) -> dict:
+    try:
+        from datascience.detect_corpus_language import detect_corpus_language
+        out = detect_corpus_language(texts)
+        if isinstance(out, dict):
+            return out
+    except Exception:  # noqa: BLE001
+        pass
+    return {"available": False, "distribution": [], "dominant": None}
+
+
+def _duplicates(texts) -> dict:
+    try:
+        from datascience.compute_text_duplicates import compute_text_duplicates
+        out = compute_text_duplicates(texts)
+        if isinstance(out, dict):
+            return out
+    except Exception:  # noqa: BLE001
+        pass
+    return {}
+
+
+def _readability(texts) -> dict:
+    try:
+        from datascience.compute_text_readability import compute_text_readability
+        out = compute_text_readability(texts)
+        if isinstance(out, dict):
+            return out
+    except Exception:  # noqa: BLE001
+        pass
+    return {"available": False, "flesch": {}}
+
+
+# --------------------------------------------------------------------------- #
+# Candidate detection + raw sample acquisition.
+# --------------------------------------------------------------------------- #
+def _candidate_columns(profile: dict) -> list:
+    """Cheap gate: non-numeric columns whose mean char length reaches the
+    threshold. Returns the list of column names (possibly empty)."""
+    out = []
+    for col in profile.get("columns") or []:
+        if not isinstance(col, dict):
+            continue
+        if col.get("inferred_type") == "numeric":
+            continue
+        cat = col.get("categorical")
+        if not isinstance(cat, dict):
+            continue
+        len_mean = cat.get("len_mean")
+        if isinstance(len_mean, (int, float)) and not isinstance(len_mean, bool) \
+                and len_mean >= _MIN_LEN_CHARS:
+            name = col.get("name")
+            if name:
+                out.append(str(name))
+    return out
+
+
+def _get_samples(profile: dict, ctx: dict, columns: list) -> dict:
+    """Return {col: [str, ...]} raw text samples for the candidate columns.
+
+    Prefers an in-memory ``ctx['text_raw']`` (used by tests); otherwise pushes a
+    sample down to the database via ``extract_text_sample`` using ctx db_path /
+    table. Never raises: returns {} when no sample can be obtained."""
+    text_raw = ctx.get("text_raw")
+    if isinstance(text_raw, dict) and text_raw:
+        return {c: [str(v) for v in (text_raw.get(c) or []) if v is not None]
+                for c in columns if text_raw.get(c)}
+
+    db_path = ctx.get("db_path")
+    table = ctx.get("table")
+    if not db_path or not table:
+        return {}
+    backend = ctx.get("backend") or "duckdb"
+    sample = ctx.get("sample") or _SAMPLE_ROWS
+    try:
+        from datascience.extract_text_sample import extract_text_sample
+        out = extract_text_sample(db_path, table, columns, backend=backend,
+                                  sample=sample)
+        if isinstance(out, dict) and out.get("status") == "ok":
+            cols = out.get("columns")
+            if isinstance(cols, dict):
+                return {c: list(v) for c, v in cols.items() if v}
+    except Exception:  # noqa: BLE001 — dict-no-throw: no sample → chapter omits.
+        pass
+    return {}
+
+
+def _confirm_long_text(samples: dict) -> dict:
+    """Keep only columns whose median word count reaches _MIN_WORDS. Returns
+    {col: length_stats_dict} for the survivors, in input order."""
+    survivors = {}
+    for col, texts in samples.items():
+        stats = _length_stats(texts)
+        words = stats.get("words") if isinstance(stats, dict) else None
+        median = words.get("p50") if isinstance(words, dict) else None
+        if isinstance(median, (int, float)) and not isinstance(median, bool) \
+                and median >= _MIN_WORDS:
+            survivors[col] = stats
+    return survivors
+
+
+# --------------------------------------------------------------------------- #
+# Figures (lazy matplotlib, scaled by the renderers — same style as num_distr).
+# --------------------------------------------------------------------------- #
+def _hist_figure(name: str, length_stats: dict):
+    def make():
+        import matplotlib
+        matplotlib.use("Agg")
+        from matplotlib.figure import Figure
+        fig = Figure(figsize=(6.2, 3.0))
+        ax = fig.add_subplot(111)
+        bins = (length_stats or {}).get("word_hist") or []
+        drew = False
+        for b in bins:
+            if not isinstance(b, dict):
+                continue
+            lo, hi, count = b.get("lo"), b.get("hi"), b.get("count") or 0
+            if lo is None or hi is None:
+                continue
+            width = (hi - lo) if hi > lo else max(abs(lo) * 1e-3, 1e-6)
+            ax.bar(lo, count, width=width, align="edge", color="#9ec6df",
+                   edgecolor="#5b8aa6", linewidth=0.4)
+            drew = True
+        if not drew:
+            ax.text(0.5, 0.5, "(sin datos de longitud)", ha="center",
+                    va="center", color="#8a8a8a", transform=ax.transAxes)
+        ax.set_xlabel("palabras por documento", fontsize=8)
+        ax.set_ylabel("nº de documentos", fontsize=8)
+        ax.tick_params(labelsize=7)
+        for spine in ("top", "right"):
+            ax.spines[spine].set_visible(False)
+        ax.set_title(f"Longitud de «{_truncate(name, 30)}»", fontsize=10,
+                     loc="left")
+        fig.tight_layout()
+        return fig
+    return make
+
+
+def _barh_figure(title: str, items: list, label_key: str, value_key: str,
+                 xlabel: str):
+    """Horizontal bar chart from [{label_key:..., value_key:...}, ...]."""
+    def make():
+        import matplotlib
+        matplotlib.use("Agg")
+        from matplotlib.figure import Figure
+        rows = [it for it in (items or []) if isinstance(it, dict)
+                and isinstance(it.get(value_key), (int, float))]
+        rows = rows[:12]
+        fig = Figure(figsize=(6.2, max(2.2, 0.32 * len(rows) + 0.8)))
+        ax = fig.add_subplot(111)
+        if not rows:
+            ax.text(0.5, 0.5, "(sin datos)", ha="center", va="center",
+                    color="#8a8a8a", transform=ax.transAxes)
+            ax.axis("off")
+            return fig
+        labels = [_truncate(r.get(label_key), 28) for r in rows][::-1]
+        values = [float(r.get(value_key) or 0) for r in rows][::-1]
+        ypos = range(len(rows))
+        ax.barh(list(ypos), values, color="#9ec6df", edgecolor="#5b8aa6",
+                linewidth=0.4)
+        ax.set_yticks(list(ypos))
+        ax.set_yticklabels(labels, fontsize=7)
+        ax.set_xlabel(xlabel, fontsize=8)
+        ax.tick_params(labelsize=7)
+        for spine in ("top", "right"):
+            ax.spines[spine].set_visible(False)
+        ax.set_title(_truncate(title, 44), fontsize=10, loc="left")
+        fig.tight_layout()
+        return fig
+    return make
+
+
+def _wordcloud_figure(texts):
+    """Word-cloud figure callable, or None if wordcloud is not installed."""
+    try:
+        import wordcloud  # noqa: F401
+    except Exception:  # noqa: BLE001 — optional dependency: omit the figure.
+        return None
+
+    def make():
+        import matplotlib
+        matplotlib.use("Agg")
+        from matplotlib.figure import Figure
+        from wordcloud import WordCloud
+        fig = Figure(figsize=(6.2, 3.2))
+        ax = fig.add_subplot(111)
+        joined = " ".join(t for t in texts if isinstance(t, str))
+        try:
+            wc = WordCloud(width=800, height=400, background_color="white",
+                           colormap="viridis").generate(joined)
+            ax.imshow(wc, interpolation="bilinear")
+        except Exception:  # noqa: BLE001
+            ax.text(0.5, 0.5, "(nube de palabras no disponible)", ha="center",
+                    va="center", color="#8a8a8a", transform=ax.transAxes)
+        ax.axis("off")
+        fig.tight_layout()
+        return fig
+    return make
+
+
+# --------------------------------------------------------------------------- #
+# Per-column block assembly.
+# --------------------------------------------------------------------------- #
+def _summary_kv(n_docs, length_stats, vocab, lang, dup, read):
+    chars = (length_stats or {}).get("chars") or {}
+    words = (length_stats or {}).get("words") or {}
+    sents = (length_stats or {}).get("sentences") or {}
+    rows = [
+        ("Documentos", _fmt_int(n_docs)),
+        ("Caracteres (media · p50 · p90 · p99)",
+         f"{_fmt_num(chars.get('mean'))} · {_fmt_int(chars.get('p50'))} · "
+         f"{_fmt_int(chars.get('p90'))} · {_fmt_int(chars.get('p99'))}"),
+        ("Palabras (media · p50 · p90 · p99)",
+         f"{_fmt_num(words.get('mean'))} · {_fmt_int(words.get('p50'))} · "
+         f"{_fmt_int(words.get('p90'))} · {_fmt_int(words.get('p99'))}"),
+        ("Frases (media · máx)",
+         f"{_fmt_num(sents.get('mean'))} · {_fmt_int(sents.get('max'))}"),
+        ("Vocabulario (tokens · tipos · TTR)",
+         f"{_fmt_int(vocab.get('n_tokens'))} · {_fmt_int(vocab.get('n_types'))} "
+         f"· {_fmt_num(vocab.get('ttr'), 3)}"),
+        ("Hapax legomena",
+         f"{_fmt_int(vocab.get('n_hapax'))} ({_fmt_pct(vocab.get('hapax_pct'))})"),
+    ]
+    if isinstance(lang, dict) and lang.get("available"):
+        dom = lang.get("dominant")
+        n_langs = len(lang.get("distribution") or [])
+        rows.append(("Idioma dominante · nº idiomas",
+                     f"{model._safe_str(dom) or '—'} · {_fmt_int(n_langs)}"))
+    if isinstance(dup, dict) and dup.get("n_docs"):
+        rows.append(("Duplicados exactos",
+                     f"{_fmt_int(dup.get('n_exact_dup'))} "
+                     f"({_fmt_pct(dup.get('exact_dup_pct'))})"))
+    if isinstance(read, dict) and read.get("available"):
+        flesch = read.get("flesch") or {}
+        rows.append(("Legibilidad Flesch (media)",
+                     _fmt_num(flesch.get("mean"), 1)))
+    return model.KVTable(rows=rows, title="Resumen del texto")
+
+
+def _terms_table(vocab) -> "model.DataTable | None":
+    top = (vocab or {}).get("top_terms") or []
+    rows = [[_truncate(t.get("term"), 32), _fmt_int(t.get("count")),
+             _fmt_pct(t.get("pct"))]
+            for t in top[:_TOP_TERMS] if isinstance(t, dict)]
+    if not rows:
+        return None
+    return model.DataTable(header=["Término", "Conteo", "% tokens"], rows=rows,
+                           title="Términos más frecuentes",
+                           note="stopwords ES+EN eliminadas")
+
+
+def _ngram_table(items, n_label) -> "model.DataTable | None":
+    rows = [[_truncate(it.get("ngram"), 40), _fmt_int(it.get("count"))]
+            for it in (items or [])[:_TOP_NGRAMS] if isinstance(it, dict)]
+    if not rows:
+        return None
+    return model.DataTable(header=[n_label, "Conteo"], rows=rows,
+                           title=f"{n_label} más frecuentes")
+
+
+def _dup_note(dup, lang, read) -> "model.Note | None":
+    bits = []
+    if isinstance(dup, dict):
+        nd = dup.get("near_dup") or {}
+        if nd.get("available"):
+            bits.append(
+                f"casi-duplicados detectados (MinHash, umbral "
+                f"{_fmt_num(nd.get('threshold'))}): "
+                f"{_fmt_int(nd.get('n_near_dup_docs'))} documentos")
+        else:
+            bits.append("near-duplicados no calculados (datasketch no instalado; "
+                        "se reportan solo los duplicados exactos por hash)")
+    if isinstance(lang, dict) and not lang.get("available"):
+        bits.append("detección de idioma omitida (langdetect no instalado)")
+    if isinstance(read, dict) and not read.get("available"):
+        bits.append("legibilidad omitida (textstat no instalado)")
+    if not bits:
+        return None
+    return model.Note(" · ".join(bits))
+
+
+def _column_group(name, texts, length_stats, idx, mark_terms):
+    vocab = _vocab_stats(texts)
+    lang = _language(texts)
+    dup = _duplicates(texts)
+    read = _readability(texts)
+    n_docs = (length_stats or {}).get("n_docs")
+
+    blocks = [
+        model.Heading(text=str(name), level=2),
+        _summary_kv(n_docs, length_stats, vocab, lang, dup, read),
+        model.Figure(make=_hist_figure(name, length_stats),
+                     caption=f"Distribución de la longitud (palabras) de "
+                             f"«{_truncate(name, 30)}»."),
+    ]
+
+    terms_tbl = _terms_table(vocab)
+    if terms_tbl is not None:
+        blocks.append(terms_tbl)
+        blocks.append(model.Figure(
+            make=_barh_figure(f"Top términos de «{_truncate(name, 24)}»",
+                              vocab.get("top_terms"), "term", "count",
+                              "conteo"),
+            caption="Términos más frecuentes (barras)."))
+
+    bi_tbl = _ngram_table(_ngrams(texts, 2), "Bigrama")
+    if bi_tbl is not None:
+        blocks.append(bi_tbl)
+    tri_tbl = _ngram_table(_ngrams(texts, 3), "Trigrama")
+    if tri_tbl is not None:
+        blocks.append(tri_tbl)
+
+    if isinstance(lang, dict) and lang.get("available") \
+            and lang.get("distribution"):
+        blocks.append(model.Figure(
+            make=_barh_figure(f"Idiomas detectados en «{_truncate(name, 24)}»",
+                              lang.get("distribution"), "lang", "count",
+                              "documentos"),
+            caption="Distribución de idiomas detectados (langdetect)."))
+
+    wc = _wordcloud_figure(texts)
+    if wc is not None:
+        blocks.append(model.Figure(
+            make=wc, caption=f"Nube de palabras de «{_truncate(name, 30)}»."))
+
+    note = _dup_note(dup, lang, read)
+    if note is not None:
+        blocks.append(note)
+
+    return model.Group(blocks=blocks, page_break_before=(idx > 0))
+
+
+def _intro_blocks(n_cols, mark_terms):
+    ttr = ("[[term:ttr]]TTR[[/term]]" if mark_terms else "TTR")
+    hapax = ("[[term:hapax]]hapax legomena[[/term]]" if mark_terms
+             else "hapax legomena")
+    text = (
+        f"Este capítulo perfila las columnas de **texto libre largo** del "
+        f"dataset (reseñas, descripciones, comentarios): contenido lingüístico "
+        f"que la distribución categórica no resume bien. Para cada columna se "
+        f"muestran la longitud de los documentos, la riqueza de vocabulario "
+        f"(incluido el {ttr} y el porcentaje de {hapax}), los términos y "
+        f"n-gramas más frecuentes, los idiomas detectados y el nivel de "
+        f"duplicación. Las métricas son baratas y sin modelos pesados; las "
+        f"piezas que dependen de una librería opcional se omiten si no está "
+        f"instalada.")
+    return [
+        model.Heading(text=CHAPTER_TITLE, level=1),
+        model.Markdown(text=text),
+    ]
+
+
+def build_text_distr(profile: dict, ctx: dict):
+    """Build the free-text Chapter, or None if no long-text column applies."""
+    profile = profile or {}
+    ctx = ctx or {}
+
+    # 1) Cheap gate from the profile (no DB access yet).
+    candidates = _candidate_columns(profile)
+    if not candidates:
+        return None
+
+    # 2) Raw sample + 3) confirm genuine long text (median words >= threshold).
+    samples = _get_samples(profile, ctx, candidates)
+    if not samples:
+        return None
+    survivors = _confirm_long_text(samples)
+    if not survivors:
+        return None
+
+    # Register glossary terms (clickable) once we know the chapter applies.
+    glossary = ctx.get("glossary")
+    mark_terms = False
+    if isinstance(glossary, model.GlossaryCollector):
+        for key, (label, definition) in _TERMS.items():
+            glossary.add(key, label, definition)
+        mark_terms = True
+
+    blocks = list(_intro_blocks(len(survivors), mark_terms))
+
+    rendered = list(survivors.items())[:_MAX_TEXT_COLS]
+    for idx, (name, length_stats) in enumerate(rendered):
+        texts = samples.get(name) or []
+        blocks.append(_column_group(name, texts, length_stats, idx, mark_terms))
+
+    if len(survivors) > len(rendered):
+        omitted = len(survivors) - len(rendered)
+        blocks.append(model.Note(
+            f"Se muestran las primeras {len(rendered)} columnas de texto; "
+            f"quedan {omitted} sin mostrar para mantener acotado el informe."))
+
+    return model.Chapter(id=CHAPTER_ID, title=CHAPTER_TITLE,
+                         version=CHAPTER_VERSION, blocks=blocks)
@@ -0,0 +1,256 @@
+"""Tests for the TEXT DISTR chapter — DoD: golden + edges + degradation.
+
+Self-contained: builds synthetic TableProfiles and feeds the raw text sample
+in-memory through ``ctx['text_raw']`` (no DuckDB needed), so the suite is fast
+and deterministic. Verifies that ``build_text_distr``:
+
+- GOLDEN: with a long-text column, emits the chapter with its key blocks
+  (length summary, word histogram, top-terms table, n-gram tables, language
+  bars) and registers the clickable glossary terms; and that it renders inside
+  the full document to both PDF and PPTX showing that content.
+- EDGE (None): a dataset whose only string column is short labels (titanic-like
+  ``Name``) yields ``None`` without raising — the existing report is untouched.
+- EDGE (None): a column that passes the cheap char gate but whose documents are
+  short (median words below the threshold) is rejected at the confirmation step.
+- DEGRADATION: with ``langdetect`` / ``textstat`` / ``wordcloud`` unavailable,
+  the chapter still builds (those pieces are omitted) and never raises.
+"""
+
+import builtins
+import os
+import tempfile
+
+from pypdf import PdfReader
+from pptx import Presentation
+
+from datascience.automatic_eda.model import (
+    DataTable, Figure, GlossaryCollector, Group, Heading, KVTable, Markdown,
+    Note,
+)
+from datascience.automatic_eda.chapters.text_distr import (
+    CHAPTER_ID, CHAPTER_VERSION, build_text_distr,
+)
+from datascience.automatic_eda.chapters_registry import build_document
+from datascience.render_automatic_eda_pdf import render_automatic_eda_pdf
+from datascience.render_automatic_eda_pptx import render_automatic_eda_pptx
+
+
+# --------------------------------------------------------------------------- #
+# Synthetic corpus + profiles.
+# --------------------------------------------------------------------------- #
+_ES = [
+    "El producto llegó en perfecto estado y mucho antes de lo previsto por la tienda",
+    "La calidad de los materiales es realmente excelente y se nota la diferencia al usarlo",
+    "No me convenció del todo porque esperaba bastante más por el precio que pagué finalmente",
+    "El servicio de atención al cliente fue rápido amable y resolvió mi problema sin demora",
+    "Lo recomiendo totalmente ya que ha superado con creces todas mis expectativas iniciales",
+]
+_EN = [
+    "The product arrived in perfect condition and much earlier than the store had promised me",
+    "The build quality is genuinely outstanding and you can really feel the difference using it",
+    "I was not fully convinced because I expected quite a lot more for the price i finally paid",
+    "Customer support was fast friendly and solved my whole problem without any delay at all",
+    "I highly recommend it since it has exceeded by far every one of my initial expectations",
+]
+
+
+def _long_reviews(n=40) -> list:
+    """A corpus of long multi-sentence reviews (>= 20 words each), mixing two
+    languages and including a few exact duplicates."""
+    out = []
+    for i in range(n):
+        base = _ES if i % 3 != 0 else _EN  # mostly ES, some EN
+        a = base[i % len(base)]
+        b = base[(i + 2) % len(base)]
+        out.append(f"{a}. {b}.")
+    # Inject a couple of exact duplicates.
+    out.append(out[0])
+    out.append(out[1])
+    return out
+
+
+def _text_profile() -> dict:
+    """Profile with a long free-text column (review) + a numeric + a short cat."""
+    return {
+        "table": "reviews",
+        "source": "/data/reviews.duckdb",
+        "profiled_at": "2026-06-30T10:00:00+00:00",
+        "n_rows": 42,
+        "n_cols": 3,
+        "quality_score": 88.0,
+        "columns": [
+            {
+                "name": "review",
+                "inferred_type": "categorical",
+                "categorical": {
+                    "top": [{"value": "x", "count": 2, "pct": 0.05}],
+                    "n_distinct": 40,
+                    "len_mean": 180.0,
+                    "len_min": 80,
+                    "len_max": 220,
+                },
+            },
+            {
+                "name": "rating",
+                "inferred_type": "numeric",
+                "numeric": {"mean": 3.1, "median": 3.0, "std": 1.2,
+                            "min": 1, "max": 5},
+            },
+            {
+                "name": "product",
+                "inferred_type": "categorical",
+                "categorical": {
+                    "top": [{"value": "teclado", "count": 10, "pct": 0.25}],
+                    "n_distinct": 6,
+                    "len_mean": 7.0,
+                    "len_min": 5, "len_max": 11,
+                },
+            },
+        ],
+    }
+
+
+def _no_text_profile() -> dict:
+    """titanic-like: the only string column is short labels (Name ≈ 27 chars)."""
+    return {
+        "table": "titanic",
+        "n_rows": 891,
+        "n_cols": 3,
+        "columns": [
+            {"name": "Age", "inferred_type": "numeric",
+             "numeric": {"mean": 29.7, "median": 28.0, "std": 14.5}},
+            {"name": "Name", "inferred_type": "categorical",
+             "categorical": {"top": [{"value": "Braund, Mr. Owen Harris",
+                                      "count": 1, "pct": 0.001}],
+                             "n_distinct": 891, "len_mean": 27.0,
+                             "len_min": 12, "len_max": 82}},
+            {"name": "Sex", "inferred_type": "categorical",
+             "categorical": {"top": [{"value": "male", "count": 577,
+                                      "pct": 0.65}],
+                             "n_distinct": 2, "len_mean": 4.6,
+                             "len_min": 4, "len_max": 6}},
+        ],
+    }
+
+
+def _flatten(blocks) -> list:
+    """Recursively flatten Group blocks so tests can inspect leaf blocks."""
+    out = []
+    for b in blocks:
+        if isinstance(b, Group):
+            out.extend(_flatten(b.blocks))
+        else:
+            out.append(b)
+    return out
+
+
+# --------------------------------------------------------------------------- #
+# Golden.
+# --------------------------------------------------------------------------- #
+def test_golden_activa_con_texto():
+    glossary = GlossaryCollector()
+    ctx = {"text_raw": {"review": _long_reviews()}, "glossary": glossary}
+    ch = build_text_distr(_text_profile(), ctx)
+
+    assert ch is not None, "el capítulo debe activarse con una columna de texto largo"
+    assert ch.id == CHAPTER_ID
+    assert ch.version == CHAPTER_VERSION
+    leaves = _flatten(ch.blocks)
+    kinds = [b.kind for b in leaves]
+    assert "heading" in kinds
+    assert "kv_table" in kinds          # summary
+    assert "figure" in kinds            # histogram / bars
+    assert "data_table" in kinds        # top terms + n-grams
+
+    # KV summary mentions vocabulary metrics.
+    kv = next(b for b in leaves if isinstance(b, KVTable))
+    labels = " ".join(str(r[0]) for r in kv.rows)
+    assert "TTR" in labels
+    assert "Hapax" in labels or "hapax" in labels
+
+    # There is a terms table and at least one n-gram table.
+    titles = [getattr(b, "title", "") or "" for b in leaves
+              if isinstance(b, DataTable)]
+    assert any("Términos" in t for t in titles)
+    assert any("Bigrama" in t for t in titles)
+
+    # Glossary terms were registered (clickable destinations).
+    assert glossary.has("ttr")
+    assert glossary.has("hapax")
+
+
+def test_golden_render_pdf_pptx():
+    profile = _text_profile()
+    ctx = {"text_raw": {"review": _long_reviews()},
+           "dataset_name": "reviews"}
+    chapters = build_document(profile, ctx)
+    ids = [c.id for c in chapters]
+    assert "text_distr" in ids, f"text_distr ausente en {ids}"
+
+    with tempfile.TemporaryDirectory() as d:
+        pdf = os.path.join(d, "t.pdf")
+        pptx = os.path.join(d, "t.pptx")
+        rp = render_automatic_eda_pdf(profile, pdf, {"title": "EDA", "ctx": ctx})
+        rx = render_automatic_eda_pptx(profile, pptx, {"title": "EDA", "ctx": ctx})
+        assert rp.get("path") and os.path.exists(pdf)
+        assert rx.get("path") and os.path.exists(pptx)
+
+        text = "\n".join(p.extract_text() or "" for p in PdfReader(pdf).pages)
+        assert "Texto libre" in text or "TTR" in text
+
+        prs = Presentation(pptx)
+        ptext = []
+        for slide in prs.slides:
+            for shp in slide.shapes:
+                if shp.has_text_frame:
+                    ptext.append(shp.text_frame.text)
+        joined = "\n".join(ptext)
+        assert "Texto libre" in joined or "TTR" in joined
+
+
+# --------------------------------------------------------------------------- #
+# Edges — None.
+# --------------------------------------------------------------------------- #
+def test_edge_none_sin_texto_largo():
+    # titanic-like: short labels only → chapter must not apply.
+    assert build_text_distr(_no_text_profile(), {}) is None
+
+
+def test_edge_none_palabras_cortas():
+    # Char gate passes (len_mean high) but documents are short → confirmation
+    # rejects them (median words below threshold).
+    profile = _text_profile()
+    short = ["palabra " * 3] * 30  # 3 words each, < _MIN_WORDS
+    ctx = {"text_raw": {"review": short}}
+    assert build_text_distr(profile, ctx) is None
+
+
+def test_edge_none_empty_profile():
+    assert build_text_distr({}, {}) is None
+    assert build_text_distr(None, None) is None
+
+
+# --------------------------------------------------------------------------- #
+# Degradation — optional libs absent.
+# --------------------------------------------------------------------------- #
+def test_degradacion_sin_libs(monkeypatch):
+    real_import = builtins.__import__
+    blocked = ("langdetect", "textstat", "wordcloud", "datasketch")
+
+    def fake_import(name, *a, **k):
+        if name in blocked or any(name.startswith(b + ".") for b in blocked):
+            raise ImportError(f"simulado: {name}")
+        return real_import(name, *a, **k)
+
+    monkeypatch.setattr(builtins, "__import__", fake_import)
+
+    ctx = {"text_raw": {"review": _long_reviews()}}
+    ch = build_text_distr(_text_profile(), ctx)
+    # Still builds (the cheap, stdlib-only pieces remain) and never raises.
+    assert ch is not None
+    leaves = _flatten(ch.blocks)
+    assert any(isinstance(b, KVTable) for b in leaves)
+    assert any(isinstance(b, DataTable) for b in leaves)
+    # A degradation note is present mentioning the missing optional libs.
+    notes = " ".join(b.text for b in leaves if isinstance(b, Note))
+    assert "langdetect" in notes or "textstat" in notes or "datasketch" in notes
@@ -31,8 +31,10 @@ CHAPTER_ORDER = [
    "analisis_llm",  # LLM interpretation — sits next to overview (user request)
    "num_distr",     # numeric distributions
    "cat_distr",     # categorical distributions
+    "text_distr",    # free-text / NLP distributions (non-tabular content)
    "calidad",       # data quality
    "missingness",   # missing-data patterns (co-occurrence of absences; MCAR/MAR)
+    "outliers",      # atypical values: univariate (Tukey/z) + multivariate (IsolationForest)
    "correlacion",   # correlations / associations
    "relaciones",    # key relations: declared/candidate PK + FK (inter/intra-table)
    "modelos",       # cheap models (PCA/KMeans/outliers)
@@ -0,0 +1,253 @@
+"""Tests for the Markdown completeness appendix (report 2053).
+
+The AutomaticEDA Markdown is the output meant to be *pasted into an LLM*, so it
+must carry EVERYTHING the engine computed — even the numbers the human-facing
+chapters (shared with the PDF/PPTX) drop for readability. ``render_md`` appends a
+full-data appendix built from ``meta['profile']`` that closes the six losses the
+evaluation found:
+
+1. the complete association matrix (every pair, incl. correlation_ratio /
+   cramers_v) — not just the top extremes;
+2. every numeric statistic for every numeric column (skew/kurtosis/percentiles);
+3. the concrete recommended re-expression;
+4. KMeans ``scores_by_k``;
+5. the normality test statistics;
+6. correct headers for bar/scree figure tables (not ``Desde/Hasta/Frecuencia``).
+
+Self-contained: a synthetic profile, no DuckDB, no heavy renderer.
+"""
+
+import os
+import sys
+
+import pytest  # noqa: F401
+
+_HERE = os.path.dirname(os.path.abspath(__file__))
+_FUNCTIONS = os.path.abspath(os.path.join(_HERE, "..", "..", ".."))  # python/functions
+if _FUNCTIONS not in sys.path:
+    sys.path.insert(0, _FUNCTIONS)
+
+from datascience.automatic_eda import model  # noqa: E402
+from datascience.automatic_eda.render_md_impl import (  # noqa: E402
+    _bars_table,
+    _is_histogram_caption,
+    _profile_appendix,
+    render_md,
+)
+
+
+# --------------------------------------------------------------------------- #
+# Synthetic profile fixtures.
+# --------------------------------------------------------------------------- #
+def _numeric(skew, kurtosis):
+    """A numeric stat block with every key the appendix serializes."""
+    return {
+        "count": 100, "min": 0.0, "max": 10.0, "mean": 5.0, "median": 5.0,
+        "mode": 4.0, "std": 2.0, "variance": 4.0, "cv": 0.4,
+        "p1": 0.1, "p5": 0.5, "p25": 2.5, "p50": 5.0, "p75": 7.5,
+        "p95": 9.5, "p99": 9.9, "iqr": 5.0, "skew": skew, "kurtosis": kurtosis,
+        "n_outliers": 1, "distribution_type": "normal",
+    }
+
+
+def _profile():
+    """A small but structurally faithful TableProfile (3 numeric, 2 categorical)."""
+    pairs = [
+        {"a": "A", "b": "B", "a_type": "numeric", "b_type": "numeric",
+         "method": "pearson/spearman", "value": 0.8,
+         "p_value": 1e-9, "p_value_adjusted": 2e-9, "significant": True},
+        {"a": "A", "b": "C", "a_type": "numeric", "b_type": "numeric",
+         "method": "pearson/spearman", "value": -0.3,
+         "p_value": 0.01, "p_value_adjusted": 0.02, "significant": True},
+        {"a": "A", "b": "Cat1", "a_type": "numeric", "b_type": "categorical",
+         "method": "correlation_ratio", "value": 0.45,
+         "p_value": 0.001, "p_value_adjusted": 0.002, "significant": True},
+        # The single cat-cat pair the human chapter never shows.
+        {"a": "Cat1", "b": "Cat2", "a_type": "categorical",
+         "b_type": "categorical", "method": "cramers_v", "value": 0.11,
+         "p_value": 0.04, "p_value_adjusted": 0.05, "significant": False},
+    ]
+    return {
+        "correlations": {
+            "pairs": pairs,
+            "multiple_testing": {"method": "bh", "n_tests": 4, "n_rejected": 3},
+        },
+        "columns": [
+            {"name": "A", "count": 100, "numeric": _numeric(0.0, -1.2),
+             "reexpression": {"recommended": "none", "ladder_power": 1.0,
+                              "reason": "symmetric", "alternatives": []}},
+            {"name": "B", "count": 100, "numeric": _numeric(4.77, 33.1),
+             "reexpression": {"recommended": "log1p", "ladder_power": 0.0,
+                              "reason": "skew 4.77 with zeros",
+                              "alternatives": [{"transform": "yeo-johnson"},
+                                               {"transform": "sqrt"}]}},
+            {"name": "C", "count": 100, "numeric": _numeric(-0.6, 0.2)},
+            {"name": "Cat1", "categorical": {"top": [], "mode": "x"}},
+            {"name": "Cat2", "categorical": {"top": [], "mode": "y"}},
+        ],
+        "models": {
+            "kmeans": {
+                "best_k": 3,
+                "scores_by_k": [
+                    {"k": 2, "silhouette": 0.46, "inertia": 900.0},
+                    {"k": 3, "silhouette": 0.50, "inertia": 550.0},
+                    {"k": 4, "silhouette": 0.38, "inertia": 430.0},
+                ],
+                "cluster_sizes": [40, 35, 25],
+            },
+            "normality": {
+                "A": {"n": 100,
+                      "jarque_bera": {"stat": 18.7, "p": 8e-5, "normal": False},
+                      "dagostino": {"stat": 18.1, "p": 1e-4, "normal": False},
+                      "shapiro": {"stat": 0.98, "p": 7e-8, "normal": False},
+                      "is_normal": False},
+                "C": {"n": 100,
+                      "jarque_bera": {"stat": 2.1, "p": 0.35, "normal": True},
+                      "dagostino": {"stat": 1.9, "p": 0.38, "normal": True},
+                      "shapiro": {"stat": 0.99, "p": 0.12, "normal": True},
+                      "is_normal": True},
+            },
+        },
+    }
+
+
+def _dummy_chapters():
+    """A minimal one-chapter document so render_md does not early-return empty."""
+    return model.as_chapters([
+        {"id": "intro", "title": "Intro",
+         "blocks": [{"kind": "markdown", "text": "cuerpo del informe"}]},
+    ])
+
+
+def _render(tmp_path, profile):
+    out = os.path.join(str(tmp_path), "out.md")
+    res = render_md(_dummy_chapters(), out, {"title": "EDA — t", "profile": profile})
+    assert res["path"] == out
+    return open(out, encoding="utf-8").read()
+
+
+def _table_rows(md, section_title):
+    """Count data rows of the first Markdown table under ``section_title``."""
+    seg = md.split(section_title, 1)[1]
+    rows, in_t, seen_sep = 0, False, False
+    for ln in seg.splitlines():
+        if ln.startswith("|"):
+            in_t = True
+            stripped = ln.replace("|", "").replace(" ", "")
+            if stripped and set(stripped) == {"-"}:
+                seen_sep = True
+                continue
+            if seen_sep:
+                rows += 1
+        elif in_t and not ln.strip():
+            break
+    return rows
+
+
+# --------------------------------------------------------------------------- #
+# Golden: every datum the profile holds reaches the .md.
+# --------------------------------------------------------------------------- #
+def test_appendix_lists_all_correlation_pairs(tmp_path):
+    md = _render(tmp_path, _profile())
+    assert "## Apéndice — Datos completos del perfil" in md
+    # All 4 pairs (the real titanic profile has 28; here 4 synthetic).
+    assert _table_rows(md, "### Matriz de asociación") == 4
+    # The cat-cat Cramér's V pair the human chapter drops is present.
+    assert "Cat1 ↔ Cat2" in md
+    assert "cramers_v" in md
+    assert "correlation_ratio" in md
+
+
+def test_appendix_has_skew_kurtosis_for_every_numeric(tmp_path):
+    md = _render(tmp_path, _profile())
+    seg = md.split("### Estadísticos numéricos completos", 1)[1].split("###", 1)[0]
+    lines = [l for l in seg.splitlines() if l.startswith("|")]
+    header = [h.strip() for h in lines[0].strip("|").split("|")]
+    assert "skew" in header and "kurtosis" in header
+    ski, kui = header.index("skew"), header.index("kurtosis")
+    data = lines[2:]  # skip header + separator
+    assert len(data) == 3  # exactly the 3 numeric columns
+    for row in data:
+        cells = [c.strip() for c in row.strip("|").split("|")]
+        assert cells[ski] != "", f"missing skew in {cells[0]}"
+        assert cells[kui] != "", f"missing kurtosis in {cells[0]}"
+
+
+def test_appendix_has_extended_percentiles(tmp_path):
+    md = _render(tmp_path, _profile())
+    seg = md.split("### Estadísticos numéricos completos", 1)[1]
+    header = [h.strip() for h in seg.splitlines()[2].strip("|").split("|")]
+    for p in ("p1", "p5", "p25", "p75", "p95", "p99"):
+        assert p in header, f"percentile {p} missing from describe header"
+
+
+def test_appendix_names_concrete_reexpression(tmp_path):
+    md = _render(tmp_path, _profile())
+    assert "### Re-expresión recomendada" in md
+    assert "log1p" in md  # the concrete transform, not just "consider re-expressing"
+    assert "yeo-johnson" in md  # alternatives listed too
+
+
+def test_appendix_has_kmeans_scores_by_k(tmp_path):
+    md = _render(tmp_path, _profile())
+    assert "scores_by_k" in md
+    assert _table_rows(md, "#### KMeans — selección de k") == 3  # k=2,3,4
+
+
+def test_appendix_has_normality_statistics(tmp_path):
+    md = _render(tmp_path, _profile())
+    assert "JB stat" in md  # the statistic, not only the p-value
+    assert "Shapiro stat" in md
+    assert _table_rows(md, "#### Tests de normalidad") == 2  # cols A and C
+
+
+# --------------------------------------------------------------------------- #
+# Edge: a profile missing models / correlations degrades, never raises.
+# --------------------------------------------------------------------------- #
+def test_lite_profile_without_models(tmp_path):
+    prof = _profile()
+    prof.pop("models")  # lite: no KMeans/normality
+    md = _render(tmp_path, prof)
+    assert "scores_by_k" not in md  # section skipped
+    assert "Matriz de asociación" in md  # correlations still dumped
+    assert "## Apéndice" in md
+
+
+def test_profile_without_correlations(tmp_path):
+    prof = _profile()
+    prof.pop("correlations")
+    md = _render(tmp_path, prof)  # must not raise
+    assert "Matriz de asociación" not in md
+    assert "Estadísticos numéricos completos" in md  # numeric section still there
+
+
+def test_no_profile_means_no_appendix(tmp_path):
+    out = os.path.join(str(tmp_path), "noprof.md")
+    res = render_md(_dummy_chapters(), out, {"title": "x"})
+    assert res["path"] == out
+    assert "## Apéndice" not in open(out, encoding="utf-8").read()
+
+
+def test_appendix_helper_is_defensive():
+    assert _profile_appendix(None) == ""
+    assert _profile_appendix({}) == ""
+    assert _profile_appendix({"columns": []}) == ""
+
+
+# --------------------------------------------------------------------------- #
+# Loss #6: bar/scree figure tables get a non-misleading header.
+# --------------------------------------------------------------------------- #
+def test_histogram_caption_detection():
+    assert _is_histogram_caption("Histograma de Age")
+    assert _is_histogram_caption("Distribución de Fare")
+    assert not _is_histogram_caption("Media de Survived por Sex")
+    assert not _is_histogram_caption("Varianza explicada (scree PCA)")
+
+
+def test_bars_table_custom_header():
+    bars = [(0.0, 1.0, 5.0), (1.0, 2.0, 3.0)]
+    hist = _bars_table(bars)  # default histogram header
+    assert "| Desde | Hasta | Frecuencia |" in hist
+    bar = _bars_table(bars, ("Inicio", "Fin", "Valor"))
+    assert "| Inicio | Fin | Valor |" in bar
+    assert "Frecuencia" not in bar
@@ -178,9 +178,17 @@ def _md_data_table(block) -> str:
    return "\n".join(lines)


-def _bars_table(bars: list) -> str:
-    """Render extracted bar/histogram data as a Markdown table (Desde/Hasta/Frec)."""
-    lines = ["| Desde | Hasta | Frecuencia |", "| --- | --- | --- |"]
+def _bars_table(bars: list, header: tuple = ("Desde", "Hasta", "Frecuencia")) -> str:
+    """Render extracted bar/histogram data as a Markdown table.
+
+    ``header`` is the 3-column header to use. Histogram bars are
+    ``(Desde, Hasta, Frecuencia)``; bar/scree charts (means by group, PCA
+    explained variance) are *not* bins, so the caller passes a semantically
+    correct header (e.g. ``(Inicio, Fin, Valor)``) to avoid the misleading
+    "Frecuencia" label — see report 2053, loss #6.
+    """
+    h0, h1, h2 = header
+    lines = [f"| {h0} | {h1} | {h2} |", "| --- | --- | --- |"]
    shown = bars[:_MAX_BAR_ROWS]
    for x0, x1, h in shown:
        lines.append(f"| {_fmt_num(x0)} | {_fmt_num(x1)} | {_fmt_num(h)} |")
@@ -191,6 +199,18 @@ def _bars_table(bars: list) -> str:
    return out


+def _is_histogram_caption(caption: str) -> bool:
+    """True when a figure caption describes a histogram (genuine numeric bins).
+
+    Histograms are the only figures whose bars are real ``[Desde, Hasta)`` bins
+    with a frequency count. Bar charts (means by group) and the PCA scree plot
+    carry per-category / per-component values, not bins — they must not inherit
+    the ``Desde/Hasta/Frecuencia`` header.
+    """
+    c = (caption or "").lower()
+    return "histograma" in c or "distribución" in c or "distribucion" in c
+
+
 def _extract_bars(fig) -> list:
    """Collect (x_from, x_to, height) of the rectangular bars of a matplotlib fig.

@@ -253,7 +273,13 @@ def _md_figure(block, meta: dict, out_path: str, counter: list) -> str:
        if fig is not None:
            bars = _extract_bars(fig)
            if bars:
-                parts.append(_bars_table(bars))
+                # A histogram's bars are genuine numeric bins (Desde/Hasta/
+                # Frecuencia). Bar charts and the PCA scree plot are not bins —
+                # give them a header that does not lie about "Frecuencia".
+                header = (("Desde", "Hasta", "Frecuencia")
+                          if _is_histogram_caption(caption)
+                          else ("Inicio", "Fin", "Valor"))
+                parts.append(_bars_table(bars, header))
            if meta.get("embed_figures"):
                png = _embed_png(fig, out_path, counter)
                if png:
@@ -354,6 +380,258 @@ def _serialize_block(block, meta: dict, out_path: str, counter: list) -> str:
    return _md_note(model.Note(text=model._safe_str(block)))


+# --------------------------------------------------------------------------- #
+# Profile appendix — the data the human-facing chapters drop.
+#
+# The chapter document (shared with the PDF/PPTX renderers) is designed for human
+# reading and intentionally omits raw numbers: the correlation matrix shows only
+# the top extremes, the numeric blocks skip skew/kurtosis/extended percentiles,
+# the model chapter does not list ``scores_by_k`` or the normality test
+# statistics. But the Markdown is meant to be *pasted into an LLM*, so it should
+# carry EVERYTHING the engine computed. This appendix serializes the full
+# ``profile`` (passed via ``meta['profile']``) as Markdown tables, additively:
+# the PDF/PPTX are untouched, the .md simply has more than they do. Each section
+# is emitted only when its source data is present, so a ``lite`` profile (no
+# models) or a profile without correlations degrades cleanly instead of raising.
+# See report 2053 for the six losses this closes.
+# --------------------------------------------------------------------------- #
+def _pair_types(a_type, b_type) -> str:
+    """Short ``num↔cat`` label for an association pair's variable types."""
+    def short(t):
+        t = model._safe_str(t).lower()
+        if t.startswith("num"):
+            return "num"
+        if t.startswith("cat"):
+            return "cat"
+        return t or "?"
+    return f"{short(a_type)}↔{short(b_type)}"
+
+
+def _app_correlations(corr: dict) -> str:
+    """Loss #1 — every association pair (not just the top extremes).
+
+    Dumps all of ``correlations['pairs']`` as a table (pair · types · method ·
+    value · p · p-FDR · significant), ordered by |value| desc so the strongest
+    associations lead while nothing is cut. Includes the ``correlation_ratio``
+    (num↔cat) and ``cramers_v`` (cat↔cat) pairs the human chapter never shows.
+    """
+    pairs = list(corr.get("pairs", []) or [])
+    if not pairs:
+        return ""
+    def keyfn(p):
+        try:
+            return -abs(float(p.get("value")))
+        except Exception:  # noqa: BLE001
+            return 0.0
+    pairs_sorted = sorted(pairs, key=keyfn)
+    lines = ["### Matriz de asociación — todos los pares",
+             "",
+             ("| Par | Tipos | Método | Valor | p-value | p-ajustado (FDR) "
+              "| ¿Sig? |"),
+             "| --- | --- | --- | --- | --- | --- | --- |"]
+    for p in pairs_sorted:
+        par = f"{_cell(p.get('a'))} ↔ {_cell(p.get('b'))}"
+        types = _pair_types(p.get("a_type"), p.get("b_type"))
+        method = _cell(p.get("method"))
+        val = _fmt_num(p.get("value"))
+        pv = _fmt_num(p.get("p_value")) if p.get("p_value") is not None else ""
+        padj = (_fmt_num(p.get("p_value_adjusted"))
+                if p.get("p_value_adjusted") is not None else "")
+        sig = "sí" if p.get("significant") else "no"
+        lines.append(
+            f"| {par} | {types} | {method} | {val} | {pv} | {padj} | {sig} |")
+    mt = corr.get("multiple_testing") or {}
+    n_tests = mt.get("n_tests", corr.get("n_tests"))
+    n_rej = mt.get("n_rejected")
+    note_bits = [f"{len(pairs)} pares en total"]
+    if n_tests is not None and n_rej is not None:
+        note_bits.append(
+            f"{n_rej} de {n_tests} significativos tras corrección "
+            f"{model._safe_str(mt.get('method', 'FDR')).upper()}")
+    lines.append("")
+    lines.append(f"*{'; '.join(note_bits)}.*")
+    return "\n".join(lines)
+
+
+# Numeric statistics, in serialization order: (profile key, column header).
+_NUM_STATS = [
+    ("count", "n"), ("mean", "mean"), ("median", "median"), ("mode", "mode"),
+    ("std", "std"), ("variance", "variance"), ("cv", "cv"),
+    ("skew", "skew"), ("kurtosis", "kurtosis"),
+    ("min", "min"), ("p1", "p1"), ("p5", "p5"), ("p25", "p25"), ("p50", "p50"),
+    ("p75", "p75"), ("p95", "p95"), ("p99", "p99"), ("iqr", "iqr"),
+    ("max", "max"), ("n_outliers", "outliers"),
+    ("distribution_type", "distribución"),
+]
+
+
+def _app_numeric_describe(columns: list) -> str:
+    """Loss #2 — every numeric statistic for every numeric column.
+
+    One row per numeric column with the full describe: mean/median/mode/std/
+    variance/cv, skew & kurtosis (for ALL columns, not only the skewed ones),
+    p1/p5/p25/p50/p75/p95/p99, iqr, min/max, outliers and distribution_type.
+    """
+    rows = []
+    for info in (columns or []):
+        num = info.get("numeric") if isinstance(info, dict) else None
+        if not num:
+            continue
+        name = _cell(info.get("name"))
+        cells = [name]
+        for key, _hdr in _NUM_STATS:
+            v = num.get("count" if key == "count" else key)
+            if key == "count":
+                v = num.get("count", info.get("count"))
+            if key == "distribution_type":
+                cells.append(_cell(v))
+            else:
+                cells.append(_fmt_num(v) if v is not None else "")
+        rows.append(cells)
+    if not rows:
+        return ""
+    header = ["Columna"] + [hdr for _k, hdr in _NUM_STATS]
+    lines = ["### Estadísticos numéricos completos (describe)",
+             "",
+             "| " + " | ".join(header) + " |",
+             "| " + " | ".join(["---"] * len(header)) + " |"]
+    for cells in rows:
+        lines.append("| " + " | ".join(cells) + " |")
+    return "\n".join(lines)
+
+
+def _app_reexpression(columns: list) -> str:
+    """Loss #3 — the concrete recommended re-expression per column.
+
+    Names the transform (log1p/sqrt/yeo-johnson/none) instead of a vague
+    "consider re-expressing", with the ladder power, reason and alternatives.
+    """
+    rows = []
+    for info in (columns or []):
+        rx = info.get("reexpression") if isinstance(info, dict) else None
+        if not rx or not isinstance(rx, dict):
+            continue
+        rec = model._safe_str(rx.get("recommended")).strip()
+        if not rec:
+            continue
+        alts = rx.get("alternatives") or []
+        alt_txt = ", ".join(
+            model._safe_str(a.get("transform")) for a in alts
+            if isinstance(a, dict) and a.get("transform")) or "—"
+        rows.append([
+            _cell(info.get("name")), _cell(rec),
+            _fmt_num(rx.get("ladder_power")) if rx.get("ladder_power") is not None else "",
+            _cell(rx.get("reason")), _cell(alt_txt),
+        ])
+    if not rows:
+        return ""
+    lines = ["### Re-expresión recomendada (escalera de Tukey)",
+             "",
+             "| Columna | Recomendada | Potencia | Razón | Alternativas |",
+             "| --- | --- | --- | --- | --- |"]
+    for r in rows:
+        lines.append("| " + " | ".join(r) + " |")
+    return "\n".join(lines)
+
+
+def _app_kmeans_scores(kmeans: dict) -> str:
+    """Loss #4 — KMeans silhouette + inertia per k (justifies the chosen k)."""
+    scores = list(kmeans.get("scores_by_k", []) or [])
+    if not scores:
+        return ""
+    best_k = kmeans.get("best_k")
+    lines = ["#### KMeans — selección de k (`scores_by_k`)",
+             "",
+             "| k | Silhouette | Inercia | Elegido |",
+             "| --- | --- | --- | --- |"]
+    for s in scores:
+        if not isinstance(s, dict):
+            continue
+        k = s.get("k")
+        chosen = "✓" if best_k is not None and k == best_k else ""
+        lines.append(
+            f"| {_fmt_num(k)} | {_fmt_num(s.get('silhouette'))} "
+            f"| {_fmt_num(s.get('inertia'))} | {chosen} |")
+    return "\n".join(lines)
+
+
+def _app_normality(normality: dict) -> str:
+    """Loss #5 — each normality test's statistic next to its p-value."""
+    if not isinstance(normality, dict) or not normality:
+        return ""
+    lines = ["#### Tests de normalidad (estadístico + p-value)",
+             "",
+             ("| Columna | n | JB stat | JB p | D'Agostino stat | D'Agostino p "
+              "| Shapiro stat | Shapiro p | ¿Normal? |"),
+             "| --- | --- | --- | --- | --- | --- | --- | --- | --- |"]
+    any_row = False
+    for col, res in normality.items():
+        if not isinstance(res, dict):
+            continue
+        jb = res.get("jarque_bera") or {}
+        da = res.get("dagostino") or {}
+        sh = res.get("shapiro") or {}
+        is_norm = "sí" if res.get("is_normal") else "no"
+        lines.append(
+            f"| {_cell(col)} | {_fmt_num(res.get('n')) if res.get('n') is not None else ''} "
+            f"| {_fmt_num(jb.get('stat'))} | {_fmt_num(jb.get('p'))} "
+            f"| {_fmt_num(da.get('stat'))} | {_fmt_num(da.get('p'))} "
+            f"| {_fmt_num(sh.get('stat'))} | {_fmt_num(sh.get('p'))} | {is_norm} |")
+        any_row = True
+    return "\n".join(lines) if any_row else ""
+
+
+def _profile_appendix(profile: dict) -> str:
+    """Build the full-data appendix from a TableProfile dict (additive).
+
+    Returns a Markdown ``## Apéndice`` section with one sub-table per loss the
+    human chapters drop, or ``""`` when the profile carries none of them. Never
+    raises: a missing/oddly-shaped section is skipped, not fatal.
+    """
+    if not isinstance(profile, dict):
+        return ""
+    sections: list = []
+    try:
+        corr = profile.get("correlations") or {}
+        seg = _app_correlations(corr) if isinstance(corr, dict) else ""
+        if seg:
+            sections.append(seg)
+    except Exception:  # noqa: BLE001
+        pass
+    try:
+        columns = profile.get("columns") or []
+        seg = _app_numeric_describe(columns)
+        if seg:
+            sections.append(seg)
+        seg = _app_reexpression(columns)
+        if seg:
+            sections.append(seg)
+    except Exception:  # noqa: BLE001
+        pass
+    try:
+        models = profile.get("models") or {}
+        if isinstance(models, dict):
+            model_segs = []
+            seg = _app_kmeans_scores(models.get("kmeans") or {})
+            if seg:
+                model_segs.append(seg)
+            seg = _app_normality(models.get("normality") or {})
+            if seg:
+                model_segs.append(seg)
+            if model_segs:
+                sections.append(
+                    "### Modelos — detalle\n\n" + "\n\n".join(model_segs))
+    except Exception:  # noqa: BLE001
+        pass
+    if not sections:
+        return ""
+    intro = ("Volcado completo de los datos que el motor computó y que los "
+             "capítulos (pensados para lectura humana / PDF) resumen. "
+             "Pensado para que un LLM reconstruya el análisis entero.")
+    return ("## Apéndice — Datos completos del perfil\n\n"
+            f"*{intro}*\n\n" + "\n\n".join(sections))
+
+
 # --------------------------------------------------------------------------- #
 # Entry point.
 # --------------------------------------------------------------------------- #
@@ -437,6 +715,18 @@ def render_md(chapters: list, out_path: str, meta: dict = None) -> dict:
                segments.append(seg)
        chapters_meta.append({"id": ch.id, "version": ch.version})

+    # Full-data appendix: dump everything the profile holds that the human
+    # chapters drop (additive — the .md ends up with more than the PDF/PPTX).
+    # Emitted only when a profile is supplied via meta['profile']; never fatal.
+    try:
+        appendix = _profile_appendix(meta.get("profile"))
+    except Exception as e:  # noqa: BLE001
+        appendix = ""
+        notes.append(f"apéndice de perfil omitido: {e}")
+    if appendix:
+        segments.append("---")
+        segments.append(appendix)
+
    content = "\n\n".join(segments) + "\n"
    note = f"{len(content)} caracteres"
    if notes:
@@ -0,0 +1,125 @@
+---
+id: build_boxplots_figure_py_datascience
+name: build_boxplots_figure
+kind: function
+lang: py
+domain: datascience
+version: "1.0.0"
+purity: impure
+signature: "def build_boxplots_figure(boxes: list, title: str = \"\", max_boxes: int = 12) -> \"matplotlib.figure.Figure\""
+description: "Construye una unica figura matplotlib con boxplots de Tukey HORIZONTALES (uno por columna) usando ax.bxp: caja Q1-Q3, bigotes hasta 1.5*IQR, linea de mediana y puntos atipicos. Consume la salida de build_boxplot_stats (un dict box por columna, leido con .get) mas una lista opcional de outliers crudos por columna; si vienen los dibuja como puntos (showfliers), si no marca solo box[min]/box[max] cuando hay outliers de cola (igual que num_distr). Dibuja como mucho max_boxes cajas (las primeras, ya ordenadas por contaminacion por el caller) y avisa de la truncacion con (mostrando N de M). Backend Agg sin pyplot global; alto adaptativo al nº de cajas. Defensiva: omite entradas invalidas y NUNCA lanza — sin cajas validas devuelve una figura placeholder (sin boxplots). Es la version small-multiples del capitulo num_distr para responder que columnas tienen mas outliers de un vistazo."
+tags: [eda, outliers, boxplot, tukey, iqr, bxp, matplotlib, figure, visualization, small-multiples, datascience, impure]
+uses_functions: []
+uses_types: []
+returns: []
+returns_optional: false
+error_type: "error_go_core"
+imports: [matplotlib]
+example: |
+  from datascience.build_boxplot_stats import build_boxplot_stats
+  from datascience.build_boxplots_figure import build_boxplots_figure
+  boxes = [
+      {"name": "ingresos", "box": build_boxplot_stats({"min": 1.0, "max": 9e3,
+          "p25": 1e3, "median": 2e3, "p75": 3e3, "n_outliers": 7}), "fliers": None},
+      {"name": "edad", "box": build_boxplot_stats({"min": 0.0, "max": 99.0,
+          "p25": 25.0, "median": 38.0, "p75": 52.0}), "fliers": None},
+  ]
+  fig = build_boxplots_figure(boxes, title="Outliers por columna", max_boxes=12)
+tested: true
+tests:
+  - "test_returns_figure_with_axes"
+  - "test_empty_list_returns_placeholder_figure"
+  - "test_invalid_box_is_skipped_not_raised"
+  - "test_all_invalid_returns_placeholder"
+  - "test_raw_fliers_are_drawn"
+  - "test_max_boxes_truncates_and_does_not_raise"
+test_file_path: "python/functions/datascience/build_boxplots_figure_test.py"
+file_path: "python/functions/datascience/build_boxplots_figure.py"
+params:
+  - name: boxes
+    desc: "Lista de dicts, cada uno {\"name\": str, \"box\": dict, \"fliers\": list|None}. box es EXACTAMENTE la salida de build_boxplot_stats (claves leidas con .get: q1, median, q3, whisker_lo, whisker_hi, min, max, has_low_outliers, has_high_outliers, lower_fence, upper_fence, n_outliers). fliers es la lista opcional de outliers crudos: si viene se dibuja como puntos; si es None/ausente solo se marcan los extremos box[min]/box[max] cuando hay outliers de cola. Entradas que no son dict, sin box dict, o sin q1/median/q3 se omiten. El caller las pasa ya ordenadas por contaminacion (la mayor primera)."
+  - name: title
+    desc: "Titulo de la figura (fig.suptitle, alineado a la izquierda). Vacio => sin titulo. Si len(boxes) > max_boxes se le anade una nota \"(mostrando N de M)\" para que la truncacion no sea silenciosa. Default \"\"."
+  - name: max_boxes
+    desc: "Numero maximo de cajas a dibujar (las primeras de la lista). Default 12. Un valor no entero o <= 0 cae a 12. Si la lista trae mas entradas, las sobrantes se descartan pero se reporta en el titulo con (mostrando N de M)."
+output: "Un matplotlib.figure.Figure (figsize 7.0 x alto adaptativo = max(2.0, 0.5*n + 1.0), dpi 150) con un unico Axes que apila boxplots horizontales de Tukey (ax.bxp, orientation=horizontal con fallback vert=False), uno por columna valida, de arriba a abajo en el orden recibido. Cada caja: relleno #9ec6df, borde/bigotes/caps #5b8aa6, mediana #2e8b57, atipicos #c0392b. Etiquetas del eje Y = nombres de columna; eje X etiquetado \"valor\". Outliers dibujados desde fliers crudos (showfliers) o, si faltan, marcados en box[min]/box[max] segun has_low/high_outliers. Si no queda ninguna caja valida (lista vacia o todas invalidas) devuelve una Figure placeholder con texto centrado \"(sin boxplots)\"; cualquier error inesperado se captura y devuelve una Figure con el mensaje de error. NUNCA lanza. El caller rasteriza/cierra la figura; la funcion no la muestra ni la guarda."
+---
+
+## Ejemplo
+
+```python
+import sys, os
+sys.path.insert(0, os.path.join("python", "functions"))
+from datascience.build_boxplot_stats import build_boxplot_stats
+from datascience.build_boxplots_figure import build_boxplots_figure
+
+# Un `box` por columna numérica, derivado del sub-bloque `numeric` del profile
+# (salida de describe_numeric). El caller los pasa ya ordenados por outlier_pct.
+boxes = [
+    {
+        "name": "ingresos",
+        "box": build_boxplot_stats({
+            "min": 1.0, "max": 9000.0,
+            "p25": 1000.0, "median": 2000.0, "p75": 3000.0,
+            "n_outliers": 7,
+        }),
+        "fliers": None,  # valores crudos desconocidos -> se marca solo el extremo.
+    },
+    {
+        "name": "edad",
+        "box": build_boxplot_stats({
+            "min": 0.0, "max": 99.0,
+            "p25": 25.0, "median": 38.0, "p75": 52.0,
+        }),
+        "fliers": [88.0, 95.0, 99.0],  # outliers crudos -> se dibujan como puntos.
+    },
+]
+
+fig = build_boxplots_figure(boxes, title="Outliers por columna", max_boxes=12)
+
+# El renderer del informe lo rasteriza; aquí solo persistimos para inspección.
+fig.savefig("/tmp/boxplots.png")
+```
+
+## Cuando usarla
+
+Úsala en el capítulo de outliers de un informe EDA cuando quieras comparar de un
+vistazo *qué columnas están más contaminadas por valores atípicos*: a diferencia
+de `num_distr` (que dibuja un histograma+boxplot por columna en figuras
+separadas), aquí apilas todos los boxplots horizontales en **una sola figura**
+(small multiples). Primero deriva el `box` de cada columna con
+`build_boxplot_stats`, ordénalas por `outlier_pct` descendente, envuélvelas como
+`{"name", "box", "fliers"}` y pásaselas. Si tienes los valores crudos fuera de
+las vallas, métele la lista `fliers` y se dibujarán como puntos; si no, la
+función marca solo los extremos `min`/`max` cuando hay cola.
+
+## Gotchas
+
+- **Impura por matplotlib.** Toca la maquinaria de render. Usa el backend `Agg`
+  y la API orientada a objetos `Figure`/`add_subplot` — NUNCA `pyplot.*` aquí,
+  para no tocar el estado global ni filtrar figuras entre llamadas. `pyplot` NO
+  es thread-safe; esta función construye el `Figure` directamente, así que es
+  segura de llamar en bucle desde el renderer.
+- **El caller cierra la figura.** Devuelve el `Figure` pero no lo muestra ni lo
+  guarda. Quien la consume debe rasterizarla y luego liberarla
+  (`matplotlib.pyplot.close(fig)`) para no acumular memoria en lotes grandes.
+- **`fliers` opcional, semántica distinta.** Si pasas la lista de outliers
+  crudos se dibujan todos como puntos (`showfliers=True`). Si es `None`/ausente
+  los valores son desconocidos y solo se marca un punto en `box["min"]` /
+  `box["max"]` cuando `has_low_outliers` / `has_high_outliers` — mismo criterio
+  que `num_distr`. No inventes fliers a partir del profile: el `box` no trae los
+  valores crudos, solo si los extremos superan las vallas.
+- **API de orientación de `ax.bxp`.** matplotlib reciente usa
+  `orientation="horizontal"`; las versiones antiguas usan `vert=False`. La
+  función prueba la primera y cae a la segunda en `except TypeError`, así que
+  funciona en ambas. Si `bxp` falla del todo, el Axes degrada a un texto
+  "(boxplot no disponible)" en vez de propagar.
+- **Truncación visible.** `max_boxes` (default 12) limita el nº de cajas para que
+  ninguna se solape; si la lista trae más, las sobrantes se descartan pero se
+  avisa en el título con "(mostrando N de M)". Pasa las columnas ya ordenadas por
+  contaminación para que las descartadas sean las menos relevantes.
+- **Defensiva, nunca lanza.** Lista vacía, entradas no-dict, sin `box`, o sin
+  `q1`/`median`/`q3` se omiten sin propagar; sin cajas válidas devuelve un
+  placeholder "(sin boxplots)" y cualquier error inesperado se captura en una
+  figura con el texto del error. No envuelvas la llamada en try/except por miedo
+  a un raise — no lo hay.
@@ -0,0 +1,250 @@
+"""Impure EDA helper: a single figure of horizontal Tukey boxplots (`eda` group).
+
+Draws, in one ``matplotlib.figure.Figure``, a stack of horizontal Tukey boxplots
+(one per column) using ``ax.bxp``: each carries its box (Q1–Q3), whiskers (up to
+1.5·IQR), the median line and its outlier points. It consumes the output of the
+pure registry function ``build_boxplot_stats`` (one ``box`` dict per column) plus
+an optional list of raw outlier values per column; it never recomputes anything.
+
+It is the "small-multiples" companion of ``num_distr`` (which draws one
+histogram+boxplot per column): here every column shares a single figure so the
+caller can show, at a glance, *which* columns are the most contaminated by
+outliers (the caller passes them already ordered by contamination).
+
+Impure because it touches matplotlib's rendering machinery. It uses the headless
+Agg backend and the object-oriented ``Figure`` API (no ``pyplot``) so it leaks no
+global state and is safe to call repeatedly from a report renderer. It is fully
+defensive and NEVER raises: invalid entries are skipped and, if nothing valid
+remains, it returns a placeholder figure carrying a centered "(sin boxplots)".
+"""
+
+import matplotlib
+
+matplotlib.use("Agg")
+
+from matplotlib.figure import Figure  # noqa: E402
+
+# Blue palette shared with the ``num_distr`` chapter so the report stays coherent.
+_BOX_FACE = "#9ec6df"   # box fill.
+_BOX_EDGE = "#5b8aa6"   # box / whisker / cap border.
+_MEDIAN = "#2e8b57"     # median line (sea green).
+_OUTLIER = "#c0392b"    # outlier points (soft red).
+# Muted gray for the placeholder / fallback message text.
+_MUTED_TEXT = "#5f6b7a"
+# Soft red for the error fallback message.
+_ERROR_TEXT = "#b00020"
+
+
+def _num(value):
+    """Coerce ``value`` to float defensively; None for None/bool/non-numeric/NaN."""
+    # bool is a subclass of int; a stat value is never a real bool, so treat
+    # True/False as missing instead of silently coercing to 1.0/0.0.
+    if value is None or isinstance(value, bool):
+        return None
+    try:
+        f = float(value)
+    except (TypeError, ValueError):
+        return None
+    if f != f:  # NaN guard.
+        return None
+    return f
+
+
+def _placeholder_figure(message: str, color: str = _MUTED_TEXT) -> "Figure":
+    """Return a fallback ``Figure`` carrying a single centered message."""
+    fig = Figure(figsize=(7.0, 2.4), dpi=150)
+    ax = fig.add_subplot(111)
+    ax.axis("off")
+    ax.text(
+        0.5,
+        0.5,
+        message,
+        ha="center",
+        va="center",
+        fontsize=12,
+        color=color,
+        wrap=True,
+        transform=ax.transAxes,
+    )
+    fig.tight_layout()
+    return fig
+
+
+def build_boxplots_figure(
+    boxes: list,
+    title: str = "",
+    max_boxes: int = 12,
+) -> "matplotlib.figure.Figure":
+    """Build one figure of stacked horizontal Tukey boxplots (one per column).
+
+    For each entry the function builds a ``bxp`` stats record (``med, q1, q3,
+    whislo, whishi, fliers, label``) from its ``box`` sub-dict (the output of
+    ``build_boxplot_stats``) and draws all of them as horizontal boxplots sharing
+    the X axis, top-to-bottom in the order received (the caller is expected to
+    pass them already sorted by contamination).
+
+    Outliers are shown two ways:
+
+    - If an entry carries a ``fliers`` list (the raw out-of-fence values), they
+      are drawn as red points via ``ax.bxp(..., showfliers=True)``.
+    - If ``fliers`` is ``None``/absent, the raw values are unknown, so only the
+      extremes are marked: a red point at ``box["min"]`` when
+      ``box["has_low_outliers"]`` and at ``box["max"]`` when
+      ``box["has_high_outliers"]`` (same convention as ``num_distr``).
+
+    The function is fully defensive and NEVER raises. Entries that are not dicts,
+    lack a ``box`` dict, or miss any of ``q1``/``median``/``q3`` are skipped. If
+    after filtering no valid box remains it returns a placeholder ``Figure`` with
+    a centered "(sin boxplots)"; any unexpected error is caught and turned into a
+    fallback figure carrying the error text. It always returns a ``Figure``.
+
+    Args:
+        boxes: List of dicts ``{"name": str, "box": dict, "fliers": list|None}``.
+            ``box`` is exactly the output of ``build_boxplot_stats`` (read with
+            ``.get``: ``q1, median, q3, whisker_lo, whisker_hi, min, max,
+            has_low_outliers, has_high_outliers, ...``). ``fliers`` is the
+            optional list of raw outlier values; when present they are plotted,
+            otherwise only the extremes are marked.
+        title: Figure title (``fig.suptitle``). Empty => no title. When the list
+            is longer than ``max_boxes`` a "(mostrando N de M)" note is appended.
+        max_boxes: Draw at most the first ``max_boxes`` entries (default 12). The
+            rest are dropped but their omission is surfaced in the title note, so
+            the truncation is never silent.
+
+    Returns:
+        A ``matplotlib.figure.Figure`` with a single Axes holding the horizontal
+        boxplots (height adaptive to the box count so none overlap). The caller is
+        responsible for rasterizing/closing it; this function never shows nor
+        saves it.
+    """
+    try:
+        if not isinstance(boxes, (list, tuple)) or len(boxes) == 0:
+            return _placeholder_figure("(sin boxplots)")
+
+        total = len(boxes)
+
+        # Cap the number of boxes; tolerate a non-int / non-positive max_boxes.
+        try:
+            cap = int(max_boxes)
+        except (TypeError, ValueError):
+            cap = 12
+        if cap <= 0:
+            cap = 12
+        candidates = list(boxes)[:cap]
+
+        stats_list = []        # bxp stats records, in draw order.
+        labels = []            # Y tick labels (column names).
+        manual_markers = []    # (position, box) for entries without raw fliers.
+        any_fliers = False     # whether to enable showfliers in the bxp call.
+
+        for entry in candidates:
+            if not isinstance(entry, dict):
+                continue
+            box = entry.get("box")
+            if not isinstance(box, dict):
+                continue
+
+            q1 = _num(box.get("q1"))
+            med = _num(box.get("median"))
+            q3 = _num(box.get("q3"))
+            # Without the three quartiles a boxplot cannot be drawn — skip it.
+            if q1 is None or med is None or q3 is None:
+                continue
+
+            # Whisker extremes fall back to the quartiles when missing.
+            whislo = _num(box.get("whisker_lo"))
+            whishi = _num(box.get("whisker_hi"))
+            if whislo is None:
+                whislo = q1
+            if whishi is None:
+                whishi = q3
+
+            name = entry.get("name")
+            label = "" if name is None else str(name)
+
+            position = len(stats_list) + 1  # bxp positions are 1-indexed.
+            fliers_raw = entry.get("fliers")
+            if isinstance(fliers_raw, (list, tuple)):
+                fliers = [v for v in (_num(x) for x in fliers_raw) if v is not None]
+                if fliers:
+                    any_fliers = True
+            else:
+                # Raw values unknown: draw no bxp fliers, mark min/max by hand.
+                fliers = []
+                manual_markers.append((position, box))
+
+            stats_list.append({
+                "med": med,
+                "q1": q1,
+                "q3": q3,
+                "whislo": whislo,
+                "whishi": whishi,
+                "fliers": fliers,
+                "label": label,
+            })
+            labels.append(label)
+
+        if not stats_list:
+            return _placeholder_figure("(sin boxplots)")
+
+        n = len(stats_list)
+        positions = list(range(1, n + 1))
+
+        # Height grows with the box count so none of them overlap.
+        height = max(2.0, 0.5 * n + 1.0)
+        fig = Figure(figsize=(7.0, height), dpi=150)
+        ax = fig.add_subplot(111)
+
+        bxp_kw = dict(
+            showfliers=any_fliers, widths=0.5, patch_artist=True,
+            boxprops={"facecolor": _BOX_FACE, "edgecolor": _BOX_EDGE},
+            medianprops={"color": _MEDIAN, "linewidth": 1.6},
+            whiskerprops={"color": _BOX_EDGE},
+            capprops={"color": _BOX_EDGE},
+            flierprops={"marker": "o", "markersize": 3.5,
+                        "markerfacecolor": _OUTLIER, "markeredgecolor": _OUTLIER,
+                        "linestyle": "none"})
+        try:
+            # ``orientation`` is the current API; older matplotlib uses ``vert``.
+            try:
+                ax.bxp(stats_list, positions=positions,
+                       orientation="horizontal", **bxp_kw)
+            except TypeError:
+                ax.bxp(stats_list, positions=positions, vert=False, **bxp_kw)
+        except Exception:  # noqa: BLE001 — never let bxp kill the whole figure.
+            ax.text(0.5, 0.5, "(boxplot no disponible)", ha="center",
+                    va="center", fontsize=10, color=_MUTED_TEXT,
+                    transform=ax.transAxes)
+
+        # For entries without raw fliers, mark only the out-of-fence extremes.
+        for position, box in manual_markers:
+            mn = _num(box.get("min"))
+            mx = _num(box.get("max"))
+            if box.get("has_low_outliers") and mn is not None:
+                ax.plot([mn], [position], marker="o", markersize=3.5,
+                        color=_OUTLIER, zorder=5)
+            if box.get("has_high_outliers") and mx is not None:
+                ax.plot([mx], [position], marker="o", markersize=3.5,
+                        color=_OUTLIER, zorder=5)
+
+        # Pin the Y tick labels explicitly so they work across matplotlib
+        # versions regardless of whether ``bxp`` consumed the ``label`` key.
+        ax.set_yticks(positions)
+        ax.set_yticklabels(labels, fontsize=8)
+        ax.set_xlabel("valor", fontsize=9)
+        ax.tick_params(labelsize=7)
+        ax.margins(y=0.15)
+        for spine in ("top", "right"):
+            ax.spines[spine].set_visible(False)
+
+        # Surface truncation in the title instead of silently dropping boxes.
+        note = f"(mostrando {n} de {total})" if total > cap else ""
+        heading = "  ".join(p for p in (title, note) if p)
+        if heading:
+            fig.suptitle(heading, fontsize=12, x=0.02, ha="left")
+
+        fig.tight_layout()
+        return fig
+    except Exception as exc:  # noqa: BLE001 — never raise from a figure builder.
+        return _placeholder_figure(
+            f"error al dibujar boxplots: {exc}", color=_ERROR_TEXT)
@@ -0,0 +1,109 @@
+"""Tests para build_boxplots_figure (boxplots horizontales de Tukey, grupo eda).
+
+Usa el backend Agg sin display; no muestra ni guarda figuras. Cada test cierra
+explícitamente la Figure construida (matplotlib.pyplot.close) para no acumular
+estado entre tests.
+"""
+
+import matplotlib
+
+matplotlib.use("Agg")
+
+import matplotlib.pyplot as plt  # noqa: E402
+from matplotlib.figure import Figure  # noqa: E402
+
+from build_boxplots_figure import build_boxplots_figure
+
+
+def _box(name, q1, median, q3, mn, mx, low=False, high=False, fliers=None):
+    """Construye una entrada {name, box, fliers} con un box estilo build_boxplot_stats."""
+    iqr = q3 - q1
+    return {
+        "name": name,
+        "box": {
+            "q1": q1,
+            "median": median,
+            "q3": q3,
+            "iqr": iqr,
+            "lower_fence": q1 - 1.5 * iqr,
+            "upper_fence": q3 + 1.5 * iqr,
+            "whisker_lo": max(mn, q1 - 1.5 * iqr),
+            "whisker_hi": min(mx, q3 + 1.5 * iqr),
+            "min": mn,
+            "max": mx,
+            "has_low_outliers": low,
+            "has_high_outliers": high,
+            "n_outliers": 0,
+        },
+        "fliers": fliers,
+    }
+
+
+def test_returns_figure_with_axes():
+    boxes = [
+        _box("edad", 10.0, 25.0, 40.0, 1.0, 100.0, high=True),
+        _box("ingresos", 100.0, 200.0, 300.0, 50.0, 400.0),
+        _box("score", -1.0, 0.0, 1.0, -5.0, 5.0, low=True, high=True),
+    ]
+    fig = build_boxplots_figure(boxes, title="Boxplots", max_boxes=12)
+    assert isinstance(fig, Figure)
+    assert len(fig.axes) >= 1
+    # Tres cajas -> tres etiquetas en el eje Y.
+    ax = fig.axes[0]
+    assert len(ax.get_yticks()) == 3
+    plt.close(fig)
+
+
+def test_empty_list_returns_placeholder_figure():
+    fig = build_boxplots_figure([], title="vacío")
+    assert isinstance(fig, Figure)
+    assert len(fig.axes) >= 1
+    plt.close(fig)
+
+
+def test_invalid_box_is_skipped_not_raised():
+    boxes = [
+        {"name": "rota", "box": {"q1": None, "median": None, "q3": None}},
+        {"name": "sin_box"},                         # falta la clave box.
+        "no_es_dict",                                 # entrada no-dict.
+        _box("buena", 1.0, 2.0, 3.0, 0.0, 10.0, high=True),
+    ]
+    fig = build_boxplots_figure(boxes)
+    assert isinstance(fig, Figure)
+    ax = fig.axes[0]
+    # Solo la caja válida sobrevive al filtrado.
+    assert len(ax.get_yticks()) == 1
+    plt.close(fig)
+
+
+def test_all_invalid_returns_placeholder():
+    boxes = [
+        {"name": "a", "box": {"q1": None, "median": 1.0, "q3": 2.0}},
+        {"name": "b"},
+    ]
+    fig = build_boxplots_figure(boxes)
+    assert isinstance(fig, Figure)
+    assert len(fig.axes) >= 1
+    plt.close(fig)
+
+
+def test_raw_fliers_are_drawn():
+    boxes = [
+        _box("con_fliers", 10.0, 20.0, 30.0, 5.0, 200.0,
+             high=True, fliers=[150.0, 180.0, 200.0]),
+    ]
+    fig = build_boxplots_figure(boxes)
+    assert isinstance(fig, Figure)
+    assert len(fig.axes) >= 1
+    plt.close(fig)
+
+
+def test_max_boxes_truncates_and_does_not_raise():
+    boxes = [_box(f"c{i}", float(i), float(i + 1), float(i + 2),
+                  float(i - 5), float(i + 10)) for i in range(20)]
+    fig = build_boxplots_figure(boxes, title="muchos", max_boxes=5)
+    assert isinstance(fig, Figure)
+    ax = fig.axes[0]
+    # Solo se dibujan las primeras 5 cajas.
+    assert len(ax.get_yticks()) == 5
+    plt.close(fig)
@@ -0,0 +1,68 @@
+---
+name: classify_relationship_type
+kind: function
+lang: py
+domain: datascience
+version: "1.0.0"
+purity: pure
+signature: "def classify_relationship_type(xs: list, ys: list) -> dict"
+description: "Clasifica el TIPO de relacion entre dos variables numericas pareadas por indice para el EDA automatico del grupo eda. Limpia los pares de forma defensiva (descarta None/bool/NaN/inf), reusa pearson y spearman_corr del registry y ajusta polinomios de grado 2 y 3 con numpy.polyfit (R^2 manual), y a partir de esas senales etiqueta la forma: 'lineal', 'polinomica (grado 2/3)', 'monotona no-lineal' o 'debil/sin forma'. Orden de decision: debil -> monotona -> polinomica -> lineal (la primera que matchea gana), con umbrales calibrados para datos reales discretos/ruidosos. Devuelve ademas los coeficientes del mejor modelo en orden de numpy.polyval para pintar la curva de ajuste sobre el scatter. Funcion pura no-throw: ante datos insuficientes (menos de 5 pares validos o varianza ~0) o cualquier fallo devuelve el dict canonico con tipo='debil/sin forma' y el resto a None."
+tags: [eda, correlation, relationship, classification, polyfit, datascience, pure]
+params:
+  - name: xs
+    desc: "Lista (o tupla) de valores numericos de la primera variable, pareada por indice con ys. Cada par xs[i],ys[i] se descarta si cualquiera de los dos es None, bool, NaN o inf. Lectura defensiva."
+  - name: ys
+    desc: "Lista (o tupla) de valores numericos de la segunda variable, pareada por indice con xs. Mismas reglas de limpieza que xs."
+output: "Dict con SIEMPRE las mismas 8 claves: tipo (str: 'lineal' | 'polinómica (grado 2)' | 'polinómica (grado 3)' | 'monótona no-lineal' | 'débil/sin forma'); pearson (float|None: coeficiente de Pearson r); r2_linear (float|None: r**2 del ajuste lineal); spearman (float|None: rho de Spearman); r2_poly2 (float|None: R^2 del ajuste polinomico de grado 2); r2_poly3 (float|None: R^2 del ajuste de grado 3); best_degree (int|None: grado del modelo elegido — 1 lineal, 2/3 polinomico, None si monotona/debil); coeffs (list|None: coeficientes del mejor modelo en orden de numpy.polyval para pintar la curva, o None). Ante datos insuficientes o error: tipo='débil/sin forma' y el resto de claves a None."
+uses_functions: [pearson_py_datascience, spearman_corr_py_datascience]
+uses_types: []
+returns: []
+returns_optional: false
+error_type: ""
+imports: [numpy]
+tested: true
+tests: ["test_lineal", "test_polinomica_cuadratica", "test_monotona_no_lineal", "test_monotona_exponencial", "test_debil_sin_forma", "test_lista_vacia_no_lanza", "test_longitudes_distintas_no_lanza", "test_todos_none_no_lanza", "test_entradas_none_no_lanza", "test_constante_no_lanza", "test_filtra_nan_inf_bool"]
+test_file_path: "python/functions/datascience/classify_relationship_type_test.py"
+file_path: "python/functions/datascience/classify_relationship_type.py"
+---
+
+## Ejemplo
+
+```python
+import sys, os
+sys.path.insert(0, os.path.join("python", "functions"))
+from datascience.classify_relationship_type import classify_relationship_type
+import numpy as np
+
+# Relacion claramente cuadratica (forma de parabola) sobre dominio simetrico.
+x = list(np.linspace(-10, 10, 60))
+y = [v * v for v in x]
+
+res = classify_relationship_type(x, y)
+print(res["tipo"])         # 'polinómica (grado 2)'
+print(res["best_degree"])  # 2
+print(res["r2_linear"])    # 0.0   -> el Pearson lineal no ve la parabola
+print(res["r2_poly2"])     # 1.0
+print(res["coeffs"])       # [1.0, -0.0, -0.0]  -> numpy.polyval(coeffs, x) ~ x**2
+
+# El capitulo pinta la curva de ajuste cuando coeffs no es None:
+#   if res["coeffs"] is not None:
+#       xs_fit = np.linspace(min(x), max(x), 200)
+#       ys_fit = np.polyval(res["coeffs"], xs_fit)
+#       ax.plot(xs_fit, ys_fit)   # curva sobre el ax.scatter(x, y)
+```
+
+## Cuando usarla
+
+- Usala en el capitulo de relaciones/correlaciones del EDA automatico, despues de detectar dos columnas numericas con alguna asociacion, para decidir QUE curva de ajuste pintar sobre el scatter (recta, parabola, cubica o ninguna) y poner una etiqueta legible al tipo de relacion.
+- Cuando un Pearson bajo no signifique "sin relacion": esta funcion cruza Pearson con Spearman y con ajustes polinomicos para distinguir una relacion lineal debil de una monotona no-lineal (que el rango si capta) o de una curva polinomica.
+- Cuando necesites un punto de entrada determinista y no-throw que, con los mismos datos, devuelva siempre el mismo `tipo` y los mismos `coeffs` listos para `numpy.polyval` sin tener que ajustar modelos a mano en el capitulo.
+
+## Gotchas
+
+- Funcion pura, deterministica y no-throw: ante menos de 5 pares validos, varianza ~0 (xs o ys constante) o cualquier excepcion interna devuelve el dict canonico `tipo="débil/sin forma"` con el resto de claves a `None`. El dict SIEMPRE trae las 8 claves: nunca compruebes existencia, comprueba `None`.
+- El orden de decision importa: `débil -> monótona -> polinómica -> lineal` (la primera que matchee gana). La monotonia se evalua ANTES que el ajuste polinomico, asi que una curva monotona suave (exp, log, potencias) sale `monótona no-lineal` aunque un cubico tambien la ajuste — la dominancia del rango (Spearman >> Pearson) es la senal mas interpretable. Solo cae en `polinómica` una forma curva NO monotona (p.ej. una parabola, Spearman ~0 pero R^2 polinomico alto).
+- Umbrales fijos (calibrados para EDA con datos discretos/ruidosos, no para inferencia formal): `débil/sin forma` si las tres senales son bajas a la vez (`abs(pearson) < 0.3` y `abs(spearman) < 0.3` y `mejor_poly < 0.3`); `monótona no-lineal` si `abs(spearman) - abs(pearson) >= 0.1` y `abs(spearman) >= 0.4`; `polinómica (grado N)` si el mejor polinomico mejora `>= 0.1` sobre el lineal y su R^2 `>= 0.3`; en cualquier otro caso con senal (no debil) `lineal`. El suelo de 0.3 evita llamar "debil" a relaciones reales pero discretas (conteos, escalas ordinales) con R^2 bajo pero direccion clara.
+- `coeffs` va en orden de `numpy.polyval` (grado descendente). Para `lineal` es `[pendiente, intercepto]` (grado 1); para `polinómica` los del grado elegido; para `monótona no-lineal` y `débil/sin forma` es `None` (el scatter pintara una curva suavizada o nada — lo decide el capitulo, no esta funcion).
+- `best_degree` prefiere el grado 2 sobre el 3 cuando empatan dentro de 0.02 de R^2 (parsimonia): no esperes grado 3 salvo que mejore claramente.
+- Los pares con `None`, `bool`, `NaN` o `inf` se descartan por indice en silencio; `bool` cuenta como no-numerico (un `True` no es `1`). El dominio de los datos afecta al resultado: una parabola sobre un dominio simetrico da Pearson ~0 (sale `polinómica`), pero sobre un dominio asimetrico el Pearson sube y puede salir `lineal`.
@@ -0,0 +1,187 @@
+"""Clasifica el TIPO de relacion entre dos variables numericas pareadas.
+
+Funcion pura del grupo eda. Dadas dos listas numericas pareadas por indice,
+limpia los pares de forma defensiva, calcula correlaciones lineal (Pearson) y de
+rangos (Spearman) y ajustes polinomicos de grado 2 y 3, y a partir de esas
+senales etiqueta la forma de la relacion para el EDA automatico:
+
+    "lineal" | "polinómica (grado 2)" | "polinómica (grado 3)" |
+    "monótona no-lineal" | "débil/sin forma"
+
+Ademas devuelve los coeficientes del mejor modelo (en orden de numpy.polyval)
+para que el capitulo pinte la curva de ajuste sobre el scatter. Reusa las
+funciones del registry `pearson` y `spearman_corr` en vez de reimplementarlas.
+
+NUNCA lanza: ante cualquier fallo o dato insuficiente devuelve el dict canonico
+con tipo="débil/sin forma" y el resto de claves a None.
+"""
+
+import math
+import warnings
+
+import numpy as np
+
+from datascience.datascience import pearson
+from datascience.spearman_corr import spearman_corr
+
+# Forma canonica de la respuesta cuando no se puede clasificar (datos
+# insuficientes, varianza nula o error interno). Siempre las mismas claves.
+_WEAK = {
+    "tipo": "débil/sin forma",
+    "pearson": None,
+    "r2_linear": None,
+    "spearman": None,
+    "r2_poly2": None,
+    "r2_poly3": None,
+    "best_degree": None,
+    "coeffs": None,
+}
+
+
+def _is_num(v) -> bool:
+    """True si v es un numero real finito (int/float, no bool, no NaN, no inf)."""
+    return (
+        isinstance(v, (int, float))
+        and not isinstance(v, bool)
+        and not (isinstance(v, float) and (math.isnan(v) or math.isinf(v)))
+    )
+
+
+def _poly_r2(coeffs, x_arr, y_arr, ss_tot: float) -> float:
+    """R^2 de un ajuste polinomico: 1 - SS_res/SS_tot. 0 si SS_tot==0."""
+    if ss_tot == 0.0:
+        return 0.0
+    pred = np.polyval(coeffs, x_arr)
+    ss_res = float(np.sum((y_arr - pred) ** 2))
+    return 1.0 - ss_res / ss_tot
+
+
+def classify_relationship_type(xs: list, ys: list) -> dict:
+    """Clasifica el tipo de relacion entre dos variables numericas pareadas.
+
+    Empareja xs[i],ys[i] por indice y descarta el par si cualquiera de los dos
+    es None, bool, NaN o inf. Sobre los pares limpios calcula Pearson r
+    (r2_linear = r**2), Spearman rho y los R^2 de ajustes polinomicos de grado 2
+    y 3 (con numpy.polyfit + R^2 manual). Con esas senales decide la etiqueta.
+
+    Orden de evaluacion de la etiqueta (la primera que matchee gana). Los
+    umbrales estan calibrados para datos reales, a menudo discretos y ruidosos
+    (conteos, escalas ordinales): una relacion con |r| >= 0.3, |rho| >= 0.3 o un
+    polinomio con R^2 >= 0.3 ya tiene FORMA y no debe etiquetarse como "debil".
+        1. "débil/sin forma" — todas las senales bajas a la vez:
+           abs(pearson) < 0.3 y abs(spearman) < 0.3 y mejor_poly < 0.3.
+        2. "monótona no-lineal" — el rango (Spearman) capta una monotonia que el
+           Pearson lineal no: abs(spearman) - abs(pearson) >= 0.1 y
+           abs(spearman) >= 0.4. No se fuerza un polinomio (coeffs/best_degree =
+           None); el capitulo dibuja la tendencia ordenada sobre el scatter.
+        3. "polinómica (grado N)" — el mejor polinomico mejora claramente sobre
+           el lineal (mejor_poly - r2_linear >= 0.1) y mejor_poly >= 0.3. N es el
+           grado (2 o 3) con mejor R^2, prefiriendo el 2 si empatan dentro de 0.02
+           (parsimonia).
+        4. "lineal" — el resto: hay senal (no es debil) y la forma que existe es
+           esencialmente lineal. best_degree=1, coeffs del ajuste de grado 1.
+
+    Si hay menos de 5 pares validos, o la varianza de xs o de ys es ~0
+    (constante), devuelve directamente "débil/sin forma".
+
+    Args:
+        xs: lista (o tupla) de valores numericos de la primera variable,
+            pareada por indice con ys. Pares con None/bool/NaN/inf se descartan.
+        ys: lista (o tupla) de valores numericos de la segunda variable,
+            pareada por indice con xs.
+
+    Returns:
+        dict con SIEMPRE las mismas claves:
+            tipo (str), pearson (float|None), r2_linear (float|None),
+            spearman (float|None), r2_poly2 (float|None), r2_poly3 (float|None),
+            best_degree (int|None: 1, 2, 3 o None),
+            coeffs (list|None: coeficientes en orden de numpy.polyval, o None).
+        Nunca lanza: ante fallo o datos insuficientes devuelve el dict debil.
+    """
+    try:
+        if xs is None or ys is None:
+            return dict(_WEAK)
+
+        pairs = [
+            (float(x), float(y))
+            for x, y in zip(xs, ys)
+            if _is_num(x) and _is_num(y)
+        ]
+
+        # Datos insuficientes para hablar de forma de la relacion.
+        if len(pairs) < 5:
+            return dict(_WEAK)
+
+        clean_x = [p[0] for p in pairs]
+        clean_y = [p[1] for p in pairs]
+
+        # Varianza ~0 en cualquiera de las series => relacion indefinida.
+        if len(set(clean_x)) < 2 or len(set(clean_y)) < 2:
+            return dict(_WEAK)
+        x_arr = np.asarray(clean_x, dtype=float)
+        y_arr = np.asarray(clean_y, dtype=float)
+        if float(np.var(x_arr)) < 1e-15 or float(np.var(y_arr)) < 1e-15:
+            return dict(_WEAK)
+
+        # Correlaciones reutilizando las funciones del registry.
+        r = pearson(clean_x, clean_y)
+        spearman = spearman_corr(clean_x, clean_y)
+        r2_linear = r ** 2
+
+        # Ajustes polinomicos grado 2 y 3 con R^2 manual.
+        ss_tot = float(np.sum((y_arr - float(np.mean(y_arr))) ** 2))
+        with warnings.catch_warnings():
+            warnings.simplefilter("ignore")
+            c1 = np.polyfit(x_arr, y_arr, 1)
+            c2 = np.polyfit(x_arr, y_arr, 2)
+            c3 = np.polyfit(x_arr, y_arr, 3)
+        r2_poly2 = _poly_r2(c2, x_arr, y_arr, ss_tot)
+        r2_poly3 = _poly_r2(c3, x_arr, y_arr, ss_tot)
+
+        mejor_poly = max(r2_poly2, r2_poly3)
+        # Grado del mejor polinomico, con preferencia por la parsimonia: solo se
+        # elige el grado 3 si supera al grado 2 por mas de 0.02.
+        best_poly_degree = 3 if (r2_poly3 - r2_poly2) > 0.02 else 2
+
+        abs_s = abs(spearman)
+        abs_p = abs(r)
+
+        # Decision en orden: debil-temprano -> monotona -> polinomica -> lineal.
+        if abs_p < 0.3 and abs_s < 0.3 and mejor_poly < 0.3:
+            # Ninguna senal supera el suelo de forma: relacion debil/sin forma.
+            tipo = "débil/sin forma"
+            best_degree = None
+            coeffs = None
+        elif (abs_s - abs_p) >= 0.1 and abs_s >= 0.4:
+            # Spearman (rango) capta una monotonia que el Pearson lineal no:
+            # relacion monotona no-lineal. No se fuerza un polinomio que tal vez
+            # no ajusta bien; el capitulo dibuja la tendencia ordenada.
+            tipo = "monótona no-lineal"
+            best_degree = None
+            coeffs = None
+        elif (mejor_poly - r2_linear) >= 0.1 and mejor_poly >= 0.3:
+            tipo = "polinómica (grado {})".format(best_poly_degree)
+            best_degree = best_poly_degree
+            best_coeffs = c2 if best_poly_degree == 2 else c3
+            coeffs = [float(c) for c in best_coeffs]
+        else:
+            # Hay senal (no es debil) y no es ni monotona-pura ni polinomica:
+            # la correlacion que existe es esencialmente lineal.
+            tipo = "lineal"
+            best_degree = 1
+            coeffs = [float(c) for c in c1]
+
+        return {
+            "tipo": tipo,
+            "pearson": round(float(r), 6),
+            "r2_linear": round(float(r2_linear), 6),
+            "spearman": round(float(spearman), 6),
+            "r2_poly2": round(float(r2_poly2), 6),
+            "r2_poly3": round(float(r2_poly3), 6),
+            "best_degree": best_degree,
+            "coeffs": (
+                [round(c, 8) for c in coeffs] if coeffs is not None else None
+            ),
+        }
+    except Exception:
+        return dict(_WEAK)
@@ -0,0 +1,174 @@
+"""Tests para classify_relationship_type."""
+
+import os
+import sys
+
+import numpy as np
+
+sys.path.insert(0, os.path.dirname(__file__))
+
+from classify_relationship_type import classify_relationship_type
+
+# Claves que el dict de salida debe contener SIEMPRE.
+_EXPECTED_KEYS = {
+    "tipo", "pearson", "r2_linear", "spearman",
+    "r2_poly2", "r2_poly3", "best_degree", "coeffs",
+}
+
+
+def _assert_shape(r):
+    """Toda salida tiene exactamente las 8 claves canonicas."""
+    assert isinstance(r, dict)
+    assert set(r.keys()) == _EXPECTED_KEYS
+
+
+def test_lineal():
+    """Golden: y = 2x + 1 con ruido pequeno -> 'lineal', best_degree=1."""
+    rng = np.random.default_rng(42)
+    x = np.linspace(0.0, 10.0, 50)
+    y = 2.0 * x + 1.0 + rng.normal(0.0, 0.3, 50)
+
+    r = classify_relationship_type(list(x), list(y))
+    _assert_shape(r)
+
+    assert r["tipo"] == "lineal"
+    assert r["best_degree"] == 1
+    assert r["r2_linear"] >= 0.5
+    # coeffs ~ [pendiente, intercepto] del ajuste de grado 1.
+    assert r["coeffs"] is not None and len(r["coeffs"]) == 2
+    assert abs(r["coeffs"][0] - 2.0) < 0.1   # pendiente ~2
+    assert abs(r["coeffs"][1] - 1.0) < 0.3   # intercepto ~1
+
+
+def test_polinomica_cuadratica():
+    """Golden: y = x**2 sobre [-10, 10] -> 'polinómica', best_degree in (2, 3)."""
+    x = np.linspace(-10.0, 10.0, 60)
+    y = x ** 2
+
+    r = classify_relationship_type(list(x), list(y))
+    _assert_shape(r)
+
+    assert r["tipo"].startswith("polinómica")
+    assert r["best_degree"] in (2, 3)
+    # Una parabola perfecta queda capturada por el grado 2 (parsimonia).
+    assert r["best_degree"] == 2
+    assert r["r2_poly2"] > 0.99
+    assert r["coeffs"] is not None and len(r["coeffs"]) == r["best_degree"] + 1
+
+
+def test_monotona_no_lineal():
+    """Golden: monotona convexa de cola pesada -> 'monótona no-lineal'.
+
+    y = 1/(N+1-i)**2 es estrictamente creciente (Spearman ~ 1) pero su cola
+    explosiva hace que ni la recta ni un polinomio de grado 2/3 la ajusten
+    (R^2 polinomico < 0.5), de modo que el Pearson lineal NO capta la relacion
+    que el rango (Spearman) si ve. Construccion deterministica (sin azar).
+    """
+    n = 200
+    i = np.arange(n, dtype=float)
+    y = 1.0 / (n + 1 - i) ** 2
+
+    r = classify_relationship_type(list(i), list(y))
+    _assert_shape(r)
+
+    assert r["tipo"] == "monótona no-lineal"
+    assert r["best_degree"] is None
+    assert r["coeffs"] is None
+    # Spearman fuerte y claramente por encima del Pearson.
+    assert abs(r["spearman"]) >= 0.5
+    assert abs(r["spearman"]) - abs(r["pearson"]) >= 0.15
+
+
+def test_monotona_exponencial():
+    """DoD literal: y = exp(x) (monotona no-lineal) -> 'monótona no-lineal'.
+
+    exp es estrictamente creciente (Spearman = 1) pero el Pearson lineal queda
+    claramente por debajo (~0.86), así que la dominancia del rango la marca como
+    monótona no-lineal en vez de lineal o polinómica.
+    """
+    x = np.linspace(0.0, 5.0, 80)
+    y = np.exp(x)
+
+    r = classify_relationship_type(list(x), list(y))
+    _assert_shape(r)
+
+    assert r["tipo"] == "monótona no-lineal"
+    assert r["best_degree"] is None and r["coeffs"] is None
+    assert abs(r["spearman"]) >= 0.9
+    assert abs(r["spearman"]) - abs(r["pearson"]) >= 0.1
+
+
+def test_debil_sin_forma():
+    """Golden: x e y independientes (semilla fija) -> 'débil/sin forma'."""
+    rng = np.random.default_rng(0)
+    x = rng.normal(0.0, 1.0, 200)
+    y = rng.normal(0.0, 1.0, 200)
+
+    r = classify_relationship_type(list(x), list(y))
+    _assert_shape(r)
+
+    assert r["tipo"] == "débil/sin forma"
+    assert r["best_degree"] is None
+    assert r["coeffs"] is None
+    # Todas las senales son bajas.
+    assert abs(r["pearson"]) < 0.3
+    assert r["r2_linear"] < 0.1
+
+
+def test_lista_vacia_no_lanza():
+    """Edge: listas vacias -> dict debil canonico, sin lanzar."""
+    r = classify_relationship_type([], [])
+    _assert_shape(r)
+    assert r["tipo"] == "débil/sin forma"
+    assert r["pearson"] is None
+    assert r["r2_linear"] is None
+    assert r["spearman"] is None
+    assert r["r2_poly2"] is None
+    assert r["r2_poly3"] is None
+    assert r["best_degree"] is None
+    assert r["coeffs"] is None
+
+
+def test_longitudes_distintas_no_lanza():
+    """Edge: listas de distinta longitud -> empareja por indice, no lanza."""
+    # zip trunca a la longitud minima: solo 3 pares (< 5) -> debil.
+    r = classify_relationship_type([1, 2, 3, 4, 5, 6, 7, 8], [1.0, 2.0, 3.0])
+    _assert_shape(r)
+    assert r["tipo"] == "débil/sin forma"
+    assert r["best_degree"] is None
+
+
+def test_todos_none_no_lanza():
+    """Edge: todos los valores None -> ningun par valido -> debil, no lanza."""
+    r = classify_relationship_type([None, None, None, None, None, None],
+                                   [None, None, None, None, None, None])
+    _assert_shape(r)
+    assert r["tipo"] == "débil/sin forma"
+    assert r["coeffs"] is None
+
+
+def test_entradas_none_no_lanza():
+    """Edge: xs/ys None directamente -> debil, no lanza."""
+    assert classify_relationship_type(None, None)["tipo"] == "débil/sin forma"
+    assert classify_relationship_type([1.0, 2.0], None)["tipo"] == "débil/sin forma"
+
+
+def test_constante_no_lanza():
+    """Edge: ys constante (varianza ~0) -> debil, no lanza."""
+    r = classify_relationship_type([1, 2, 3, 4, 5, 6, 7], [5, 5, 5, 5, 5, 5, 5])
+    _assert_shape(r)
+    assert r["tipo"] == "débil/sin forma"
+
+
+def test_filtra_nan_inf_bool():
+    """Edge: pares con NaN/inf/bool/None se descartan por indice."""
+    nan = float("nan")
+    inf = float("inf")
+    # Solo i=0,1,2,3,4 quedan validos (5 pares) y forman una recta perfecta.
+    xs = [0.0, 1.0, 2.0, 3.0, 4.0, nan, inf, True, None]
+    ys = [1.0, 3.0, 5.0, 7.0, 9.0, 1.0, 2.0, 3.0, 4.0]
+    r = classify_relationship_type(xs, ys)
+    _assert_shape(r)
+    # Los 5 pares validos son y = 2x + 1 exacto -> lineal.
+    assert r["tipo"] == "lineal"
+    assert r["best_degree"] == 1
@@ -0,0 +1,102 @@
+---
+id: compute_text_duplicates_py_datascience
+name: compute_text_duplicates
+kind: function
+lang: py
+domain: datascience
+version: "1.0.0"
+purity: pure
+signature: "def compute_text_duplicates(texts, near_threshold=0.85, sample_max=2000) -> dict"
+description: "Detecta documentos duplicados en un corpus de texto. Los duplicados EXACTOS se calculan siempre con la stdlib: cada documento se normaliza (colapsa espacios, strip, lower) y se hashea con SHA-1; n_exact_dup es cuántos docs repiten uno ya visto y exact_dup_pct su porcentaje. Los CASI-duplicados (near-dup) usan la dependencia OPCIONAL datasketch (MinHash + LSH sobre 3-shingles de palabras); si no está instalada, esa parte degrada a available:False sin afectar al resto. Estilo dict-no-throw del grupo eda — nunca lanza."
+tags: [eda, datascience, text, nlp, duplicates, minhash, pure, python]
+uses_functions: []
+uses_types: []
+returns: []
+returns_optional: false
+error_type: ""
+imports: [hashlib, re]
+example: |
+  from datascience.compute_text_duplicates import compute_text_duplicates
+  texts = ["El gato come pescado", "El gato come pescado", "Un perro ladra"]
+  result = compute_text_duplicates(texts)
+  # {"n_docs": 3, "n_exact_dup": 1, "exact_dup_pct": 33.33, "n_unique": 2,
+  #  "near_dup": {"available": False, "n_near_dup_docs": 0}}
+tested: true
+tests:
+  - "test_duplicados_exactos"
+  - "test_sin_duplicados"
+  - "test_vacio"
+  - "test_near_dup_degrada"
+test_file_path: "python/functions/datascience/compute_text_duplicates_test.py"
+file_path: "python/functions/datascience/compute_text_duplicates.py"
+params:
+  - name: texts
+    desc: "Lista de documentos de texto. Los elementos None o que no sean str se descartan silenciosamente; n_docs cuenta solo los documentos válidos. None como argumento se trata como lista vacía."
+  - name: near_threshold
+    desc: "Umbral de similitud Jaccard (0–1) para considerar dos documentos casi-duplicados en el cálculo near-dup vía MinHashLSH. Solo aplica si datasketch está instalada. Default 0.85."
+  - name: sample_max
+    desc: "Número máximo de documentos muestreados (los primeros) para el cálculo near-dup, que es O(n) en memoria de MinHashes. No afecta al conteo de duplicados exactos, que siempre recorre todo el corpus. Default 2000."
+output: "Dict con exactamente 5 claves, siempre presentes: n_docs (int, docs válidos), n_exact_dup (int, docs que repiten un texto normalizado ya visto = n_docs - n_unique), exact_dup_pct (float a 2 decimales = n_exact_dup/n_docs*100, o None si el corpus está vacío), n_unique (int, nº de textos normalizados distintos), y near_dup (sub-dict con available:bool y n_near_dup_docs:int; cuando available es True incluye además threshold con el near_threshold usado). La función nunca lanza: captura toda excepción y degrada."
+---
+
+## Ejemplo
+
+```python
+from datascience.compute_text_duplicates import compute_text_duplicates
+
+# Tres copias del mismo texto (con espacios/casing distintos) + dos únicos.
+texts = [
+    "El gato come pescado",
+    "El gato come pescado",
+    "el  GATO   come pescado",   # mismo tras normalizar
+    "Un perro ladra",
+    "La luna brilla",
+]
+
+compute_text_duplicates(texts)
+# {
+#   "n_docs": 5,
+#   "n_exact_dup": 2,          # 3 copias del primer texto => 2 repeticiones
+#   "exact_dup_pct": 40.0,     # 2 / 5 * 100
+#   "n_unique": 3,             # 3 textos normalizados distintos
+#   "near_dup": {"available": False, "n_near_dup_docs": 0},  # datasketch ausente
+# }
+
+# Corpus vacío: contrato estable, exact_dup_pct None, sin excepción.
+compute_text_duplicates([])
+# {"n_docs": 0, "n_exact_dup": 0, "exact_dup_pct": None, "n_unique": 0,
+#  "near_dup": {"available": False, "n_near_dup_docs": 0}}
+```
+
+## Cuando usarla
+
+Úsala en la fase de calidad de un EDA de texto, cuando quieras saber cuánto de
+tu corpus es ruido duplicado antes de entrenar, vectorizar o muestrear: te da
+el porcentaje de duplicados exactos (`exact_dup_pct`), el número de documentos
+únicos (`n_unique`) y, si tienes `datasketch` instalada, una estimación de
+casi-duplicados (paráfrasis, copias con pequeñas ediciones) vía MinHash + LSH.
+Pásale directamente la columna/lista de textos crudos; la función filtra None y
+no-str por ti y nunca lanza, así que es segura para encadenar en pipelines de
+perfilado.
+
+## Gotchas
+
+- **Near-dup requiere `datasketch` (opcional).** Si la librería no está
+  instalada, `near_dup` degrada a `{"available": False, "n_near_dup_docs": 0}`
+  (sin clave `threshold`) y el resto del resultado se calcula igual. Los
+  duplicados **exactos** funcionan siempre porque solo usan la stdlib (hash).
+- **Normalización de exactos.** Dos textos cuentan como el mismo duplicado
+  exacto si coinciden tras `" ".join(doc.split()).strip().lower()`: se colapsan
+  espacios/tabuladores/saltos, se recortan extremos y se ignora el caso. Cambios
+  de puntuación o acentos SÍ los distinguen (no se eliminan).
+- **`n_exact_dup` cuenta repeticiones, no grupos.** Con 3 copias de un mismo
+  texto, `n_exact_dup` es 2 (las dos copias extra), no 1. Equivale a
+  `n_docs - n_unique`.
+- **`exact_dup_pct` es `None` con corpus vacío** (no `ZeroDivisionError`); en
+  cualquier otro caso es un float redondeado a 2 decimales.
+- **`sample_max` solo limita el near-dup.** El conteo de duplicados exactos
+  recorre todo el corpus; el near-dup muestrea los primeros `sample_max`
+  documentos para acotar memoria. Si el corpus está ordenado, considera barajar
+  antes para que la muestra sea representativa.
+- **Elementos no-str se descartan.** `True`/`False` no cuentan como str y se
+  ignoran igual que `None`; `n_docs` refleja solo los documentos válidos.
@@ -0,0 +1,128 @@
+"""Detección de documentos duplicados en un corpus de texto.
+
+Función pura, estilo dict-no-throw del grupo `eda`: nunca lanza, siempre
+devuelve el mismo contrato de claves. Los duplicados EXACTOS se calculan
+siempre con la stdlib (normalización + hash SHA-1). Los CASI-duplicados
+(near-dup) requieren la dependencia opcional `datasketch`; si no está
+instalada, esa parte degrada limpiamente a ``available: False`` sin afectar
+al resto del cálculo.
+"""
+
+import hashlib
+import re
+
+
+def _compute_near_dup(valid, near_threshold, sample_max):
+    """Cuenta documentos con al menos otro casi-duplicado vía MinHash + LSH.
+
+    Import perezoso de ``datasketch``. Si la librería no está disponible (o
+    cualquier paso falla), degrada a ``{"available": False, "n_near_dup_docs": 0}``
+    sin propagar la excepción.
+
+    Args:
+        valid: lista de str ya filtrada (sin None ni no-str).
+        near_threshold: umbral de similitud Jaccard para LSH.
+        sample_max: número máximo de documentos a muestrear.
+
+    Returns:
+        dict con ``available`` (bool) y ``n_near_dup_docs`` (int). Cuando
+        ``available`` es True, incluye además ``threshold``.
+    """
+    try:
+        from datasketch import MinHash, MinHashLSH
+    except Exception:
+        return {"available": False, "n_near_dup_docs": 0}
+
+    try:
+        docs = valid[:sample_max]
+        num_perm = 128
+        lsh = MinHashLSH(threshold=near_threshold, num_perm=num_perm)
+        minhashes = {}
+
+        for i, doc in enumerate(docs):
+            tokens = re.findall(r"\w+", doc.lower())
+            shingles = set()
+            for j in range(len(tokens) - 2):
+                shingles.add(" ".join(tokens[j:j + 3]))
+            # Documentos con menos de 3 tokens no generan 3-shingles: caemos a
+            # los tokens sueltos para no perderlos del todo.
+            if not shingles:
+                shingles = set(tokens)
+            if not shingles:
+                # Documento sin tokens (cadena vacía / solo símbolos): se omite.
+                continue
+            m = MinHash(num_perm=num_perm)
+            for sh in shingles:
+                m.update(sh.encode("utf-8"))
+            key = "d{}".format(i)
+            minhashes[key] = m
+            lsh.insert(key, m)
+
+        n_near = 0
+        for key, m in minhashes.items():
+            matches = lsh.query(m)
+            if len(matches) > 1:
+                n_near += 1
+
+        return {
+            "available": True,
+            "n_near_dup_docs": int(n_near),
+            "threshold": near_threshold,
+        }
+    except Exception:
+        return {"available": False, "n_near_dup_docs": 0}
+
+
+def compute_text_duplicates(texts, near_threshold=0.85, sample_max=2000) -> dict:
+    """Detecta duplicados exactos y casi-duplicados en un corpus de texto.
+
+    Args:
+        texts: lista de documentos. Los elementos None o que no sean str se
+            descartan; ``n_docs`` cuenta solo los válidos.
+        near_threshold: umbral de similitud Jaccard para considerar dos
+            documentos casi-duplicados (solo near-dup, requiere datasketch).
+        sample_max: tope de documentos muestreados para el cálculo near-dup.
+
+    Returns:
+        dict con las claves ``n_docs``, ``n_exact_dup``, ``exact_dup_pct``
+        (float redondeado a 2 decimales, o None si el corpus está vacío),
+        ``n_unique`` y ``near_dup`` (sub-dict con ``available`` y
+        ``n_near_dup_docs``, más ``threshold`` cuando está disponible).
+        Nunca lanza: captura toda excepción y degrada.
+    """
+    # Filtrado defensivo de documentos válidos.
+    try:
+        valid = [t for t in texts if isinstance(t, str)] if texts is not None else []
+    except Exception:
+        valid = []
+
+    n_docs = len(valid)
+
+    # Duplicados exactos: normalizar + hash SHA-1 (stdlib, siempre disponible).
+    try:
+        seen = set()
+        n_exact_dup = 0
+        for doc in valid:
+            norm = " ".join(doc.split()).strip().lower()
+            digest = hashlib.sha1(norm.encode("utf-8")).hexdigest()
+            if digest in seen:
+                n_exact_dup += 1
+            else:
+                seen.add(digest)
+        n_unique = len(seen)
+    except Exception:
+        n_exact_dup = 0
+        n_unique = 0
+
+    exact_dup_pct = round(n_exact_dup / n_docs * 100, 2) if n_docs > 0 else None
+
+    # Casi-duplicados: opcional vía datasketch, degrada solo.
+    near_dup = _compute_near_dup(valid, near_threshold, sample_max)
+
+    return {
+        "n_docs": n_docs,
+        "n_exact_dup": n_exact_dup,
+        "exact_dup_pct": exact_dup_pct,
+        "n_unique": n_unique,
+        "near_dup": near_dup,
+    }
@@ -0,0 +1,77 @@
+"""Tests para compute_text_duplicates.
+
+Importa el modulo hoja directamente (`datascience.compute_text_duplicates`)
+para no depender de que el paquete reexporte la funcion en su __init__.
+datasketch normalmente NO esta instalada en el venv, asi que near_dup
+degrada a available=False; los tests no requieren la libreria.
+"""
+
+from datascience.compute_text_duplicates import compute_text_duplicates
+
+
+EXPECTED_KEYS = {"n_docs", "n_exact_dup", "exact_dup_pct", "n_unique", "near_dup"}
+
+
+def test_duplicados_exactos():
+    """3 copias del mismo texto + 2 únicos: n_exact_dup=2, pct>0."""
+    texts = [
+        "El gato come pescado",
+        "El gato come pescado",
+        "el  GATO   come pescado",  # mismo tras normalizar (espacios + case)
+        "Un perro ladra",
+        "La luna brilla",
+    ]
+    result = compute_text_duplicates(texts)
+
+    assert set(result.keys()) == EXPECTED_KEYS
+    assert result["n_docs"] == 5
+    # 3 copias del primer texto (2 son repeticion) + 2 textos unicos.
+    assert result["n_exact_dup"] == 2
+    assert result["n_unique"] == 3
+    assert result["exact_dup_pct"] is not None
+    assert result["exact_dup_pct"] > 0
+    # 2 / 5 * 100 = 40.0
+    assert abs(result["exact_dup_pct"] - 40.0) < 1e-9
+
+
+def test_sin_duplicados():
+    """Corpus sin repeticiones: n_exact_dup=0, n_unique==n_docs."""
+    texts = [
+        "primero documento distinto",
+        "segundo documento distinto",
+        "tercero documento distinto",
+    ]
+    result = compute_text_duplicates(texts)
+
+    assert result["n_docs"] == 3
+    assert result["n_exact_dup"] == 0
+    assert result["n_unique"] == 3
+    assert abs(result["exact_dup_pct"] - 0.0) < 1e-9
+
+
+def test_vacio():
+    """Corpus vacio: n_docs 0, exact_dup_pct None, no lanza."""
+    result = compute_text_duplicates([])
+
+    assert set(result.keys()) == EXPECTED_KEYS
+    assert result["n_docs"] == 0
+    assert result["n_exact_dup"] == 0
+    assert result["exact_dup_pct"] is None
+    assert result["n_unique"] == 0
+    assert result["near_dup"]["n_near_dup_docs"] == 0
+
+
+def test_near_dup_degrada():
+    """near_dup expone 'available' (bool) y no lanza aunque falte datasketch."""
+    texts = ["uno dos tres cuatro", "uno dos tres cuatro cinco", "algo distinto"]
+    result = compute_text_duplicates(texts)
+
+    near = result["near_dup"]
+    assert "available" in near
+    assert isinstance(near["available"], bool)
+    assert "n_near_dup_docs" in near
+    assert isinstance(near["n_near_dup_docs"], int)
+    # Tambien tolera None y entradas no-str sin lanzar.
+    mixed = compute_text_duplicates(["hola", None, 123, "hola"])
+    assert mixed["n_docs"] == 2
+    assert mixed["n_exact_dup"] == 1
@@ -0,0 +1,86 @@
+---
+id: compute_text_length_stats_py_datascience
+name: compute_text_length_stats
+kind: function
+lang: py
+domain: datascience
+version: "1.0.0"
+purity: pure
+signature: "def compute_text_length_stats(texts, n_bins=20) -> dict"
+description: "Profiles the length distribution of a corpus of text documents for EDA: per-document characters, words (unicode \\w+ tokens) and sentences (segments split on .!?… with a minimum of 1 per non-empty doc), each summarized with mean/p50/p90/p99/min/max (nearest-rank percentiles), plus an equal-width histogram of per-document word counts. None and non-str items are discarded. Dict-no-throw: never raises. Stdlib only (re)."
+tags: [eda, datascience, text, nlp, length, statistics, pure, python]
+uses_functions: []
+uses_types: []
+returns: []
+returns_optional: false
+error_type: ""
+imports: [re, math]
+example: |
+  from datascience.compute_text_length_stats import compute_text_length_stats
+  result = compute_text_length_stats(["Hola mundo.", "Una frase mas larga aqui."], n_bins=5)
+tested: true
+tests:
+  - "test_basico"
+  - "test_vacio"
+  - "test_descarta_none"
+  - "test_un_documento"
+test_file_path: "python/functions/datascience/compute_text_length_stats_test.py"
+file_path: "python/functions/datascience/compute_text_length_stats.py"
+params:
+  - name: texts
+    desc: "List of text documents (str). None entries and any non-str items (ints, floats, etc.) are discarded before any computation. An empty string \"\" is kept (chars 0, words 0, sentences 0)."
+  - name: n_bins
+    desc: "Number of equal-width bins for the per-document word-count histogram. Default 20. When all docs have the same word count, there are <2 docs, or n_bins < 1, a single covering bin is returned instead."
+output: "Dict with keys n_docs (int), chars, words, sentences and word_hist. Each of the three axis sub-dicts has the exact keys mean (float, 2 decimals), p50, p90, p99, min, max (ints). When there are no valid documents, n_docs is 0, every axis statistic is None and word_hist is []. word_hist is a list of {lo: float, hi: float, count: int} bins; the sum of all bin counts equals n_docs."
+---
+
+## Ejemplo
+
+```python
+from datascience.compute_text_length_stats import compute_text_length_stats
+
+compute_text_length_stats(
+    [
+        "Hola mundo.",
+        "Una frase mas larga con varias palabras aqui.",
+        "Esto. Tiene. Tres frases distintas!",
+    ],
+    n_bins=5,
+)
+# {
+#   "n_docs": 3,
+#   "chars":     {"mean": 30.33, "p50": 35, "p90": 45, "p99": 45, "min": 11, "max": 45},
+#   "words":     {"mean": 5.0,   "p50": 5,  "p90": 8,  "p99": 8,  "min": 2,  "max": 8},
+#   "sentences": {"mean": 1.67,  "p50": 1,  "p90": 3,  "p99": 3,  "min": 1,  "max": 3},
+#   "word_hist": [
+#     {"lo": 2.0, "hi": 3.2, "count": 1},
+#     {"lo": 3.2, "hi": 4.4, "count": 0},
+#     {"lo": 4.4, "hi": 5.6, "count": 1},
+#     {"lo": 5.6, "hi": 6.8, "count": 0},
+#     {"lo": 6.8, "hi": 8.0, "count": 1},
+#   ],
+# }
+```
+
+## Cuando usarla
+
+Úsala al perfilar una columna o corpus de texto libre en un EDA: cuando
+necesites saber lo largos que son los documentos (en caracteres, palabras y
+frases) y cómo se reparte esa longitud antes de tokenizar, vectorizar o decidir
+truncados/ventanas para un modelo. Pásale la lista de strings crudos de la
+columna; `None` y valores no-texto se descartan solos. Encaja en el grupo `eda`
+como bloque de longitud junto a `summarize_categorical`.
+
+## Gotchas
+
+- Función pura, solo stdlib (`re`). No usa numpy, pandas ni sklearn.
+- Percentiles por método **nearest-rank** (devuelven un valor real de la lista,
+  no interpolan); por eso p50/p90/p99/min/max son enteros y `mean` es el único
+  float (redondeado a 2 decimales).
+- El conteo de frases es una **aproximación** por puntuación (`.!?…`): un texto
+  sin esa puntuación cuenta como 1 frase si no está vacío; abreviaturas o
+  ellipsis pueden inflar o reducir el conteo.
+- `word_hist` es equal-width entre min y max de palabras: con todos los docs
+  del mismo tamaño, menos de 2 docs, o `n_bins < 1`, devuelve un único bin.
+- Dict-no-throw: ante input inesperado devuelve la forma vacía
+  (`n_docs` 0, ejes `None`, `word_hist` []) en vez de lanzar.
@@ -0,0 +1,168 @@
+"""Pure EDA helper: document length distribution for the `eda` group.
+
+Given a list of text documents, computes the length distribution along three
+axes (characters, words and sentences) plus an equal-width histogram of the
+per-document word counts. Stdlib only (``re`` + ``statistics`` semantics via a
+hand-rolled nearest-rank percentile). No numpy, no sklearn.
+
+The function is dict-no-throw: it never raises. On any unexpected input it
+degrades to the empty-shape result.
+"""
+
+import math
+import re
+
+_WORD_RE = re.compile(r"\w+", re.UNICODE)
+_SENT_RE = re.compile(r"[.!?…]+")
+
+
+def _empty_axis() -> dict:
+    """Return an axis sub-dict with every statistic set to ``None``."""
+    return {"mean": None, "p50": None, "p90": None, "p99": None, "min": None, "max": None}
+
+
+def _pct(sorted_vals, q):
+    """Nearest-rank percentile of an already-sorted list.
+
+    Args:
+        sorted_vals: List of numbers sorted ascending.
+        q: Percentile in the 0..100 range.
+
+    Returns:
+        The value at the nearest rank, or ``None`` for an empty list.
+    """
+    n = len(sorted_vals)
+    if n == 0:
+        return None
+    if q <= 0:
+        return sorted_vals[0]
+    rank = math.ceil(q / 100.0 * n)
+    if rank < 1:
+        rank = 1
+    if rank > n:
+        rank = n
+    return sorted_vals[rank - 1]
+
+
+def _axis_stats(values) -> dict:
+    """Compute mean/p50/p90/p99/min/max over a list of integer counts.
+
+    ``mean`` is rounded to 2 decimals; every other statistic is an integer
+    (they are counts). Returns an all-``None`` axis for an empty list.
+    """
+    if not values:
+        return _empty_axis()
+    sv = sorted(values)
+    return {
+        "mean": round(sum(sv) / len(sv), 2),
+        "p50": int(_pct(sv, 50)),
+        "p90": int(_pct(sv, 90)),
+        "p99": int(_pct(sv, 99)),
+        "min": int(sv[0]),
+        "max": int(sv[-1]),
+    }
+
+
+def _word_hist(word_counts, n_bins) -> list:
+    """Equal-width histogram of per-document word counts.
+
+    Builds ``n_bins`` bins between ``min`` and ``max`` of the word counts. When
+    every document has the same number of words, there are fewer than 2
+    documents, or ``n_bins`` is not at least 1, a single covering bin is
+    returned. With no documents the result is ``[]``. The sum of bin ``count``
+    always equals ``len(word_counts)``.
+    """
+    if not word_counts:
+        return []
+    wmin = min(word_counts)
+    wmax = max(word_counts)
+    if wmax == wmin or len(word_counts) < 2 or n_bins < 1:
+        return [{"lo": float(wmin), "hi": float(wmax), "count": len(word_counts)}]
+
+    width = (wmax - wmin) / n_bins
+    bins = []
+    for i in range(n_bins):
+        lo = wmin + i * width
+        hi = wmin + (i + 1) * width
+        bins.append({"lo": float(lo), "hi": float(hi), "count": 0})
+    # Pin the last upper edge to the real maximum to avoid float drift.
+    bins[-1]["hi"] = float(wmax)
+
+    for wc in word_counts:
+        if wc >= wmax:
+            idx = n_bins - 1
+        else:
+            idx = int((wc - wmin) / width)
+            if idx < 0:
+                idx = 0
+            elif idx >= n_bins:
+                idx = n_bins - 1
+        bins[idx]["count"] += 1
+    return bins
+
+
+def compute_text_length_stats(texts, n_bins=20) -> dict:
+    """Summarize the length distribution of a corpus of text documents.
+
+    For each document three lengths are measured: characters (``len(doc)``),
+    words (count of ``\\w+`` unicode tokens) and sentences (non-empty segments
+    after splitting on ``.!?…``, with a minimum of 1 for any non-empty
+    document). For each axis the mean, p50, p90, p99, min and max are reported,
+    plus an equal-width histogram of the per-document word counts.
+
+    ``None`` entries and any non-``str`` items in ``texts`` are discarded.
+    The function never raises: on empty/``None`` input or any internal error it
+    returns the empty-shape result (``n_docs`` 0, all-``None`` axes, ``[]``
+    histogram).
+
+    Args:
+        texts: List of text documents (``str``). ``None`` and non-``str``
+            items are dropped.
+        n_bins: Number of equal-width bins for the word-count histogram.
+            Default 20.
+
+    Returns:
+        Dict with keys ``n_docs``, ``chars``, ``words``, ``sentences`` and
+        ``word_hist``. Each of the three axes is a sub-dict with ``mean``
+        (float, 2 decimals), ``p50``, ``p90``, ``p99``, ``min`` and ``max``
+        (ints), all ``None`` when there are no documents. ``word_hist`` is a
+        list of ``{lo, hi, count}`` bins whose ``count`` sums to ``n_docs``.
+    """
+    empty_axis = _empty_axis()
+    fallback = {
+        "n_docs": 0,
+        "chars": dict(empty_axis),
+        "words": dict(empty_axis),
+        "sentences": dict(empty_axis),
+        "word_hist": [],
+    }
+    try:
+        if not texts:
+            return fallback
+
+        docs = [t for t in texts if isinstance(t, str)]
+        n_docs = len(docs)
+        if n_docs == 0:
+            return fallback
+
+        char_counts = [len(d) for d in docs]
+        word_counts = [len(_WORD_RE.findall(d)) for d in docs]
+
+        sent_counts = []
+        for d in docs:
+            segments = [s for s in _SENT_RE.split(d) if s.strip()]
+            n = len(segments)
+            if d and n == 0:
+                # Non-empty document with no detectable sentence: count as 1.
+                n = 1
+            sent_counts.append(n)
+
+        return {
+            "n_docs": n_docs,
+            "chars": _axis_stats(char_counts),
+            "words": _axis_stats(word_counts),
+            "sentences": _axis_stats(sent_counts),
+            "word_hist": _word_hist(word_counts, n_bins),
+        }
+    except Exception:
+        return fallback
@@ -0,0 +1,70 @@
+"""Tests para compute_text_length_stats.
+
+Inserta `python/functions` en sys.path (relativo a este archivo) para importar
+el modulo hoja por su paquete `datascience`, sin depender de que el paquete lo
+reexporte en su __init__.
+"""
+
+import os
+import sys
+
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+
+from datascience.compute_text_length_stats import compute_text_length_stats
+
+
+def test_basico():
+    """Varios textos de longitudes distintas: stats y histograma coherentes."""
+    texts = [
+        "Hola mundo.",                      # 2 words, 1 sentence
+        "Una frase mas larga con varias palabras aqui.",  # 8 words, 1 sentence
+        "Corto.",                           # 1 word, 1 sentence
+        "Esto. Tiene. Tres frases distintas!",            # 5 words, 3 sentences
+    ]
+    result = compute_text_length_stats(texts)
+
+    assert result["n_docs"] == 4
+    # Diferentes longitudes en palabras -> max estrictamente mayor que min.
+    assert result["words"]["max"] > result["words"]["min"]
+    # El histograma de palabras no esta vacio.
+    assert result["word_hist"] != []
+    # La suma de counts del histograma cubre todos los documentos.
+    assert sum(b["count"] for b in result["word_hist"]) == result["n_docs"]
+    # mean es float redondeado; min/max son enteros.
+    assert isinstance(result["words"]["mean"], float)
+    assert isinstance(result["words"]["min"], int)
+    assert isinstance(result["words"]["max"], int)
+    # El documento con 3 frases empuja el max de sentences a >= 3.
+    assert result["sentences"]["max"] >= 3
+
+
+def test_vacio():
+    """Lista vacia: n_docs 0, subdicts None, word_hist []."""
+    result = compute_text_length_stats([])
+    assert result["n_docs"] == 0
+    for axis in ("chars", "words", "sentences"):
+        for key in ("mean", "p50", "p90", "p99", "min", "max"):
+            assert result[axis][key] is None
+    assert result["word_hist"] == []
+
+
+def test_descarta_none():
+    """None y valores no-str se descartan del computo."""
+    result = compute_text_length_stats(["hello world", None, 123, 4.5, "foo bar baz"])
+    # Solo dos strings validos.
+    assert result["n_docs"] == 2
+    assert result["words"]["min"] == 2  # "hello world"
+    assert result["words"]["max"] == 3  # "foo bar baz"
+    assert sum(b["count"] for b in result["word_hist"]) == 2
+
+
+def test_un_documento():
+    """Un solo documento: word_hist tiene exactamente un bin con count 1."""
+    result = compute_text_length_stats(["solo un documento aqui"])
+    assert result["n_docs"] == 1
+    assert len(result["word_hist"]) == 1
+    assert result["word_hist"][0]["count"] == 1
+    # Con un unico documento, p50 == min == max == su numero de palabras (4).
+    assert result["words"]["min"] == 4
+    assert result["words"]["max"] == 4
+    assert result["words"]["p50"] == 4
@@ -0,0 +1,88 @@
+---
+id: compute_text_readability_py_datascience
+name: compute_text_readability
+kind: function
+lang: py
+domain: datascience
+version: "1.0.0"
+purity: pure
+signature: "def compute_text_readability(texts, sample_max=500) -> dict"
+description: "Calcula la legibilidad Flesch Reading Ease de un corpus de texto usando textstat con import perezoso y degradación. Filtra None/no-str/vacíos, muestrea hasta sample_max documentos (los primeros) y agrega los scores Flesch en {mean, p50, min, max}. Si textstat no está instalada devuelve available=False sin lanzar. Estilo dict-no-throw del grupo eda — nunca lanza."
+tags: [eda, datascience, text, nlp, readability, flesch, textstat, pure, python]
+uses_functions: []
+uses_types: []
+returns: []
+returns_optional: false
+error_type: ""
+imports: [math, textstat]
+example: |
+  from datascience.compute_text_readability import compute_text_readability
+  out = compute_text_readability(["The cat sat on the mat. It was warm and sunny."])
+  # {"available": True, "n_scored": 1, "flesch": {"mean": 109.0, "p50": 109.0, "min": 108.96..., "max": 108.96...}}
+tested: true
+tests:
+  - "test_prosa_ingles"
+  - "test_vacio"
+  - "test_degradacion"
+test_file_path: "python/functions/datascience/compute_text_readability_test.py"
+file_path: "python/functions/datascience/compute_text_readability.py"
+params:
+  - name: texts
+    desc: "Lista de str (documentos del corpus). Los elementos None, no-str o vacíos tras strip() se descartan silenciosamente. El orden se respeta: el muestreo toma los primeros documentos válidos."
+  - name: sample_max
+    desc: "Número máximo de documentos válidos a puntuar (los primeros). Default 500. Acota el coste en corpus grandes. Valores no convertibles a int caen a 500; negativos se tratan como 0."
+output: "Dict con exactamente 3 claves siempre presentes: available (bool: True si textstat se pudo importar), n_scored (int: nº de documentos efectivamente puntuados), flesch (dict con mean, p50, min, max). mean y p50 redondeados a 1 decimal; p50 por nearest-rank sobre los scores ordenados; min/max son los scores extremos sin redondear. Todos los valores de flesch son None cuando n_scored es 0. La función nunca lanza: cualquier excepción global (incluida ImportError de textstat) degrada a available=False, n_scored=0 y flesch todo None."
+---
+
+## Ejemplo
+
+```python
+from datascience.compute_text_readability import compute_text_readability
+
+textos = [
+    "The cat sat on the mat. It was a warm and sunny day in the park.",
+    "Reading is a wonderful habit. Books open doors to new worlds and ideas.",
+    "He ran quickly to the store to buy some fresh bread and a bottle of milk.",
+]
+
+compute_text_readability(textos)
+# {
+#   "available": True,
+#   "n_scored": 3,
+#   "flesch": {"mean": 91.4, "p50": 95.4, "min": 70.08..., "max": 108.83...}
+# }
+
+# Corpus vacío (textstat presente): available True pero nada que puntuar.
+compute_text_readability([])
+# {"available": True, "n_scored": 0,
+#  "flesch": {"mean": None, "p50": None, "min": None, "max": None}}
+```
+
+## Cuando usarla
+
+Úsala en un EDA de texto cuando necesites una métrica única y comparable de
+**lo fácil que es de leer** un corpus de documentos (descripciones, reviews,
+artículos, tickets). Devuelve el resumen Flesch Reading Ease agregado
+(`mean`/`p50`/`min`/`max`) listo para un report o un bloque del notebook, sin
+tener que iterar `textstat` a mano. Pásale la lista de textos crudos y, si el
+corpus es grande, limita el coste con `sample_max`. El estilo dict-no-throw
+permite incrustarla en pipelines del grupo `eda` sin envolver en try/except.
+
+## Gotchas
+
+- **`textstat` es una dependencia opcional.** Si no está instalada (o falla al
+  importar) la función NO lanza: devuelve `available=False`, `n_scored=0` y
+  `flesch` todo `None`. Comprueba `available` antes de interpretar los números.
+- **Flesch Reading Ease está pensado para prosa en inglés.** Aplicado a otros
+  idiomas o a texto no-prosa (código, listas, tablas, cadenas muy cortas) los
+  scores no son interpretables, aunque se calculen sin error.
+- **Escala Flesch:** valores **altos** = más fácil de leer (≈90–100 muy fácil),
+  valores **bajos** = más difícil (puede ser negativo en texto muy denso). No
+  se recortan a ningún rango: se reportan tal cual los devuelve `textstat`.
+- **`available=True` con `n_scored=0`** significa que `textstat` está presente
+  pero el corpus no aportó documentos puntuables (vacío, solo None/no-str, o
+  todos los docs fallaron al puntuar). Es distinto de `available=False`.
+- **Muestreo = los primeros `sample_max`**, no aleatorio. Si el orden del corpus
+  está sesgado, el resumen reflejará ese sesgo.
+- **`mean` y `p50` redondean a 1 decimal**; `min`/`max` se devuelven sin
+  redondear (los scores extremos reales).
@@ -0,0 +1,121 @@
+"""Legibilidad Flesch Reading Ease de un corpus de texto.
+
+Función pura del grupo `eda`, estilo dict-no-throw: nunca lanza. Usa la
+librería `textstat` con import perezoso y degradación: si `textstat` no está
+instalada (o falla al importar), devuelve un resultado con `available=False`
+en lugar de propagar el error.
+"""
+
+
+def _percentile_nearest_rank(sorted_values, pct):
+    """Percentil por nearest-rank sobre una lista ya ordenada ascendente.
+
+    rank = ceil(pct/100 * n); índice 1-based recortado a [1, n].
+    Devuelve None si la lista está vacía.
+    """
+    n = len(sorted_values)
+    if n == 0:
+        return None
+    import math
+
+    rank = math.ceil((pct / 100.0) * n)
+    if rank < 1:
+        rank = 1
+    if rank > n:
+        rank = n
+    return sorted_values[rank - 1]
+
+
+def compute_text_readability(texts, sample_max=500) -> dict:
+    """Calcula la legibilidad Flesch Reading Ease de un corpus.
+
+    Args:
+        texts: lista de str. Los elementos None, no-str o vacíos (tras strip)
+            se descartan. Se muestrean los primeros `sample_max` documentos
+            válidos.
+        sample_max: número máximo de documentos a puntuar (los primeros).
+
+    Returns:
+        Dict con la forma exacta::
+
+            {"available": bool, "n_scored": int,
+             "flesch": {"mean": float|None, "p50": float|None,
+                        "min": float|None, "max": float|None}}
+
+        `available` es True si `textstat` se pudo importar. La función nunca
+        lanza: cualquier excepción global degrada a `available=False`.
+    """
+    empty = {
+        "available": False,
+        "n_scored": 0,
+        "flesch": {"mean": None, "p50": None, "min": None, "max": None},
+    }
+    try:
+        # Import perezoso con degradación: textstat es una dependencia opcional.
+        try:
+            import textstat
+        except Exception:
+            return {
+                "available": False,
+                "n_scored": 0,
+                "flesch": {"mean": None, "p50": None, "min": None, "max": None},
+            }
+
+        # Filtrar y muestrear documentos válidos (los primeros sample_max).
+        docs = []
+        if texts is not None:
+            try:
+                limit = int(sample_max)
+            except Exception:
+                limit = 500
+            if limit < 0:
+                limit = 0
+            for item in texts:
+                if not isinstance(item, str):
+                    continue
+                if item.strip() == "":
+                    continue
+                docs.append(item)
+                if len(docs) >= limit:
+                    break
+
+        scores = []
+        for doc in docs:
+            try:
+                score = textstat.flesch_reading_ease(doc)
+            except Exception:
+                continue
+            try:
+                score = float(score)
+            except Exception:
+                continue
+            scores.append(score)
+
+        n_scored = len(scores)
+        if n_scored == 0:
+            # textstat presente pero corpus vacío / sin puntuar.
+            return {
+                "available": True,
+                "n_scored": 0,
+                "flesch": {"mean": None, "p50": None, "min": None, "max": None},
+            }
+
+        mean_val = round(sum(scores) / n_scored, 1)
+        sorted_scores = sorted(scores)
+        p50_raw = _percentile_nearest_rank(sorted_scores, 50)
+        p50_val = round(p50_raw, 1) if p50_raw is not None else None
+        min_val = sorted_scores[0]
+        max_val = sorted_scores[-1]
+
+        return {
+            "available": True,
+            "n_scored": n_scored,
+            "flesch": {
+                "mean": mean_val,
+                "p50": p50_val,
+                "min": min_val,
+                "max": max_val,
+            },
+        }
+    except Exception:
+        return empty
@@ -0,0 +1,74 @@
+"""Tests para compute_text_readability."""
+
+import sys
+import os
+import builtins
+
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", ".."))
+
+from datascience.compute_text_readability import compute_text_readability
+
+
+EXPECTED_KEYS = {"available", "n_scored", "flesch"}
+FLESCH_KEYS = {"mean", "p50", "min", "max"}
+
+
+def test_prosa_ingles():
+    """Varios textos en prosa inglesa: available True, n_scored>0, mean no None."""
+    texts = [
+        "The cat sat on the mat. It was a warm and sunny day in the park.",
+        "She sells sea shells by the sea shore. The shells she sells are surely sea shells.",
+        "Reading is a wonderful habit. Books open doors to new worlds and ideas.",
+        "He ran quickly to the store to buy some fresh bread and a bottle of milk.",
+    ]
+    out = compute_text_readability(texts)
+
+    assert set(out.keys()) == EXPECTED_KEYS
+    assert out["available"] is True
+    assert out["n_scored"] > 0
+    assert set(out["flesch"].keys()) == FLESCH_KEYS
+    assert out["flesch"]["mean"] is not None
+    assert out["flesch"]["p50"] is not None
+    assert out["flesch"]["min"] is not None
+    assert out["flesch"]["max"] is not None
+    # min <= mean/p50 <= max coherente.
+    assert out["flesch"]["min"] <= out["flesch"]["max"]
+
+
+def test_vacio():
+    """Corpus vacío con textstat presente: available True, n_scored 0, flesch None."""
+    out = compute_text_readability([])
+
+    assert set(out.keys()) == EXPECTED_KEYS
+    assert out["available"] is True
+    assert out["n_scored"] == 0
+    assert out["flesch"]["mean"] is None
+    assert out["flesch"]["p50"] is None
+    assert out["flesch"]["min"] is None
+    assert out["flesch"]["max"] is None
+
+    # Elementos no-str / vacíos también se descartan -> n_scored 0.
+    out2 = compute_text_readability([None, "", "   ", 123])
+    assert out2["available"] is True
+    assert out2["n_scored"] == 0
+
+
+def test_degradacion(monkeypatch):
+    """Sin textstat (ImportError forzado): degrada a available False sin lanzar."""
+    import datascience.compute_text_readability as m
+
+    real = builtins.__import__
+
+    def fake(name, *a, **k):
+        if name == "textstat" or name.startswith("textstat."):
+            raise ImportError("simulado")
+        return real(name, *a, **k)
+
+    monkeypatch.setattr(builtins, "__import__", fake)
+    out = m.compute_text_readability(["The cat sat on the mat. It was happy and warm."])
+    assert out["available"] is False
+    assert out["n_scored"] == 0
+    assert out["flesch"]["mean"] is None
+    assert out["flesch"]["p50"] is None
+    assert out["flesch"]["min"] is None
+    assert out["flesch"]["max"] is None
@@ -0,0 +1,103 @@
+---
+id: compute_top_ngrams_py_datascience
+name: compute_top_ngrams
+kind: function
+lang: py
+domain: datascience
+version: "1.0.0"
+purity: pure
+signature: "def compute_top_ngrams(texts, n=2, top_k=15, remove_stopwords=True) -> dict"
+description: "Calcula los n-gramas de palabras más frecuentes de un corpus de texto (n=1 unigramas, 2 bigramas, 3 trigramas...). Tokeniza a minúsculas con re.findall(r'\\w+', ...), descarta tokens numéricos y, si remove_stopwords=True, elimina stopwords ES+EN ANTES de formar los n-gramas (n-gramas contiguos sobre la secuencia de tokens de contenido, sin cruzar documentos). Pura y autocontenida con collections.Counter, sin sklearn. Estilo dict-no-throw del grupo eda: nunca lanza."
+tags: [eda, datascience, text, nlp, ngrams, bigrams, trigrams, pure, python]
+uses_functions: []
+uses_types: []
+returns: []
+returns_optional: false
+error_type: ""
+imports: [re, collections]
+example: |
+  from datascience.compute_top_ngrams import compute_top_ngrams
+  texts = ["machine learning rocks", "we love machine learning"]
+  compute_top_ngrams(texts, n=2, top_k=5)
+  # {"n": 2, "top": [{"ngram": "machine learning", "count": 2}, ...]}
+tested: true
+tests:
+  - "test_bigramas"
+  - "test_trigramas"
+  - "test_vacio"
+  - "test_stopwords"
+test_file_path: "python/functions/datascience/compute_top_ngrams_test.py"
+file_path: "python/functions/datascience/compute_top_ngrams.py"
+params:
+  - name: texts
+    desc: "Lista (o tupla) de cadenas. Los elementos None o que no sean str se descartan silenciosamente. Cada documento se tokeniza por separado; los n-gramas no cruzan la frontera entre documentos."
+  - name: n
+    desc: "Tamaño del n-grama: 1 unigramas, 2 bigramas, 3 trigramas, etc. Valores < 1 o no enteros producen top vacío (se conserva tal cual en la clave 'n' del retorno)."
+  - name: top_k
+    desc: "Número máximo de n-gramas a devolver, ordenados por frecuencia descendente con desempate alfabético determinista. Default 15. Valores negativos se tratan como 0."
+  - name: remove_stopwords
+    desc: "Si True (default) elimina las stopwords ES+EN de una lista inline (~130 términos de altísima frecuencia) ANTES de formar los n-gramas, de modo que los n-gramas se construyen sobre la secuencia de tokens de contenido."
+output: "Dict con exactamente 2 claves: n (el n recibido, sin normalizar) y top (lista de dicts {'ngram': str, 'count': int} ordenada por count descendente, longitud <= top_k). ngram es la unión de los tokens del n-grama por un espacio. Corpus vacío, tokens insuficientes para formar n-gramas o cualquier excepción interna degradan a {'n': n, 'top': []}. La función nunca lanza."
+---
+
+## Ejemplo
+
+```python
+from datascience.compute_top_ngrams import compute_top_ngrams
+
+texts = [
+    "machine learning rocks",
+    "machine learning is fun",
+    "we love machine learning",
+]
+
+# Bigramas (n=2): "machine learning" aparece en los 3 documentos.
+compute_top_ngrams(texts, n=2, top_k=5)
+# {
+#   "n": 2,
+#   "top": [
+#       {"ngram": "machine learning", "count": 3},
+#       {"ngram": "learning fun",     "count": 1},
+#       {"ngram": "learning rocks",   "count": 1},
+#       {"ngram": "love machine",     "count": 1},
+#   ],
+# }
+
+# Unigramas con stopwords fuera (default): solo palabras de contenido.
+compute_top_ngrams(["the cat sat on the mat"], n=1, top_k=3)
+# {"n": 1, "top": [{"ngram": "cat", "count": 1},
+#                  {"ngram": "mat", "count": 1},
+#                  {"ngram": "sat", "count": 1}]}
+```
+
+## Cuando usarla
+
+Úsala en la fase de EDA de texto cuando, además del vocabulario suelto, necesites
+ver qué **combinaciones de palabras contiguas** dominan un corpus: colocaciones,
+frases técnicas recurrentes ("machine learning", "data analyst"), o patrones de
+trigramas en titulares/descripciones. Es el complemento natural de un perfil de
+vocabulario: pasa de "qué palabras aparecen" a "qué secuencias aparecen". Llámala
+con `n=1` para unigramas, `n=2` para bigramas y `n=3` para trigramas, y ajusta
+`top_k` al tamaño de la tabla que vas a renderizar. Deja `remove_stopwords=True`
+para que los n-gramas reflejen contenido y no conectores gramaticales.
+
+## Gotchas
+
+- **Las stopwords se eliminan ANTES de formar los n-gramas.** Con
+  `remove_stopwords=True` la frase "data of analysis" produce el bigrama
+  "data analysis" (el "of" intermedio desaparece y los tokens de contenido se
+  vuelven contiguos), no "data of" ni "of analysis". Si quieres preservar la
+  adyacencia literal del texto original, pasa `remove_stopwords=False`.
+- **Los n-gramas NO cruzan documentos.** Cada elemento de `texts` se tokeniza y
+  recorre por separado; el último token de un documento nunca se combina con el
+  primero del siguiente.
+- **Tokens puramente numéricos se descartan** (`tok.isdigit()`), pero los
+  alfanuméricos mixtos no: "3d" o "covid19" sí cuentan como tokens. Un decimal
+  como "3.5" se parte en "3" y "5" por `\w+` y ambos se descartan por numéricos.
+- **La lista de stopwords es inline ES+EN**, pensada para textos generales en
+  esos dos idiomas. Para otros idiomas o jerga específica de dominio puede dejar
+  pasar conectores; en ese caso filtra el corpus aguas arriba o usa
+  `remove_stopwords=False` y posfiltra.
+- **`top` puede tener menos de `top_k` elementos** si el corpus no tiene tantos
+  n-gramas distintos. El desempate por frecuencia es alfabético (determinista),
+  no por orden de aparición.
@@ -0,0 +1,94 @@
+"""Top n-gramas de palabras más frecuentes de un corpus de texto.
+
+Función pura, autocontenida (solo stdlib: re + collections.Counter). No depende
+de scikit-learn ni de ninguna otra librería externa. Estilo dict-no-throw del
+grupo `eda`: ante cualquier entrada degenerada o excepción interna devuelve
+``{"n": n, "top": []}`` en vez de lanzar.
+"""
+
+import re
+from collections import Counter
+
+# Lista inline de stopwords ES + EN (~80 términos de altísima frecuencia).
+# Se eliminan ANTES de formar los n-gramas: los n-gramas se construyen sobre la
+# secuencia de tokens de contenido, no sobre el texto original.
+_STOPWORDS = frozenset({
+    # Español
+    "de", "la", "que", "el", "en", "y", "a", "los", "del", "se", "las", "por",
+    "un", "para", "con", "no", "una", "su", "al", "lo", "como", "más", "mas",
+    "pero", "sus", "le", "ya", "o", "este", "sí", "si", "porque", "esta",
+    "entre", "cuando", "muy", "sin", "sobre", "también", "tambien", "me",
+    "hasta", "hay", "donde", "quien", "desde", "todo", "nos", "durante",
+    "todos", "uno", "les", "ni", "contra", "otros", "ese", "eso", "ante",
+    "ellos", "e", "esto", "mí", "antes", "algunos", "qué", "unos", "yo",
+    "otro", "otras", "otra", "él", "tanto", "esa", "estos", "mucho", "quienes",
+    "nada", "muchos", "cual", "poco", "ella", "estar", "estas", "algunas",
+    "algo", "nosotros",
+    # Inglés
+    "the", "of", "and", "to", "in", "is", "it", "for", "on", "with", "as",
+    "are", "was", "be", "this", "that", "by", "an", "or", "at", "from", "but",
+    "not", "have", "has", "had", "they", "you", "we", "he", "she", "his",
+    "her", "their", "its", "i", "my", "me", "our", "us", "do", "does", "did",
+    "will", "would", "can", "could", "should", "there", "which", "who", "what",
+    "when", "where", "how", "all", "if", "so", "than", "then", "out", "up",
+})
+
+
+def compute_top_ngrams(texts, n=2, top_k=15, remove_stopwords=True) -> dict:
+    """Calcula los n-gramas de palabras más frecuentes de un corpus.
+
+    Args:
+        texts: lista de cadenas. Los elementos ``None`` o que no sean ``str`` se
+            descartan silenciosamente.
+        n: tamaño del n-grama (1 = unigramas, 2 = bigramas, 3 = trigramas...).
+            Valores < 1 o no enteros producen ``top`` vacío.
+        top_k: número máximo de n-gramas a devolver, ordenados por frecuencia
+            descendente (con desempate alfabético determinista).
+        remove_stopwords: si ``True`` elimina las stopwords ES+EN ANTES de
+            formar los n-gramas, de modo que los n-gramas se construyen sobre la
+            secuencia de tokens de contenido (no cruzando documentos).
+
+    Returns:
+        ``{"n": n, "top": [{"ngram": "w1 w2", "count": int}, ...]}``. Corpus
+        vacío, sin tokens suficientes o cualquier excepción interna degrada a
+        ``{"n": n, "top": []}``. Nunca lanza.
+    """
+    try:
+        if not isinstance(n, int) or n < 1:
+            return {"n": n, "top": []}
+
+        try:
+            limit = int(top_k)
+        except (TypeError, ValueError):
+            limit = 0
+        if limit < 0:
+            limit = 0
+
+        if not isinstance(texts, (list, tuple)):
+            return {"n": n, "top": []}
+
+        counter = Counter()
+        for doc in texts:
+            if not isinstance(doc, str):
+                continue
+            tokens = [
+                tok
+                for tok in re.findall(r"\w+", doc.lower(), re.UNICODE)
+                if not tok.isdigit()
+            ]
+            if remove_stopwords:
+                tokens = [tok for tok in tokens if tok not in _STOPWORDS]
+            if len(tokens) < n:
+                continue
+            for i in range(len(tokens) - n + 1):
+                ngram = " ".join(tokens[i:i + n])
+                counter[ngram] += 1
+
+        if not counter:
+            return {"n": n, "top": []}
+
+        ordered = sorted(counter.items(), key=lambda kv: (-kv[1], kv[0]))
+        top = [{"ngram": ngram, "count": count} for ngram, count in ordered[:limit]]
+        return {"n": n, "top": top}
+    except Exception:
+        return {"n": n, "top": []}
@@ -0,0 +1,65 @@
+"""Tests para compute_top_ngrams."""
+
+import sys
+import os
+
+# sys.path estándar: añade `python/functions/` para importar por paquete raíz.
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", ".."))
+
+from datascience.compute_top_ngrams import compute_top_ngrams
+
+
+def test_bigramas():
+    # "machine learning" se repite en cada documento -> bigrama más frecuente.
+    texts = [
+        "machine learning rocks",
+        "machine learning is fun",
+        "we love machine learning",
+    ]
+    result = compute_top_ngrams(texts, n=2, top_k=5)
+    assert result["n"] == 2
+    assert result["top"], "esperaba al menos un bigrama"
+    assert result["top"][0]["ngram"] == "machine learning"
+    assert result["top"][0]["count"] == 3
+    # Cada entrada respeta el contrato {"ngram": str, "count": int}.
+    for item in result["top"]:
+        assert isinstance(item["ngram"], str)
+        assert isinstance(item["count"], int)
+
+
+def test_trigramas():
+    texts = [
+        "alpha beta gamma delta",
+        "alpha beta gamma omega",
+    ]
+    # Con stopwords desactivadas para no descartar tokens de contenido.
+    result = compute_top_ngrams(texts, n=3, top_k=5, remove_stopwords=False)
+    assert result["n"] == 3
+    ngrams = {item["ngram"]: item["count"] for item in result["top"]}
+    # "alpha beta gamma" aparece en ambos documentos.
+    assert ngrams.get("alpha beta gamma") == 2
+    # Trigramas únicos de cada documento.
+    assert ngrams.get("beta gamma delta") == 1
+    assert ngrams.get("beta gamma omega") == 1
+
+
+def test_vacio():
+    assert compute_top_ngrams([], n=2) == {"n": 2, "top": []}
+    # Documentos no-str / None se descartan -> corpus efectivamente vacío.
+    assert compute_top_ngrams([None, 123, {"a": 1}], n=2) == {"n": 2, "top": []}
+
+
+def test_stopwords():
+    # "the cat" debería desaparecer al quitar stopwords ("the" es stopword EN).
+    texts = ["the cat the cat the cat"]
+    con = compute_top_ngrams(texts, n=2, top_k=10, remove_stopwords=True)
+    sin = compute_top_ngrams(texts, n=2, top_k=10, remove_stopwords=False)
+
+    con_ngrams = {item["ngram"] for item in con["top"]}
+    sin_ngrams = {item["ngram"] for item in sin["top"]}
+
+    # Sin filtrar, el bigrama dominante es "the cat".
+    assert "the cat" in sin_ngrams
+    # Al filtrar stopwords, ya no aparece "the cat" (queda solo "cat cat").
+    assert "the cat" not in con_ngrams
+    assert con_ngrams != sin_ngrams
@@ -0,0 +1,91 @@
+---
+id: compute_vocabulary_stats_py_datascience
+name: compute_vocabulary_stats
+kind: function
+lang: py
+domain: datascience
+version: "1.0.0"
+purity: pure
+signature: "def compute_vocabulary_stats(texts: list, top_k: int = 20, remove_stopwords: bool = True) -> dict"
+description: "Profiles the vocabulary of a text corpus for EDA: tokenises a list of documents, counts term frequencies and derives lexical-richness measures — total tokens, unique types, type-token ratio (TTR), hapax legomena and the top-k most frequent terms. Pure, stdlib only (re + collections.Counter); no nltk, no sklearn. Inline ES+EN stopword list, opt-out via remove_stopwords. Never raises: empty/degenerate input returns the zeroed result."
+tags: [eda, datascience, text, nlp, vocabulary, ttr, hapax, pure, python]
+uses_functions: []
+uses_types: []
+returns: []
+returns_optional: false
+error_type: ""
+imports: [re, collections]
+example: |
+  from datascience.compute_vocabulary_stats import compute_vocabulary_stats
+  result = compute_vocabulary_stats(["el gato y el perro", "gato veloz"], top_k=5)
+tested: true
+tests:
+  - "test_basico"
+  - "test_vacio"
+  - "test_stopwords_quitadas"
+  - "test_stopwords_conservadas"
+test_file_path: "python/functions/datascience/compute_vocabulary_stats_test.py"
+file_path: "python/functions/datascience/compute_vocabulary_stats.py"
+params:
+  - name: texts
+    desc: "List of documents (strings) forming the corpus. Entries that are None or not a str are silently discarded. Tokens are extracted per document with re.findall(r'\\w+', doc.lower(), re.UNICODE); purely numeric tokens (tok.isdigit()) are dropped."
+  - name: top_k
+    desc: "Maximum number of most-frequent terms to return in top_terms. Default 20. Does not affect n_tokens/n_types/ttr/hapax — only the length of the top_terms list."
+  - name: remove_stopwords
+    desc: "When True (default) common Spanish+English stopwords from the inline _STOPWORDS set (~120 entries) are removed from the token stream before any counting. Set False to keep every word (raw lexical profile)."
+output: "Dict with the exact keys n_tokens (int), n_types (int), ttr (float|None, n_types/n_tokens rounded to 4 dp), n_hapax (int, terms occurring exactly once), hapax_pct (float|None, n_hapax/n_types*100 rounded to 2 dp) and top_terms (list of {term, count, pct} sorted by count descending, pct = count/n_tokens*100 rounded to 2 dp). For an empty corpus (no tokens after filtering): n_tokens=0, n_types=0, ttr=None, n_hapax=0, hapax_pct=None, top_terms=[]. Any exception degrades to that same empty result — the function never throws."
+---
+
+## Ejemplo
+
+```python
+from datascience.compute_vocabulary_stats import compute_vocabulary_stats
+
+compute_vocabulary_stats(
+    ["el gato y el perro", "gato veloz corre", "perro perro perro"],
+    top_k=5,
+)
+# {
+#   "n_tokens": 6,        # stopwords (el, y) eliminadas por defecto
+#   "n_types": 3,         # gato, perro, veloz, corre -> tras quitar stopwords
+#   "ttr": 0.5,           # n_types / n_tokens
+#   "n_hapax": 2,         # veloz, corre (1 aparicion cada uno)
+#   "hapax_pct": 50.0,    # n_hapax / n_types * 100
+#   "top_terms": [
+#     {"term": "perro", "count": 4, "pct": 44.44},
+#     {"term": "gato",  "count": 2, "pct": 22.22},
+#     ...
+#   ],
+# }
+
+# Perfil lexico crudo (sin filtrar stopwords):
+compute_vocabulary_stats(["the cat and the dog"], remove_stopwords=False)
+```
+
+## Cuando usarla
+
+Úsala al perfilar una columna o corpus de texto libre en un EDA del grupo `eda`:
+cuando necesites medir la riqueza léxica (cuántos tokens y cuántas palabras
+distintas, type-token ratio, porcentaje de palabras que solo aparecen una vez) y
+ver qué términos dominan el vocabulario (top-k frecuencias). Pásale la lista de
+documentos crudos (filas de la columna); `None` y valores no-string se ignoran
+solos. Es el equivalente para texto largo de `summarize_categorical`, que perfila
+categorías cortas.
+
+## Gotchas
+
+- Función pura y stdlib-only, pero el resultado depende del **idioma**: la lista
+  `_STOPWORDS` cubre español e inglés. Para otros idiomas pon
+  `remove_stopwords=False` o filtra fuera, o el perfil mezclará stopwords no
+  reconocidas en `top_terms`.
+- La tokenización es `\w+` con `re.UNICODE`: separa por puntuación y conserva
+  acentos/ñ, pero NO hace stemming ni lematización — "gato" y "gatos" cuentan
+  como tipos distintos. Tampoco hace stripping de acentos, así que "más" (con
+  tilde) y "mas" son tokens diferentes (ambos están en la stoplist).
+- Los tokens **puramente numéricos** (`"123"`) se descartan siempre; un token
+  alfanumérico mixto (`"covid19"`) se conserva.
+- `ttr` baja artificialmente en corpus grandes (más texto, más repetición): no
+  compares TTR entre corpus de tamaños muy distintos sin normalizar.
+- Nunca lanza: entrada vacía, `None`, o cualquier excepción interna devuelven el
+  resultado con ceros/`None`/`[]`. Comprueba `n_tokens == 0` para detectar el
+  caso degenerado.
@@ -0,0 +1,99 @@
+"""Profile the vocabulary of a text corpus for EDA (pure, stdlib only).
+
+Tokenises a list of documents, counts term frequencies and derives lexical
+richness measures (type-token ratio, hapax legomena) plus the top-k terms.
+No external NLP dependencies (no nltk, no sklearn) — only ``re`` and
+``collections`` from the standard library.
+"""
+
+import re
+from collections import Counter
+
+# Common Spanish + English stopwords. Inline, lowercase, no accents stripped
+# beyond what already appears here. Filtering is opt-in via remove_stopwords.
+_STOPWORDS = {
+    # Spanish
+    "de", "la", "que", "el", "en", "y", "a", "los", "del", "se", "las", "por",
+    "un", "para", "con", "no", "una", "su", "al", "es", "lo", "como", "mas",
+    "más", "pero", "sus", "le", "ya", "o", "este", "si", "sí", "porque",
+    "esta", "entre", "cuando", "muy", "sin", "sobre", "tambien", "también",
+    "me", "hasta", "hay", "donde", "quien", "desde", "todo", "nos", "durante",
+    "todos", "uno", "les", "ni", "contra", "otros", "ese", "eso", "ante",
+    "ellos", "e", "esto", "antes", "algunos", "que", "unos", "yo", "otro",
+    "otras", "otra", "el", "tanto", "esa", "estos", "mucho", "nada", "muchos",
+    # English
+    "the", "of", "and", "to", "in", "is", "it", "for", "on", "with", "as",
+    "was", "but", "are", "this", "that", "an", "be", "by", "or", "not", "at",
+    "from", "my", "i", "you", "he", "she", "we", "they", "his", "her", "its",
+    "our", "their", "what", "which", "who", "whom", "has", "have", "had", "do",
+    "does", "did", "will", "would", "can", "could", "should", "may", "might",
+    "must", "if", "then", "than", "so", "too", "very", "just", "also", "were",
+    "been", "being", "there", "here", "all", "any", "some", "more", "most",
+    "out", "up", "down", "into", "over", "such", "only", "own", "same",
+}
+
+
+def compute_vocabulary_stats(texts, top_k=20, remove_stopwords=True) -> dict:
+    """Profile the vocabulary of a corpus of documents.
+
+    Args:
+        texts: List of strings (the corpus). Entries that are None or not a
+            string are discarded silently.
+        top_k: Maximum number of most-frequent terms to include in
+            ``top_terms``. Default 20. Does not affect the other measures.
+        remove_stopwords: When True (default) common ES+EN stopwords are
+            dropped from the token stream before any counting.
+
+    Returns:
+        A dict with the exact keys ``n_tokens``, ``n_types``, ``ttr``,
+        ``n_hapax``, ``hapax_pct`` and ``top_terms``. For an empty corpus (no
+        tokens after filtering): n_tokens=0, n_types=0, ttr=None, n_hapax=0,
+        hapax_pct=None, top_terms=[]. Never raises — any exception degrades to
+        the empty-corpus result.
+    """
+    empty = {
+        "n_tokens": 0,
+        "n_types": 0,
+        "ttr": None,
+        "n_hapax": 0,
+        "hapax_pct": None,
+        "top_terms": [],
+    }
+    try:
+        tokens = []
+        for doc in texts or []:
+            if not isinstance(doc, str):
+                continue
+            for tok in re.findall(r"\w+", doc.lower(), re.UNICODE):
+                if tok.isdigit():
+                    continue
+                if remove_stopwords and tok in _STOPWORDS:
+                    continue
+                tokens.append(tok)
+
+        n_tokens = len(tokens)
+        if n_tokens == 0:
+            return dict(empty)
+
+        counts = Counter(tokens)
+        n_types = len(counts)
+        ttr = round(n_types / n_tokens, 4)
+
+        n_hapax = sum(1 for c in counts.values() if c == 1)
+        hapax_pct = round(n_hapax / n_types * 100, 2)
+
+        top_terms = [
+            {"term": term, "count": count, "pct": round(count / n_tokens * 100, 2)}
+            for term, count in counts.most_common(top_k)
+        ]
+
+        return {
+            "n_tokens": n_tokens,
+            "n_types": n_types,
+            "ttr": ttr,
+            "n_hapax": n_hapax,
+            "hapax_pct": hapax_pct,
+            "top_terms": top_terms,
+        }
+    except Exception:
+        return dict(empty)
@@ -0,0 +1,74 @@
+"""Tests para compute_vocabulary_stats."""
+
+import os
+import sys
+
+sys.path.insert(
+    0, os.path.join(os.path.dirname(__file__), "..", "..", "functions")
+)
+
+from datascience.compute_vocabulary_stats import compute_vocabulary_stats
+
+
+def test_basico():
+    # Corpus con repeticiones y hapax. Stopwords desactivadas para controlar
+    # exactamente que tokens entran.
+    texts = ["gato gato perro", "perro perro raton", "elefante"]
+    r = compute_vocabulary_stats(texts, top_k=10, remove_stopwords=False)
+
+    # n_types < n_tokens cuando hay repeticiones.
+    assert r["n_types"] < r["n_tokens"]
+    assert r["n_tokens"] == 7
+    assert r["n_types"] == 4  # gato, perro, raton, elefante
+
+    # ttr en (0, 1].
+    assert 0 < r["ttr"] <= 1
+    assert r["ttr"] == round(4 / 7, 4)
+
+    # top_terms ordenado por count descendente.
+    counts = [t["count"] for t in r["top_terms"]]
+    assert counts == sorted(counts, reverse=True)
+    assert r["top_terms"][0]["term"] == "perro"
+    assert r["top_terms"][0]["count"] == 3
+
+    # hapax: raton y elefante aparecen exactamente una vez.
+    assert r["n_hapax"] == 2
+    assert r["hapax_pct"] == round(2 / 4 * 100, 2)
+
+    # pct coherente con count/n_tokens.
+    assert r["top_terms"][0]["pct"] == round(3 / 7 * 100, 2)
+
+
+def test_vacio():
+    # Sin documentos validos -> ceros / None / [].
+    for arg in ([], None, [None, 123, ""], ["123 456"]):
+        r = compute_vocabulary_stats(arg)
+        assert r["n_tokens"] == 0
+        assert r["n_types"] == 0
+        assert r["ttr"] is None
+        assert r["n_hapax"] == 0
+        assert r["hapax_pct"] is None
+        assert r["top_terms"] == []
+
+
+def test_stopwords_quitadas():
+    texts = ["the gato the perro", "de la casa azul"]
+    r = compute_vocabulary_stats(texts, remove_stopwords=True)
+    terms = {t["term"] for t in r["top_terms"]}
+    # Stopwords ES+EN no deben aparecer.
+    assert "the" not in terms
+    assert "de" not in terms
+    assert "la" not in terms
+    # Palabras de contenido si.
+    assert "gato" in terms
+    assert "casa" in terms
+
+
+def test_stopwords_conservadas():
+    texts = ["the gato the perro", "de la casa azul"]
+    r = compute_vocabulary_stats(texts, remove_stopwords=False)
+    terms = {t["term"] for t in r["top_terms"]}
+    # Con el filtro desactivado, las stopwords se conservan.
+    assert "the" in terms
+    assert "de" in terms
+    assert "la" in terms
@@ -0,0 +1,80 @@
+---
+name: detect_corpus_language
+kind: function
+lang: py
+domain: datascience
+version: "1.0.0"
+purity: pure
+signature: "def detect_corpus_language(texts, top_k=10, sample_max=1000) -> dict"
+description: "Estima la distribucion de idiomas de un corpus de textos con la libreria langdetect (import perezoso). Funcion pura y defensiva del grupo eda: filtra documentos None/no-str/vacios, muestrea hasta sample_max docs, clasifica cada uno con detect() ignorando los que langdetect no puede resolver (LangDetectException), y devuelve la distribucion top_k por frecuencia mas el idioma dominante. Si langdetect no esta instalada o algo falla, degrada a {available: False, ...} y NUNCA lanza (dict-no-throw). Seed fija (DetectorFactory.seed=0) para deteccion determinista."
+tags: [eda, datascience, text, nlp, language-detection, langdetect, pure, python]
+params:
+  - name: texts
+    desc: "Lista de strings (documentos). Los elementos None, no-str o vacios tras strip se descartan antes de clasificar."
+  - name: top_k
+    desc: "Numero maximo de idiomas a devolver en distribution, ordenados por count descendente (desempate por codigo ISO ascendente). Default 10."
+  - name: sample_max
+    desc: "Numero maximo de documentos a clasificar (se toman los primeros del corpus) para acotar el coste. Default 1000."
+output: >
+  Dict con forma fija (dict-no-throw, nunca lanza):
+  {"available": bool, "n_detected": int,
+   "distribution": [{"lang": str, "count": int, "pct": float}, ...],
+   "dominant": str|None}.
+  available=True si langdetect es importable; lang son codigos ISO 639-1 ("es","en","fr",...);
+  pct = count/n_detected*100 redondeado a 2 decimales; n_detected = docs clasificados con exito;
+  dominant = idioma mas frecuente (None si no hubo detecciones). Corpus vacio con langdetect
+  presente -> available True, n_detected 0, distribution [], dominant None. Sin langdetect (o
+  fallo global) -> available False y el resto de campos a su valor vacio.
+uses_functions: []
+uses_types: []
+returns: []
+returns_optional: false
+error_type: ""
+imports: [langdetect]
+tested: true
+tests: ["test_mixto_es_en", "test_vacio", "test_degradacion"]
+test_file_path: "python/functions/datascience/detect_corpus_language_test.py"
+file_path: "python/functions/datascience/detect_corpus_language.py"
+---
+
+## Ejemplo
+
+```python
+import sys, os
+sys.path.insert(0, os.path.join("python", "functions"))
+from datascience.detect_corpus_language import detect_corpus_language
+
+corpus = [
+    "este es un texto bastante largo en español para detectar el idioma correctamente",
+    "la inteligencia artificial transforma la manera en que trabajamos cada dia",
+    "this is a fairly long english text to detect the language correctly without issues",
+]
+out = detect_corpus_language(corpus)
+# {"available": True, "n_detected": 3,
+#  "distribution": [{"lang": "es", "count": 2, "pct": 66.67},
+#                   {"lang": "en", "count": 1, "pct": 33.33}],
+#  "dominant": "es"}
+```
+
+## Cuando usarla
+
+Cuando perfiles una columna o corpus de texto en un EDA y necesites saber en
+que idioma(s) esta escrito antes de elegir tokenizadores, stopwords, modelos
+NLP o stemmers. Util tambien como check de calidad: detectar corpus mezclados
+o un idioma inesperado. Llamala con la lista de textos crudos; la funcion
+limpia, muestrea y resume sola.
+
+## Gotchas
+
+- `langdetect` es **opcional**: si no esta instalada, la funcion no lanza —
+  devuelve `{"available": False, "n_detected": 0, "distribution": [], "dominant": None}`.
+  Comprueba `out["available"]` antes de usar la distribucion.
+- **Textos cortos** (pocas palabras o sin features lingüisticas) pueden no
+  detectarse: langdetect lanza `LangDetectException`, que se ignora y el doc no
+  cuenta en `n_detected`. Pasa frases razonablemente largas para resultados fiables.
+- **Determinismo**: se fija `DetectorFactory.seed = 0` en cada llamada para que la
+  deteccion sea reproducible; sin esa semilla langdetect puede dar resultados
+  ligeramente distintos entre ejecuciones.
+- `distribution` esta truncada a `top_k`; si el corpus tiene mas idiomas que
+  `top_k`, la suma de los `count` mostrados puede ser menor que `n_detected`
+  (pero `dominant` siempre refleja el idioma mas frecuente del corpus completo).
@@ -0,0 +1,91 @@
+"""Detecta la distribucion de idiomas de un corpus de textos.
+
+Funcion pura y defensiva: el computo es determinista y local (sin I/O de red).
+La libreria opcional `langdetect` se importa de forma perezosa dentro de la
+funcion; si no esta instalada (o cualquier paso falla), la funcion degrada
+limpiamente a `available=False` y NUNCA lanza excepciones.
+"""
+
+
+def detect_corpus_language(texts, top_k=10, sample_max=1000) -> dict:
+    """Estima la distribucion de idiomas de un corpus con `langdetect`.
+
+    Args:
+        texts: lista de strings (documentos). Los elementos None, no-str o
+            vacios tras strip se descartan.
+        top_k: numero maximo de idiomas a devolver en `distribution`,
+            ordenados por frecuencia descendente.
+        sample_max: numero maximo de documentos a clasificar (se toman los
+            primeros) para acotar el coste.
+
+    Returns:
+        dict con la forma fija (dict-no-throw):
+        {
+            "available": bool,   # True si langdetect es importable
+            "n_detected": int,   # documentos clasificados con exito
+            "distribution": [{"lang": str, "count": int, "pct": float}, ...],
+            "dominant": str | None,
+        }
+    """
+    degraded = {
+        "available": False,
+        "n_detected": 0,
+        "distribution": [],
+        "dominant": None,
+    }
+    try:
+        # Import perezoso con degradacion: si langdetect no esta disponible,
+        # devolvemos el dict degradado sin lanzar.
+        try:
+            from langdetect import detect, DetectorFactory
+
+            # Semilla fija -> deteccion determinista entre ejecuciones.
+            DetectorFactory.seed = 0
+        except Exception:
+            return dict(degraded)
+
+        # Normaliza y filtra el corpus.
+        docs = []
+        if texts:
+            for t in texts:
+                if isinstance(t, str):
+                    s = t.strip()
+                    if s:
+                        docs.append(s)
+
+        # Muestreo de los primeros `sample_max` documentos.
+        if sample_max is not None and sample_max >= 0:
+            docs = docs[:sample_max]
+
+        # Conteo por idioma; langdetect lanza LangDetectException en textos
+        # sin features detectables -> se ignora y se sigue.
+        counts: dict = {}
+        for doc in docs:
+            try:
+                lang = detect(doc)
+            except Exception:
+                continue
+            counts[lang] = counts.get(lang, 0) + 1
+
+        n_detected = sum(counts.values())
+
+        # Orden estable: por count descendente, desempate por codigo de idioma.
+        ordered = sorted(counts.items(), key=lambda kv: (-kv[1], kv[0]))
+
+        k = top_k if (top_k is not None and top_k >= 0) else len(ordered)
+        distribution = []
+        for lang, count in ordered[:k]:
+            pct = round(count / n_detected * 100, 2) if n_detected else 0.0
+            distribution.append({"lang": lang, "count": count, "pct": pct})
+
+        dominant = ordered[0][0] if ordered else None
+
+        return {
+            "available": True,
+            "n_detected": n_detected,
+            "distribution": distribution,
+            "dominant": dominant,
+        }
+    except Exception:
+        # Cualquier fallo global degrada a available False sin lanzar.
+        return dict(degraded)
@@ -0,0 +1,58 @@
+"""Tests para detect_corpus_language."""
+
+import builtins
+import os
+import sys
+
+# Anade python/functions a sys.path para importar el paquete `datascience`.
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
+
+from datascience.detect_corpus_language import detect_corpus_language
+
+_ES = [
+    "este es un texto bastante largo en español para detectar el idioma correctamente sin problemas",
+    "la inteligencia artificial transforma la manera en que trabajamos cada dia en muchos sectores",
+]
+_EN = [
+    "this is a fairly long english text to detect the language correctly without any length issues",
+    "machine learning models can classify documents into many different categories quite reliably",
+]
+
+
+def test_mixto_es_en():
+    """Golden: corpus mixto ES+EN claro -> available True, >=2 idiomas, counts coherentes."""
+    out = detect_corpus_language(_ES + _EN)
+    assert out["available"] is True
+    assert out["dominant"] in {"es", "en"}
+    assert len(out["distribution"]) >= 2
+    total = sum(item["count"] for item in out["distribution"])
+    assert total == out["n_detected"]
+    assert out["n_detected"] == 4
+
+
+def test_vacio():
+    """Edge: lista vacia con langdetect presente -> available True, sin detecciones."""
+    out = detect_corpus_language([])
+    assert out["available"] is True
+    assert out["n_detected"] == 0
+    assert out["distribution"] == []
+    assert out["dominant"] is None
+
+
+def test_degradacion(monkeypatch):
+    """Error path: si langdetect no es importable -> degrada a available False sin lanzar."""
+    import datascience.detect_corpus_language as m
+
+    real_import = builtins.__import__
+
+    def fake_import(name, *a, **k):
+        if name == "langdetect" or name.startswith("langdetect."):
+            raise ImportError("simulado")
+        return real_import(name, *a, **k)
+
+    monkeypatch.setattr(builtins, "__import__", fake_import)
+    out = m.detect_corpus_language(["hola mundo", "hello world"])
+    assert out["available"] is False
+    assert out["n_detected"] == 0
+    assert out["distribution"] == []
+    assert out["dominant"] is None
@@ -0,0 +1,102 @@
+---
+name: extract_text_sample
+kind: function
+lang: py
+domain: datascience
+version: "1.0.0"
+purity: impure
+signature: "def extract_text_sample(db_path: str, table: str, columns: list, backend: str = 'duckdb', sample: int = 2000) -> dict"
+description: "Muestrea columnas de texto de una tabla DuckDB/Postgres con push-down SQL (LIMIT sample), SIN traer la tabla entera a RAM. Funcion impura del grupo de capacidad `eda`: la usan los capitulos de texto/NLP del AutomaticEDA que necesitan valores crudos de texto (longitudes, tokens, ejemplos) sobre una muestra acotada. Construye el lector read-only query_fn(sql)->dict igual que build_eda_render_ctx (closure sobre duckdb_query_readonly / pg_query importados perezosamente desde infra). Escapa los identificadores con comillas dobles y lanza una sola query SELECT \"c1\", \"c2\" FROM \"table\" LIMIT n. Por columna, la lista de strings solo contiene valores NO None y NO vacios: cada celda no nula se convierte con str(...) y se descarta si queda cadena vacia. Estilo dict-no-throw del grupo eda: NUNCA lanza; ante cualquier fallo (query, conversion, backend desconocido) devuelve {status:'error', error:str, columns:{}, n:0}. La clave n reporta el numero de FILAS leidas por la query (antes de filtrar None/vacios)."
+tags: [eda, datascience, text, nlp, extraction, read-only, duckdb, postgres, python]
+uses_functions: [duckdb_query_readonly_py_infra, pg_query_py_infra]
+uses_types: []
+returns: []
+returns_optional: false
+error_type: "error_go_core"
+imports: []
+params:
+  - name: db_path
+    desc: "ruta al archivo DuckDB, o DSN PostgreSQL si backend='postgres'. Se inyecta en el closure query_fn. No se valida aqui: si la base no existe o el DSN es invalido, la query devuelve status error y el resultado es {status:'error', ...} (no lanza)."
+  - name: table
+    desc: "nombre de la tabla. Se escapa con comillas dobles en la query (SELECT ... FROM \"table\")."
+  - name: columns
+    desc: "lista de nombres de columna de texto a muestrear. Se filtra a las entradas que sean str no vacio; cada nombre se escapa con comillas dobles. Si tras filtrar queda vacia -> {status:'ok', columns:{}, n:0} sin tocar la base."
+  - name: backend
+    desc: "'duckdb' (default) o 'postgres'. Selecciona el lector read-only del registry (duckdb_query_readonly / pg_query). Cualquier otro valor -> {status:'error', error:'backend desconocido: <valor>', columns:{}, n:0}."
+  - name: sample
+    desc: "maximo de filas a muestrear (clausula LIMIT). Default 2000. Acota memoria y tiempo: con tablas grandes obtienes el primer tramo por orden fisico (sin ORDER BY), no un muestreo uniforme."
+output: "dict dict-no-throw (NUNCA lanza): {status:'ok'|'error', columns:{col_name:[str,...]}, n:int, error:str}. En exito (status='ok') columns mapea cada columna pedida a la lista de sus valores de texto NO None y NO vacios (cada celda convertida con str(...)); n es el numero de FILAS leidas por la query (antes de filtrar None/vacios). columns vacio -> {status:'ok', columns:{}, n:0}. En error (backend desconocido, query con status!='ok', o cualquier excepcion) -> {status:'error', error:str, columns:{}, n:0}; la clave error solo aparece en este caso."
+tested: true
+tests: ["test_extract_basic", "test_backend_desconocido", "test_columns_vacio", "test_sample_limit"]
+test_file_path: "python/functions/datascience/extract_text_sample_test.py"
+file_path: "python/functions/datascience/extract_text_sample.py"
+---
+
+## Ejemplo
+
+```python
+import sys, os
+sys.path.insert(0, os.path.join("python", "functions"))
+# Import directo del submodulo (no requiere export en datascience/__init__.py).
+from datascience.extract_text_sample import extract_text_sample
+
+# Muestrea hasta 2000 filas de dos columnas de texto de una tabla DuckDB.
+res = extract_text_sample(
+    "data/reviews.duckdb", "reviews", ["title", "body"],
+    backend="duckdb", sample=2000,
+)
+# res == {
+#   "status": "ok",
+#   "columns": {
+#     "title": ["Gran producto", "No funciona", ...],   # solo no-None, no-""
+#     "body":  ["Lo uso a diario...", ...],
+#   },
+#   "n": 2000,   # filas leidas por la query (antes de filtrar None/vacios)
+# }
+
+# Postgres: db_path es el DSN.
+res_pg = extract_text_sample(
+    "postgresql://user:pass@localhost:5433/trends", "comentarios", ["texto"],
+    backend="postgres", sample=500,
+)
+```
+
+## Cuando usarla
+
+Cuando necesites valores CRUDOS de texto de una o varias columnas para analisis
+NLP/texto (distribucion de longitudes, conteo de tokens, ejemplos representativos,
+deteccion de idioma) pero NO quieras cargar la tabla entera en memoria. Es el
+muestreador de texto del grupo `eda`: una sola llamada con push-down `LIMIT`
+devuelve listas de strings por columna, limpias de None y vacios, listas para
+alimentar un capitulo de texto del AutomaticEDA o cualquier rutina de tokenizado.
+Usala junto a `profile_table` / `build_eda_render_ctx` cuando el perfil agregado
+no basta y hace falta el texto real.
+
+## Gotchas
+
+- **Impura**: lee de la base de datos a traves de `query_fn` (closure sobre
+  `duckdb_query_readonly` / `pg_query`). No abre conexiones fuera de esos wrappers
+  del registry. Estilo dict-no-throw del grupo `eda`: NUNCA lanza; ante cualquier
+  fallo devuelve `{status:'error', error:str, columns:{}, n:0}`.
+- **`error_type` en el frontmatter es `error_go_core` por convencion del registry**
+  (toda funcion impura debe declararlo y el indexer lo exige), pero el codigo NO
+  lanza esa excepcion: degrada al dict de error. Es metadata, no comportamiento.
+- **Backend desconocido**: con un `backend` que no sea `duckdb` ni `postgres`
+  devuelve `{status:'error', error:'backend desconocido: <valor>', columns:{},
+  n:0}` sin tocar la base.
+- **Las listas NO incluyen None ni cadenas vacias**: cada celda no nula se pasa
+  por `str(...)` y se descarta si queda `""`. Por eso `len(columns[col])` puede ser
+  menor que `n` (que cuenta las filas leidas). Si necesitas alineacion por fila
+  (una entrada por fila aunque sea None), usa `build_eda_render_ctx` (raw_numeric),
+  no esta funcion.
+- **`LIMIT sample` sin `ORDER BY`**: con tablas grandes obtienes el primer tramo
+  por orden fisico del backend, no un muestreo uniforme ni reproducible. Sube
+  `sample` para mas cobertura, o pre-ordena/aleatoriza la tabla si necesitas
+  representatividad.
+- **DuckDB en sandbox por defecto**: `duckdb_query_readonly` abre la conexion con
+  `enable_external_access=False`, asi que la query solo puede leer la propia base
+  (no `read_csv`/`httpfs`/`ATTACH` a paths externos). Lee tablas ya existentes en
+  el archivo DuckDB sin problema.
+- **No loguear los datos crudos**: las listas de `columns` pueden contener texto
+  sensible (reviews, comentarios, PII). En trazas usa solo conteos (`n`,
+  `len(columns[col])`) y nombres de columna, no el dict completo.
@@ -0,0 +1,112 @@
+"""extract_text_sample — muestrea columnas de texto de una tabla sin cargarla en RAM.
+
+Funcion impura (lee de la base de datos) del grupo de capacidad `eda`. Dado un
+``db_path`` + ``table`` (DuckDB o PostgreSQL) y una lista de ``columns`` de texto,
+trae una MUESTRA de esas columnas con push-down SQL (``LIMIT sample``), nunca la
+tabla entera. La usan los capitulos de texto/NLP del AutomaticEDA que necesitan
+valores crudos de texto (longitudes, tokens, ejemplos) sin materializar millones
+de filas en memoria.
+
+El lector read-only ``query_fn(sql) -> dict`` se construye igual que en
+``build_eda_render_ctx`` / ``profile_table``: un closure sobre el wrapper del
+registry (``duckdb_query_readonly`` / ``pg_query``), importado perezosamente
+dentro de la funcion para no crear ciclos al cargar el ``__init__`` del paquete
+``datascience``. Nunca abre conexiones fuera de esos wrappers.
+
+Estilo dict-no-throw del grupo `eda`: la funcion NUNCA lanza. Captura cualquier
+excepcion (query, conversion) y devuelve ``{"status":"error", "error":str(e),
+"columns":{}, "n":0}``. Si la query subyacente devuelve ``status != "ok"``, se
+propaga como error con el mensaje del wrapper.
+
+Por columna, la lista de strings solo contiene valores NO nulos y NO vacios:
+cada celda no-None se convierte con ``str(...)`` y se descarta si queda ``""``.
+La clave ``n`` reporta el numero de FILAS leidas por la query (antes de filtrar
+los None/vacios), util para saber cuanto se muestreo realmente.
+"""
+
+
+def extract_text_sample(db_path, table, columns, backend="duckdb", sample=2000):
+    """Muestrea columnas de texto de una tabla DuckDB/Postgres con push-down SQL.
+
+    Args:
+        db_path: ruta al archivo DuckDB, o DSN PostgreSQL si backend="postgres".
+            Se inyecta en el closure query_fn. No se valida aqui: si la base no
+            existe o el DSN es invalido, la query devuelve status error y el
+            resultado es {status:'error', ...} (no lanza).
+        table: nombre de la tabla. Se escapa con comillas dobles en la query.
+        columns: lista de nombres de columna de texto a muestrear. Se filtra a las
+            entradas que sean str no vacio; cada nombre se escapa con comillas
+            dobles. Si tras filtrar queda vacia -> {status:'ok', columns:{}, n:0}.
+        backend: "duckdb" (default) o "postgres". Selecciona el lector read-only
+            del registry (duckdb_query_readonly / pg_query). Cualquier otro valor
+            -> {status:'error', error:'backend desconocido: ...', columns:{}, n:0}.
+        sample: maximo de filas a muestrear (clausula LIMIT). Default 2000. Acota
+            memoria y tiempo: con tablas grandes obtienes el primer tramo por
+            orden fisico, no un muestreo uniforme.
+
+    Returns:
+        dict (dict-no-throw, NUNCA lanza):
+        {"status": "ok"|"error",
+         "columns": {col_name: [str, str, ...], ...},  # solo no-None, no-""
+         "n": int,        # nº de filas leidas por la query (antes de filtrar)
+         "error": str}    # solo presente si status == "error"
+    """
+    try:
+        # 1) Lector read-only del backend activo, construido como en
+        # build_eda_render_ctx (closure sobre el wrapper del registry). Imports
+        # perezosos: este modulo vive en el paquete `datascience`, importar a
+        # `infra` a nivel de modulo crearia un ciclo al cargar el __init__.
+        if backend == "duckdb":
+            from infra import duckdb_query_readonly
+
+            def query_fn(sql):
+                return duckdb_query_readonly(db_path, sql)
+
+        elif backend == "postgres":
+            from infra import pg_query
+
+            def query_fn(sql):
+                return pg_query(db_path, sql)
+
+        else:
+            return {
+                "status": "error",
+                "error": f"backend desconocido: {backend}",
+                "columns": {},
+                "n": 0,
+            }
+
+        # 2) Columnas validas (str no vacio). Si no queda ninguna, nada que
+        # muestrear: ok con columns vacio.
+        cols = []
+        if isinstance(columns, (list, tuple)):
+            cols = [c for c in columns if isinstance(c, str) and c != ""]
+        if not cols:
+            return {"status": "ok", "columns": {}, "n": 0}
+
+        # 3) Push-down: una sola query con LIMIT. Identificadores escapados con
+        # comillas dobles, igual que build_eda_render_ctx.
+        cols_sql = ", ".join(f'"{c}"' for c in cols)
+        sql = f'SELECT {cols_sql} FROM "{table}" LIMIT {int(sample)}'
+        q = query_fn(sql)
+        if not isinstance(q, dict) or q.get("status") != "ok":
+            err = q.get("error") if isinstance(q, dict) else "query sin resultado"
+            return {"status": "error", "error": str(err), "columns": {}, "n": 0}
+
+        rows = q.get("rows") or []
+        out = {c: [] for c in cols}
+        for row in rows:
+            if not isinstance(row, dict):
+                continue
+            for c in cols:
+                value = row.get(c)
+                if value is None:
+                    continue
+                s = str(value)
+                if s == "":
+                    continue
+                out[c].append(s)
+
+        return {"status": "ok", "columns": out, "n": len(rows)}
+    except Exception as exc:  # noqa: BLE001 - dict-no-throw del grupo eda
+        return {"status": "error", "error": str(exc), "columns": {}, "n": 0}
@@ -0,0 +1,83 @@
+"""Tests para extract_text_sample.
+
+Self-contained: crea un DuckDB temporal pequeño con una columna de texto (algunas
+filas con NULL) y una numerica, y verifica que la muestra de texto trae solo los
+valores no nulos, que el backend desconocido y la lista de columnas vacia se
+manejan dict-no-throw, y que sample acota el numero de filas leidas.
+"""
+
+import os
+import sys
+
+_HERE = os.path.dirname(os.path.abspath(__file__))
+_FUNCTIONS = os.path.abspath(os.path.join(_HERE, ".."))  # python/functions
+if _FUNCTIONS not in sys.path:
+    sys.path.insert(0, _FUNCTIONS)
+
+import duckdb  # noqa: E402
+
+from datascience.extract_text_sample import extract_text_sample  # noqa: E402
+
+_TABLE = "t"
+# 6 filas: txt VARCHAR con dos NULL, other INT siempre presente.
+_ROWS = [
+    ("alpha", 1),
+    ("beta", 2),
+    (None, 3),
+    ("gamma", 4),
+    (None, 5),
+    ("delta", 6),
+]
+_TXT_NON_NULL = {"alpha", "beta", "gamma", "delta"}
+
+
+def _make_db(tmp_path):
+    """Crea un DuckDB temporal con la tabla de prueba y devuelve su ruta."""
+    db_path = os.path.join(str(tmp_path), "text_sample.duckdb")
+    con = duckdb.connect(db_path)
+    try:
+        con.execute(f'CREATE TABLE "{_TABLE}" (txt VARCHAR, other INTEGER)')
+        con.executemany(f'INSERT INTO "{_TABLE}" VALUES (?, ?)', _ROWS)
+    finally:
+        con.close()
+    return db_path
+
+
+def test_extract_basic(tmp_path):
+    db_path = _make_db(tmp_path)
+    res = extract_text_sample(db_path, _TABLE, ["txt"])
+    assert res["status"] == "ok"
+    # n = filas leidas por la query (6), antes de filtrar None.
+    assert res["n"] == len(_ROWS)
+    # columns["txt"] trae solo los strings no nulos (los dos NULL fuera).
+    assert "txt" in res["columns"]
+    assert set(res["columns"]["txt"]) == _TXT_NON_NULL
+    assert len(res["columns"]["txt"]) == len(_TXT_NON_NULL)
+    # No se pidio "other", no debe aparecer.
+    assert "other" not in res["columns"]
+
+
+def test_backend_desconocido(tmp_path):
+    db_path = _make_db(tmp_path)
+    res = extract_text_sample(db_path, _TABLE, ["txt"], backend="mysql")
+    assert res["status"] == "error"
+    assert "backend desconocido" in res["error"]
+    assert res["columns"] == {}
+    assert res["n"] == 0
+
+
+def test_columns_vacio(tmp_path):
+    db_path = _make_db(tmp_path)
+    res = extract_text_sample(db_path, _TABLE, [])
+    assert res["status"] == "ok"
+    assert res["columns"] == {}
+    assert res["n"] == 0
+
+
+def test_sample_limit(tmp_path):
+    db_path = _make_db(tmp_path)
+    res = extract_text_sample(db_path, _TABLE, ["txt"], sample=2)
+    assert res["status"] == "ok"
+    # sample=2 -> la query lee como mucho 2 filas.
+    assert res["n"] == 2
+    assert len(res["columns"]["txt"]) <= 2
@@ -0,0 +1,122 @@
+---
+id: relationship_scatter_figure_py_datascience
+name: relationship_scatter_figure
+kind: function
+lang: py
+domain: datascience
+version: "1.0.0"
+purity: impure
+signature: "def relationship_scatter_figure(xs: list, ys: list, x_label: str = \"\", y_label: str = \"\", classification: dict = None, max_points: int = 2000) -> \"matplotlib.figure.Figure\""
+description: "Construye una figura matplotlib scatter de un par de variables numéricas con su curva/recta de ajuste y una anotación del tipo de relación (lineal, polinómica grado 2/3, monótona no-lineal, etc.) más sus métricas (r, ρ, R²lin, R²poly). Consume el dict de classify_relationship_type; si es None lo calcula internamente reusando esa función. Devuelve un matplotlib.figure.Figure listo para rasterizar por el renderer del informe EDA (PDF/PPTX). Backend Agg sin pyplot global; downsample determinista de los puntos dibujados; defensivo ante vacío/None."
+tags: [eda, correlation, scatter, relationship, matplotlib, figure, visualization, datascience, impure]
+uses_functions: [classify_relationship_type_py_datascience]
+uses_types: []
+returns: []
+returns_optional: false
+error_type: "error_go_core"
+imports: [matplotlib, numpy]
+example: |
+  from relationship_scatter_figure import relationship_scatter_figure
+  xs = [float(i) for i in range(100)]
+  ys = [0.5 * x * x - x + 3 for x in xs]
+  classification = {
+      "tipo": "polinómica (grado 2)", "pearson": 0.97, "spearman": 0.99,
+      "r2_linear": 0.92, "r2_poly2": 0.999, "r2_poly3": 0.999,
+      "best_degree": 2, "coeffs": [0.5, -1.0, 3.0],
+  }
+  fig = relationship_scatter_figure(xs, ys, x_label="dosis", y_label="efecto", classification=classification)
+tested: true
+tests:
+  - "test_returns_figure"
+  - "test_downsample_determinista"
+  - "test_empty_no_lanza"
+  - "test_classification_none"
+test_file_path: "python/functions/datascience/relationship_scatter_figure_test.py"
+file_path: "python/functions/datascience/relationship_scatter_figure.py"
+params:
+  - name: xs
+    desc: "Lista (o tupla) de valores x. Se emparejan por índice con ys. Valores None, bool, NaN o inf descartan ese par (lectura defensiva)."
+  - name: ys
+    desc: "Lista (o tupla) de valores y, paralela a xs. Mismas reglas defensivas que xs."
+  - name: x_label
+    desc: "Etiqueta del eje/título para la variable x. Default \"\" (en el título cae a \"x\")."
+  - name: y_label
+    desc: "Etiqueta del eje/título para la variable y. Default \"\" (en el título cae a \"y\")."
+  - name: classification
+    desc: "Opcional. Dict de classify_relationship_type con claves tipo, pearson, r2_linear, spearman, r2_poly2, r2_poly3, best_degree, coeffs. Si es None se calcula internamente importando y llamando a classify_relationship_type sobre los pares limpios (self-contained). Si el módulo hermano no está disponible, se dibuja el scatter sin curva de ajuste ni anotación. Default None."
+  - name: max_points
+    desc: "Tope del nº de puntos DIBUJADOS. Si los pares limpios superan el tope, la nube se submuestrea por paso fijo ceil(n/max_points) tomando pairs[::step] — DETERMINISTA, no aleatorio, reproducible. La clasificación/ajuste usa SIEMPRE todos los pares limpios; el downsample solo adelgaza el dibujo. Valor no-positivo o no-int desactiva el downsample. Default 2000."
+output: "Un matplotlib.figure.Figure (figsize 6.4x4.0, dpi 150) con un Axes scatter (puntos semitransparentes alpha 0.5, color #4C72B0), la curva/recta de ajuste (numpy.polyval sobre coeffs, color #C44E52) cuando hay un ajuste polinómico disponible, título \"{x_label} ↔ {y_label}\", labels de ejes y una caja de anotación en la esquina superior izquierda con el tipo de relación y las métricas disponibles (r, ρ, R²lin, R²poly; se omiten las None). Si tras la limpieza hay menos de 2 pares válidos, devuelve igualmente una Figure con un texto centrado \"Sin datos suficientes para el scatter\" (nunca lanza). El caller rasteriza/cierra la figura; la función no la muestra ni la guarda."
+---
+
+## Ejemplo
+
+```python
+from relationship_scatter_figure import relationship_scatter_figure
+
+# Par numérico con relación cuadrática y su clasificación (de
+# classify_relationship_type). Pasándola explícita evitas recomputarla.
+xs = [float(i) for i in range(100)]
+ys = [0.5 * x * x - x + 3 for x in xs]
+classification = {
+    "tipo": "polinómica (grado 2)",
+    "pearson": 0.97,
+    "spearman": 0.99,
+    "r2_linear": 0.92,
+    "r2_poly2": 0.999,
+    "r2_poly3": 0.999,
+    "best_degree": 2,
+    "coeffs": [0.5, -1.0, 3.0],
+}
+
+fig = relationship_scatter_figure(
+    xs, ys, x_label="dosis", y_label="efecto", classification=classification
+)
+
+# El renderer del informe lo rasteriza; aquí solo persistimos para inspección.
+fig.savefig("/tmp/scatter_dosis_efecto.png")
+
+# Con classification=None la función la calcula internamente (self-contained):
+fig2 = relationship_scatter_figure(xs, ys, x_label="dosis", y_label="efecto")
+```
+
+## Cuando usarla
+
+Úsala dentro del informe EDA automático cuando quieras visualizar de un vistazo
+la relación entre dos variables numéricas: la nube de puntos, la curva que mejor
+la ajusta y una etiqueta legible del tipo de relación con sus métricas. Es la
+pareja "vista humana" de `classify_relationship_type`: esa función decide el
+tipo y los coeficientes; esta los pinta en una `Figure` que el renderer del
+informe rasteriza a PDF/PPTX. Pásale el dict de clasificación si ya lo tienes
+calculado (evitas recomputar el ajuste); si no, déjalo en `None` y la función lo
+resuelve sola sobre los pares limpios. Pensada para móvil: anotación pequeña
+(fontsize 8) y nube adelgazada por `max_points` para que el PDF no pese.
+
+## Gotchas
+
+- **Impura por matplotlib.** Toca la maquinaria de render. Usa el backend `Agg`
+  y la API orientada a objetos `Figure`/`add_subplot` — NUNCA `pyplot.*` aquí,
+  para no tocar el estado global ni filtrar figuras entre llamadas. `pyplot` NO
+  es thread-safe; esta función lo evita construyendo el `Figure` directamente,
+  así que es segura de llamar en bucle desde el renderer.
+- **El caller cierra la figura.** Devuelve el `Figure` pero no lo muestra ni lo
+  guarda. Quien la consume debe rasterizarla y luego liberarla
+  (`matplotlib.pyplot.close(fig)`) para no acumular memoria en lotes grandes de
+  pares de columnas.
+- **Downsample determinista, solo del dibujo.** Cuando los pares limpios superan
+  `max_points`, la nube DIBUJADA se adelgaza por paso fijo `pairs[::step]`
+  (reproducible, no aleatorio). La clasificación y el ajuste usan SIEMPRE todos
+  los pares limpios; el downsample no altera las métricas ni la curva.
+- **`classification=None` ⇒ se calcula sola.** Importa y llama a
+  `classify_relationship_type` sobre los pares limpios. Si ese módulo hermano no
+  está disponible (entorno incompleto), NO lanza: dibuja el scatter sin curva de
+  ajuste ni anotación. Pasar la clasificación explícita es más barato (no
+  recomputa el ajuste).
+- **Sin curva para `monótona no-lineal`.** Cuando `coeffs` es `None` o
+  `best_degree` es `None` (p.ej. tipo "monótona no-lineal"), no se pinta recta
+  polinómica — solo la nube y la anotación. Tampoco se dibuja la curva si el
+  rango de x es nulo (todos los x iguales). Nunca falla por esto.
+- **Defensiva, nunca lanza.** `xs=[]`, `ys=[]`, menos de 2 pares válidos, ends
+  `None`/`bool`/`NaN`/`inf` o `coeffs` malformado se manejan sin error: en el
+  peor caso devuelve una `Figure` con "Sin datos suficientes para el scatter".
+  No envuelvas la llamada en try/except por miedo a un raise — no lo hay.
@@ -0,0 +1,322 @@
+"""Impure EDA helper: scatter figure of a numeric pair with its fit (`eda` group).
+
+Builds a matplotlib scatter of two numeric variables, overlays the fitted
+curve/line implied by the relationship classification (linear, polynomial of
+degree 2/3, etc.) and annotates the relationship type with its available
+metrics. Returns a ready-to-rasterize ``matplotlib.figure.Figure``; it never
+shows nor saves it.
+
+Impure because it touches matplotlib's rendering machinery. It uses the headless
+Agg backend and the object-oriented ``Figure`` API (no ``pyplot``) so it leaks no
+global state and is safe to call repeatedly from a report renderer.
+
+To keep the rendered PDF/PPTX light on phones, when the number of valid pairs
+exceeds ``max_points`` the *plotted* points are down-sampled DETERMINISTICALLY by
+a fixed step (``pairs[::step]``), never randomly, so the output is reproducible.
+The classification/fit always uses every clean pair; the down-sample only thins
+the drawn cloud.
+"""
+
+import math
+
+import matplotlib
+
+matplotlib.use("Agg")
+
+import numpy as np  # noqa: E402
+from matplotlib.figure import Figure  # noqa: E402
+
+# Sober blue for the scatter cloud and red for the fitted curve (Tufte: the
+# data points are the primary ink, the fit is the secondary highlight).
+_POINT_COLOR = "#4C72B0"
+_FIT_COLOR = "#C44E52"
+# Muted gray for the no-data fallback message.
+_MUTED_TEXT = "#5f6b7a"
+
+
+def _finite(value):
+    """Coerce ``value`` to a finite float, or return None when not usable.
+
+    bool is a subclass of int, but a real numeric measurement is never a bool,
+    so True/False are treated as missing instead of coercing to 1.0/0.0. NaN and
+    +/-infinity are never valid either.
+    """
+    if value is None or isinstance(value, bool):
+        return None
+    try:
+        f = float(value)
+    except (TypeError, ValueError):
+        return None
+    if math.isnan(f) or math.isinf(f):
+        return None
+    return f
+
+
+def _clean_pairs(xs, ys):
+    """Pair ``xs[i], ys[i]`` by index, dropping any pair with a non-finite end."""
+    pairs = []
+    if isinstance(xs, (list, tuple)) and isinstance(ys, (list, tuple)):
+        n = min(len(xs), len(ys))
+        for i in range(n):
+            x = _finite(xs[i])
+            y = _finite(ys[i])
+            if x is None or y is None:
+                continue
+            pairs.append((x, y))
+    return pairs
+
+
+def _ordered_trend(xs_clean, ys_clean, n_bins: int = 12):
+    """Return (x_trend, y_trend): the ordered trend of y over x for a monotonic
+    relationship that has no polynomial fit.
+
+    When x has few distinct values (an ordinal/discrete scale) the trend is the
+    mean of y per distinct x value. Otherwise x is split into ``n_bins`` ordered
+    quantile bins and each point is (mean x, mean y) of the bin. Returns
+    ``(None, None)`` when there is nothing meaningful to draw.
+    """
+    x_arr = np.asarray(xs_clean, dtype=float)
+    y_arr = np.asarray(ys_clean, dtype=float)
+    if x_arr.size < 2:
+        return None, None
+    uniq = np.unique(x_arr)
+    if uniq.size <= max(2, n_bins):
+        # Discrete x: one trend point per distinct value (mean y).
+        xt = uniq
+        yt = np.array([float(np.mean(y_arr[x_arr == ux])) for ux in uniq])
+        return xt, yt
+    # Continuous x: ordered quantile bins, (mean x, mean y) per bin.
+    order = np.argsort(x_arr, kind="stable")
+    x_sorted = x_arr[order]
+    y_sorted = y_arr[order]
+    chunks_x = np.array_split(x_sorted, n_bins)
+    chunks_y = np.array_split(y_sorted, n_bins)
+    xt = np.array([float(np.mean(cx)) for cx in chunks_x if cx.size])
+    yt = np.array([float(np.mean(cy)) for cy in chunks_y if cy.size])
+    return xt, yt
+
+
+def _no_data_figure(message: str) -> "matplotlib.figure.Figure":
+    """A bare Figure carrying a centered muted message (defensive fallback)."""
+    fig = Figure(figsize=(6.4, 4.0), dpi=150)
+    ax = fig.add_subplot(111)
+    ax.axis("off")
+    ax.text(
+        0.5,
+        0.5,
+        message,
+        ha="center",
+        va="center",
+        fontsize=12,
+        color=_MUTED_TEXT,
+        transform=ax.transAxes,
+    )
+    fig.tight_layout()
+    return fig
+
+
+def _metrics_caption(classification: dict) -> str:
+    """Format the available metrics of a classification dict into one line.
+
+    Omits the metrics that are None. Keys consumed (any may be absent/None):
+    ``pearson`` (r), ``spearman`` (rho), ``r2_linear`` (R²lin) and the best
+    polynomial R² (``r2_poly3`` if a cubic was the best fit, else ``r2_poly2``).
+    """
+    parts = []
+    r = _finite(classification.get("pearson"))
+    if r is not None:
+        parts.append(f"r={r:.2f}")
+    rho = _finite(classification.get("spearman"))
+    if rho is not None:
+        parts.append(f"ρ={rho:.2f}")
+    r2_lin = _finite(classification.get("r2_linear"))
+    if r2_lin is not None:
+        parts.append(f"R²lin={r2_lin:.2f}")
+    # Prefer the R² of the best polynomial degree when it is a poly fit.
+    best_degree = classification.get("best_degree")
+    r2_poly = None
+    if best_degree == 3:
+        r2_poly = _finite(classification.get("r2_poly3"))
+    elif best_degree == 2:
+        r2_poly = _finite(classification.get("r2_poly2"))
+    if r2_poly is None:
+        # Fall back to whichever poly R² is present (cubic first).
+        r2_poly = _finite(classification.get("r2_poly3"))
+        if r2_poly is None:
+            r2_poly = _finite(classification.get("r2_poly2"))
+    if r2_poly is not None:
+        parts.append(f"R²poly={r2_poly:.2f}")
+    return "  ".join(parts)
+
+
+def relationship_scatter_figure(
+    xs: list,
+    ys: list,
+    x_label: str = "",
+    y_label: str = "",
+    classification: dict = None,
+    max_points: int = 2000,
+) -> "matplotlib.figure.Figure":
+    """Build a scatter figure of a numeric pair with its fit and a type label.
+
+    Cleans the pairs defensively (drops any pair with a None/bool/NaN/inf end),
+    plots a semi-transparent scatter cloud (down-sampled deterministically when
+    it exceeds ``max_points``), overlays the polynomial fit implied by
+    ``classification`` and annotates the relationship type plus its available
+    metrics in a corner box.
+
+    The fit and classification always use every clean pair; only the drawn cloud
+    is thinned by the down-sample. When ``classification`` is None it is computed
+    internally by reusing ``classify_relationship_type`` over the clean pairs, so
+    the function is self-contained.
+
+    The function is fully defensive: empty input, fewer than 2 clean pairs, a
+    missing/None ``coeffs`` or a missing sibling classifier never raise. When
+    there is nothing valid to draw it still returns a ``Figure`` carrying a
+    centered "Sin datos suficientes para el scatter" message.
+
+    Args:
+        xs: List (or tuple) of x values. Paired by index with ``ys``. Values that
+            are None, bool, NaN or infinite discard that pair. Read defensively.
+        ys: List (or tuple) of y values, parallel to ``xs``. Same defensive rules.
+        x_label: Axis/title label for the x variable. Default "" (falls back to
+            "x" in the title).
+        y_label: Axis/title label for the y variable. Default "" (falls back to
+            "y" in the title).
+        classification: Optional dict from ``classify_relationship_type`` with
+            keys ``tipo, pearson, r2_linear, spearman, r2_poly2, r2_poly3,
+            best_degree, coeffs``. When None, it is computed internally by
+            importing and calling ``classify_relationship_type`` over the clean
+            pairs. When that sibling module is unavailable, the scatter is still
+            drawn (no fit curve, no annotation).
+        max_points: Cap on the number of *plotted* points. When the number of
+            clean pairs exceeds this cap, the drawn cloud is down-sampled by a
+            fixed step ``ceil(n/max_points)`` taking ``pairs[::step]`` —
+            DETERMINISTIC, not random, so the figure is reproducible. A
+            non-positive or non-int value disables down-sampling. Default 2000.
+
+    Returns:
+        A ``matplotlib.figure.Figure`` (figsize 6.4x4.0, dpi 150) with a single
+        scatter Axes, the fitted curve (when a polynomial fit is available) and a
+        corner annotation with the relationship type and metrics. When there are
+        fewer than 2 clean pairs it returns a Figure with a centered "Sin datos
+        suficientes para el scatter" message. The caller rasterizes/closes it.
+    """
+    pairs = _clean_pairs(xs, ys)
+    if len(pairs) < 2:
+        return _no_data_figure("Sin datos suficientes para el scatter")
+
+    # Full clean coordinates feed the classification/fit; the plotted cloud is
+    # what gets thinned.
+    xs_clean = [p[0] for p in pairs]
+    ys_clean = [p[1] for p in pairs]
+
+    # Resolve the classification. If not provided, reuse the sibling classifier
+    # over ALL clean pairs (self-contained). Missing module => no fit/annotation.
+    cls = classification
+    if cls is None:
+        try:
+            from classify_relationship_type import classify_relationship_type
+
+            cls = classify_relationship_type(xs_clean, ys_clean)
+        except Exception:
+            cls = None
+    if not isinstance(cls, dict):
+        cls = {}
+
+    # --- Deterministic down-sampling of the DRAWN points only.
+    n_total = len(pairs)
+    if (
+        isinstance(max_points, int)
+        and not isinstance(max_points, bool)
+        and max_points > 0
+        and n_total > max_points
+    ):
+        step = math.ceil(n_total / max_points)
+        sampled = pairs[::step]
+    else:
+        sampled = pairs
+
+    x_plot = [p[0] for p in sampled]
+    y_plot = [p[1] for p in sampled]
+
+    fig = Figure(figsize=(6.4, 4.0), dpi=150)
+    ax = fig.add_subplot(111)
+
+    ax.scatter(
+        x_plot,
+        y_plot,
+        s=12,
+        alpha=0.5,
+        color=_POINT_COLOR,
+        edgecolors="none",
+        rasterized=True,
+    )
+
+    # --- Fitted curve/line over the full clean x range.
+    coeffs = cls.get("coeffs")
+    best_degree = cls.get("best_degree")
+    tipo = cls.get("tipo")
+    x_min, x_max = min(xs_clean), max(xs_clean)
+    drew_fit = False
+    if coeffs is not None and best_degree is not None and x_max > x_min:
+        try:
+            coeff_arr = np.asarray(coeffs, dtype=float)
+            if coeff_arr.ndim == 1 and coeff_arr.size > 0 and np.all(np.isfinite(coeff_arr)):
+                x_line = np.linspace(x_min, x_max, 200)
+                y_line = np.polyval(coeff_arr, x_line)
+                if np.all(np.isfinite(y_line)):
+                    ax.plot(x_line, y_line, color=_FIT_COLOR, linewidth=2)
+                    drew_fit = True
+        except Exception:
+            # Never fail the figure because of a malformed coeffs array.
+            pass
+
+    # A monotonic non-linear relationship has no fitted polynomial (coeffs is
+    # None by design — a low-degree polynomial would mislead). Draw instead the
+    # ordered trend of y over x so the reader still sees the shape: y averaged
+    # within ordered x-bins (or per distinct x value when x is discrete with few
+    # levels, e.g. an ordinal scale). Defensive: any failure leaves the cloud.
+    if (not drew_fit and isinstance(tipo, str) and "monóton" in tipo.lower()
+            and x_max > x_min):
+        try:
+            xt, yt = _ordered_trend(xs_clean, ys_clean)
+            if xt is not None and len(xt) >= 2:
+                ax.plot(xt, yt, color=_FIT_COLOR, linewidth=2, marker="o",
+                        markersize=3)
+        except Exception:
+            pass
+
+    # --- Labels and title.
+    tx = x_label if x_label else "x"
+    ty = y_label if y_label else "y"
+    ax.set_title(f"{tx} ↔ {ty}", fontsize=12, loc="left", pad=8)
+    ax.set_xlabel(x_label)
+    ax.set_ylabel(y_label)
+
+    # --- Corner annotation: relationship type + available metrics.
+    caption_lines = []
+    if tipo:
+        caption_lines.append(str(tipo))
+    metrics_line = _metrics_caption(cls)
+    if metrics_line:
+        caption_lines.append(metrics_line)
+    if caption_lines:
+        ax.text(
+            0.03,
+            0.97,
+            "\n".join(caption_lines),
+            transform=ax.transAxes,
+            ha="left",
+            va="top",
+            fontsize=8,
+            bbox=dict(
+                boxstyle="round,pad=0.35",
+                facecolor="white",
+                edgecolor="#cccccc",
+                alpha=0.85,
+            ),
+        )
+
+    fig.tight_layout()
+    return fig
@@ -0,0 +1,100 @@
+"""Tests para relationship_scatter_figure (scatter de un par numérico, grupo eda).
+
+Usa el backend Agg sin pyplot global; no muestra ni guarda figuras. Cada test
+cierra explícitamente la Figure construida (matplotlib.pyplot.close) para no
+acumular estado entre tests.
+"""
+
+import os
+import sys
+
+sys.path.insert(0, os.path.dirname(__file__))
+
+import matplotlib
+
+matplotlib.use("Agg")
+
+import matplotlib.pyplot as plt  # noqa: E402
+from matplotlib.collections import PathCollection  # noqa: E402
+from matplotlib.figure import Figure  # noqa: E402
+
+from relationship_scatter_figure import relationship_scatter_figure
+
+
+def _scatter_offsets(fig):
+    """Return the plotted points of the first PathCollection (scatter) found."""
+    for ax in fig.axes:
+        for coll in ax.collections:
+            if isinstance(coll, PathCollection):
+                return coll.get_offsets()
+    return None
+
+
+def test_returns_figure():
+    xs = [float(i) for i in range(20)]
+    ys = [2.0 * x + 1.0 for x in xs]  # y = 2x + 1
+    classification = {
+        "tipo": "lineal",
+        "pearson": 1.0,
+        "r2_linear": 1.0,
+        "spearman": 1.0,
+        "r2_poly2": 1.0,
+        "r2_poly3": 1.0,
+        "best_degree": 1,
+        "coeffs": [2.0, 1.0],
+    }
+    fig = relationship_scatter_figure(
+        xs, ys, x_label="a", y_label="b", classification=classification
+    )
+    assert hasattr(fig, "savefig")
+    assert len(fig.axes) >= 1
+    plt.close(fig)
+
+
+def test_downsample_determinista():
+    n = 5000
+    xs = [float(i) for i in range(n)]
+    ys = [0.5 * x for x in xs]
+    classification = {
+        "tipo": "lineal",
+        "pearson": 1.0,
+        "r2_linear": 1.0,
+        "spearman": 1.0,
+        "r2_poly2": 1.0,
+        "r2_poly3": 1.0,
+        "best_degree": 1,
+        "coeffs": [0.5, 0.0],
+    }
+    fig = relationship_scatter_figure(
+        xs, ys, x_label="x", y_label="y", classification=classification, max_points=1000
+    )
+    assert isinstance(fig, Figure)
+    offsets = _scatter_offsets(fig)
+    assert offsets is not None
+    # El nº de puntos dibujados no debe exceder el cap.
+    assert len(offsets) <= 1000
+    plt.close(fig)
+
+
+def test_empty_no_lanza():
+    fig = relationship_scatter_figure([], [], x_label="x", y_label="y")
+    assert isinstance(fig, Figure)
+    plt.close(fig)
+
+
+def test_classification_none():
+    # Solo se ejecuta si el módulo hermano classify_relationship_type existe.
+    try:
+        import classify_relationship_type  # noqa: F401
+    except Exception:
+        import pytest
+
+        pytest.skip("classify_relationship_type aún no disponible")
+    xs = [float(i) for i in range(30)]
+    ys = [3.0 * x - 2.0 for x in xs]
+    fig = relationship_scatter_figure(
+        xs, ys, x_label="a", y_label="b", classification=None
+    )
+    assert isinstance(fig, Figure)
+    assert len(fig.axes) >= 1
+    plt.close(fig)
@@ -0,0 +1,79 @@
+---
+name: summarize_outlier_dims
+kind: function
+lang: py
+domain: datascience
+version: "1.0.0"
+purity: pure
+signature: "def summarize_outlier_dims(raw_numeric: dict, outlier_rows: list, top_k: int = 3) -> list"
+description: "Explica QUE columnas hacen rara cada fila anomala detectada por isolation_forest_outliers. Para cada {row_index, score} reconstruye la fila valida (mismo filtro de columnas numericas y mismo descarte de filas con None que el detector, asi row_index coincide) y devuelve las top_k columnas de mayor |z-score| poblacional (ddof=0). Capa de explicabilidad del paso de outliers multivariante en EDA. Pura y determinista; ante entradas vacias/invalidas o sin filas validas devuelve [] sin petar."
+tags: [eda, models, outliers, anomaly-detection, explainability, z-score, multivariate]
+params:
+  - name: raw_numeric
+    desc: "dict {nombre_columna: [valores]} alineado por fila (como ctx['raw_numeric'] del motor AutomaticEDA). Solo se usan columnas con todos los valores numericos (None permitido por fila; bool/str/NaN/Inf descartan la columna entera) — filtro IDENTICO al de isolation_forest_outliers para que row_index coincida."
+  - name: outlier_rows
+    desc: "Lista de {row_index, score} tal cual la devuelve isolation_forest_outliers. row_index cuenta SOLO las filas validas (sin None) en orden de aparicion, base 0. Entradas fuera de rango o malformadas se ignoran defensivamente."
+  - name: top_k
+    desc: "Numero de columnas (las de mayor |z-score|) a reportar por outlier. Default 3. Valores invalidos (no-int, bool, <1) caen a 3."
+output: "Lista paralela a outlier_rows (mismo orden) de dicts {row_index: int, score: float, dims: [{col: str, value: float, z: float}, ...]}. dims trae hasta top_k columnas ordenadas por |z| descendente, con z (z-score poblacional, ddof=0) redondeado a 3 decimales; si una columna tiene std==0 su z es 0. Las entradas de outlier_rows fuera de rango/malformadas se omiten. Ante raw_numeric vacio/no-dict, outlier_rows no-lista, 0 columnas numericas o 0 filas validas devuelve []."
+uses_functions: []
+uses_types: []
+returns: []
+returns_optional: false
+error_type: ""
+imports: []
+tested: true
+tests: ["test_row_index_skips_none_rows", "test_extreme_row_flagged_via_isolation", "test_out_of_range_row_index_is_ignored", "test_degrades_to_empty_on_invalid_inputs"]
+test_file_path: "python/functions/datascience/summarize_outlier_dims_test.py"
+file_path: "python/functions/datascience/summarize_outlier_dims.py"
+---
+
+## Ejemplo
+
+```python
+from datascience import isolation_forest_outliers, summarize_outlier_dims
+
+# Nube densa alrededor del origen + 1 fila con un valor extremo en "c".
+raw_numeric = {
+    "a": [0.1, 0.2, -0.1, 0.0, 0.3, -0.2, 0.15, -0.05, 0.25, 0.2, -0.3, 0.1],
+    "b": [1.0, 1.1, 0.9, 1.2, 0.8, 1.0, 1.1, 0.95, 1.05, 0.9, 1.15, 1.0],
+    "c": [5.0, 5.2, 4.8, 5.1, 4.9, 5.0, 4.95, 5.05, 4.9, 500.0, 5.1, 5.0],
+}
+
+result = isolation_forest_outliers(raw_numeric, contamination=0.1)
+summary = summarize_outlier_dims(raw_numeric, result["outlier_rows"], top_k=3)
+
+for item in summary:
+    top = item["dims"][0]
+    print(item["row_index"], top["col"], top["value"], top["z"])
+# La fila del valor 500 sale con dim top "c" y |z| alto: es lo que la hace rara.
+```
+
+## Cuando usarla
+
+Justo **despues** de `isolation_forest_outliers`, cuando ya sabes QUE filas son
+anomalas y quieres explicar POR QUE: en que columnas se desvian mas respecto al
+resto. Util para rellenar la seccion de outliers de un report/notebook EDA con
+"la fila 9 es rara sobre todo por `c` (z=+3.3)" en lugar de solo un row_index
+opaco. Pasa el mismo `raw_numeric` que diste al detector y su `outlier_rows`
+intacto; el `row_index` apunta a la misma fila porque ambas funciones aplican el
+mismo filtro de columnas y el mismo descarte de filas con None.
+
+## Gotchas
+
+- **Mismo `raw_numeric` que el detector**: el `row_index` solo coincide si pasas
+  el mismo dict de columnas (mismo orden, mismas listas) con el que llamaste a
+  `isolation_forest_outliers`. Si cambias las columnas o el orden, los indices
+  dejan de mapear.
+- **`row_index` es relativo a las filas validas**: las filas con `None` en
+  cualquier columna usada se descartan y los indices se recalculan sobre las que
+  quedan (base 0, orden de aparicion). No mapea 1:1 con las listas de entrada si
+  hay None.
+- **z-score poblacional (ddof=0)**: se usa la desviacion tipica poblacional,
+  consistente con el escalado del detector. Columnas con `std==0` (todos los
+  valores iguales) dan `z=0`, asi que nunca aparecen como "raras".
+- **Devuelve `[]` en vez de petar**: entrada no-dict/no-lista, 0 columnas
+  numericas, 0 filas validas, o todas las entradas fuera de rango -> lista vacia.
+  No lanza excepciones.
+- **No llama a `isolation_forest_outliers`**: solo consume su salida. Es una
+  funcion independiente (no la importa), por eso `uses_functions` esta vacio.
@@ -0,0 +1,144 @@
+"""Explica que dimensiones (columnas) hacen rara cada fila anomala.
+
+Toma la salida multivariante de `isolation_forest_outliers` (lista de
+`{row_index, score}`) y, para cada outlier, devuelve las columnas con mayor
+|z-score| respecto a la distribucion de las filas validas. Es la capa de
+"explicabilidad" del paso de outliers multivariante en la fase EDA: el
+Isolation Forest dice QUE filas son raras, esta funcion dice POR QUE (en que
+columnas se desvian mas).
+
+Pura y determinista: reconstruye EXACTAMENTE las mismas "filas validas" que usa
+`isolation_forest_outliers` (mismo filtro de columnas numericas y mismo descarte
+de filas con None), de modo que el `row_index` apunta a la misma fila en ambas
+funciones. No hace I/O ni depende de estado.
+"""
+
+import math
+
+import numpy as np
+
+
+def _is_finite_number(v) -> bool:
+    """True si v es int/float finito. bool NO cuenta; NaN/Inf tampoco."""
+    if isinstance(v, bool):
+        return False
+    if not isinstance(v, (int, float)):
+        return False
+    if isinstance(v, float) and (math.isnan(v) or math.isinf(v)):
+        return False
+    return True
+
+
+def summarize_outlier_dims(
+    raw_numeric: dict,
+    outlier_rows: list,
+    top_k: int = 3,
+) -> list:
+    """Resume las dimensiones que mas desvian a cada fila anomala.
+
+    Args:
+        raw_numeric: dict {nombre_columna: [valores]} alineado por fila (como
+            ctx['raw_numeric'] del motor AutomaticEDA). Solo se usan columnas
+            cuyos valores sean todos numericos (None permitido por fila; bool,
+            str, NaN e Inf descartan la columna entera) — filtro identico al de
+            isolation_forest_outliers.
+        outlier_rows: lista de {row_index, score} tal como la devuelve
+            isolation_forest_outliers. row_index cuenta SOLO las filas validas
+            (sin None) en orden de aparicion, empezando en 0.
+        top_k: numero de columnas (las de mayor |z-score|) a reportar por cada
+            outlier. Default 3. Valores invalidos caen a 3.
+
+    Returns:
+        Lista paralela a outlier_rows (mismo orden) de dicts
+        {row_index, score, dims}, donde dims es la lista de hasta top_k columnas
+        ordenadas por |z| descendente: [{col, value, z}, ...] con z redondeado a
+        3 decimales. Las entradas de outlier_rows fuera de rango o malformadas se
+        omiten (defensivo). Ante raw_numeric vacio/no-dict, outlier_rows
+        no-lista, 0 columnas numericas o 0 filas validas devuelve [].
+    """
+    # Validacion defensiva de los argumentos principales.
+    if not isinstance(raw_numeric, dict) or not isinstance(outlier_rows, list):
+        return []
+    if not isinstance(top_k, int) or isinstance(top_k, bool) or top_k < 1:
+        top_k = 3
+
+    # Seleccion de columnas numericas: identica a isolation_forest_outliers.
+    # Una columna entra solo si todos sus valores son numericos (None permitido
+    # por fila); cualquier bool/str/NaN/Inf descarta la columna completa.
+    numeric_cols: dict[str, list] = {}
+    for name, values in raw_numeric.items():
+        if not isinstance(values, (list, tuple)):
+            continue
+        ok = True
+        for v in values:
+            if v is None:
+                continue
+            if not _is_finite_number(v):
+                ok = False
+                break
+        if ok:
+            numeric_cols[name] = list(values)
+
+    if len(numeric_cols) < 1:
+        return []
+
+    col_names = list(numeric_cols.keys())
+    try:
+        n_rows_total = min(len(numeric_cols[c]) for c in col_names)
+    except ValueError:
+        return []
+
+    # Reconstruye las filas validas con el MISMO criterio que el detector: la
+    # fila i toma un valor por columna; si cualquier valor es None, la fila se
+    # descarta y NO incrementa el indice valido. Asi row_index de outlier_rows
+    # apunta a esta misma secuencia (base 0, orden de aparicion).
+    valid_rows: list[list[float]] = []
+    for i in range(n_rows_total):
+        row = [numeric_cols[c][i] for c in col_names]
+        if any(v is None for v in row):
+            continue
+        valid_rows.append([float(v) for v in row])
+
+    if not valid_rows:
+        return []
+
+    matrix = np.asarray(valid_rows, dtype=float)
+    n_valid = matrix.shape[0]
+    means = matrix.mean(axis=0)
+    stds = matrix.std(axis=0, ddof=0)  # poblacional (ddof=0)
+
+    out: list = []
+    for entry in outlier_rows:
+        if not isinstance(entry, dict):
+            continue
+        ri = entry.get("row_index")
+        # bool es subclase de int: lo excluimos explicitamente.
+        if not isinstance(ri, int) or isinstance(ri, bool):
+            continue
+        if ri < 0 or ri >= n_valid:
+            continue
+
+        try:
+            score = float(entry.get("score"))
+        except (TypeError, ValueError):
+            score = 0.0
+
+        row = matrix[ri]
+        dims = []
+        for j, name in enumerate(col_names):
+            std = stds[j]
+            if std == 0.0:
+                z = 0.0
+            else:
+                z = float((row[j] - means[j]) / std)
+            dims.append({"col": name, "value": float(row[j]), "z": z})
+
+        # Mayor |z| primero; sort estable, empates por orden de columna.
+        dims.sort(key=lambda d: abs(d["z"]), reverse=True)
+        dims = dims[:top_k]
+        for d in dims:
+            d["z"] = round(d["z"], 3)
+
+        out.append({"row_index": int(ri), "score": score, "dims": dims})
+
+    return out
@@ -0,0 +1,93 @@
+"""Tests para summarize_outlier_dims."""
+
+from isolation_forest_outliers import isolation_forest_outliers
+from summarize_outlier_dims import summarize_outlier_dims
+
+
+# Dataset compartido: 3 columnas, 13 filas. La fila ORIGINAL 6 tiene None en "a"
+# (se descarta), de modo que la fila ORIGINAL 10 -- con un valor extremo en "c"
+# -- queda en el indice VALIDO 9 (no 10). Esto verifica el salto de None.
+A = [0.1, 0.2, -0.1, 0.0, 0.3, -0.2, None, 0.15, -0.05, 0.25, 0.2, -0.3, 0.1]
+B = [1.0, 1.1, 0.9, 1.2, 0.8, 1.0, 1.3, 1.1, 0.95, 1.05, 0.9, 1.15, 1.0]
+C = [5.0, 5.2, 4.8, 5.1, 4.9, 5.0, 5.3, 4.95, 5.05, 4.9, 500.0, 5.1, 5.0]
+RAW = {"a": A, "b": B, "c": C}
+
+# Mapa original -> valido (saltando original 6):
+#   orig: 0 1 2 3 4 5 7 8 9 10 11 12
+#  valid: 0 1 2 3 4 5 6 7 8  9 10 11
+# => el extremo en "c" (original 10) esta en el indice valido 9.
+EXTREME_VALID_INDEX = 9
+
+
+def test_row_index_skips_none_rows():
+    # Mapeo directo (sin depender de la aleatoriedad de IsolationForest): el
+    # indice valido 9 debe corresponder a la fila con c == 500 -> el None de la
+    # fila original 6 se salto correctamente.
+    summary = summarize_outlier_dims(
+        RAW, [{"row_index": EXTREME_VALID_INDEX, "score": -0.5}], top_k=3
+    )
+    assert len(summary) == 1
+    entry = summary[0]
+    assert entry["row_index"] == EXTREME_VALID_INDEX
+    assert entry["score"] == -0.5
+    # La dimension dominante es "c", con su valor extremo y |z| alto.
+    top = entry["dims"][0]
+    assert top["col"] == "c"
+    assert top["value"] == 500.0
+    assert abs(top["z"]) > 2.0
+    # top_k respetado: como mucho 3 dims.
+    assert len(entry["dims"]) <= 3
+
+
+def test_extreme_row_flagged_via_isolation():
+    # Integracion real: detectar outliers y explicarlos.
+    result = isolation_forest_outliers(RAW, contamination=0.1)
+    assert "note" not in result
+    outlier_rows = result["outlier_rows"]
+    assert outlier_rows  # al menos un outlier
+
+    summary = summarize_outlier_dims(RAW, outlier_rows, top_k=3)
+    # Paralela a outlier_rows (todos los indices estan en rango).
+    assert len(summary) == len(outlier_rows)
+
+    by_index = {e["row_index"]: e for e in summary}
+    # El punto extremo debe estar entre los outliers detectados...
+    assert EXTREME_VALID_INDEX in by_index
+    # ...y su dimension top debe ser "c" (donde se desvia ~muchas sigmas).
+    extreme = by_index[EXTREME_VALID_INDEX]
+    assert extreme["dims"][0]["col"] == "c"
+    assert abs(extreme["dims"][0]["z"]) > 2.0
+
+
+def test_out_of_range_row_index_is_ignored():
+    # Indices fuera de rango se omiten en lugar de petar.
+    summary = summarize_outlier_dims(
+        RAW,
+        [
+            {"row_index": 999, "score": -1.0},
+            {"row_index": -1, "score": -1.0},
+            {"row_index": EXTREME_VALID_INDEX, "score": -0.5},
+        ],
+        top_k=2,
+    )
+    # Solo sobrevive el indice valido; los otros dos se descartan.
+    assert len(summary) == 1
+    assert summary[0]["row_index"] == EXTREME_VALID_INDEX
+    assert len(summary[0]["dims"]) <= 2
+
+
+def test_degrades_to_empty_on_invalid_inputs():
+    # raw_numeric vacio + outlier_rows vacio.
+    assert summarize_outlier_dims({}, [], 3) == []
+    # raw_numeric no es dict.
+    assert summarize_outlier_dims("not a dict", [{"row_index": 0}], 3) == []
+    # outlier_rows no es lista.
+    assert summarize_outlier_dims(RAW, "not a list", 3) == []
+    # Sin columnas numericas (todas con strings) -> [].
+    assert summarize_outlier_dims(
+        {"s": ["x", "y", "z"]}, [{"row_index": 0, "score": -1.0}], 3
+    ) == []
+    # Entradas malformadas dentro de outlier_rows se ignoran (no petan).
+    assert summarize_outlier_dims(
+        RAW, ["nope", 42, {"no_row_index": 1}], 3
+    ) == []
@@ -261,7 +261,15 @@ def render_automatic_eda(
        md_path = None
        if emit_md:
            md_path = os.path.join(out_dir, base + ".md")
-            rmd = render_automatic_eda_markdown(prof, md_path, meta) or {}
+            # El Markdown es la salida MÁS completa: además del documento por
+            # capítulos (compartido con PDF/PPTX) volca un apéndice con TODOS los
+            # datos numéricos del perfil (matriz de asociación completa, describe
+            # con skew/kurtosis/percentiles, re-expresiones, scores_by_k de
+            # KMeans, estadísticos de normalidad). Se le pasa el `prof` vía
+            # meta['profile']; un meta propio evita alterar el de PDF/PPTX.
+            md_meta = dict(meta)
+            md_meta["profile"] = prof
+            rmd = render_automatic_eda_markdown(prof, md_path, md_meta) or {}

        return {
            "status": "ok",
@@ -18,6 +18,7 @@ dependencies = [
    "google-cloud-bigquery-storage>=2.27",
    "google-cloud-storage>=3.10.1",
    "httpx",
+    "langdetect>=1.0.9",
    "matplotlib>=3.10.9",
    "opencv-contrib-python-headless>=4.13.0.92",
    "openpyxl>=3.1.5",
@@ -40,6 +41,7 @@ dependencies = [
    "seaborn>=0.13.2",
    "shapely>=2.1.2",
    "statsmodels>=0.14.6",
+    "textstat>=0.7.13",
    "trimesh>=4.12.2",
    "xlrd>=2.0.2",
 ]
@@ -899,6 +899,7 @@ dependencies = [
    { name = "google-cloud-bigquery-storage" },
    { name = "google-cloud-storage" },
    { name = "httpx" },
+    { name = "langdetect" },
    { name = "matplotlib" },
    { name = "opencv-contrib-python-headless" },
    { name = "openpyxl" },
@@ -906,9 +907,11 @@ dependencies = [
    { name = "polars" },
    { name = "pymeshlab" },
    { name = "pymssql" },
+    { name = "pymupdf" },
    { name = "pypdf" },
    { name = "pyproj" },
    { name = "python-docx" },
+    { name = "python-pptx" },
    { name = "pyyaml" },
    { name = "qrcode", extra = ["pil"] },
    { name = "rapidfuzz" },
@@ -919,6 +922,7 @@ dependencies = [
    { name = "seaborn" },
    { name = "shapely" },
    { name = "statsmodels" },
+    { name = "textstat" },
    { name = "trimesh" },
    { name = "xlrd" },
 ]
@@ -959,6 +963,7 @@ requires-dist = [
    { name = "jupyter-collaboration", marker = "extra == 'jupyter'", specifier = ">=2.0" },
    { name = "jupyter-mcp-server", marker = "extra == 'jupyter'" },
    { name = "jupyterlab", marker = "extra == 'jupyter'", specifier = ">=4.0" },
+    { name = "langdetect", specifier = ">=1.0.9" },
    { name = "matplotlib", specifier = ">=3.10.9" },
    { name = "opencv-contrib-python-headless", specifier = ">=4.13.0.92" },
    { name = "openpyxl", specifier = ">=3.1.5" },
@@ -966,9 +971,11 @@ requires-dist = [
    { name = "polars", specifier = ">=1.40.1" },
    { name = "pymeshlab", specifier = ">=2025.7.post1" },
    { name = "pymssql", specifier = ">=2.3.13" },
+    { name = "pymupdf", specifier = ">=1.28.0" },
    { name = "pypdf", specifier = ">=6.10.0" },
    { name = "pyproj", specifier = ">=3.7.2" },
    { name = "python-docx", specifier = ">=1.2.0" },
+    { name = "python-pptx", specifier = ">=1.0.2" },
    { name = "pyyaml", specifier = ">=6.0.3" },
    { name = "qrcode", extras = ["pil"], specifier = ">=8.2" },
    { name = "rapidfuzz", specifier = ">=3.14.5" },
@@ -979,6 +986,7 @@ requires-dist = [
    { name = "seaborn", specifier = ">=0.13.2" },
    { name = "shapely", specifier = ">=2.1.2" },
    { name = "statsmodels", specifier = ">=0.14.6" },
+    { name = "textstat", specifier = ">=0.7.13" },
    { name = "trimesh", specifier = ">=4.12.2" },
    { name = "xlrd", specifier = ">=2.0.2" },
 ]
@@ -2198,6 +2206,15 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/b5/91/53255615acd2a1eaca307ede3c90eb550bae9c94581f8c00081b6b1c8f44/kiwisolver-1.5.0-graalpy312-graalpy250_312_native-win_amd64.whl", hash = "sha256:1f1489f769582498610e015a8ef2d36f28f505ab3096d0e16b4858a9ec214f57", size = 75987, upload-time = "2026-03-09T13:15:39.65Z" },
 ]

+[[package]]
+name = "langdetect"
+version = "1.0.9"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "six" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/0e/72/a3add0e4eec4eb9e2569554f7c70f4a3c27712f40e3284d483e88094cc0e/langdetect-1.0.9.tar.gz", hash = "sha256:cbc1fef89f8d062739774bd51eda3da3274006b3661d199c2655f6b3f6d605a0", size = 981474, upload-time = "2021-05-07T07:54:13.562Z" }
+
 [[package]]
 name = "lark"
 version = "1.3.1"
@@ -2699,6 +2716,21 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/9e/c9/b2622292ea83fbb4ec318f5b9ab867d0a28ab43c5717bb85b0a5f6b3b0a4/networkx-3.6.1-py3-none-any.whl", hash = "sha256:d47fbf302e7d9cbbb9e2555a0d267983d2aa476bac30e90dfbe5669bd57f3762", size = 2068504, upload-time = "2025-12-08T17:02:38.159Z" },
 ]

+[[package]]
+name = "nltk"
+version = "3.9.4"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "click" },
+    { name = "joblib" },
+    { name = "regex" },
+    { name = "tqdm" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/74/a1/b3b4adf15585a5bc4c357adde150c01ebeeb642173ded4d871e89468767c/nltk-3.9.4.tar.gz", hash = "sha256:ed03bc098a40481310320808b2db712d95d13ca65b27372f8a403949c8b523d0", size = 2946864, upload-time = "2026-03-24T06:13:40.641Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/9d/91/04e965f8e717ba0ab4bdca5c112deeab11c9e750d94c4d4602f050295d39/nltk-3.9.4-py3-none-any.whl", hash = "sha256:f2fa301c3a12718ce4a0e9305c5675299da5ad9e26068218b69d692fda84828f", size = 1552087, upload-time = "2026-03-24T06:13:38.47Z" },
+]
+
 [[package]]
 name = "notebook-shim"
 version = "0.2.4"
@@ -3750,6 +3782,23 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/25/50/4be9bd9cf4b43208a7175117a533ece200cfe4131a39f9909bdc7560ddeb/pymssql-2.3.13-cp314-cp314-win_amd64.whl", hash = "sha256:7d7037d2b5b907acc7906d0479924db2935a70c720450c41339146a4ada2b93d", size = 2049139, upload-time = "2026-02-14T05:00:23.951Z" },
 ]

+[[package]]
+name = "pymupdf"
+version = "1.28.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/8e/e9/6d6c5d6c0a3551bffd47681a6240caf941727f195b45593cf20ab36f018f/pymupdf-1.28.0.tar.gz", hash = "sha256:e53f3567403a92da15caa9e7ae0164327fff48817e9f40175367fb9de524258d", size = 87637751, upload-time = "2026-06-29T09:08:47.547Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/c8/b7/88043e38cc7529de070f0c9bd267fa258035cca0b4ad5260536b994594a7/pymupdf-1.28.0-cp310-abi3-macosx_10_15_x86_64.whl", hash = "sha256:892b89ba88e8f98b53133b62877a9dc9b5e7dc6a4aeb837b612db56a8d2e03ac", size = 24597385, upload-time = "2026-06-29T09:03:30.608Z" },
+    { url = "https://files.pythonhosted.org/packages/33/f4/23775bbda0781b61fc398cc75079a2b0e64696d8fcf93271748883e9627e/pymupdf-1.28.0-cp310-abi3-macosx_11_0_arm64.whl", hash = "sha256:4d692dcf44d3566ae96bc6f6346c6ad432274a29ba617bf7a9fe18009e24adb4", size = 23828292, upload-time = "2026-06-29T09:03:46.129Z" },
+    { url = "https://files.pythonhosted.org/packages/1c/f5/bf75fc7a415722f8b33662054f82d88520c0cbfd4c36d0e08aeaec605e49/pymupdf-1.28.0-cp310-abi3-manylinux_2_28_aarch64.whl", hash = "sha256:47a5c29ed4eb0744de9c4e37bb49b1259b18d4d75fcc8a7c130f7c9fa15956f6", size = 25045507, upload-time = "2026-06-29T09:04:03.86Z" },
+    { url = "https://files.pythonhosted.org/packages/58/69/5d12c9f1f2d76f28383d6110a069c79fbfced5a4f97bb1ee6e8354f52bb7/pymupdf-1.28.0-cp310-abi3-manylinux_2_28_x86_64.whl", hash = "sha256:44f0973f5e5edbaec95bc34b64e71d1959d4ee90b1328de1b4f4f5b4fa78673f", size = 25716599, upload-time = "2026-06-29T09:04:19.367Z" },
+    { url = "https://files.pythonhosted.org/packages/4d/b4/ec0e017bc42857cc86bd651441dbc41cc18be48d4698ecd27aac491e0c9a/pymupdf-1.28.0-cp310-abi3-musllinux_1_2_x86_64.whl", hash = "sha256:4d61ec323a706e153a12e262e51febfb43eeaa20977785ace135d18d48bcdc83", size = 25940489, upload-time = "2026-06-29T09:04:36.624Z" },
+    { url = "https://files.pythonhosted.org/packages/06/86/f831fef09013f33b3c9c09fb3923f2ff53e1e437f6ace14b8ae46392f558/pymupdf-1.28.0-cp310-abi3-win32.whl", hash = "sha256:caea2b3b67347fd79e5d15ed7929b0e886aac594ea228073b6d39de0078189da", size = 18489703, upload-time = "2026-06-29T20:50:30.599Z" },
+    { url = "https://files.pythonhosted.org/packages/2e/5d/1a03f53eb0449900469335fcfc742ca28e3ba159b7d650e0921d50b8b308/pymupdf-1.28.0-cp310-abi3-win_amd64.whl", hash = "sha256:e01e90fd86abfeb37ceb921eddb951f988a11d45ff6ce6b7664f2039849068ec", size = 19773102, upload-time = "2026-06-29T09:04:49.773Z" },
+    { url = "https://files.pythonhosted.org/packages/72/f6/1e52ce243ca792254f6223b4017c5667194c146ce9b88baf37bc5eb3d1c9/pymupdf-1.28.0-cp313-abi3-pyemscripten_2025_0_wasm32.whl", hash = "sha256:74c6d00ba2a9aad3a635db73b07c15db462b480741d831a34a75a56535ebc22b", size = 18357011, upload-time = "2026-06-29T20:50:50.353Z" },
+    { url = "https://files.pythonhosted.org/packages/62/b1/46b5b3d8ef3cc71114667cf10c4d8b33f39af97253af32e9a0986775b638/pymupdf-1.28.0-cp314-cp314t-manylinux_2_28_x86_64.whl", hash = "sha256:b3e1399c7a64c6914239116a369efcdaac4cfb9e838bde2656d7accc4a85c72d", size = 25753599, upload-time = "2026-06-29T09:05:09.398Z" },
+]
+
 [[package]]
 name = "pyogrio"
 version = "0.12.1"
@@ -3811,6 +3860,15 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/55/f2/7ebe366f633f30a6ad105f650f44f24f98cb1335c4157d21ae47138b3482/pypdf-6.10.0-py3-none-any.whl", hash = "sha256:90005e959e1596c6e6c84c8b0ad383285b3e17011751cedd17f2ce8fcdfc86de", size = 334459, upload-time = "2026-04-10T09:34:54.966Z" },
 ]

+[[package]]
+name = "pyphen"
+version = "0.17.2"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/69/56/e4d7e1bd70d997713649c5ce530b2d15a5fc2245a74ca820fc2d51d89d4d/pyphen-0.17.2.tar.gz", hash = "sha256:f60647a9c9b30ec6c59910097af82bc5dd2d36576b918e44148d8b07ef3b4aa3", size = 2079470, upload-time = "2025-01-20T13:18:36.296Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/7b/1f/c2142d2edf833a90728e5cdeb10bdbdc094dde8dbac078cee0cf33f5e11b/pyphen-0.17.2-py3-none-any.whl", hash = "sha256:3a07fb017cb2341e1d9ff31b8634efb1ae4dc4b130468c7c39dd3d32e7c3affd", size = 2079358, upload-time = "2025-01-20T13:18:29.629Z" },
+]
+
 [[package]]
 name = "pyproj"
 version = "3.7.2"
@@ -3935,6 +3993,21 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/1c/fd/0318007beb234790993d3ec5afd051d1dbceb733e81e3afe2b981ece3f37/python_multipart-0.0.30-py3-none-any.whl", hash = "sha256:830964def8c90607ac5daa00514e3987815865713ade8d20febc9177ac0c3c5b", size = 29730, upload-time = "2026-05-31T19:24:53.814Z" },
 ]

+[[package]]
+name = "python-pptx"
+version = "1.0.2"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "lxml" },
+    { name = "pillow" },
+    { name = "typing-extensions" },
+    { name = "xlsxwriter" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/52/a9/0c0db8d37b2b8a645666f7fd8accea4c6224e013c42b1d5c17c93590cd06/python_pptx-1.0.2.tar.gz", hash = "sha256:479a8af0eaf0f0d76b6f00b0887732874ad2e3188230315290cd1f9dd9cc7095", size = 10109297, upload-time = "2024-08-07T17:33:37.772Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/d9/4f/00be2196329ebbff56ce564aa94efb0fbc828d00de250b1980de1a34ab49/python_pptx-1.0.2-py3-none-any.whl", hash = "sha256:160838e0b8565a8b1f67947675886e9fea18aa5e795db7ae531606d68e785cba", size = 472788, upload-time = "2024-08-07T17:33:28.192Z" },
+]
+
 [[package]]
 name = "pywin32"
 version = "311"
@@ -4936,6 +5009,20 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/6a/9e/2064975477fdc887e47ad42157e214526dcad8f317a948dee17e1659a62f/terminado-0.18.1-py3-none-any.whl", hash = "sha256:a4468e1b37bb318f8a86514f65814e1afc977cf29b3992a4500d9dd305dcceb0", size = 14154, upload-time = "2024-03-12T14:34:36.569Z" },
 ]

+[[package]]
+name = "textstat"
+version = "0.7.13"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "nltk" },
+    { name = "pyphen" },
+    { name = "setuptools" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/8c/0f/b673fcec5ad6e976b2e8368ef3651fe0fea3348a1191bacfcd41a17ddec6/textstat-0.7.13.tar.gz", hash = "sha256:a88d1da76287cd27ca4ce7bcba1ebaf2890544a5f0bb6a5758fa84cef3bceccb", size = 138932, upload-time = "2026-02-18T21:07:39.525Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/ca/31/0eb4cc5bb021b4ceaaa602c59ba16ce99256b9dd30981bef3f3a53d8555f/textstat-0.7.13-py3-none-any.whl", hash = "sha256:04b1ec995d1e8b2e628759497e6b23204a9ec91dcd652447d8cbba9478f25471", size = 177050, upload-time = "2026-02-18T21:07:38.163Z" },
+]
+
 [[package]]
 name = "threadpoolctl"
 version = "3.6.0"
@@ -5312,6 +5399,15 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/1a/62/c8d562e7766786ba6587d09c5a8ba9f718ed3fa8af7f4553e8f91c36f302/xlrd-2.0.2-py2.py3-none-any.whl", hash = "sha256:ea762c3d29f4cca48d82df517b6d89fbce4db3107f9d78713e48cd321d5c9aa9", size = 96555, upload-time = "2025-06-14T08:46:37.766Z" },
 ]

+[[package]]
+name = "xlsxwriter"
+version = "3.2.9"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/46/2c/c06ef49dc36e7954e55b802a8b231770d286a9758b3d936bd1e04ce5ba88/xlsxwriter-3.2.9.tar.gz", hash = "sha256:254b1c37a368c444eac6e2f867405cc9e461b0ed97a3233b2ac1e574efb4140c", size = 215940, upload-time = "2025-09-16T00:16:21.63Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/3a/0c/3662f4a66880196a590b202f0db82d919dd2f89e99a27fadef91c4a33d41/xlsxwriter-3.2.9-py3-none-any.whl", hash = "sha256:9a5db42bc5dff014806c58a20b9eae7322a134abb6fce3c92c181bfb275ec5b3", size = 175315, upload-time = "2025-09-16T00:16:20.108Z" },
+]
+
 [[package]]
 name = "xxhash"
 version = "3.7.0"
Author	SHA1	Message	Date
egutierrez	6f88f184f1	feat(eda): capítulo OUTLIERS — valores atípicos univariantes + multivariantes Nuevo capítulo dedicado `outliers` para el motor AutomaticEDA que reúne y profundiza en un solo sitio el análisis de valores atípicos, hoy disperso entre `num_distr` (conteo por columna) y `modelos` (IsolationForest). Se registra en `chapters_registry.py` entre `missingness` y `correlacion` (bloque de calidad de datos: calidad → missingness → outliers). Contenido del capítulo: - Resumen univariante por columna: nº y % de atípicos por Tukey (1.5·IQR) y por z-score (\|z\| > 3), con vallas inferior/superior y valores extremos. Ordenado por contaminación y marcando las columnas más afectadas. Reusa las funciones del registry `build_boxplot_stats` (vallas desde los percentiles del profile) y `detect_outliers` (regla z-score sobre la muestra cruda de `ctx`). - Boxplots de Tukey de las columnas más contaminadas (caja, bigotes y puntos atípicos), delegados a la función nueva `build_boxplots_figure`. - Multivariante: filas anómalas considerando todas las columnas a la vez con `isolation_forest_outliers` — nº y % de filas, las más anómalas con su score y las dimensiones que las hacen raras (top columnas por \|z\|, vía la función nueva `summarize_outlier_dims`). El detector se corre en vivo sobre `raw_numeric` para que el indexado de filas coincida exactamente con el de las dimensiones; cae al bloque precomputado del perfil cuando no hay muestra cruda (preset lite). - Interpretación exploratoria: un atípico no es necesariamente un error (distingue error de dato vs dato real extremo) y recomendaciones (revisar, winsorizar o re-expresar, enlazando con la re-expresión de Tukey del perfil). Términos clicables registrados en el glosario compartido: `outlier`, `tukey_fence`, `zscore`, `isolation_forest`. Funciones nuevas del registry (dominio datascience, grupo eda): - `build_boxplots_figure_py_datascience` (figure helper, impura) - `summarize_outlier_dims_py_datascience` (pura) El capítulo se activa con ≥1 columna numérica y devuelve None en su ausencia; lee todo defensivo y nunca lanza. Tests: capítulo (golden + edges + error path + render PDF/PPTX) y ambas funciones nuevas. Suite de no-regresión de AutomaticEDA verde. Verificado end-to-end con el dataset Titanic (Fare/Parch/SibSp como las columnas más contaminadas). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-30 21:12:40 +02:00
egutierrez	e5abc18211	merge(eda): capitulo MISSINGNESS — patrones de nulos (co-ocurrencia + MCAR/MAR)	2026-06-30 20:42:46 +02:00
egutierrez	9da1ee6533	merge(eda): capitulo text_distr (TEXTO/NLP) — primer capitulo no tabular	2026-06-30 20:41:29 +02:00
egutierrez	5d4a48ec5e	merge(eda): scatters de pares correlacionados + tipo de relacion en cap CORRELACION	2026-06-30 20:39:16 +02:00
egutierrez	105e56cf05	feat(eda): capítulo text_distr (TEXTO/NLP) — primer capítulo de datos no tabulares Añade el capítulo `text_distr` al motor AutomaticEDA: perfila columnas de texto libre largo (reseñas, descripciones, comentarios) que la distribución categórica no resume bien. Sigue el patrón de cat_distr/num_distr (build_text_distr(profile, ctx) -> Chapter \| None) y se registra en CHAPTER_ORDER tras cat_distr. Activación en dos fases: gate barato desde el perfil (columna no numérica con len_mean >= 50 chars) + confirmación con muestra cruda (mediana de palabras >= 20). Un dataset sin texto largo (p.ej. titanic) devuelve None sin tocar el informe. Bloques por columna (Group con page_break): resumen (longitudes, vocabulario con TTR y % hapax, idioma dominante, % duplicados, legibilidad), histograma de longitudes, top términos (tabla + barras), bigramas/trigramas, idiomas detectados y nube de palabras opcional. Términos ttr/hapax enganchados al glosario clicable. Lógica delegada a 7 funciones nuevas del registry (datascience, tag eda), estilo dict-no-throw: - extract_text_sample (impura, push-down SQL DuckDB/Postgres) - compute_text_length_stats, compute_vocabulary_stats, compute_top_ngrams (puras, stdlib) - detect_corpus_language (langdetect opcional), compute_text_readability (textstat opcional), compute_text_duplicates (hash + datasketch opcional) Versión barata sin modelos pesados: las piezas que dependen de una librería opcional (langdetect, textstat, wordcloud, datasketch) degradan a omitidas sin lanzar. Añade langdetect y textstat (ligeras) al pyproject + uv.lock. Verificado: golden sobre dataset de reviews multi-idioma (capítulo presente en PDF+PPTX+MD con métricas reales), titanic sin capítulo (None), degradación sin libs, suite automatic_eda + pipeline verde (128 passed), fn index OK. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-30 20:38:17 +02:00
egutierrez	eaca41a532	feat(eda): scatters de pares más correlacionados + tipo de relación en capítulo CORRELACION Añade al capítulo `correlacion` del AutomaticEDA la visualización con scatters de los pares numérico-numérico más correlacionados (positiva y negativamente) y, para cada uno, la clasificación del tipo de relación: lineal, polinómica (grado 2/3), monótona no-lineal o débil/sin forma. Funciones nuevas del registry (dominio datascience, grupo eda): - classify_relationship_type_py_datascience (pura): dadas dos listas numéricas pareadas, cruza Pearson r (lineal), Spearman ρ (monótona) y ajustes polinómicos de grado 2 y 3 (numpy.polyfit + R² manual) para etiquetar la forma. Reusa pearson y spearman_corr del registry. Umbrales calibrados para datos reales discretos/ruidosos (orden: débil → monótona → polinómica → lineal). Devuelve los coeficientes del mejor modelo para pintar la curva. No-throw. - relationship_scatter_figure_py_datascience (impure): construye la Figure matplotlib del scatter de un par con su recta/curva de ajuste y una anotación del tipo + métricas (r, ρ, R²lin, R²poly). Backend Agg sin pyplot global, downsample determinista de los puntos dibujados, tendencia ordenada (binned / por valor) para el caso monótona sin polinomio. Defensiva ante vacío. Capítulo correlacion.py (1.0.0 → 1.1.0): nueva sección "Relaciones más fuertes (scatter)" tras la matriz + tablas top. Toma los top-K pares num↔num por \|valor\| de profile['correlations']['pairs'], obtiene los datos crudos de cada par desde ctx['raw_numeric'] y emite, por par, un Figure dentro de un Group keep-together junto a una nota de texto con el tipo de relación (extraíble por pdftotext). Solo num↔num: los pares cat↔cat (Cramér's V) y num↔cat (razón de correlación) no llevan scatter. Cuando no hay raw_numeric (perfil lite/agregado o ctx None) los scatters se omiten sin lanzar; la matriz + tablas siguen. Verificado: golden EDA de titanic (run_models) — el capítulo Correlación del PDF y PPTX incluye los scatters (pclass↔fare → monótona no-lineal, sibsp↔parch → lineal, …) con su ajuste y etiqueta de tipo en texto. Tests de clasificación sintética (lineal, y=x² → polinómica, y=exp(x) → monótona, ruido → débil) + tests del capítulo (golden con raw_numeric, edge sin raw, par sin columna). Suite automatic_eda + pipeline render_automatic_eda verde (141 passed). fn index sin error. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-30 20:37:01 +02:00
egutierrez	e815f5b3b9	merge(eda): MD del AutomaticEDA vuelca TODOS los datos del profile (28 pares, skew/kurtosis/percentiles, scores_by_k)	2026-06-30 20:31:50 +02:00
egutierrez	7ec2bb1b45	feat(eda): el Markdown del AutomaticEDA vuelca TODOS los datos del profile El .md del grupo `eda` es la salida pensada para pegar a un LLM, así que debe contener todo lo que el motor computó, aunque el PDF/PPTX (vista humana) resuman. La evaluación 2053 detectó 6 datos que el .md perdía respecto al profile. Se cierran de forma aditiva (el .md tiene MÁS que el PDF/PPTX, sin tocar esos renderers ni los capítulos). render_automatic_eda.py pasa el profile al serializador Markdown vía meta['profile'] (un meta propio del MD; el de PDF/PPTX queda intacto). render_md_impl.py añade un "Apéndice — Datos completos del perfil" al final del documento, emitido solo cuando hay profile y degradando limpio cuando falta una sección (lite sin modelos, profile sin correlaciones). El apéndice no se acopla a los ids de capítulo (que editan otros agentes en paralelo). Pérdidas cerradas: 1. Matriz de asociación COMPLETA: los N pares de correlations.pairs (no solo el top-17), incluidos correlation_ratio (num↔cat) y cramers_v (cat↔cat). 2. Numéricas: describe completo por columna — mean/median/mode/std/variance/cv, skew y kurtosis para TODAS (no solo las asimétricas), p1/p5/p25/p50/p75/p95/ p99, iqr, min/max, outliers, distribution_type. 3. Re-expresión: nombra la transformación concreta (log1p/sqrt/yeo-johnson) con potencia, razón y alternativas, no un vago "considerar re-expresión". 4. KMeans: tabla scores_by_k (silhouette + inercia por k) marcando el k elegido. 5. Normalidad: el estadístico (stat) de cada test junto al p-value. 6. Encabezados de figuras de barras/scree dejan de heredar "Desde/Hasta/Frecuencia" del histograma; usan "Inicio/Fin/Valor" cuando el caption no es un histograma. Test nuevo md_completeness_test.py: profile sintético, asserta los N pares de correlación, skew/kurtosis de cada numérica, percentiles extendidos, log1p, scores_by_k, stat de normalidad, headers de barras y los edges (sin modelos / sin correlaciones / sin profile, defensivo). Verificado con titanic (profile_level=full): 28 pares en la tabla (incl. Sex↔Embarked cramers_v), 7 numéricas con skew+kurtosis, p5/p95/p99, scores_by_k y JB/D'Agostino/Shapiro stat presentes. PDF/PPTX/manifest siguen saliendo. Suite automatic_eda + render_automatic_eda_test: 134 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-30 20:27:30 +02:00