feat(eda): capítulo MISSINGNESS — patrones de datos faltantes (co-ocurrencia + MCAR/MAR)

Añade el capítulo `missingness` al motor AutomaticEDA, complemento natural de `calidad`: donde calidad reporta cuánto falta por columna, este capítulo analiza el PATRÓN de los nulos — dónde faltan y si las columnas faltan juntas (co-ocurrencia de ausencias), la señal que distingue MCAR de MAR antes de imputar. Capítulo (`chapters/missingness.py`), registrado en `chapters_registry.py` justo tras `calidad`: - Resumen global: % de celdas faltantes, columnas con nulos, filas completas vs incompletas. - Ranking por columna (tabla + barras horizontales). - Co-ocurrencia: correlación de las máscaras is-null entre columnas (heatmap + tabla de los pares que co-faltan, con co-faltantes y Jaccard). - Patrones de fila más frecuentes (estilo matriz de missingno). - Lectura MCAR/MAR exploratoria (heurística por correlación/solape de ausencias, no confirmatoria), que cita la evidencia concreta. - Términos de glosario clicables: missingness, MCAR, MAR. La máscara is-null por fila de TODAS las columnas (numéricas y categóricas) se construye con un push-down DuckDB sobre ctx['db_path']/table (mismo patrón que el capítulo agregación), con fallback a ctx['raw_numeric'] cuando no hay BD. Activa solo si la tabla tiene nulos; si no, devuelve None. Funciones nuevas del grupo `eda` (dominio datascience): - extract_null_mask (impura): máscara is-null por fila vía query_fn. - missingness_overview (pura): resumen global + filas completas/incompletas. - missingness_correlation (pura): correlación de ausencias + pares + Jaccard, reutiliza pearson. - missingness_row_patterns (pura): patrones de fila más comunes. - missingness_corr_heatmap_figure / missingness_rank_bar_figure (impuras): figuras. Verificado: EDA de titanic genera el capítulo en PDF + PPTX + MD con Cabin 77.1%, Age 19.9% y la co-ocurrencia Age↔Cabin (158 filas). Suite completa de AutomaticEDA + render_automatic_eda en verde (125 passed); tests por función y por capítulo; fn index sin error. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-30 20:38:39 +02:00
32 changed files with 2624 additions and 1422 deletions
@@ -72,10 +72,8 @@ from .profile_datetime import profile_datetime
 from .resample_timeseries import resample_timeseries
 from .add_pdf_internal_links import add_pdf_internal_links
 from .suggest_intratable_fk_candidates import suggest_intratable_fk_candidates
-from .draw_join_graph_figure import draw_join_graph_figure

 __all__ = [
-    "draw_join_graph_figure",
    "suggest_intratable_fk_candidates",
    "detect_time_column",
    "extract_timeseries_raw",
@@ -0,0 +1,594 @@
+"""Missingness chapter (MISSINGNESS) — patterns of missing data.
+
+Complements the CALIDAD chapter: where CALIDAD reports *how much* is missing per
+column (the null percentage that lowers the completeness score), this chapter
+reports the **pattern** of the missing data — whether columns tend to be missing
+*together* (co-occurrence of absences) or independently. That distinction is what
+separates data that is missing completely at random ([[term:mcar]]MCAR[[/term]])
+from data missing as a function of another variable ([[term:mar]]MAR[[/term]]),
+which is the key question to settle before imputing or modelling.
+
+The chapter activates only when the table actually has missing data (at least one
+column with a null in the aggregated profile); otherwise it returns ``None`` and
+disappears from the document.
+
+Sections, in order:
+
+1. **Resumen global** — % of missing cells in the dataset, number of columns with
+   nulls, and complete rows (no missing) vs incomplete rows (≥1 missing).
+2. **Ranking por columna** — columns sorted by their null percentage, with a
+   horizontal bar figure.
+3. **Co-ocurrencia de ausencias** — the correlation of the binary is-null masks
+   between columns (which columns tend to be missing together): a heatmap plus a
+   table of the top column pairs that co-miss.
+4. **Patrones de fila** — the most frequent "which columns are missing together"
+   row patterns, in the style of missingno's pattern matrix.
+5. **Lectura MCAR/MAR** — an interpretive, *exploratory* note (not a confirmatory
+   test such as Little's) reading the absence correlations as a hint of MCAR
+   (independent absences) vs MAR (co-occurring absences).
+
+The aggregate per-column null counts come from the ``eda`` group ``TableProfile``
+(``columns[i]['null_count'] / 'null_pct'`` and the table-level ``null_cell_pct``).
+The per-row is-null mask needed for co-occurrence is built from raw data: a single
+DuckDB push-down over ``ctx['db_path'] / ctx['table']`` (same pattern as the
+AGREGACION chapter) covering ALL columns, with a fallback to the numeric-only
+``ctx['raw_numeric']`` when no database is reachable. All the heavy lifting is
+delegated to pure registry functions (``missingness_overview``,
+``missingness_correlation``, ``missingness_row_patterns``) and two figure helpers
+(``missingness_rank_bar_figure``, ``missingness_corr_heatmap_figure``); every one
+is imported lazily and degrades to an honest note so this chapter never raises.
+
+Contract: build_<id>(profile, ctx) -> Chapter | None ; CHAPTER_VERSION = "x.y.z".
+"""
+
+from __future__ import annotations
+
+from .. import model
+
+CHAPTER_VERSION = "1.0.0"
+CHAPTER_ID = "missingness"
+CHAPTER_TITLE = "Datos faltantes"
+
+# Sample cap for the per-row is-null mask push-down. Co-occurrence and row
+# patterns are computed on this sample; the global % of missing cells and the
+# per-column ranking come from the (exact) aggregated profile instead.
+MASK_SAMPLE = 5000
+# Thresholds for the MCAR/MAR heuristic note. A pair counts as a *strong*
+# co-occurrence when the absence correlation alone is high; as a *partial*
+# co-occurrence when the absences overlap materially (high Jaccard) even if the
+# Pearson correlation is modest — the usual case when one column is missing far
+# more often than the other (e.g. Cabin 77% vs Age 20% in Titanic), which dilutes
+# the correlation while the rows still co-miss in absolute terms.
+_CORR_STRONG = 0.30
+_JACCARD_NOTABLE = 0.20
+# Rows shown in the top-pairs and row-patterns tables (bounded, never silently
+# truncated: the table note reports the full count).
+_TOP_PAIRS = 12
+_TOP_PATTERNS = 12
+# Truncate long column names in tables (the renderer also wraps).
+_LABEL_MAX = 28
+
+# Glossary terms this chapter explains (contract §11.1). Registered in the shared
+# collector and marked clickable on their first appearance.
+_TERMS = {
+    "missingness": (
+        "Patrón de datos faltantes (missingness)",
+        "El patrón con el que faltan los datos: cuánto falta, en qué columnas y "
+        "si las ausencias de unas columnas coinciden (co-ocurren) con las de "
+        "otras. Analizarlo —no solo contar nulos— distingue datos que faltan al "
+        "azar (MCAR) de los que faltan en función de otra variable (MAR), lo que "
+        "decide cómo imputar o si descartar filas sin sesgar el análisis.",
+    ),
+    "mcar": (
+        "MCAR (Missing Completely At Random)",
+        "Los valores faltan de forma independiente de cualquier dato, observado o "
+        "no: las ausencias de unas columnas no se relacionan entre sí ni con los "
+        "valores. Es el caso más benigno —descartar filas o imputar la media no "
+        "introduce sesgo—, pero rara vez se cumple del todo en datos reales.",
+    ),
+    "mar": (
+        "MAR (Missing At Random)",
+        "La probabilidad de que un valor falte depende de OTRAS variables "
+        "observadas (p. ej. una medición que falta más en cierto grupo). Las "
+        "ausencias co-ocurren entre columnas o se relacionan con los valores de "
+        "otras; imputar exige condicionar en esas variables para no sesgar. La "
+        "co-ocurrencia fuerte de ausencias es un indicio (exploratorio) de MAR.",
+    ),
+}
+
+
+# --------------------------------------------------------------------------- #
+# Small defensive formatters (own copy: the chapter never imports siblings).
+# --------------------------------------------------------------------------- #
+def _fmt_int(value) -> str:
+    if value is None:
+        return "—"
+    try:
+        return f"{int(round(float(value))):,}".replace(",", ".")
+    except (TypeError, ValueError):
+        return model._safe_str(value)
+
+
+def _fmt_pct(value, decimals: int = 1) -> str:
+    """Format an already-0-100 value as a percentage. None -> placeholder."""
+    if value is None:
+        return "—"
+    try:
+        return f"{float(value):.{decimals}f}%"
+    except (TypeError, ValueError):
+        return model._safe_str(value)
+
+
+def _fmt_num(value, decimals: int = 3) -> str:
+    if value is None:
+        return "—"
+    try:
+        f = float(value)
+    except (TypeError, ValueError):
+        return model._safe_str(value)
+    if f != f:  # NaN
+        return "—"
+    text = f"{f:.{decimals}f}".rstrip("0").rstrip(".")
+    return text if text else "0"
+
+
+def _truncate(text, limit: int = _LABEL_MAX) -> str:
+    s = model._safe_str(text)
+    if len(s) <= limit:
+        return s
+    return s[: max(1, limit - 1)].rstrip() + "…"
+
+
+def _term(key: str, label: str, mark: bool) -> str:
+    if mark:
+        return f"[[term:{key}]]**{label}**[[/term]]"
+    return f"**{label}**"
+
+
+# --------------------------------------------------------------------------- #
+# Profile reads (exact, all rows).
+# --------------------------------------------------------------------------- #
+def _null_count_of(col: dict):
+    """Best-effort null count of a column: ``null_count`` or null_pct*n_rows."""
+    nc = col.get("null_count")
+    if isinstance(nc, (int, float)) and not isinstance(nc, bool):
+        return int(nc)
+    np_ = col.get("null_pct")
+    nr = col.get("n_rows")
+    if isinstance(np_, (int, float)) and isinstance(nr, (int, float)):
+        return int(round(float(np_) * float(nr)))
+    return 0
+
+
+def _columns_with_nulls(profile: dict):
+    """Return ``[(name, null_count, null_pct_0_100)]`` for columns with nulls,
+    sorted by null percentage descending. Reads the aggregated profile (exact)."""
+    cols = profile.get("columns") or []
+    out = []
+    for c in cols:
+        if not isinstance(c, dict):
+            continue
+        nc = _null_count_of(c)
+        if nc <= 0:
+            continue
+        np_ = c.get("null_pct")
+        nr = c.get("n_rows") or profile.get("n_rows")
+        if isinstance(np_, (int, float)) and not isinstance(np_, bool):
+            pct = float(np_) * 100.0 if np_ <= 1.0 else float(np_)
+        elif nr:
+            pct = nc / float(nr) * 100.0
+        else:
+            pct = None
+        out.append((c.get("name") or "(col)", nc, pct))
+    out.sort(key=lambda t: (t[2] if t[2] is not None else -1.0), reverse=True)
+    return out
+
+
+def _global_missing_pct(profile: dict):
+    """Table-level % of missing cells (0-100), exact, from the profile."""
+    v = profile.get("null_cell_pct")
+    if isinstance(v, (int, float)) and not isinstance(v, bool):
+        return float(v) * 100.0 if v <= 1.0 else float(v)
+    return None
+
+
+# --------------------------------------------------------------------------- #
+# Per-row is-null mask (sample): DuckDB push-down, fallback to raw_numeric.
+# --------------------------------------------------------------------------- #
+def _build_query_fn(ctx: dict):
+    """Return ``(query_fn, table)`` for a DuckDB-backed ctx, or ``(None, None)``.
+
+    Mirrors build_eda_render_ctx: a read-only closure over the registry wrapper.
+    Only DuckDB is supported here; any other backend degrades to raw_numeric."""
+    db_path = ctx.get("db_path")
+    table = ctx.get("table")
+    if not db_path or not table:
+        return None, None
+    try:
+        from infra import duckdb_query_readonly
+    except Exception:  # noqa: BLE001 — wrapper unavailable -> degrade.
+        return None, None
+
+    def query_fn(sql):
+        return duckdb_query_readonly(db_path, sql)
+
+    return query_fn, table
+
+
+def _null_mask(profile: dict, ctx: dict):
+    """Build the per-row is-null mask ``{col: [0/1, ...]}``.
+
+    Tries a single DuckDB push-down over ALL columns first (so categorical
+    columns like Cabin are covered, not only numeric ones); falls back to the
+    numeric-only ``ctx['raw_numeric']`` (None -> missing); returns ``(None, 0,
+    None)`` when neither is reachable. Never raises.
+    Returns ``(mask, n_sampled, source)`` with source in {"db","raw_numeric"}.
+    """
+    cols = profile.get("columns") or []
+    names = [c.get("name") for c in cols
+             if isinstance(c, dict) and c.get("name")]
+    # 1) DuckDB push-down over every column (covers categoricals too).
+    query_fn, table = _build_query_fn(ctx)
+    if query_fn is not None and names:
+        try:
+            from datascience.extract_null_mask import extract_null_mask
+
+            res = extract_null_mask(query_fn, table, names, max_rows=MASK_SAMPLE)
+            if isinstance(res, dict) and res.get("status") == "ok":
+                mask = res.get("mask") or {}
+                if mask:
+                    return mask, int(res.get("n") or 0), "db"
+        except Exception:  # noqa: BLE001 — degrade to raw_numeric.
+            pass
+    # 2) Fallback: numeric-only mask derived from raw_numeric (None -> missing).
+    rn = ctx.get("raw_numeric")
+    if isinstance(rn, dict) and rn:
+        mask = {}
+        for col, vals in rn.items():
+            if isinstance(vals, (list, tuple)):
+                mask[col] = [1 if v is None else 0 for v in vals]
+        if mask:
+            n = max((len(v) for v in mask.values()), default=0)
+            return mask, n, "raw_numeric"
+    return None, 0, None
+
+
+# --------------------------------------------------------------------------- #
+# Lazy registry delegations (each degrades to None on any failure).
+# --------------------------------------------------------------------------- #
+def _overview(mask: dict):
+    try:
+        from datascience.missingness_overview import missingness_overview
+
+        out = missingness_overview(mask)
+        return out if isinstance(out, dict) else None
+    except Exception:  # noqa: BLE001
+        return None
+
+
+def _correlation(mask: dict, top_k: int):
+    try:
+        from datascience.missingness_correlation import missingness_correlation
+
+        out = missingness_correlation(mask, top_k=top_k)
+        return out if isinstance(out, dict) else None
+    except Exception:  # noqa: BLE001
+        return None
+
+
+def _row_patterns(mask: dict, top_n: int):
+    try:
+        from datascience.missingness_row_patterns import missingness_row_patterns
+
+        out = missingness_row_patterns(mask, top_n=top_n)
+        return out if isinstance(out, dict) else None
+    except Exception:  # noqa: BLE001
+        return None
+
+
+def _rank_bar_make(names, pcts, title):
+    def make():
+        try:
+            from datascience.missingness_rank_bar_figure import (
+                missingness_rank_bar_figure,
+            )
+
+            return missingness_rank_bar_figure(names, pcts, title=title)
+        except Exception:  # noqa: BLE001 — minimal fallback figure.
+            return _fallback_fig("ranking de nulos no disponible")
+
+    return make
+
+
+def _heatmap_make(matrix, labels, title):
+    def make():
+        try:
+            from datascience.missingness_corr_heatmap_figure import (
+                missingness_corr_heatmap_figure,
+            )
+
+            return missingness_corr_heatmap_figure(matrix, labels, title=title)
+        except Exception:  # noqa: BLE001 — minimal fallback figure.
+            return _fallback_fig("heatmap de co-ocurrencia no disponible")
+
+    return make
+
+
+def _fallback_fig(message: str):
+    import matplotlib
+
+    matplotlib.use("Agg")
+    from matplotlib.figure import Figure
+
+    fig = Figure(figsize=(5.0, 2.2))
+    ax = fig.add_subplot(111)
+    ax.text(0.5, 0.5, message, ha="center", va="center")
+    ax.axis("off")
+    return fig
+
+
+# --------------------------------------------------------------------------- #
+# Block builders.
+# --------------------------------------------------------------------------- #
+def _summary_block(profile: dict, with_nulls: list, overview, sampled, n_total):
+    rows = []
+    gpct = _global_missing_pct(profile)
+    rows.append(("Celdas faltantes (global)", _fmt_pct(gpct)))
+    rows.append(("Columnas con faltantes", str(len(with_nulls))))
+    all_null = profile.get("all_null_cols")
+    if isinstance(all_null, (list, tuple)) and all_null:
+        rows.append(("Columnas 100% faltantes", str(len(all_null))))
+    if isinstance(overview, dict):
+        cr = overview.get("complete_rows")
+        ir = overview.get("incomplete_rows")
+        suffix = ""
+        if (isinstance(sampled, int) and isinstance(n_total, (int, float))
+                and sampled and n_total and sampled < n_total):
+            suffix = f" (sobre muestra de {_fmt_int(sampled)} filas)"
+        if cr is not None:
+            rows.append(("Filas completas (sin faltantes)",
+                         f"{_fmt_int(cr)} ({_fmt_pct(overview.get('complete_pct'))})"
+                         + suffix))
+        if ir is not None:
+            rows.append(("Filas con ≥1 faltante",
+                         f"{_fmt_int(ir)} "
+                         f"({_fmt_pct(overview.get('incomplete_pct'))})" + suffix))
+    return model.KVTable(rows=rows, title="Resumen de datos faltantes")
+
+
+def _ranking_block(with_nulls: list):
+    header = ["Columna", "Faltantes", "% faltante"]
+    rows = [[_truncate(n), _fmt_int(c), _fmt_pct(p)] for (n, c, p) in with_nulls]
+    if not rows:
+        return None
+    return model.DataTable(
+        header=header, rows=rows, title="Faltantes por columna",
+        note="ordenado de más a menos faltante")
+
+
+def _ranking_figure(with_nulls: list):
+    names = [n for (n, _, p) in with_nulls if p is not None]
+    pcts = [p for (_, _, p) in with_nulls if p is not None]
+    if not names:
+        return None
+    return model.Figure(
+        make=_rank_bar_make(names, pcts, "% de valores faltantes por columna"),
+        caption="Porcentaje de valores faltantes por columna (barras).")
+
+
+def _pairs_block(corr: dict):
+    """Top column pairs whose absences co-occur, as a table, or None."""
+    pairs = (corr or {}).get("pairs") or []
+    header = ["Columna A", "Columna B", "Corr. ausencia", "Co-faltan", "Jaccard"]
+    rows = []
+    for p in pairs[:_TOP_PAIRS]:
+        if not isinstance(p, dict):
+            continue
+        rows.append([
+            _truncate(p.get("a")),
+            _truncate(p.get("b")),
+            _fmt_num(p.get("corr")),
+            _fmt_int(p.get("co_missing")),
+            _fmt_num(p.get("jaccard")),
+        ])
+    if not rows:
+        return None
+    shown = len(rows)
+    total = len(pairs)
+    note = ("correlación de las máscaras is-null entre columnas; "
+            "«Co-faltan» = nº de filas en que ambas faltan a la vez")
+    if total > shown:
+        note += f" — top {shown} de {total} pares"
+    return model.DataTable(header=header, rows=rows,
+                           title="Pares de columnas que co-faltan", note=note)
+
+
+def _heatmap_block(corr: dict):
+    cols = (corr or {}).get("columns") or []
+    matrix = (corr or {}).get("matrix") or []
+    if len(cols) < 2 or not matrix:
+        return None
+    labels = [_truncate(c, 16) for c in cols]
+    return model.Figure(
+        make=_heatmap_make(matrix, labels, "Co-ocurrencia de ausencias"),
+        caption=("Correlación de las ausencias entre columnas (azul = faltan "
+                 "juntas; rojo = cuando una falta la otra tiende a estar)."))
+
+
+def _patterns_block(patterns_res: dict):
+    patterns = (patterns_res or {}).get("patterns") or []
+    header = ["Columnas que faltan juntas", "Filas", "%"]
+    rows = []
+    for p in patterns[:_TOP_PATTERNS]:
+        if not isinstance(p, dict):
+            continue
+        cols = p.get("missing_cols") or []
+        if cols:
+            label = ", ".join(_truncate(c, 18) for c in cols)
+        else:
+            label = "(fila completa — sin faltantes)"
+        rows.append([label, _fmt_int(p.get("n_rows")), _fmt_pct(p.get("pct"))])
+    if not rows:
+        return None
+    total = (patterns_res or {}).get("n_patterns")
+    shown = len(rows)
+    note = "cada fila es un patrón de «qué columnas faltan juntas»"
+    if isinstance(total, int) and total > shown:
+        note += f" — top {shown} de {total} patrones distintos"
+    return model.DataTable(header=header, rows=rows,
+                           title="Patrones de fila más comunes", note=note)
+
+
+def _mcar_mar_note(corr: dict, mark: bool):
+    """Interpretive, exploratory MCAR/MAR note from the absence correlations.
+
+    Reads the absence correlations at two levels so the verdict never contradicts
+    the visible evidence: a *strong* correlation flags a clear non-random (MAR)
+    pattern; a *partial* overlap (many rows co-miss — high Jaccard — even if the
+    correlation is diluted by one column being missing far more often) flags a
+    localized possible-MAR and cites the concrete co-missing pair; only when
+    neither holds does it read the absences as compatible with MCAR."""
+
+    def _pairs_with(attr_ok):
+        out = []
+        for p in (corr or {}).get("pairs") or []:
+            if isinstance(p, dict) and attr_ok(p):
+                out.append(p)
+        return out
+
+    def _cf(v):
+        try:
+            return float(v)
+        except (TypeError, ValueError):
+            return 0.0
+
+    strong = _pairs_with(lambda p: abs(_cf(p.get("corr"))) >= _CORR_STRONG)
+    partial = _pairs_with(
+        lambda p: _cf(p.get("corr")) > 0 and _cf(p.get("jaccard")) >= _JACCARD_NOTABLE)
+    mcar = _term("mcar", "MCAR", mark)
+    mar = _term("mar", "MAR", mark)
+    head = (
+        "**Lectura exploratoria MCAR/MAR.** Esta es una heurística basada en la "
+        "correlación de las ausencias entre columnas, NO un test confirmatorio "
+        "(como el de Little); orienta, no demuestra. ")
+    if strong:
+        top = strong[0]
+        ev = (f"«{model._safe_str(top.get('a'))}» y "
+              f"«{model._safe_str(top.get('b'))}» "
+              f"(corr {_fmt_num(top.get('corr'))})")
+        body = (
+            f"Hay ausencias que co-ocurren con fuerza —{ev}—: las columnas no "
+            f"faltan de forma independiente, lo que es un indicio de un patrón no "
+            f"aleatorio ({mar}). Antes de imputar o descartar filas conviene "
+            f"comprobar si la ausencia depende de otra variable observada; en ese "
+            f"caso la imputación debería condicionar en ella para no sesgar.")
+    elif partial:
+        top = max(partial, key=lambda p: _cf(p.get("jaccard")))
+        ev = (f"«{model._safe_str(top.get('a'))}» y "
+              f"«{model._safe_str(top.get('b'))}» faltan a la vez en "
+              f"{_fmt_int(top.get('co_missing'))} filas "
+              f"(Jaccard {_fmt_num(top.get('jaccard'))})")
+        body = (
+            f"Hay co-ocurrencia parcial de ausencias —{ev}—: algunas columnas "
+            f"tienden a faltar juntas aunque la correlación global sea modesta "
+            f"(habitual cuando una columna falta mucho más que la otra). Es un "
+            f"indicio de un posible patrón localizado no aleatorio ({mar}); "
+            f"conviene revisar si esa ausencia depende de otra variable observada "
+            f"antes de imputar, en lugar de asumir que faltan al azar.")
+    else:
+        body = (
+            f"Las ausencias entre columnas no muestran correlación ni solape "
+            f"relevante: parecen independientes, lo que es compatible con que "
+            f"falten al azar ({mcar}). Aun así, la ausencia podría depender de "
+            f"variables no observadas (la heurística no lo descarta).")
+    return model.Markdown(text=head + body)
+
+
+def _intro_block(mark: bool, source):
+    missingness = _term("missingness", "missingness", mark)
+    text = (
+        f"Este capítulo analiza el {missingness} de la tabla: no solo cuánto "
+        "falta (eso lo cubre la calidad), sino DÓNDE falta y si las columnas "
+        "faltan juntas. La co-ocurrencia de ausencias se calcula sobre la matriz "
+        "binaria «is-null» por fila.")
+    if source == "raw_numeric":
+        text += (" Nota: no se pudo leer la tabla cruda completa, así que la "
+                 "co-ocurrencia se limita a las columnas numéricas disponibles.")
+    return model.Markdown(text=text)
+
+
+# --------------------------------------------------------------------------- #
+# Entry point.
+# --------------------------------------------------------------------------- #
+def build_missingness(profile: dict, ctx: dict):
+    """Build the missingness Chapter, or None if the table has no missing data."""
+    if not isinstance(profile, dict):
+        profile = {}
+    ctx = ctx or {}
+
+    with_nulls = _columns_with_nulls(profile)
+    if not with_nulls:
+        return None  # no missing data anywhere -> chapter does not apply.
+
+    # Register glossary terms (if a collector is present) and mark them clickable.
+    glossary = ctx.get("glossary")
+    mark = False
+    if isinstance(glossary, model.GlossaryCollector):
+        for key, (label, definition) in _TERMS.items():
+            glossary.add(key, label, definition)
+        mark = True
+
+    # Per-row is-null mask (sample) for co-occurrence and row patterns.
+    mask, sampled, source = _null_mask(profile, ctx)
+    overview = _overview(mask) if mask else None
+    n_total = profile.get("n_rows")
+
+    blocks = [
+        model.Heading(text="Cuánto y dónde faltan datos", level=2),
+        _intro_block(mark, source),
+        _summary_block(profile, with_nulls, overview, sampled, n_total),
+        model.Heading(text="Faltantes por columna", level=2),
+    ]
+    ranking = _ranking_block(with_nulls)
+    if ranking is not None:
+        blocks.append(ranking)
+    rank_fig = _ranking_figure(with_nulls)
+    if rank_fig is not None:
+        blocks.append(rank_fig)
+
+    # Co-occurrence + row patterns need the per-row mask. Without it, say so.
+    if not mask:
+        blocks.append(model.Note(
+            "No se pudo construir la matriz «is-null» por fila (sin acceso a los "
+            "datos crudos), así que no se analiza la co-ocurrencia de ausencias "
+            "ni los patrones de fila en este informe."))
+        return model.Chapter(id=CHAPTER_ID, title=CHAPTER_TITLE,
+                             version=CHAPTER_VERSION, blocks=blocks)
+
+    corr = _correlation(mask, _TOP_PAIRS) or {}
+    co_blocks = [model.Heading(text="Co-ocurrencia de ausencias", level=2)]
+    heatmap = _heatmap_block(corr)
+    if heatmap is not None:
+        co_blocks.append(heatmap)
+    pairs = _pairs_block(corr)
+    if pairs is not None:
+        co_blocks.append(pairs)
+    if heatmap is None and pairs is None:
+        co_blocks.append(model.Note(
+            "Ninguna pareja de columnas comparte ausencias con variación "
+            "suficiente para correlacionarlas (p. ej. una sola columna con "
+            "faltantes), así que no hay co-ocurrencia que mostrar."))
+    # Keep the co-occurrence heading next to its heatmap and table.
+    blocks.append(model.Group(blocks=co_blocks))
+
+    patterns_res = _row_patterns(mask, _TOP_PATTERNS) or {}
+    patterns = _patterns_block(patterns_res)
+    if patterns is not None:
+        blocks.append(model.Heading(text="Patrones de fila", level=2))
+        blocks.append(patterns)
+
+    blocks.append(model.Heading(text="Lectura MCAR / MAR", level=2))
+    blocks.append(_mcar_mar_note(corr, mark))
+
+    return model.Chapter(id=CHAPTER_ID, title=CHAPTER_TITLE,
+                         version=CHAPTER_VERSION, blocks=blocks)
@@ -0,0 +1,162 @@
+"""Tests for the MISSINGNESS chapter.
+
+Covers the Definition of Done for this chapter:
+  * Activates (non-None Chapter with the expected sections) when the profile has
+    missing data, building the co-occurrence from the per-row is-null mask.
+  * Returns None when the table has no missing data at all (edge case).
+  * Registers the MCAR/MAR/missingness glossary terms.
+  * The DuckDB push-down path covers categorical columns (not only numeric),
+    so a categorical column that co-misses with a numeric one is detected.
+"""
+
+import os
+import sys
+
+_HERE = os.path.dirname(os.path.abspath(__file__))
+_FUNCTIONS = os.path.abspath(os.path.join(_HERE, "..", "..", ".."))  # python/functions
+if _FUNCTIONS not in sys.path:
+    sys.path.insert(0, _FUNCTIONS)
+
+from datascience.automatic_eda import model  # noqa: E402
+from datascience.automatic_eda.chapters.missingness import (  # noqa: E402
+    build_missingness,
+)
+
+
+def _titles(chapter):
+    """Collect heading texts and table/figure titles for assertions."""
+    out = []
+    for b in chapter.blocks:
+        kind = getattr(b, "kind", None)
+        if kind == "heading":
+            out.append(("heading", getattr(b, "text", "")))
+        elif kind in ("data_table", "kv_table"):
+            out.append((kind, getattr(b, "title", "")))
+        elif kind == "group":
+            for inner in getattr(b, "blocks", []):
+                ik = getattr(inner, "kind", None)
+                if ik == "heading":
+                    out.append(("heading", getattr(inner, "text", "")))
+                elif ik in ("data_table", "kv_table"):
+                    out.append((ik, getattr(inner, "title", "")))
+                elif ik == "figure":
+                    out.append(("figure", getattr(inner, "caption", "")))
+        elif kind == "figure":
+            out.append(("figure", getattr(b, "caption", "")))
+    return out
+
+
+def _all_text(chapter):
+    parts = []
+    def walk(blocks):
+        for b in blocks:
+            for attr in ("text", "title", "note", "caption"):
+                v = getattr(b, attr, None)
+                if v:
+                    parts.append(str(v))
+            if getattr(b, "kind", None) == "group":
+                walk(getattr(b, "blocks", []))
+    walk(chapter.blocks)
+    return "\n".join(parts)
+
+
+def test_returns_none_when_no_missing_data():
+    profile = {
+        "n_rows": 4,
+        "null_cell_pct": 0.0,
+        "columns": [
+            {"name": "a", "null_count": 0, "null_pct": 0.0, "n_rows": 4},
+            {"name": "b", "null_count": 0, "null_pct": 0.0, "n_rows": 4},
+        ],
+    }
+    assert build_missingness(profile, {}) is None
+
+
+def test_activates_with_cooccurrence_via_raw_numeric():
+    # a and b are missing in EXACTLY the same rows (0,1,2) -> perfect absence
+    # correlation. c has no nulls. No db_path -> the chapter falls back to the
+    # numeric raw_numeric mask.
+    profile = {
+        "n_rows": 6,
+        "null_cell_pct": (0.5 + 0.5 + 0.0) / 3.0,
+        "columns": [
+            {"name": "a", "null_count": 3, "null_pct": 0.5, "n_rows": 6},
+            {"name": "b", "null_count": 3, "null_pct": 0.5, "n_rows": 6},
+            {"name": "c", "null_count": 0, "null_pct": 0.0, "n_rows": 6},
+        ],
+    }
+    glossary = model.GlossaryCollector()
+    ctx = {
+        "raw_numeric": {
+            "a": [None, None, None, 1.0, 2.0, 3.0],
+            "b": [None, None, None, 4.0, 5.0, 6.0],
+        },
+        "glossary": glossary,
+    }
+    ch = build_missingness(profile, ctx)
+    assert ch is not None
+    assert ch.id == "missingness"
+    assert ch.blocks
+
+    titles = _titles(ch)
+    headings = {t for (k, t) in titles if k == "heading"}
+    # Core sections present.
+    assert any("Cuánto y dónde" in h for h in headings)
+    assert any("Faltantes por columna" in h for h in headings)
+    assert any("Co-ocurrencia" in h for h in headings)
+    assert any("MCAR" in h for h in headings)
+    # A summary KVTable, a ranking DataTable, a co-occurrence figure and the
+    # pairs table all exist.
+    kinds = {k for (k, _) in titles}
+    assert "kv_table" in kinds
+    assert "data_table" in kinds
+    assert "figure" in kinds
+
+    # Glossary terms registered.
+    keys = {t["key"] for t in glossary.terms()}
+    assert {"missingness", "mcar", "mar"} <= keys
+
+    # The MCAR/MAR note reads the co-occurrence; with a perfect overlap it must
+    # flag the non-random (MAR) reading.
+    text = _all_text(ch)
+    assert "MAR" in text
+
+
+def test_db_pushdown_covers_categorical_column(tmp_path):
+    """The is-null mask push-down must cover a categorical column, so a
+    categorical that co-misses with a numeric one shows up in the pairs."""
+    import duckdb
+
+    db = str(tmp_path / "miss.duckdb")
+    con = duckdb.connect(db)
+    con.execute("CREATE TABLE t (num1 DOUBLE, num2 DOUBLE, cat VARCHAR)")
+    # num1 and cat are NULL together in the first 4 of 10 rows; num2 never null.
+    rows = []
+    for i in range(10):
+        if i < 4:
+            rows.append((None, float(i), None))
+        else:
+            rows.append((float(i), float(i), f"c{i}"))
+    con.executemany("INSERT INTO t VALUES (?,?,?)", rows)
+    con.close()
+
+    profile = {
+        "n_rows": 10,
+        "null_cell_pct": (0.4 + 0.0 + 0.4) / 3.0,
+        "columns": [
+            {"name": "num1", "null_count": 4, "null_pct": 0.4, "n_rows": 10},
+            {"name": "num2", "null_count": 0, "null_pct": 0.0, "n_rows": 10},
+            {"name": "cat", "null_count": 4, "null_pct": 0.4, "n_rows": 10},
+        ],
+    }
+    ctx = {"db_path": db, "table": "t", "glossary": model.GlossaryCollector()}
+    ch = build_missingness(profile, ctx)
+    assert ch is not None
+
+    # The pairs table must mention both num1 and cat (they co-miss perfectly),
+    # which is only possible if the mask covered the categorical column.
+    text = _all_text(ch)
+    assert "num1" in text and "cat" in text
+    # Co-occurrence section + a pairs data table exist.
+    titles = _titles(ch)
+    assert any("co-faltan" in (t or "").lower() for (k, t) in titles)
@@ -32,6 +32,7 @@ CHAPTER_ORDER = [
    "num_distr",     # numeric distributions
    "cat_distr",     # categorical distributions
    "calidad",       # data quality
+    "missingness",   # missing-data patterns (co-occurrence of absences; MCAR/MAR)
    "correlacion",   # correlations / associations
    "relaciones",    # key relations: declared/candidate PK + FK (inter/intra-table)
    "modelos",       # cheap models (PCA/KMeans/outliers)
@@ -1,103 +0,0 @@
---
-id: draw_join_graph_figure_py_datascience
-name: draw_join_graph_figure
-kind: function
-lang: py
-domain: datascience
-version: "1.0.0"
-purity: impure
-signature: "def draw_join_graph_figure(join_graph: dict, title: str = None) -> \"matplotlib.figure.Figure\""
-description: "Rasteriza el join graph de una base (relaciones FK inter-tabla, salida de build_join_graph) a un matplotlib.figure.Figure: nodos circulares con el nombre de cada tabla (hubs en color de acento cálido, el resto neutro) y aristas dirigidas etiquetadas from_col→to_col (más la cardinalidad si viene). Es la contrapartida dibujada del string Mermaid para que el capítulo de relaciones del informe AutomaticEDA muestre un diagrama real. Layout networkx spring_layout determinista (seed=42), backend Agg sin abrir ventanas; defensivo: nunca lanza y nunca hace I/O."
-tags: [eda, plot, relations, graph, matplotlib, figure, networkx, datascience, impure]
-uses_functions: []
-uses_types: []
-returns: []
-returns_optional: false
-error_type: "error_go_core"
-imports: [matplotlib, networkx]
-example: |
-  from draw_join_graph_figure import draw_join_graph_figure
-  join_graph = {
-      "nodes": [
-          {"table": "customers", "out_degree": 0, "in_degree": 1, "role": "dimension"},
-          {"table": "orders", "out_degree": 1, "in_degree": 0, "role": "fact"},
-      ],
-      "edges": [
-          {"from_table": "orders", "from_col": "customer_id",
-           "to_table": "customers", "to_col": "id", "cardinality": "N:1"},
-      ],
-      "hubs": ["orders"],
-  }
-  fig = draw_join_graph_figure(join_graph, title="Relaciones FK")
-  fig.savefig("/tmp/join_graph.png")
-tested: true
-tests:
-  - "test_returns_figure_with_axis"
-  - "test_savefig_produces_nonempty_png"
-  - "test_empty_dict_does_not_raise_and_savefig_png"
-  - "test_none_does_not_raise_and_savefig_png"
-test_file_path: "python/functions/datascience/draw_join_graph_figure_test.py"
-file_path: "python/functions/datascience/draw_join_graph_figure.py"
-params:
-  - name: join_graph
-    desc: "Dict producido por build_join_graph. Claves: `nodes` (list[dict] con table, out_degree, in_degree, role), `edges` (list[dict] con from_table, from_col, to_table, to_col y opcional cardinality/inclusion) y `hubs` (list[str] de tablas hub a destacar en color cálido). Claves ausentes, items no-dict, None o {} se toleran (devuelve Figure con texto, sin lanzar). Los nombres de nodo se derivan también de las aristas, así que un grafo con edges pero sin nodes explícitos igual se dibuja."
-  - name: title
-    desc: "Título dibujado sobre el diagrama. Si se omite (None) se usa \"Join graph\". Default None."
-output: "Un matplotlib.figure.Figure (figsize 7x5) con un único Axes que contiene el diagrama node-link dirigido: tablas como nodos circulares etiquetados (hubs en acento cálido #DD8452, resto en azul neutro #4C72B0) y FKs como flechas dirigidas con etiqueta from_col→to_col (+ cardinalidad). Si join_graph no tiene nodos ni aristas (o es None/{}), devuelve igualmente una Figure con el texto centrado \"Sin relaciones FK detectadas.\"; ante cualquier fallo interno devuelve una Figure con un mensaje genérico (nunca lanza). El caller rasteriza/cierra la figura; la función no la muestra ni la guarda."
---
-
-## Ejemplo
-
-```python
-from draw_join_graph_figure import draw_join_graph_figure
-
-# `join_graph` es la salida de build_join_graph (nodes + edges + hubs).
-join_graph = {
-    "nodes": [
-        {"table": "customers", "out_degree": 0, "in_degree": 1, "role": "dimension"},
-        {"table": "orders", "out_degree": 2, "in_degree": 0, "role": "fact"},
-        {"table": "products", "out_degree": 0, "in_degree": 1, "role": "dimension"},
-    ],
-    "edges": [
-        {"from_table": "orders", "from_col": "customer_id",
-         "to_table": "customers", "to_col": "id", "cardinality": "N:1"},
-        {"from_table": "orders", "from_col": "product_id",
-         "to_table": "products", "to_col": "id", "cardinality": "N:1"},
-    ],
-    "hubs": ["orders"],  # `orders` se pinta en color de acento (tabla de hechos)
-}
-
-fig = draw_join_graph_figure(join_graph, title="Relaciones FK")
-
-# El renderer del informe lo rasteriza; aquí solo persistimos para inspección.
-fig.savefig("/tmp/join_graph.png")
-```
-
-## Cuando usarla
-
-Úsala en el capítulo de relaciones de un informe AutomaticEDA cuando quieras un
-diagrama **dibujado** del esquema relacional, no solo el bloque Mermaid pegable.
-Pásale directamente la salida de `build_join_graph` (`nodes` + `edges` + `hubs`)
-y obtienes una `matplotlib.figure.Figure` lista para que el renderer perezoso la
-rasterice. Es la pareja visual del string Mermaid: Mermaid sirve para pegar en
-Markdown/docs que lo soporten; esta función produce la imagen real (PNG/PDF) que
-va embebida en informes que no renderizan Mermaid.
-
-## Gotchas
-
- **Impura por matplotlib.** Fija el backend `Agg` al importar — no abre
-  ventanas ni depende de un display. Segura de llamar en lotes desde el
-  renderer.
- **Layout determinista (`seed=42`).** Usa `nx.spring_layout(G, seed=42)`, así
-  que la misma entrada produce el mismo diagrama (test reproducible). Para
-  grafos de 0/1 nodos usa una posición fija centrada en vez del spring layout.
- **No hace I/O.** No llama `plt.show()` ni guarda a disco — solo devuelve la
-  `Figure`. Quien la consume la rasteriza y la libera (`plt.close(fig)`) para no
-  acumular memoria en informes con muchas tablas.
- **Devuelve una Figure, NO un dict.** A diferencia de `build_join_graph` (que
-  devuelve el dict del grafo), esta función devuelve el objeto de figura ya
-  dibujado.
- **Defensiva, nunca lanza.** `None`, `{}`, claves ausentes o items malformados
-  se manejan sin error: en el peor caso devuelve una `Figure` con
-  "Sin relaciones FK detectadas." (vacío) o un mensaje genérico (fallo interno).
-  No la envuelvas en try/except por miedo a un raise — no lo hay.
@@ -1,214 +0,0 @@
-"""Impure EDA helper: rasterize a join graph to a matplotlib Figure (`eda` group).
-
-Takes the join graph produced by ``build_join_graph`` (inter-table FK relations)
-and draws it as a directed node-link diagram on a ready-to-rasterize
-``matplotlib.figure.Figure``. Hub tables (the ones with the highest out-degree,
-candidate fact tables of a star schema) are highlighted in a warm accent colour;
-the rest use a neutral colour. Directed edges carry a ``from_col→to_col`` label
-(plus the cardinality when present).
-
-This is the *drawn* counterpart of the Mermaid string that ``build_join_graph``
-also emits: the relations chapter of an AutomaticEDA report can show a real
-picture instead of only the pasteable Mermaid block.
-
-Impure because it touches matplotlib's rendering machinery. It pins the headless
-Agg backend and a deterministic ``spring_layout`` seed so the output is
-reproducible. It never raises: on any internal failure (or empty input) it
-returns a ``Figure`` carrying a centered message, so the lazy render of the
-document is never broken.
-"""
-
-import matplotlib
-
-matplotlib.use("Agg")
-
-import matplotlib.pyplot as plt  # noqa: E402
-import networkx as nx  # noqa: E402
-
-# Warm accent reserved for hub tables (candidate fact tables / star-schema cores).
-_HUB_COLOR = "#DD8452"
-# Neutral blue for every other table.
-_NODE_COLOR = "#4C72B0"
-# Muted gray for the empty/error message text.
-_MUTED_TEXT = "#5f6b7a"
-# Edge colour and label colour.
-_EDGE_COLOR = "#7a7a7a"
-_EDGE_LABEL_COLOR = "#34495e"
-# Constant node size; shared with the edge drawing so arrowheads stop at the
-# node boundary instead of being hidden under the marker.
-_NODE_SIZE = 2200
-
-
-def _text_figure(message: str) -> "matplotlib.figure.Figure":
-    """Return a blank Figure carrying a single centered message.
-
-    Used both for the "no relations" case and as the never-raise fallback.
-    """
-    fig, ax = plt.subplots(figsize=(7, 5))
-    ax.axis("off")
-    ax.text(
-        0.5,
-        0.5,
-        message,
-        ha="center",
-        va="center",
-        fontsize=12,
-        color=_MUTED_TEXT,
-        transform=ax.transAxes,
-    )
-    fig.tight_layout()
-    return fig
-
-
-def _edge_label(edge: dict) -> str:
-    """Build the ``from_col→to_col`` label of an edge, appending cardinality."""
-    fc = edge.get("from_col")
-    tc = edge.get("to_col")
-    if fc is not None and tc is not None:
-        label = f"{fc}→{tc}"
-    elif fc is not None:
-        label = str(fc)
-    elif tc is not None:
-        label = str(tc)
-    else:
-        label = ""
-    card = edge.get("cardinality")
-    if card:
-        label = f"{label} ({card})" if label else str(card)
-    return label
-
-
-def draw_join_graph_figure(join_graph: dict, title: str = None):
-    """Rasterize a join graph to a matplotlib Figure.
-
-    Builds a ``networkx.DiGraph`` from the graph's nodes and edges, lays it out
-    with a deterministic ``spring_layout`` (``seed=42``) and draws it on a
-    ``matplotlib.figure.Figure``: tables as labelled circular nodes (hubs in a
-    warm accent, the rest neutral) and FK relations as directed arrows labelled
-    ``from_col→to_col`` (plus cardinality when available).
-
-    The function never raises. On empty/``None`` input it returns a Figure with
-    a centered "Sin relaciones FK detectadas." message; on any internal failure
-    it returns a Figure with a generic centered message. It never shows the
-    figure nor writes it to disk — the document renderer rasterizes it.
-
-    Args:
-        join_graph: Dict produced by ``build_join_graph`` with keys ``nodes``
-            (list of ``{table, out_degree, in_degree, role}``), ``edges`` (list
-            of ``{from_table, from_col, to_table, to_col, cardinality?,
-            inclusion?}``) and ``hubs`` (list of hub table names to highlight).
-            Missing keys, non-dict items, ``None`` or ``{}`` are all tolerated.
-        title: Optional title drawn above the diagram. When omitted, the title
-            defaults to "Join graph".
-
-    Returns:
-        A ``matplotlib.figure.Figure`` (figsize 7x5) with a single Axes holding
-        the node-link diagram. The caller rasterizes/closes it.
-    """
-    try:
-        jg = join_graph if isinstance(join_graph, dict) else {}
-        nodes = jg.get("nodes") or []
-        edges = jg.get("edges") or []
-        hubs = {h for h in (jg.get("hubs") or []) if h is not None}
-
-        # Collect node names from the declared nodes and, defensively, from the
-        # edges (so a graph with edges but no explicit nodes still draws).
-        node_names: list = []
-        seen: set = set()
-
-        def _register(name) -> None:
-            if name is not None and name not in seen:
-                seen.add(name)
-                node_names.append(name)
-
-        for n in nodes:
-            if isinstance(n, dict):
-                _register(n.get("table"))
-        for e in edges:
-            if isinstance(e, dict):
-                _register(e.get("from_table"))
-                _register(e.get("to_table"))
-
-        if not node_names:
-            return _text_figure("Sin relaciones FK detectadas.")
-
-        graph = nx.DiGraph()
-        for name in node_names:
-            graph.add_node(name)
-
-        edge_labels: dict = {}
-        for e in edges:
-            if not isinstance(e, dict):
-                continue
-            ft = e.get("from_table")
-            tt = e.get("to_table")
-            if ft is None or tt is None:
-                continue
-            graph.add_edge(ft, tt)
-            edge_labels[(ft, tt)] = _edge_label(e)
-
-        fig, ax = plt.subplots(figsize=(7, 5))
-
-        # Deterministic layout. Fixed positions for trivial graphs so a single
-        # node sits centered instead of at an arbitrary spring-layout point.
-        if graph.number_of_nodes() <= 1:
-            pos = {name: (0.5, 0.5) for name in graph.nodes()}
-        else:
-            pos = nx.spring_layout(graph, seed=42)
-
-        node_colors = [
-            _HUB_COLOR if name in hubs else _NODE_COLOR for name in graph.nodes()
-        ]
-        nx.draw_networkx_nodes(
-            graph,
-            pos,
-            ax=ax,
-            node_color=node_colors,
-            node_size=_NODE_SIZE,
-            node_shape="o",
-            edgecolors="white",
-            linewidths=1.5,
-        )
-        nx.draw_networkx_labels(
-            graph,
-            pos,
-            ax=ax,
-            font_size=9,
-            font_color="white",
-            font_weight="bold",
-        )
-        nx.draw_networkx_edges(
-            graph,
-            pos,
-            ax=ax,
-            arrows=True,
-            arrowstyle="-|>",
-            arrowsize=18,
-            edge_color=_EDGE_COLOR,
-            width=1.4,
-            connectionstyle="arc3,rad=0.06",
-            node_size=_NODE_SIZE,
-        )
-        if any(lbl for lbl in edge_labels.values()):
-            nx.draw_networkx_edge_labels(
-                graph,
-                pos,
-                edge_labels=edge_labels,
-                ax=ax,
-                font_size=7,
-                font_color=_EDGE_LABEL_COLOR,
-                bbox={
-                    "boxstyle": "round,pad=0.2",
-                    "fc": "white",
-                    "ec": "none",
-                    "alpha": 0.7,
-                },
-            )
-
-        ax.set_title(title if title else "Join graph", fontsize=13)
-        ax.axis("off")
-        fig.tight_layout()
-        return fig
-    except Exception:
-        # Never raise — the document render is lazy and must not be broken.
-        return _text_figure("No se pudo dibujar el join graph.")
@@ -1,84 +0,0 @@
-"""Tests para draw_join_graph_figure (rasteriza el join graph, grupo eda).
-
-Usa el backend Agg sin abrir ventanas; cada test cierra la Figure construida
-(matplotlib.pyplot.close) para no acumular estado entre tests. Las aserciones de
-guardado escriben a tmp_path (fixture de pytest) y comprueban que el PNG no está
-vacío.
-"""
-
-import matplotlib
-
-matplotlib.use("Agg")
-
-import matplotlib.pyplot as plt  # noqa: E402
-from matplotlib.figure import Figure  # noqa: E402
-
-from draw_join_graph_figure import draw_join_graph_figure
-
-
-def _make_join_graph():
-    """Join graph mínimo: 3 nodos (customers/orders/products) y 2 aristas.
-
-    orders -> customers y orders -> products. `orders` es el hub (out_degree 2).
-    """
-    return {
-        "nodes": [
-            {"table": "customers", "out_degree": 0, "in_degree": 1, "role": "dimension"},
-            {"table": "orders", "out_degree": 2, "in_degree": 0, "role": "fact"},
-            {"table": "products", "out_degree": 0, "in_degree": 1, "role": "dimension"},
-        ],
-        "edges": [
-            {
-                "from_table": "orders",
-                "from_col": "customer_id",
-                "to_table": "customers",
-                "to_col": "id",
-                "cardinality": "N:1",
-                "inclusion": 1.0,
-            },
-            {
-                "from_table": "orders",
-                "from_col": "product_id",
-                "to_table": "products",
-                "to_col": "id",
-                "cardinality": "N:1",
-                "inclusion": 0.98,
-            },
-        ],
-        "hubs": ["orders"],
-    }
-
-
-def test_returns_figure_with_axis():
-    fig = draw_join_graph_figure(_make_join_graph(), title="Relaciones FK")
-    assert isinstance(fig, Figure)
-    # Al menos un eje con el diagrama.
-    assert len(fig.axes) >= 1
-    plt.close(fig)
-
-
-def test_savefig_produces_nonempty_png(tmp_path):
-    fig = draw_join_graph_figure(_make_join_graph())
-    out = tmp_path / "g.png"
-    fig.savefig(out)
-    assert out.exists()
-    assert out.stat().st_size > 0
-    plt.close(fig)
-
-
-def test_empty_dict_does_not_raise_and_savefig_png(tmp_path):
-    fig = draw_join_graph_figure({})
-    assert isinstance(fig, Figure)
-    out = tmp_path / "empty.png"
-    fig.savefig(out)
-    assert out.stat().st_size > 0
-    plt.close(fig)
-
-
-def test_none_does_not_raise_and_savefig_png(tmp_path):
-    fig = draw_join_graph_figure(None)
-    assert isinstance(fig, Figure)
-    out = tmp_path / "none.png"
-    fig.savefig(out)
-    assert out.stat().st_size > 0
-    plt.close(fig)
@@ -0,0 +1,97 @@
+---
+name: extract_null_mask
+kind: function
+lang: py
+domain: datascience
+version: "1.0.0"
+purity: impure
+signature: "def extract_null_mask(query_fn, table: str, columns: list, max_rows: int = 5000) -> dict"
+description: "Extrae la mascara de nulos (1=falta / 0=presente) de una muestra de filas de una tabla, una lista 0/1 por columna alineada por fila, para alimentar el capitulo de calidad / patron de nulos de AutomaticEDA sin que el capitulo toque la base de datos. Recibe un lector read-only inyectado `query_fn(sql) -> dict` (mismo contrato que duckdb_query_readonly / pg_query / el `_q` de profile_table) y NO abre ninguna conexion por su cuenta. Construye UNA sola query que proyecta por cada columna `CASE WHEN \"col\" IS NULL THEN 1 ELSE 0 END` con identificadores escapados y LIMIT. Devuelve dict dict-no-throw: columns (efectivamente leidas, en orden), mask (lista int 0/1 por columna, misma longitud todas) y n. Una celda None se cuenta defensivamente como 1 (falta)."
+tags: [eda, nulls, missing, datascience, automatic-eda, extraction, read-only, duckdb, postgres, python]
+uses_functions: []
+uses_types: []
+returns: []
+returns_optional: false
+error_type: "error_go_core"
+imports: []
+params:
+  - name: query_fn
+    desc: "callable lector read-only del backend activo. Recibe un string SQL y devuelve un dict {'status':'ok','rows':[{col:val,...},...]} (mismo contrato que duckdb_query_readonly o el `_q` de profile_table). NO se abre ninguna conexion dentro de la funcion: toda la lectura pasa por query_fn. Si es None -> error."
+  - name: table
+    desc: "nombre de la tabla de la que muestrear la mascara de nulos. Se escapa con comillas dobles en la query. Vacio o None -> status error."
+  - name: columns
+    desc: "lista de nombres de columna a evaluar. Cada una produce una entrada en `mask` con una lista 0/1 paralela por fila (1=IS NULL, 0=presente). Cada nombre se escapa con comillas dobles. Vacia o None -> status error."
+  - name: max_rows
+    desc: "limite de filas a muestrear (clausula LIMIT). Default 5000. Protege frente a tablas enormes; con LIMIT obtienes el primer tramo, no un muestreo uniforme."
+output: "dict (nunca lanza). En exito: {'status':'ok','table':str,'columns':[str,...] (en orden),'mask':{col:[int 0/1,...],...} (1=falta/IS NULL, 0=presente; todas las listas con misma longitud = n),'n':int}. En error (sin lanzar): {'status':'error','error':str,'table':str,'columns':[],'mask':{},'n':0}. Errores: query_fn None, table vacia, columns vacia, o query_fn devuelve status!='ok' (se propaga su error)."
+tested: true
+tests: ["test_golden_mask_alineada", "test_celda_none_cuenta_como_falta", "test_columns_vacia_status_error", "test_query_fn_status_error_propaga", "test_query_fn_none_da_error_sin_reventar", "test_sql_contiene_case_y_limit"]
+test_file_path: "python/functions/datascience/extract_null_mask_test.py"
+file_path: "python/functions/datascience/extract_null_mask.py"
+---
+
+## Ejemplo
+
+```python
+import sys, os
+sys.path.insert(0, os.path.join("python", "functions"))
+from datascience.extract_null_mask import extract_null_mask
+from infra import duckdb_query_readonly
+
+# El lector read-only se inyecta como closure (igual que el `_q` de profile_table).
+db = "data/clientes.duckdb"
+def _q(sql):
+    return duckdb_query_readonly(db, sql)
+
+res = extract_null_mask(_q, "clientes", ["email", "telefono", "edad"])
+# res == {
+#   "status": "ok",
+#   "table": "clientes",
+#   "columns": ["email", "telefono", "edad"],
+#   "mask": {
+#     "email":    [0, 0, 1, 0, ...],   # fila 2 sin email
+#     "telefono": [1, 0, 1, 0, ...],
+#     "edad":     [0, 0, 0, 1, ...],
+#   },
+#   "n": 5000,
+# }
+
+# % de nulos por columna a partir de la muestra:
+pct = {c: 100 * sum(bits) / max(res["n"], 1) for c, bits in res["mask"].items()}
+
+# Se entrega al capitulo de calidad sin que este toque la BD:
+ctx = {"null_mask": res}
+```
+
+## Cuando usarla
+
+Cuando el capitulo de calidad / patron de nulos de AutomaticEDA necesita saber
+DONDE faltan los valores (no solo cuantos) y NO debe abrir la base de datos por
+su cuenta: extraes aqui la mascara 0/1 por columna alineada por fila y se la pasas
+en `ctx['null_mask']`. Usala siempre que quieras detectar co-ocurrencia de nulos
+(filas que fallan en varias columnas a la vez), calcular el % de nulos sobre una
+muestra, o pintar un heatmap de missingness reutilizando un unico lector read-only
+inyectado, en vez de hacer N `COUNT(*) WHERE col IS NULL` por separado.
+
+## Gotchas
+
+- **Impura**: lee de la base de datos a traves de `query_fn`. No abre conexiones
+  por su cuenta — depende por completo del lector inyectado. Sigue el estilo
+  dict-no-throw del grupo `eda`: nunca lanza; ante cualquier fallo devuelve
+  `{"status":"error","error":...}` con `columns=[]`, `mask={}`, `n=0`.
+- **`error_type` en el frontmatter es `error_go_core` por convencion del registry**
+  (toda funcion impura debe declararlo y el indexer lo exige), pero el codigo
+  NO lanza esa excepcion: degrada al dict de error. Es metadata, no comportamiento.
+- **Muestra, no censo**: con `LIMIT max_rows` obtienes el primer tramo de filas que
+  devuelva el backend, no un muestreo uniforme ni la tabla entera. El % de nulos
+  derivado es una estimacion sobre esa muestra; para el conteo exacto usa un
+  agregado `COUNT(*)`/`COUNT(col)` aparte.
+- **Alineacion por fila**: `mask[col][i]` corresponde a la misma fila `i` que
+  `mask[otra_col][i]`. Todas las listas tienen longitud `n`, asi que puedes cruzar
+  columnas por indice (co-ocurrencia de nulos) sin re-alinear.
+- **Defensa None -> 1**: el SQL ya devuelve 0/1, pero si una celda llega como `None`
+  (CASE no aplicado, columna ausente en la fila, backend que nulifica) se cuenta
+  como 1 (falta). Un valor inesperado no convertible a int se trata como presente (0).
+- **No loguear los datos crudos**: aunque `mask` es solo 0/1, los nombres de columna
+  pueden revelar el esquema. En trazas usa `n` y el numero de columnas, no el dict
+  completo.
@@ -0,0 +1,101 @@
+"""extract_null_mask — extrae la mascara de nulos (1=falta / 0=presente) de una tabla.
+
+Lector read-only inyectado: recibe `query_fn(sql) -> dict` con el mismo contrato
+que duckdb_query_readonly / pg_query (y que el `_q` de profile_table):
+`{"status": "ok", "rows": [{col: val, ...}, ...]}`. Esta funcion NO abre ninguna
+conexion por su cuenta — solo usa `query_fn`. Construye UNA sola query que, por
+cada columna pedida, evalua `CASE WHEN "col" IS NULL THEN 1 ELSE 0 END` y devuelve
+una muestra de filas con esos bits. El resultado es un dict `mask` con una lista
+0/1 por columna, alineada por fila (1 = el valor falta / IS NULL, 0 = presente),
+listo para alimentar el capitulo de calidad / patron de nulos de AutomaticEDA sin
+que el capitulo toque la base de datos.
+
+Estilo dict-no-throw del grupo `eda`: nunca lanza; captura cualquier excepcion y
+degrada a `{"status": "error", "error": str, ...}`.
+"""
+
+
+def _to_bit(value):
+    """Coacciona el valor 0/1 del CASE a int de forma defensiva.
+
+    El SQL ya devuelve 0 (presente) o 1 (falta). Por si una celda llega como None
+    (el CASE no se aplico o el backend la nulifico), se cuenta como 1 (falta). El
+    resto se reduce a int: un entero distinto de 0 cuenta como 1 (falta), 0 como
+    presente. Un valor no convertible se trata como presente (0) — nunca lanza.
+    """
+    if value is None:
+        return 1
+    try:
+        return 1 if int(value) != 0 else 0
+    except (TypeError, ValueError):
+        return 0
+
+
+def extract_null_mask(query_fn, table, columns, max_rows=5000):
+    """Extrae la mascara de nulos (1=falta / 0=presente) de una muestra de la tabla.
+
+    Args:
+        query_fn: callable lector read-only del backend activo. Recibe un string
+            SQL y devuelve un dict {"status": "ok", "rows": [{col: val, ...}]}
+            (mismo contrato que duckdb_query_readonly / el `_q` de profile_table).
+            No se abre ninguna conexion aqui: toda la lectura pasa por query_fn.
+        table: nombre de la tabla. Se escapa con comillas dobles en la query.
+        columns: lista de nombres de columna a evaluar. Cada una produce una
+            entrada en `mask` con una lista 0/1 paralela por fila. Vacia o None ->
+            status error.
+        max_rows: limite de filas a muestrear (clausula LIMIT). Default 5000.
+
+    Returns:
+        dict (nunca lanza):
+            {
+              "status": "ok" | "error",
+              "error": str,                 # solo si status == "error"
+              "table": str,
+              "columns": [str, ...],        # columnas efectivamente leidas, en orden
+              "mask": {col: [int 0/1, ...], ...},  # alineada por fila, 1=falta, 0=presente
+              "n": int                      # nº de filas muestreadas
+            }
+        Todas las listas de `mask` tienen la misma longitud (= n).
+    """
+    base = {"status": "ok", "table": table, "columns": [], "mask": {}, "n": 0}
+    try:
+        if query_fn is None:
+            return {**base, "status": "error", "error": "query_fn es None"}
+        if not table:
+            return {**base, "status": "error", "error": "table es obligatorio"}
+        if not columns:
+            return {**base, "status": "error", "error": "columns vacío"}
+
+        # Identificadores escapados con comillas dobles (como hace profile_table)
+        # para tolerar nombres con mayusculas/espacios/palabras reservadas. Cada
+        # columna se proyecta como su propio bit IS NULL conservando el alias.
+        select_sql = ", ".join(
+            f'(CASE WHEN "{c}" IS NULL THEN 1 ELSE 0 END) AS "{c}"' for c in columns
+        )
+        sql = f'SELECT {select_sql} FROM "{table}" LIMIT {int(max_rows)}'
+
+        q = query_fn(sql)
+        if not isinstance(q, dict) or q.get("status") != "ok":
+            err = (
+                q.get("error", "query_fn fallo")
+                if isinstance(q, dict)
+                else "query_fn no devolvio un dict"
+            )
+            return {**base, "status": "error", "error": err}
+
+        rows = q.get("rows", []) or []
+        mask = {c: [] for c in columns}
+        for row in rows:
+            for c in columns:
+                # row.get tolera filas que no traigan la columna (None -> falta).
+                mask[c].append(_to_bit(row.get(c) if isinstance(row, dict) else None))
+
+        return {
+            "status": "ok",
+            "table": table,
+            "columns": list(columns),
+            "mask": mask,
+            "n": len(rows),
+        }
+    except Exception as e:  # noqa: BLE001 - dict-no-throw: degradar, nunca lanzar
+        return {**base, "status": "error", "error": str(e)}
@@ -0,0 +1,116 @@
+"""Tests para extract_null_mask.
+
+No usa DuckDB real: inyecta un query_fn FAKE (closure) que devuelve filas
+predefinidas (simulando el SELECT de bits 0/1) y, opcionalmente, captura el SQL
+recibido para verificar la query generada (CASE WHEN ... IS NULL + LIMIT). Asi el
+test es autocontenido y no depende de ningun backend.
+"""
+
+import os
+import sys
+
+sys.path.insert(0, os.path.dirname(__file__))
+
+from extract_null_mask import extract_null_mask
+
+
+def _fake_query(rows, captured=None, status="ok", error=None):
+    """Crea un query_fn FAKE.
+
+    `captured` (lista opcional) recibe el SQL ejecutado para poder inspeccionarlo.
+    `status`/`error` permiten simular un fallo del backend.
+    """
+
+    def _q(sql):
+        if captured is not None:
+            captured.append(sql)
+        if status != "ok":
+            return {"status": "error", "error": error or "boom"}
+        return {"status": "ok", "rows": rows}
+
+    return _q
+
+
+def test_golden_mask_alineada():
+    """Golden: mask 0/1 por columna alineada por fila, n correcto, status ok."""
+    # Cada fila simula el SELECT (CASE WHEN col IS NULL THEN 1 ELSE 0 END) AS col.
+    rows = [
+        {"email": 0, "telefono": 1, "edad": 0},
+        {"email": 0, "telefono": 0, "edad": 1},
+        {"email": 1, "telefono": 1, "edad": 0},
+    ]
+    res = extract_null_mask(_fake_query(rows), "clientes", ["email", "telefono", "edad"])
+    assert res["status"] == "ok"
+    assert res["table"] == "clientes"
+    assert res["columns"] == ["email", "telefono", "edad"]
+    assert res["n"] == 3
+    assert res["mask"]["email"] == [0, 0, 1]
+    assert res["mask"]["telefono"] == [1, 0, 1]
+    assert res["mask"]["edad"] == [0, 1, 0]
+    # Todas las listas con la misma longitud.
+    assert all(len(v) == res["n"] for v in res["mask"].values())
+
+
+def test_celda_none_cuenta_como_falta():
+    """Una celda None se cuenta defensivamente como 1 (falta)."""
+    rows = [
+        {"email": 0, "telefono": None},
+        {"email": None, "telefono": 1},
+        {"email": 1, "telefono": 0},
+    ]
+    res = extract_null_mask(_fake_query(rows), "clientes", ["email", "telefono"])
+    assert res["status"] == "ok"
+    assert res["mask"]["email"] == [0, 1, 1]
+    assert res["mask"]["telefono"] == [1, 1, 0]
+    assert res["n"] == 3
+
+
+def test_columns_vacia_status_error():
+    """columns vacia -> status error con columns/mask/n vacios."""
+    res = extract_null_mask(_fake_query([]), "clientes", [])
+    assert res["status"] == "error"
+    assert "columns" in res["error"]
+    assert res["table"] == "clientes"
+    assert res["columns"] == []
+    assert res["mask"] == {}
+    assert res["n"] == 0
+
+
+def test_query_fn_status_error_propaga():
+    """query_fn que devuelve status != ok -> se propaga como error, mask {}."""
+    res = extract_null_mask(
+        _fake_query([], status="error", error="db locked"),
+        "clientes",
+        ["email"],
+    )
+    assert res["status"] == "error"
+    assert "db locked" in res["error"]
+    assert res["mask"] == {}
+    assert res["n"] == 0
+
+
+def test_query_fn_none_da_error_sin_reventar():
+    """query_fn None -> error degradado, sin excepcion."""
+    res = extract_null_mask(None, "clientes", ["email"])
+    assert res["status"] == "error"
+    assert res["columns"] == []
+    assert res["mask"] == {}
+    assert res["n"] == 0
+
+
+def test_sql_contiene_case_y_limit():
+    """La query genera un CASE WHEN IS NULL por columna escapada + LIMIT sobre la tabla."""
+    captured = []
+    rows = [{"email": 0}]
+    extract_null_mask(
+        _fake_query(rows, captured),
+        "clientes_tbl",
+        ["email"],
+        max_rows=123,
+    )
+    assert len(captured) == 1
+    sql = captured[0]
+    assert 'CASE WHEN "email" IS NULL THEN 1 ELSE 0 END' in sql
+    assert 'AS "email"' in sql
+    assert 'FROM "clientes_tbl"' in sql
+    assert "LIMIT 123" in sql
@@ -0,0 +1,103 @@
+---
+id: missingness_corr_heatmap_figure_py_datascience
+name: missingness_corr_heatmap_figure
+kind: function
+lang: py
+domain: datascience
+version: "1.0.0"
+purity: impure
+signature: "def missingness_corr_heatmap_figure(matrix, labels, title=\"Co-ocurrencia de ausencias\") -> \"matplotlib.figure.Figure\""
+description: "Construye una figura matplotlib (heatmap) de la matriz NxN de correlación de ausencias entre columnas: +1 = dos columnas suelen ser nulas a la vez, -1 = cuando una falta la otra está presente, 0 = ausencias independientes. Usa ax.imshow con coolwarm fijado a [-1,1], ticks con los labels truncados (X rotados 45º), colorbar y anota el valor de cada celda si N<=12. Devuelve un matplotlib.figure.Figure listo para rasterizar por el renderer del informe EDA (capítulo de datos faltantes). Backend Agg sin pyplot global; defensivo ante matrix/labels vacíos o celdas no numéricas (nunca lanza)."
+tags: [eda, missing, missingness, correlation, heatmap, matplotlib, figure, visualization, datascience, impure]
+uses_functions: []
+uses_types: []
+returns: []
+returns_optional: false
+error_type: "error_go_core"
+imports: [matplotlib]
+example: |
+  from datascience.missingness_corr_heatmap_figure import missingness_corr_heatmap_figure
+  matrix = [
+      [1.0, 0.82, -0.10],
+      [0.82, 1.0, 0.05],
+      [-0.10, 0.05, 1.0],
+  ]
+  labels = ["telefono", "movil", "email"]
+  fig = missingness_corr_heatmap_figure(matrix, labels, title="Co-ocurrencia de ausencias")
+tested: true
+tests:
+  - "test_returns_figure_with_axes"
+  - "test_empty_matrix_does_not_raise_and_returns_figure"
+  - "test_empty_labels_returns_message_figure"
+  - "test_large_matrix_omits_annotations"
+  - "test_ragged_and_non_numeric_cells_are_handled"
+test_file_path: "python/functions/datascience/missingness_corr_heatmap_figure_test.py"
+file_path: "python/functions/datascience/missingness_corr_heatmap_figure.py"
+params:
+  - name: matrix
+    desc: "Lista de listas (NxN) de floats en [-1,1]: la correlación de ausencias por pares de columnas. Puede venir vacía. Filas de longitud desigual se toleran (se rellenan/recortan a N); celdas None, NaN o no numéricas se coercen a 0.0. No se muta el original."
+  - name: labels
+    desc: "Lista de N nombres de columna, paralela a matrix. Puede venir vacía (devuelve figura \"sin columnas con ausencia variable\"). Se truncan a ~14 chars con elipsis para los ticks; los originales no se mutan."
+  - name: title
+    desc: "Título de la figura. Se trunca a ~60 chars con elipsis si es muy largo. Default \"Co-ocurrencia de ausencias\"."
+output: "Un matplotlib.figure.Figure (figsize 6.4x5.2, dpi 150) con un Axes heatmap (imshow vmin=-1, vmax=1, cmap coolwarm) más una colorbar etiquetada \"correlación de ausencias\". Ticks en ambos ejes con los labels truncados (X rotados 45º). Si N<=12 cada celda lleva su valor numérico anotado (texto blanco sobre celdas saturadas, oscuro sobre pálidas); con N grande se omiten las anotaciones para no saturar. Si matrix o labels vienen vacíos devuelve una Figure con texto centrado \"sin columnas con ausencia variable\"; cualquier error inesperado se captura y devuelve una Figure con el mensaje de error (nunca lanza). El caller rasteriza/cierra la figura; la función no la muestra ni la guarda."
+---
+
+## Ejemplo
+
+```python
+from datascience.missingness_corr_heatmap_figure import missingness_corr_heatmap_figure
+
+# Correlación de ausencias entre 3 columnas de contacto:
+# telefono y movil tienden a faltar juntos (0.82); email es casi independiente.
+matrix = [
+    [1.00, 0.82, -0.10],
+    [0.82, 1.00,  0.05],
+    [-0.10, 0.05, 1.00],
+]
+labels = ["telefono", "movil", "email"]
+
+fig = missingness_corr_heatmap_figure(
+    matrix,
+    labels,
+    title="Co-ocurrencia de ausencias",
+)
+
+# El renderer del informe lo rasteriza; aquí solo persistimos para inspección.
+fig.savefig("/tmp/missingness_heatmap.png")
+```
+
+## Cuando usarla
+
+Úsala en el capítulo de datos faltantes de un informe EDA cuando quieras ver de
+un vistazo qué columnas faltan juntas (mismo formulario sin rellenar, mismo
+proceso roto) frente a columnas cuyas ausencias son independientes. Pásale la
+matriz de correlación de ausencias (calculada sobre la máscara de nulos, p. ej.
+`df.isnull().corr()`) restringida a las columnas que de verdad tienen ausencia
+variable, junto con sus nombres. Es la pareja "estructura" del ranking de % de
+nulos: las barras dicen *cuánto* falta cada columna, este heatmap dice *si las
+ausencias están relacionadas* entre columnas.
+
+## Gotchas
+
+- **Impura por matplotlib.** Toca la maquinaria de render. Usa el backend `Agg`
+  y la API orientada a objetos `Figure`/`add_subplot` — NUNCA `pyplot.*` aquí,
+  para no tocar el estado global ni filtrar figuras entre llamadas. `pyplot` NO
+  es thread-safe; esta función evita ese riesgo construyendo el `Figure`
+  directamente, así que es segura de llamar en bucle desde el renderer.
+- **El caller cierra la figura.** Devuelve el `Figure` pero no lo muestra ni lo
+  guarda. Quien la consume debe rasterizarla y luego liberarla
+  (`matplotlib.pyplot.close(fig)`) para no acumular memoria en lotes grandes.
+- **Escala de color fija en [-1, 1].** `vmin=-1`, `vmax=1` están fijados a
+  propósito para que el color sea comparable entre informes y entre columnas. No
+  se autoescala al rango real de la matriz; valores fuera de `[-1, 1]` se
+  saturan al extremo del colormap.
+- **Anotaciones solo con N<=12.** Por encima de 12 columnas el grid de números
+  se vuelve ilegible y se omite; queda solo el color + la colorbar. Filtra a las
+  columnas con ausencia variable antes de llamar para no llegar a matrices
+  enormes.
+- **Defensiva, nunca lanza.** `matrix=[]`, `labels=[]`, filas cortas, celdas
+  `None`/`NaN`/no numéricas o cualquier error inesperado se manejan sin propagar:
+  en el peor caso devuelve una `Figure` con "sin columnas con ausencia variable"
+  o con el texto del error. No envuelvas la llamada en try/except por miedo a un
+  raise — no lo hay.
@@ -0,0 +1,158 @@
+"""Impure EDA helper: heatmap of missingness co-occurrence (`eda` group).
+
+Builds a matplotlib heatmap of the pairwise missingness correlation matrix of a
+dataset: a value near ``+1`` means two columns tend to be null together, near
+``-1`` means when one is null the other tends to be present, and ``0`` means
+their absences are independent. Returns a ready-to-rasterize
+``matplotlib.figure.Figure``; it never shows nor saves it.
+
+Impure because it touches matplotlib's rendering machinery. It uses the headless
+Agg backend and the object-oriented ``Figure`` API (no ``pyplot``) so it leaks no
+global state and is safe to call repeatedly from a report renderer.
+"""
+
+import matplotlib
+
+matplotlib.use("Agg")
+
+from matplotlib.figure import Figure  # noqa: E402
+
+# Muted gray for secondary text (no-data / fallback messages).
+_MUTED_TEXT = "#5f6b7a"
+# Soft red for the error fallback message (kept readable, not alarming).
+_ERROR_TEXT = "#b00020"
+
+
+def _truncate(text, width: int = 14) -> str:
+    """Truncate ``text`` to ``width`` chars, appending an ellipsis if cut."""
+    s = "" if text is None else str(text)
+    if len(s) <= width:
+        return s
+    if width <= 1:
+        return s[:width]
+    return s[: width - 1] + "…"
+
+
+def _message_figure(message: str, color: str = _MUTED_TEXT) -> "Figure":
+    """Return a fallback ``Figure`` carrying a single centered message."""
+    fig = Figure(figsize=(6.4, 4.0), dpi=150)
+    ax = fig.add_subplot(111)
+    ax.axis("off")
+    ax.text(
+        0.5,
+        0.5,
+        message,
+        ha="center",
+        va="center",
+        fontsize=12,
+        color=color,
+        wrap=True,
+        transform=ax.transAxes,
+    )
+    fig.tight_layout()
+    return fig
+
+
+def missingness_corr_heatmap_figure(
+    matrix,
+    labels,
+    title: str = "Co-ocurrencia de ausencias",
+) -> "matplotlib.figure.Figure":
+    """Build a heatmap figure of a missingness correlation matrix.
+
+    Renders an ``NxN`` matrix of missingness correlations in ``[-1, 1]`` with a
+    diverging ``coolwarm`` colormap (fixed ``vmin=-1``, ``vmax=1`` so the color
+    scale is comparable across reports). Both axes are tick-labelled with the
+    column names (truncated to ~14 chars; the X labels rotated 45°). A colorbar
+    is attached. When the matrix is small (``N <= 12``) each cell is annotated
+    with its numeric value; for larger matrices the annotations are omitted to
+    avoid an unreadable grid.
+
+    The function is fully defensive: empty/ragged/non-numeric input never raises.
+    When there is nothing valid to draw it returns a ``Figure`` carrying a
+    centered "sin columnas con ausencia variable" message, and any unexpected
+    error is caught and turned into a fallback ``Figure`` carrying the error text.
+
+    Args:
+        matrix: List of lists (``NxN``) of floats in ``[-1, 1]`` — the pairwise
+            missingness correlation. May be empty; rows of unequal length are
+            tolerated by treating the matrix as invalid only when it is empty or
+            its label count does not match. Non-numeric/``None`` cells are
+            coerced to ``0.0``.
+        labels: List of ``N`` column names, parallel to ``matrix``. May be empty.
+            Truncated for display; the originals are not mutated.
+        title: Figure title. Default "Co-ocurrencia de ausencias".
+
+    Returns:
+        A ``matplotlib.figure.Figure`` with a single heatmap Axes plus a
+        colorbar. The caller is responsible for rasterizing/closing it.
+    """
+    try:
+        # --- Validate shape: need a non-empty square-ish matrix with labels.
+        if (
+            not isinstance(matrix, (list, tuple))
+            or not isinstance(labels, (list, tuple))
+            or len(matrix) == 0
+            or len(labels) == 0
+        ):
+            return _message_figure("sin columnas con ausencia variable")
+
+        n = len(labels)
+        # Build a clean NxN grid: coerce each cell to float, default 0.0, pad/clip
+        # rows so a ragged input never crashes imshow.
+        grid = []
+        for i in range(n):
+            row_src = matrix[i] if i < len(matrix) else []
+            if not isinstance(row_src, (list, tuple)):
+                row_src = []
+            row = []
+            for j in range(n):
+                cell = row_src[j] if j < len(row_src) else 0.0
+                try:
+                    val = float(cell)
+                except (TypeError, ValueError):
+                    val = 0.0
+                if val != val:  # NaN guard.
+                    val = 0.0
+                row.append(val)
+            grid.append(row)
+
+        fig = Figure(figsize=(6.4, 5.2), dpi=150)
+        ax = fig.add_subplot(111)
+
+        im = ax.imshow(grid, vmin=-1, vmax=1, cmap="coolwarm", aspect="equal")
+
+        short = [_truncate(lab, 14) for lab in labels]
+        ax.set_xticks(range(n))
+        ax.set_yticks(range(n))
+        ax.set_xticklabels(short, rotation=45, ha="right", fontsize=8)
+        ax.set_yticklabels(short, fontsize=8)
+
+        # Annotate each cell only when the grid is small enough to stay legible.
+        if n <= 12:
+            for i in range(n):
+                for j in range(n):
+                    val = grid[i][j]
+                    # White text over saturated (dark) cells, dark over pale.
+                    txt_color = "white" if abs(val) >= 0.55 else "#202020"
+                    ax.text(
+                        j,
+                        i,
+                        f"{val:.2f}",
+                        ha="center",
+                        va="center",
+                        fontsize=7,
+                        color=txt_color,
+                    )
+
+        cbar = fig.colorbar(im, ax=ax, fraction=0.046, pad=0.04)
+        cbar.ax.tick_params(labelsize=8)
+        cbar.set_label("correlación de ausencias", fontsize=8)
+
+        if title:
+            ax.set_title(_truncate(title, 60), fontsize=12, loc="center", pad=10)
+
+        fig.tight_layout()
+        return fig
+    except Exception as exc:  # noqa: BLE001 — never raise from a figure builder.
+        return _message_figure(f"error al dibujar heatmap: {exc}", color=_ERROR_TEXT)
@@ -0,0 +1,62 @@
+"""Tests para missingness_corr_heatmap_figure (heatmap de ausencias, grupo eda).
+
+Usa el backend Agg sin pyplot; no muestra ni guarda figuras. Cada test cierra
+explícitamente la Figure construida (matplotlib.pyplot.close) para no acumular
+estado entre tests.
+"""
+
+import matplotlib
+
+matplotlib.use("Agg")
+
+import matplotlib.pyplot as plt  # noqa: E402
+from matplotlib.figure import Figure  # noqa: E402
+
+from missingness_corr_heatmap_figure import missingness_corr_heatmap_figure
+
+
+def _identity_matrix(n):
+    """Matriz NxN con diagonal 1.0 y resto 0.0 (correlación de ausencias)."""
+    return [[1.0 if i == j else 0.0 for j in range(n)] for i in range(n)]
+
+
+def test_returns_figure_with_axes():
+    matrix = [[1.0, 0.3, -0.2], [0.3, 1.0, 0.5], [-0.2, 0.5, 1.0]]
+    labels = ["edad", "ingresos", "ciudad"]
+    fig = missingness_corr_heatmap_figure(matrix, labels, title="ausencias")
+    assert isinstance(fig, Figure)
+    # Heatmap (>=1 axes) + colorbar añade su propio Axes -> al menos 1.
+    assert len(fig.axes) >= 1
+    plt.close(fig)
+
+
+def test_empty_matrix_does_not_raise_and_returns_figure():
+    fig = missingness_corr_heatmap_figure([], [], title="vacía")
+    assert isinstance(fig, Figure)
+    assert len(fig.axes) >= 1
+    plt.close(fig)
+
+
+def test_empty_labels_returns_message_figure():
+    fig = missingness_corr_heatmap_figure([[1.0]], [], title="sin labels")
+    assert isinstance(fig, Figure)
+    plt.close(fig)
+
+
+def test_large_matrix_omits_annotations():
+    n = 16
+    fig = missingness_corr_heatmap_figure(
+        _identity_matrix(n), [f"col_{i}" for i in range(n)]
+    )
+    assert isinstance(fig, Figure)
+    assert len(fig.axes) >= 1
+    plt.close(fig)
+
+
+def test_ragged_and_non_numeric_cells_are_handled():
+    # Fila corta + celda None + celda string -> se rellenan/coercen sin lanzar.
+    matrix = [[1.0, None], ["x", 1.0, 0.5]]
+    labels = ["a", "b"]
+    fig = missingness_corr_heatmap_figure(matrix, labels)
+    assert isinstance(fig, Figure)
+    plt.close(fig)
@@ -0,0 +1,68 @@
+---
+name: missingness_correlation
+kind: function
+lang: py
+domain: datascience
+version: "1.0.0"
+purity: pure
+signature: "def missingness_correlation(null_mask: dict, top_k: int = 20) -> dict"
+description: "Co-ocurrencia de ausencias: nucleo del capitulo de missingness del grupo eda. Recibe la mascara binaria de nulos de una tabla (1 = falta, 0 = presente, alineada por fila) y mide hasta que punto las columnas faltan juntas. Calcula la matriz de correlacion de Pearson entre los vectores binarios de ausencia de las columnas con varianza (al menos un 1 y un 0), mas las cifras de solapamiento de conjuntos por par (co-missing, either-missing, Jaccard). Excluye las columnas constantes en su ausencia (correlacion indefinida) y reporta cuantas. Compone la funcion atomica pearson del registry; no la reimplementa. Lectura defensiva; NUNCA lanza."
+tags: [eda, missingness, correlation, pearson, co-occurrence, jaccard, datascience]
+params:
+  - name: null_mask
+    desc: "dict {col: [int 0/1, ...]} con la mascara de ausencias de la tabla, alineada por fila: 1 = el valor falta en esa fila, 0 = presente. Todas las listas se asumen de la misma longitud (numero de filas). Valores truthy distintos de 0 se tratan como ausencia; entradas no-lista se ignoran sin romper."
+  - name: top_k
+    desc: "Numero maximo de pares a devolver en `pairs`, ordenados por valor absoluto de correlacion descendente. Default 20. Solo limita la lista de pares; la matriz cubre siempre todas las columnas con varianza."
+output: "dict con: columns (columnas con varianza en la ausencia, en orden de entrada); matrix (len(columns) x len(columns) de correlacion de Pearson entre las mascaras binarias, diagonal 1.0); pairs (hasta top_k pares i<j ordenados por |corr| desc, cada uno {a, b, corr, co_missing, either_missing, jaccard} donde co_missing = filas en que ambas faltan, either_missing = filas en que al menos una falta, jaccard = co_missing/either_missing o 0.0 si either_missing=0); n_excluded (nº de columnas con algun nulo pero sin varianza, constantes en la ausencia); excluded_cols (esas columnas en orden de entrada). Si hay <2 columnas con varianza, columns/matrix/pairs van vacios pero n_excluded/excluded_cols se rellenan. NUNCA lanza."
+uses_functions: [pearson_py_datascience]
+uses_types: []
+returns: []
+returns_optional: false
+error_type: ""
+imports: []
+tested: true
+tests: ["test_co_ocurrencia_fuerte_corr_uno_jaccard_uno", "test_ausencias_disjuntas_corr_negativa_jaccard_cero", "test_columna_sin_varianza_se_excluye", "test_menos_de_dos_columnas_con_varianza_vacio_pero_cuenta_excluidas", "test_mask_vacio_todo_vacio", "test_top_k_limita_pares", "test_no_lanza_con_entradas_raras"]
+test_file_path: "python/functions/datascience/missingness_correlation_test.py"
+file_path: "python/functions/datascience/missingness_correlation.py"
+---
+
+## Ejemplo
+
+```python
+import sys, os
+sys.path.insert(0, os.path.join("python", "functions"))
+from datascience.missingness_correlation import missingness_correlation
+
+# Mascara de ausencias de 6 filas. 1 = falta, 0 = presente.
+mask = {
+    "ingresos":  [1, 0, 1, 0, 1, 0],   # falta junto a "deducciones"
+    "deducciones": [1, 0, 1, 0, 1, 0], # mismas filas que "ingresos"
+    "telefono":  [0, 0, 0, 1, 0, 0],   # casi siempre presente
+    "verificado": [1, 1, 1, 1, 1, 1],  # siempre ausente -> constante, excluida
+}
+out = missingness_correlation(mask, top_k=10)
+
+print(out["columns"])        # ['ingresos', 'deducciones', 'telefono']
+print(out["n_excluded"])     # 1
+print(out["excluded_cols"])  # ['verificado']
+
+# El par mas fuerte: ingresos y deducciones faltan siempre juntas.
+top = out["pairs"][0]
+print(top["a"], top["b"], round(top["corr"], 3))  # ingresos deducciones 1.0
+print(top["co_missing"], top["either_missing"], top["jaccard"])  # 3 3 1.0
+```
+
+## Cuando usarla
+
+- Usala en el capitulo de **missingness** de `AutomaticEDA` cuando ya tengas la mascara binaria de nulos por columna y quieras detectar **patrones de ausencia conjunta**: que columnas faltan siempre juntas (posible misma fuente/proceso roto) y cuales faltan de forma independiente.
+- Cuando necesites ordenar los pares de columnas por fuerza de co-ocurrencia (|corr|) para priorizar que bloques de ausencia investigar o imputar juntos.
+- Cuando quieras la cifra de solapamiento de conjuntos (Jaccard, co-missing) ademas de la correlacion lineal, para distinguir "faltan juntas" de "estan presentes juntas".
+- Antes de elegir una estrategia de imputacion: dos columnas con corr de ausencia ~1.0 no aportan informacion independiente sobre por que falta la otra.
+
+## Gotchas
+
+- Funcion pura, sin I/O y determinista. Lectura defensiva: entradas no-dict, columnas no-lista o vacias se ignoran sin lanzar.
+- Solo entran al calculo las columnas con **varianza en la ausencia** (al menos un 1 y al menos un 0). Una columna siempre-presente (todo 0) no aporta ausencia y **no** se cuenta como excluida; una columna siempre-ausente o constante con nulos (todo 1) tiene correlacion indefinida y se excluye, sumando a `n_excluded` / `excluded_cols`.
+- Con menos de 2 columnas con varianza, `columns`/`matrix`/`pairs` quedan vacios pero `n_excluded`/`excluded_cols` se rellenan igual — el caller debe contemplar el caso "sin pares".
+- La correlacion es la de Pearson sobre vectores binarios (equivale al coeficiente phi). El signo importa: corr negativa = las ausencias tienden a ser **complementarias** (cuando una falta, la otra suele estar presente).
+- Asume todas las listas alineadas por fila y de la misma longitud. Si vienen de longitudes distintas, `pearson` opera sobre el solapamiento que permita `zip` y degrada a 0.0 cuando no hay varianza efectiva; alinea la mascara antes de llamar.
@@ -0,0 +1,120 @@
+"""Co-ocurrencia de ausencias: matriz de correlacion de Pearson entre mascaras de nulos.
+
+Funcion pura del grupo eda, nucleo del capitulo de missingness. Recibe la mascara
+binaria de ausencias de una tabla (1 = falta, 0 = presente, alineada por fila) y
+mide hasta que punto las columnas faltan juntas. Para cada par de columnas con
+varianza en su ausencia calcula la correlacion de Pearson entre los vectores
+binarios, mas las cifras de solapamiento de conjuntos (co-missing, either-missing,
+Jaccard). Compone la funcion atomica `pearson` del registry; no reimplementa la
+correlacion. Lectura defensiva; NUNCA lanza.
+"""
+
+from datascience import pearson
+
+
+def missingness_correlation(null_mask, top_k=20) -> dict:
+    """Correlacion de co-ocurrencia de ausencias entre columnas.
+
+    Args:
+        null_mask: dict {col: [int 0/1, ...]} alineado por fila (1 = el valor
+            falta en esa fila). Todas las listas se asumen de la misma longitud.
+        top_k: numero maximo de pares a devolver, ordenados por |corr| desc.
+
+    Returns:
+        dict con:
+          - columns: columnas con varianza en la ausencia (al menos un 1 y al
+            menos un 0), en orden de entrada.
+          - matrix: matriz len(columns) x len(columns) de correlacion de Pearson
+            entre las mascaras binarias, diagonal 1.0.
+          - pairs: lista de hasta top_k pares (i<j) ordenados por |corr| desc.
+            Cada par: {a, b, corr, co_missing, either_missing, jaccard}.
+          - n_excluded: numero de columnas con algun nulo pero sin varianza
+            (constantes en la ausencia: siempre presentes o siempre ausentes).
+          - excluded_cols: lista de esas columnas (en orden de entrada).
+
+        Si hay menos de 2 columnas con varianza, columns/matrix/pairs van vacios
+        pero n_excluded/excluded_cols se rellenan igualmente. NUNCA lanza.
+    """
+    # Salida base, defensiva ante entradas no-dict.
+    result = {
+        "columns": [],
+        "matrix": [],
+        "pairs": [],
+        "n_excluded": 0,
+        "excluded_cols": [],
+    }
+
+    if not isinstance(null_mask, dict) or not null_mask:
+        return result
+
+    varying = []          # columnas con varianza en la ausencia
+    varying_vecs = []     # sus vectores binarios saneados (floats 0.0/1.0)
+    excluded_cols = []    # columnas con nulos pero sin varianza (constantes)
+
+    for col, raw in null_mask.items():
+        if not isinstance(raw, (list, tuple)):
+            continue
+        # Sanea a 0/1: cualquier valor truthy distinto de 0 cuenta como ausencia.
+        vec = [1 if bool(v) else 0 for v in raw]
+        if not vec:
+            continue
+        ones = sum(vec)
+        zeros = len(vec) - ones
+        if ones > 0 and zeros > 0:
+            varying.append(col)
+            varying_vecs.append([float(v) for v in vec])
+        elif ones > 0:
+            # Tiene nulos pero todos (constante en la ausencia): sin varianza.
+            excluded_cols.append(col)
+        # ones == 0 -> columna siempre presente, sin nulos: no se cuenta como
+        # excluida (no aporta ausencia al analisis de co-ocurrencia).
+
+    result["n_excluded"] = len(excluded_cols)
+    result["excluded_cols"] = excluded_cols
+
+    n = len(varying)
+    if n < 2:
+        return result
+
+    result["columns"] = list(varying)
+
+    # Matriz de correlacion de Pearson, diagonal 1.0.
+    matrix = [[0.0] * n for _ in range(n)]
+    for i in range(n):
+        matrix[i][i] = 1.0
+    for i in range(n):
+        for j in range(i + 1, n):
+            r = pearson(varying_vecs[i], varying_vecs[j])
+            matrix[i][j] = r
+            matrix[j][i] = r
+    result["matrix"] = matrix
+
+    # Pares con cifras de solapamiento de conjuntos.
+    pairs = []
+    for i in range(n):
+        vi = varying_vecs[i]
+        for j in range(i + 1, n):
+            vj = varying_vecs[j]
+            co_missing = 0
+            either_missing = 0
+            for a, b in zip(vi, vj):
+                a_miss = a != 0.0
+                b_miss = b != 0.0
+                if a_miss and b_miss:
+                    co_missing += 1
+                if a_miss or b_miss:
+                    either_missing += 1
+            jaccard = co_missing / either_missing if either_missing > 0 else 0.0
+            pairs.append({
+                "a": varying[i],
+                "b": varying[j],
+                "corr": matrix[i][j],
+                "co_missing": co_missing,
+                "either_missing": either_missing,
+                "jaccard": jaccard,
+            })
+
+    pairs.sort(key=lambda p: abs(p["corr"]), reverse=True)
+    result["pairs"] = pairs[:top_k] if top_k is not None and top_k >= 0 else pairs
+
+    return result
@@ -0,0 +1,115 @@
+"""Tests para missingness_correlation."""
+
+from datascience.missingness_correlation import missingness_correlation
+
+
+def test_co_ocurrencia_fuerte_corr_uno_jaccard_uno():
+    # a y b faltan EXACTAMENTE en las mismas filas -> corr 1.0, jaccard 1.0.
+    mask = {
+        "a": [1, 0, 1, 0, 1, 0],
+        "b": [1, 0, 1, 0, 1, 0],
+    }
+    out = missingness_correlation(mask)
+    assert out["columns"] == ["a", "b"]
+    assert out["n_excluded"] == 0
+    # Diagonal 1.0, off-diagonal ~1.0.
+    assert out["matrix"][0][0] == 1.0
+    assert out["matrix"][1][1] == 1.0
+    assert abs(out["matrix"][0][1] - 1.0) < 1e-9
+    assert len(out["pairs"]) == 1
+    pair = out["pairs"][0]
+    assert {pair["a"], pair["b"]} == {"a", "b"}
+    assert abs(pair["corr"] - 1.0) < 1e-9
+    assert pair["co_missing"] == 3       # filas 0,2,4
+    assert pair["either_missing"] == 3   # mismas filas
+    assert abs(pair["jaccard"] - 1.0) < 1e-9
+
+
+def test_ausencias_disjuntas_corr_negativa_jaccard_cero():
+    # a y b nunca faltan en la misma fila -> co_missing 0, jaccard 0, corr <= 0.
+    mask = {
+        "a": [1, 1, 0, 0],
+        "b": [0, 0, 1, 1],
+    }
+    out = missingness_correlation(mask)
+    assert out["columns"] == ["a", "b"]
+    pair = out["pairs"][0]
+    assert pair["co_missing"] == 0
+    assert pair["either_missing"] == 4
+    assert pair["jaccard"] == 0.0
+    # Solapamiento nulo + ausencias complementarias -> correlacion negativa.
+    assert pair["corr"] < 0.0
+    assert abs(pair["corr"] - out["matrix"][0][1]) < 1e-12
+
+
+def test_columna_sin_varianza_se_excluye():
+    # c esta siempre presente (todo 0): no aporta ausencia -> no entra ni como
+    # excluida. d esta siempre ausente (todo 1): tiene nulos pero sin varianza
+    # -> excluida y n_excluded incrementa. a y b tienen varianza.
+    mask = {
+        "a": [1, 0, 1, 0],
+        "b": [1, 0, 0, 0],
+        "c": [0, 0, 0, 0],   # siempre presente
+        "d": [1, 1, 1, 1],   # siempre ausente, constante
+    }
+    out = missingness_correlation(mask)
+    assert out["columns"] == ["a", "b"]
+    assert "d" in out["excluded_cols"]
+    assert "c" not in out["excluded_cols"]
+    assert out["n_excluded"] == 1
+    # Matriz solo de las columnas con varianza.
+    assert len(out["matrix"]) == 2
+    assert len(out["matrix"][0]) == 2
+
+
+def test_menos_de_dos_columnas_con_varianza_vacio_pero_cuenta_excluidas():
+    # Solo una columna con varianza (a) + una constante-ausente (d).
+    mask = {
+        "a": [1, 0, 1, 0],
+        "d": [1, 1, 1, 1],
+    }
+    out = missingness_correlation(mask)
+    assert out["columns"] == []
+    assert out["matrix"] == []
+    assert out["pairs"] == []
+    assert out["n_excluded"] == 1
+    assert out["excluded_cols"] == ["d"]
+
+
+def test_mask_vacio_todo_vacio():
+    out = missingness_correlation({})
+    assert out == {
+        "columns": [],
+        "matrix": [],
+        "pairs": [],
+        "n_excluded": 0,
+        "excluded_cols": [],
+    }
+
+
+def test_top_k_limita_pares():
+    # 4 columnas con varianza -> 6 pares; top_k=2 deja 2.
+    mask = {
+        "a": [1, 0, 1, 0, 0],
+        "b": [1, 0, 0, 1, 0],
+        "c": [0, 1, 1, 0, 1],
+        "d": [1, 1, 0, 0, 1],
+    }
+    out = missingness_correlation(mask, top_k=2)
+    assert len(out["columns"]) == 4
+    assert len(out["pairs"]) == 2
+    # Ordenados por |corr| desc.
+    assert abs(out["pairs"][0]["corr"]) >= abs(out["pairs"][1]["corr"])
+
+
+def test_no_lanza_con_entradas_raras():
+    # Valores no-lista y no-dict no deben romper.
+    assert missingness_correlation(None)["columns"] == []
+    mask = {
+        "a": [1, 0, 1, 0],
+        "b": [1, 0, 1, 0],
+        "bad": "not a list",
+        "empty": [],
+    }
+    out = missingness_correlation(mask)
+    assert out["columns"] == ["a", "b"]
@@ -0,0 +1,99 @@
+---
+id: missingness_overview_py_datascience
+name: missingness_overview
+kind: function
+lang: py
+domain: datascience
+version: "1.0.0"
+purity: pure
+signature: "def missingness_overview(null_mask) -> dict"
+description: "Resumen de ausencias a nivel de dataset a partir de una máscara de nulos 0/1 por columna ({col: [1=falta, 0=presente]} alineada por fila). Calcula celdas y porcentaje de datos faltantes, cuántas columnas tienen algún nulo y cuántas filas son completas vs. incompletas. Estilo dict-no-throw del grupo eda: nunca lanza. Lectura defensiva — no-dict o dict vacío devuelve todo a 0; columnas no-lista se tratan como vacías; listas de longitud distinta se alinean a la longitud máxima rellenando la cola corta como presente (0); valores None/no-int cuentan como presente; sin ZeroDivisionError."
+tags: [eda, missing, missingness, nulls, profiling, datascience, pure]
+uses_functions: []
+uses_types: []
+returns: []
+returns_optional: false
+error_type: ""
+imports: []
+example: |
+  from datascience.missingness_overview import missingness_overview
+  mask = {
+      "a": [1, 0, 0, 0, 1],
+      "b": [1, 0, 1, 0, 0],
+      "c": [0, 0, 0, 0, 1],
+  }
+  missingness_overview(mask)
+  # n_missing_cells=5, missing_cell_pct≈33.33, complete_rows=2, incomplete_rows=3
+tested: true
+tests:
+  - "test_cooccurrence_three_cols_exact"
+  - "test_empty_dict_all_zero"
+  - "test_output_keys_contract"
+  - "test_not_a_dict_returns_zero"
+  - "test_no_nulls_all_complete"
+  - "test_none_values_treated_as_present"
+  - "test_unequal_lengths_pad_with_max"
+  - "test_columns_present_but_no_rows"
+  - "test_never_raises_on_garbage"
+test_file_path: "python/functions/datascience/missingness_overview_test.py"
+file_path: "python/functions/datascience/missingness_overview.py"
+params:
+  - name: null_mask
+    desc: "Dict {col_name: [int 0/1, ...]} con la máscara de nulos por columna, alineada por fila (1 = el valor falta, 0 = el valor está presente). Normalmente todas las listas tienen la misma longitud = nº de filas. Lectura defensiva: si no es dict o está vacío se devuelve todo a 0; columnas cuyo valor no es lista/tupla se tratan como vacías; listas de longitud distinta se alinean a la longitud máxima (las posiciones inexistentes de las columnas más cortas cuentan como presentes, 0); valores None o no enteros cuentan como presentes."
+output: "Dict con exactamente 9 claves, todas siempre presentes (la función nunca lanza): n_rows (longitud de fila = longitud máxima entre columnas, 0 si vacío), n_cols (nº de columnas), n_cols_with_null (columnas con >=1 falta), n_missing_cells (suma total de 1s), missing_cell_pct (0-100 = n_missing_cells / (n_rows*n_cols) * 100), complete_rows (filas sin ninguna falta), incomplete_rows (filas con >=1 falta), complete_pct (0-100), incomplete_pct (0-100). Los porcentajes son 0.0 cuando el denominador es 0 (sin ZeroDivisionError)."
+---
+
+## Ejemplo
+
+```python
+from datascience.missingness_overview import missingness_overview
+
+# Máscara de nulos por columna: 1 = falta, 0 = presente, alineada por fila.
+mask = {
+    "a": [1, 0, 0, 0, 1],
+    "b": [1, 0, 1, 0, 0],
+    "c": [0, 0, 0, 0, 1],
+}
+
+missingness_overview(mask)
+# {
+#   "n_rows": 5,
+#   "n_cols": 3,
+#   "n_cols_with_null": 3,      # a, b y c tienen al menos una falta
+#   "n_missing_cells": 5,       # 2 (a) + 2 (b) + 1 (c)
+#   "missing_cell_pct": 33.33,  # 5 / (5*3) * 100
+#   "complete_rows": 2,         # filas 1 y 3 sin ninguna falta
+#   "incomplete_rows": 3,       # filas 0 (a&b), 2 (b), 4 (a&c)
+#   "complete_pct": 40.0,       # 2 / 5 * 100
+#   "incomplete_pct": 60.0,     # 3 / 5 * 100
+# }
+
+missingness_overview({})
+# Todo a 0: {"n_rows": 0, "n_cols": 0, "n_cols_with_null": 0,
+#            "n_missing_cells": 0, "missing_cell_pct": 0.0,
+#            "complete_rows": 0, "incomplete_rows": 0,
+#            "complete_pct": 0.0, "incomplete_pct": 0.0}
+```
+
+## Cuando usarla
+
+Úsala al perfilar un dataset cuando ya tienes una máscara de nulos 0/1 por
+columna (p. ej. derivada del paso de carga/perfilado del EDA) y quieres la foto
+global de ausencias en una llamada: cuánta proporción de celdas falta, cuántas
+columnas están afectadas y, sobre todo, cuántas filas quedan completas vs.
+incompletas. Es el bloque resumen del capítulo de calidad/missingness de un EDA,
+y la base para decidir estrategias de imputación o de borrado de filas. Como es
+pura y dict-no-throw, puedes alimentarla con la máscara tal cual sin validarla
+antes: entradas malformadas degradan a ceros en vez de romper el pipeline.
+
+## Gotchas
+
+- **`n_rows` es la longitud máxima entre columnas.** Con listas de longitud
+  desigual, las posiciones que faltan en las columnas más cortas se cuentan como
+  presentes (`0`); no se descartan filas. En el caso normal (todas las listas de
+  igual longitud) `n_rows` es simplemente esa longitud.
+- **Solo el valor exacto `1` cuenta como falta.** `None`, `0`, cadenas y
+  cualquier otro valor se tratan como presentes. `True` (== 1) también cuenta
+  como falta por la igualdad.
+- **Porcentajes en escala 0-100**, no fracciones. División por cero protegida:
+  con `n_rows*n_cols == 0` los porcentajes salen `0.0`.
@@ -0,0 +1,116 @@
+"""Pure EDA helper: dataset-level missingness overview from a 0/1 null mask.
+
+Part of the `eda` capability group. Consumes a per-column null mask
+(``{col_name: [int 0/1, ...]}`` aligned by row, ``1`` = value is missing,
+``0`` = value is present) and derives dataset-wide missingness metrics: cell
+count and percentage of missing data, how many columns carry any null, and how
+many rows are complete vs. incomplete.
+
+Dict-no-throw style of the `eda` group: it NEVER raises. A non-dict, an empty
+dict, malformed columns, ragged lists or non-int cell values all degrade
+gracefully to the zero/contract output. Stdlib only.
+
+Ragged-length policy: columns are allowed to have different lengths. ``n_rows``
+is the **maximum** column length; positions that don't exist in a shorter
+column are treated as present (``0``). This keeps the ``n_rows * n_cols`` cell
+grid well defined without dropping rows.
+"""
+
+
+def _is_missing(value) -> int:
+    """Return ``1`` iff ``value`` denotes a missing cell, else ``0``.
+
+    Only an exact equality to ``1`` (covers ``int`` ``1`` and ``float`` ``1.0``)
+    counts as missing. ``None``, ``0``, strings and any other value are treated
+    as present. The comparison cannot raise for standard inputs.
+    """
+    try:
+        return 1 if value == 1 else 0
+    except Exception:
+        return 0
+
+
+def missingness_overview(null_mask) -> dict:
+    """Summarize dataset-level missingness from a 0/1 null mask.
+
+    Args:
+        null_mask: Dict ``{col_name: [int 0/1, ...]}`` where each list is aligned
+            by row (``1`` = missing, ``0`` = present). Lists are normally all the
+            same length (= number of rows). Defensive: a non-dict or empty dict
+            returns the all-zero contract; non-list columns are treated as empty;
+            ragged lists are aligned to the maximum length, padding the missing
+            tail of shorter columns as present (``0``); ``None`` / non-int cells
+            count as present.
+
+    Returns:
+        Dict with exactly these keys, all always present (the function never
+        raises): ``n_rows``, ``n_cols``, ``n_cols_with_null``,
+        ``n_missing_cells``, ``missing_cell_pct`` (0-100), ``complete_rows``,
+        ``incomplete_rows``, ``complete_pct`` (0-100), ``incomplete_pct``
+        (0-100). Percentages are ``0.0`` when the denominator is zero (no
+        ``ZeroDivisionError``).
+    """
+    zero = {
+        "n_rows": 0,
+        "n_cols": 0,
+        "n_cols_with_null": 0,
+        "n_missing_cells": 0,
+        "missing_cell_pct": 0.0,
+        "complete_rows": 0,
+        "incomplete_rows": 0,
+        "complete_pct": 0.0,
+        "incomplete_pct": 0.0,
+    }
+
+    if not isinstance(null_mask, dict) or not null_mask:
+        return dict(zero)
+
+    # Normalize every column to a list; non-list columns become empty.
+    cols = {}
+    for name, seq in null_mask.items():
+        cols[name] = seq if isinstance(seq, (list, tuple)) else []
+
+    n_cols = len(cols)
+    lengths = [len(seq) for seq in cols.values()]
+    n_rows = max(lengths) if lengths else 0
+
+    if n_rows == 0:
+        # Columns exist but carry no rows: everything zero except n_cols.
+        out = dict(zero)
+        out["n_cols"] = n_cols
+        return out
+
+    n_missing_cells = 0
+    n_cols_with_null = 0
+    row_has_missing = [False] * n_rows
+
+    for seq in cols.values():
+        col_len = len(seq)
+        col_has_null = False
+        for r in range(n_rows):
+            if r < col_len and _is_missing(seq[r]):
+                n_missing_cells += 1
+                row_has_missing[r] = True
+                col_has_null = True
+        if col_has_null:
+            n_cols_with_null += 1
+
+    incomplete_rows = sum(1 for flag in row_has_missing if flag)
+    complete_rows = n_rows - incomplete_rows
+
+    total_cells = n_rows * n_cols
+    missing_cell_pct = (n_missing_cells / total_cells * 100.0) if total_cells else 0.0
+    complete_pct = complete_rows / n_rows * 100.0
+    incomplete_pct = incomplete_rows / n_rows * 100.0
+
+    return {
+        "n_rows": n_rows,
+        "n_cols": n_cols,
+        "n_cols_with_null": n_cols_with_null,
+        "n_missing_cells": n_missing_cells,
+        "missing_cell_pct": missing_cell_pct,
+        "complete_rows": complete_rows,
+        "incomplete_rows": incomplete_rows,
+        "complete_pct": complete_pct,
+        "incomplete_pct": incomplete_pct,
+    }
@@ -0,0 +1,146 @@
+"""Tests para missingness_overview."""
+
+import sys
+import os
+
+import pytest
+
+sys.path.insert(0, os.path.dirname(__file__))
+
+from missingness_overview import missingness_overview
+
+
+# Output contract: every call returns exactly these 9 keys.
+EXPECTED_KEYS = {
+    "n_rows",
+    "n_cols",
+    "n_cols_with_null",
+    "n_missing_cells",
+    "missing_cell_pct",
+    "complete_rows",
+    "incomplete_rows",
+    "complete_pct",
+    "incomplete_pct",
+}
+
+
+def test_cooccurrence_three_cols_exact():
+    # 3 columns, 5 rows. Hand-computed expectations:
+    #   col a missing at rows 0, 4      -> 2
+    #   col b missing at rows 0, 2      -> 2
+    #   col c missing at row  4         -> 1
+    #   n_missing_cells = 5, total_cells = 5*3 = 15 -> 33.333...%
+    #   row 0 (a&b co-occur)  -> incomplete
+    #   row 1 (all present)   -> complete
+    #   row 2 (b only)        -> incomplete
+    #   row 3 (all present)   -> complete
+    #   row 4 (a&c co-occur)  -> incomplete
+    mask = {
+        "a": [1, 0, 0, 0, 1],
+        "b": [1, 0, 1, 0, 0],
+        "c": [0, 0, 0, 0, 1],
+    }
+    out = missingness_overview(mask)
+    assert out["n_rows"] == 5
+    assert out["n_cols"] == 3
+    assert out["n_cols_with_null"] == 3
+    assert out["n_missing_cells"] == 5
+    assert out["missing_cell_pct"] == pytest.approx(33.33333333, abs=1e-6)
+    assert out["complete_rows"] == 2
+    assert out["incomplete_rows"] == 3
+    assert out["complete_pct"] == pytest.approx(40.0)
+    assert out["incomplete_pct"] == pytest.approx(60.0)
+
+
+def test_empty_dict_all_zero():
+    out = missingness_overview({})
+    assert out == {
+        "n_rows": 0,
+        "n_cols": 0,
+        "n_cols_with_null": 0,
+        "n_missing_cells": 0,
+        "missing_cell_pct": 0.0,
+        "complete_rows": 0,
+        "incomplete_rows": 0,
+        "complete_pct": 0.0,
+        "incomplete_pct": 0.0,
+    }
+
+
+def test_output_keys_contract():
+    # The 9-key contract holds even for the garbage/zero path.
+    assert set(missingness_overview({}).keys()) == EXPECTED_KEYS
+    assert set(missingness_overview({"a": [1, 0]}).keys()) == EXPECTED_KEYS
+
+
+def test_not_a_dict_returns_zero():
+    for bad in (None, [1, 0, 1], 42, "nope", 3.14):
+        out = missingness_overview(bad)
+        assert out["n_rows"] == 0
+        assert out["n_cols"] == 0
+        assert out["n_missing_cells"] == 0
+        assert out["missing_cell_pct"] == 0.0
+
+
+def test_no_nulls_all_complete():
+    mask = {"a": [0, 0, 0], "b": [0, 0, 0]}
+    out = missingness_overview(mask)
+    assert out["n_rows"] == 3
+    assert out["n_cols"] == 2
+    assert out["n_cols_with_null"] == 0
+    assert out["n_missing_cells"] == 0
+    assert out["missing_cell_pct"] == 0.0
+    assert out["complete_rows"] == 3
+    assert out["incomplete_rows"] == 0
+    assert out["complete_pct"] == pytest.approx(100.0)
+    assert out["incomplete_pct"] == pytest.approx(0.0)
+
+
+def test_none_values_treated_as_present():
+    # None and other non-1 values count as present (0).
+    mask = {"a": [None, 1, None, "x", 0]}
+    out = missingness_overview(mask)
+    assert out["n_rows"] == 5
+    assert out["n_cols"] == 1
+    assert out["n_missing_cells"] == 1  # only the explicit 1 at row 1
+    assert out["n_cols_with_null"] == 1
+    assert out["complete_rows"] == 4
+    assert out["incomplete_rows"] == 1
+
+
+def test_unequal_lengths_pad_with_max():
+    # Ragged lists: n_rows = max length; shorter column padded as present.
+    #   a = [1, 1] -> missing at rows 0, 1
+    #   b = [0]    -> row 1 padded to present
+    #   n_rows = 2, n_cols = 2, total_cells = 4, n_missing_cells = 2 -> 50%
+    mask = {"a": [1, 1], "b": [0]}
+    out = missingness_overview(mask)
+    assert out["n_rows"] == 2
+    assert out["n_cols"] == 2
+    assert out["n_cols_with_null"] == 1
+    assert out["n_missing_cells"] == 2
+    assert out["missing_cell_pct"] == pytest.approx(50.0)
+    assert out["complete_rows"] == 0
+    assert out["incomplete_rows"] == 2
+    assert out["incomplete_pct"] == pytest.approx(100.0)
+
+
+def test_columns_present_but_no_rows():
+    # Columns exist but all empty -> zero metrics, n_cols preserved.
+    out = missingness_overview({"a": [], "b": []})
+    assert out["n_rows"] == 0
+    assert out["n_cols"] == 2
+    assert out["n_missing_cells"] == 0
+    assert out["missing_cell_pct"] == 0.0
+    assert out["complete_pct"] == 0.0
+
+
+def test_never_raises_on_garbage():
+    # Non-list column values, mixed junk -> must not raise.
+    mask = {"a": "not a list", "b": 123, "c": [1, 0, 1]}
+    out = missingness_overview(mask)
+    assert set(out.keys()) == EXPECTED_KEYS
+    assert out["n_rows"] == 3
+    assert out["n_cols"] == 3
+    assert out["n_missing_cells"] == 2  # only col c contributes
+    assert out["n_cols_with_null"] == 1
@@ -0,0 +1,93 @@
+---
+id: missingness_rank_bar_figure_py_datascience
+name: missingness_rank_bar_figure
+kind: function
+lang: py
+domain: datascience
+version: "1.0.0"
+purity: impure
+signature: "def missingness_rank_bar_figure(names, pcts, title=\"% de valores faltantes por columna\") -> \"matplotlib.figure.Figure\""
+description: "Construye una figura matplotlib de barras horizontales que ordena las columnas de un dataset por su porcentaje de valores faltantes (0-100), la mayor arriba, etiquetando cada barra con su NN.N% al final. Usa ax.barh, eje X fijo 0-100 y labels truncados a ~22 chars. Devuelve un matplotlib.figure.Figure listo para rasterizar por el renderer del informe EDA (capítulo de datos faltantes). Backend Agg sin pyplot global; defensivo ante listas vacías, longitudes desiguales o valores no numéricos (nunca lanza)."
+tags: [eda, missing, missingness, ranking, bar, barh, matplotlib, figure, visualization, datascience, impure]
+uses_functions: []
+uses_types: []
+returns: []
+returns_optional: false
+error_type: "error_go_core"
+imports: [matplotlib]
+example: |
+  from datascience.missingness_rank_bar_figure import missingness_rank_bar_figure
+  names = ["edad", "ingresos", "ciudad", "email"]
+  pcts = [12.5, 40.0, 3.2, 0.0]
+  fig = missingness_rank_bar_figure(names, pcts, title="% de valores faltantes por columna")
+tested: true
+tests:
+  - "test_returns_figure_with_axes"
+  - "test_sorted_descending_largest_on_top"
+  - "test_empty_lists_do_not_raise_and_returns_figure"
+  - "test_xlim_is_zero_to_hundred"
+  - "test_length_mismatch_and_non_numeric_are_handled"
+test_file_path: "python/functions/datascience/missingness_rank_bar_figure_test.py"
+file_path: "python/functions/datascience/missingness_rank_bar_figure.py"
+params:
+  - name: names
+    desc: "Lista de nombres de columna. Puede venir vacía (devuelve figura \"sin datos faltantes\"). Los items se convierten a str y se truncan a ~22 chars con elipsis para las etiquetas del eje Y; los originales no se mutan."
+  - name: pcts
+    desc: "Lista paralela a names con el % de nulos en [0,100]. Valores None, NaN o no numéricos se coercen a 0.0 y los negativos se recortan a 0. Si len(names) != len(pcts) se recorta al menor de ambos para no romper."
+  - name: title
+    desc: "Título de la figura. Se trunca a ~60 chars con elipsis si es muy largo. Default \"% de valores faltantes por columna\"."
+output: "Un matplotlib.figure.Figure (figsize 6.4 x alto adaptativo según nº de barras, dpi 150) con un Axes de barras horizontales (ax.barh) ordenadas por % descendente, la mayor arriba. Eje X fijado a [0,100] con label \"% faltante\", etiquetas del eje Y truncadas a ~22 chars, y cada barra anotada con su NN.N% al final. Si names o pcts vienen vacíos devuelve una Figure con texto centrado \"sin datos faltantes\"; cualquier error inesperado se captura y devuelve una Figure con el mensaje de error (nunca lanza). El caller rasteriza/cierra la figura; la función no la muestra ni la guarda."
+---
+
+## Ejemplo
+
+```python
+from datascience.missingness_rank_bar_figure import missingness_rank_bar_figure
+
+# % de nulos por columna (p. ej. (df.isnull().mean() * 100).
+names = ["edad", "ingresos", "ciudad", "email"]
+pcts = [12.5, 40.0, 3.2, 0.0]
+
+fig = missingness_rank_bar_figure(
+    names,
+    pcts,
+    title="% de valores faltantes por columna",
+)
+
+# ingresos (40.0%) queda arriba; email (0.0%) abajo.
+# El renderer del informe lo rasteriza; aquí solo persistimos para inspección.
+fig.savefig("/tmp/missingness_rank.png")
+```
+
+## Cuando usarla
+
+Úsala al abrir el capítulo de datos faltantes de un informe EDA para responder
+"¿qué columnas están más incompletas?" de un vistazo. Pásale los nombres de
+columna y el % de nulos de cada una (`(df.isnull().mean() * 100).round(1)`); la
+función se encarga de ordenar de mayor a menor y poner la peor arriba. Es la
+pareja "magnitud" del heatmap de co-ocurrencia: las barras dicen *cuánto* falta
+en cada columna, el heatmap dice *si esas ausencias están relacionadas* entre
+columnas.
+
+## Gotchas
+
+- **Impura por matplotlib.** Toca la maquinaria de render. Usa el backend `Agg`
+  y la API orientada a objetos `Figure`/`add_subplot` — NUNCA `pyplot.*` aquí,
+  para no tocar el estado global ni filtrar figuras entre llamadas. `pyplot` NO
+  es thread-safe; esta función evita ese riesgo construyendo el `Figure`
+  directamente, así que es segura de llamar en bucle desde el renderer.
+- **El caller cierra la figura.** Devuelve el `Figure` pero no lo muestra ni lo
+  guarda. Quien la consume debe rasterizarla y luego liberarla
+  (`matplotlib.pyplot.close(fig)`) para no acumular memoria en lotes grandes.
+- **Espera porcentajes 0-100, no fracciones 0-1.** El eje X está fijado a
+  `[0, 100]`. Si pasas fracciones (`0.4` en vez de `40.0`) las barras saldrán
+  pegadas al origen. Multiplica por 100 antes de llamar.
+- **Alto adaptativo.** La altura de la figura crece con el número de barras
+  (hasta un tope) para que reports con muchas columnas sigan legibles; aun así,
+  conviene filtrar a las columnas con algún nulo antes de llamar para no listar
+  decenas de barras a 0%.
+- **Defensiva, nunca lanza.** Listas vacías, longitudes desiguales, valores
+  `None`/`NaN`/no numéricos o cualquier error inesperado se manejan sin propagar:
+  en el peor caso devuelve una `Figure` con "sin datos faltantes" o con el texto
+  del error. No envuelvas la llamada en try/except por miedo a un raise — no lo
+  hay.
@@ -0,0 +1,150 @@
+"""Impure EDA helper: ranked bar figure of missing-value share (`eda` group).
+
+Builds a horizontal bar chart ranking the columns of a dataset by their
+percentage of missing values (0-100), largest at the top, each bar labelled with
+its ``NN.N%`` at the end. Returns a ready-to-rasterize
+``matplotlib.figure.Figure``; it never shows nor saves it.
+
+Impure because it touches matplotlib's rendering machinery. It uses the headless
+Agg backend and the object-oriented ``Figure`` API (no ``pyplot``) so it leaks no
+global state and is safe to call repeatedly from a report renderer.
+"""
+
+import matplotlib
+
+matplotlib.use("Agg")
+
+from matplotlib.figure import Figure  # noqa: E402
+
+# Muted gray for secondary text (no-data / fallback messages).
+_MUTED_TEXT = "#5f6b7a"
+# Soft red for the error fallback message.
+_ERROR_TEXT = "#b00020"
+# Bar fill — a calm blue that reads well on white at report size.
+_BAR_COLOR = "#4C72B0"
+
+
+def _truncate(text, width: int = 22) -> str:
+    """Truncate ``text`` to ``width`` chars, appending an ellipsis if cut."""
+    s = "" if text is None else str(text)
+    if len(s) <= width:
+        return s
+    if width <= 1:
+        return s[:width]
+    return s[: width - 1] + "…"
+
+
+def _message_figure(message: str, color: str = _MUTED_TEXT) -> "Figure":
+    """Return a fallback ``Figure`` carrying a single centered message."""
+    fig = Figure(figsize=(6.4, 4.0), dpi=150)
+    ax = fig.add_subplot(111)
+    ax.axis("off")
+    ax.text(
+        0.5,
+        0.5,
+        message,
+        ha="center",
+        va="center",
+        fontsize=12,
+        color=color,
+        wrap=True,
+        transform=ax.transAxes,
+    )
+    fig.tight_layout()
+    return fig
+
+
+def missingness_rank_bar_figure(
+    names,
+    pcts,
+    title: str = "% de valores faltantes por columna",
+) -> "matplotlib.figure.Figure":
+    """Build a horizontal ranked bar figure of missing-value share per column.
+
+    Pairs each column name with its missing percentage, sorts by percentage
+    descending and draws horizontal bars with the largest at the top. The X axis
+    is pinned to ``[0, 100]`` so bars are comparable across reports, each bar is
+    annotated with its ``NN.N%`` at the end, and the Y tick labels are truncated
+    to ~22 chars.
+
+    The function is fully defensive: empty/mismatched/non-numeric input never
+    raises. When there is nothing valid to draw it returns a ``Figure`` carrying
+    a centered "sin datos faltantes" message, and any unexpected error is caught
+    and turned into a fallback ``Figure`` carrying the error text.
+
+    Args:
+        names: List of column names. May be empty. Items are stringified and
+            truncated for display; the originals are not mutated.
+        pcts: List parallel to ``names`` of missing-value percentages in
+            ``[0, 100]``. Non-numeric/``None`` values are coerced to ``0.0`` and
+            negatives are clamped to ``0``. The list is truncated to
+            ``min(len(names), len(pcts))`` so a length mismatch never crashes.
+        title: Figure title. Default "% de valores faltantes por columna".
+
+    Returns:
+        A ``matplotlib.figure.Figure`` with a single horizontal-bar Axes. The
+        caller is responsible for rasterizing/closing it.
+    """
+    try:
+        if (
+            not isinstance(names, (list, tuple))
+            or not isinstance(pcts, (list, tuple))
+            or len(names) == 0
+            or len(pcts) == 0
+        ):
+            return _message_figure("sin datos faltantes")
+
+        # --- Pair names with coerced percentages, tolerating length mismatch.
+        pairs = []
+        for name, pct in zip(names, pcts):
+            try:
+                val = float(pct)
+            except (TypeError, ValueError):
+                val = 0.0
+            if val != val:  # NaN guard.
+                val = 0.0
+            val = max(0.0, val)
+            pairs.append((name, val))
+
+        if not pairs:
+            return _message_figure("sin datos faltantes")
+
+        # Sort by percentage descending; barh draws bottom-up, so the largest
+        # ends at the top when we reverse the order before plotting.
+        pairs.sort(key=lambda p: p[1], reverse=True)
+        ordered = list(reversed(pairs))  # smallest first -> largest on top.
+
+        labels = [_truncate(name, 22) for name, _ in ordered]
+        values = [val for _, val in ordered]
+        y_pos = range(len(ordered))
+
+        # Height scales with the number of bars so dense reports stay readable.
+        height = max(2.4, min(0.4 * len(ordered) + 1.2, 14.0))
+        fig = Figure(figsize=(6.4, height), dpi=150)
+        ax = fig.add_subplot(111)
+
+        ax.barh(list(y_pos), values, color=_BAR_COLOR, edgecolor="white")
+        ax.set_yticks(list(y_pos))
+        ax.set_yticklabels(labels, fontsize=8)
+        ax.set_xlim(0, 100)
+        ax.set_xlabel("% faltante", fontsize=9)
+
+        # Annotate each bar with its percentage at the end of the bar.
+        for y, val in zip(y_pos, values):
+            ax.text(
+                min(val + 1.5, 99.0),
+                y,
+                f"{val:.1f}%",
+                va="center",
+                ha="left" if val < 90 else "right",
+                fontsize=7,
+                color="#202020",
+            )
+
+        if title:
+            ax.set_title(_truncate(title, 60), fontsize=12, loc="left", pad=10)
+
+        fig.tight_layout()
+        return fig
+    except Exception as exc:  # noqa: BLE001 — never raise from a figure builder.
+        return _message_figure(f"error al dibujar barras: {exc}", color=_ERROR_TEXT)
@@ -0,0 +1,64 @@
+"""Tests para missingness_rank_bar_figure (barras de % faltante, grupo eda).
+
+Usa el backend Agg sin pyplot; no muestra ni guarda figuras. Cada test cierra
+explícitamente la Figure construida (matplotlib.pyplot.close) para no acumular
+estado entre tests.
+"""
+
+import matplotlib
+
+matplotlib.use("Agg")
+
+import matplotlib.pyplot as plt  # noqa: E402
+from matplotlib.figure import Figure  # noqa: E402
+
+from missingness_rank_bar_figure import missingness_rank_bar_figure
+
+
+def test_returns_figure_with_axes():
+    names = ["edad", "ingresos", "ciudad"]
+    pcts = [12.5, 40.0, 3.2]
+    fig = missingness_rank_bar_figure(names, pcts, title="faltantes")
+    assert isinstance(fig, Figure)
+    assert len(fig.axes) >= 1
+    plt.close(fig)
+
+
+def test_sorted_descending_largest_on_top():
+    names = ["a", "b", "c"]
+    pcts = [10.0, 50.0, 25.0]
+    fig = missingness_rank_bar_figure(names, pcts)
+    ax = fig.axes[0]
+    # barh dibuja de abajo arriba; la mayor (50, "b") debe quedar arriba (mayor y).
+    bars = ax.patches
+    # El último parche (mayor índice y) corresponde a la barra superior.
+    widths = [b.get_width() for b in bars]
+    assert max(widths) == 50.0
+    # La barra con la mayor anchura es la de mayor coordenada y (arriba).
+    top_bar = max(bars, key=lambda b: b.get_y())
+    assert top_bar.get_width() == 50.0
+    plt.close(fig)
+
+
+def test_empty_lists_do_not_raise_and_returns_figure():
+    fig = missingness_rank_bar_figure([], [], title="vacía")
+    assert isinstance(fig, Figure)
+    assert len(fig.axes) >= 1
+    plt.close(fig)
+
+
+def test_xlim_is_zero_to_hundred():
+    fig = missingness_rank_bar_figure(["a"], [42.0])
+    ax = fig.axes[0]
+    assert ax.get_xlim() == (0.0, 100.0)
+    plt.close(fig)
+
+
+def test_length_mismatch_and_non_numeric_are_handled():
+    # Más names que pcts + un pct None -> zip recorta y None se coacciona a 0.
+    names = ["a", "b", "c"]
+    pcts = [None, 30.0]
+    fig = missingness_rank_bar_figure(names, pcts)
+    assert isinstance(fig, Figure)
+    assert len(fig.axes) >= 1
+    plt.close(fig)
@@ -0,0 +1,65 @@
+---
+name: missingness_row_patterns
+kind: function
+lang: py
+domain: datascience
+version: "1.0.0"
+purity: pure
+signature: "def missingness_row_patterns(null_mask, top_n=10) -> dict"
+description: "Agrupa las filas de un dataset por su patron de ausencias (estilo matriz de missingno): para cada fila, el patron es la tupla ORDENADA de columnas que faltan en esa fila (las que tienen 1 en el null_mask). Cuenta la frecuencia de cada patron distinto, incluido el patron vacio (fila completa). Devuelve el top_n por frecuencia con su pct sobre el total. Pura, lectura defensiva, NUNCA lanza; {} -> n_rows 0."
+tags: [eda, missingness, missingno, patterns, profiling, datascience, data-quality]
+params:
+  - name: null_mask
+    desc: "Dict {col: [0/1, ...]} alineado por fila, donde 1 = la celda falta en esa fila y 0 = presente. Todas las columnas deberian tener la misma longitud (una entrada por fila); si difieren, n_rows es la lista mas larga y las celdas fuera de rango cuentan como presentes. Las claves se ordenan por str(col) para canonizar el patron. {} (o no-dict) -> n_rows 0."
+  - name: top_n
+    desc: "Maximo de patrones devueltos en `patterns`, rankeados por n_rows desc (desempate: menos columnas primero, luego nombres de columna). El recuento total de patrones distintos siempre se reporta en `n_patterns`, no se trunca. Default 10. Valores negativos -> 0; no-int -> 10."
+output: "Dict {n_rows: int (filas totales), n_patterns: int (patrones distintos, incluye el patron vacio = fila completa), complete_rows: int (filas con patron vacio, nada falta), patterns: lista del top_n ordenada por n_rows desc con [{missing_cols: [col,...] (vacio = fila completa), n_rows: int, pct: float 0-100 sobre n_rows total, redondeado a 2 decimales}]}. Para {} devuelve n_rows 0 y patterns []. NUNCA lanza."
+uses_functions: []
+uses_types: []
+returns: []
+returns_optional: false
+error_type: ""
+imports: []
+tested: true
+tests: ["test_patron_dominante_completas_singleton", "test_mask_vacio", "test_top_n_trunca_pero_cuenta_todos"]
+test_file_path: "python/functions/datascience/missingness_row_patterns_test.py"
+file_path: "python/functions/datascience/missingness_row_patterns.py"
+---
+
+## Ejemplo
+
+```python
+import sys, os
+sys.path.insert(0, os.path.join("python", "functions"))
+from datascience.missingness_row_patterns import missingness_row_patterns
+
+# null_mask alineado por fila: 1 = la celda falta en esa fila.
+null_mask = {
+    "A": [1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
+    "B": [1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
+    "C": [0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
+}
+out = missingness_row_patterns(null_mask, top_n=10)
+print(out["n_rows"], out["n_patterns"], out["complete_rows"])  # 10 3 5
+for p in out["patterns"]:
+    label = p["missing_cols"] or "(fila completa)"
+    print(label, p["n_rows"], p["pct"])
+# (fila completa) 5 50.0
+# ['A', 'B'] 4 40.0
+# ['C'] 1 10.0
+```
+
+## Cuando usarla
+
+- Usala en el capitulo de calidad/ausencias de `AutomaticEDA` para mostrar la "matriz de patrones de missingno": en vez de pintar celda a celda, resume que combinaciones de columnas se quedan en blanco juntas y con que frecuencia.
+- Cuando ya tengas el null_mask por columna (1=falta) y quieras detectar co-ausencia estructural ("A y B siempre faltan juntas") antes de decidir una imputacion o un drop conjunto de columnas.
+- Cuando necesites una tabla compacta "patron -> nº filas -> pct" para un report o un grafico de barras de los patrones de ausencia mas comunes, separando ademas cuantas filas estan completas (`complete_rows`).
+
+## Gotchas
+
+- Funcion pura, sin I/O y determinista. Lectura defensiva: `{}` o un no-dict devuelven `n_rows` 0 con `patterns` []. NUNCA lanza.
+- El patron vacio (fila completa, `missing_cols=[]`) SI cuenta como patron: aparece en `n_patterns` y puede aparecer en `patterns`. El consumidor lo etiqueta como "(fila completa)".
+- `pct` es sobre `n_rows` total (0-100), redondeado a 2 decimales. La suma de los `pct` de TODOS los patrones es 100; si `top_n` trunca, los `pct` mostrados sumaran menos.
+- Las columnas se ordenan por `str(col)` para canonizar cada patron, asi `{A,B}` y `{B,A}` colapsan al mismo patron `["A", "B"]`.
+- Una celda cuenta como ausente solo si vale 1 (`int(cell) == 1`); 0, None y valores no numericos se tratan como presentes.
+- Si las listas de columnas tienen longitudes distintas, `n_rows` es la mas larga y las posiciones fuera de rango de una columna corta cuentan como presentes (0).
@@ -0,0 +1,107 @@
+"""missingness_row_patterns — distinct per-row missingness patterns (missingno matrix style).
+
+Pure function: no I/O, deterministic, NEVER raises. Given a per-column null mask
+aligned by row ({col: [0/1, ...]}, 1 = missing), it groups rows by their missing
+"pattern" — the sorted tuple of column names that are missing in that row — and
+counts how often each distinct pattern occurs.
+
+This mirrors the missingno matrix idea: instead of plotting per-cell nullity, it
+collapses each row to the SET of columns it lacks, surfacing co-missing structure
+(e.g. "A and B always go missing together"). The empty pattern (a fully complete
+row) is a first-class pattern and may appear in the result with missing_cols=[];
+the caller labels it "(fila completa)".
+"""
+
+
+def _is_missing(cell) -> bool:
+    """A cell counts as missing when it equals 1 (truthy 0/1 mask).
+
+    None / 0 / non-numeric are treated as present. Defensive: never raises.
+    """
+    try:
+        return int(cell) == 1
+    except (TypeError, ValueError):
+        return bool(cell)
+
+
+def missingness_row_patterns(null_mask, top_n=10) -> dict:
+    """Count distinct per-row missingness patterns from a column null mask.
+
+    For each row, its pattern is the sorted tuple of column names missing in that
+    row (the columns whose value is 1). The frequency of each distinct pattern is
+    counted, including the empty pattern (a complete row with nothing missing).
+
+    Args:
+        null_mask: Dict {col: [0/1, ...]} aligned by row, where 1 means the cell
+            is missing in that row. Read defensively; columns with differing
+            lengths are tolerated (n_rows is the longest list; out-of-range cells
+            count as present). Empty dict -> n_rows 0.
+        top_n: Maximum number of patterns returned in `patterns`, ranked by
+            n_rows desc (tiebreak: fewer columns first, then column names). The
+            full count of distinct patterns is always reported in `n_patterns`.
+
+    Returns:
+        Dict:
+        {
+          "n_rows": int,            # total rows
+          "n_patterns": int,        # distinct patterns (incl. the empty pattern)
+          "complete_rows": int,     # rows with the empty pattern (nothing missing)
+          "patterns": [             # top_n patterns, n_rows desc
+             {"missing_cols": [col, ...], "n_rows": int, "pct": float}  # [] = complete row
+          ],
+        }
+        For {} (or a non-dict) returns n_rows 0 and patterns []. NEVER raises.
+    """
+    empty = {"n_rows": 0, "n_patterns": 0, "complete_rows": 0, "patterns": []}
+    if not isinstance(null_mask, dict) or not null_mask:
+        return empty
+
+    # Stable, canonical column order so each row's pattern tuple is sorted.
+    items = sorted(null_mask.items(), key=lambda kv: str(kv[0]))
+    names = [str(k) for k, _ in items]
+    lists = [v if isinstance(v, (list, tuple)) else [] for _, v in items]
+
+    n_rows = max((len(lst) for lst in lists), default=0)
+    if n_rows == 0:
+        return empty
+
+    # Defensive parsing of top_n.
+    try:
+        limit = int(top_n)
+    except (TypeError, ValueError):
+        limit = 10
+    if limit < 0:
+        limit = 0
+
+    counts: dict = {}
+    n_cols = len(names)
+    for r in range(n_rows):
+        # names is sorted, so iterating in order yields an already-sorted tuple.
+        pattern = tuple(
+            names[c]
+            for c in range(n_cols)
+            if r < len(lists[c]) and _is_missing(lists[c][r])
+        )
+        counts[pattern] = counts.get(pattern, 0) + 1
+
+    complete_rows = counts.get((), 0)
+    n_patterns = len(counts)
+
+    # Rank: n_rows desc, then fewer columns first, then column names (deterministic).
+    ordered = sorted(counts.items(), key=lambda kv: (-kv[1], len(kv[0]), kv[0]))
+
+    patterns = [
+        {
+            "missing_cols": list(pat),
+            "n_rows": cnt,
+            "pct": round(100.0 * cnt / n_rows, 2),
+        }
+        for pat, cnt in ordered[:limit]
+    ]
+
+    return {
+        "n_rows": n_rows,
+        "n_patterns": n_patterns,
+        "complete_rows": complete_rows,
+        "patterns": patterns,
+    }
@@ -0,0 +1,87 @@
+"""Tests para missingness_row_patterns."""
+
+import os
+import sys
+
+sys.path.insert(0, os.path.dirname(__file__))
+
+from missingness_row_patterns import missingness_row_patterns
+
+_EXPECTED_KEYS = {"n_rows", "n_patterns", "complete_rows", "patterns"}
+
+
+def test_patron_dominante_completas_singleton():
+    """Golden: {A,B} co-faltan en 4 filas + 5 filas completas + 1 singleton {C}."""
+    # 10 filas. A y B faltan juntas en las filas 0-3; filas 4-8 completas;
+    # la fila 9 solo le falta C.
+    null_mask = {
+        "A": [1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
+        "B": [1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
+        "C": [0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
+    }
+    out = missingness_row_patterns(null_mask)
+
+    assert set(out.keys()) == _EXPECTED_KEYS
+    assert out["n_rows"] == 10
+    # 3 patrones distintos: (A,B), () y (C,).
+    assert out["n_patterns"] == 3
+    # 5 filas completas (filas 4-8).
+    assert out["complete_rows"] == 5
+
+    # Orden: n_rows desc; desempate menos columnas primero.
+    # () tiene 5 filas, (A,B) 4, (C,) 1.
+    pats = out["patterns"]
+    assert len(pats) == 3
+
+    assert pats[0]["missing_cols"] == []
+    assert pats[0]["n_rows"] == 5
+    assert pats[0]["pct"] == 50.0
+
+    assert pats[1]["missing_cols"] == ["A", "B"]
+    assert pats[1]["n_rows"] == 4
+    assert pats[1]["pct"] == 40.0
+
+    assert pats[2]["missing_cols"] == ["C"]
+    assert pats[2]["n_rows"] == 1
+    assert pats[2]["pct"] == 10.0
+
+    # Tipos de salida.
+    assert isinstance(out["n_rows"], int)
+    assert isinstance(pats[0]["pct"], float)
+
+
+def test_mask_vacio():
+    """{} -> n_rows 0, sin patrones, nunca lanza."""
+    out = missingness_row_patterns({})
+    assert out == {
+        "n_rows": 0,
+        "n_patterns": 0,
+        "complete_rows": 0,
+        "patterns": [],
+    }
+    # No dict / None tambien degradan a vacio sin lanzar.
+    assert missingness_row_patterns(None)["n_rows"] == 0
+    # Columnas presentes pero listas vacias -> n_rows 0.
+    assert missingness_row_patterns({"A": [], "B": []})["patterns"] == []
+
+
+def test_top_n_trunca_pero_cuenta_todos():
+    """top_n limita `patterns`, pero n_patterns reporta TODOS los distintos."""
+    null_mask = {
+        "A": [0, 1, 1, 0, 1],
+        "B": [0, 0, 0, 1, 1],
+        "C": [0, 0, 0, 0, 1],
+    }
+    # Filas: ()  (A,)  (A,)  (B,)  (A,B,C)
+    out = missingness_row_patterns(null_mask, top_n=2)
+
+    assert out["n_rows"] == 5
+    assert out["n_patterns"] == 4          # (), (A,), (B,), (A,B,C)
+    assert out["complete_rows"] == 1
+    # Solo 2 patrones devueltos pese a haber 4.
+    assert len(out["patterns"]) == 2
+    # (A,) domina con 2 filas; desempate del 2o entre los de 1 fila -> () (0 cols).
+    assert out["patterns"][0]["missing_cols"] == ["A"]
+    assert out["patterns"][0]["n_rows"] == 2
+    assert out["patterns"][1]["missing_cols"] == []
+    assert out["patterns"][1]["n_rows"] == 1
@@ -34,7 +34,6 @@ from .upsert_xlsx_sheet import upsert_xlsx_sheet
 from .duckdb_query_readonly import duckdb_query_readonly
 from .duckdb_execute import duckdb_execute
 from .duckdb_upsert import duckdb_upsert
-from .load_folder_to_duckdb import load_folder_to_duckdb
 from .imap_connect import imap_connect
 from .imap_list_mailboxes import imap_list_mailboxes
 from .imap_search import imap_search
@@ -51,7 +50,6 @@ __all__ = [
    "upsert_xlsx_sheet",
    "duckdb_query_readonly",
    "duckdb_execute",
-    "load_folder_to_duckdb",
    "duckdb_upsert",
    "pg_insert_rows",
    "pg_apply_sql",
@@ -1,100 +0,0 @@
---
-name: load_folder_to_duckdb
-kind: function
-lang: py
-domain: infra
-version: "1.0.0"
-purity: impure
-signature: "def load_folder_to_duckdb(folder: str, db_path: str = None, pattern: str = '*.csv,*.parquet,*.json') -> dict"
-description: "Escanea el primer nivel de una CARPETA buscando archivos tabulares (CSV/TSV/TXT, Parquet, JSON/NDJSON) y los carga como tablas en una base DuckDB usando los lectores nativos read_csv_auto/read_parquet/read_json_auto. Es la pieza de entrada del EDA a nivel de carpeta (grupo eda). Por cada archivo crea una tabla cuyo nombre se deriva del basename saneado a [0-9a-zA-Z_] en minusculas (prefijo t_ si empieza por digito, sufijos _2/_3 ante colisiones, tabla_<i> si queda vacio). El path se escapa (comilla simple '->'') antes de interpolarlo porque los lectores DuckDB no aceptan el path como parametro posicional. Glob NO recursivo: un glob.glob(os.path.join(folder, g)) por cada patron del CSV, dedup y ordenado. db_path=None genera una DuckDB temporal (mkstemp, se borra el placeholder vacio porque DuckDB rechaza un archivo de 0 bytes) y devuelve su ruta. Un fallo al cargar un archivo concreto no aborta el resto: se registra en errors y se continua. Devuelve siempre un dict sin lanzar (estilo del grupo duckdb): {status:'ok', db_path, tables, errors} en exito (carpeta sin archivos tabulares incluida, tables=[]) y {status:'error', error} cuando la carpeta no existe o falla algo global. Depende del paquete duckdb (1.5.2)."
-tags: [eda, duckdb, ingest, etl, folder]
-uses_functions: []
-uses_types: []
-returns: []
-returns_optional: false
-error_type: "error_py_core"
-imports: [glob, os, re, tempfile, duckdb]
-params:
-  - name: folder
-    desc: "ruta a un directorio. Se escanea solo su primer nivel (NO recursivo). Si no existe o no es un directorio devuelve {status:'error'} sin lanzar."
-  - name: db_path
-    desc: "ruta del archivo DuckDB destino, abierto en modo read-write (lo crea si no existe). None (default) genera una DuckDB temporal unica con tempfile.mkstemp y devuelve su ruta en el campo db_path del retorno. DuckDB es single-writer: si otro proceso lo tiene abierto en escritura, connect falla con error de lock devuelto en el dict."
-  - name: pattern
-    desc: "CSV de globs separados por coma (default '*.csv,*.parquet,*.json'). Cada glob se aplica con glob.glob(os.path.join(folder, g)) sobre el primer nivel de folder; los resultados de todos los globs se deduplican y ordenan. Los globs con ** NO descienden recursivamente (glob.glob sin recursive=True)."
-output: "dict. En exito: {status:'ok', db_path:str (ruta DuckDB usada), tables:[{name:str, source_file:str, n_rows:int}], errors:[{name?:str, source_file:str, error:str}]}. La carpeta sin archivos tabulares es un exito con tables=[] y errors=[]. En error (sin lanzar): {status:'error', error:str}."
-tested: true
-tests:
-  - "test_carga_dos_csv_como_tablas"
-  - "test_db_path_none_crea_temporal"
-  - "test_carpeta_vacia_es_ok_sin_tablas"
-  - "test_carpeta_inexistente_devuelve_status_error"
-test_file_path: "python/functions/infra/load_folder_to_duckdb_test.py"
-file_path: "python/functions/infra/load_folder_to_duckdb.py"
---
-
-## Ejemplo
-
-```python
-import sys
-sys.path.insert(0, "python/functions")
-from infra.load_folder_to_duckdb import load_folder_to_duckdb
-
-# Preparar una carpeta de demo con dos CSV.
-import os
-os.makedirs("/tmp/eda_folder_demo", exist_ok=True)
-with open("/tmp/eda_folder_demo/ventas.csv", "w") as f:
-    f.write("id,total\n1,10.5\n2,20.0\n3,5.25\n")
-with open("/tmp/eda_folder_demo/clientes.csv", "w") as f:
-    f.write("id,nombre\n1,ana\n2,luis\n")
-
-# Cargar todos los tabulares de la carpeta a una DuckDB temporal.
-res = load_folder_to_duckdb("/tmp/eda_folder_demo")
-print(res["status"])    # ok
-print(res["db_path"])   # /tmp/tmpXXXXXXXX.duckdb (temporal)
-for t in res["tables"]:
-    print(t["name"], t["n_rows"])   # ventas 3  /  clientes 2
-
-# Persistir en una DuckDB concreta y limitar a CSV.
-res2 = load_folder_to_duckdb(
-    "/tmp/eda_folder_demo",
-    db_path="/tmp/eda_folder_demo/folder.duckdb",
-    pattern="*.csv",
-)
-print(res2["tables"])   # [{'name': 'clientes', ...}, {'name': 'ventas', ...}]
-```
-
-## Cuando usarla
-
-Cuando tienes una carpeta de datos sueltos (un dump, un export, varios CSV/Parquet
-descargados) y quieres analizarlos juntos con SQL sin montar la ingesta a mano,
-archivo por archivo. Es el primer eslabon del EDA a nivel de carpeta (grupo `eda`):
-deja una DuckDB con una tabla por archivo, lista para perfilar con
-`duckdb_table_schema_py_infra`, consultar con `duckdb_query_readonly_py_infra`, o
-correlacionar aguas abajo. Usala antes de cualquier paso de perfilado cuando la
-unidad de trabajo es "todos los archivos de este directorio".
-
-## Gotchas
-
- **Glob NO recursivo**: solo se escanea el primer nivel de `folder`. Archivos en
-  subdirectorios se ignoran (ni siquiera con `**` en el patron, porque
-  `glob.glob` se llama sin `recursive=True`). Si necesitas recursion, aplana la
-  carpeta antes o amplia la funcion.
- **Saneo de nombres de tabla**: el basename se reduce a `[0-9a-zA-Z_]` en
-  minusculas. `Ventas 2024.csv` -> tabla `ventas_2024`. Dos archivos distintos
-  pueden sanear al mismo nombre (`a-b.csv` y `a_b.csv`); el segundo se desambigua
-  con sufijo `_2`, `_3`, ... El mapeo real archivo->tabla esta en `tables[].name`
-  / `tables[].source_file`, no lo asumas.
- **`read_json_auto` requiere JSON tabular** (array de objetos u objetos NDJSON
-  homogeneos). Un JSON anidado o irregular puede fallar la carga de ESA tabla; el
-  error se registra en `errors` y el resto de archivos siguen cargandose.
- **Extension desconocida = se salta**, no falla: queda anotada en `errors` con
-  `unsupported extension`. Mapeo de lectores: `.csv/.tsv/.txt`->`read_csv_auto`,
-  `.parquet/.pq`->`read_parquet`, `.json/.ndjson`->`read_json_auto`.
- **Escritura real en disco (impura)**. DuckDB es single-writer: si otro proceso
-  tiene `db_path` abierto en escritura, `connect` falla con error de lock devuelto
-  en el dict. Un `db_path` con un directorio padre inexistente tambien falla.
- **`db_path=None` crea un archivo temporal que NO se borra solo**: la ruta se
-  devuelve en `db_path` para que el llamador la consuma y la limpie cuando termine.
- **Tipos inferidos por los lectores `_auto`**: los tipos de columna los infiere
-  DuckDB. Revisa el schema con `duckdb_table_schema_py_infra` si el tipado importa
-  aguas abajo.
@@ -1,175 +0,0 @@
-"""Carga una carpeta de archivos tabulares (CSV/Parquet/JSON) como tablas DuckDB.
-
-Funcion impura: escanea el primer nivel de un directorio buscando archivos que
-casen con uno o varios globs, y por cada archivo crea una tabla en una base
-DuckDB usando los lectores nativos (`read_csv_auto`, `read_parquet`,
-`read_json_auto`). Es la pieza de entrada del EDA a nivel de carpeta (grupo
-`eda`): deja una DuckDB con una tabla por archivo, lista para perfilar y
-correlacionar aguas abajo.
-
-Devuelve siempre un dict sin lanzar excepciones, siguiendo el estilo del grupo
-duckdb del registry: {status:'ok', db_path, tables, errors} en exito (incluida
-la carpeta sin archivos tabulares, que es un exito con tables=[]) y
-{status:'error', error:str} cuando la carpeta no existe o falla algo global.
-
-El nombre de cada tabla se deriva del basename del archivo, saneado a
-`[0-9a-zA-Z_]` en minusculas, prefijado con `t_` si empieza por digito, y
-desambiguado con sufijos `_2`, `_3`, ... ante colisiones. El path del archivo se
-escapa (comilla simple, `'`->`''`) antes de interpolarlo en el SQL del lector,
-ya que los lectores DuckDB no admiten el path como parametro posicional. Un fallo
-al cargar un archivo concreto NO aborta el resto: se registra en `errors` y se
-continua con los siguientes.
-"""
-
-import glob
-import os
-import re
-import tempfile
-
-
-def _sanitize_table_name(basename_no_ext: str, index: int) -> str:
-    """Deriva un identificador de tabla valido desde el basename de un archivo.
-
-    Reemplaza todo lo que no sea ``[0-9a-zA-Z_]`` por ``_`` y baja a minusculas.
-    Si tras el saneo queda vacio, usa ``tabla_<index>``. Si empieza por digito,
-    prefija ``t_`` para que sea un identificador SQL valido.
-    """
-    name = re.sub(r"[^0-9a-zA-Z_]", "_", basename_no_ext).lower()
-    if not name:
-        name = f"tabla_{index}"
-    if name[0].isdigit():
-        name = "t_" + name
-    return name
-
-
-def _reader_for_extension(ext: str, quoted_path: str):
-    """Devuelve la expresion de lector DuckDB para una extension, o None.
-
-    El ``quoted_path`` ya viene escapado y entre comillas simples. Extensiones
-    desconocidas devuelven None para que el llamador salte el archivo.
-    """
-    ext = ext.lower()
-    if ext in (".csv", ".tsv", ".txt"):
-        return f"read_csv_auto('{quoted_path}')"
-    if ext in (".parquet", ".pq"):
-        return f"read_parquet('{quoted_path}')"
-    if ext in (".json", ".ndjson"):
-        return f"read_json_auto('{quoted_path}')"
-    return None
-
-
-def load_folder_to_duckdb(
-    folder: str,
-    db_path: str = None,
-    pattern: str = "*.csv,*.parquet,*.json",
-) -> dict:
-    """Carga los archivos tabulares de una carpeta como tablas en una DuckDB.
-
-    Args:
-        folder: ruta a un directorio. Si no existe o no es un directorio,
-            devuelve {status:'error', ...} sin lanzar.
-        db_path: ruta de la DuckDB destino (read-write, se crea si no existe). Si
-            es None, se genera una base temporal con NamedTemporaryFile y su ruta
-            se devuelve en el retorno (`db_path`).
-        pattern: CSV de globs separados por coma (default
-            "*.csv,*.parquet,*.json"). Cada glob se aplica con
-            glob.glob(os.path.join(folder, g)) en el primer nivel (NO recursivo);
-            los resultados se deduplican y ordenan.
-
-    Returns:
-        dict. En exito: {status:'ok', db_path:str, tables:[{name, source_file,
-        n_rows}], errors:[{name?, source_file, error}]}. La carpeta sin archivos
-        tabulares es un exito con tables=[] y errors=[]. En error (sin lanzar):
-        {status:'error', error:str}.
-    """
-    if not isinstance(folder, str) or not os.path.isdir(folder):
-        return {
-            "status": "error",
-            "error": f"folder does not exist or is not a directory: {folder!r}",
-        }
-
-    conn = None
-    try:
-        # Resolver la ruta de la DuckDB destino. Si no se da, reservar un nombre
-        # temporal unico y borrar el archivo vacio que crea mkstemp: DuckDB 1.5.2
-        # rechaza abrir un archivo de 0 bytes ("not a valid DuckDB database
-        # file"), por lo que debe crear el archivo el mismo desde cero.
-        if db_path is None:
-            fd, tmp_name = tempfile.mkstemp(suffix=".duckdb")
-            os.close(fd)
-            os.remove(tmp_name)
-            db_path = tmp_name
-
-        # Resolver los archivos: un glob por cada patron, dedup + orden estable.
-        globs = [g.strip() for g in pattern.split(",") if g.strip()]
-        found = set()
-        for g in globs:
-            for path in glob.glob(os.path.join(folder, g)):
-                if os.path.isfile(path):
-                    found.add(path)
-        files = sorted(found)
-
-        conn = __import__("duckdb").connect(db_path)
-
-        tables = []
-        errors = []
-        used_names = set()
-
-        for i, path in enumerate(files):
-            base = os.path.basename(path)
-            stem, ext = os.path.splitext(base)
-            quoted_path = path.replace("'", "''")
-            reader = _reader_for_extension(ext, quoted_path)
-            if reader is None:
-                errors.append(
-                    {
-                        "source_file": path,
-                        "error": f"unsupported extension: {ext!r}",
-                    }
-                )
-                continue
-
-            name = _sanitize_table_name(stem, i)
-            # Desambiguar colisiones con sufijos _2, _3, ...
-            if name in used_names:
-                suffix = 2
-                while f"{name}_{suffix}" in used_names:
-                    suffix += 1
-                name = f"{name}_{suffix}"
-
-            quoted_ident = '"' + name.replace('"', '""') + '"'
-            try:
-                conn.execute(
-                    f"CREATE TABLE {quoted_ident} AS SELECT * FROM {reader}"
-                )
-                n_rows = conn.execute(
-                    f"SELECT count(*) FROM {quoted_ident}"
-                ).fetchone()[0]
-                used_names.add(name)
-                tables.append(
-                    {
-                        "name": name,
-                        "source_file": path,
-                        "n_rows": int(n_rows),
-                    }
-                )
-            except Exception as e:  # noqa: BLE001
-                errors.append(
-                    {
-                        "name": name,
-                        "source_file": path,
-                        "error": str(e),
-                    }
-                )
-
-        return {
-            "status": "ok",
-            "db_path": db_path,
-            "tables": tables,
-            "errors": errors,
-        }
-    except Exception as e:  # noqa: BLE001
-        return {"status": "error", "error": str(e)}
-    finally:
-        if conn is not None:
-            conn.close()
@@ -1,73 +0,0 @@
-"""Tests para load_folder_to_duckdb."""
-
-import os
-import sys
-
-sys.path.insert(0, os.path.dirname(__file__))
-
-import duckdb  # noqa: E402
-
-from load_folder_to_duckdb import load_folder_to_duckdb  # noqa: E402
-
-
-def _write_csv(path: str, header: str, rows: list[str]) -> None:
-    with open(path, "w", encoding="utf-8") as f:
-        f.write(header + "\n")
-        for r in rows:
-            f.write(r + "\n")
-
-
-def test_carga_dos_csv_como_tablas(tmp_path):
-    _write_csv(
-        str(tmp_path / "ventas.csv"),
-        "id,total",
-        ["1,10.5", "2,20.0", "3,5.25"],
-    )
-    _write_csv(
-        str(tmp_path / "clientes.csv"),
-        "id,nombre",
-        ["1,ana", "2,luis"],
-    )
-    db = tmp_path / "out.duckdb"
-    res = load_folder_to_duckdb(str(tmp_path), str(db))
-
-    assert res["status"] == "ok", res
-    assert res["errors"] == []
-    assert len(res["tables"]) == 2
-    assert res["db_path"] == str(db)
-    assert os.path.exists(str(db))
-
-    by_name = {t["name"]: t for t in res["tables"]}
-    assert by_name["ventas"]["n_rows"] == 3
-    assert by_name["clientes"]["n_rows"] == 2
-
-    # Verificar que las tablas existen realmente en la base.
-    con = duckdb.connect(str(db), read_only=True)
-    assert con.execute("SELECT count(*) FROM ventas").fetchone()[0] == 3
-    assert con.execute("SELECT count(*) FROM clientes").fetchone()[0] == 2
-    con.close()
-
-
-def test_db_path_none_crea_temporal(tmp_path):
-    _write_csv(str(tmp_path / "datos.csv"), "x", ["1", "2"])
-    res = load_folder_to_duckdb(str(tmp_path))
-    assert res["status"] == "ok", res
-    assert res["db_path"]
-    assert os.path.exists(res["db_path"])
-    assert len(res["tables"]) == 1
-    assert res["tables"][0]["n_rows"] == 2
-    os.remove(res["db_path"])
-
-
-def test_carpeta_vacia_es_ok_sin_tablas(tmp_path):
-    db = tmp_path / "out.duckdb"
-    res = load_folder_to_duckdb(str(tmp_path), str(db))
-    assert res["status"] == "ok", res
-    assert res["tables"] == []
-    assert res["errors"] == []
-
-
-def test_carpeta_inexistente_devuelve_status_error(tmp_path):
-    res = load_folder_to_duckdb(str(tmp_path / "no_existe"))
-    assert res["status"] == "error"
-    assert "folder" in res["error"]
@@ -1,115 +0,0 @@
---
-name: render_automatic_eda_folder
-kind: pipeline
-lang: py
-domain: pipelines
-purity: impure
-version: "1.0.0"
-signature: "def render_automatic_eda_folder(path: str, out_dir: str = \"reports\", basename: str = None, profile_level: str = \"standard\", emit_pdf: bool = True, emit_pptx: bool = True, emit_md: bool = True, per_table_eda: bool = False, min_inclusion: float = 0.9, ctx_extra: dict = None) -> dict"
-description: "Informe AutomaticEDA a nivel de BASE one-shot de una CARPETA de archivos tabulares (CSV/Parquet/JSON) o de una DuckDB existente. Carga la carpeta a una DuckDB temporal con load_folder_to_duckdb (o usa la DuckDB dada directa), perfila TODA la base con profile_database (resumen de cada tabla + FK candidatas por containment + join graph con diagrama Mermaid), ENSAMBLA un documento-base por capitulos (portada-base con nombre/n tablas/totales/fecha/fuente, resumen de tablas con una fila por tabla, y relaciones inter-tabla con la tabla de FK candidatas + una Figure matplotlib REAL del join graph dibujada con draw_join_graph_figure mas el texto Mermaid) y lo renderiza con el motor AutomaticEDA a PDF (A5 movil), PPTX (16:9) y Markdown autocontenido a la vez. Con per_table_eda=True anexa los capitulos de mini-EDA de cada tabla (build_document por tabla). Es el hermano a nivel de base de render_automatic_eda (que perfila UNA tabla): aqui el informe es de la base y de sus relaciones. Devuelve las rutas de PDF/PPTX/MD, el manifiesto y el DatabaseProfile."
-tags: [eda, duckdb, database, profiling, relations, pipeline, dataops, report, pdf, pptx, launcher]
-uses_functions:
-  - load_folder_to_duckdb_py_infra
-  - profile_database_py_pipelines
-  - render_automatic_eda_pdf_py_datascience
-  - render_automatic_eda_pptx_py_datascience
-  - render_automatic_eda_markdown_py_datascience
-  - draw_join_graph_figure_py_datascience
-uses_types: []
-returns: []
-returns_optional: false
-error_type: error_go_core
-imports: []
-tested: true
-tests:
-  - "golden: carpeta con 3 CSV relacionados (customers/orders/products) emite PDF+PPTX+MD del documento-base con 3 tablas y la FK orders.customer_id->customers.id"
-  - "edge: carpeta vacia -> status ok con documento minimo, sin lanzar"
-  - "edge: 1 sola tabla -> funciona sin relaciones (capitulo relaciones dice 'sin FK')"
-test_file_path: "python/functions/pipelines/render_automatic_eda_folder_test.py"
-file_path: "python/functions/pipelines/render_automatic_eda_folder.py"
-params:
-  - name: path
-    desc: "DIRECTORIO con archivos tabulares (CSV/Parquet/JSON) que se cargan a una DuckDB temporal, o una DuckDB ya existente (.duckdb/.ddb/.db) que se perfila directa."
-  - name: out_dir
-    desc: "Directorio de salida de los informes (se crea si no existe). Default 'reports'."
-  - name: basename
-    desc: "Nombre base de los archivos sin extension. Default 'aeda_base_<nombre>_<timestamp>'."
-  - name: profile_level
-    desc: "Preset de coste del perfil por tabla ('lite'/'standard'/'full'); ajusta el sample que profile_database pasa a cada tabla (lite=2000, standard/full=5000)."
-  - name: emit_pdf
-    desc: "Emite el PDF A5 movil del documento-base. Default True."
-  - name: emit_pptx
-    desc: "Emite el PPTX 16:9 del documento-base. Default True."
-  - name: emit_md
-    desc: "Emite el Markdown autocontenido del documento-base. Default True."
-  - name: per_table_eda
-    desc: "Si True, anexa al documento-base los capitulos de mini-EDA de cada tabla (Heading 'Tabla: <n>' + build_document por tabla). Default False (solo documento-base: portada + resumen + relaciones)."
-  - name: min_inclusion
-    desc: "Umbral de inclusion (0-1) para emitir una FK candidata (se pasa a profile_database). Default 0.9."
-  - name: ctx_extra
-    desc: "Dict opcional de claves de presentacion (p.ej. dataset_name, description) que se mezclan en el contexto de la portada-base."
-output: "Dict dict-no-throw. En exito: {status:'ok', pdf_path, pptx_path, md_path, manifest_path, n_tables, n_pages, n_slides, md_chars, db_path, db_profile}. En error: {status:'error', error:str}."
---
-
-# render_automatic_eda_folder
-
-EDA de una **carpeta / base multi-tabla** → informe AutomaticEDA por capítulos
-en PDF (móvil A5) + PPTX (16:9) + Markdown, en una sola llamada. Es el hermano a
-nivel de **base** de `render_automatic_eda` (que perfila una sola tabla): aquí el
-documento resume **todas** las tablas y, sobre todo, sus **relaciones**
-inter-tabla (FK candidatas por containment + join graph con diagrama Mermaid).
-
-Compone, sin reimplementar su lógica: `load_folder_to_duckdb` (carga la carpeta),
-`profile_database` (perfila la base + infiere FK + join graph) y los tres
-renderers del motor AutomaticEDA (`render_automatic_eda_pdf`/`_pptx`/`_markdown`),
-que aceptan directamente la lista de capítulos del documento-base que este
-pipeline ensambla. El pipeline de tabla única (`render_automatic_eda`) queda
-intacto: esto es aditivo.
-
-## Ejemplo
-
-```bash
-# Carpeta con varios CSV/Parquet/JSON relacionados:
-./fn run render_automatic_eda_folder /tmp/eda_folder_demo
-
-# Una DuckDB ya existente (rama directa):
-./fn run render_automatic_eda_folder temp/bigdata/taxi.duckdb
-```
-
-```python
-import sys, os
-sys.path.insert(0, os.path.join("python", "functions"))
-from pipelines.render_automatic_eda_folder import render_automatic_eda_folder
-
-r = render_automatic_eda_folder("/tmp/eda_folder_demo", out_dir="reports")
-# r["status"] == "ok"; r["pdf_path"], r["pptx_path"], r["md_path"]
-# r["n_tables"] == 3; r["db_profile"]["fk_candidates"] incluye
-#   orders.customer_id -> customers.id
-```
-
-## Cuando usarla
-
-Cuando quieras un EDA de una **base entera** (una carpeta de exports o una
-DuckDB con varias tablas), no de una sola tabla: para ver de un vistazo qué
-tablas hay, su tamaño y calidad, y cómo se relacionan (FK candidatas + diagrama),
-en el mismo formato rico por capítulos (PDF móvil + PPTX + MD) que el EDA de
-tabla. Usa `per_table_eda=True` cuando además quieras el mini-EDA de cada tabla
-anexado.
-
-## Gotchas
-
- Impuro: lee archivos del disco y escribe PDF/PPTX/MD en `out_dir`. En la rama
-  "carpeta" crea una **DuckDB temporal** (su ruta sale en `db_path`); no se borra
-  automáticamente (queda para reinspección).
- `path` se interpreta así: directorio → se carga la carpeta; archivo con
-  extensión `.duckdb`/`.ddb`/`.db` → se usa directo; cualquier otro archivo o un
-  path inexistente → `{status:'error'}` (no lanza).
- El escaneo de la carpeta es **no recursivo** (solo el primer nivel) y por
-  defecto cubre `*.csv,*.parquet,*.json` (ver `load_folder_to_duckdb`).
- El join graph se rasteriza a una **Figure matplotlib real** (vía
-  `draw_join_graph_figure`) que aparece dibujada en PDF/PPTX (nodos = tablas,
-  flechas = FK). Además, el **texto Mermaid** del grafo se incluye como bloque de
-  código (en el Markdown queda como diagrama renderizable y es útil para pegar a
-  un LLM).
- Carpeta vacía o con 1 sola tabla: funciona igual; el capítulo de relaciones
-  dice "sin FK". dict-no-throw en todos los caminos.
@@ -1,366 +0,0 @@
-"""render_automatic_eda_folder — EDA de una CARPETA / base multi-tabla one-shot.
-
-Pipeline impuro del grupo de capacidad `eda`, a nivel de BASE. Dada una CARPETA
-de archivos tabulares (CSV/Parquet/JSON) o una DuckDB ya existente, produce el
-informe AutomaticEDA de la BASE en sus tres formatos a la vez (PDF móvil A5 +
-PPTX 16:9 + Markdown autocontenido), con los capítulos POBLADOS, en una sola
-llamada. Es el hermano a nivel de base de ``render_automatic_eda`` (que perfila
-UNA tabla): aquí el documento por capítulos resume TODAS las tablas y, sobre
-todo, sus RELACIONES inter-tabla (FK candidatas + join graph).
-
-Compone funciones del registry SIN reimplementar su lógica:
-
-  - load_folder_to_duckdb : carga una carpeta de archivos a una DuckDB temporal
-                            (rama "carpeta"). En la rama "ya es duckdb" se omite.
-  - profile_database      : perfila TODA la base (resumen de cada tabla,
-                            TableProfiles completos, FK candidatas por
-                            containment y join graph con diagrama Mermaid).
-  - render_automatic_eda_pdf      : renderiza el documento-base por capítulos a PDF.
-  - render_automatic_eda_pptx     : renderiza el mismo documento-base a PPTX.
-  - render_automatic_eda_markdown : serializa el mismo documento-base a Markdown
-                                    autocontenido (texto + tablas markdown).
-  - build_document        : (solo con per_table_eda=True) ensambla los capítulos
-                            canónicos de CADA tabla para anexarlos al documento.
-
-La capa propia de este pipeline es ENSAMBLAR EL DOCUMENTO-BASE de capítulos a
-partir del ``DatabaseProfile`` que devuelve ``profile_database`` y cablear los
-tres renderers del motor AutomaticEDA. El documento-base mínimo tiene tres
-capítulos: portada-base (nombre/nº tablas/totales/fecha/fuente), resumen de
-tablas (una fila por tabla) y relaciones inter-tabla (FK candidatas + diagrama
-Mermaid). Con ``per_table_eda=True`` anexa, por cada tabla, sus capítulos de
-mini-EDA.
-
-Estilo dict-no-throw del grupo `eda`: nunca lanza; captura cualquier error y
-degrada a ``{"status": "error", "error": str}``.
-"""
-
-import os
-from datetime import datetime, timezone
-
-from datascience import (
-    draw_join_graph_figure,
-    render_automatic_eda_markdown,
-    render_automatic_eda_pdf,
-    render_automatic_eda_pptx,
-)
-from datascience.automatic_eda import build_document
-from infra import load_folder_to_duckdb
-from pipelines.profile_database import profile_database
-
-# Mapa profile_level -> tamaño de muestra por columna del perfil de cada tabla.
-# A nivel de base el coste lo domina el nº de tablas; el preset solo ajusta el
-# sample que profile_database pasa a profile_table.
-_SAMPLE_BY_LEVEL = {"lite": 2000, "standard": 5000, "full": 5000}
-
-# Extensiones que se consideran "una DuckDB ya hecha" en la rama directa.
-_DUCKDB_EXTS = (".duckdb", ".ddb", ".db")
-
-
-def _fmt_num(v) -> str:
-    """Formatea un entero con separador de millar; '—' si no es número."""
-    if isinstance(v, bool) or not isinstance(v, (int, float)):
-        return "—"
-    try:
-        return f"{int(v):,}".replace(",", ".")
-    except Exception:  # noqa: BLE001
-        return str(v)
-
-
-def _portada_chapter(db_profile: dict, source_path: str, db_path: str,
-                     meta_ctx: dict) -> dict:
-    """Capítulo de portada a nivel de base (NO reusa chapters/portada.py, que es
-    de tabla única): nombre de la base, nº de tablas, totales y procedencia."""
-    tables = db_profile.get("tables", []) or []
-    total_rows = sum(
-        (t.get("n_rows") or 0) for t in tables if isinstance(t.get("n_rows"), (int, float))
-    )
-    total_cols = sum(
-        (t.get("n_cols") or 0) for t in tables if isinstance(t.get("n_cols"), (int, float))
-    )
-    base_name = (meta_ctx or {}).get("dataset_name") or os.path.basename(
-        os.path.normpath(source_path)
-    ) or source_path
-
-    rows = [
-        ("Base", base_name),
-        ("Tablas", _fmt_num(db_profile.get("n_tables"))),
-        ("Filas totales", _fmt_num(total_rows)),
-        ("Columnas totales", _fmt_num(total_cols)),
-        ("Relaciones FK", _fmt_num(len(db_profile.get("fk_candidates", []) or []))),
-        ("Fuente", source_path),
-        ("DuckDB", db_path),
-        ("Generado", datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M UTC")),
-    ]
-    blocks = [
-        {"kind": "heading", "text": f"EDA de la base — {base_name}", "level": 1},
-        {"kind": "kv_table", "rows": rows, "title": "Resumen de la base"},
-    ]
-    errs = db_profile.get("errors", []) or []
-    if errs:
-        blocks.append({
-            "kind": "note",
-            "text": f"{len(errs)} aviso(s) durante el perfilado (ver detalle).",
-        })
-    return {"id": "portada_base", "title": "Portada", "version": "1.0.0",
-            "blocks": blocks}
-
-
-def _resumen_chapter(db_profile: dict) -> dict:
-    """Capítulo con una fila por tabla: filas, columnas, calidad, key_candidates."""
-    header = ["Tabla", "Filas", "Columnas", "Calidad", "key_candidates"]
-    rows = []
-    for t in db_profile.get("tables", []) or []:
-        keys = ", ".join(t.get("key_candidates") or []) or "—"
-        rows.append([
-            t.get("table"),
-            _fmt_num(t.get("n_rows")),
-            _fmt_num(t.get("n_cols")),
-            t.get("quality_score"),
-            keys,
-        ])
-    if rows:
-        blocks = [{
-            "kind": "data_table", "header": header, "rows": rows,
-            "title": "Tablas de la base",
-            "note": "Una fila por tabla. Calidad = score agregado del TableProfile.",
-        }]
-    else:
-        blocks = [{"kind": "note",
-                   "text": "La base no contiene tablas perfilables."}]
-    return {"id": "resumen_tablas", "title": "Resumen de tablas",
-            "version": "1.0.0", "blocks": blocks}
-
-
-def _relaciones_chapter(db_profile: dict) -> dict:
-    """Capítulo de relaciones inter-tabla: tabla de FK candidatas + diagrama
-    Mermaid del join graph (vuelca el Mermaid como bloque de código)."""
-    fks = db_profile.get("fk_candidates", []) or []
-    blocks = [{
-        "kind": "heading", "text": "Relaciones inter-tabla", "level": 2,
-    }]
-    if fks:
-        header = ["From", "To", "Inclusión", "Cardinalidad"]
-        rows = []
-        for fk in fks:
-            frm = f"{fk.get('from_table')}.{fk.get('from_col')}"
-            to = f"{fk.get('to_table')}.{fk.get('to_col')}"
-            inc = fk.get("inclusion")
-            inc_s = f"{inc:.3f}" if isinstance(inc, (int, float)) else str(inc)
-            rows.append([frm, to, inc_s, fk.get("cardinality")])
-        blocks.append({
-            "kind": "data_table", "header": header, "rows": rows,
-            "title": "FK candidatas (por containment de valores)",
-            "note": "Inclusión = fracción de valores de From contenidos en To.",
-        })
-    else:
-        blocks.append({
-            "kind": "note",
-            "text": "Sin relaciones FK candidatas detectadas entre las tablas.",
-        })
-
-    join_graph = db_profile.get("join_graph") or {}
-    has_edges = bool(join_graph.get("edges"))
-    if has_edges:
-        blocks.append({"kind": "heading", "text": "Diagrama (join graph)",
-                       "level": 3})
-        # Figure matplotlib REAL del grafo de relaciones (nodos = tablas,
-        # aristas = FK). Lazy via `make`: el renderer la construye solo al
-        # paginar, y se rasteriza en PDF/PPTX. draw_join_graph_figure nunca
-        # lanza (devuelve una Figure de error si algo falla).
-        blocks.append({
-            "kind": "figure",
-            "make": (lambda jg=join_graph: draw_join_graph_figure(
-                jg, title="Join graph (relaciones inter-tabla)")),
-            "caption": "Grafo de relaciones: nodos = tablas, flechas = FK "
-                       "candidatas (etiqueta from_col→to_col).",
-            "height_in": 4.5,
-        })
-        # Además, el Mermaid en texto: en el Markdown queda como diagrama
-        # renderizable y es útil para pegar a un LLM.
-        mermaid = (join_graph.get("mermaid", "") or "").strip()
-        if mermaid:
-            blocks.append({"kind": "markdown",
-                           "text": "```mermaid\n" + mermaid + "\n```"})
-    return {"id": "relaciones", "title": "Relaciones inter-tabla",
-            "version": "1.0.0", "blocks": blocks}
-
-
-def _build_db_document(db_profile: dict, source_path: str, db_path: str,
-                       meta_ctx: dict, per_table_eda: bool) -> list:
-    """Ensambla el documento-base por capítulos a partir del DatabaseProfile.
-
-    Mínimo: portada-base + resumen de tablas + relaciones. Con per_table_eda
-    True anexa, por cada tabla, un capítulo separador + los capítulos canónicos
-    de su mini-EDA (reusando build_document sobre cada TableProfile)."""
-    chapters = [
-        _portada_chapter(db_profile, source_path, db_path, meta_ctx),
-        _resumen_chapter(db_profile),
-        _relaciones_chapter(db_profile),
-    ]
-    if per_table_eda:
-        for prof in db_profile.get("table_profiles", []) or []:
-            tname = prof.get("table") or "tabla"
-            chapters.append({
-                "id": f"tabla_{tname}", "title": f"Tabla: {tname}",
-                "version": "1.0.0",
-                "blocks": [{"kind": "heading", "text": f"Tabla: {tname}",
-                            "level": 1}],
-            })
-            try:
-                # build_document devuelve los capítulos canónicos de la tabla.
-                # ctx None -> los capítulos que necesitan datos crudos degradan,
-                # pero salen completos los de portada/overview/distrib/calidad.
-                chapters.extend(build_document(prof, None) or [])
-            except Exception:  # noqa: BLE001 — una tabla mala no rompe el doc.
-                chapters.append({
-                    "id": f"tabla_{tname}_err", "title": f"Tabla: {tname}",
-                    "version": "1.0.0",
-                    "blocks": [{"kind": "note",
-                                "text": "No se pudo ensamblar el mini-EDA de "
-                                        "esta tabla."}],
-                })
-    return chapters
-
-
-def _resolve_db_path(path: str) -> dict:
-    """Resuelve el DuckDB a perfilar desde ``path``.
-
-    - Directorio  -> carga la carpeta con load_folder_to_duckdb (DuckDB temp).
-    - Archivo .duckdb/.ddb/.db -> se usa directo (rama "ya es duckdb").
-    - Otro archivo / inexistente -> error.
-
-    Devuelve {status, db_path, loaded, n_tables, load_errors}.
-    """
-    if os.path.isdir(path):
-        lr = load_folder_to_duckdb(path)
-        if lr.get("status") != "ok":
-            return {"status": "error",
-                    "error": f"load_folder_to_duckdb falló: {lr.get('error')}"}
-        return {
-            "status": "ok",
-            "db_path": lr.get("db_path"),
-            "loaded": True,
-            "n_tables": len(lr.get("tables", []) or []),
-            "load_errors": lr.get("errors", []) or [],
-        }
-    if os.path.isfile(path):
-        if path.lower().endswith(_DUCKDB_EXTS):
-            return {"status": "ok", "db_path": path, "loaded": False,
-                    "n_tables": None, "load_errors": []}
-        return {"status": "error",
-                "error": f"'{path}' no es un directorio ni una DuckDB "
-                         f"(extensiones {_DUCKDB_EXTS})."}
-    return {"status": "error", "error": f"path no existe: {path}"}
-
-
-def render_automatic_eda_folder(
-    path: str,
-    out_dir: str = "reports",
-    basename: str = None,
-    profile_level: str = "standard",
-    emit_pdf: bool = True,
-    emit_pptx: bool = True,
-    emit_md: bool = True,
-    per_table_eda: bool = False,
-    min_inclusion: float = 0.9,
-    ctx_extra: dict = None,
-) -> dict:
-    """Perfila una CARPETA (o una DuckDB) y emite el informe AutomaticEDA de la base.
-
-    Args:
-        path: o bien un DIRECTORIO con archivos tabulares (CSV/Parquet/JSON) que
-            se cargan a una DuckDB temporal, o bien una DuckDB ya existente
-            (``.duckdb``/``.ddb``/``.db``) que se perfila directa.
-        out_dir: directorio de salida (se crea si no existe). Default "reports".
-        basename: nombre base de los archivos sin extensión. Default
-            "aeda_base_<nombre>_<timestamp>".
-        profile_level: preset de coste del perfil por tabla ("lite"/"standard"/
-            "full"); ajusta el ``sample`` que profile_database pasa a cada tabla.
-        emit_pdf / emit_pptx / emit_md: qué formatos emitir. Default los tres.
-        per_table_eda: si True, anexa al documento-base los capítulos de mini-EDA
-            de cada tabla (un Heading "Tabla: <n>" + build_document por tabla).
-            Default False (solo el documento-base: portada + resumen + relaciones).
-        min_inclusion: umbral de inclusión para emitir una FK candidata (0-1).
-        ctx_extra: dict opcional de claves de presentación (p.ej. dataset_name,
-            description) que se mezclan en el contexto de la portada.
-
-    Returns:
-        dict (nunca lanza). En éxito::
-
-            {"status": "ok", "pdf_path": str|None, "pptx_path": str|None,
-             "md_path": str|None, "manifest_path": str|None,
-             "n_tables": int, "n_pages": int|None, "n_slides": int|None,
-             "md_chars": int|None, "db_path": str, "db_profile": <DatabaseProfile>}
-
-        En error: {"status": "error", "error": str}.
-    """
-    try:
-        # 1) Resolver la DuckDB a perfilar (cargar carpeta o usar la dada).
-        rdb = _resolve_db_path(path)
-        if rdb.get("status") != "ok":
-            return {"status": "error", "error": rdb.get("error")}
-        db_path = rdb.get("db_path")
-
-        # 2) Perfilar la base entera (resumen + FK + join graph). Sin report
-        # propio (write_report/emit_pdf False): este pipeline emite el suyo.
-        sample = _SAMPLE_BY_LEVEL.get(profile_level, 5000)
-        pres = profile_database(
-            db_path, sample=sample, write_report=False,
-            min_inclusion=min_inclusion, emit_pdf=False,
-        )
-        if pres.get("status") != "ok":
-            return {"status": "error",
-                    "error": f"profile_database falló: {pres.get('error')}"}
-        db_profile = pres.get("db_profile") or {}
-
-        # 3) Ensamblar el documento-base por capítulos.
-        meta_ctx = dict(ctx_extra or {})
-        chapters = _build_db_document(
-            db_profile, path, db_path, meta_ctx, per_table_eda
-        )
-
-        # 4) Render a los tres formatos desde el MISMO documento por capítulos.
-        os.makedirs(out_dir, exist_ok=True)
-        ts = datetime.now(timezone.utc).strftime("%Y%m%d-%H%M%S")
-        nm = (meta_ctx.get("dataset_name")
-              or os.path.basename(os.path.normpath(path)) or "base")
-        nm = "".join(c if c.isalnum() else "_" for c in str(nm)).strip("_") or "base"
-        base = basename or f"aeda_base_{nm}_{ts}"
-        title = f"EDA base — {meta_ctx.get('dataset_name') or nm}"
-        meta = {"title": title}
-
-        pdf_path = pptx_path = md_path = manifest_path = None
-        n_pages = n_slides = md_chars = None
-
-        if emit_pdf:
-            target = os.path.join(out_dir, base + ".pdf")
-            rpdf = render_automatic_eda_pdf(chapters, target, meta) or {}
-            pdf_path = rpdf.get("path")
-            n_pages = rpdf.get("n_pages")
-            manifest_path = rpdf.get("manifest_path")
-        if emit_pptx:
-            target = os.path.join(out_dir, base + ".pptx")
-            rpptx = render_automatic_eda_pptx(chapters, target, meta) or {}
-            pptx_path = rpptx.get("path")
-            n_slides = rpptx.get("n_slides")
-        if emit_md:
-            target = os.path.join(out_dir, base + ".md")
-            rmd = render_automatic_eda_markdown(chapters, target, meta) or {}
-            md_path = rmd.get("path")
-            md_chars = rmd.get("n_chars")
-
-        return {
-            "status": "ok",
-            "pdf_path": pdf_path,
-            "pptx_path": pptx_path,
-            "md_path": md_path,
-            "manifest_path": manifest_path,
-            "n_tables": db_profile.get("n_tables"),
-            "n_pages": n_pages,
-            "n_slides": n_slides,
-            "md_chars": md_chars,
-            "db_path": db_path,
-            "db_profile": db_profile,
-        }
-    except Exception as e:  # noqa: BLE001 — dict-no-throw: degradar, nunca lanzar.
-        return {"status": "error", "error": str(e)}
@@ -1,188 +0,0 @@
-"""Tests para render_automatic_eda_folder — EDA de una carpeta / base multi-tabla.
-
-Golden: una carpeta con 3 CSV relacionados (customers/orders/products) produce el
-documento-base en PDF + PPTX + MD, con las 3 tablas en el resumen y la FK
-orders.customer_id -> customers.id en el capítulo de relaciones. Edges: carpeta
-vacía (documento mínimo, sin lanzar), 1 sola tabla (sin relaciones) y la rama
-"ya es una DuckDB" sobre un archivo .duckdb existente.
-"""
-
-import os
-import sys
-
-import duckdb
-
-sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", ".."))
-
-from pipelines.render_automatic_eda_folder import (
-    _relaciones_chapter,
-    render_automatic_eda_folder,
-)
-
-
-def _write_demo_folder(folder: str) -> None:
-    """3 CSV relacionados: orders.customer_id -> customers.id (FK detectable)."""
-    with open(os.path.join(folder, "customers.csv"), "w", encoding="utf-8") as fh:
-        fh.write("id,name,city\n")
-        fh.write("1,Alice,Madrid\n2,Bob,Barcelona\n3,Carol,Valencia\n"
-                 "4,Dave,Sevilla\n5,Eve,Madrid\n")
-    with open(os.path.join(folder, "orders.csv"), "w", encoding="utf-8") as fh:
-        fh.write("order_id,customer_id,product_id,total\n")
-        fh.write("100,1,10,49.90\n101,1,11,12.50\n102,2,10,49.90\n"
-                 "103,3,12,8.00\n104,3,11,12.50\n105,5,10,49.90\n"
-                 "106,2,12,8.00\n")
-    with open(os.path.join(folder, "products.csv"), "w", encoding="utf-8") as fh:
-        fh.write("product_id,product_name,price\n")
-        fh.write("10,Widget,49.90\n11,Gadget,12.50\n12,Gizmo,8.00\n")
-
-
-def _has_fk(db_profile: dict, from_t: str, from_c: str, to_t: str) -> bool:
-    for fk in db_profile.get("fk_candidates", []) or []:
-        if (fk.get("from_table") == from_t and fk.get("from_col") == from_c
-                and fk.get("to_table") == to_t):
-            return True
-    return False
-
-
-def test_golden_folder_three_csv(tmp_path):
-    """Carpeta con 3 CSV relacionados -> PDF+PPTX+MD, 3 tablas, FK detectada."""
-    folder = tmp_path / "demo"
-    folder.mkdir()
-    _write_demo_folder(str(folder))
-    out = tmp_path / "out"
-
-    r = render_automatic_eda_folder(str(folder), out_dir=str(out))
-
-    assert r["status"] == "ok", r
-    assert r["n_tables"] == 3
-    # Los tres formatos se emitieron y existen en disco.
-    assert r["pdf_path"] and os.path.exists(r["pdf_path"])
-    assert r["pptx_path"] and os.path.exists(r["pptx_path"])
-    assert r["md_path"] and os.path.exists(r["md_path"])
-    assert (r["n_pages"] or 0) >= 1
-    assert (r["n_slides"] or 0) >= 1
-    # La FK orders.customer_id -> customers.id se detecta por containment.
-    assert _has_fk(r["db_profile"], "orders", "customer_id", "customers"), \
-        r["db_profile"].get("fk_candidates")
-    # El Markdown menciona las 3 tablas y la relación.
-    md = open(r["md_path"], encoding="utf-8").read()
-    for t in ("customers", "orders", "products"):
-        assert t in md
-    assert "customer_id" in md
-
-
-def test_edge_empty_folder(tmp_path):
-    """Carpeta vacía -> status ok con documento mínimo, sin lanzar."""
-    folder = tmp_path / "empty"
-    folder.mkdir()
-    out = tmp_path / "out"
-
-    r = render_automatic_eda_folder(str(folder), out_dir=str(out))
-
-    assert r["status"] == "ok", r
-    assert r["n_tables"] == 0
-    # Aun sin tablas, emite el documento-base mínimo (portada + resumen vacío +
-    # relaciones "sin FK").
-    assert r["pdf_path"] and os.path.exists(r["pdf_path"])
-    assert r["md_path"] and os.path.exists(r["md_path"])
-
-
-def test_edge_single_table_no_relations(tmp_path):
-    """Carpeta con 1 sola tabla -> funciona sin relaciones (capítulo 'sin FK')."""
-    folder = tmp_path / "single"
-    folder.mkdir()
-    with open(folder / "lonely.csv", "w", encoding="utf-8") as fh:
-        fh.write("a,b\n1,x\n2,y\n3,z\n")
-    out = tmp_path / "out"
-
-    r = render_automatic_eda_folder(str(folder), out_dir=str(out))
-
-    assert r["status"] == "ok", r
-    assert r["n_tables"] == 1
-    assert not (r["db_profile"].get("fk_candidates") or [])
-    md = open(r["md_path"], encoding="utf-8").read()
-    assert "Sin relaciones FK" in md or "sin FK" in md.lower()
-
-
-def test_accepts_existing_duckdb(tmp_path):
-    """Rama 'ya es una DuckDB': un archivo .duckdb existente se perfila directo."""
-    db = tmp_path / "base.duckdb"
-    conn = duckdb.connect(str(db))
-    try:
-        conn.execute("CREATE TABLE customers (id INTEGER, name VARCHAR)")
-        conn.execute("INSERT INTO customers VALUES (1,'Ana'),(2,'Luis'),(3,'Eva')")
-        conn.execute("CREATE TABLE orders (oid INTEGER, customer_id INTEGER)")
-        conn.execute("INSERT INTO orders VALUES (10,1),(11,2),(12,1),(13,3)")
-    finally:
-        conn.close()
-    out = tmp_path / "out"
-
-    r = render_automatic_eda_folder(str(db), out_dir=str(out))
-
-    assert r["status"] == "ok", r
-    assert r["n_tables"] == 2
-    assert r["db_path"] == str(db)
-    assert r["pdf_path"] and os.path.exists(r["pdf_path"])
-
-
-def test_emit_flags_select_formats(tmp_path):
-    """emit_pdf/pptx/md controlan qué formatos se emiten."""
-    folder = tmp_path / "demo"
-    folder.mkdir()
-    _write_demo_folder(str(folder))
-    out = tmp_path / "out"
-
-    r = render_automatic_eda_folder(
-        str(folder), out_dir=str(out),
-        emit_pdf=True, emit_pptx=False, emit_md=False,
-    )
-    assert r["status"] == "ok", r
-    assert r["pdf_path"] and os.path.exists(r["pdf_path"])
-    assert r["pptx_path"] is None
-    assert r["md_path"] is None
-
-
-def test_path_does_not_exist(tmp_path):
-    """Path inexistente -> status error, sin lanzar."""
-    r = render_automatic_eda_folder(str(tmp_path / "nope"))
-    assert r["status"] == "error"
-    assert "no existe" in r["error"].lower()
-
-
-def test_relaciones_chapter_has_real_figure_when_edges():
-    """Con edges, el capítulo de relaciones incluye un bloque Figure matplotlib
-    REAL (no solo el texto Mermaid): su make() devuelve una Figure."""
-    db_profile = {
-        "join_graph": {
-            "nodes": [
-                {"table": "orders", "out_degree": 1, "in_degree": 0, "role": "fact"},
-                {"table": "customers", "out_degree": 0, "in_degree": 1, "role": "dim"},
-            ],
-            "edges": [{"from_table": "orders", "from_col": "customer_id",
-                       "to_table": "customers", "to_col": "id",
-                       "cardinality": "N:1"}],
-            "mermaid": "graph LR orders --> customers",
-            "hubs": ["orders"],
-        },
-        "fk_candidates": [{"from_table": "orders", "from_col": "customer_id",
-                           "to_table": "customers", "to_col": "id",
-                           "inclusion": 1.0, "cardinality": "N:1"}],
-    }
-    ch = _relaciones_chapter(db_profile)
-    figs = [b for b in ch["blocks"] if b.get("kind") == "figure"]
-    assert len(figs) == 1, ch["blocks"]
-    # El make() perezoso produce una matplotlib Figure real.
-    import matplotlib
-    matplotlib.use("Agg")
-    fig = figs[0]["make"]()
-    from matplotlib.figure import Figure
-    assert isinstance(fig, Figure)
-    assert fig.get_axes(), "la Figure del join graph debe tener al menos un eje"
-
-
-def test_relaciones_chapter_no_figure_when_no_edges():
-    """Sin edges, no se añade bloque Figure (capítulo dice 'sin FK')."""
-    db_profile = {"join_graph": {"nodes": [], "edges": [], "mermaid": "",
-                                 "hubs": []}, "fk_candidates": []}
-    ch = _relaciones_chapter(db_profile)
-    assert not [b for b in ch["blocks"] if b.get("kind") == "figure"]