feat(eda): capítulo MISSINGNESS — patrones de datos faltantes (co-ocurrencia + MCAR/MAR)
Añade el capítulo `missingness` al motor AutomaticEDA, complemento natural de `calidad`: donde calidad reporta cuánto falta por columna, este capítulo analiza el PATRÓN de los nulos — dónde faltan y si las columnas faltan juntas (co-ocurrencia de ausencias), la señal que distingue MCAR de MAR antes de imputar. Capítulo (`chapters/missingness.py`), registrado en `chapters_registry.py` justo tras `calidad`: - Resumen global: % de celdas faltantes, columnas con nulos, filas completas vs incompletas. - Ranking por columna (tabla + barras horizontales). - Co-ocurrencia: correlación de las máscaras is-null entre columnas (heatmap + tabla de los pares que co-faltan, con co-faltantes y Jaccard). - Patrones de fila más frecuentes (estilo matriz de missingno). - Lectura MCAR/MAR exploratoria (heurística por correlación/solape de ausencias, no confirmatoria), que cita la evidencia concreta. - Términos de glosario clicables: missingness, MCAR, MAR. La máscara is-null por fila de TODAS las columnas (numéricas y categóricas) se construye con un push-down DuckDB sobre ctx['db_path']/table (mismo patrón que el capítulo agregación), con fallback a ctx['raw_numeric'] cuando no hay BD. Activa solo si la tabla tiene nulos; si no, devuelve None. Funciones nuevas del grupo `eda` (dominio datascience): - extract_null_mask (impura): máscara is-null por fila vía query_fn. - missingness_overview (pura): resumen global + filas completas/incompletas. - missingness_correlation (pura): correlación de ausencias + pares + Jaccard, reutiliza pearson. - missingness_row_patterns (pura): patrones de fila más comunes. - missingness_corr_heatmap_figure / missingness_rank_bar_figure (impuras): figuras. Verificado: EDA de titanic genera el capítulo en PDF + PPTX + MD con Cabin 77.1%, Age 19.9% y la co-ocurrencia Age↔Cabin (158 filas). Suite completa de AutomaticEDA + render_automatic_eda en verde (125 passed); tests por función y por capítulo; fn index sin error. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,594 @@
|
||||
"""Missingness chapter (MISSINGNESS) — patterns of missing data.
|
||||
|
||||
Complements the CALIDAD chapter: where CALIDAD reports *how much* is missing per
|
||||
column (the null percentage that lowers the completeness score), this chapter
|
||||
reports the **pattern** of the missing data — whether columns tend to be missing
|
||||
*together* (co-occurrence of absences) or independently. That distinction is what
|
||||
separates data that is missing completely at random ([[term:mcar]]MCAR[[/term]])
|
||||
from data missing as a function of another variable ([[term:mar]]MAR[[/term]]),
|
||||
which is the key question to settle before imputing or modelling.
|
||||
|
||||
The chapter activates only when the table actually has missing data (at least one
|
||||
column with a null in the aggregated profile); otherwise it returns ``None`` and
|
||||
disappears from the document.
|
||||
|
||||
Sections, in order:
|
||||
|
||||
1. **Resumen global** — % of missing cells in the dataset, number of columns with
|
||||
nulls, and complete rows (no missing) vs incomplete rows (≥1 missing).
|
||||
2. **Ranking por columna** — columns sorted by their null percentage, with a
|
||||
horizontal bar figure.
|
||||
3. **Co-ocurrencia de ausencias** — the correlation of the binary is-null masks
|
||||
between columns (which columns tend to be missing together): a heatmap plus a
|
||||
table of the top column pairs that co-miss.
|
||||
4. **Patrones de fila** — the most frequent "which columns are missing together"
|
||||
row patterns, in the style of missingno's pattern matrix.
|
||||
5. **Lectura MCAR/MAR** — an interpretive, *exploratory* note (not a confirmatory
|
||||
test such as Little's) reading the absence correlations as a hint of MCAR
|
||||
(independent absences) vs MAR (co-occurring absences).
|
||||
|
||||
The aggregate per-column null counts come from the ``eda`` group ``TableProfile``
|
||||
(``columns[i]['null_count'] / 'null_pct'`` and the table-level ``null_cell_pct``).
|
||||
The per-row is-null mask needed for co-occurrence is built from raw data: a single
|
||||
DuckDB push-down over ``ctx['db_path'] / ctx['table']`` (same pattern as the
|
||||
AGREGACION chapter) covering ALL columns, with a fallback to the numeric-only
|
||||
``ctx['raw_numeric']`` when no database is reachable. All the heavy lifting is
|
||||
delegated to pure registry functions (``missingness_overview``,
|
||||
``missingness_correlation``, ``missingness_row_patterns``) and two figure helpers
|
||||
(``missingness_rank_bar_figure``, ``missingness_corr_heatmap_figure``); every one
|
||||
is imported lazily and degrades to an honest note so this chapter never raises.
|
||||
|
||||
Contract: build_<id>(profile, ctx) -> Chapter | None ; CHAPTER_VERSION = "x.y.z".
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from .. import model
|
||||
|
||||
CHAPTER_VERSION = "1.0.0"
|
||||
CHAPTER_ID = "missingness"
|
||||
CHAPTER_TITLE = "Datos faltantes"
|
||||
|
||||
# Sample cap for the per-row is-null mask push-down. Co-occurrence and row
|
||||
# patterns are computed on this sample; the global % of missing cells and the
|
||||
# per-column ranking come from the (exact) aggregated profile instead.
|
||||
MASK_SAMPLE = 5000
|
||||
# Thresholds for the MCAR/MAR heuristic note. A pair counts as a *strong*
|
||||
# co-occurrence when the absence correlation alone is high; as a *partial*
|
||||
# co-occurrence when the absences overlap materially (high Jaccard) even if the
|
||||
# Pearson correlation is modest — the usual case when one column is missing far
|
||||
# more often than the other (e.g. Cabin 77% vs Age 20% in Titanic), which dilutes
|
||||
# the correlation while the rows still co-miss in absolute terms.
|
||||
_CORR_STRONG = 0.30
|
||||
_JACCARD_NOTABLE = 0.20
|
||||
# Rows shown in the top-pairs and row-patterns tables (bounded, never silently
|
||||
# truncated: the table note reports the full count).
|
||||
_TOP_PAIRS = 12
|
||||
_TOP_PATTERNS = 12
|
||||
# Truncate long column names in tables (the renderer also wraps).
|
||||
_LABEL_MAX = 28
|
||||
|
||||
# Glossary terms this chapter explains (contract §11.1). Registered in the shared
|
||||
# collector and marked clickable on their first appearance.
|
||||
_TERMS = {
|
||||
"missingness": (
|
||||
"Patrón de datos faltantes (missingness)",
|
||||
"El patrón con el que faltan los datos: cuánto falta, en qué columnas y "
|
||||
"si las ausencias de unas columnas coinciden (co-ocurren) con las de "
|
||||
"otras. Analizarlo —no solo contar nulos— distingue datos que faltan al "
|
||||
"azar (MCAR) de los que faltan en función de otra variable (MAR), lo que "
|
||||
"decide cómo imputar o si descartar filas sin sesgar el análisis.",
|
||||
),
|
||||
"mcar": (
|
||||
"MCAR (Missing Completely At Random)",
|
||||
"Los valores faltan de forma independiente de cualquier dato, observado o "
|
||||
"no: las ausencias de unas columnas no se relacionan entre sí ni con los "
|
||||
"valores. Es el caso más benigno —descartar filas o imputar la media no "
|
||||
"introduce sesgo—, pero rara vez se cumple del todo en datos reales.",
|
||||
),
|
||||
"mar": (
|
||||
"MAR (Missing At Random)",
|
||||
"La probabilidad de que un valor falte depende de OTRAS variables "
|
||||
"observadas (p. ej. una medición que falta más en cierto grupo). Las "
|
||||
"ausencias co-ocurren entre columnas o se relacionan con los valores de "
|
||||
"otras; imputar exige condicionar en esas variables para no sesgar. La "
|
||||
"co-ocurrencia fuerte de ausencias es un indicio (exploratorio) de MAR.",
|
||||
),
|
||||
}
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Small defensive formatters (own copy: the chapter never imports siblings).
|
||||
# --------------------------------------------------------------------------- #
|
||||
def _fmt_int(value) -> str:
|
||||
if value is None:
|
||||
return "—"
|
||||
try:
|
||||
return f"{int(round(float(value))):,}".replace(",", ".")
|
||||
except (TypeError, ValueError):
|
||||
return model._safe_str(value)
|
||||
|
||||
|
||||
def _fmt_pct(value, decimals: int = 1) -> str:
|
||||
"""Format an already-0-100 value as a percentage. None -> placeholder."""
|
||||
if value is None:
|
||||
return "—"
|
||||
try:
|
||||
return f"{float(value):.{decimals}f}%"
|
||||
except (TypeError, ValueError):
|
||||
return model._safe_str(value)
|
||||
|
||||
|
||||
def _fmt_num(value, decimals: int = 3) -> str:
|
||||
if value is None:
|
||||
return "—"
|
||||
try:
|
||||
f = float(value)
|
||||
except (TypeError, ValueError):
|
||||
return model._safe_str(value)
|
||||
if f != f: # NaN
|
||||
return "—"
|
||||
text = f"{f:.{decimals}f}".rstrip("0").rstrip(".")
|
||||
return text if text else "0"
|
||||
|
||||
|
||||
def _truncate(text, limit: int = _LABEL_MAX) -> str:
|
||||
s = model._safe_str(text)
|
||||
if len(s) <= limit:
|
||||
return s
|
||||
return s[: max(1, limit - 1)].rstrip() + "…"
|
||||
|
||||
|
||||
def _term(key: str, label: str, mark: bool) -> str:
|
||||
if mark:
|
||||
return f"[[term:{key}]]**{label}**[[/term]]"
|
||||
return f"**{label}**"
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Profile reads (exact, all rows).
|
||||
# --------------------------------------------------------------------------- #
|
||||
def _null_count_of(col: dict):
|
||||
"""Best-effort null count of a column: ``null_count`` or null_pct*n_rows."""
|
||||
nc = col.get("null_count")
|
||||
if isinstance(nc, (int, float)) and not isinstance(nc, bool):
|
||||
return int(nc)
|
||||
np_ = col.get("null_pct")
|
||||
nr = col.get("n_rows")
|
||||
if isinstance(np_, (int, float)) and isinstance(nr, (int, float)):
|
||||
return int(round(float(np_) * float(nr)))
|
||||
return 0
|
||||
|
||||
|
||||
def _columns_with_nulls(profile: dict):
|
||||
"""Return ``[(name, null_count, null_pct_0_100)]`` for columns with nulls,
|
||||
sorted by null percentage descending. Reads the aggregated profile (exact)."""
|
||||
cols = profile.get("columns") or []
|
||||
out = []
|
||||
for c in cols:
|
||||
if not isinstance(c, dict):
|
||||
continue
|
||||
nc = _null_count_of(c)
|
||||
if nc <= 0:
|
||||
continue
|
||||
np_ = c.get("null_pct")
|
||||
nr = c.get("n_rows") or profile.get("n_rows")
|
||||
if isinstance(np_, (int, float)) and not isinstance(np_, bool):
|
||||
pct = float(np_) * 100.0 if np_ <= 1.0 else float(np_)
|
||||
elif nr:
|
||||
pct = nc / float(nr) * 100.0
|
||||
else:
|
||||
pct = None
|
||||
out.append((c.get("name") or "(col)", nc, pct))
|
||||
out.sort(key=lambda t: (t[2] if t[2] is not None else -1.0), reverse=True)
|
||||
return out
|
||||
|
||||
|
||||
def _global_missing_pct(profile: dict):
|
||||
"""Table-level % of missing cells (0-100), exact, from the profile."""
|
||||
v = profile.get("null_cell_pct")
|
||||
if isinstance(v, (int, float)) and not isinstance(v, bool):
|
||||
return float(v) * 100.0 if v <= 1.0 else float(v)
|
||||
return None
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Per-row is-null mask (sample): DuckDB push-down, fallback to raw_numeric.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def _build_query_fn(ctx: dict):
|
||||
"""Return ``(query_fn, table)`` for a DuckDB-backed ctx, or ``(None, None)``.
|
||||
|
||||
Mirrors build_eda_render_ctx: a read-only closure over the registry wrapper.
|
||||
Only DuckDB is supported here; any other backend degrades to raw_numeric."""
|
||||
db_path = ctx.get("db_path")
|
||||
table = ctx.get("table")
|
||||
if not db_path or not table:
|
||||
return None, None
|
||||
try:
|
||||
from infra import duckdb_query_readonly
|
||||
except Exception: # noqa: BLE001 — wrapper unavailable -> degrade.
|
||||
return None, None
|
||||
|
||||
def query_fn(sql):
|
||||
return duckdb_query_readonly(db_path, sql)
|
||||
|
||||
return query_fn, table
|
||||
|
||||
|
||||
def _null_mask(profile: dict, ctx: dict):
|
||||
"""Build the per-row is-null mask ``{col: [0/1, ...]}``.
|
||||
|
||||
Tries a single DuckDB push-down over ALL columns first (so categorical
|
||||
columns like Cabin are covered, not only numeric ones); falls back to the
|
||||
numeric-only ``ctx['raw_numeric']`` (None -> missing); returns ``(None, 0,
|
||||
None)`` when neither is reachable. Never raises.
|
||||
Returns ``(mask, n_sampled, source)`` with source in {"db","raw_numeric"}.
|
||||
"""
|
||||
cols = profile.get("columns") or []
|
||||
names = [c.get("name") for c in cols
|
||||
if isinstance(c, dict) and c.get("name")]
|
||||
# 1) DuckDB push-down over every column (covers categoricals too).
|
||||
query_fn, table = _build_query_fn(ctx)
|
||||
if query_fn is not None and names:
|
||||
try:
|
||||
from datascience.extract_null_mask import extract_null_mask
|
||||
|
||||
res = extract_null_mask(query_fn, table, names, max_rows=MASK_SAMPLE)
|
||||
if isinstance(res, dict) and res.get("status") == "ok":
|
||||
mask = res.get("mask") or {}
|
||||
if mask:
|
||||
return mask, int(res.get("n") or 0), "db"
|
||||
except Exception: # noqa: BLE001 — degrade to raw_numeric.
|
||||
pass
|
||||
# 2) Fallback: numeric-only mask derived from raw_numeric (None -> missing).
|
||||
rn = ctx.get("raw_numeric")
|
||||
if isinstance(rn, dict) and rn:
|
||||
mask = {}
|
||||
for col, vals in rn.items():
|
||||
if isinstance(vals, (list, tuple)):
|
||||
mask[col] = [1 if v is None else 0 for v in vals]
|
||||
if mask:
|
||||
n = max((len(v) for v in mask.values()), default=0)
|
||||
return mask, n, "raw_numeric"
|
||||
return None, 0, None
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Lazy registry delegations (each degrades to None on any failure).
|
||||
# --------------------------------------------------------------------------- #
|
||||
def _overview(mask: dict):
|
||||
try:
|
||||
from datascience.missingness_overview import missingness_overview
|
||||
|
||||
out = missingness_overview(mask)
|
||||
return out if isinstance(out, dict) else None
|
||||
except Exception: # noqa: BLE001
|
||||
return None
|
||||
|
||||
|
||||
def _correlation(mask: dict, top_k: int):
|
||||
try:
|
||||
from datascience.missingness_correlation import missingness_correlation
|
||||
|
||||
out = missingness_correlation(mask, top_k=top_k)
|
||||
return out if isinstance(out, dict) else None
|
||||
except Exception: # noqa: BLE001
|
||||
return None
|
||||
|
||||
|
||||
def _row_patterns(mask: dict, top_n: int):
|
||||
try:
|
||||
from datascience.missingness_row_patterns import missingness_row_patterns
|
||||
|
||||
out = missingness_row_patterns(mask, top_n=top_n)
|
||||
return out if isinstance(out, dict) else None
|
||||
except Exception: # noqa: BLE001
|
||||
return None
|
||||
|
||||
|
||||
def _rank_bar_make(names, pcts, title):
|
||||
def make():
|
||||
try:
|
||||
from datascience.missingness_rank_bar_figure import (
|
||||
missingness_rank_bar_figure,
|
||||
)
|
||||
|
||||
return missingness_rank_bar_figure(names, pcts, title=title)
|
||||
except Exception: # noqa: BLE001 — minimal fallback figure.
|
||||
return _fallback_fig("ranking de nulos no disponible")
|
||||
|
||||
return make
|
||||
|
||||
|
||||
def _heatmap_make(matrix, labels, title):
|
||||
def make():
|
||||
try:
|
||||
from datascience.missingness_corr_heatmap_figure import (
|
||||
missingness_corr_heatmap_figure,
|
||||
)
|
||||
|
||||
return missingness_corr_heatmap_figure(matrix, labels, title=title)
|
||||
except Exception: # noqa: BLE001 — minimal fallback figure.
|
||||
return _fallback_fig("heatmap de co-ocurrencia no disponible")
|
||||
|
||||
return make
|
||||
|
||||
|
||||
def _fallback_fig(message: str):
|
||||
import matplotlib
|
||||
|
||||
matplotlib.use("Agg")
|
||||
from matplotlib.figure import Figure
|
||||
|
||||
fig = Figure(figsize=(5.0, 2.2))
|
||||
ax = fig.add_subplot(111)
|
||||
ax.text(0.5, 0.5, message, ha="center", va="center")
|
||||
ax.axis("off")
|
||||
return fig
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Block builders.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def _summary_block(profile: dict, with_nulls: list, overview, sampled, n_total):
|
||||
rows = []
|
||||
gpct = _global_missing_pct(profile)
|
||||
rows.append(("Celdas faltantes (global)", _fmt_pct(gpct)))
|
||||
rows.append(("Columnas con faltantes", str(len(with_nulls))))
|
||||
all_null = profile.get("all_null_cols")
|
||||
if isinstance(all_null, (list, tuple)) and all_null:
|
||||
rows.append(("Columnas 100% faltantes", str(len(all_null))))
|
||||
if isinstance(overview, dict):
|
||||
cr = overview.get("complete_rows")
|
||||
ir = overview.get("incomplete_rows")
|
||||
suffix = ""
|
||||
if (isinstance(sampled, int) and isinstance(n_total, (int, float))
|
||||
and sampled and n_total and sampled < n_total):
|
||||
suffix = f" (sobre muestra de {_fmt_int(sampled)} filas)"
|
||||
if cr is not None:
|
||||
rows.append(("Filas completas (sin faltantes)",
|
||||
f"{_fmt_int(cr)} ({_fmt_pct(overview.get('complete_pct'))})"
|
||||
+ suffix))
|
||||
if ir is not None:
|
||||
rows.append(("Filas con ≥1 faltante",
|
||||
f"{_fmt_int(ir)} "
|
||||
f"({_fmt_pct(overview.get('incomplete_pct'))})" + suffix))
|
||||
return model.KVTable(rows=rows, title="Resumen de datos faltantes")
|
||||
|
||||
|
||||
def _ranking_block(with_nulls: list):
|
||||
header = ["Columna", "Faltantes", "% faltante"]
|
||||
rows = [[_truncate(n), _fmt_int(c), _fmt_pct(p)] for (n, c, p) in with_nulls]
|
||||
if not rows:
|
||||
return None
|
||||
return model.DataTable(
|
||||
header=header, rows=rows, title="Faltantes por columna",
|
||||
note="ordenado de más a menos faltante")
|
||||
|
||||
|
||||
def _ranking_figure(with_nulls: list):
|
||||
names = [n for (n, _, p) in with_nulls if p is not None]
|
||||
pcts = [p for (_, _, p) in with_nulls if p is not None]
|
||||
if not names:
|
||||
return None
|
||||
return model.Figure(
|
||||
make=_rank_bar_make(names, pcts, "% de valores faltantes por columna"),
|
||||
caption="Porcentaje de valores faltantes por columna (barras).")
|
||||
|
||||
|
||||
def _pairs_block(corr: dict):
|
||||
"""Top column pairs whose absences co-occur, as a table, or None."""
|
||||
pairs = (corr or {}).get("pairs") or []
|
||||
header = ["Columna A", "Columna B", "Corr. ausencia", "Co-faltan", "Jaccard"]
|
||||
rows = []
|
||||
for p in pairs[:_TOP_PAIRS]:
|
||||
if not isinstance(p, dict):
|
||||
continue
|
||||
rows.append([
|
||||
_truncate(p.get("a")),
|
||||
_truncate(p.get("b")),
|
||||
_fmt_num(p.get("corr")),
|
||||
_fmt_int(p.get("co_missing")),
|
||||
_fmt_num(p.get("jaccard")),
|
||||
])
|
||||
if not rows:
|
||||
return None
|
||||
shown = len(rows)
|
||||
total = len(pairs)
|
||||
note = ("correlación de las máscaras is-null entre columnas; "
|
||||
"«Co-faltan» = nº de filas en que ambas faltan a la vez")
|
||||
if total > shown:
|
||||
note += f" — top {shown} de {total} pares"
|
||||
return model.DataTable(header=header, rows=rows,
|
||||
title="Pares de columnas que co-faltan", note=note)
|
||||
|
||||
|
||||
def _heatmap_block(corr: dict):
|
||||
cols = (corr or {}).get("columns") or []
|
||||
matrix = (corr or {}).get("matrix") or []
|
||||
if len(cols) < 2 or not matrix:
|
||||
return None
|
||||
labels = [_truncate(c, 16) for c in cols]
|
||||
return model.Figure(
|
||||
make=_heatmap_make(matrix, labels, "Co-ocurrencia de ausencias"),
|
||||
caption=("Correlación de las ausencias entre columnas (azul = faltan "
|
||||
"juntas; rojo = cuando una falta la otra tiende a estar)."))
|
||||
|
||||
|
||||
def _patterns_block(patterns_res: dict):
|
||||
patterns = (patterns_res or {}).get("patterns") or []
|
||||
header = ["Columnas que faltan juntas", "Filas", "%"]
|
||||
rows = []
|
||||
for p in patterns[:_TOP_PATTERNS]:
|
||||
if not isinstance(p, dict):
|
||||
continue
|
||||
cols = p.get("missing_cols") or []
|
||||
if cols:
|
||||
label = ", ".join(_truncate(c, 18) for c in cols)
|
||||
else:
|
||||
label = "(fila completa — sin faltantes)"
|
||||
rows.append([label, _fmt_int(p.get("n_rows")), _fmt_pct(p.get("pct"))])
|
||||
if not rows:
|
||||
return None
|
||||
total = (patterns_res or {}).get("n_patterns")
|
||||
shown = len(rows)
|
||||
note = "cada fila es un patrón de «qué columnas faltan juntas»"
|
||||
if isinstance(total, int) and total > shown:
|
||||
note += f" — top {shown} de {total} patrones distintos"
|
||||
return model.DataTable(header=header, rows=rows,
|
||||
title="Patrones de fila más comunes", note=note)
|
||||
|
||||
|
||||
def _mcar_mar_note(corr: dict, mark: bool):
|
||||
"""Interpretive, exploratory MCAR/MAR note from the absence correlations.
|
||||
|
||||
Reads the absence correlations at two levels so the verdict never contradicts
|
||||
the visible evidence: a *strong* correlation flags a clear non-random (MAR)
|
||||
pattern; a *partial* overlap (many rows co-miss — high Jaccard — even if the
|
||||
correlation is diluted by one column being missing far more often) flags a
|
||||
localized possible-MAR and cites the concrete co-missing pair; only when
|
||||
neither holds does it read the absences as compatible with MCAR."""
|
||||
|
||||
def _pairs_with(attr_ok):
|
||||
out = []
|
||||
for p in (corr or {}).get("pairs") or []:
|
||||
if isinstance(p, dict) and attr_ok(p):
|
||||
out.append(p)
|
||||
return out
|
||||
|
||||
def _cf(v):
|
||||
try:
|
||||
return float(v)
|
||||
except (TypeError, ValueError):
|
||||
return 0.0
|
||||
|
||||
strong = _pairs_with(lambda p: abs(_cf(p.get("corr"))) >= _CORR_STRONG)
|
||||
partial = _pairs_with(
|
||||
lambda p: _cf(p.get("corr")) > 0 and _cf(p.get("jaccard")) >= _JACCARD_NOTABLE)
|
||||
mcar = _term("mcar", "MCAR", mark)
|
||||
mar = _term("mar", "MAR", mark)
|
||||
head = (
|
||||
"**Lectura exploratoria MCAR/MAR.** Esta es una heurística basada en la "
|
||||
"correlación de las ausencias entre columnas, NO un test confirmatorio "
|
||||
"(como el de Little); orienta, no demuestra. ")
|
||||
if strong:
|
||||
top = strong[0]
|
||||
ev = (f"«{model._safe_str(top.get('a'))}» y "
|
||||
f"«{model._safe_str(top.get('b'))}» "
|
||||
f"(corr {_fmt_num(top.get('corr'))})")
|
||||
body = (
|
||||
f"Hay ausencias que co-ocurren con fuerza —{ev}—: las columnas no "
|
||||
f"faltan de forma independiente, lo que es un indicio de un patrón no "
|
||||
f"aleatorio ({mar}). Antes de imputar o descartar filas conviene "
|
||||
f"comprobar si la ausencia depende de otra variable observada; en ese "
|
||||
f"caso la imputación debería condicionar en ella para no sesgar.")
|
||||
elif partial:
|
||||
top = max(partial, key=lambda p: _cf(p.get("jaccard")))
|
||||
ev = (f"«{model._safe_str(top.get('a'))}» y "
|
||||
f"«{model._safe_str(top.get('b'))}» faltan a la vez en "
|
||||
f"{_fmt_int(top.get('co_missing'))} filas "
|
||||
f"(Jaccard {_fmt_num(top.get('jaccard'))})")
|
||||
body = (
|
||||
f"Hay co-ocurrencia parcial de ausencias —{ev}—: algunas columnas "
|
||||
f"tienden a faltar juntas aunque la correlación global sea modesta "
|
||||
f"(habitual cuando una columna falta mucho más que la otra). Es un "
|
||||
f"indicio de un posible patrón localizado no aleatorio ({mar}); "
|
||||
f"conviene revisar si esa ausencia depende de otra variable observada "
|
||||
f"antes de imputar, en lugar de asumir que faltan al azar.")
|
||||
else:
|
||||
body = (
|
||||
f"Las ausencias entre columnas no muestran correlación ni solape "
|
||||
f"relevante: parecen independientes, lo que es compatible con que "
|
||||
f"falten al azar ({mcar}). Aun así, la ausencia podría depender de "
|
||||
f"variables no observadas (la heurística no lo descarta).")
|
||||
return model.Markdown(text=head + body)
|
||||
|
||||
|
||||
def _intro_block(mark: bool, source):
|
||||
missingness = _term("missingness", "missingness", mark)
|
||||
text = (
|
||||
f"Este capítulo analiza el {missingness} de la tabla: no solo cuánto "
|
||||
"falta (eso lo cubre la calidad), sino DÓNDE falta y si las columnas "
|
||||
"faltan juntas. La co-ocurrencia de ausencias se calcula sobre la matriz "
|
||||
"binaria «is-null» por fila.")
|
||||
if source == "raw_numeric":
|
||||
text += (" Nota: no se pudo leer la tabla cruda completa, así que la "
|
||||
"co-ocurrencia se limita a las columnas numéricas disponibles.")
|
||||
return model.Markdown(text=text)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Entry point.
|
||||
# --------------------------------------------------------------------------- #
|
||||
def build_missingness(profile: dict, ctx: dict):
|
||||
"""Build the missingness Chapter, or None if the table has no missing data."""
|
||||
if not isinstance(profile, dict):
|
||||
profile = {}
|
||||
ctx = ctx or {}
|
||||
|
||||
with_nulls = _columns_with_nulls(profile)
|
||||
if not with_nulls:
|
||||
return None # no missing data anywhere -> chapter does not apply.
|
||||
|
||||
# Register glossary terms (if a collector is present) and mark them clickable.
|
||||
glossary = ctx.get("glossary")
|
||||
mark = False
|
||||
if isinstance(glossary, model.GlossaryCollector):
|
||||
for key, (label, definition) in _TERMS.items():
|
||||
glossary.add(key, label, definition)
|
||||
mark = True
|
||||
|
||||
# Per-row is-null mask (sample) for co-occurrence and row patterns.
|
||||
mask, sampled, source = _null_mask(profile, ctx)
|
||||
overview = _overview(mask) if mask else None
|
||||
n_total = profile.get("n_rows")
|
||||
|
||||
blocks = [
|
||||
model.Heading(text="Cuánto y dónde faltan datos", level=2),
|
||||
_intro_block(mark, source),
|
||||
_summary_block(profile, with_nulls, overview, sampled, n_total),
|
||||
model.Heading(text="Faltantes por columna", level=2),
|
||||
]
|
||||
ranking = _ranking_block(with_nulls)
|
||||
if ranking is not None:
|
||||
blocks.append(ranking)
|
||||
rank_fig = _ranking_figure(with_nulls)
|
||||
if rank_fig is not None:
|
||||
blocks.append(rank_fig)
|
||||
|
||||
# Co-occurrence + row patterns need the per-row mask. Without it, say so.
|
||||
if not mask:
|
||||
blocks.append(model.Note(
|
||||
"No se pudo construir la matriz «is-null» por fila (sin acceso a los "
|
||||
"datos crudos), así que no se analiza la co-ocurrencia de ausencias "
|
||||
"ni los patrones de fila en este informe."))
|
||||
return model.Chapter(id=CHAPTER_ID, title=CHAPTER_TITLE,
|
||||
version=CHAPTER_VERSION, blocks=blocks)
|
||||
|
||||
corr = _correlation(mask, _TOP_PAIRS) or {}
|
||||
co_blocks = [model.Heading(text="Co-ocurrencia de ausencias", level=2)]
|
||||
heatmap = _heatmap_block(corr)
|
||||
if heatmap is not None:
|
||||
co_blocks.append(heatmap)
|
||||
pairs = _pairs_block(corr)
|
||||
if pairs is not None:
|
||||
co_blocks.append(pairs)
|
||||
if heatmap is None and pairs is None:
|
||||
co_blocks.append(model.Note(
|
||||
"Ninguna pareja de columnas comparte ausencias con variación "
|
||||
"suficiente para correlacionarlas (p. ej. una sola columna con "
|
||||
"faltantes), así que no hay co-ocurrencia que mostrar."))
|
||||
# Keep the co-occurrence heading next to its heatmap and table.
|
||||
blocks.append(model.Group(blocks=co_blocks))
|
||||
|
||||
patterns_res = _row_patterns(mask, _TOP_PATTERNS) or {}
|
||||
patterns = _patterns_block(patterns_res)
|
||||
if patterns is not None:
|
||||
blocks.append(model.Heading(text="Patrones de fila", level=2))
|
||||
blocks.append(patterns)
|
||||
|
||||
blocks.append(model.Heading(text="Lectura MCAR / MAR", level=2))
|
||||
blocks.append(_mcar_mar_note(corr, mark))
|
||||
|
||||
return model.Chapter(id=CHAPTER_ID, title=CHAPTER_TITLE,
|
||||
version=CHAPTER_VERSION, blocks=blocks)
|
||||
@@ -0,0 +1,162 @@
|
||||
"""Tests for the MISSINGNESS chapter.
|
||||
|
||||
Covers the Definition of Done for this chapter:
|
||||
* Activates (non-None Chapter with the expected sections) when the profile has
|
||||
missing data, building the co-occurrence from the per-row is-null mask.
|
||||
* Returns None when the table has no missing data at all (edge case).
|
||||
* Registers the MCAR/MAR/missingness glossary terms.
|
||||
* The DuckDB push-down path covers categorical columns (not only numeric),
|
||||
so a categorical column that co-misses with a numeric one is detected.
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
|
||||
_HERE = os.path.dirname(os.path.abspath(__file__))
|
||||
_FUNCTIONS = os.path.abspath(os.path.join(_HERE, "..", "..", "..")) # python/functions
|
||||
if _FUNCTIONS not in sys.path:
|
||||
sys.path.insert(0, _FUNCTIONS)
|
||||
|
||||
from datascience.automatic_eda import model # noqa: E402
|
||||
from datascience.automatic_eda.chapters.missingness import ( # noqa: E402
|
||||
build_missingness,
|
||||
)
|
||||
|
||||
|
||||
def _titles(chapter):
|
||||
"""Collect heading texts and table/figure titles for assertions."""
|
||||
out = []
|
||||
for b in chapter.blocks:
|
||||
kind = getattr(b, "kind", None)
|
||||
if kind == "heading":
|
||||
out.append(("heading", getattr(b, "text", "")))
|
||||
elif kind in ("data_table", "kv_table"):
|
||||
out.append((kind, getattr(b, "title", "")))
|
||||
elif kind == "group":
|
||||
for inner in getattr(b, "blocks", []):
|
||||
ik = getattr(inner, "kind", None)
|
||||
if ik == "heading":
|
||||
out.append(("heading", getattr(inner, "text", "")))
|
||||
elif ik in ("data_table", "kv_table"):
|
||||
out.append((ik, getattr(inner, "title", "")))
|
||||
elif ik == "figure":
|
||||
out.append(("figure", getattr(inner, "caption", "")))
|
||||
elif kind == "figure":
|
||||
out.append(("figure", getattr(b, "caption", "")))
|
||||
return out
|
||||
|
||||
|
||||
def _all_text(chapter):
|
||||
parts = []
|
||||
def walk(blocks):
|
||||
for b in blocks:
|
||||
for attr in ("text", "title", "note", "caption"):
|
||||
v = getattr(b, attr, None)
|
||||
if v:
|
||||
parts.append(str(v))
|
||||
if getattr(b, "kind", None) == "group":
|
||||
walk(getattr(b, "blocks", []))
|
||||
walk(chapter.blocks)
|
||||
return "\n".join(parts)
|
||||
|
||||
|
||||
def test_returns_none_when_no_missing_data():
|
||||
profile = {
|
||||
"n_rows": 4,
|
||||
"null_cell_pct": 0.0,
|
||||
"columns": [
|
||||
{"name": "a", "null_count": 0, "null_pct": 0.0, "n_rows": 4},
|
||||
{"name": "b", "null_count": 0, "null_pct": 0.0, "n_rows": 4},
|
||||
],
|
||||
}
|
||||
assert build_missingness(profile, {}) is None
|
||||
|
||||
|
||||
def test_activates_with_cooccurrence_via_raw_numeric():
|
||||
# a and b are missing in EXACTLY the same rows (0,1,2) -> perfect absence
|
||||
# correlation. c has no nulls. No db_path -> the chapter falls back to the
|
||||
# numeric raw_numeric mask.
|
||||
profile = {
|
||||
"n_rows": 6,
|
||||
"null_cell_pct": (0.5 + 0.5 + 0.0) / 3.0,
|
||||
"columns": [
|
||||
{"name": "a", "null_count": 3, "null_pct": 0.5, "n_rows": 6},
|
||||
{"name": "b", "null_count": 3, "null_pct": 0.5, "n_rows": 6},
|
||||
{"name": "c", "null_count": 0, "null_pct": 0.0, "n_rows": 6},
|
||||
],
|
||||
}
|
||||
glossary = model.GlossaryCollector()
|
||||
ctx = {
|
||||
"raw_numeric": {
|
||||
"a": [None, None, None, 1.0, 2.0, 3.0],
|
||||
"b": [None, None, None, 4.0, 5.0, 6.0],
|
||||
},
|
||||
"glossary": glossary,
|
||||
}
|
||||
ch = build_missingness(profile, ctx)
|
||||
assert ch is not None
|
||||
assert ch.id == "missingness"
|
||||
assert ch.blocks
|
||||
|
||||
titles = _titles(ch)
|
||||
headings = {t for (k, t) in titles if k == "heading"}
|
||||
# Core sections present.
|
||||
assert any("Cuánto y dónde" in h for h in headings)
|
||||
assert any("Faltantes por columna" in h for h in headings)
|
||||
assert any("Co-ocurrencia" in h for h in headings)
|
||||
assert any("MCAR" in h for h in headings)
|
||||
# A summary KVTable, a ranking DataTable, a co-occurrence figure and the
|
||||
# pairs table all exist.
|
||||
kinds = {k for (k, _) in titles}
|
||||
assert "kv_table" in kinds
|
||||
assert "data_table" in kinds
|
||||
assert "figure" in kinds
|
||||
|
||||
# Glossary terms registered.
|
||||
keys = {t["key"] for t in glossary.terms()}
|
||||
assert {"missingness", "mcar", "mar"} <= keys
|
||||
|
||||
# The MCAR/MAR note reads the co-occurrence; with a perfect overlap it must
|
||||
# flag the non-random (MAR) reading.
|
||||
text = _all_text(ch)
|
||||
assert "MAR" in text
|
||||
|
||||
|
||||
def test_db_pushdown_covers_categorical_column(tmp_path):
|
||||
"""The is-null mask push-down must cover a categorical column, so a
|
||||
categorical that co-misses with a numeric one shows up in the pairs."""
|
||||
import duckdb
|
||||
|
||||
db = str(tmp_path / "miss.duckdb")
|
||||
con = duckdb.connect(db)
|
||||
con.execute("CREATE TABLE t (num1 DOUBLE, num2 DOUBLE, cat VARCHAR)")
|
||||
# num1 and cat are NULL together in the first 4 of 10 rows; num2 never null.
|
||||
rows = []
|
||||
for i in range(10):
|
||||
if i < 4:
|
||||
rows.append((None, float(i), None))
|
||||
else:
|
||||
rows.append((float(i), float(i), f"c{i}"))
|
||||
con.executemany("INSERT INTO t VALUES (?,?,?)", rows)
|
||||
con.close()
|
||||
|
||||
profile = {
|
||||
"n_rows": 10,
|
||||
"null_cell_pct": (0.4 + 0.0 + 0.4) / 3.0,
|
||||
"columns": [
|
||||
{"name": "num1", "null_count": 4, "null_pct": 0.4, "n_rows": 10},
|
||||
{"name": "num2", "null_count": 0, "null_pct": 0.0, "n_rows": 10},
|
||||
{"name": "cat", "null_count": 4, "null_pct": 0.4, "n_rows": 10},
|
||||
],
|
||||
}
|
||||
ctx = {"db_path": db, "table": "t", "glossary": model.GlossaryCollector()}
|
||||
ch = build_missingness(profile, ctx)
|
||||
assert ch is not None
|
||||
|
||||
# The pairs table must mention both num1 and cat (they co-miss perfectly),
|
||||
# which is only possible if the mask covered the categorical column.
|
||||
text = _all_text(ch)
|
||||
assert "num1" in text and "cat" in text
|
||||
# Co-occurrence section + a pairs data table exist.
|
||||
titles = _titles(ch)
|
||||
assert any("co-faltan" in (t or "").lower() for (k, t) in titles)
|
||||
Reference in New Issue
Block a user