6f88f184f1
Nuevo capítulo dedicado `outliers` para el motor AutomaticEDA que reúne y profundiza en un solo sitio el análisis de valores atípicos, hoy disperso entre `num_distr` (conteo por columna) y `modelos` (IsolationForest). Se registra en `chapters_registry.py` entre `missingness` y `correlacion` (bloque de calidad de datos: calidad → missingness → outliers). Contenido del capítulo: - Resumen univariante por columna: nº y % de atípicos por Tukey (1.5·IQR) y por z-score (|z| > 3), con vallas inferior/superior y valores extremos. Ordenado por contaminación y marcando las columnas más afectadas. Reusa las funciones del registry `build_boxplot_stats` (vallas desde los percentiles del profile) y `detect_outliers` (regla z-score sobre la muestra cruda de `ctx`). - Boxplots de Tukey de las columnas más contaminadas (caja, bigotes y puntos atípicos), delegados a la función nueva `build_boxplots_figure`. - Multivariante: filas anómalas considerando todas las columnas a la vez con `isolation_forest_outliers` — nº y % de filas, las más anómalas con su score y las dimensiones que las hacen raras (top columnas por |z|, vía la función nueva `summarize_outlier_dims`). El detector se corre en vivo sobre `raw_numeric` para que el indexado de filas coincida exactamente con el de las dimensiones; cae al bloque precomputado del perfil cuando no hay muestra cruda (preset lite). - Interpretación exploratoria: un atípico no es necesariamente un error (distingue error de dato vs dato real extremo) y recomendaciones (revisar, winsorizar o re-expresar, enlazando con la re-expresión de Tukey del perfil). Términos clicables registrados en el glosario compartido: `outlier`, `tukey_fence`, `zscore`, `isolation_forest`. Funciones nuevas del registry (dominio datascience, grupo eda): - `build_boxplots_figure_py_datascience` (figure helper, impura) - `summarize_outlier_dims_py_datascience` (pura) El capítulo se activa con ≥1 columna numérica y devuelve None en su ausencia; lee todo defensivo y nunca lanza. Tests: capítulo (golden + edges + error path + render PDF/PPTX) y ambas funciones nuevas. Suite de no-regresión de AutomaticEDA verde. Verificado end-to-end con el dataset Titanic (Fare/Parch/SibSp como las columnas más contaminadas). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
594 lines
26 KiB
Python
594 lines
26 KiB
Python
"""Outliers chapter (OUTLIERS) — univariate + multivariate atypical values.
|
||
|
||
Today the analysis of atypical values is scattered across the document: the
|
||
NUM DISTR chapter mentions the per-column outlier count inside each distribution
|
||
figure, and the MODELOS chapter runs Isolation Forest as one of several cheap
|
||
models. This chapter gathers and deepens the whole outlier story in a single
|
||
place, with its interpretation: an [[term:outlier]]outlier[[/term]] is **not
|
||
necessarily an error** — it can be a legitimate, extreme but real observation —
|
||
so the reading is exploratory (what to look at), never confirmatory (what to
|
||
delete).
|
||
|
||
Sections, in order:
|
||
|
||
1. **Resumen univariante por columna** — for every numeric column, the number
|
||
and percentage of atypical values by two complementary criteria: Tukey's
|
||
1.5·IQR rule ([[term:tukey_fence]]vallas de Tukey[[/term]]) and the
|
||
[[term:zscore]]z-score[[/term]] rule (|z| > 3). The most contaminated columns
|
||
are flagged. The fences come from the pure registry function
|
||
``build_boxplot_stats`` (derived from the profile percentiles); the per-column
|
||
counts use the raw sample in ``ctx['raw_numeric']`` when available (the exact
|
||
count), degrading to the profile's own z-score counts otherwise.
|
||
2. **Boxplots** — a single figure with the Tukey boxplots of the most
|
||
contaminated columns (box, whiskers and atypical points), delegated to the
|
||
reusable registry helper ``build_boxplots_figure``.
|
||
3. **Multivariante (filas anómalas)** — rows that are atypical considering ALL
|
||
columns at once, via the registry function ``isolation_forest_outliers``: the
|
||
count and percentage of anomalous rows, the most anomalous rows with their
|
||
score, and the dimensions that make each one rare (top columns by |z|, via
|
||
``summarize_outlier_dims``). Run live on ``ctx['raw_numeric']`` (the same
|
||
numeric columns ``summarize_outlier_dims`` uses, so the row indexing stays
|
||
coherent and the dimension breakdown is correct); falls back to the
|
||
precomputed ``profile['models']['outliers']`` only when no raw sample is
|
||
available (e.g. the lite preset), where no per-row breakdown is shown.
|
||
4. **Interpretación** — outlier ≠ error: how to tell a data-entry error from a
|
||
genuine extreme value, and what to do (inspect, winsorize, or re-express —
|
||
linking to the Tukey re-expression the profile already computes).
|
||
|
||
The chapter activates whenever the table has at least one numeric column; with
|
||
no numeric column it returns ``None`` and disappears from the document.
|
||
|
||
Reads everything defensively (``.get``) and never raises: every registry
|
||
delegation is imported lazily and degraded to an honest note on any failure.
|
||
|
||
Contract: build_<id>(profile, ctx) -> Chapter | None ; CHAPTER_VERSION = "x.y.z".
|
||
"""
|
||
|
||
from __future__ import annotations
|
||
|
||
from .. import model
|
||
|
||
CHAPTER_VERSION = "1.0.0"
|
||
CHAPTER_ID = "outliers"
|
||
CHAPTER_TITLE = "Valores atípicos"
|
||
|
||
# z-score threshold for the univariate z rule: |z| > 3 flags a value ~3 standard
|
||
# deviations from the mean (≈99.7% of a normal distribution lies within ±3σ).
|
||
_Z_THRESH = 3.0
|
||
# How many columns to draw in the boxplots figure (most contaminated first) and
|
||
# how many anomalous rows to list in the multivariate table.
|
||
_TOP_BOX = 12
|
||
_TOP_ROWS = 12
|
||
# Cap on the raw atypical values passed as boxplot fliers, so a heavy-tailed
|
||
# column does not flood the figure with thousands of points.
|
||
_MAX_FLIERS = 200
|
||
# How many columns flagged as "most contaminated" in the summary note.
|
||
_TOP_FLAGGED = 3
|
||
|
||
# Glossary terms this chapter explains (contract §11.1). Registered in the shared
|
||
# collector and marked clickable on first appearance. ``isolation_forest`` and
|
||
# ``zscore`` may also be registered by the MODELOS chapter — ``add`` is
|
||
# idempotent (first definition wins), so registering them here is harmless and
|
||
# keeps this chapter self-contained when MODELOS does not render.
|
||
_TERM_DEFS = {
|
||
"outlier": (
|
||
"Valor atípico (outlier)",
|
||
"Una observación que se aparta mucho del grueso de los datos. Un atípico "
|
||
"NO es necesariamente un error: puede ser un fallo de medida o de "
|
||
"registro, pero también un dato real extremo (un cliente que gasta diez "
|
||
"veces la media, un día de ventas excepcional). Por eso se señalan para "
|
||
"revisarlos, no para borrarlos automáticamente.",
|
||
),
|
||
"tukey_fence": (
|
||
"Vallas de Tukey (1,5·IQR)",
|
||
"Regla clásica para marcar atípicos a partir de los cuartiles: se calcula "
|
||
"el rango intercuartílico IQR = P75 − P25 y se trazan dos vallas, una "
|
||
"inferior en P25 − 1,5·IQR y otra superior en P75 + 1,5·IQR. Los valores "
|
||
"que caen fuera de esas vallas se consideran atípicos. Es robusta porque "
|
||
"se apoya en la mediana y los cuartiles, no en la media.",
|
||
),
|
||
"zscore": (
|
||
"z-score (puntuación típica)",
|
||
"Mide a cuántas desviaciones típicas está un valor de la media de su "
|
||
"columna: z = (valor − media) / desviación típica. Un |z| grande (aquí > "
|
||
"3) señala un valor alejado del centro. A diferencia de las vallas de "
|
||
"Tukey, el z-score usa media y desviación, así que es más sensible a la "
|
||
"presencia de los propios atípicos.",
|
||
),
|
||
"isolation_forest": (
|
||
"Isolation Forest (anomalías multivariantes)",
|
||
"Algoritmo de detección de anomalías que considera TODAS las columnas a "
|
||
"la vez: construye árboles que parten el espacio con cortes aleatorios y "
|
||
"mide cuántos cortes hacen falta para aislar cada fila. Las filas raras "
|
||
"se aíslan con muy pocos cortes y se marcan como atípicas según un umbral "
|
||
"de contaminación. Detecta combinaciones de valores poco frecuentes que "
|
||
"ninguna columna por separado revelaría.",
|
||
),
|
||
}
|
||
|
||
|
||
# --------------------------------------------------------------------------- #
|
||
# Lazy registry delegations (each degrades to None / no-op on any failure).
|
||
# --------------------------------------------------------------------------- #
|
||
def _load_build_boxplot_stats():
|
||
try:
|
||
from datascience.build_boxplot_stats import build_boxplot_stats
|
||
return build_boxplot_stats
|
||
except Exception: # noqa: BLE001
|
||
return None
|
||
|
||
|
||
def _load_detect_outliers():
|
||
# detect_outliers lives in the monolithic ``datascience.datascience`` module
|
||
# (file_path datascience.py), not in its own submodule — try both shapes.
|
||
try:
|
||
from datascience.datascience import detect_outliers
|
||
return detect_outliers
|
||
except Exception: # noqa: BLE001
|
||
try:
|
||
from datascience import detect_outliers
|
||
return detect_outliers
|
||
except Exception: # noqa: BLE001
|
||
return None
|
||
|
||
|
||
def _load_isolation_forest():
|
||
try:
|
||
from datascience.isolation_forest_outliers import isolation_forest_outliers
|
||
return isolation_forest_outliers
|
||
except Exception: # noqa: BLE001
|
||
return None
|
||
|
||
|
||
def _load_summarize_dims():
|
||
try:
|
||
from datascience.summarize_outlier_dims import summarize_outlier_dims
|
||
return summarize_outlier_dims
|
||
except Exception: # noqa: BLE001
|
||
return None
|
||
|
||
|
||
# --------------------------------------------------------------------------- #
|
||
# Defensive formatters (own copy: the chapter never imports siblings).
|
||
# --------------------------------------------------------------------------- #
|
||
def _fmt_num(value, decimals: int = 3) -> str:
|
||
if value is None:
|
||
return "—"
|
||
if isinstance(value, bool):
|
||
return "sí" if value else "no"
|
||
if isinstance(value, int):
|
||
return f"{value:,}".replace(",", ".")
|
||
if isinstance(value, float):
|
||
if value != value: # NaN
|
||
return "—"
|
||
if value in (float("inf"), float("-inf")):
|
||
return str(value)
|
||
text = f"{value:.{decimals}f}".rstrip("0").rstrip(".")
|
||
return text if text else "0"
|
||
return model._safe_str(value)
|
||
|
||
|
||
def _fmt_int(value) -> str:
|
||
if value is None:
|
||
return "—"
|
||
try:
|
||
return f"{int(round(float(value))):,}".replace(",", ".")
|
||
except (TypeError, ValueError):
|
||
return model._safe_str(value)
|
||
|
||
|
||
def _fmt_pct(value, decimals: int = 2) -> str:
|
||
"""Format an already-0-100 value as a percentage. None -> placeholder."""
|
||
if value is None:
|
||
return "—"
|
||
try:
|
||
return f"{float(value):.{decimals}f}%"
|
||
except (TypeError, ValueError):
|
||
return model._safe_str(value)
|
||
|
||
|
||
def _term(mark: bool, key: str, text: str) -> str:
|
||
return f"[[term:{key}]]{text}[[/term]]" if mark else text
|
||
|
||
|
||
def _is_dict(v) -> bool:
|
||
return isinstance(v, dict)
|
||
|
||
|
||
# --------------------------------------------------------------------------- #
|
||
# Profile reads.
|
||
# --------------------------------------------------------------------------- #
|
||
def _numeric_columns(profile: dict) -> list:
|
||
"""Return [(name, numeric_dict)] for numeric columns with usable stats."""
|
||
out = []
|
||
for col in profile.get("columns") or []:
|
||
if not isinstance(col, dict):
|
||
continue
|
||
if col.get("inferred_type") != "numeric":
|
||
continue
|
||
num = col.get("numeric")
|
||
if not isinstance(num, dict) or not num:
|
||
continue
|
||
if num.get("mean") is None and num.get("median") is None:
|
||
continue
|
||
out.append((col.get("name") or "(columna)", num))
|
||
return out
|
||
|
||
|
||
def _clean_values(raw):
|
||
"""Return the finite float values of a raw column list (drop None/NaN/inf)."""
|
||
if not isinstance(raw, (list, tuple)):
|
||
return None
|
||
vals = []
|
||
for v in raw:
|
||
if v is None or isinstance(v, bool):
|
||
continue
|
||
try:
|
||
f = float(v)
|
||
except (TypeError, ValueError):
|
||
continue
|
||
if f != f or f in (float("inf"), float("-inf")):
|
||
continue
|
||
vals.append(f)
|
||
return vals
|
||
|
||
|
||
# --------------------------------------------------------------------------- #
|
||
# Per-column univariate summary.
|
||
# --------------------------------------------------------------------------- #
|
||
def _univariate_row(name, numeric, raw_vals, box_fn, detect_fn):
|
||
"""Compute one univariate summary row + boxplot inputs for a column.
|
||
|
||
Returns a dict with the table cells and, when raw values are available, the
|
||
exact Tukey/z counts and the list of atypical (flier) values; otherwise it
|
||
degrades to the profile's own z-score counts and the fence flags.
|
||
"""
|
||
box = {}
|
||
if box_fn is not None:
|
||
try:
|
||
box = box_fn(numeric) or {}
|
||
except Exception: # noqa: BLE001
|
||
box = {}
|
||
lf = box.get("lower_fence")
|
||
uf = box.get("upper_fence")
|
||
|
||
vals = _clean_values(raw_vals)
|
||
n_tukey = pct_tukey = None
|
||
n_z = pct_z = None
|
||
low_extreme = high_extreme = None
|
||
fliers = []
|
||
contamination = None # metric used to rank columns (prefer Tukey %).
|
||
|
||
if vals:
|
||
n = len(vals)
|
||
tukey_out = []
|
||
for v in vals:
|
||
below = (lf is not None and v < lf)
|
||
above = (uf is not None and v > uf)
|
||
if below or above:
|
||
tukey_out.append(v)
|
||
n_tukey = len(tukey_out)
|
||
pct_tukey = 100.0 * n_tukey / n if n else None
|
||
if tukey_out:
|
||
low_extreme = min(tukey_out)
|
||
high_extreme = max(tukey_out)
|
||
fliers = tukey_out[:_MAX_FLIERS]
|
||
# z-score rule via the registry function (returns parallel bools).
|
||
if detect_fn is not None:
|
||
try:
|
||
flags = detect_fn(vals, _Z_THRESH) or []
|
||
n_z = int(sum(1 for b in flags if b))
|
||
pct_z = 100.0 * n_z / n if n else None
|
||
except Exception: # noqa: BLE001
|
||
n_z = pct_z = None
|
||
contamination = pct_tukey
|
||
else:
|
||
# Degrade: no raw sample for this column. The profile's own outlier
|
||
# count/pct come from the z-score block (build_boxplot_stats note); the
|
||
# Tukey count is unknown, only the fence flags are.
|
||
n_z = numeric.get("n_outliers")
|
||
pct_z = numeric.get("outlier_pct")
|
||
if box.get("has_low_outliers") and box.get("min") is not None:
|
||
low_extreme = box.get("min")
|
||
if box.get("has_high_outliers") and box.get("max") is not None:
|
||
high_extreme = box.get("max")
|
||
contamination = pct_z if isinstance(pct_z, (int, float)) else None
|
||
|
||
# Compact "extremos atípicos" cell: down/up arrows for the low/high tail.
|
||
extremes = []
|
||
if low_extreme is not None:
|
||
extremes.append(f"↓ {_fmt_num(low_extreme)}")
|
||
if high_extreme is not None:
|
||
extremes.append(f"↑ {_fmt_num(high_extreme)}")
|
||
extremes_cell = " ".join(extremes) if extremes else "—"
|
||
|
||
return {
|
||
"name": model._safe_str(name),
|
||
"n_tukey": n_tukey,
|
||
"pct_tukey": pct_tukey,
|
||
"n_z": n_z,
|
||
"pct_z": pct_z,
|
||
"lower_fence": lf,
|
||
"upper_fence": uf,
|
||
"extremes": extremes_cell,
|
||
"box": box,
|
||
"fliers": fliers,
|
||
"has_raw": bool(vals),
|
||
"contamination": contamination if isinstance(contamination, (int, float)) else -1.0,
|
||
}
|
||
|
||
|
||
def _univariate_table(rows: list) -> model.DataTable:
|
||
header = ["Columna", "Atípicos Tukey", "% Tukey", "Atípicos z", "% z",
|
||
"Valla inf.", "Valla sup.", "Extremos atípicos"]
|
||
table_rows = []
|
||
for r in rows:
|
||
table_rows.append([
|
||
r["name"],
|
||
_fmt_int(r["n_tukey"]) if r["n_tukey"] is not None else "—",
|
||
_fmt_pct(r["pct_tukey"]) if r["pct_tukey"] is not None else "—",
|
||
_fmt_int(r["n_z"]) if r["n_z"] is not None else "—",
|
||
_fmt_pct(r["pct_z"]) if r["pct_z"] is not None else "—",
|
||
_fmt_num(r["lower_fence"]),
|
||
_fmt_num(r["upper_fence"]),
|
||
r["extremes"],
|
||
])
|
||
return model.DataTable(
|
||
header=header, rows=table_rows,
|
||
title="Valores atípicos por columna",
|
||
note="Tukey = fuera de las vallas 1,5·IQR · z = |z-score| > 3 · "
|
||
"ordenado de más a menos contaminada")
|
||
|
||
|
||
# --------------------------------------------------------------------------- #
|
||
# Multivariate (Isolation Forest) section.
|
||
# --------------------------------------------------------------------------- #
|
||
def _resolve_multivariate(profile: dict, ctx: dict, raw_numeric):
|
||
"""Return (outliers_dict_or_None, source).
|
||
|
||
Prefers a LIVE Isolation Forest over ``raw_numeric`` so the detector and
|
||
``summarize_outlier_dims`` use EXACTLY the same numeric columns and the same
|
||
valid-row indexing — otherwise the precomputed ``profile['models']
|
||
['outliers']`` (run by MODELOS over a possibly different column subset) would
|
||
yield ``row_index`` values that no longer point at the rows
|
||
``summarize_outlier_dims`` reconstructs, mislabelling the "dimensions that
|
||
make each row rare". Falls back to the precomputed block when no raw sample
|
||
is available (e.g. the lite preset drops ``raw_numeric``)."""
|
||
if _is_dict(raw_numeric) and raw_numeric:
|
||
iso = _load_isolation_forest()
|
||
if iso is not None:
|
||
try:
|
||
out = iso(raw_numeric)
|
||
if _is_dict(out) and out.get("n_outliers") is not None and out.get("n_rows_used"):
|
||
return out, "live"
|
||
except Exception: # noqa: BLE001
|
||
pass
|
||
# Fallback: the model the MODELOS chapter already computed (no raw sample to
|
||
# recompute against, so no per-row dimension breakdown either).
|
||
models = profile.get("models") if _is_dict(profile.get("models")) else {}
|
||
pre = models.get("outliers") if _is_dict(models) else None
|
||
if _is_dict(pre) and pre.get("n_outliers") is not None and pre.get("n_rows_used"):
|
||
return pre, "precomputed"
|
||
return None, "none"
|
||
|
||
|
||
def _multivariate_blocks(outliers: dict, raw_numeric, mark: bool) -> list:
|
||
isof = _term(mark, "isolation_forest", "**Isolation Forest**")
|
||
blocks = [
|
||
model.Heading(text="Filas atípicas (multivariante)", level=2),
|
||
model.Markdown(text=(
|
||
f"Hasta aquí cada columna se ha mirado por separado. {isof} busca "
|
||
"filas raras considerando **todas las columnas a la vez**: una fila "
|
||
"puede ser normal en cada variable y aun así ser atípica por la "
|
||
"**combinación** de sus valores (p. ej. una edad baja con una tarifa "
|
||
"muy alta). La tabla resume cuántas filas se marcaron y el umbral de "
|
||
"decisión.")),
|
||
model.KVTable(rows=[
|
||
("Filas analizadas", _fmt_int(outliers.get("n_rows_used"))),
|
||
("Columnas consideradas", _fmt_int(outliers.get("n_features"))),
|
||
("Filas atípicas", _fmt_int(outliers.get("n_outliers"))),
|
||
("% filas atípicas", _fmt_pct(outliers.get("outlier_pct"))),
|
||
("Umbral de decisión", _fmt_num(outliers.get("threshold"), 4)),
|
||
], title="Anomalías multivariantes"),
|
||
]
|
||
|
||
rows_in = outliers.get("outlier_rows") or []
|
||
if not rows_in:
|
||
return blocks
|
||
|
||
# Enrich each anomalous row with the dimensions that make it rare, when the
|
||
# raw sample is available (summarize_outlier_dims reconstructs the same
|
||
# valid-row indexing as isolation_forest_outliers).
|
||
dims_by_row = {}
|
||
if _is_dict(raw_numeric) and raw_numeric:
|
||
summ = _load_summarize_dims()
|
||
if summ is not None:
|
||
try:
|
||
enriched = summ(raw_numeric, rows_in, top_k=3) or []
|
||
for e in enriched:
|
||
if _is_dict(e) and e.get("row_index") is not None:
|
||
dims_by_row[e.get("row_index")] = e.get("dims") or []
|
||
except Exception: # noqa: BLE001
|
||
dims_by_row = {}
|
||
|
||
has_dims = bool(dims_by_row)
|
||
header = ["Fila (entre válidas)", "Score"]
|
||
if has_dims:
|
||
header.append("Dimensiones que la hacen rara (col = valor, z)")
|
||
table_rows = []
|
||
for r in rows_in[:_TOP_ROWS]:
|
||
if not _is_dict(r):
|
||
continue
|
||
ridx = r.get("row_index")
|
||
cells = [_fmt_int(ridx), _fmt_num(r.get("score"), 4)]
|
||
if has_dims:
|
||
dims = dims_by_row.get(ridx) or []
|
||
parts = []
|
||
for d in dims:
|
||
if not _is_dict(d):
|
||
continue
|
||
parts.append(
|
||
f"{model._safe_str(d.get('col'))} = {_fmt_num(d.get('value'))} "
|
||
f"(z {_fmt_num(d.get('z'), 2)})")
|
||
cells.append("; ".join(parts) if parts else "—")
|
||
table_rows.append(cells)
|
||
|
||
if table_rows:
|
||
shown = len(table_rows)
|
||
total = outliers.get("n_outliers")
|
||
note = "las filas más anómalas primero (score más bajo = más rara)"
|
||
if isinstance(total, int) and total > shown:
|
||
note += f" — top {shown} de {total}"
|
||
if not has_dims:
|
||
note += (" · no se pudo recuperar la muestra cruda para explicar las "
|
||
"dimensiones de cada fila")
|
||
blocks.append(model.DataTable(
|
||
header=header, rows=table_rows,
|
||
title="Filas más atípicas", note=note))
|
||
return blocks
|
||
|
||
|
||
# --------------------------------------------------------------------------- #
|
||
# Interpretation section.
|
||
# --------------------------------------------------------------------------- #
|
||
def _interpretation_block(mark: bool) -> model.Markdown:
|
||
outlier = _term(mark, "outlier", "atípico")
|
||
text = (
|
||
f"**Un {outlier} no es necesariamente un error.** Conviene distinguir "
|
||
"dos casos antes de actuar:\n\n"
|
||
"- **Error de dato** (medida, registro o unidad equivocada): una edad de "
|
||
"200 años, un importe negativo donde no puede haberlo, un decimal "
|
||
"desplazado. Estos sí se corrigen o se eliminan, idealmente en el origen.\n"
|
||
"- **Dato real extremo**: una observación legítima de la cola de la "
|
||
"distribución (un cliente que gasta mucho más, una tarifa de lujo, un día "
|
||
"de ventas excepcional). Borrarla sesga el análisis y oculta información "
|
||
"valiosa.\n\n"
|
||
"**Qué hacer.** Primero, **revisar** los valores señalados arriba contra "
|
||
"su origen para decidir cuál de los dos casos es. Si son errores, "
|
||
"corregirlos. Si son datos reales que distorsionan medias y modelos, hay "
|
||
"alternativas a borrarlos: **winsorizar** (recortar los extremos a un "
|
||
"percentil), o **re-expresar** la variable (por ejemplo una "
|
||
"transformación logarítmica o la escalera de re-expresión de Tukey que "
|
||
"este mismo perfil ya calcula para las columnas asimétricas), que suele "
|
||
"domar la cola sin perder ninguna fila. La elección depende del objetivo: "
|
||
"esta lectura es **exploratoria** —orienta dónde mirar—, no una regla "
|
||
"automática de limpieza.")
|
||
return model.Markdown(text=text)
|
||
|
||
|
||
# --------------------------------------------------------------------------- #
|
||
# Entry point.
|
||
# --------------------------------------------------------------------------- #
|
||
def build_outliers(profile: dict, ctx: dict):
|
||
"""Build the OUTLIERS Chapter, or None if the dataset has no numeric column."""
|
||
profile = profile or {}
|
||
ctx = ctx or {}
|
||
if not isinstance(profile, dict):
|
||
return None
|
||
|
||
numerics = _numeric_columns(profile)
|
||
if not numerics:
|
||
return None # chapter does not apply to a dataset with no numerics.
|
||
|
||
# Register glossary terms (if a collector is present) and mark them clickable.
|
||
glossary = ctx.get("glossary")
|
||
mark = False
|
||
if isinstance(glossary, model.GlossaryCollector):
|
||
for key, (label, definition) in _TERM_DEFS.items():
|
||
glossary.add(key, label, definition)
|
||
mark = True
|
||
|
||
raw_numeric = ctx.get("raw_numeric")
|
||
raw_numeric = raw_numeric if isinstance(raw_numeric, dict) else {}
|
||
|
||
box_fn = _load_build_boxplot_stats()
|
||
detect_fn = _load_detect_outliers()
|
||
|
||
# --- Univariate summary ------------------------------------------------- #
|
||
uni_rows = []
|
||
for name, numeric in numerics:
|
||
uni_rows.append(_univariate_row(
|
||
name, numeric, raw_numeric.get(name), box_fn, detect_fn))
|
||
# Rank columns by contamination (Tukey % when available, else z %).
|
||
uni_rows.sort(key=lambda r: r.get("contamination", -1.0), reverse=True)
|
||
|
||
intro = (
|
||
"Este capítulo reúne en un solo sitio el análisis de los **valores "
|
||
"atípicos** de la tabla, que en el resto del informe aparecen dispersos. "
|
||
f"Un {_term(mark, 'outlier', 'atípico')} es una observación que se aparta "
|
||
"mucho del grueso de los datos. Cada columna numérica se evalúa con dos "
|
||
f"criterios complementarios: las {_term(mark, 'tukey_fence', 'vallas de Tukey')} "
|
||
"(fuera de P25−1,5·IQR o P75+1,5·IQR, robusto a la propia cola) y el "
|
||
f"{_term(mark, 'zscore', 'z-score')} (|z| > 3, sensible a la media). La "
|
||
"tabla está ordenada de la columna más contaminada a la menos.")
|
||
|
||
blocks = [
|
||
model.Heading(text=CHAPTER_TITLE, level=1),
|
||
model.Markdown(text=intro),
|
||
_univariate_table(uni_rows),
|
||
]
|
||
|
||
# Flag the most contaminated columns explicitly.
|
||
flagged = [r["name"] for r in uni_rows
|
||
if r.get("contamination", -1.0) > 0][:_TOP_FLAGGED]
|
||
if flagged:
|
||
names = ", ".join(f"**{n}**" for n in flagged)
|
||
blocks.append(model.Markdown(text=(
|
||
f"Las columnas con mayor proporción de atípicos son {names}: "
|
||
"concentran el grueso de los valores fuera de las vallas y son las "
|
||
"primeras a revisar.")))
|
||
|
||
# --- Boxplots figure ---------------------------------------------------- #
|
||
box_entries = [
|
||
{"name": r["name"], "box": r["box"], "fliers": r.get("fliers")}
|
||
for r in uni_rows
|
||
if r.get("box")
|
||
][:_TOP_BOX]
|
||
if box_entries:
|
||
def _boxplots_make(entries=box_entries):
|
||
try:
|
||
from datascience.build_boxplots_figure import build_boxplots_figure
|
||
return build_boxplots_figure(
|
||
entries, title="Boxplots de Tukey por columna",
|
||
max_boxes=_TOP_BOX)
|
||
except Exception: # noqa: BLE001 — minimal fallback figure.
|
||
import matplotlib
|
||
matplotlib.use("Agg")
|
||
from matplotlib.figure import Figure
|
||
fig = Figure(figsize=(5.0, 2.2))
|
||
ax = fig.add_subplot(111)
|
||
ax.text(0.5, 0.5, "(boxplots no disponibles)",
|
||
ha="center", va="center")
|
||
ax.axis("off")
|
||
return fig
|
||
|
||
blocks.append(model.Group(blocks=[
|
||
model.Heading(text="Boxplots", level=2),
|
||
model.Markdown(text=(
|
||
"Cada caja abarca del primer al tercer cuartil (P25–P75), la línea "
|
||
"interior es la mediana y los bigotes llegan hasta 1,5·IQR; los "
|
||
"puntos son los valores que caen fuera de las vallas (atípicos por "
|
||
"Tukey).")),
|
||
model.Figure(
|
||
make=_boxplots_make,
|
||
caption="Boxplots de Tukey de las columnas más contaminadas."),
|
||
]))
|
||
|
||
# --- Multivariate ------------------------------------------------------- #
|
||
outliers, _src = _resolve_multivariate(profile, ctx, raw_numeric)
|
||
if outliers is not None:
|
||
blocks.extend(_multivariate_blocks(outliers, raw_numeric, mark))
|
||
else:
|
||
blocks.append(model.Heading(text="Filas atípicas (multivariante)", level=2))
|
||
blocks.append(model.Note(
|
||
"No se pudo analizar la anomalía multivariante: hacen falta al menos "
|
||
"dos columnas numéricas y la muestra cruda (o los modelos del perfil) "
|
||
"para correr Isolation Forest."))
|
||
|
||
# --- Interpretation ----------------------------------------------------- #
|
||
blocks.append(model.Heading(text="Cómo interpretar los atípicos", level=2))
|
||
blocks.append(_interpretation_block(mark))
|
||
|
||
return model.Chapter(id=CHAPTER_ID, title=CHAPTER_TITLE,
|
||
version=CHAPTER_VERSION, blocks=blocks)
|