merge(eda): capitulo OUTLIERS — univariante (Tukey/z) + multivariante (IsolationForest)

This commit is contained in:
2026-06-30 21:15:05 +02:00
9 changed files with 1698 additions and 0 deletions
@@ -0,0 +1,593 @@
"""Outliers chapter (OUTLIERS) — univariate + multivariate atypical values.
Today the analysis of atypical values is scattered across the document: the
NUM DISTR chapter mentions the per-column outlier count inside each distribution
figure, and the MODELOS chapter runs Isolation Forest as one of several cheap
models. This chapter gathers and deepens the whole outlier story in a single
place, with its interpretation: an [[term:outlier]]outlier[[/term]] is **not
necessarily an error** — it can be a legitimate, extreme but real observation —
so the reading is exploratory (what to look at), never confirmatory (what to
delete).
Sections, in order:
1. **Resumen univariante por columna** — for every numeric column, the number
and percentage of atypical values by two complementary criteria: Tukey's
1.5·IQR rule ([[term:tukey_fence]]vallas de Tukey[[/term]]) and the
[[term:zscore]]z-score[[/term]] rule (|z| > 3). The most contaminated columns
are flagged. The fences come from the pure registry function
``build_boxplot_stats`` (derived from the profile percentiles); the per-column
counts use the raw sample in ``ctx['raw_numeric']`` when available (the exact
count), degrading to the profile's own z-score counts otherwise.
2. **Boxplots** — a single figure with the Tukey boxplots of the most
contaminated columns (box, whiskers and atypical points), delegated to the
reusable registry helper ``build_boxplots_figure``.
3. **Multivariante (filas anómalas)** — rows that are atypical considering ALL
columns at once, via the registry function ``isolation_forest_outliers``: the
count and percentage of anomalous rows, the most anomalous rows with their
score, and the dimensions that make each one rare (top columns by |z|, via
``summarize_outlier_dims``). Run live on ``ctx['raw_numeric']`` (the same
numeric columns ``summarize_outlier_dims`` uses, so the row indexing stays
coherent and the dimension breakdown is correct); falls back to the
precomputed ``profile['models']['outliers']`` only when no raw sample is
available (e.g. the lite preset), where no per-row breakdown is shown.
4. **Interpretación** — outlier ≠ error: how to tell a data-entry error from a
genuine extreme value, and what to do (inspect, winsorize, or re-express —
linking to the Tukey re-expression the profile already computes).
The chapter activates whenever the table has at least one numeric column; with
no numeric column it returns ``None`` and disappears from the document.
Reads everything defensively (``.get``) and never raises: every registry
delegation is imported lazily and degraded to an honest note on any failure.
Contract: build_<id>(profile, ctx) -> Chapter | None ; CHAPTER_VERSION = "x.y.z".
"""
from __future__ import annotations
from .. import model
CHAPTER_VERSION = "1.0.0"
CHAPTER_ID = "outliers"
CHAPTER_TITLE = "Valores atípicos"
# z-score threshold for the univariate z rule: |z| > 3 flags a value ~3 standard
# deviations from the mean (≈99.7% of a normal distribution lies within ±3σ).
_Z_THRESH = 3.0
# How many columns to draw in the boxplots figure (most contaminated first) and
# how many anomalous rows to list in the multivariate table.
_TOP_BOX = 12
_TOP_ROWS = 12
# Cap on the raw atypical values passed as boxplot fliers, so a heavy-tailed
# column does not flood the figure with thousands of points.
_MAX_FLIERS = 200
# How many columns flagged as "most contaminated" in the summary note.
_TOP_FLAGGED = 3
# Glossary terms this chapter explains (contract §11.1). Registered in the shared
# collector and marked clickable on first appearance. ``isolation_forest`` and
# ``zscore`` may also be registered by the MODELOS chapter — ``add`` is
# idempotent (first definition wins), so registering them here is harmless and
# keeps this chapter self-contained when MODELOS does not render.
_TERM_DEFS = {
"outlier": (
"Valor atípico (outlier)",
"Una observación que se aparta mucho del grueso de los datos. Un atípico "
"NO es necesariamente un error: puede ser un fallo de medida o de "
"registro, pero también un dato real extremo (un cliente que gasta diez "
"veces la media, un día de ventas excepcional). Por eso se señalan para "
"revisarlos, no para borrarlos automáticamente.",
),
"tukey_fence": (
"Vallas de Tukey (1,5·IQR)",
"Regla clásica para marcar atípicos a partir de los cuartiles: se calcula "
"el rango intercuartílico IQR = P75 P25 y se trazan dos vallas, una "
"inferior en P25 1,5·IQR y otra superior en P75 + 1,5·IQR. Los valores "
"que caen fuera de esas vallas se consideran atípicos. Es robusta porque "
"se apoya en la mediana y los cuartiles, no en la media.",
),
"zscore": (
"z-score (puntuación típica)",
"Mide a cuántas desviaciones típicas está un valor de la media de su "
"columna: z = (valor media) / desviación típica. Un |z| grande (aquí > "
"3) señala un valor alejado del centro. A diferencia de las vallas de "
"Tukey, el z-score usa media y desviación, así que es más sensible a la "
"presencia de los propios atípicos.",
),
"isolation_forest": (
"Isolation Forest (anomalías multivariantes)",
"Algoritmo de detección de anomalías que considera TODAS las columnas a "
"la vez: construye árboles que parten el espacio con cortes aleatorios y "
"mide cuántos cortes hacen falta para aislar cada fila. Las filas raras "
"se aíslan con muy pocos cortes y se marcan como atípicas según un umbral "
"de contaminación. Detecta combinaciones de valores poco frecuentes que "
"ninguna columna por separado revelaría.",
),
}
# --------------------------------------------------------------------------- #
# Lazy registry delegations (each degrades to None / no-op on any failure).
# --------------------------------------------------------------------------- #
def _load_build_boxplot_stats():
try:
from datascience.build_boxplot_stats import build_boxplot_stats
return build_boxplot_stats
except Exception: # noqa: BLE001
return None
def _load_detect_outliers():
# detect_outliers lives in the monolithic ``datascience.datascience`` module
# (file_path datascience.py), not in its own submodule — try both shapes.
try:
from datascience.datascience import detect_outliers
return detect_outliers
except Exception: # noqa: BLE001
try:
from datascience import detect_outliers
return detect_outliers
except Exception: # noqa: BLE001
return None
def _load_isolation_forest():
try:
from datascience.isolation_forest_outliers import isolation_forest_outliers
return isolation_forest_outliers
except Exception: # noqa: BLE001
return None
def _load_summarize_dims():
try:
from datascience.summarize_outlier_dims import summarize_outlier_dims
return summarize_outlier_dims
except Exception: # noqa: BLE001
return None
# --------------------------------------------------------------------------- #
# Defensive formatters (own copy: the chapter never imports siblings).
# --------------------------------------------------------------------------- #
def _fmt_num(value, decimals: int = 3) -> str:
if value is None:
return ""
if isinstance(value, bool):
return "" if value else "no"
if isinstance(value, int):
return f"{value:,}".replace(",", ".")
if isinstance(value, float):
if value != value: # NaN
return ""
if value in (float("inf"), float("-inf")):
return str(value)
text = f"{value:.{decimals}f}".rstrip("0").rstrip(".")
return text if text else "0"
return model._safe_str(value)
def _fmt_int(value) -> str:
if value is None:
return ""
try:
return f"{int(round(float(value))):,}".replace(",", ".")
except (TypeError, ValueError):
return model._safe_str(value)
def _fmt_pct(value, decimals: int = 2) -> str:
"""Format an already-0-100 value as a percentage. None -> placeholder."""
if value is None:
return ""
try:
return f"{float(value):.{decimals}f}%"
except (TypeError, ValueError):
return model._safe_str(value)
def _term(mark: bool, key: str, text: str) -> str:
return f"[[term:{key}]]{text}[[/term]]" if mark else text
def _is_dict(v) -> bool:
return isinstance(v, dict)
# --------------------------------------------------------------------------- #
# Profile reads.
# --------------------------------------------------------------------------- #
def _numeric_columns(profile: dict) -> list:
"""Return [(name, numeric_dict)] for numeric columns with usable stats."""
out = []
for col in profile.get("columns") or []:
if not isinstance(col, dict):
continue
if col.get("inferred_type") != "numeric":
continue
num = col.get("numeric")
if not isinstance(num, dict) or not num:
continue
if num.get("mean") is None and num.get("median") is None:
continue
out.append((col.get("name") or "(columna)", num))
return out
def _clean_values(raw):
"""Return the finite float values of a raw column list (drop None/NaN/inf)."""
if not isinstance(raw, (list, tuple)):
return None
vals = []
for v in raw:
if v is None or isinstance(v, bool):
continue
try:
f = float(v)
except (TypeError, ValueError):
continue
if f != f or f in (float("inf"), float("-inf")):
continue
vals.append(f)
return vals
# --------------------------------------------------------------------------- #
# Per-column univariate summary.
# --------------------------------------------------------------------------- #
def _univariate_row(name, numeric, raw_vals, box_fn, detect_fn):
"""Compute one univariate summary row + boxplot inputs for a column.
Returns a dict with the table cells and, when raw values are available, the
exact Tukey/z counts and the list of atypical (flier) values; otherwise it
degrades to the profile's own z-score counts and the fence flags.
"""
box = {}
if box_fn is not None:
try:
box = box_fn(numeric) or {}
except Exception: # noqa: BLE001
box = {}
lf = box.get("lower_fence")
uf = box.get("upper_fence")
vals = _clean_values(raw_vals)
n_tukey = pct_tukey = None
n_z = pct_z = None
low_extreme = high_extreme = None
fliers = []
contamination = None # metric used to rank columns (prefer Tukey %).
if vals:
n = len(vals)
tukey_out = []
for v in vals:
below = (lf is not None and v < lf)
above = (uf is not None and v > uf)
if below or above:
tukey_out.append(v)
n_tukey = len(tukey_out)
pct_tukey = 100.0 * n_tukey / n if n else None
if tukey_out:
low_extreme = min(tukey_out)
high_extreme = max(tukey_out)
fliers = tukey_out[:_MAX_FLIERS]
# z-score rule via the registry function (returns parallel bools).
if detect_fn is not None:
try:
flags = detect_fn(vals, _Z_THRESH) or []
n_z = int(sum(1 for b in flags if b))
pct_z = 100.0 * n_z / n if n else None
except Exception: # noqa: BLE001
n_z = pct_z = None
contamination = pct_tukey
else:
# Degrade: no raw sample for this column. The profile's own outlier
# count/pct come from the z-score block (build_boxplot_stats note); the
# Tukey count is unknown, only the fence flags are.
n_z = numeric.get("n_outliers")
pct_z = numeric.get("outlier_pct")
if box.get("has_low_outliers") and box.get("min") is not None:
low_extreme = box.get("min")
if box.get("has_high_outliers") and box.get("max") is not None:
high_extreme = box.get("max")
contamination = pct_z if isinstance(pct_z, (int, float)) else None
# Compact "extremos atípicos" cell: down/up arrows for the low/high tail.
extremes = []
if low_extreme is not None:
extremes.append(f"{_fmt_num(low_extreme)}")
if high_extreme is not None:
extremes.append(f"{_fmt_num(high_extreme)}")
extremes_cell = " ".join(extremes) if extremes else ""
return {
"name": model._safe_str(name),
"n_tukey": n_tukey,
"pct_tukey": pct_tukey,
"n_z": n_z,
"pct_z": pct_z,
"lower_fence": lf,
"upper_fence": uf,
"extremes": extremes_cell,
"box": box,
"fliers": fliers,
"has_raw": bool(vals),
"contamination": contamination if isinstance(contamination, (int, float)) else -1.0,
}
def _univariate_table(rows: list) -> model.DataTable:
header = ["Columna", "Atípicos Tukey", "% Tukey", "Atípicos z", "% z",
"Valla inf.", "Valla sup.", "Extremos atípicos"]
table_rows = []
for r in rows:
table_rows.append([
r["name"],
_fmt_int(r["n_tukey"]) if r["n_tukey"] is not None else "",
_fmt_pct(r["pct_tukey"]) if r["pct_tukey"] is not None else "",
_fmt_int(r["n_z"]) if r["n_z"] is not None else "",
_fmt_pct(r["pct_z"]) if r["pct_z"] is not None else "",
_fmt_num(r["lower_fence"]),
_fmt_num(r["upper_fence"]),
r["extremes"],
])
return model.DataTable(
header=header, rows=table_rows,
title="Valores atípicos por columna",
note="Tukey = fuera de las vallas 1,5·IQR · z = |z-score| > 3 · "
"ordenado de más a menos contaminada")
# --------------------------------------------------------------------------- #
# Multivariate (Isolation Forest) section.
# --------------------------------------------------------------------------- #
def _resolve_multivariate(profile: dict, ctx: dict, raw_numeric):
"""Return (outliers_dict_or_None, source).
Prefers a LIVE Isolation Forest over ``raw_numeric`` so the detector and
``summarize_outlier_dims`` use EXACTLY the same numeric columns and the same
valid-row indexing — otherwise the precomputed ``profile['models']
['outliers']`` (run by MODELOS over a possibly different column subset) would
yield ``row_index`` values that no longer point at the rows
``summarize_outlier_dims`` reconstructs, mislabelling the "dimensions that
make each row rare". Falls back to the precomputed block when no raw sample
is available (e.g. the lite preset drops ``raw_numeric``)."""
if _is_dict(raw_numeric) and raw_numeric:
iso = _load_isolation_forest()
if iso is not None:
try:
out = iso(raw_numeric)
if _is_dict(out) and out.get("n_outliers") is not None and out.get("n_rows_used"):
return out, "live"
except Exception: # noqa: BLE001
pass
# Fallback: the model the MODELOS chapter already computed (no raw sample to
# recompute against, so no per-row dimension breakdown either).
models = profile.get("models") if _is_dict(profile.get("models")) else {}
pre = models.get("outliers") if _is_dict(models) else None
if _is_dict(pre) and pre.get("n_outliers") is not None and pre.get("n_rows_used"):
return pre, "precomputed"
return None, "none"
def _multivariate_blocks(outliers: dict, raw_numeric, mark: bool) -> list:
isof = _term(mark, "isolation_forest", "**Isolation Forest**")
blocks = [
model.Heading(text="Filas atípicas (multivariante)", level=2),
model.Markdown(text=(
f"Hasta aquí cada columna se ha mirado por separado. {isof} busca "
"filas raras considerando **todas las columnas a la vez**: una fila "
"puede ser normal en cada variable y aun así ser atípica por la "
"**combinación** de sus valores (p. ej. una edad baja con una tarifa "
"muy alta). La tabla resume cuántas filas se marcaron y el umbral de "
"decisión.")),
model.KVTable(rows=[
("Filas analizadas", _fmt_int(outliers.get("n_rows_used"))),
("Columnas consideradas", _fmt_int(outliers.get("n_features"))),
("Filas atípicas", _fmt_int(outliers.get("n_outliers"))),
("% filas atípicas", _fmt_pct(outliers.get("outlier_pct"))),
("Umbral de decisión", _fmt_num(outliers.get("threshold"), 4)),
], title="Anomalías multivariantes"),
]
rows_in = outliers.get("outlier_rows") or []
if not rows_in:
return blocks
# Enrich each anomalous row with the dimensions that make it rare, when the
# raw sample is available (summarize_outlier_dims reconstructs the same
# valid-row indexing as isolation_forest_outliers).
dims_by_row = {}
if _is_dict(raw_numeric) and raw_numeric:
summ = _load_summarize_dims()
if summ is not None:
try:
enriched = summ(raw_numeric, rows_in, top_k=3) or []
for e in enriched:
if _is_dict(e) and e.get("row_index") is not None:
dims_by_row[e.get("row_index")] = e.get("dims") or []
except Exception: # noqa: BLE001
dims_by_row = {}
has_dims = bool(dims_by_row)
header = ["Fila (entre válidas)", "Score"]
if has_dims:
header.append("Dimensiones que la hacen rara (col = valor, z)")
table_rows = []
for r in rows_in[:_TOP_ROWS]:
if not _is_dict(r):
continue
ridx = r.get("row_index")
cells = [_fmt_int(ridx), _fmt_num(r.get("score"), 4)]
if has_dims:
dims = dims_by_row.get(ridx) or []
parts = []
for d in dims:
if not _is_dict(d):
continue
parts.append(
f"{model._safe_str(d.get('col'))} = {_fmt_num(d.get('value'))} "
f"(z {_fmt_num(d.get('z'), 2)})")
cells.append("; ".join(parts) if parts else "")
table_rows.append(cells)
if table_rows:
shown = len(table_rows)
total = outliers.get("n_outliers")
note = "las filas más anómalas primero (score más bajo = más rara)"
if isinstance(total, int) and total > shown:
note += f" — top {shown} de {total}"
if not has_dims:
note += (" · no se pudo recuperar la muestra cruda para explicar las "
"dimensiones de cada fila")
blocks.append(model.DataTable(
header=header, rows=table_rows,
title="Filas más atípicas", note=note))
return blocks
# --------------------------------------------------------------------------- #
# Interpretation section.
# --------------------------------------------------------------------------- #
def _interpretation_block(mark: bool) -> model.Markdown:
outlier = _term(mark, "outlier", "atípico")
text = (
f"**Un {outlier} no es necesariamente un error.** Conviene distinguir "
"dos casos antes de actuar:\n\n"
"- **Error de dato** (medida, registro o unidad equivocada): una edad de "
"200 años, un importe negativo donde no puede haberlo, un decimal "
"desplazado. Estos sí se corrigen o se eliminan, idealmente en el origen.\n"
"- **Dato real extremo**: una observación legítima de la cola de la "
"distribución (un cliente que gasta mucho más, una tarifa de lujo, un día "
"de ventas excepcional). Borrarla sesga el análisis y oculta información "
"valiosa.\n\n"
"**Qué hacer.** Primero, **revisar** los valores señalados arriba contra "
"su origen para decidir cuál de los dos casos es. Si son errores, "
"corregirlos. Si son datos reales que distorsionan medias y modelos, hay "
"alternativas a borrarlos: **winsorizar** (recortar los extremos a un "
"percentil), o **re-expresar** la variable (por ejemplo una "
"transformación logarítmica o la escalera de re-expresión de Tukey que "
"este mismo perfil ya calcula para las columnas asimétricas), que suele "
"domar la cola sin perder ninguna fila. La elección depende del objetivo: "
"esta lectura es **exploratoria** —orienta dónde mirar—, no una regla "
"automática de limpieza.")
return model.Markdown(text=text)
# --------------------------------------------------------------------------- #
# Entry point.
# --------------------------------------------------------------------------- #
def build_outliers(profile: dict, ctx: dict):
"""Build the OUTLIERS Chapter, or None if the dataset has no numeric column."""
profile = profile or {}
ctx = ctx or {}
if not isinstance(profile, dict):
return None
numerics = _numeric_columns(profile)
if not numerics:
return None # chapter does not apply to a dataset with no numerics.
# Register glossary terms (if a collector is present) and mark them clickable.
glossary = ctx.get("glossary")
mark = False
if isinstance(glossary, model.GlossaryCollector):
for key, (label, definition) in _TERM_DEFS.items():
glossary.add(key, label, definition)
mark = True
raw_numeric = ctx.get("raw_numeric")
raw_numeric = raw_numeric if isinstance(raw_numeric, dict) else {}
box_fn = _load_build_boxplot_stats()
detect_fn = _load_detect_outliers()
# --- Univariate summary ------------------------------------------------- #
uni_rows = []
for name, numeric in numerics:
uni_rows.append(_univariate_row(
name, numeric, raw_numeric.get(name), box_fn, detect_fn))
# Rank columns by contamination (Tukey % when available, else z %).
uni_rows.sort(key=lambda r: r.get("contamination", -1.0), reverse=True)
intro = (
"Este capítulo reúne en un solo sitio el análisis de los **valores "
"atípicos** de la tabla, que en el resto del informe aparecen dispersos. "
f"Un {_term(mark, 'outlier', 'atípico')} es una observación que se aparta "
"mucho del grueso de los datos. Cada columna numérica se evalúa con dos "
f"criterios complementarios: las {_term(mark, 'tukey_fence', 'vallas de Tukey')} "
"(fuera de P251,5·IQR o P75+1,5·IQR, robusto a la propia cola) y el "
f"{_term(mark, 'zscore', 'z-score')} (|z| > 3, sensible a la media). La "
"tabla está ordenada de la columna más contaminada a la menos.")
blocks = [
model.Heading(text=CHAPTER_TITLE, level=1),
model.Markdown(text=intro),
_univariate_table(uni_rows),
]
# Flag the most contaminated columns explicitly.
flagged = [r["name"] for r in uni_rows
if r.get("contamination", -1.0) > 0][:_TOP_FLAGGED]
if flagged:
names = ", ".join(f"**{n}**" for n in flagged)
blocks.append(model.Markdown(text=(
f"Las columnas con mayor proporción de atípicos son {names}: "
"concentran el grueso de los valores fuera de las vallas y son las "
"primeras a revisar.")))
# --- Boxplots figure ---------------------------------------------------- #
box_entries = [
{"name": r["name"], "box": r["box"], "fliers": r.get("fliers")}
for r in uni_rows
if r.get("box")
][:_TOP_BOX]
if box_entries:
def _boxplots_make(entries=box_entries):
try:
from datascience.build_boxplots_figure import build_boxplots_figure
return build_boxplots_figure(
entries, title="Boxplots de Tukey por columna",
max_boxes=_TOP_BOX)
except Exception: # noqa: BLE001 — minimal fallback figure.
import matplotlib
matplotlib.use("Agg")
from matplotlib.figure import Figure
fig = Figure(figsize=(5.0, 2.2))
ax = fig.add_subplot(111)
ax.text(0.5, 0.5, "(boxplots no disponibles)",
ha="center", va="center")
ax.axis("off")
return fig
blocks.append(model.Group(blocks=[
model.Heading(text="Boxplots", level=2),
model.Markdown(text=(
"Cada caja abarca del primer al tercer cuartil (P25P75), la línea "
"interior es la mediana y los bigotes llegan hasta 1,5·IQR; los "
"puntos son los valores que caen fuera de las vallas (atípicos por "
"Tukey).")),
model.Figure(
make=_boxplots_make,
caption="Boxplots de Tukey de las columnas más contaminadas."),
]))
# --- Multivariate ------------------------------------------------------- #
outliers, _src = _resolve_multivariate(profile, ctx, raw_numeric)
if outliers is not None:
blocks.extend(_multivariate_blocks(outliers, raw_numeric, mark))
else:
blocks.append(model.Heading(text="Filas atípicas (multivariante)", level=2))
blocks.append(model.Note(
"No se pudo analizar la anomalía multivariante: hacen falta al menos "
"dos columnas numéricas y la muestra cruda (o los modelos del perfil) "
"para correr Isolation Forest."))
# --- Interpretation ----------------------------------------------------- #
blocks.append(model.Heading(text="Cómo interpretar los atípicos", level=2))
blocks.append(_interpretation_block(mark))
return model.Chapter(id=CHAPTER_ID, title=CHAPTER_TITLE,
version=CHAPTER_VERSION, blocks=blocks)
@@ -0,0 +1,304 @@
"""Tests for the OUTLIERS chapter — DoD: golden + edges + error path.
Self-contained: builds synthetic ``numeric`` blocks + a raw_numeric sample (no
DuckDB) so the suite is fast and deterministic. Verifies that the chapter emits
the univariate per-column table, a boxplots figure, the multivariate Isolation
Forest section and the outlier≠error interpretation; that the most contaminated
column is ranked first; that a profile with no numeric column yields None; that
None/empty never raises; that the glossary terms are registered; and that the
chapter renders into both PDF and PPTX without cutting its title.
"""
import math
import os
import re
import tempfile
from pypdf import PdfReader
from datascience.automatic_eda.chapters.outliers import (
build_outliers, CHAPTER_VERSION, CHAPTER_TITLE, _TERM_DEFS,
)
from datascience.automatic_eda import model
from datascience.render_automatic_eda_pdf import render_automatic_eda_pdf
from datascience.render_automatic_eda_pptx import render_automatic_eda_pptx
def _percentile(sorted_vals, q):
"""Linear-interpolation percentile (q in 0..1) on an already-sorted list."""
if not sorted_vals:
return None
if len(sorted_vals) == 1:
return float(sorted_vals[0])
pos = q * (len(sorted_vals) - 1)
lo = int(math.floor(pos))
hi = int(math.ceil(pos))
if lo == hi:
return float(sorted_vals[lo])
frac = pos - lo
return float(sorted_vals[lo] * (1 - frac) + sorted_vals[hi] * frac)
def _col_from_values(values, nbins=10):
"""Build a ``numeric`` sub-block shaped like describe_numeric's output from a
concrete list of raw values, so the profile percentiles and the raw sample
are consistent (the boxplot fences match the crudo)."""
vals = [float(v) for v in values]
s = sorted(vals)
n = len(s)
mean = sum(vals) / n
var = sum((v - mean) ** 2 for v in vals) / n
std = math.sqrt(var)
median = _percentile(s, 0.5)
p25 = _percentile(s, 0.25)
p75 = _percentile(s, 0.75)
mn, mx = s[0], s[-1]
# z-score outlier count (population), what the profile's n_outliers carries.
n_out = sum(1 for v in vals if std > 0 and abs((v - mean) / std) > 3.0)
width = (mx - mn) / nbins if mx > mn else 1.0
hist = [{"lo": mn + i * width, "hi": mn + (i + 1) * width, "count": 1}
for i in range(nbins)]
return {
"min": mn, "max": mx, "mean": mean, "median": median, "std": std,
"p25": p25, "p50": median, "p75": p75, "iqr": (p75 - p25),
"n_outliers": n_out, "outlier_pct": 100.0 * n_out / n,
"distribution_type": "right-skewed", "histogram": hist,
}
def _fare_values():
"""A heavy-tailed column (most ~10-30, a few 200-512): clear Tukey/z outliers."""
base = [7.0 + (i % 25) for i in range(120)] # bulk 7..31
tail = [180.0, 210.0, 263.0, 512.0] # extreme upper tail
return base + tail
def _age_values():
"""A roughly symmetric column with one extreme low value."""
base = [22.0 + (i % 40) for i in range(120)] # 22..61
return base + [80.0, 0.5, 74.0, 1.0]
def _quiet_values():
"""A clean column with no atypical values."""
return [50.0 + (i % 5) for i in range(124)]
def _profile_and_ctx(with_models=True, with_raw=True):
fare = _fare_values()
age = _age_values()
quiet = _quiet_values()
cols = [
{"name": "Fare", "inferred_type": "numeric", "numeric": _col_from_values(fare)},
{"name": "Age", "inferred_type": "numeric", "numeric": _col_from_values(age)},
{"name": "Quiet", "inferred_type": "numeric", "numeric": _col_from_values(quiet)},
{"name": "Sexo", "inferred_type": "categorical",
"categorical": {"top": [{"value": "male", "count": 80}]}},
]
profile = {"table": "titanic", "n_rows": len(fare), "n_cols": len(cols),
"columns": cols}
if with_models:
profile["models"] = {
"outliers": {
"n_outliers": 4, "outlier_pct": 3.2,
"outlier_rows": [
{"row_index": 123, "score": -0.21},
{"row_index": 121, "score": -0.15},
],
"threshold": -0.02, "n_rows_used": 124, "n_features": 3,
}
}
ctx = {}
if with_raw:
ctx["raw_numeric"] = {"Fare": fare, "Age": age, "Quiet": quiet}
return profile, ctx
def _pdf_text(path: str) -> str:
txt = "".join((pg.extract_text() or "") for pg in PdfReader(path).pages)
return re.sub(r"\s+", " ", txt)
def _flatten(blocks):
out = []
for b in blocks:
if getattr(b, "kind", "") == "group":
out.extend(_flatten(getattr(b, "blocks", []) or []))
else:
out.append(b)
return out
# --------------------------------------------------------------------------- #
# Golden.
# --------------------------------------------------------------------------- #
def test_golden_estructura_y_secciones():
profile, ctx = _profile_and_ctx()
ctx["glossary"] = model.GlossaryCollector()
ch = build_outliers(profile, ctx)
assert ch is not None
assert ch.id == "outliers"
assert ch.version == CHAPTER_VERSION
flat = _flatten(ch.blocks)
kinds = [b.kind for b in flat]
# Title heading + univariate DataTable + boxplots Figure + multivariate
# KVTable + interpretation Markdown.
assert kinds[0] == "heading" and flat[0].text == CHAPTER_TITLE
tables = [b for b in flat if b.kind == "data_table"]
titles = [t.title for t in tables]
assert any(t and "atípicos por columna" in t for t in titles)
assert any(b.kind == "figure" for b in flat), "falta la figura de boxplots"
assert any(b.kind == "kv_table" for b in flat), "falta el resumen multivariante"
# The boxplots figure maker yields a real matplotlib figure (or its fallback).
fig = next(b for b in flat if b.kind == "figure").make()
assert fig is not None
import matplotlib.pyplot as plt
plt.close(fig)
def test_golden_fare_es_la_mas_contaminada():
# The univariate table must rank Fare (heavy tail) first and report a
# non-zero Tukey percentage for it.
profile, ctx = _profile_and_ctx()
ch = build_outliers(profile, ctx)
table = next(b for b in _flatten(ch.blocks)
if b.kind == "data_table" and b.title
and "atípicos por columna" in b.title)
first_col = table.rows[0][0]
assert first_col == "Fare", f"esperaba Fare primera, fue {first_col}"
# % Tukey column (index 2) of the first row must be > 0.
pct_cell = table.rows[0][2]
assert pct_cell not in ("", "0%", "0.00%"), f"% Tukey de Fare vacío: {pct_cell}"
# The z-score rule (detect_outliers) must actually run with raw_numeric: at
# least one column reports a non-empty z count/percentage (regression guard
# for the detect_outliers import path).
z_pcts = [r[4] for r in table.rows]
assert any(c not in ("",) for c in z_pcts), f"columna z toda vacía: {z_pcts}"
z_counts = [r[3] for r in table.rows]
assert any(c not in ("",) for c in z_counts), f"conteo z vacío: {z_counts}"
def test_golden_interpretacion_outlier_no_es_error():
profile, ctx = _profile_and_ctx()
ch = build_outliers(profile, ctx)
md = " ".join(b.text for b in _flatten(ch.blocks) if b.kind == "markdown")
assert "no es necesariamente un error" in md.lower()
# Mentions the actionable options (winsorize / re-express).
assert "winsoriz" in md.lower()
assert "re-expres" in md.lower() or "logarítmic" in md.lower()
def test_golden_terminos_glosario_registrados():
profile, ctx = _profile_and_ctx()
gloss = model.GlossaryCollector()
ctx["glossary"] = gloss
build_outliers(profile, ctx)
for key in _TERM_DEFS:
assert gloss.has(key), f"término '{key}' no registrado en el glosario"
# Terms are marked clickable in the body text.
md = " ".join(b.text for b in _flatten(build_outliers(profile, ctx).blocks)
if b.kind == "markdown")
assert "[[term:outlier]]" in md and "[[term:tukey_fence]]" in md
# --------------------------------------------------------------------------- #
# Multivariate.
# --------------------------------------------------------------------------- #
def test_multivariante_live_con_raw_y_dims():
# With a raw sample the chapter runs Isolation Forest live (over the same
# columns summarize_outlier_dims uses) and lists the anomalous rows with the
# dimensions that make each one rare.
profile, ctx = _profile_and_ctx(with_models=False, with_raw=True)
ch = build_outliers(profile, ctx)
flat = _flatten(ch.blocks)
kv = next(b for b in flat if b.kind == "kv_table")
flat_kv = " ".join(f"{k} {v}" for (k, v) in kv.rows)
assert "Filas atípicas" in flat_kv
# A non-zero number of anomalous rows is reported.
n_cell = dict(kv.rows).get("Filas atípicas")
assert n_cell not in (None, "", "0"), f"sin filas atípicas: {n_cell}"
# The anomalous-rows table carries the per-row dimension breakdown.
tbls = [b for b in flat if b.kind == "data_table" and b.title
and "más atípicas" in b.title]
assert tbls, "falta la tabla de filas más atípicas"
assert any("hacen rara" in h for h in tbls[0].header), \
f"falta la columna de dimensiones: {tbls[0].header}"
def test_multivariante_precomputed_sin_raw():
# Without a raw sample the chapter falls back to profile['models']['outliers']
# (lite preset path); the precomputed n_outliers (4) surfaces in the KV table.
profile, ctx = _profile_and_ctx(with_models=True, with_raw=False)
ch = build_outliers(profile, ctx)
kv = next(b for b in _flatten(ch.blocks) if b.kind == "kv_table")
assert any("4" in str(v) for (k, v) in kv.rows)
def test_multivariante_ausente_degrada_a_nota():
# No models and no raw sample → an honest note, never a crash.
profile, ctx = _profile_and_ctx(with_models=False, with_raw=False)
ch = build_outliers(profile, ctx)
assert ch is not None
notes = [b.text for b in _flatten(ch.blocks) if b.kind == "note"]
assert any("Isolation Forest" in n for n in notes)
# --------------------------------------------------------------------------- #
# Edges / error path.
# --------------------------------------------------------------------------- #
def test_edge_sin_columnas_numericas_devuelve_none():
prof = {"columns": [{"name": "c", "inferred_type": "categorical",
"categorical": {"top": [{"value": "x", "count": 3}]}}]}
assert build_outliers(prof, {}) is None
def test_edge_solo_texto_sintetico_devuelve_none():
# A text-only synthetic table (no numeric column) yields None (does not break).
prof = {"table": "notas", "n_rows": 3, "n_cols": 1,
"columns": [{"name": "comentario", "inferred_type": "text",
"text": {"n_docs": 3}}]}
assert build_outliers(prof, {}) is None
def test_edge_profile_none_y_vacio_no_revienta():
assert build_outliers(None, None) is None
assert build_outliers({}, {}) is None
assert build_outliers({"columns": []}, {}) is None
def test_edge_sin_raw_numeric_degrada_a_perfil():
# Without raw_numeric the chapter still builds, using the profile z-score
# counts; the univariate table exists and Tukey counts degrade to '—'.
profile, ctx = _profile_and_ctx(with_models=True, with_raw=False)
ch = build_outliers(profile, ctx)
assert ch is not None
table = next(b for b in _flatten(ch.blocks)
if b.kind == "data_table" and b.title
and "atípicos por columna" in b.title)
# z column comes from the profile; Tukey count is unknown ('—').
assert all(len(r) == 8 for r in table.rows)
# --------------------------------------------------------------------------- #
# Anti-cut render.
# --------------------------------------------------------------------------- #
def test_render_pdf_y_pptx_incluyen_el_capitulo():
profile, ctx = _profile_and_ctx()
# The renderers build the whole document; the chapter is reached via the
# registry. Render the chapter standalone through a one-chapter document by
# passing the profile directly (the renderers run the full chapter registry).
with tempfile.TemporaryDirectory() as d:
pdf = os.path.join(d, "out.pdf")
res_pdf = render_automatic_eda_pdf(profile, pdf,
{"write_manifest": False, "ctx": ctx})
assert res_pdf["path"] == pdf
txt = _pdf_text(pdf)
assert CHAPTER_TITLE in txt, "el capítulo OUTLIERS no aparece en el PDF"
assert "Fare" in txt
pptx = os.path.join(d, "out.pptx")
res_pptx = render_automatic_eda_pptx(profile, pptx,
{"write_manifest": False, "ctx": ctx})
assert res_pptx["path"] == pptx
assert res_pptx["n_slides"] >= 1
@@ -34,6 +34,7 @@ CHAPTER_ORDER = [
"text_distr", # free-text / NLP distributions (non-tabular content)
"calidad", # data quality
"missingness", # missing-data patterns (co-occurrence of absences; MCAR/MAR)
"outliers", # atypical values: univariate (Tukey/z) + multivariate (IsolationForest)
"correlacion", # correlations / associations
"relaciones", # key relations: declared/candidate PK + FK (inter/intra-table)
"modelos", # cheap models (PCA/KMeans/outliers)
@@ -0,0 +1,125 @@
---
id: build_boxplots_figure_py_datascience
name: build_boxplots_figure
kind: function
lang: py
domain: datascience
version: "1.0.0"
purity: impure
signature: "def build_boxplots_figure(boxes: list, title: str = \"\", max_boxes: int = 12) -> \"matplotlib.figure.Figure\""
description: "Construye una unica figura matplotlib con boxplots de Tukey HORIZONTALES (uno por columna) usando ax.bxp: caja Q1-Q3, bigotes hasta 1.5*IQR, linea de mediana y puntos atipicos. Consume la salida de build_boxplot_stats (un dict box por columna, leido con .get) mas una lista opcional de outliers crudos por columna; si vienen los dibuja como puntos (showfliers), si no marca solo box[min]/box[max] cuando hay outliers de cola (igual que num_distr). Dibuja como mucho max_boxes cajas (las primeras, ya ordenadas por contaminacion por el caller) y avisa de la truncacion con (mostrando N de M). Backend Agg sin pyplot global; alto adaptativo al nº de cajas. Defensiva: omite entradas invalidas y NUNCA lanza — sin cajas validas devuelve una figura placeholder (sin boxplots). Es la version small-multiples del capitulo num_distr para responder que columnas tienen mas outliers de un vistazo."
tags: [eda, outliers, boxplot, tukey, iqr, bxp, matplotlib, figure, visualization, small-multiples, datascience, impure]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: "error_go_core"
imports: [matplotlib]
example: |
from datascience.build_boxplot_stats import build_boxplot_stats
from datascience.build_boxplots_figure import build_boxplots_figure
boxes = [
{"name": "ingresos", "box": build_boxplot_stats({"min": 1.0, "max": 9e3,
"p25": 1e3, "median": 2e3, "p75": 3e3, "n_outliers": 7}), "fliers": None},
{"name": "edad", "box": build_boxplot_stats({"min": 0.0, "max": 99.0,
"p25": 25.0, "median": 38.0, "p75": 52.0}), "fliers": None},
]
fig = build_boxplots_figure(boxes, title="Outliers por columna", max_boxes=12)
tested: true
tests:
- "test_returns_figure_with_axes"
- "test_empty_list_returns_placeholder_figure"
- "test_invalid_box_is_skipped_not_raised"
- "test_all_invalid_returns_placeholder"
- "test_raw_fliers_are_drawn"
- "test_max_boxes_truncates_and_does_not_raise"
test_file_path: "python/functions/datascience/build_boxplots_figure_test.py"
file_path: "python/functions/datascience/build_boxplots_figure.py"
params:
- name: boxes
desc: "Lista de dicts, cada uno {\"name\": str, \"box\": dict, \"fliers\": list|None}. box es EXACTAMENTE la salida de build_boxplot_stats (claves leidas con .get: q1, median, q3, whisker_lo, whisker_hi, min, max, has_low_outliers, has_high_outliers, lower_fence, upper_fence, n_outliers). fliers es la lista opcional de outliers crudos: si viene se dibuja como puntos; si es None/ausente solo se marcan los extremos box[min]/box[max] cuando hay outliers de cola. Entradas que no son dict, sin box dict, o sin q1/median/q3 se omiten. El caller las pasa ya ordenadas por contaminacion (la mayor primera)."
- name: title
desc: "Titulo de la figura (fig.suptitle, alineado a la izquierda). Vacio => sin titulo. Si len(boxes) > max_boxes se le anade una nota \"(mostrando N de M)\" para que la truncacion no sea silenciosa. Default \"\"."
- name: max_boxes
desc: "Numero maximo de cajas a dibujar (las primeras de la lista). Default 12. Un valor no entero o <= 0 cae a 12. Si la lista trae mas entradas, las sobrantes se descartan pero se reporta en el titulo con (mostrando N de M)."
output: "Un matplotlib.figure.Figure (figsize 7.0 x alto adaptativo = max(2.0, 0.5*n + 1.0), dpi 150) con un unico Axes que apila boxplots horizontales de Tukey (ax.bxp, orientation=horizontal con fallback vert=False), uno por columna valida, de arriba a abajo en el orden recibido. Cada caja: relleno #9ec6df, borde/bigotes/caps #5b8aa6, mediana #2e8b57, atipicos #c0392b. Etiquetas del eje Y = nombres de columna; eje X etiquetado \"valor\". Outliers dibujados desde fliers crudos (showfliers) o, si faltan, marcados en box[min]/box[max] segun has_low/high_outliers. Si no queda ninguna caja valida (lista vacia o todas invalidas) devuelve una Figure placeholder con texto centrado \"(sin boxplots)\"; cualquier error inesperado se captura y devuelve una Figure con el mensaje de error. NUNCA lanza. El caller rasteriza/cierra la figura; la funcion no la muestra ni la guarda."
---
## Ejemplo
```python
import sys, os
sys.path.insert(0, os.path.join("python", "functions"))
from datascience.build_boxplot_stats import build_boxplot_stats
from datascience.build_boxplots_figure import build_boxplots_figure
# Un `box` por columna numérica, derivado del sub-bloque `numeric` del profile
# (salida de describe_numeric). El caller los pasa ya ordenados por outlier_pct.
boxes = [
{
"name": "ingresos",
"box": build_boxplot_stats({
"min": 1.0, "max": 9000.0,
"p25": 1000.0, "median": 2000.0, "p75": 3000.0,
"n_outliers": 7,
}),
"fliers": None, # valores crudos desconocidos -> se marca solo el extremo.
},
{
"name": "edad",
"box": build_boxplot_stats({
"min": 0.0, "max": 99.0,
"p25": 25.0, "median": 38.0, "p75": 52.0,
}),
"fliers": [88.0, 95.0, 99.0], # outliers crudos -> se dibujan como puntos.
},
]
fig = build_boxplots_figure(boxes, title="Outliers por columna", max_boxes=12)
# El renderer del informe lo rasteriza; aquí solo persistimos para inspección.
fig.savefig("/tmp/boxplots.png")
```
## Cuando usarla
Úsala en el capítulo de outliers de un informe EDA cuando quieras comparar de un
vistazo *qué columnas están más contaminadas por valores atípicos*: a diferencia
de `num_distr` (que dibuja un histograma+boxplot por columna en figuras
separadas), aquí apilas todos los boxplots horizontales en **una sola figura**
(small multiples). Primero deriva el `box` de cada columna con
`build_boxplot_stats`, ordénalas por `outlier_pct` descendente, envuélvelas como
`{"name", "box", "fliers"}` y pásaselas. Si tienes los valores crudos fuera de
las vallas, métele la lista `fliers` y se dibujarán como puntos; si no, la
función marca solo los extremos `min`/`max` cuando hay cola.
## Gotchas
- **Impura por matplotlib.** Toca la maquinaria de render. Usa el backend `Agg`
y la API orientada a objetos `Figure`/`add_subplot` — NUNCA `pyplot.*` aquí,
para no tocar el estado global ni filtrar figuras entre llamadas. `pyplot` NO
es thread-safe; esta función construye el `Figure` directamente, así que es
segura de llamar en bucle desde el renderer.
- **El caller cierra la figura.** Devuelve el `Figure` pero no lo muestra ni lo
guarda. Quien la consume debe rasterizarla y luego liberarla
(`matplotlib.pyplot.close(fig)`) para no acumular memoria en lotes grandes.
- **`fliers` opcional, semántica distinta.** Si pasas la lista de outliers
crudos se dibujan todos como puntos (`showfliers=True`). Si es `None`/ausente
los valores son desconocidos y solo se marca un punto en `box["min"]` /
`box["max"]` cuando `has_low_outliers` / `has_high_outliers` — mismo criterio
que `num_distr`. No inventes fliers a partir del profile: el `box` no trae los
valores crudos, solo si los extremos superan las vallas.
- **API de orientación de `ax.bxp`.** matplotlib reciente usa
`orientation="horizontal"`; las versiones antiguas usan `vert=False`. La
función prueba la primera y cae a la segunda en `except TypeError`, así que
funciona en ambas. Si `bxp` falla del todo, el Axes degrada a un texto
"(boxplot no disponible)" en vez de propagar.
- **Truncación visible.** `max_boxes` (default 12) limita el nº de cajas para que
ninguna se solape; si la lista trae más, las sobrantes se descartan pero se
avisa en el título con "(mostrando N de M)". Pasa las columnas ya ordenadas por
contaminación para que las descartadas sean las menos relevantes.
- **Defensiva, nunca lanza.** Lista vacía, entradas no-dict, sin `box`, o sin
`q1`/`median`/`q3` se omiten sin propagar; sin cajas válidas devuelve un
placeholder "(sin boxplots)" y cualquier error inesperado se captura en una
figura con el texto del error. No envuelvas la llamada en try/except por miedo
a un raise — no lo hay.
@@ -0,0 +1,250 @@
"""Impure EDA helper: a single figure of horizontal Tukey boxplots (`eda` group).
Draws, in one ``matplotlib.figure.Figure``, a stack of horizontal Tukey boxplots
(one per column) using ``ax.bxp``: each carries its box (Q1Q3), whiskers (up to
1.5·IQR), the median line and its outlier points. It consumes the output of the
pure registry function ``build_boxplot_stats`` (one ``box`` dict per column) plus
an optional list of raw outlier values per column; it never recomputes anything.
It is the "small-multiples" companion of ``num_distr`` (which draws one
histogram+boxplot per column): here every column shares a single figure so the
caller can show, at a glance, *which* columns are the most contaminated by
outliers (the caller passes them already ordered by contamination).
Impure because it touches matplotlib's rendering machinery. It uses the headless
Agg backend and the object-oriented ``Figure`` API (no ``pyplot``) so it leaks no
global state and is safe to call repeatedly from a report renderer. It is fully
defensive and NEVER raises: invalid entries are skipped and, if nothing valid
remains, it returns a placeholder figure carrying a centered "(sin boxplots)".
"""
import matplotlib
matplotlib.use("Agg")
from matplotlib.figure import Figure # noqa: E402
# Blue palette shared with the ``num_distr`` chapter so the report stays coherent.
_BOX_FACE = "#9ec6df" # box fill.
_BOX_EDGE = "#5b8aa6" # box / whisker / cap border.
_MEDIAN = "#2e8b57" # median line (sea green).
_OUTLIER = "#c0392b" # outlier points (soft red).
# Muted gray for the placeholder / fallback message text.
_MUTED_TEXT = "#5f6b7a"
# Soft red for the error fallback message.
_ERROR_TEXT = "#b00020"
def _num(value):
"""Coerce ``value`` to float defensively; None for None/bool/non-numeric/NaN."""
# bool is a subclass of int; a stat value is never a real bool, so treat
# True/False as missing instead of silently coercing to 1.0/0.0.
if value is None or isinstance(value, bool):
return None
try:
f = float(value)
except (TypeError, ValueError):
return None
if f != f: # NaN guard.
return None
return f
def _placeholder_figure(message: str, color: str = _MUTED_TEXT) -> "Figure":
"""Return a fallback ``Figure`` carrying a single centered message."""
fig = Figure(figsize=(7.0, 2.4), dpi=150)
ax = fig.add_subplot(111)
ax.axis("off")
ax.text(
0.5,
0.5,
message,
ha="center",
va="center",
fontsize=12,
color=color,
wrap=True,
transform=ax.transAxes,
)
fig.tight_layout()
return fig
def build_boxplots_figure(
boxes: list,
title: str = "",
max_boxes: int = 12,
) -> "matplotlib.figure.Figure":
"""Build one figure of stacked horizontal Tukey boxplots (one per column).
For each entry the function builds a ``bxp`` stats record (``med, q1, q3,
whislo, whishi, fliers, label``) from its ``box`` sub-dict (the output of
``build_boxplot_stats``) and draws all of them as horizontal boxplots sharing
the X axis, top-to-bottom in the order received (the caller is expected to
pass them already sorted by contamination).
Outliers are shown two ways:
- If an entry carries a ``fliers`` list (the raw out-of-fence values), they
are drawn as red points via ``ax.bxp(..., showfliers=True)``.
- If ``fliers`` is ``None``/absent, the raw values are unknown, so only the
extremes are marked: a red point at ``box["min"]`` when
``box["has_low_outliers"]`` and at ``box["max"]`` when
``box["has_high_outliers"]`` (same convention as ``num_distr``).
The function is fully defensive and NEVER raises. Entries that are not dicts,
lack a ``box`` dict, or miss any of ``q1``/``median``/``q3`` are skipped. If
after filtering no valid box remains it returns a placeholder ``Figure`` with
a centered "(sin boxplots)"; any unexpected error is caught and turned into a
fallback figure carrying the error text. It always returns a ``Figure``.
Args:
boxes: List of dicts ``{"name": str, "box": dict, "fliers": list|None}``.
``box`` is exactly the output of ``build_boxplot_stats`` (read with
``.get``: ``q1, median, q3, whisker_lo, whisker_hi, min, max,
has_low_outliers, has_high_outliers, ...``). ``fliers`` is the
optional list of raw outlier values; when present they are plotted,
otherwise only the extremes are marked.
title: Figure title (``fig.suptitle``). Empty => no title. When the list
is longer than ``max_boxes`` a "(mostrando N de M)" note is appended.
max_boxes: Draw at most the first ``max_boxes`` entries (default 12). The
rest are dropped but their omission is surfaced in the title note, so
the truncation is never silent.
Returns:
A ``matplotlib.figure.Figure`` with a single Axes holding the horizontal
boxplots (height adaptive to the box count so none overlap). The caller is
responsible for rasterizing/closing it; this function never shows nor
saves it.
"""
try:
if not isinstance(boxes, (list, tuple)) or len(boxes) == 0:
return _placeholder_figure("(sin boxplots)")
total = len(boxes)
# Cap the number of boxes; tolerate a non-int / non-positive max_boxes.
try:
cap = int(max_boxes)
except (TypeError, ValueError):
cap = 12
if cap <= 0:
cap = 12
candidates = list(boxes)[:cap]
stats_list = [] # bxp stats records, in draw order.
labels = [] # Y tick labels (column names).
manual_markers = [] # (position, box) for entries without raw fliers.
any_fliers = False # whether to enable showfliers in the bxp call.
for entry in candidates:
if not isinstance(entry, dict):
continue
box = entry.get("box")
if not isinstance(box, dict):
continue
q1 = _num(box.get("q1"))
med = _num(box.get("median"))
q3 = _num(box.get("q3"))
# Without the three quartiles a boxplot cannot be drawn — skip it.
if q1 is None or med is None or q3 is None:
continue
# Whisker extremes fall back to the quartiles when missing.
whislo = _num(box.get("whisker_lo"))
whishi = _num(box.get("whisker_hi"))
if whislo is None:
whislo = q1
if whishi is None:
whishi = q3
name = entry.get("name")
label = "" if name is None else str(name)
position = len(stats_list) + 1 # bxp positions are 1-indexed.
fliers_raw = entry.get("fliers")
if isinstance(fliers_raw, (list, tuple)):
fliers = [v for v in (_num(x) for x in fliers_raw) if v is not None]
if fliers:
any_fliers = True
else:
# Raw values unknown: draw no bxp fliers, mark min/max by hand.
fliers = []
manual_markers.append((position, box))
stats_list.append({
"med": med,
"q1": q1,
"q3": q3,
"whislo": whislo,
"whishi": whishi,
"fliers": fliers,
"label": label,
})
labels.append(label)
if not stats_list:
return _placeholder_figure("(sin boxplots)")
n = len(stats_list)
positions = list(range(1, n + 1))
# Height grows with the box count so none of them overlap.
height = max(2.0, 0.5 * n + 1.0)
fig = Figure(figsize=(7.0, height), dpi=150)
ax = fig.add_subplot(111)
bxp_kw = dict(
showfliers=any_fliers, widths=0.5, patch_artist=True,
boxprops={"facecolor": _BOX_FACE, "edgecolor": _BOX_EDGE},
medianprops={"color": _MEDIAN, "linewidth": 1.6},
whiskerprops={"color": _BOX_EDGE},
capprops={"color": _BOX_EDGE},
flierprops={"marker": "o", "markersize": 3.5,
"markerfacecolor": _OUTLIER, "markeredgecolor": _OUTLIER,
"linestyle": "none"})
try:
# ``orientation`` is the current API; older matplotlib uses ``vert``.
try:
ax.bxp(stats_list, positions=positions,
orientation="horizontal", **bxp_kw)
except TypeError:
ax.bxp(stats_list, positions=positions, vert=False, **bxp_kw)
except Exception: # noqa: BLE001 — never let bxp kill the whole figure.
ax.text(0.5, 0.5, "(boxplot no disponible)", ha="center",
va="center", fontsize=10, color=_MUTED_TEXT,
transform=ax.transAxes)
# For entries without raw fliers, mark only the out-of-fence extremes.
for position, box in manual_markers:
mn = _num(box.get("min"))
mx = _num(box.get("max"))
if box.get("has_low_outliers") and mn is not None:
ax.plot([mn], [position], marker="o", markersize=3.5,
color=_OUTLIER, zorder=5)
if box.get("has_high_outliers") and mx is not None:
ax.plot([mx], [position], marker="o", markersize=3.5,
color=_OUTLIER, zorder=5)
# Pin the Y tick labels explicitly so they work across matplotlib
# versions regardless of whether ``bxp`` consumed the ``label`` key.
ax.set_yticks(positions)
ax.set_yticklabels(labels, fontsize=8)
ax.set_xlabel("valor", fontsize=9)
ax.tick_params(labelsize=7)
ax.margins(y=0.15)
for spine in ("top", "right"):
ax.spines[spine].set_visible(False)
# Surface truncation in the title instead of silently dropping boxes.
note = f"(mostrando {n} de {total})" if total > cap else ""
heading = " ".join(p for p in (title, note) if p)
if heading:
fig.suptitle(heading, fontsize=12, x=0.02, ha="left")
fig.tight_layout()
return fig
except Exception as exc: # noqa: BLE001 — never raise from a figure builder.
return _placeholder_figure(
f"error al dibujar boxplots: {exc}", color=_ERROR_TEXT)
@@ -0,0 +1,109 @@
"""Tests para build_boxplots_figure (boxplots horizontales de Tukey, grupo eda).
Usa el backend Agg sin display; no muestra ni guarda figuras. Cada test cierra
explícitamente la Figure construida (matplotlib.pyplot.close) para no acumular
estado entre tests.
"""
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt # noqa: E402
from matplotlib.figure import Figure # noqa: E402
from build_boxplots_figure import build_boxplots_figure
def _box(name, q1, median, q3, mn, mx, low=False, high=False, fliers=None):
"""Construye una entrada {name, box, fliers} con un box estilo build_boxplot_stats."""
iqr = q3 - q1
return {
"name": name,
"box": {
"q1": q1,
"median": median,
"q3": q3,
"iqr": iqr,
"lower_fence": q1 - 1.5 * iqr,
"upper_fence": q3 + 1.5 * iqr,
"whisker_lo": max(mn, q1 - 1.5 * iqr),
"whisker_hi": min(mx, q3 + 1.5 * iqr),
"min": mn,
"max": mx,
"has_low_outliers": low,
"has_high_outliers": high,
"n_outliers": 0,
},
"fliers": fliers,
}
def test_returns_figure_with_axes():
boxes = [
_box("edad", 10.0, 25.0, 40.0, 1.0, 100.0, high=True),
_box("ingresos", 100.0, 200.0, 300.0, 50.0, 400.0),
_box("score", -1.0, 0.0, 1.0, -5.0, 5.0, low=True, high=True),
]
fig = build_boxplots_figure(boxes, title="Boxplots", max_boxes=12)
assert isinstance(fig, Figure)
assert len(fig.axes) >= 1
# Tres cajas -> tres etiquetas en el eje Y.
ax = fig.axes[0]
assert len(ax.get_yticks()) == 3
plt.close(fig)
def test_empty_list_returns_placeholder_figure():
fig = build_boxplots_figure([], title="vacío")
assert isinstance(fig, Figure)
assert len(fig.axes) >= 1
plt.close(fig)
def test_invalid_box_is_skipped_not_raised():
boxes = [
{"name": "rota", "box": {"q1": None, "median": None, "q3": None}},
{"name": "sin_box"}, # falta la clave box.
"no_es_dict", # entrada no-dict.
_box("buena", 1.0, 2.0, 3.0, 0.0, 10.0, high=True),
]
fig = build_boxplots_figure(boxes)
assert isinstance(fig, Figure)
ax = fig.axes[0]
# Solo la caja válida sobrevive al filtrado.
assert len(ax.get_yticks()) == 1
plt.close(fig)
def test_all_invalid_returns_placeholder():
boxes = [
{"name": "a", "box": {"q1": None, "median": 1.0, "q3": 2.0}},
{"name": "b"},
]
fig = build_boxplots_figure(boxes)
assert isinstance(fig, Figure)
assert len(fig.axes) >= 1
plt.close(fig)
def test_raw_fliers_are_drawn():
boxes = [
_box("con_fliers", 10.0, 20.0, 30.0, 5.0, 200.0,
high=True, fliers=[150.0, 180.0, 200.0]),
]
fig = build_boxplots_figure(boxes)
assert isinstance(fig, Figure)
assert len(fig.axes) >= 1
plt.close(fig)
def test_max_boxes_truncates_and_does_not_raise():
boxes = [_box(f"c{i}", float(i), float(i + 1), float(i + 2),
float(i - 5), float(i + 10)) for i in range(20)]
fig = build_boxplots_figure(boxes, title="muchos", max_boxes=5)
assert isinstance(fig, Figure)
ax = fig.axes[0]
# Solo se dibujan las primeras 5 cajas.
assert len(ax.get_yticks()) == 5
plt.close(fig)
@@ -0,0 +1,79 @@
---
name: summarize_outlier_dims
kind: function
lang: py
domain: datascience
version: "1.0.0"
purity: pure
signature: "def summarize_outlier_dims(raw_numeric: dict, outlier_rows: list, top_k: int = 3) -> list"
description: "Explica QUE columnas hacen rara cada fila anomala detectada por isolation_forest_outliers. Para cada {row_index, score} reconstruye la fila valida (mismo filtro de columnas numericas y mismo descarte de filas con None que el detector, asi row_index coincide) y devuelve las top_k columnas de mayor |z-score| poblacional (ddof=0). Capa de explicabilidad del paso de outliers multivariante en EDA. Pura y determinista; ante entradas vacias/invalidas o sin filas validas devuelve [] sin petar."
tags: [eda, models, outliers, anomaly-detection, explainability, z-score, multivariate]
params:
- name: raw_numeric
desc: "dict {nombre_columna: [valores]} alineado por fila (como ctx['raw_numeric'] del motor AutomaticEDA). Solo se usan columnas con todos los valores numericos (None permitido por fila; bool/str/NaN/Inf descartan la columna entera) — filtro IDENTICO al de isolation_forest_outliers para que row_index coincida."
- name: outlier_rows
desc: "Lista de {row_index, score} tal cual la devuelve isolation_forest_outliers. row_index cuenta SOLO las filas validas (sin None) en orden de aparicion, base 0. Entradas fuera de rango o malformadas se ignoran defensivamente."
- name: top_k
desc: "Numero de columnas (las de mayor |z-score|) a reportar por outlier. Default 3. Valores invalidos (no-int, bool, <1) caen a 3."
output: "Lista paralela a outlier_rows (mismo orden) de dicts {row_index: int, score: float, dims: [{col: str, value: float, z: float}, ...]}. dims trae hasta top_k columnas ordenadas por |z| descendente, con z (z-score poblacional, ddof=0) redondeado a 3 decimales; si una columna tiene std==0 su z es 0. Las entradas de outlier_rows fuera de rango/malformadas se omiten. Ante raw_numeric vacio/no-dict, outlier_rows no-lista, 0 columnas numericas o 0 filas validas devuelve []."
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: []
tested: true
tests: ["test_row_index_skips_none_rows", "test_extreme_row_flagged_via_isolation", "test_out_of_range_row_index_is_ignored", "test_degrades_to_empty_on_invalid_inputs"]
test_file_path: "python/functions/datascience/summarize_outlier_dims_test.py"
file_path: "python/functions/datascience/summarize_outlier_dims.py"
---
## Ejemplo
```python
from datascience import isolation_forest_outliers, summarize_outlier_dims
# Nube densa alrededor del origen + 1 fila con un valor extremo en "c".
raw_numeric = {
"a": [0.1, 0.2, -0.1, 0.0, 0.3, -0.2, 0.15, -0.05, 0.25, 0.2, -0.3, 0.1],
"b": [1.0, 1.1, 0.9, 1.2, 0.8, 1.0, 1.1, 0.95, 1.05, 0.9, 1.15, 1.0],
"c": [5.0, 5.2, 4.8, 5.1, 4.9, 5.0, 4.95, 5.05, 4.9, 500.0, 5.1, 5.0],
}
result = isolation_forest_outliers(raw_numeric, contamination=0.1)
summary = summarize_outlier_dims(raw_numeric, result["outlier_rows"], top_k=3)
for item in summary:
top = item["dims"][0]
print(item["row_index"], top["col"], top["value"], top["z"])
# La fila del valor 500 sale con dim top "c" y |z| alto: es lo que la hace rara.
```
## Cuando usarla
Justo **despues** de `isolation_forest_outliers`, cuando ya sabes QUE filas son
anomalas y quieres explicar POR QUE: en que columnas se desvian mas respecto al
resto. Util para rellenar la seccion de outliers de un report/notebook EDA con
"la fila 9 es rara sobre todo por `c` (z=+3.3)" en lugar de solo un row_index
opaco. Pasa el mismo `raw_numeric` que diste al detector y su `outlier_rows`
intacto; el `row_index` apunta a la misma fila porque ambas funciones aplican el
mismo filtro de columnas y el mismo descarte de filas con None.
## Gotchas
- **Mismo `raw_numeric` que el detector**: el `row_index` solo coincide si pasas
el mismo dict de columnas (mismo orden, mismas listas) con el que llamaste a
`isolation_forest_outliers`. Si cambias las columnas o el orden, los indices
dejan de mapear.
- **`row_index` es relativo a las filas validas**: las filas con `None` en
cualquier columna usada se descartan y los indices se recalculan sobre las que
quedan (base 0, orden de aparicion). No mapea 1:1 con las listas de entrada si
hay None.
- **z-score poblacional (ddof=0)**: se usa la desviacion tipica poblacional,
consistente con el escalado del detector. Columnas con `std==0` (todos los
valores iguales) dan `z=0`, asi que nunca aparecen como "raras".
- **Devuelve `[]` en vez de petar**: entrada no-dict/no-lista, 0 columnas
numericas, 0 filas validas, o todas las entradas fuera de rango -> lista vacia.
No lanza excepciones.
- **No llama a `isolation_forest_outliers`**: solo consume su salida. Es una
funcion independiente (no la importa), por eso `uses_functions` esta vacio.
@@ -0,0 +1,144 @@
"""Explica que dimensiones (columnas) hacen rara cada fila anomala.
Toma la salida multivariante de `isolation_forest_outliers` (lista de
`{row_index, score}`) y, para cada outlier, devuelve las columnas con mayor
|z-score| respecto a la distribucion de las filas validas. Es la capa de
"explicabilidad" del paso de outliers multivariante en la fase EDA: el
Isolation Forest dice QUE filas son raras, esta funcion dice POR QUE (en que
columnas se desvian mas).
Pura y determinista: reconstruye EXACTAMENTE las mismas "filas validas" que usa
`isolation_forest_outliers` (mismo filtro de columnas numericas y mismo descarte
de filas con None), de modo que el `row_index` apunta a la misma fila en ambas
funciones. No hace I/O ni depende de estado.
"""
import math
import numpy as np
def _is_finite_number(v) -> bool:
"""True si v es int/float finito. bool NO cuenta; NaN/Inf tampoco."""
if isinstance(v, bool):
return False
if not isinstance(v, (int, float)):
return False
if isinstance(v, float) and (math.isnan(v) or math.isinf(v)):
return False
return True
def summarize_outlier_dims(
raw_numeric: dict,
outlier_rows: list,
top_k: int = 3,
) -> list:
"""Resume las dimensiones que mas desvian a cada fila anomala.
Args:
raw_numeric: dict {nombre_columna: [valores]} alineado por fila (como
ctx['raw_numeric'] del motor AutomaticEDA). Solo se usan columnas
cuyos valores sean todos numericos (None permitido por fila; bool,
str, NaN e Inf descartan la columna entera) — filtro identico al de
isolation_forest_outliers.
outlier_rows: lista de {row_index, score} tal como la devuelve
isolation_forest_outliers. row_index cuenta SOLO las filas validas
(sin None) en orden de aparicion, empezando en 0.
top_k: numero de columnas (las de mayor |z-score|) a reportar por cada
outlier. Default 3. Valores invalidos caen a 3.
Returns:
Lista paralela a outlier_rows (mismo orden) de dicts
{row_index, score, dims}, donde dims es la lista de hasta top_k columnas
ordenadas por |z| descendente: [{col, value, z}, ...] con z redondeado a
3 decimales. Las entradas de outlier_rows fuera de rango o malformadas se
omiten (defensivo). Ante raw_numeric vacio/no-dict, outlier_rows
no-lista, 0 columnas numericas o 0 filas validas devuelve [].
"""
# Validacion defensiva de los argumentos principales.
if not isinstance(raw_numeric, dict) or not isinstance(outlier_rows, list):
return []
if not isinstance(top_k, int) or isinstance(top_k, bool) or top_k < 1:
top_k = 3
# Seleccion de columnas numericas: identica a isolation_forest_outliers.
# Una columna entra solo si todos sus valores son numericos (None permitido
# por fila); cualquier bool/str/NaN/Inf descarta la columna completa.
numeric_cols: dict[str, list] = {}
for name, values in raw_numeric.items():
if not isinstance(values, (list, tuple)):
continue
ok = True
for v in values:
if v is None:
continue
if not _is_finite_number(v):
ok = False
break
if ok:
numeric_cols[name] = list(values)
if len(numeric_cols) < 1:
return []
col_names = list(numeric_cols.keys())
try:
n_rows_total = min(len(numeric_cols[c]) for c in col_names)
except ValueError:
return []
# Reconstruye las filas validas con el MISMO criterio que el detector: la
# fila i toma un valor por columna; si cualquier valor es None, la fila se
# descarta y NO incrementa el indice valido. Asi row_index de outlier_rows
# apunta a esta misma secuencia (base 0, orden de aparicion).
valid_rows: list[list[float]] = []
for i in range(n_rows_total):
row = [numeric_cols[c][i] for c in col_names]
if any(v is None for v in row):
continue
valid_rows.append([float(v) for v in row])
if not valid_rows:
return []
matrix = np.asarray(valid_rows, dtype=float)
n_valid = matrix.shape[0]
means = matrix.mean(axis=0)
stds = matrix.std(axis=0, ddof=0) # poblacional (ddof=0)
out: list = []
for entry in outlier_rows:
if not isinstance(entry, dict):
continue
ri = entry.get("row_index")
# bool es subclase de int: lo excluimos explicitamente.
if not isinstance(ri, int) or isinstance(ri, bool):
continue
if ri < 0 or ri >= n_valid:
continue
try:
score = float(entry.get("score"))
except (TypeError, ValueError):
score = 0.0
row = matrix[ri]
dims = []
for j, name in enumerate(col_names):
std = stds[j]
if std == 0.0:
z = 0.0
else:
z = float((row[j] - means[j]) / std)
dims.append({"col": name, "value": float(row[j]), "z": z})
# Mayor |z| primero; sort estable, empates por orden de columna.
dims.sort(key=lambda d: abs(d["z"]), reverse=True)
dims = dims[:top_k]
for d in dims:
d["z"] = round(d["z"], 3)
out.append({"row_index": int(ri), "score": score, "dims": dims})
return out
@@ -0,0 +1,93 @@
"""Tests para summarize_outlier_dims."""
from isolation_forest_outliers import isolation_forest_outliers
from summarize_outlier_dims import summarize_outlier_dims
# Dataset compartido: 3 columnas, 13 filas. La fila ORIGINAL 6 tiene None en "a"
# (se descarta), de modo que la fila ORIGINAL 10 -- con un valor extremo en "c"
# -- queda en el indice VALIDO 9 (no 10). Esto verifica el salto de None.
A = [0.1, 0.2, -0.1, 0.0, 0.3, -0.2, None, 0.15, -0.05, 0.25, 0.2, -0.3, 0.1]
B = [1.0, 1.1, 0.9, 1.2, 0.8, 1.0, 1.3, 1.1, 0.95, 1.05, 0.9, 1.15, 1.0]
C = [5.0, 5.2, 4.8, 5.1, 4.9, 5.0, 5.3, 4.95, 5.05, 4.9, 500.0, 5.1, 5.0]
RAW = {"a": A, "b": B, "c": C}
# Mapa original -> valido (saltando original 6):
# orig: 0 1 2 3 4 5 7 8 9 10 11 12
# valid: 0 1 2 3 4 5 6 7 8 9 10 11
# => el extremo en "c" (original 10) esta en el indice valido 9.
EXTREME_VALID_INDEX = 9
def test_row_index_skips_none_rows():
# Mapeo directo (sin depender de la aleatoriedad de IsolationForest): el
# indice valido 9 debe corresponder a la fila con c == 500 -> el None de la
# fila original 6 se salto correctamente.
summary = summarize_outlier_dims(
RAW, [{"row_index": EXTREME_VALID_INDEX, "score": -0.5}], top_k=3
)
assert len(summary) == 1
entry = summary[0]
assert entry["row_index"] == EXTREME_VALID_INDEX
assert entry["score"] == -0.5
# La dimension dominante es "c", con su valor extremo y |z| alto.
top = entry["dims"][0]
assert top["col"] == "c"
assert top["value"] == 500.0
assert abs(top["z"]) > 2.0
# top_k respetado: como mucho 3 dims.
assert len(entry["dims"]) <= 3
def test_extreme_row_flagged_via_isolation():
# Integracion real: detectar outliers y explicarlos.
result = isolation_forest_outliers(RAW, contamination=0.1)
assert "note" not in result
outlier_rows = result["outlier_rows"]
assert outlier_rows # al menos un outlier
summary = summarize_outlier_dims(RAW, outlier_rows, top_k=3)
# Paralela a outlier_rows (todos los indices estan en rango).
assert len(summary) == len(outlier_rows)
by_index = {e["row_index"]: e for e in summary}
# El punto extremo debe estar entre los outliers detectados...
assert EXTREME_VALID_INDEX in by_index
# ...y su dimension top debe ser "c" (donde se desvia ~muchas sigmas).
extreme = by_index[EXTREME_VALID_INDEX]
assert extreme["dims"][0]["col"] == "c"
assert abs(extreme["dims"][0]["z"]) > 2.0
def test_out_of_range_row_index_is_ignored():
# Indices fuera de rango se omiten en lugar de petar.
summary = summarize_outlier_dims(
RAW,
[
{"row_index": 999, "score": -1.0},
{"row_index": -1, "score": -1.0},
{"row_index": EXTREME_VALID_INDEX, "score": -0.5},
],
top_k=2,
)
# Solo sobrevive el indice valido; los otros dos se descartan.
assert len(summary) == 1
assert summary[0]["row_index"] == EXTREME_VALID_INDEX
assert len(summary[0]["dims"]) <= 2
def test_degrades_to_empty_on_invalid_inputs():
# raw_numeric vacio + outlier_rows vacio.
assert summarize_outlier_dims({}, [], 3) == []
# raw_numeric no es dict.
assert summarize_outlier_dims("not a dict", [{"row_index": 0}], 3) == []
# outlier_rows no es lista.
assert summarize_outlier_dims(RAW, "not a list", 3) == []
# Sin columnas numericas (todas con strings) -> [].
assert summarize_outlier_dims(
{"s": ["x", "y", "z"]}, [{"row_index": 0, "score": -1.0}], 3
) == []
# Entradas malformadas dentro de outlier_rows se ignoran (no petan).
assert summarize_outlier_dims(
RAW, ["nope", 42, {"no_row_index": 1}], 3
) == []