feat(eda): capítulo OUTLIERS — valores atípicos univariantes + multivariantes
Nuevo capítulo dedicado `outliers` para el motor AutomaticEDA que reúne y profundiza en un solo sitio el análisis de valores atípicos, hoy disperso entre `num_distr` (conteo por columna) y `modelos` (IsolationForest). Se registra en `chapters_registry.py` entre `missingness` y `correlacion` (bloque de calidad de datos: calidad → missingness → outliers). Contenido del capítulo: - Resumen univariante por columna: nº y % de atípicos por Tukey (1.5·IQR) y por z-score (|z| > 3), con vallas inferior/superior y valores extremos. Ordenado por contaminación y marcando las columnas más afectadas. Reusa las funciones del registry `build_boxplot_stats` (vallas desde los percentiles del profile) y `detect_outliers` (regla z-score sobre la muestra cruda de `ctx`). - Boxplots de Tukey de las columnas más contaminadas (caja, bigotes y puntos atípicos), delegados a la función nueva `build_boxplots_figure`. - Multivariante: filas anómalas considerando todas las columnas a la vez con `isolation_forest_outliers` — nº y % de filas, las más anómalas con su score y las dimensiones que las hacen raras (top columnas por |z|, vía la función nueva `summarize_outlier_dims`). El detector se corre en vivo sobre `raw_numeric` para que el indexado de filas coincida exactamente con el de las dimensiones; cae al bloque precomputado del perfil cuando no hay muestra cruda (preset lite). - Interpretación exploratoria: un atípico no es necesariamente un error (distingue error de dato vs dato real extremo) y recomendaciones (revisar, winsorizar o re-expresar, enlazando con la re-expresión de Tukey del perfil). Términos clicables registrados en el glosario compartido: `outlier`, `tukey_fence`, `zscore`, `isolation_forest`. Funciones nuevas del registry (dominio datascience, grupo eda): - `build_boxplots_figure_py_datascience` (figure helper, impura) - `summarize_outlier_dims_py_datascience` (pura) El capítulo se activa con ≥1 columna numérica y devuelve None en su ausencia; lee todo defensivo y nunca lanza. Tests: capítulo (golden + edges + error path + render PDF/PPTX) y ambas funciones nuevas. Suite de no-regresión de AutomaticEDA verde. Verificado end-to-end con el dataset Titanic (Fare/Parch/SibSp como las columnas más contaminadas). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,250 @@
|
||||
"""Impure EDA helper: a single figure of horizontal Tukey boxplots (`eda` group).
|
||||
|
||||
Draws, in one ``matplotlib.figure.Figure``, a stack of horizontal Tukey boxplots
|
||||
(one per column) using ``ax.bxp``: each carries its box (Q1–Q3), whiskers (up to
|
||||
1.5·IQR), the median line and its outlier points. It consumes the output of the
|
||||
pure registry function ``build_boxplot_stats`` (one ``box`` dict per column) plus
|
||||
an optional list of raw outlier values per column; it never recomputes anything.
|
||||
|
||||
It is the "small-multiples" companion of ``num_distr`` (which draws one
|
||||
histogram+boxplot per column): here every column shares a single figure so the
|
||||
caller can show, at a glance, *which* columns are the most contaminated by
|
||||
outliers (the caller passes them already ordered by contamination).
|
||||
|
||||
Impure because it touches matplotlib's rendering machinery. It uses the headless
|
||||
Agg backend and the object-oriented ``Figure`` API (no ``pyplot``) so it leaks no
|
||||
global state and is safe to call repeatedly from a report renderer. It is fully
|
||||
defensive and NEVER raises: invalid entries are skipped and, if nothing valid
|
||||
remains, it returns a placeholder figure carrying a centered "(sin boxplots)".
|
||||
"""
|
||||
|
||||
import matplotlib
|
||||
|
||||
matplotlib.use("Agg")
|
||||
|
||||
from matplotlib.figure import Figure # noqa: E402
|
||||
|
||||
# Blue palette shared with the ``num_distr`` chapter so the report stays coherent.
|
||||
_BOX_FACE = "#9ec6df" # box fill.
|
||||
_BOX_EDGE = "#5b8aa6" # box / whisker / cap border.
|
||||
_MEDIAN = "#2e8b57" # median line (sea green).
|
||||
_OUTLIER = "#c0392b" # outlier points (soft red).
|
||||
# Muted gray for the placeholder / fallback message text.
|
||||
_MUTED_TEXT = "#5f6b7a"
|
||||
# Soft red for the error fallback message.
|
||||
_ERROR_TEXT = "#b00020"
|
||||
|
||||
|
||||
def _num(value):
|
||||
"""Coerce ``value`` to float defensively; None for None/bool/non-numeric/NaN."""
|
||||
# bool is a subclass of int; a stat value is never a real bool, so treat
|
||||
# True/False as missing instead of silently coercing to 1.0/0.0.
|
||||
if value is None or isinstance(value, bool):
|
||||
return None
|
||||
try:
|
||||
f = float(value)
|
||||
except (TypeError, ValueError):
|
||||
return None
|
||||
if f != f: # NaN guard.
|
||||
return None
|
||||
return f
|
||||
|
||||
|
||||
def _placeholder_figure(message: str, color: str = _MUTED_TEXT) -> "Figure":
|
||||
"""Return a fallback ``Figure`` carrying a single centered message."""
|
||||
fig = Figure(figsize=(7.0, 2.4), dpi=150)
|
||||
ax = fig.add_subplot(111)
|
||||
ax.axis("off")
|
||||
ax.text(
|
||||
0.5,
|
||||
0.5,
|
||||
message,
|
||||
ha="center",
|
||||
va="center",
|
||||
fontsize=12,
|
||||
color=color,
|
||||
wrap=True,
|
||||
transform=ax.transAxes,
|
||||
)
|
||||
fig.tight_layout()
|
||||
return fig
|
||||
|
||||
|
||||
def build_boxplots_figure(
|
||||
boxes: list,
|
||||
title: str = "",
|
||||
max_boxes: int = 12,
|
||||
) -> "matplotlib.figure.Figure":
|
||||
"""Build one figure of stacked horizontal Tukey boxplots (one per column).
|
||||
|
||||
For each entry the function builds a ``bxp`` stats record (``med, q1, q3,
|
||||
whislo, whishi, fliers, label``) from its ``box`` sub-dict (the output of
|
||||
``build_boxplot_stats``) and draws all of them as horizontal boxplots sharing
|
||||
the X axis, top-to-bottom in the order received (the caller is expected to
|
||||
pass them already sorted by contamination).
|
||||
|
||||
Outliers are shown two ways:
|
||||
|
||||
- If an entry carries a ``fliers`` list (the raw out-of-fence values), they
|
||||
are drawn as red points via ``ax.bxp(..., showfliers=True)``.
|
||||
- If ``fliers`` is ``None``/absent, the raw values are unknown, so only the
|
||||
extremes are marked: a red point at ``box["min"]`` when
|
||||
``box["has_low_outliers"]`` and at ``box["max"]`` when
|
||||
``box["has_high_outliers"]`` (same convention as ``num_distr``).
|
||||
|
||||
The function is fully defensive and NEVER raises. Entries that are not dicts,
|
||||
lack a ``box`` dict, or miss any of ``q1``/``median``/``q3`` are skipped. If
|
||||
after filtering no valid box remains it returns a placeholder ``Figure`` with
|
||||
a centered "(sin boxplots)"; any unexpected error is caught and turned into a
|
||||
fallback figure carrying the error text. It always returns a ``Figure``.
|
||||
|
||||
Args:
|
||||
boxes: List of dicts ``{"name": str, "box": dict, "fliers": list|None}``.
|
||||
``box`` is exactly the output of ``build_boxplot_stats`` (read with
|
||||
``.get``: ``q1, median, q3, whisker_lo, whisker_hi, min, max,
|
||||
has_low_outliers, has_high_outliers, ...``). ``fliers`` is the
|
||||
optional list of raw outlier values; when present they are plotted,
|
||||
otherwise only the extremes are marked.
|
||||
title: Figure title (``fig.suptitle``). Empty => no title. When the list
|
||||
is longer than ``max_boxes`` a "(mostrando N de M)" note is appended.
|
||||
max_boxes: Draw at most the first ``max_boxes`` entries (default 12). The
|
||||
rest are dropped but their omission is surfaced in the title note, so
|
||||
the truncation is never silent.
|
||||
|
||||
Returns:
|
||||
A ``matplotlib.figure.Figure`` with a single Axes holding the horizontal
|
||||
boxplots (height adaptive to the box count so none overlap). The caller is
|
||||
responsible for rasterizing/closing it; this function never shows nor
|
||||
saves it.
|
||||
"""
|
||||
try:
|
||||
if not isinstance(boxes, (list, tuple)) or len(boxes) == 0:
|
||||
return _placeholder_figure("(sin boxplots)")
|
||||
|
||||
total = len(boxes)
|
||||
|
||||
# Cap the number of boxes; tolerate a non-int / non-positive max_boxes.
|
||||
try:
|
||||
cap = int(max_boxes)
|
||||
except (TypeError, ValueError):
|
||||
cap = 12
|
||||
if cap <= 0:
|
||||
cap = 12
|
||||
candidates = list(boxes)[:cap]
|
||||
|
||||
stats_list = [] # bxp stats records, in draw order.
|
||||
labels = [] # Y tick labels (column names).
|
||||
manual_markers = [] # (position, box) for entries without raw fliers.
|
||||
any_fliers = False # whether to enable showfliers in the bxp call.
|
||||
|
||||
for entry in candidates:
|
||||
if not isinstance(entry, dict):
|
||||
continue
|
||||
box = entry.get("box")
|
||||
if not isinstance(box, dict):
|
||||
continue
|
||||
|
||||
q1 = _num(box.get("q1"))
|
||||
med = _num(box.get("median"))
|
||||
q3 = _num(box.get("q3"))
|
||||
# Without the three quartiles a boxplot cannot be drawn — skip it.
|
||||
if q1 is None or med is None or q3 is None:
|
||||
continue
|
||||
|
||||
# Whisker extremes fall back to the quartiles when missing.
|
||||
whislo = _num(box.get("whisker_lo"))
|
||||
whishi = _num(box.get("whisker_hi"))
|
||||
if whislo is None:
|
||||
whislo = q1
|
||||
if whishi is None:
|
||||
whishi = q3
|
||||
|
||||
name = entry.get("name")
|
||||
label = "" if name is None else str(name)
|
||||
|
||||
position = len(stats_list) + 1 # bxp positions are 1-indexed.
|
||||
fliers_raw = entry.get("fliers")
|
||||
if isinstance(fliers_raw, (list, tuple)):
|
||||
fliers = [v for v in (_num(x) for x in fliers_raw) if v is not None]
|
||||
if fliers:
|
||||
any_fliers = True
|
||||
else:
|
||||
# Raw values unknown: draw no bxp fliers, mark min/max by hand.
|
||||
fliers = []
|
||||
manual_markers.append((position, box))
|
||||
|
||||
stats_list.append({
|
||||
"med": med,
|
||||
"q1": q1,
|
||||
"q3": q3,
|
||||
"whislo": whislo,
|
||||
"whishi": whishi,
|
||||
"fliers": fliers,
|
||||
"label": label,
|
||||
})
|
||||
labels.append(label)
|
||||
|
||||
if not stats_list:
|
||||
return _placeholder_figure("(sin boxplots)")
|
||||
|
||||
n = len(stats_list)
|
||||
positions = list(range(1, n + 1))
|
||||
|
||||
# Height grows with the box count so none of them overlap.
|
||||
height = max(2.0, 0.5 * n + 1.0)
|
||||
fig = Figure(figsize=(7.0, height), dpi=150)
|
||||
ax = fig.add_subplot(111)
|
||||
|
||||
bxp_kw = dict(
|
||||
showfliers=any_fliers, widths=0.5, patch_artist=True,
|
||||
boxprops={"facecolor": _BOX_FACE, "edgecolor": _BOX_EDGE},
|
||||
medianprops={"color": _MEDIAN, "linewidth": 1.6},
|
||||
whiskerprops={"color": _BOX_EDGE},
|
||||
capprops={"color": _BOX_EDGE},
|
||||
flierprops={"marker": "o", "markersize": 3.5,
|
||||
"markerfacecolor": _OUTLIER, "markeredgecolor": _OUTLIER,
|
||||
"linestyle": "none"})
|
||||
try:
|
||||
# ``orientation`` is the current API; older matplotlib uses ``vert``.
|
||||
try:
|
||||
ax.bxp(stats_list, positions=positions,
|
||||
orientation="horizontal", **bxp_kw)
|
||||
except TypeError:
|
||||
ax.bxp(stats_list, positions=positions, vert=False, **bxp_kw)
|
||||
except Exception: # noqa: BLE001 — never let bxp kill the whole figure.
|
||||
ax.text(0.5, 0.5, "(boxplot no disponible)", ha="center",
|
||||
va="center", fontsize=10, color=_MUTED_TEXT,
|
||||
transform=ax.transAxes)
|
||||
|
||||
# For entries without raw fliers, mark only the out-of-fence extremes.
|
||||
for position, box in manual_markers:
|
||||
mn = _num(box.get("min"))
|
||||
mx = _num(box.get("max"))
|
||||
if box.get("has_low_outliers") and mn is not None:
|
||||
ax.plot([mn], [position], marker="o", markersize=3.5,
|
||||
color=_OUTLIER, zorder=5)
|
||||
if box.get("has_high_outliers") and mx is not None:
|
||||
ax.plot([mx], [position], marker="o", markersize=3.5,
|
||||
color=_OUTLIER, zorder=5)
|
||||
|
||||
# Pin the Y tick labels explicitly so they work across matplotlib
|
||||
# versions regardless of whether ``bxp`` consumed the ``label`` key.
|
||||
ax.set_yticks(positions)
|
||||
ax.set_yticklabels(labels, fontsize=8)
|
||||
ax.set_xlabel("valor", fontsize=9)
|
||||
ax.tick_params(labelsize=7)
|
||||
ax.margins(y=0.15)
|
||||
for spine in ("top", "right"):
|
||||
ax.spines[spine].set_visible(False)
|
||||
|
||||
# Surface truncation in the title instead of silently dropping boxes.
|
||||
note = f"(mostrando {n} de {total})" if total > cap else ""
|
||||
heading = " ".join(p for p in (title, note) if p)
|
||||
if heading:
|
||||
fig.suptitle(heading, fontsize=12, x=0.02, ha="left")
|
||||
|
||||
fig.tight_layout()
|
||||
return fig
|
||||
except Exception as exc: # noqa: BLE001 — never raise from a figure builder.
|
||||
return _placeholder_figure(
|
||||
f"error al dibujar boxplots: {exc}", color=_ERROR_TEXT)
|
||||
Reference in New Issue
Block a user