From 6f88f184f11513278eb0f20ae2d26e32021b86be Mon Sep 17 00:00:00 2001 From: Egutierrez Date: Tue, 30 Jun 2026 21:12:40 +0200 Subject: [PATCH] =?UTF-8?q?feat(eda):=20cap=C3=ADtulo=20OUTLIERS=20?= =?UTF-8?q?=E2=80=94=20valores=20at=C3=ADpicos=20univariantes=20+=20multiv?= =?UTF-8?q?ariantes?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Nuevo capítulo dedicado `outliers` para el motor AutomaticEDA que reúne y profundiza en un solo sitio el análisis de valores atípicos, hoy disperso entre `num_distr` (conteo por columna) y `modelos` (IsolationForest). Se registra en `chapters_registry.py` entre `missingness` y `correlacion` (bloque de calidad de datos: calidad → missingness → outliers). Contenido del capítulo: - Resumen univariante por columna: nº y % de atípicos por Tukey (1.5·IQR) y por z-score (|z| > 3), con vallas inferior/superior y valores extremos. Ordenado por contaminación y marcando las columnas más afectadas. Reusa las funciones del registry `build_boxplot_stats` (vallas desde los percentiles del profile) y `detect_outliers` (regla z-score sobre la muestra cruda de `ctx`). - Boxplots de Tukey de las columnas más contaminadas (caja, bigotes y puntos atípicos), delegados a la función nueva `build_boxplots_figure`. - Multivariante: filas anómalas considerando todas las columnas a la vez con `isolation_forest_outliers` — nº y % de filas, las más anómalas con su score y las dimensiones que las hacen raras (top columnas por |z|, vía la función nueva `summarize_outlier_dims`). El detector se corre en vivo sobre `raw_numeric` para que el indexado de filas coincida exactamente con el de las dimensiones; cae al bloque precomputado del perfil cuando no hay muestra cruda (preset lite). - Interpretación exploratoria: un atípico no es necesariamente un error (distingue error de dato vs dato real extremo) y recomendaciones (revisar, winsorizar o re-expresar, enlazando con la re-expresión de Tukey del perfil). Términos clicables registrados en el glosario compartido: `outlier`, `tukey_fence`, `zscore`, `isolation_forest`. Funciones nuevas del registry (dominio datascience, grupo eda): - `build_boxplots_figure_py_datascience` (figure helper, impura) - `summarize_outlier_dims_py_datascience` (pura) El capítulo se activa con ≥1 columna numérica y devuelve None en su ausencia; lee todo defensivo y nunca lanza. Tests: capítulo (golden + edges + error path + render PDF/PPTX) y ambas funciones nuevas. Suite de no-regresión de AutomaticEDA verde. Verificado end-to-end con el dataset Titanic (Fare/Parch/SibSp como las columnas más contaminadas). Co-Authored-By: Claude Opus 4.8 (1M context) --- .../automatic_eda/chapters/outliers.py | 593 ++++++++++++++++++ .../automatic_eda/chapters/outliers_test.py | 304 +++++++++ .../automatic_eda/chapters_registry.py | 1 + .../datascience/build_boxplots_figure.md | 125 ++++ .../datascience/build_boxplots_figure.py | 250 ++++++++ .../datascience/build_boxplots_figure_test.py | 109 ++++ .../datascience/summarize_outlier_dims.md | 79 +++ .../datascience/summarize_outlier_dims.py | 144 +++++ .../summarize_outlier_dims_test.py | 93 +++ 9 files changed, 1698 insertions(+) create mode 100644 python/functions/datascience/automatic_eda/chapters/outliers.py create mode 100644 python/functions/datascience/automatic_eda/chapters/outliers_test.py create mode 100644 python/functions/datascience/build_boxplots_figure.md create mode 100644 python/functions/datascience/build_boxplots_figure.py create mode 100644 python/functions/datascience/build_boxplots_figure_test.py create mode 100644 python/functions/datascience/summarize_outlier_dims.md create mode 100644 python/functions/datascience/summarize_outlier_dims.py create mode 100644 python/functions/datascience/summarize_outlier_dims_test.py diff --git a/python/functions/datascience/automatic_eda/chapters/outliers.py b/python/functions/datascience/automatic_eda/chapters/outliers.py new file mode 100644 index 00000000..0522a2ca --- /dev/null +++ b/python/functions/datascience/automatic_eda/chapters/outliers.py @@ -0,0 +1,593 @@ +"""Outliers chapter (OUTLIERS) — univariate + multivariate atypical values. + +Today the analysis of atypical values is scattered across the document: the +NUM DISTR chapter mentions the per-column outlier count inside each distribution +figure, and the MODELOS chapter runs Isolation Forest as one of several cheap +models. This chapter gathers and deepens the whole outlier story in a single +place, with its interpretation: an [[term:outlier]]outlier[[/term]] is **not +necessarily an error** — it can be a legitimate, extreme but real observation — +so the reading is exploratory (what to look at), never confirmatory (what to +delete). + +Sections, in order: + +1. **Resumen univariante por columna** — for every numeric column, the number + and percentage of atypical values by two complementary criteria: Tukey's + 1.5·IQR rule ([[term:tukey_fence]]vallas de Tukey[[/term]]) and the + [[term:zscore]]z-score[[/term]] rule (|z| > 3). The most contaminated columns + are flagged. The fences come from the pure registry function + ``build_boxplot_stats`` (derived from the profile percentiles); the per-column + counts use the raw sample in ``ctx['raw_numeric']`` when available (the exact + count), degrading to the profile's own z-score counts otherwise. +2. **Boxplots** — a single figure with the Tukey boxplots of the most + contaminated columns (box, whiskers and atypical points), delegated to the + reusable registry helper ``build_boxplots_figure``. +3. **Multivariante (filas anómalas)** — rows that are atypical considering ALL + columns at once, via the registry function ``isolation_forest_outliers``: the + count and percentage of anomalous rows, the most anomalous rows with their + score, and the dimensions that make each one rare (top columns by |z|, via + ``summarize_outlier_dims``). Run live on ``ctx['raw_numeric']`` (the same + numeric columns ``summarize_outlier_dims`` uses, so the row indexing stays + coherent and the dimension breakdown is correct); falls back to the + precomputed ``profile['models']['outliers']`` only when no raw sample is + available (e.g. the lite preset), where no per-row breakdown is shown. +4. **Interpretación** — outlier ≠ error: how to tell a data-entry error from a + genuine extreme value, and what to do (inspect, winsorize, or re-express — + linking to the Tukey re-expression the profile already computes). + +The chapter activates whenever the table has at least one numeric column; with +no numeric column it returns ``None`` and disappears from the document. + +Reads everything defensively (``.get``) and never raises: every registry +delegation is imported lazily and degraded to an honest note on any failure. + +Contract: build_(profile, ctx) -> Chapter | None ; CHAPTER_VERSION = "x.y.z". +""" + +from __future__ import annotations + +from .. import model + +CHAPTER_VERSION = "1.0.0" +CHAPTER_ID = "outliers" +CHAPTER_TITLE = "Valores atípicos" + +# z-score threshold for the univariate z rule: |z| > 3 flags a value ~3 standard +# deviations from the mean (≈99.7% of a normal distribution lies within ±3σ). +_Z_THRESH = 3.0 +# How many columns to draw in the boxplots figure (most contaminated first) and +# how many anomalous rows to list in the multivariate table. +_TOP_BOX = 12 +_TOP_ROWS = 12 +# Cap on the raw atypical values passed as boxplot fliers, so a heavy-tailed +# column does not flood the figure with thousands of points. +_MAX_FLIERS = 200 +# How many columns flagged as "most contaminated" in the summary note. +_TOP_FLAGGED = 3 + +# Glossary terms this chapter explains (contract §11.1). Registered in the shared +# collector and marked clickable on first appearance. ``isolation_forest`` and +# ``zscore`` may also be registered by the MODELOS chapter — ``add`` is +# idempotent (first definition wins), so registering them here is harmless and +# keeps this chapter self-contained when MODELOS does not render. +_TERM_DEFS = { + "outlier": ( + "Valor atípico (outlier)", + "Una observación que se aparta mucho del grueso de los datos. Un atípico " + "NO es necesariamente un error: puede ser un fallo de medida o de " + "registro, pero también un dato real extremo (un cliente que gasta diez " + "veces la media, un día de ventas excepcional). Por eso se señalan para " + "revisarlos, no para borrarlos automáticamente.", + ), + "tukey_fence": ( + "Vallas de Tukey (1,5·IQR)", + "Regla clásica para marcar atípicos a partir de los cuartiles: se calcula " + "el rango intercuartílico IQR = P75 − P25 y se trazan dos vallas, una " + "inferior en P25 − 1,5·IQR y otra superior en P75 + 1,5·IQR. Los valores " + "que caen fuera de esas vallas se consideran atípicos. Es robusta porque " + "se apoya en la mediana y los cuartiles, no en la media.", + ), + "zscore": ( + "z-score (puntuación típica)", + "Mide a cuántas desviaciones típicas está un valor de la media de su " + "columna: z = (valor − media) / desviación típica. Un |z| grande (aquí > " + "3) señala un valor alejado del centro. A diferencia de las vallas de " + "Tukey, el z-score usa media y desviación, así que es más sensible a la " + "presencia de los propios atípicos.", + ), + "isolation_forest": ( + "Isolation Forest (anomalías multivariantes)", + "Algoritmo de detección de anomalías que considera TODAS las columnas a " + "la vez: construye árboles que parten el espacio con cortes aleatorios y " + "mide cuántos cortes hacen falta para aislar cada fila. Las filas raras " + "se aíslan con muy pocos cortes y se marcan como atípicas según un umbral " + "de contaminación. Detecta combinaciones de valores poco frecuentes que " + "ninguna columna por separado revelaría.", + ), +} + + +# --------------------------------------------------------------------------- # +# Lazy registry delegations (each degrades to None / no-op on any failure). +# --------------------------------------------------------------------------- # +def _load_build_boxplot_stats(): + try: + from datascience.build_boxplot_stats import build_boxplot_stats + return build_boxplot_stats + except Exception: # noqa: BLE001 + return None + + +def _load_detect_outliers(): + # detect_outliers lives in the monolithic ``datascience.datascience`` module + # (file_path datascience.py), not in its own submodule — try both shapes. + try: + from datascience.datascience import detect_outliers + return detect_outliers + except Exception: # noqa: BLE001 + try: + from datascience import detect_outliers + return detect_outliers + except Exception: # noqa: BLE001 + return None + + +def _load_isolation_forest(): + try: + from datascience.isolation_forest_outliers import isolation_forest_outliers + return isolation_forest_outliers + except Exception: # noqa: BLE001 + return None + + +def _load_summarize_dims(): + try: + from datascience.summarize_outlier_dims import summarize_outlier_dims + return summarize_outlier_dims + except Exception: # noqa: BLE001 + return None + + +# --------------------------------------------------------------------------- # +# Defensive formatters (own copy: the chapter never imports siblings). +# --------------------------------------------------------------------------- # +def _fmt_num(value, decimals: int = 3) -> str: + if value is None: + return "—" + if isinstance(value, bool): + return "sí" if value else "no" + if isinstance(value, int): + return f"{value:,}".replace(",", ".") + if isinstance(value, float): + if value != value: # NaN + return "—" + if value in (float("inf"), float("-inf")): + return str(value) + text = f"{value:.{decimals}f}".rstrip("0").rstrip(".") + return text if text else "0" + return model._safe_str(value) + + +def _fmt_int(value) -> str: + if value is None: + return "—" + try: + return f"{int(round(float(value))):,}".replace(",", ".") + except (TypeError, ValueError): + return model._safe_str(value) + + +def _fmt_pct(value, decimals: int = 2) -> str: + """Format an already-0-100 value as a percentage. None -> placeholder.""" + if value is None: + return "—" + try: + return f"{float(value):.{decimals}f}%" + except (TypeError, ValueError): + return model._safe_str(value) + + +def _term(mark: bool, key: str, text: str) -> str: + return f"[[term:{key}]]{text}[[/term]]" if mark else text + + +def _is_dict(v) -> bool: + return isinstance(v, dict) + + +# --------------------------------------------------------------------------- # +# Profile reads. +# --------------------------------------------------------------------------- # +def _numeric_columns(profile: dict) -> list: + """Return [(name, numeric_dict)] for numeric columns with usable stats.""" + out = [] + for col in profile.get("columns") or []: + if not isinstance(col, dict): + continue + if col.get("inferred_type") != "numeric": + continue + num = col.get("numeric") + if not isinstance(num, dict) or not num: + continue + if num.get("mean") is None and num.get("median") is None: + continue + out.append((col.get("name") or "(columna)", num)) + return out + + +def _clean_values(raw): + """Return the finite float values of a raw column list (drop None/NaN/inf).""" + if not isinstance(raw, (list, tuple)): + return None + vals = [] + for v in raw: + if v is None or isinstance(v, bool): + continue + try: + f = float(v) + except (TypeError, ValueError): + continue + if f != f or f in (float("inf"), float("-inf")): + continue + vals.append(f) + return vals + + +# --------------------------------------------------------------------------- # +# Per-column univariate summary. +# --------------------------------------------------------------------------- # +def _univariate_row(name, numeric, raw_vals, box_fn, detect_fn): + """Compute one univariate summary row + boxplot inputs for a column. + + Returns a dict with the table cells and, when raw values are available, the + exact Tukey/z counts and the list of atypical (flier) values; otherwise it + degrades to the profile's own z-score counts and the fence flags. + """ + box = {} + if box_fn is not None: + try: + box = box_fn(numeric) or {} + except Exception: # noqa: BLE001 + box = {} + lf = box.get("lower_fence") + uf = box.get("upper_fence") + + vals = _clean_values(raw_vals) + n_tukey = pct_tukey = None + n_z = pct_z = None + low_extreme = high_extreme = None + fliers = [] + contamination = None # metric used to rank columns (prefer Tukey %). + + if vals: + n = len(vals) + tukey_out = [] + for v in vals: + below = (lf is not None and v < lf) + above = (uf is not None and v > uf) + if below or above: + tukey_out.append(v) + n_tukey = len(tukey_out) + pct_tukey = 100.0 * n_tukey / n if n else None + if tukey_out: + low_extreme = min(tukey_out) + high_extreme = max(tukey_out) + fliers = tukey_out[:_MAX_FLIERS] + # z-score rule via the registry function (returns parallel bools). + if detect_fn is not None: + try: + flags = detect_fn(vals, _Z_THRESH) or [] + n_z = int(sum(1 for b in flags if b)) + pct_z = 100.0 * n_z / n if n else None + except Exception: # noqa: BLE001 + n_z = pct_z = None + contamination = pct_tukey + else: + # Degrade: no raw sample for this column. The profile's own outlier + # count/pct come from the z-score block (build_boxplot_stats note); the + # Tukey count is unknown, only the fence flags are. + n_z = numeric.get("n_outliers") + pct_z = numeric.get("outlier_pct") + if box.get("has_low_outliers") and box.get("min") is not None: + low_extreme = box.get("min") + if box.get("has_high_outliers") and box.get("max") is not None: + high_extreme = box.get("max") + contamination = pct_z if isinstance(pct_z, (int, float)) else None + + # Compact "extremos atípicos" cell: down/up arrows for the low/high tail. + extremes = [] + if low_extreme is not None: + extremes.append(f"↓ {_fmt_num(low_extreme)}") + if high_extreme is not None: + extremes.append(f"↑ {_fmt_num(high_extreme)}") + extremes_cell = " ".join(extremes) if extremes else "—" + + return { + "name": model._safe_str(name), + "n_tukey": n_tukey, + "pct_tukey": pct_tukey, + "n_z": n_z, + "pct_z": pct_z, + "lower_fence": lf, + "upper_fence": uf, + "extremes": extremes_cell, + "box": box, + "fliers": fliers, + "has_raw": bool(vals), + "contamination": contamination if isinstance(contamination, (int, float)) else -1.0, + } + + +def _univariate_table(rows: list) -> model.DataTable: + header = ["Columna", "Atípicos Tukey", "% Tukey", "Atípicos z", "% z", + "Valla inf.", "Valla sup.", "Extremos atípicos"] + table_rows = [] + for r in rows: + table_rows.append([ + r["name"], + _fmt_int(r["n_tukey"]) if r["n_tukey"] is not None else "—", + _fmt_pct(r["pct_tukey"]) if r["pct_tukey"] is not None else "—", + _fmt_int(r["n_z"]) if r["n_z"] is not None else "—", + _fmt_pct(r["pct_z"]) if r["pct_z"] is not None else "—", + _fmt_num(r["lower_fence"]), + _fmt_num(r["upper_fence"]), + r["extremes"], + ]) + return model.DataTable( + header=header, rows=table_rows, + title="Valores atípicos por columna", + note="Tukey = fuera de las vallas 1,5·IQR · z = |z-score| > 3 · " + "ordenado de más a menos contaminada") + + +# --------------------------------------------------------------------------- # +# Multivariate (Isolation Forest) section. +# --------------------------------------------------------------------------- # +def _resolve_multivariate(profile: dict, ctx: dict, raw_numeric): + """Return (outliers_dict_or_None, source). + + Prefers a LIVE Isolation Forest over ``raw_numeric`` so the detector and + ``summarize_outlier_dims`` use EXACTLY the same numeric columns and the same + valid-row indexing — otherwise the precomputed ``profile['models'] + ['outliers']`` (run by MODELOS over a possibly different column subset) would + yield ``row_index`` values that no longer point at the rows + ``summarize_outlier_dims`` reconstructs, mislabelling the "dimensions that + make each row rare". Falls back to the precomputed block when no raw sample + is available (e.g. the lite preset drops ``raw_numeric``).""" + if _is_dict(raw_numeric) and raw_numeric: + iso = _load_isolation_forest() + if iso is not None: + try: + out = iso(raw_numeric) + if _is_dict(out) and out.get("n_outliers") is not None and out.get("n_rows_used"): + return out, "live" + except Exception: # noqa: BLE001 + pass + # Fallback: the model the MODELOS chapter already computed (no raw sample to + # recompute against, so no per-row dimension breakdown either). + models = profile.get("models") if _is_dict(profile.get("models")) else {} + pre = models.get("outliers") if _is_dict(models) else None + if _is_dict(pre) and pre.get("n_outliers") is not None and pre.get("n_rows_used"): + return pre, "precomputed" + return None, "none" + + +def _multivariate_blocks(outliers: dict, raw_numeric, mark: bool) -> list: + isof = _term(mark, "isolation_forest", "**Isolation Forest**") + blocks = [ + model.Heading(text="Filas atípicas (multivariante)", level=2), + model.Markdown(text=( + f"Hasta aquí cada columna se ha mirado por separado. {isof} busca " + "filas raras considerando **todas las columnas a la vez**: una fila " + "puede ser normal en cada variable y aun así ser atípica por la " + "**combinación** de sus valores (p. ej. una edad baja con una tarifa " + "muy alta). La tabla resume cuántas filas se marcaron y el umbral de " + "decisión.")), + model.KVTable(rows=[ + ("Filas analizadas", _fmt_int(outliers.get("n_rows_used"))), + ("Columnas consideradas", _fmt_int(outliers.get("n_features"))), + ("Filas atípicas", _fmt_int(outliers.get("n_outliers"))), + ("% filas atípicas", _fmt_pct(outliers.get("outlier_pct"))), + ("Umbral de decisión", _fmt_num(outliers.get("threshold"), 4)), + ], title="Anomalías multivariantes"), + ] + + rows_in = outliers.get("outlier_rows") or [] + if not rows_in: + return blocks + + # Enrich each anomalous row with the dimensions that make it rare, when the + # raw sample is available (summarize_outlier_dims reconstructs the same + # valid-row indexing as isolation_forest_outliers). + dims_by_row = {} + if _is_dict(raw_numeric) and raw_numeric: + summ = _load_summarize_dims() + if summ is not None: + try: + enriched = summ(raw_numeric, rows_in, top_k=3) or [] + for e in enriched: + if _is_dict(e) and e.get("row_index") is not None: + dims_by_row[e.get("row_index")] = e.get("dims") or [] + except Exception: # noqa: BLE001 + dims_by_row = {} + + has_dims = bool(dims_by_row) + header = ["Fila (entre válidas)", "Score"] + if has_dims: + header.append("Dimensiones que la hacen rara (col = valor, z)") + table_rows = [] + for r in rows_in[:_TOP_ROWS]: + if not _is_dict(r): + continue + ridx = r.get("row_index") + cells = [_fmt_int(ridx), _fmt_num(r.get("score"), 4)] + if has_dims: + dims = dims_by_row.get(ridx) or [] + parts = [] + for d in dims: + if not _is_dict(d): + continue + parts.append( + f"{model._safe_str(d.get('col'))} = {_fmt_num(d.get('value'))} " + f"(z {_fmt_num(d.get('z'), 2)})") + cells.append("; ".join(parts) if parts else "—") + table_rows.append(cells) + + if table_rows: + shown = len(table_rows) + total = outliers.get("n_outliers") + note = "las filas más anómalas primero (score más bajo = más rara)" + if isinstance(total, int) and total > shown: + note += f" — top {shown} de {total}" + if not has_dims: + note += (" · no se pudo recuperar la muestra cruda para explicar las " + "dimensiones de cada fila") + blocks.append(model.DataTable( + header=header, rows=table_rows, + title="Filas más atípicas", note=note)) + return blocks + + +# --------------------------------------------------------------------------- # +# Interpretation section. +# --------------------------------------------------------------------------- # +def _interpretation_block(mark: bool) -> model.Markdown: + outlier = _term(mark, "outlier", "atípico") + text = ( + f"**Un {outlier} no es necesariamente un error.** Conviene distinguir " + "dos casos antes de actuar:\n\n" + "- **Error de dato** (medida, registro o unidad equivocada): una edad de " + "200 años, un importe negativo donde no puede haberlo, un decimal " + "desplazado. Estos sí se corrigen o se eliminan, idealmente en el origen.\n" + "- **Dato real extremo**: una observación legítima de la cola de la " + "distribución (un cliente que gasta mucho más, una tarifa de lujo, un día " + "de ventas excepcional). Borrarla sesga el análisis y oculta información " + "valiosa.\n\n" + "**Qué hacer.** Primero, **revisar** los valores señalados arriba contra " + "su origen para decidir cuál de los dos casos es. Si son errores, " + "corregirlos. Si son datos reales que distorsionan medias y modelos, hay " + "alternativas a borrarlos: **winsorizar** (recortar los extremos a un " + "percentil), o **re-expresar** la variable (por ejemplo una " + "transformación logarítmica o la escalera de re-expresión de Tukey que " + "este mismo perfil ya calcula para las columnas asimétricas), que suele " + "domar la cola sin perder ninguna fila. La elección depende del objetivo: " + "esta lectura es **exploratoria** —orienta dónde mirar—, no una regla " + "automática de limpieza.") + return model.Markdown(text=text) + + +# --------------------------------------------------------------------------- # +# Entry point. +# --------------------------------------------------------------------------- # +def build_outliers(profile: dict, ctx: dict): + """Build the OUTLIERS Chapter, or None if the dataset has no numeric column.""" + profile = profile or {} + ctx = ctx or {} + if not isinstance(profile, dict): + return None + + numerics = _numeric_columns(profile) + if not numerics: + return None # chapter does not apply to a dataset with no numerics. + + # Register glossary terms (if a collector is present) and mark them clickable. + glossary = ctx.get("glossary") + mark = False + if isinstance(glossary, model.GlossaryCollector): + for key, (label, definition) in _TERM_DEFS.items(): + glossary.add(key, label, definition) + mark = True + + raw_numeric = ctx.get("raw_numeric") + raw_numeric = raw_numeric if isinstance(raw_numeric, dict) else {} + + box_fn = _load_build_boxplot_stats() + detect_fn = _load_detect_outliers() + + # --- Univariate summary ------------------------------------------------- # + uni_rows = [] + for name, numeric in numerics: + uni_rows.append(_univariate_row( + name, numeric, raw_numeric.get(name), box_fn, detect_fn)) + # Rank columns by contamination (Tukey % when available, else z %). + uni_rows.sort(key=lambda r: r.get("contamination", -1.0), reverse=True) + + intro = ( + "Este capítulo reúne en un solo sitio el análisis de los **valores " + "atípicos** de la tabla, que en el resto del informe aparecen dispersos. " + f"Un {_term(mark, 'outlier', 'atípico')} es una observación que se aparta " + "mucho del grueso de los datos. Cada columna numérica se evalúa con dos " + f"criterios complementarios: las {_term(mark, 'tukey_fence', 'vallas de Tukey')} " + "(fuera de P25−1,5·IQR o P75+1,5·IQR, robusto a la propia cola) y el " + f"{_term(mark, 'zscore', 'z-score')} (|z| > 3, sensible a la media). La " + "tabla está ordenada de la columna más contaminada a la menos.") + + blocks = [ + model.Heading(text=CHAPTER_TITLE, level=1), + model.Markdown(text=intro), + _univariate_table(uni_rows), + ] + + # Flag the most contaminated columns explicitly. + flagged = [r["name"] for r in uni_rows + if r.get("contamination", -1.0) > 0][:_TOP_FLAGGED] + if flagged: + names = ", ".join(f"**{n}**" for n in flagged) + blocks.append(model.Markdown(text=( + f"Las columnas con mayor proporción de atípicos son {names}: " + "concentran el grueso de los valores fuera de las vallas y son las " + "primeras a revisar."))) + + # --- Boxplots figure ---------------------------------------------------- # + box_entries = [ + {"name": r["name"], "box": r["box"], "fliers": r.get("fliers")} + for r in uni_rows + if r.get("box") + ][:_TOP_BOX] + if box_entries: + def _boxplots_make(entries=box_entries): + try: + from datascience.build_boxplots_figure import build_boxplots_figure + return build_boxplots_figure( + entries, title="Boxplots de Tukey por columna", + max_boxes=_TOP_BOX) + except Exception: # noqa: BLE001 — minimal fallback figure. + import matplotlib + matplotlib.use("Agg") + from matplotlib.figure import Figure + fig = Figure(figsize=(5.0, 2.2)) + ax = fig.add_subplot(111) + ax.text(0.5, 0.5, "(boxplots no disponibles)", + ha="center", va="center") + ax.axis("off") + return fig + + blocks.append(model.Group(blocks=[ + model.Heading(text="Boxplots", level=2), + model.Markdown(text=( + "Cada caja abarca del primer al tercer cuartil (P25–P75), la línea " + "interior es la mediana y los bigotes llegan hasta 1,5·IQR; los " + "puntos son los valores que caen fuera de las vallas (atípicos por " + "Tukey).")), + model.Figure( + make=_boxplots_make, + caption="Boxplots de Tukey de las columnas más contaminadas."), + ])) + + # --- Multivariate ------------------------------------------------------- # + outliers, _src = _resolve_multivariate(profile, ctx, raw_numeric) + if outliers is not None: + blocks.extend(_multivariate_blocks(outliers, raw_numeric, mark)) + else: + blocks.append(model.Heading(text="Filas atípicas (multivariante)", level=2)) + blocks.append(model.Note( + "No se pudo analizar la anomalía multivariante: hacen falta al menos " + "dos columnas numéricas y la muestra cruda (o los modelos del perfil) " + "para correr Isolation Forest.")) + + # --- Interpretation ----------------------------------------------------- # + blocks.append(model.Heading(text="Cómo interpretar los atípicos", level=2)) + blocks.append(_interpretation_block(mark)) + + return model.Chapter(id=CHAPTER_ID, title=CHAPTER_TITLE, + version=CHAPTER_VERSION, blocks=blocks) diff --git a/python/functions/datascience/automatic_eda/chapters/outliers_test.py b/python/functions/datascience/automatic_eda/chapters/outliers_test.py new file mode 100644 index 00000000..bff20166 --- /dev/null +++ b/python/functions/datascience/automatic_eda/chapters/outliers_test.py @@ -0,0 +1,304 @@ +"""Tests for the OUTLIERS chapter — DoD: golden + edges + error path. + +Self-contained: builds synthetic ``numeric`` blocks + a raw_numeric sample (no +DuckDB) so the suite is fast and deterministic. Verifies that the chapter emits +the univariate per-column table, a boxplots figure, the multivariate Isolation +Forest section and the outlier≠error interpretation; that the most contaminated +column is ranked first; that a profile with no numeric column yields None; that +None/empty never raises; that the glossary terms are registered; and that the +chapter renders into both PDF and PPTX without cutting its title. +""" + +import math +import os +import re +import tempfile + +from pypdf import PdfReader + +from datascience.automatic_eda.chapters.outliers import ( + build_outliers, CHAPTER_VERSION, CHAPTER_TITLE, _TERM_DEFS, +) +from datascience.automatic_eda import model +from datascience.render_automatic_eda_pdf import render_automatic_eda_pdf +from datascience.render_automatic_eda_pptx import render_automatic_eda_pptx + + +def _percentile(sorted_vals, q): + """Linear-interpolation percentile (q in 0..1) on an already-sorted list.""" + if not sorted_vals: + return None + if len(sorted_vals) == 1: + return float(sorted_vals[0]) + pos = q * (len(sorted_vals) - 1) + lo = int(math.floor(pos)) + hi = int(math.ceil(pos)) + if lo == hi: + return float(sorted_vals[lo]) + frac = pos - lo + return float(sorted_vals[lo] * (1 - frac) + sorted_vals[hi] * frac) + + +def _col_from_values(values, nbins=10): + """Build a ``numeric`` sub-block shaped like describe_numeric's output from a + concrete list of raw values, so the profile percentiles and the raw sample + are consistent (the boxplot fences match the crudo).""" + vals = [float(v) for v in values] + s = sorted(vals) + n = len(s) + mean = sum(vals) / n + var = sum((v - mean) ** 2 for v in vals) / n + std = math.sqrt(var) + median = _percentile(s, 0.5) + p25 = _percentile(s, 0.25) + p75 = _percentile(s, 0.75) + mn, mx = s[0], s[-1] + # z-score outlier count (population), what the profile's n_outliers carries. + n_out = sum(1 for v in vals if std > 0 and abs((v - mean) / std) > 3.0) + width = (mx - mn) / nbins if mx > mn else 1.0 + hist = [{"lo": mn + i * width, "hi": mn + (i + 1) * width, "count": 1} + for i in range(nbins)] + return { + "min": mn, "max": mx, "mean": mean, "median": median, "std": std, + "p25": p25, "p50": median, "p75": p75, "iqr": (p75 - p25), + "n_outliers": n_out, "outlier_pct": 100.0 * n_out / n, + "distribution_type": "right-skewed", "histogram": hist, + } + + +def _fare_values(): + """A heavy-tailed column (most ~10-30, a few 200-512): clear Tukey/z outliers.""" + base = [7.0 + (i % 25) for i in range(120)] # bulk 7..31 + tail = [180.0, 210.0, 263.0, 512.0] # extreme upper tail + return base + tail + + +def _age_values(): + """A roughly symmetric column with one extreme low value.""" + base = [22.0 + (i % 40) for i in range(120)] # 22..61 + return base + [80.0, 0.5, 74.0, 1.0] + + +def _quiet_values(): + """A clean column with no atypical values.""" + return [50.0 + (i % 5) for i in range(124)] + + +def _profile_and_ctx(with_models=True, with_raw=True): + fare = _fare_values() + age = _age_values() + quiet = _quiet_values() + cols = [ + {"name": "Fare", "inferred_type": "numeric", "numeric": _col_from_values(fare)}, + {"name": "Age", "inferred_type": "numeric", "numeric": _col_from_values(age)}, + {"name": "Quiet", "inferred_type": "numeric", "numeric": _col_from_values(quiet)}, + {"name": "Sexo", "inferred_type": "categorical", + "categorical": {"top": [{"value": "male", "count": 80}]}}, + ] + profile = {"table": "titanic", "n_rows": len(fare), "n_cols": len(cols), + "columns": cols} + if with_models: + profile["models"] = { + "outliers": { + "n_outliers": 4, "outlier_pct": 3.2, + "outlier_rows": [ + {"row_index": 123, "score": -0.21}, + {"row_index": 121, "score": -0.15}, + ], + "threshold": -0.02, "n_rows_used": 124, "n_features": 3, + } + } + ctx = {} + if with_raw: + ctx["raw_numeric"] = {"Fare": fare, "Age": age, "Quiet": quiet} + return profile, ctx + + +def _pdf_text(path: str) -> str: + txt = "".join((pg.extract_text() or "") for pg in PdfReader(path).pages) + return re.sub(r"\s+", " ", txt) + + +def _flatten(blocks): + out = [] + for b in blocks: + if getattr(b, "kind", "") == "group": + out.extend(_flatten(getattr(b, "blocks", []) or [])) + else: + out.append(b) + return out + + +# --------------------------------------------------------------------------- # +# Golden. +# --------------------------------------------------------------------------- # +def test_golden_estructura_y_secciones(): + profile, ctx = _profile_and_ctx() + ctx["glossary"] = model.GlossaryCollector() + ch = build_outliers(profile, ctx) + assert ch is not None + assert ch.id == "outliers" + assert ch.version == CHAPTER_VERSION + + flat = _flatten(ch.blocks) + kinds = [b.kind for b in flat] + # Title heading + univariate DataTable + boxplots Figure + multivariate + # KVTable + interpretation Markdown. + assert kinds[0] == "heading" and flat[0].text == CHAPTER_TITLE + tables = [b for b in flat if b.kind == "data_table"] + titles = [t.title for t in tables] + assert any(t and "atípicos por columna" in t for t in titles) + assert any(b.kind == "figure" for b in flat), "falta la figura de boxplots" + assert any(b.kind == "kv_table" for b in flat), "falta el resumen multivariante" + + # The boxplots figure maker yields a real matplotlib figure (or its fallback). + fig = next(b for b in flat if b.kind == "figure").make() + assert fig is not None + import matplotlib.pyplot as plt + plt.close(fig) + + +def test_golden_fare_es_la_mas_contaminada(): + # The univariate table must rank Fare (heavy tail) first and report a + # non-zero Tukey percentage for it. + profile, ctx = _profile_and_ctx() + ch = build_outliers(profile, ctx) + table = next(b for b in _flatten(ch.blocks) + if b.kind == "data_table" and b.title + and "atípicos por columna" in b.title) + first_col = table.rows[0][0] + assert first_col == "Fare", f"esperaba Fare primera, fue {first_col}" + # % Tukey column (index 2) of the first row must be > 0. + pct_cell = table.rows[0][2] + assert pct_cell not in ("—", "0%", "0.00%"), f"% Tukey de Fare vacío: {pct_cell}" + # The z-score rule (detect_outliers) must actually run with raw_numeric: at + # least one column reports a non-empty z count/percentage (regression guard + # for the detect_outliers import path). + z_pcts = [r[4] for r in table.rows] + assert any(c not in ("—",) for c in z_pcts), f"columna z toda vacía: {z_pcts}" + z_counts = [r[3] for r in table.rows] + assert any(c not in ("—",) for c in z_counts), f"conteo z vacío: {z_counts}" + + +def test_golden_interpretacion_outlier_no_es_error(): + profile, ctx = _profile_and_ctx() + ch = build_outliers(profile, ctx) + md = " ".join(b.text for b in _flatten(ch.blocks) if b.kind == "markdown") + assert "no es necesariamente un error" in md.lower() + # Mentions the actionable options (winsorize / re-express). + assert "winsoriz" in md.lower() + assert "re-expres" in md.lower() or "logarítmic" in md.lower() + + +def test_golden_terminos_glosario_registrados(): + profile, ctx = _profile_and_ctx() + gloss = model.GlossaryCollector() + ctx["glossary"] = gloss + build_outliers(profile, ctx) + for key in _TERM_DEFS: + assert gloss.has(key), f"término '{key}' no registrado en el glosario" + # Terms are marked clickable in the body text. + md = " ".join(b.text for b in _flatten(build_outliers(profile, ctx).blocks) + if b.kind == "markdown") + assert "[[term:outlier]]" in md and "[[term:tukey_fence]]" in md + + +# --------------------------------------------------------------------------- # +# Multivariate. +# --------------------------------------------------------------------------- # +def test_multivariante_live_con_raw_y_dims(): + # With a raw sample the chapter runs Isolation Forest live (over the same + # columns summarize_outlier_dims uses) and lists the anomalous rows with the + # dimensions that make each one rare. + profile, ctx = _profile_and_ctx(with_models=False, with_raw=True) + ch = build_outliers(profile, ctx) + flat = _flatten(ch.blocks) + kv = next(b for b in flat if b.kind == "kv_table") + flat_kv = " ".join(f"{k} {v}" for (k, v) in kv.rows) + assert "Filas atípicas" in flat_kv + # A non-zero number of anomalous rows is reported. + n_cell = dict(kv.rows).get("Filas atípicas") + assert n_cell not in (None, "—", "0"), f"sin filas atípicas: {n_cell}" + # The anomalous-rows table carries the per-row dimension breakdown. + tbls = [b for b in flat if b.kind == "data_table" and b.title + and "más atípicas" in b.title] + assert tbls, "falta la tabla de filas más atípicas" + assert any("hacen rara" in h for h in tbls[0].header), \ + f"falta la columna de dimensiones: {tbls[0].header}" + + +def test_multivariante_precomputed_sin_raw(): + # Without a raw sample the chapter falls back to profile['models']['outliers'] + # (lite preset path); the precomputed n_outliers (4) surfaces in the KV table. + profile, ctx = _profile_and_ctx(with_models=True, with_raw=False) + ch = build_outliers(profile, ctx) + kv = next(b for b in _flatten(ch.blocks) if b.kind == "kv_table") + assert any("4" in str(v) for (k, v) in kv.rows) + + +def test_multivariante_ausente_degrada_a_nota(): + # No models and no raw sample → an honest note, never a crash. + profile, ctx = _profile_and_ctx(with_models=False, with_raw=False) + ch = build_outliers(profile, ctx) + assert ch is not None + notes = [b.text for b in _flatten(ch.blocks) if b.kind == "note"] + assert any("Isolation Forest" in n for n in notes) + + +# --------------------------------------------------------------------------- # +# Edges / error path. +# --------------------------------------------------------------------------- # +def test_edge_sin_columnas_numericas_devuelve_none(): + prof = {"columns": [{"name": "c", "inferred_type": "categorical", + "categorical": {"top": [{"value": "x", "count": 3}]}}]} + assert build_outliers(prof, {}) is None + + +def test_edge_solo_texto_sintetico_devuelve_none(): + # A text-only synthetic table (no numeric column) yields None (does not break). + prof = {"table": "notas", "n_rows": 3, "n_cols": 1, + "columns": [{"name": "comentario", "inferred_type": "text", + "text": {"n_docs": 3}}]} + assert build_outliers(prof, {}) is None + + +def test_edge_profile_none_y_vacio_no_revienta(): + assert build_outliers(None, None) is None + assert build_outliers({}, {}) is None + assert build_outliers({"columns": []}, {}) is None + + +def test_edge_sin_raw_numeric_degrada_a_perfil(): + # Without raw_numeric the chapter still builds, using the profile z-score + # counts; the univariate table exists and Tukey counts degrade to '—'. + profile, ctx = _profile_and_ctx(with_models=True, with_raw=False) + ch = build_outliers(profile, ctx) + assert ch is not None + table = next(b for b in _flatten(ch.blocks) + if b.kind == "data_table" and b.title + and "atípicos por columna" in b.title) + # z column comes from the profile; Tukey count is unknown ('—'). + assert all(len(r) == 8 for r in table.rows) + + +# --------------------------------------------------------------------------- # +# Anti-cut render. +# --------------------------------------------------------------------------- # +def test_render_pdf_y_pptx_incluyen_el_capitulo(): + profile, ctx = _profile_and_ctx() + # The renderers build the whole document; the chapter is reached via the + # registry. Render the chapter standalone through a one-chapter document by + # passing the profile directly (the renderers run the full chapter registry). + with tempfile.TemporaryDirectory() as d: + pdf = os.path.join(d, "out.pdf") + res_pdf = render_automatic_eda_pdf(profile, pdf, + {"write_manifest": False, "ctx": ctx}) + assert res_pdf["path"] == pdf + txt = _pdf_text(pdf) + assert CHAPTER_TITLE in txt, "el capítulo OUTLIERS no aparece en el PDF" + assert "Fare" in txt + pptx = os.path.join(d, "out.pptx") + res_pptx = render_automatic_eda_pptx(profile, pptx, + {"write_manifest": False, "ctx": ctx}) + assert res_pptx["path"] == pptx + assert res_pptx["n_slides"] >= 1 diff --git a/python/functions/datascience/automatic_eda/chapters_registry.py b/python/functions/datascience/automatic_eda/chapters_registry.py index 41097975..17d956db 100644 --- a/python/functions/datascience/automatic_eda/chapters_registry.py +++ b/python/functions/datascience/automatic_eda/chapters_registry.py @@ -34,6 +34,7 @@ CHAPTER_ORDER = [ "text_distr", # free-text / NLP distributions (non-tabular content) "calidad", # data quality "missingness", # missing-data patterns (co-occurrence of absences; MCAR/MAR) + "outliers", # atypical values: univariate (Tukey/z) + multivariate (IsolationForest) "correlacion", # correlations / associations "relaciones", # key relations: declared/candidate PK + FK (inter/intra-table) "modelos", # cheap models (PCA/KMeans/outliers) diff --git a/python/functions/datascience/build_boxplots_figure.md b/python/functions/datascience/build_boxplots_figure.md new file mode 100644 index 00000000..258d9986 --- /dev/null +++ b/python/functions/datascience/build_boxplots_figure.md @@ -0,0 +1,125 @@ +--- +id: build_boxplots_figure_py_datascience +name: build_boxplots_figure +kind: function +lang: py +domain: datascience +version: "1.0.0" +purity: impure +signature: "def build_boxplots_figure(boxes: list, title: str = \"\", max_boxes: int = 12) -> \"matplotlib.figure.Figure\"" +description: "Construye una unica figura matplotlib con boxplots de Tukey HORIZONTALES (uno por columna) usando ax.bxp: caja Q1-Q3, bigotes hasta 1.5*IQR, linea de mediana y puntos atipicos. Consume la salida de build_boxplot_stats (un dict box por columna, leido con .get) mas una lista opcional de outliers crudos por columna; si vienen los dibuja como puntos (showfliers), si no marca solo box[min]/box[max] cuando hay outliers de cola (igual que num_distr). Dibuja como mucho max_boxes cajas (las primeras, ya ordenadas por contaminacion por el caller) y avisa de la truncacion con (mostrando N de M). Backend Agg sin pyplot global; alto adaptativo al nº de cajas. Defensiva: omite entradas invalidas y NUNCA lanza — sin cajas validas devuelve una figura placeholder (sin boxplots). Es la version small-multiples del capitulo num_distr para responder que columnas tienen mas outliers de un vistazo." +tags: [eda, outliers, boxplot, tukey, iqr, bxp, matplotlib, figure, visualization, small-multiples, datascience, impure] +uses_functions: [] +uses_types: [] +returns: [] +returns_optional: false +error_type: "error_go_core" +imports: [matplotlib] +example: | + from datascience.build_boxplot_stats import build_boxplot_stats + from datascience.build_boxplots_figure import build_boxplots_figure + boxes = [ + {"name": "ingresos", "box": build_boxplot_stats({"min": 1.0, "max": 9e3, + "p25": 1e3, "median": 2e3, "p75": 3e3, "n_outliers": 7}), "fliers": None}, + {"name": "edad", "box": build_boxplot_stats({"min": 0.0, "max": 99.0, + "p25": 25.0, "median": 38.0, "p75": 52.0}), "fliers": None}, + ] + fig = build_boxplots_figure(boxes, title="Outliers por columna", max_boxes=12) +tested: true +tests: + - "test_returns_figure_with_axes" + - "test_empty_list_returns_placeholder_figure" + - "test_invalid_box_is_skipped_not_raised" + - "test_all_invalid_returns_placeholder" + - "test_raw_fliers_are_drawn" + - "test_max_boxes_truncates_and_does_not_raise" +test_file_path: "python/functions/datascience/build_boxplots_figure_test.py" +file_path: "python/functions/datascience/build_boxplots_figure.py" +params: + - name: boxes + desc: "Lista de dicts, cada uno {\"name\": str, \"box\": dict, \"fliers\": list|None}. box es EXACTAMENTE la salida de build_boxplot_stats (claves leidas con .get: q1, median, q3, whisker_lo, whisker_hi, min, max, has_low_outliers, has_high_outliers, lower_fence, upper_fence, n_outliers). fliers es la lista opcional de outliers crudos: si viene se dibuja como puntos; si es None/ausente solo se marcan los extremos box[min]/box[max] cuando hay outliers de cola. Entradas que no son dict, sin box dict, o sin q1/median/q3 se omiten. El caller las pasa ya ordenadas por contaminacion (la mayor primera)." + - name: title + desc: "Titulo de la figura (fig.suptitle, alineado a la izquierda). Vacio => sin titulo. Si len(boxes) > max_boxes se le anade una nota \"(mostrando N de M)\" para que la truncacion no sea silenciosa. Default \"\"." + - name: max_boxes + desc: "Numero maximo de cajas a dibujar (las primeras de la lista). Default 12. Un valor no entero o <= 0 cae a 12. Si la lista trae mas entradas, las sobrantes se descartan pero se reporta en el titulo con (mostrando N de M)." +output: "Un matplotlib.figure.Figure (figsize 7.0 x alto adaptativo = max(2.0, 0.5*n + 1.0), dpi 150) con un unico Axes que apila boxplots horizontales de Tukey (ax.bxp, orientation=horizontal con fallback vert=False), uno por columna valida, de arriba a abajo en el orden recibido. Cada caja: relleno #9ec6df, borde/bigotes/caps #5b8aa6, mediana #2e8b57, atipicos #c0392b. Etiquetas del eje Y = nombres de columna; eje X etiquetado \"valor\". Outliers dibujados desde fliers crudos (showfliers) o, si faltan, marcados en box[min]/box[max] segun has_low/high_outliers. Si no queda ninguna caja valida (lista vacia o todas invalidas) devuelve una Figure placeholder con texto centrado \"(sin boxplots)\"; cualquier error inesperado se captura y devuelve una Figure con el mensaje de error. NUNCA lanza. El caller rasteriza/cierra la figura; la funcion no la muestra ni la guarda." +--- + +## Ejemplo + +```python +import sys, os +sys.path.insert(0, os.path.join("python", "functions")) +from datascience.build_boxplot_stats import build_boxplot_stats +from datascience.build_boxplots_figure import build_boxplots_figure + +# Un `box` por columna numérica, derivado del sub-bloque `numeric` del profile +# (salida de describe_numeric). El caller los pasa ya ordenados por outlier_pct. +boxes = [ + { + "name": "ingresos", + "box": build_boxplot_stats({ + "min": 1.0, "max": 9000.0, + "p25": 1000.0, "median": 2000.0, "p75": 3000.0, + "n_outliers": 7, + }), + "fliers": None, # valores crudos desconocidos -> se marca solo el extremo. + }, + { + "name": "edad", + "box": build_boxplot_stats({ + "min": 0.0, "max": 99.0, + "p25": 25.0, "median": 38.0, "p75": 52.0, + }), + "fliers": [88.0, 95.0, 99.0], # outliers crudos -> se dibujan como puntos. + }, +] + +fig = build_boxplots_figure(boxes, title="Outliers por columna", max_boxes=12) + +# El renderer del informe lo rasteriza; aquí solo persistimos para inspección. +fig.savefig("/tmp/boxplots.png") +``` + +## Cuando usarla + +Úsala en el capítulo de outliers de un informe EDA cuando quieras comparar de un +vistazo *qué columnas están más contaminadas por valores atípicos*: a diferencia +de `num_distr` (que dibuja un histograma+boxplot por columna en figuras +separadas), aquí apilas todos los boxplots horizontales en **una sola figura** +(small multiples). Primero deriva el `box` de cada columna con +`build_boxplot_stats`, ordénalas por `outlier_pct` descendente, envuélvelas como +`{"name", "box", "fliers"}` y pásaselas. Si tienes los valores crudos fuera de +las vallas, métele la lista `fliers` y se dibujarán como puntos; si no, la +función marca solo los extremos `min`/`max` cuando hay cola. + +## Gotchas + +- **Impura por matplotlib.** Toca la maquinaria de render. Usa el backend `Agg` + y la API orientada a objetos `Figure`/`add_subplot` — NUNCA `pyplot.*` aquí, + para no tocar el estado global ni filtrar figuras entre llamadas. `pyplot` NO + es thread-safe; esta función construye el `Figure` directamente, así que es + segura de llamar en bucle desde el renderer. +- **El caller cierra la figura.** Devuelve el `Figure` pero no lo muestra ni lo + guarda. Quien la consume debe rasterizarla y luego liberarla + (`matplotlib.pyplot.close(fig)`) para no acumular memoria en lotes grandes. +- **`fliers` opcional, semántica distinta.** Si pasas la lista de outliers + crudos se dibujan todos como puntos (`showfliers=True`). Si es `None`/ausente + los valores son desconocidos y solo se marca un punto en `box["min"]` / + `box["max"]` cuando `has_low_outliers` / `has_high_outliers` — mismo criterio + que `num_distr`. No inventes fliers a partir del profile: el `box` no trae los + valores crudos, solo si los extremos superan las vallas. +- **API de orientación de `ax.bxp`.** matplotlib reciente usa + `orientation="horizontal"`; las versiones antiguas usan `vert=False`. La + función prueba la primera y cae a la segunda en `except TypeError`, así que + funciona en ambas. Si `bxp` falla del todo, el Axes degrada a un texto + "(boxplot no disponible)" en vez de propagar. +- **Truncación visible.** `max_boxes` (default 12) limita el nº de cajas para que + ninguna se solape; si la lista trae más, las sobrantes se descartan pero se + avisa en el título con "(mostrando N de M)". Pasa las columnas ya ordenadas por + contaminación para que las descartadas sean las menos relevantes. +- **Defensiva, nunca lanza.** Lista vacía, entradas no-dict, sin `box`, o sin + `q1`/`median`/`q3` se omiten sin propagar; sin cajas válidas devuelve un + placeholder "(sin boxplots)" y cualquier error inesperado se captura en una + figura con el texto del error. No envuelvas la llamada en try/except por miedo + a un raise — no lo hay. diff --git a/python/functions/datascience/build_boxplots_figure.py b/python/functions/datascience/build_boxplots_figure.py new file mode 100644 index 00000000..579ebc49 --- /dev/null +++ b/python/functions/datascience/build_boxplots_figure.py @@ -0,0 +1,250 @@ +"""Impure EDA helper: a single figure of horizontal Tukey boxplots (`eda` group). + +Draws, in one ``matplotlib.figure.Figure``, a stack of horizontal Tukey boxplots +(one per column) using ``ax.bxp``: each carries its box (Q1–Q3), whiskers (up to +1.5·IQR), the median line and its outlier points. It consumes the output of the +pure registry function ``build_boxplot_stats`` (one ``box`` dict per column) plus +an optional list of raw outlier values per column; it never recomputes anything. + +It is the "small-multiples" companion of ``num_distr`` (which draws one +histogram+boxplot per column): here every column shares a single figure so the +caller can show, at a glance, *which* columns are the most contaminated by +outliers (the caller passes them already ordered by contamination). + +Impure because it touches matplotlib's rendering machinery. It uses the headless +Agg backend and the object-oriented ``Figure`` API (no ``pyplot``) so it leaks no +global state and is safe to call repeatedly from a report renderer. It is fully +defensive and NEVER raises: invalid entries are skipped and, if nothing valid +remains, it returns a placeholder figure carrying a centered "(sin boxplots)". +""" + +import matplotlib + +matplotlib.use("Agg") + +from matplotlib.figure import Figure # noqa: E402 + +# Blue palette shared with the ``num_distr`` chapter so the report stays coherent. +_BOX_FACE = "#9ec6df" # box fill. +_BOX_EDGE = "#5b8aa6" # box / whisker / cap border. +_MEDIAN = "#2e8b57" # median line (sea green). +_OUTLIER = "#c0392b" # outlier points (soft red). +# Muted gray for the placeholder / fallback message text. +_MUTED_TEXT = "#5f6b7a" +# Soft red for the error fallback message. +_ERROR_TEXT = "#b00020" + + +def _num(value): + """Coerce ``value`` to float defensively; None for None/bool/non-numeric/NaN.""" + # bool is a subclass of int; a stat value is never a real bool, so treat + # True/False as missing instead of silently coercing to 1.0/0.0. + if value is None or isinstance(value, bool): + return None + try: + f = float(value) + except (TypeError, ValueError): + return None + if f != f: # NaN guard. + return None + return f + + +def _placeholder_figure(message: str, color: str = _MUTED_TEXT) -> "Figure": + """Return a fallback ``Figure`` carrying a single centered message.""" + fig = Figure(figsize=(7.0, 2.4), dpi=150) + ax = fig.add_subplot(111) + ax.axis("off") + ax.text( + 0.5, + 0.5, + message, + ha="center", + va="center", + fontsize=12, + color=color, + wrap=True, + transform=ax.transAxes, + ) + fig.tight_layout() + return fig + + +def build_boxplots_figure( + boxes: list, + title: str = "", + max_boxes: int = 12, +) -> "matplotlib.figure.Figure": + """Build one figure of stacked horizontal Tukey boxplots (one per column). + + For each entry the function builds a ``bxp`` stats record (``med, q1, q3, + whislo, whishi, fliers, label``) from its ``box`` sub-dict (the output of + ``build_boxplot_stats``) and draws all of them as horizontal boxplots sharing + the X axis, top-to-bottom in the order received (the caller is expected to + pass them already sorted by contamination). + + Outliers are shown two ways: + + - If an entry carries a ``fliers`` list (the raw out-of-fence values), they + are drawn as red points via ``ax.bxp(..., showfliers=True)``. + - If ``fliers`` is ``None``/absent, the raw values are unknown, so only the + extremes are marked: a red point at ``box["min"]`` when + ``box["has_low_outliers"]`` and at ``box["max"]`` when + ``box["has_high_outliers"]`` (same convention as ``num_distr``). + + The function is fully defensive and NEVER raises. Entries that are not dicts, + lack a ``box`` dict, or miss any of ``q1``/``median``/``q3`` are skipped. If + after filtering no valid box remains it returns a placeholder ``Figure`` with + a centered "(sin boxplots)"; any unexpected error is caught and turned into a + fallback figure carrying the error text. It always returns a ``Figure``. + + Args: + boxes: List of dicts ``{"name": str, "box": dict, "fliers": list|None}``. + ``box`` is exactly the output of ``build_boxplot_stats`` (read with + ``.get``: ``q1, median, q3, whisker_lo, whisker_hi, min, max, + has_low_outliers, has_high_outliers, ...``). ``fliers`` is the + optional list of raw outlier values; when present they are plotted, + otherwise only the extremes are marked. + title: Figure title (``fig.suptitle``). Empty => no title. When the list + is longer than ``max_boxes`` a "(mostrando N de M)" note is appended. + max_boxes: Draw at most the first ``max_boxes`` entries (default 12). The + rest are dropped but their omission is surfaced in the title note, so + the truncation is never silent. + + Returns: + A ``matplotlib.figure.Figure`` with a single Axes holding the horizontal + boxplots (height adaptive to the box count so none overlap). The caller is + responsible for rasterizing/closing it; this function never shows nor + saves it. + """ + try: + if not isinstance(boxes, (list, tuple)) or len(boxes) == 0: + return _placeholder_figure("(sin boxplots)") + + total = len(boxes) + + # Cap the number of boxes; tolerate a non-int / non-positive max_boxes. + try: + cap = int(max_boxes) + except (TypeError, ValueError): + cap = 12 + if cap <= 0: + cap = 12 + candidates = list(boxes)[:cap] + + stats_list = [] # bxp stats records, in draw order. + labels = [] # Y tick labels (column names). + manual_markers = [] # (position, box) for entries without raw fliers. + any_fliers = False # whether to enable showfliers in the bxp call. + + for entry in candidates: + if not isinstance(entry, dict): + continue + box = entry.get("box") + if not isinstance(box, dict): + continue + + q1 = _num(box.get("q1")) + med = _num(box.get("median")) + q3 = _num(box.get("q3")) + # Without the three quartiles a boxplot cannot be drawn — skip it. + if q1 is None or med is None or q3 is None: + continue + + # Whisker extremes fall back to the quartiles when missing. + whislo = _num(box.get("whisker_lo")) + whishi = _num(box.get("whisker_hi")) + if whislo is None: + whislo = q1 + if whishi is None: + whishi = q3 + + name = entry.get("name") + label = "" if name is None else str(name) + + position = len(stats_list) + 1 # bxp positions are 1-indexed. + fliers_raw = entry.get("fliers") + if isinstance(fliers_raw, (list, tuple)): + fliers = [v for v in (_num(x) for x in fliers_raw) if v is not None] + if fliers: + any_fliers = True + else: + # Raw values unknown: draw no bxp fliers, mark min/max by hand. + fliers = [] + manual_markers.append((position, box)) + + stats_list.append({ + "med": med, + "q1": q1, + "q3": q3, + "whislo": whislo, + "whishi": whishi, + "fliers": fliers, + "label": label, + }) + labels.append(label) + + if not stats_list: + return _placeholder_figure("(sin boxplots)") + + n = len(stats_list) + positions = list(range(1, n + 1)) + + # Height grows with the box count so none of them overlap. + height = max(2.0, 0.5 * n + 1.0) + fig = Figure(figsize=(7.0, height), dpi=150) + ax = fig.add_subplot(111) + + bxp_kw = dict( + showfliers=any_fliers, widths=0.5, patch_artist=True, + boxprops={"facecolor": _BOX_FACE, "edgecolor": _BOX_EDGE}, + medianprops={"color": _MEDIAN, "linewidth": 1.6}, + whiskerprops={"color": _BOX_EDGE}, + capprops={"color": _BOX_EDGE}, + flierprops={"marker": "o", "markersize": 3.5, + "markerfacecolor": _OUTLIER, "markeredgecolor": _OUTLIER, + "linestyle": "none"}) + try: + # ``orientation`` is the current API; older matplotlib uses ``vert``. + try: + ax.bxp(stats_list, positions=positions, + orientation="horizontal", **bxp_kw) + except TypeError: + ax.bxp(stats_list, positions=positions, vert=False, **bxp_kw) + except Exception: # noqa: BLE001 — never let bxp kill the whole figure. + ax.text(0.5, 0.5, "(boxplot no disponible)", ha="center", + va="center", fontsize=10, color=_MUTED_TEXT, + transform=ax.transAxes) + + # For entries without raw fliers, mark only the out-of-fence extremes. + for position, box in manual_markers: + mn = _num(box.get("min")) + mx = _num(box.get("max")) + if box.get("has_low_outliers") and mn is not None: + ax.plot([mn], [position], marker="o", markersize=3.5, + color=_OUTLIER, zorder=5) + if box.get("has_high_outliers") and mx is not None: + ax.plot([mx], [position], marker="o", markersize=3.5, + color=_OUTLIER, zorder=5) + + # Pin the Y tick labels explicitly so they work across matplotlib + # versions regardless of whether ``bxp`` consumed the ``label`` key. + ax.set_yticks(positions) + ax.set_yticklabels(labels, fontsize=8) + ax.set_xlabel("valor", fontsize=9) + ax.tick_params(labelsize=7) + ax.margins(y=0.15) + for spine in ("top", "right"): + ax.spines[spine].set_visible(False) + + # Surface truncation in the title instead of silently dropping boxes. + note = f"(mostrando {n} de {total})" if total > cap else "" + heading = " ".join(p for p in (title, note) if p) + if heading: + fig.suptitle(heading, fontsize=12, x=0.02, ha="left") + + fig.tight_layout() + return fig + except Exception as exc: # noqa: BLE001 — never raise from a figure builder. + return _placeholder_figure( + f"error al dibujar boxplots: {exc}", color=_ERROR_TEXT) diff --git a/python/functions/datascience/build_boxplots_figure_test.py b/python/functions/datascience/build_boxplots_figure_test.py new file mode 100644 index 00000000..3cea0914 --- /dev/null +++ b/python/functions/datascience/build_boxplots_figure_test.py @@ -0,0 +1,109 @@ +"""Tests para build_boxplots_figure (boxplots horizontales de Tukey, grupo eda). + +Usa el backend Agg sin display; no muestra ni guarda figuras. Cada test cierra +explícitamente la Figure construida (matplotlib.pyplot.close) para no acumular +estado entre tests. +""" + +import matplotlib + +matplotlib.use("Agg") + +import matplotlib.pyplot as plt # noqa: E402 +from matplotlib.figure import Figure # noqa: E402 + +from build_boxplots_figure import build_boxplots_figure + + +def _box(name, q1, median, q3, mn, mx, low=False, high=False, fliers=None): + """Construye una entrada {name, box, fliers} con un box estilo build_boxplot_stats.""" + iqr = q3 - q1 + return { + "name": name, + "box": { + "q1": q1, + "median": median, + "q3": q3, + "iqr": iqr, + "lower_fence": q1 - 1.5 * iqr, + "upper_fence": q3 + 1.5 * iqr, + "whisker_lo": max(mn, q1 - 1.5 * iqr), + "whisker_hi": min(mx, q3 + 1.5 * iqr), + "min": mn, + "max": mx, + "has_low_outliers": low, + "has_high_outliers": high, + "n_outliers": 0, + }, + "fliers": fliers, + } + + +def test_returns_figure_with_axes(): + boxes = [ + _box("edad", 10.0, 25.0, 40.0, 1.0, 100.0, high=True), + _box("ingresos", 100.0, 200.0, 300.0, 50.0, 400.0), + _box("score", -1.0, 0.0, 1.0, -5.0, 5.0, low=True, high=True), + ] + fig = build_boxplots_figure(boxes, title="Boxplots", max_boxes=12) + assert isinstance(fig, Figure) + assert len(fig.axes) >= 1 + # Tres cajas -> tres etiquetas en el eje Y. + ax = fig.axes[0] + assert len(ax.get_yticks()) == 3 + plt.close(fig) + + +def test_empty_list_returns_placeholder_figure(): + fig = build_boxplots_figure([], title="vacío") + assert isinstance(fig, Figure) + assert len(fig.axes) >= 1 + plt.close(fig) + + +def test_invalid_box_is_skipped_not_raised(): + boxes = [ + {"name": "rota", "box": {"q1": None, "median": None, "q3": None}}, + {"name": "sin_box"}, # falta la clave box. + "no_es_dict", # entrada no-dict. + _box("buena", 1.0, 2.0, 3.0, 0.0, 10.0, high=True), + ] + fig = build_boxplots_figure(boxes) + assert isinstance(fig, Figure) + ax = fig.axes[0] + # Solo la caja válida sobrevive al filtrado. + assert len(ax.get_yticks()) == 1 + plt.close(fig) + + +def test_all_invalid_returns_placeholder(): + boxes = [ + {"name": "a", "box": {"q1": None, "median": 1.0, "q3": 2.0}}, + {"name": "b"}, + ] + fig = build_boxplots_figure(boxes) + assert isinstance(fig, Figure) + assert len(fig.axes) >= 1 + plt.close(fig) + + +def test_raw_fliers_are_drawn(): + boxes = [ + _box("con_fliers", 10.0, 20.0, 30.0, 5.0, 200.0, + high=True, fliers=[150.0, 180.0, 200.0]), + ] + fig = build_boxplots_figure(boxes) + assert isinstance(fig, Figure) + assert len(fig.axes) >= 1 + plt.close(fig) + + +def test_max_boxes_truncates_and_does_not_raise(): + boxes = [_box(f"c{i}", float(i), float(i + 1), float(i + 2), + float(i - 5), float(i + 10)) for i in range(20)] + fig = build_boxplots_figure(boxes, title="muchos", max_boxes=5) + assert isinstance(fig, Figure) + ax = fig.axes[0] + # Solo se dibujan las primeras 5 cajas. + assert len(ax.get_yticks()) == 5 + plt.close(fig) diff --git a/python/functions/datascience/summarize_outlier_dims.md b/python/functions/datascience/summarize_outlier_dims.md new file mode 100644 index 00000000..b9ac5d49 --- /dev/null +++ b/python/functions/datascience/summarize_outlier_dims.md @@ -0,0 +1,79 @@ +--- +name: summarize_outlier_dims +kind: function +lang: py +domain: datascience +version: "1.0.0" +purity: pure +signature: "def summarize_outlier_dims(raw_numeric: dict, outlier_rows: list, top_k: int = 3) -> list" +description: "Explica QUE columnas hacen rara cada fila anomala detectada por isolation_forest_outliers. Para cada {row_index, score} reconstruye la fila valida (mismo filtro de columnas numericas y mismo descarte de filas con None que el detector, asi row_index coincide) y devuelve las top_k columnas de mayor |z-score| poblacional (ddof=0). Capa de explicabilidad del paso de outliers multivariante en EDA. Pura y determinista; ante entradas vacias/invalidas o sin filas validas devuelve [] sin petar." +tags: [eda, models, outliers, anomaly-detection, explainability, z-score, multivariate] +params: + - name: raw_numeric + desc: "dict {nombre_columna: [valores]} alineado por fila (como ctx['raw_numeric'] del motor AutomaticEDA). Solo se usan columnas con todos los valores numericos (None permitido por fila; bool/str/NaN/Inf descartan la columna entera) — filtro IDENTICO al de isolation_forest_outliers para que row_index coincida." + - name: outlier_rows + desc: "Lista de {row_index, score} tal cual la devuelve isolation_forest_outliers. row_index cuenta SOLO las filas validas (sin None) en orden de aparicion, base 0. Entradas fuera de rango o malformadas se ignoran defensivamente." + - name: top_k + desc: "Numero de columnas (las de mayor |z-score|) a reportar por outlier. Default 3. Valores invalidos (no-int, bool, <1) caen a 3." +output: "Lista paralela a outlier_rows (mismo orden) de dicts {row_index: int, score: float, dims: [{col: str, value: float, z: float}, ...]}. dims trae hasta top_k columnas ordenadas por |z| descendente, con z (z-score poblacional, ddof=0) redondeado a 3 decimales; si una columna tiene std==0 su z es 0. Las entradas de outlier_rows fuera de rango/malformadas se omiten. Ante raw_numeric vacio/no-dict, outlier_rows no-lista, 0 columnas numericas o 0 filas validas devuelve []." +uses_functions: [] +uses_types: [] +returns: [] +returns_optional: false +error_type: "" +imports: [] +tested: true +tests: ["test_row_index_skips_none_rows", "test_extreme_row_flagged_via_isolation", "test_out_of_range_row_index_is_ignored", "test_degrades_to_empty_on_invalid_inputs"] +test_file_path: "python/functions/datascience/summarize_outlier_dims_test.py" +file_path: "python/functions/datascience/summarize_outlier_dims.py" +--- + +## Ejemplo + +```python +from datascience import isolation_forest_outliers, summarize_outlier_dims + +# Nube densa alrededor del origen + 1 fila con un valor extremo en "c". +raw_numeric = { + "a": [0.1, 0.2, -0.1, 0.0, 0.3, -0.2, 0.15, -0.05, 0.25, 0.2, -0.3, 0.1], + "b": [1.0, 1.1, 0.9, 1.2, 0.8, 1.0, 1.1, 0.95, 1.05, 0.9, 1.15, 1.0], + "c": [5.0, 5.2, 4.8, 5.1, 4.9, 5.0, 4.95, 5.05, 4.9, 500.0, 5.1, 5.0], +} + +result = isolation_forest_outliers(raw_numeric, contamination=0.1) +summary = summarize_outlier_dims(raw_numeric, result["outlier_rows"], top_k=3) + +for item in summary: + top = item["dims"][0] + print(item["row_index"], top["col"], top["value"], top["z"]) +# La fila del valor 500 sale con dim top "c" y |z| alto: es lo que la hace rara. +``` + +## Cuando usarla + +Justo **despues** de `isolation_forest_outliers`, cuando ya sabes QUE filas son +anomalas y quieres explicar POR QUE: en que columnas se desvian mas respecto al +resto. Util para rellenar la seccion de outliers de un report/notebook EDA con +"la fila 9 es rara sobre todo por `c` (z=+3.3)" en lugar de solo un row_index +opaco. Pasa el mismo `raw_numeric` que diste al detector y su `outlier_rows` +intacto; el `row_index` apunta a la misma fila porque ambas funciones aplican el +mismo filtro de columnas y el mismo descarte de filas con None. + +## Gotchas + +- **Mismo `raw_numeric` que el detector**: el `row_index` solo coincide si pasas + el mismo dict de columnas (mismo orden, mismas listas) con el que llamaste a + `isolation_forest_outliers`. Si cambias las columnas o el orden, los indices + dejan de mapear. +- **`row_index` es relativo a las filas validas**: las filas con `None` en + cualquier columna usada se descartan y los indices se recalculan sobre las que + quedan (base 0, orden de aparicion). No mapea 1:1 con las listas de entrada si + hay None. +- **z-score poblacional (ddof=0)**: se usa la desviacion tipica poblacional, + consistente con el escalado del detector. Columnas con `std==0` (todos los + valores iguales) dan `z=0`, asi que nunca aparecen como "raras". +- **Devuelve `[]` en vez de petar**: entrada no-dict/no-lista, 0 columnas + numericas, 0 filas validas, o todas las entradas fuera de rango -> lista vacia. + No lanza excepciones. +- **No llama a `isolation_forest_outliers`**: solo consume su salida. Es una + funcion independiente (no la importa), por eso `uses_functions` esta vacio. diff --git a/python/functions/datascience/summarize_outlier_dims.py b/python/functions/datascience/summarize_outlier_dims.py new file mode 100644 index 00000000..b6b2ca61 --- /dev/null +++ b/python/functions/datascience/summarize_outlier_dims.py @@ -0,0 +1,144 @@ +"""Explica que dimensiones (columnas) hacen rara cada fila anomala. + +Toma la salida multivariante de `isolation_forest_outliers` (lista de +`{row_index, score}`) y, para cada outlier, devuelve las columnas con mayor +|z-score| respecto a la distribucion de las filas validas. Es la capa de +"explicabilidad" del paso de outliers multivariante en la fase EDA: el +Isolation Forest dice QUE filas son raras, esta funcion dice POR QUE (en que +columnas se desvian mas). + +Pura y determinista: reconstruye EXACTAMENTE las mismas "filas validas" que usa +`isolation_forest_outliers` (mismo filtro de columnas numericas y mismo descarte +de filas con None), de modo que el `row_index` apunta a la misma fila en ambas +funciones. No hace I/O ni depende de estado. +""" + +import math + +import numpy as np + + +def _is_finite_number(v) -> bool: + """True si v es int/float finito. bool NO cuenta; NaN/Inf tampoco.""" + if isinstance(v, bool): + return False + if not isinstance(v, (int, float)): + return False + if isinstance(v, float) and (math.isnan(v) or math.isinf(v)): + return False + return True + + +def summarize_outlier_dims( + raw_numeric: dict, + outlier_rows: list, + top_k: int = 3, +) -> list: + """Resume las dimensiones que mas desvian a cada fila anomala. + + Args: + raw_numeric: dict {nombre_columna: [valores]} alineado por fila (como + ctx['raw_numeric'] del motor AutomaticEDA). Solo se usan columnas + cuyos valores sean todos numericos (None permitido por fila; bool, + str, NaN e Inf descartan la columna entera) — filtro identico al de + isolation_forest_outliers. + outlier_rows: lista de {row_index, score} tal como la devuelve + isolation_forest_outliers. row_index cuenta SOLO las filas validas + (sin None) en orden de aparicion, empezando en 0. + top_k: numero de columnas (las de mayor |z-score|) a reportar por cada + outlier. Default 3. Valores invalidos caen a 3. + + Returns: + Lista paralela a outlier_rows (mismo orden) de dicts + {row_index, score, dims}, donde dims es la lista de hasta top_k columnas + ordenadas por |z| descendente: [{col, value, z}, ...] con z redondeado a + 3 decimales. Las entradas de outlier_rows fuera de rango o malformadas se + omiten (defensivo). Ante raw_numeric vacio/no-dict, outlier_rows + no-lista, 0 columnas numericas o 0 filas validas devuelve []. + """ + # Validacion defensiva de los argumentos principales. + if not isinstance(raw_numeric, dict) or not isinstance(outlier_rows, list): + return [] + if not isinstance(top_k, int) or isinstance(top_k, bool) or top_k < 1: + top_k = 3 + + # Seleccion de columnas numericas: identica a isolation_forest_outliers. + # Una columna entra solo si todos sus valores son numericos (None permitido + # por fila); cualquier bool/str/NaN/Inf descarta la columna completa. + numeric_cols: dict[str, list] = {} + for name, values in raw_numeric.items(): + if not isinstance(values, (list, tuple)): + continue + ok = True + for v in values: + if v is None: + continue + if not _is_finite_number(v): + ok = False + break + if ok: + numeric_cols[name] = list(values) + + if len(numeric_cols) < 1: + return [] + + col_names = list(numeric_cols.keys()) + try: + n_rows_total = min(len(numeric_cols[c]) for c in col_names) + except ValueError: + return [] + + # Reconstruye las filas validas con el MISMO criterio que el detector: la + # fila i toma un valor por columna; si cualquier valor es None, la fila se + # descarta y NO incrementa el indice valido. Asi row_index de outlier_rows + # apunta a esta misma secuencia (base 0, orden de aparicion). + valid_rows: list[list[float]] = [] + for i in range(n_rows_total): + row = [numeric_cols[c][i] for c in col_names] + if any(v is None for v in row): + continue + valid_rows.append([float(v) for v in row]) + + if not valid_rows: + return [] + + matrix = np.asarray(valid_rows, dtype=float) + n_valid = matrix.shape[0] + means = matrix.mean(axis=0) + stds = matrix.std(axis=0, ddof=0) # poblacional (ddof=0) + + out: list = [] + for entry in outlier_rows: + if not isinstance(entry, dict): + continue + ri = entry.get("row_index") + # bool es subclase de int: lo excluimos explicitamente. + if not isinstance(ri, int) or isinstance(ri, bool): + continue + if ri < 0 or ri >= n_valid: + continue + + try: + score = float(entry.get("score")) + except (TypeError, ValueError): + score = 0.0 + + row = matrix[ri] + dims = [] + for j, name in enumerate(col_names): + std = stds[j] + if std == 0.0: + z = 0.0 + else: + z = float((row[j] - means[j]) / std) + dims.append({"col": name, "value": float(row[j]), "z": z}) + + # Mayor |z| primero; sort estable, empates por orden de columna. + dims.sort(key=lambda d: abs(d["z"]), reverse=True) + dims = dims[:top_k] + for d in dims: + d["z"] = round(d["z"], 3) + + out.append({"row_index": int(ri), "score": score, "dims": dims}) + + return out diff --git a/python/functions/datascience/summarize_outlier_dims_test.py b/python/functions/datascience/summarize_outlier_dims_test.py new file mode 100644 index 00000000..019a4ddd --- /dev/null +++ b/python/functions/datascience/summarize_outlier_dims_test.py @@ -0,0 +1,93 @@ +"""Tests para summarize_outlier_dims.""" + +from isolation_forest_outliers import isolation_forest_outliers +from summarize_outlier_dims import summarize_outlier_dims + + +# Dataset compartido: 3 columnas, 13 filas. La fila ORIGINAL 6 tiene None en "a" +# (se descarta), de modo que la fila ORIGINAL 10 -- con un valor extremo en "c" +# -- queda en el indice VALIDO 9 (no 10). Esto verifica el salto de None. +A = [0.1, 0.2, -0.1, 0.0, 0.3, -0.2, None, 0.15, -0.05, 0.25, 0.2, -0.3, 0.1] +B = [1.0, 1.1, 0.9, 1.2, 0.8, 1.0, 1.3, 1.1, 0.95, 1.05, 0.9, 1.15, 1.0] +C = [5.0, 5.2, 4.8, 5.1, 4.9, 5.0, 5.3, 4.95, 5.05, 4.9, 500.0, 5.1, 5.0] +RAW = {"a": A, "b": B, "c": C} + +# Mapa original -> valido (saltando original 6): +# orig: 0 1 2 3 4 5 7 8 9 10 11 12 +# valid: 0 1 2 3 4 5 6 7 8 9 10 11 +# => el extremo en "c" (original 10) esta en el indice valido 9. +EXTREME_VALID_INDEX = 9 + + +def test_row_index_skips_none_rows(): + # Mapeo directo (sin depender de la aleatoriedad de IsolationForest): el + # indice valido 9 debe corresponder a la fila con c == 500 -> el None de la + # fila original 6 se salto correctamente. + summary = summarize_outlier_dims( + RAW, [{"row_index": EXTREME_VALID_INDEX, "score": -0.5}], top_k=3 + ) + assert len(summary) == 1 + entry = summary[0] + assert entry["row_index"] == EXTREME_VALID_INDEX + assert entry["score"] == -0.5 + # La dimension dominante es "c", con su valor extremo y |z| alto. + top = entry["dims"][0] + assert top["col"] == "c" + assert top["value"] == 500.0 + assert abs(top["z"]) > 2.0 + # top_k respetado: como mucho 3 dims. + assert len(entry["dims"]) <= 3 + + +def test_extreme_row_flagged_via_isolation(): + # Integracion real: detectar outliers y explicarlos. + result = isolation_forest_outliers(RAW, contamination=0.1) + assert "note" not in result + outlier_rows = result["outlier_rows"] + assert outlier_rows # al menos un outlier + + summary = summarize_outlier_dims(RAW, outlier_rows, top_k=3) + # Paralela a outlier_rows (todos los indices estan en rango). + assert len(summary) == len(outlier_rows) + + by_index = {e["row_index"]: e for e in summary} + # El punto extremo debe estar entre los outliers detectados... + assert EXTREME_VALID_INDEX in by_index + # ...y su dimension top debe ser "c" (donde se desvia ~muchas sigmas). + extreme = by_index[EXTREME_VALID_INDEX] + assert extreme["dims"][0]["col"] == "c" + assert abs(extreme["dims"][0]["z"]) > 2.0 + + +def test_out_of_range_row_index_is_ignored(): + # Indices fuera de rango se omiten en lugar de petar. + summary = summarize_outlier_dims( + RAW, + [ + {"row_index": 999, "score": -1.0}, + {"row_index": -1, "score": -1.0}, + {"row_index": EXTREME_VALID_INDEX, "score": -0.5}, + ], + top_k=2, + ) + # Solo sobrevive el indice valido; los otros dos se descartan. + assert len(summary) == 1 + assert summary[0]["row_index"] == EXTREME_VALID_INDEX + assert len(summary[0]["dims"]) <= 2 + + +def test_degrades_to_empty_on_invalid_inputs(): + # raw_numeric vacio + outlier_rows vacio. + assert summarize_outlier_dims({}, [], 3) == [] + # raw_numeric no es dict. + assert summarize_outlier_dims("not a dict", [{"row_index": 0}], 3) == [] + # outlier_rows no es lista. + assert summarize_outlier_dims(RAW, "not a list", 3) == [] + # Sin columnas numericas (todas con strings) -> []. + assert summarize_outlier_dims( + {"s": ["x", "y", "z"]}, [{"row_index": 0, "score": -1.0}], 3 + ) == [] + # Entradas malformadas dentro de outlier_rows se ignoran (no petan). + assert summarize_outlier_dims( + RAW, ["nope", 42, {"no_row_index": 1}], 3 + ) == []