"""Outliers chapter (OUTLIERS) — univariate + multivariate atypical values. Today the analysis of atypical values is scattered across the document: the NUM DISTR chapter mentions the per-column outlier count inside each distribution figure, and the MODELOS chapter runs Isolation Forest as one of several cheap models. This chapter gathers and deepens the whole outlier story in a single place, with its interpretation: an [[term:outlier]]outlier[[/term]] is **not necessarily an error** — it can be a legitimate, extreme but real observation — so the reading is exploratory (what to look at), never confirmatory (what to delete). Sections, in order: 1. **Resumen univariante por columna** — for every numeric column, the number and percentage of atypical values by two complementary criteria: Tukey's 1.5·IQR rule ([[term:tukey_fence]]vallas de Tukey[[/term]]) and the [[term:zscore]]z-score[[/term]] rule (|z| > 3). The most contaminated columns are flagged. The fences come from the pure registry function ``build_boxplot_stats`` (derived from the profile percentiles); the per-column counts use the raw sample in ``ctx['raw_numeric']`` when available (the exact count), degrading to the profile's own z-score counts otherwise. 2. **Boxplots** — a single figure with the Tukey boxplots of the most contaminated columns (box, whiskers and atypical points), delegated to the reusable registry helper ``build_boxplots_figure``. 3. **Multivariante (filas anómalas)** — rows that are atypical considering ALL columns at once, via the registry function ``isolation_forest_outliers``: the count and percentage of anomalous rows, the most anomalous rows with their score, and the dimensions that make each one rare (top columns by |z|, via ``summarize_outlier_dims``). Run live on ``ctx['raw_numeric']`` (the same numeric columns ``summarize_outlier_dims`` uses, so the row indexing stays coherent and the dimension breakdown is correct); falls back to the precomputed ``profile['models']['outliers']`` only when no raw sample is available (e.g. the lite preset), where no per-row breakdown is shown. 4. **Interpretación** — outlier ≠ error: how to tell a data-entry error from a genuine extreme value, and what to do (inspect, winsorize, or re-express — linking to the Tukey re-expression the profile already computes). The chapter activates whenever the table has at least one numeric column; with no numeric column it returns ``None`` and disappears from the document. Reads everything defensively (``.get``) and never raises: every registry delegation is imported lazily and degraded to an honest note on any failure. Contract: build_(profile, ctx) -> Chapter | None ; CHAPTER_VERSION = "x.y.z". """ from __future__ import annotations from .. import model CHAPTER_VERSION = "1.0.0" CHAPTER_ID = "outliers" CHAPTER_TITLE = "Valores atípicos" # z-score threshold for the univariate z rule: |z| > 3 flags a value ~3 standard # deviations from the mean (≈99.7% of a normal distribution lies within ±3σ). _Z_THRESH = 3.0 # How many columns to draw in the boxplots figure (most contaminated first) and # how many anomalous rows to list in the multivariate table. _TOP_BOX = 12 _TOP_ROWS = 12 # Cap on the raw atypical values passed as boxplot fliers, so a heavy-tailed # column does not flood the figure with thousands of points. _MAX_FLIERS = 200 # How many columns flagged as "most contaminated" in the summary note. _TOP_FLAGGED = 3 # Glossary terms this chapter explains (contract §11.1). Registered in the shared # collector and marked clickable on first appearance. ``isolation_forest`` and # ``zscore`` may also be registered by the MODELOS chapter — ``add`` is # idempotent (first definition wins), so registering them here is harmless and # keeps this chapter self-contained when MODELOS does not render. _TERM_DEFS = { "outlier": ( "Valor atípico (outlier)", "Una observación que se aparta mucho del grueso de los datos. Un atípico " "NO es necesariamente un error: puede ser un fallo de medida o de " "registro, pero también un dato real extremo (un cliente que gasta diez " "veces la media, un día de ventas excepcional). Por eso se señalan para " "revisarlos, no para borrarlos automáticamente.", ), "tukey_fence": ( "Vallas de Tukey (1,5·IQR)", "Regla clásica para marcar atípicos a partir de los cuartiles: se calcula " "el rango intercuartílico IQR = P75 − P25 y se trazan dos vallas, una " "inferior en P25 − 1,5·IQR y otra superior en P75 + 1,5·IQR. Los valores " "que caen fuera de esas vallas se consideran atípicos. Es robusta porque " "se apoya en la mediana y los cuartiles, no en la media.", ), "zscore": ( "z-score (puntuación típica)", "Mide a cuántas desviaciones típicas está un valor de la media de su " "columna: z = (valor − media) / desviación típica. Un |z| grande (aquí > " "3) señala un valor alejado del centro. A diferencia de las vallas de " "Tukey, el z-score usa media y desviación, así que es más sensible a la " "presencia de los propios atípicos.", ), "isolation_forest": ( "Isolation Forest (anomalías multivariantes)", "Algoritmo de detección de anomalías que considera TODAS las columnas a " "la vez: construye árboles que parten el espacio con cortes aleatorios y " "mide cuántos cortes hacen falta para aislar cada fila. Las filas raras " "se aíslan con muy pocos cortes y se marcan como atípicas según un umbral " "de contaminación. Detecta combinaciones de valores poco frecuentes que " "ninguna columna por separado revelaría.", ), } # --------------------------------------------------------------------------- # # Lazy registry delegations (each degrades to None / no-op on any failure). # --------------------------------------------------------------------------- # def _load_build_boxplot_stats(): try: from datascience.build_boxplot_stats import build_boxplot_stats return build_boxplot_stats except Exception: # noqa: BLE001 return None def _load_detect_outliers(): # detect_outliers lives in the monolithic ``datascience.datascience`` module # (file_path datascience.py), not in its own submodule — try both shapes. try: from datascience.datascience import detect_outliers return detect_outliers except Exception: # noqa: BLE001 try: from datascience import detect_outliers return detect_outliers except Exception: # noqa: BLE001 return None def _load_isolation_forest(): try: from datascience.isolation_forest_outliers import isolation_forest_outliers return isolation_forest_outliers except Exception: # noqa: BLE001 return None def _load_summarize_dims(): try: from datascience.summarize_outlier_dims import summarize_outlier_dims return summarize_outlier_dims except Exception: # noqa: BLE001 return None # --------------------------------------------------------------------------- # # Defensive formatters (own copy: the chapter never imports siblings). # --------------------------------------------------------------------------- # def _fmt_num(value, decimals: int = 3) -> str: if value is None: return "—" if isinstance(value, bool): return "sí" if value else "no" if isinstance(value, int): return f"{value:,}".replace(",", ".") if isinstance(value, float): if value != value: # NaN return "—" if value in (float("inf"), float("-inf")): return str(value) text = f"{value:.{decimals}f}".rstrip("0").rstrip(".") return text if text else "0" return model._safe_str(value) def _fmt_int(value) -> str: if value is None: return "—" try: return f"{int(round(float(value))):,}".replace(",", ".") except (TypeError, ValueError): return model._safe_str(value) def _fmt_pct(value, decimals: int = 2) -> str: """Format an already-0-100 value as a percentage. None -> placeholder.""" if value is None: return "—" try: return f"{float(value):.{decimals}f}%" except (TypeError, ValueError): return model._safe_str(value) def _term(mark: bool, key: str, text: str) -> str: return f"[[term:{key}]]{text}[[/term]]" if mark else text def _is_dict(v) -> bool: return isinstance(v, dict) # --------------------------------------------------------------------------- # # Profile reads. # --------------------------------------------------------------------------- # def _numeric_columns(profile: dict) -> list: """Return [(name, numeric_dict)] for numeric columns with usable stats.""" out = [] for col in profile.get("columns") or []: if not isinstance(col, dict): continue if col.get("inferred_type") != "numeric": continue num = col.get("numeric") if not isinstance(num, dict) or not num: continue if num.get("mean") is None and num.get("median") is None: continue out.append((col.get("name") or "(columna)", num)) return out def _clean_values(raw): """Return the finite float values of a raw column list (drop None/NaN/inf).""" if not isinstance(raw, (list, tuple)): return None vals = [] for v in raw: if v is None or isinstance(v, bool): continue try: f = float(v) except (TypeError, ValueError): continue if f != f or f in (float("inf"), float("-inf")): continue vals.append(f) return vals # --------------------------------------------------------------------------- # # Per-column univariate summary. # --------------------------------------------------------------------------- # def _univariate_row(name, numeric, raw_vals, box_fn, detect_fn): """Compute one univariate summary row + boxplot inputs for a column. Returns a dict with the table cells and, when raw values are available, the exact Tukey/z counts and the list of atypical (flier) values; otherwise it degrades to the profile's own z-score counts and the fence flags. """ box = {} if box_fn is not None: try: box = box_fn(numeric) or {} except Exception: # noqa: BLE001 box = {} lf = box.get("lower_fence") uf = box.get("upper_fence") vals = _clean_values(raw_vals) n_tukey = pct_tukey = None n_z = pct_z = None low_extreme = high_extreme = None fliers = [] contamination = None # metric used to rank columns (prefer Tukey %). if vals: n = len(vals) tukey_out = [] for v in vals: below = (lf is not None and v < lf) above = (uf is not None and v > uf) if below or above: tukey_out.append(v) n_tukey = len(tukey_out) pct_tukey = 100.0 * n_tukey / n if n else None if tukey_out: low_extreme = min(tukey_out) high_extreme = max(tukey_out) fliers = tukey_out[:_MAX_FLIERS] # z-score rule via the registry function (returns parallel bools). if detect_fn is not None: try: flags = detect_fn(vals, _Z_THRESH) or [] n_z = int(sum(1 for b in flags if b)) pct_z = 100.0 * n_z / n if n else None except Exception: # noqa: BLE001 n_z = pct_z = None contamination = pct_tukey else: # Degrade: no raw sample for this column. The profile's own outlier # count/pct come from the z-score block (build_boxplot_stats note); the # Tukey count is unknown, only the fence flags are. n_z = numeric.get("n_outliers") pct_z = numeric.get("outlier_pct") if box.get("has_low_outliers") and box.get("min") is not None: low_extreme = box.get("min") if box.get("has_high_outliers") and box.get("max") is not None: high_extreme = box.get("max") contamination = pct_z if isinstance(pct_z, (int, float)) else None # Compact "extremos atípicos" cell: down/up arrows for the low/high tail. extremes = [] if low_extreme is not None: extremes.append(f"↓ {_fmt_num(low_extreme)}") if high_extreme is not None: extremes.append(f"↑ {_fmt_num(high_extreme)}") extremes_cell = " ".join(extremes) if extremes else "—" return { "name": model._safe_str(name), "n_tukey": n_tukey, "pct_tukey": pct_tukey, "n_z": n_z, "pct_z": pct_z, "lower_fence": lf, "upper_fence": uf, "extremes": extremes_cell, "box": box, "fliers": fliers, "has_raw": bool(vals), "contamination": contamination if isinstance(contamination, (int, float)) else -1.0, } def _univariate_table(rows: list) -> model.DataTable: header = ["Columna", "Atípicos Tukey", "% Tukey", "Atípicos z", "% z", "Valla inf.", "Valla sup.", "Extremos atípicos"] table_rows = [] for r in rows: table_rows.append([ r["name"], _fmt_int(r["n_tukey"]) if r["n_tukey"] is not None else "—", _fmt_pct(r["pct_tukey"]) if r["pct_tukey"] is not None else "—", _fmt_int(r["n_z"]) if r["n_z"] is not None else "—", _fmt_pct(r["pct_z"]) if r["pct_z"] is not None else "—", _fmt_num(r["lower_fence"]), _fmt_num(r["upper_fence"]), r["extremes"], ]) return model.DataTable( header=header, rows=table_rows, title="Valores atípicos por columna", note="Tukey = fuera de las vallas 1,5·IQR · z = |z-score| > 3 · " "ordenado de más a menos contaminada") # --------------------------------------------------------------------------- # # Multivariate (Isolation Forest) section. # --------------------------------------------------------------------------- # def _resolve_multivariate(profile: dict, ctx: dict, raw_numeric): """Return (outliers_dict_or_None, source). Prefers a LIVE Isolation Forest over ``raw_numeric`` so the detector and ``summarize_outlier_dims`` use EXACTLY the same numeric columns and the same valid-row indexing — otherwise the precomputed ``profile['models'] ['outliers']`` (run by MODELOS over a possibly different column subset) would yield ``row_index`` values that no longer point at the rows ``summarize_outlier_dims`` reconstructs, mislabelling the "dimensions that make each row rare". Falls back to the precomputed block when no raw sample is available (e.g. the lite preset drops ``raw_numeric``).""" if _is_dict(raw_numeric) and raw_numeric: iso = _load_isolation_forest() if iso is not None: try: out = iso(raw_numeric) if _is_dict(out) and out.get("n_outliers") is not None and out.get("n_rows_used"): return out, "live" except Exception: # noqa: BLE001 pass # Fallback: the model the MODELOS chapter already computed (no raw sample to # recompute against, so no per-row dimension breakdown either). models = profile.get("models") if _is_dict(profile.get("models")) else {} pre = models.get("outliers") if _is_dict(models) else None if _is_dict(pre) and pre.get("n_outliers") is not None and pre.get("n_rows_used"): return pre, "precomputed" return None, "none" def _multivariate_blocks(outliers: dict, raw_numeric, mark: bool) -> list: isof = _term(mark, "isolation_forest", "**Isolation Forest**") blocks = [ model.Heading(text="Filas atípicas (multivariante)", level=2), model.Markdown(text=( f"Hasta aquí cada columna se ha mirado por separado. {isof} busca " "filas raras considerando **todas las columnas a la vez**: una fila " "puede ser normal en cada variable y aun así ser atípica por la " "**combinación** de sus valores (p. ej. una edad baja con una tarifa " "muy alta). La tabla resume cuántas filas se marcaron y el umbral de " "decisión.")), model.KVTable(rows=[ ("Filas analizadas", _fmt_int(outliers.get("n_rows_used"))), ("Columnas consideradas", _fmt_int(outliers.get("n_features"))), ("Filas atípicas", _fmt_int(outliers.get("n_outliers"))), ("% filas atípicas", _fmt_pct(outliers.get("outlier_pct"))), ("Umbral de decisión", _fmt_num(outliers.get("threshold"), 4)), ], title="Anomalías multivariantes"), ] rows_in = outliers.get("outlier_rows") or [] if not rows_in: return blocks # Enrich each anomalous row with the dimensions that make it rare, when the # raw sample is available (summarize_outlier_dims reconstructs the same # valid-row indexing as isolation_forest_outliers). dims_by_row = {} if _is_dict(raw_numeric) and raw_numeric: summ = _load_summarize_dims() if summ is not None: try: enriched = summ(raw_numeric, rows_in, top_k=3) or [] for e in enriched: if _is_dict(e) and e.get("row_index") is not None: dims_by_row[e.get("row_index")] = e.get("dims") or [] except Exception: # noqa: BLE001 dims_by_row = {} has_dims = bool(dims_by_row) header = ["Fila (entre válidas)", "Score"] if has_dims: header.append("Dimensiones que la hacen rara (col = valor, z)") table_rows = [] for r in rows_in[:_TOP_ROWS]: if not _is_dict(r): continue ridx = r.get("row_index") cells = [_fmt_int(ridx), _fmt_num(r.get("score"), 4)] if has_dims: dims = dims_by_row.get(ridx) or [] parts = [] for d in dims: if not _is_dict(d): continue parts.append( f"{model._safe_str(d.get('col'))} = {_fmt_num(d.get('value'))} " f"(z {_fmt_num(d.get('z'), 2)})") cells.append("; ".join(parts) if parts else "—") table_rows.append(cells) if table_rows: shown = len(table_rows) total = outliers.get("n_outliers") note = "las filas más anómalas primero (score más bajo = más rara)" if isinstance(total, int) and total > shown: note += f" — top {shown} de {total}" if not has_dims: note += (" · no se pudo recuperar la muestra cruda para explicar las " "dimensiones de cada fila") blocks.append(model.DataTable( header=header, rows=table_rows, title="Filas más atípicas", note=note)) return blocks # --------------------------------------------------------------------------- # # Interpretation section. # --------------------------------------------------------------------------- # def _interpretation_block(mark: bool) -> model.Markdown: outlier = _term(mark, "outlier", "atípico") text = ( f"**Un {outlier} no es necesariamente un error.** Conviene distinguir " "dos casos antes de actuar:\n\n" "- **Error de dato** (medida, registro o unidad equivocada): una edad de " "200 años, un importe negativo donde no puede haberlo, un decimal " "desplazado. Estos sí se corrigen o se eliminan, idealmente en el origen.\n" "- **Dato real extremo**: una observación legítima de la cola de la " "distribución (un cliente que gasta mucho más, una tarifa de lujo, un día " "de ventas excepcional). Borrarla sesga el análisis y oculta información " "valiosa.\n\n" "**Qué hacer.** Primero, **revisar** los valores señalados arriba contra " "su origen para decidir cuál de los dos casos es. Si son errores, " "corregirlos. Si son datos reales que distorsionan medias y modelos, hay " "alternativas a borrarlos: **winsorizar** (recortar los extremos a un " "percentil), o **re-expresar** la variable (por ejemplo una " "transformación logarítmica o la escalera de re-expresión de Tukey que " "este mismo perfil ya calcula para las columnas asimétricas), que suele " "domar la cola sin perder ninguna fila. La elección depende del objetivo: " "esta lectura es **exploratoria** —orienta dónde mirar—, no una regla " "automática de limpieza.") return model.Markdown(text=text) # --------------------------------------------------------------------------- # # Entry point. # --------------------------------------------------------------------------- # def build_outliers(profile: dict, ctx: dict): """Build the OUTLIERS Chapter, or None if the dataset has no numeric column.""" profile = profile or {} ctx = ctx or {} if not isinstance(profile, dict): return None numerics = _numeric_columns(profile) if not numerics: return None # chapter does not apply to a dataset with no numerics. # Register glossary terms (if a collector is present) and mark them clickable. glossary = ctx.get("glossary") mark = False if isinstance(glossary, model.GlossaryCollector): for key, (label, definition) in _TERM_DEFS.items(): glossary.add(key, label, definition) mark = True raw_numeric = ctx.get("raw_numeric") raw_numeric = raw_numeric if isinstance(raw_numeric, dict) else {} box_fn = _load_build_boxplot_stats() detect_fn = _load_detect_outliers() # --- Univariate summary ------------------------------------------------- # uni_rows = [] for name, numeric in numerics: uni_rows.append(_univariate_row( name, numeric, raw_numeric.get(name), box_fn, detect_fn)) # Rank columns by contamination (Tukey % when available, else z %). uni_rows.sort(key=lambda r: r.get("contamination", -1.0), reverse=True) intro = ( "Este capítulo reúne en un solo sitio el análisis de los **valores " "atípicos** de la tabla, que en el resto del informe aparecen dispersos. " f"Un {_term(mark, 'outlier', 'atípico')} es una observación que se aparta " "mucho del grueso de los datos. Cada columna numérica se evalúa con dos " f"criterios complementarios: las {_term(mark, 'tukey_fence', 'vallas de Tukey')} " "(fuera de P25−1,5·IQR o P75+1,5·IQR, robusto a la propia cola) y el " f"{_term(mark, 'zscore', 'z-score')} (|z| > 3, sensible a la media). La " "tabla está ordenada de la columna más contaminada a la menos.") blocks = [ model.Heading(text=CHAPTER_TITLE, level=1), model.Markdown(text=intro), _univariate_table(uni_rows), ] # Flag the most contaminated columns explicitly. flagged = [r["name"] for r in uni_rows if r.get("contamination", -1.0) > 0][:_TOP_FLAGGED] if flagged: names = ", ".join(f"**{n}**" for n in flagged) blocks.append(model.Markdown(text=( f"Las columnas con mayor proporción de atípicos son {names}: " "concentran el grueso de los valores fuera de las vallas y son las " "primeras a revisar."))) # --- Boxplots figure ---------------------------------------------------- # box_entries = [ {"name": r["name"], "box": r["box"], "fliers": r.get("fliers")} for r in uni_rows if r.get("box") ][:_TOP_BOX] if box_entries: def _boxplots_make(entries=box_entries): try: from datascience.build_boxplots_figure import build_boxplots_figure return build_boxplots_figure( entries, title="Boxplots de Tukey por columna", max_boxes=_TOP_BOX) except Exception: # noqa: BLE001 — minimal fallback figure. import matplotlib matplotlib.use("Agg") from matplotlib.figure import Figure fig = Figure(figsize=(5.0, 2.2)) ax = fig.add_subplot(111) ax.text(0.5, 0.5, "(boxplots no disponibles)", ha="center", va="center") ax.axis("off") return fig blocks.append(model.Group(blocks=[ model.Heading(text="Boxplots", level=2), model.Markdown(text=( "Cada caja abarca del primer al tercer cuartil (P25–P75), la línea " "interior es la mediana y los bigotes llegan hasta 1,5·IQR; los " "puntos son los valores que caen fuera de las vallas (atípicos por " "Tukey).")), model.Figure( make=_boxplots_make, caption="Boxplots de Tukey de las columnas más contaminadas."), ])) # --- Multivariate ------------------------------------------------------- # outliers, _src = _resolve_multivariate(profile, ctx, raw_numeric) if outliers is not None: blocks.extend(_multivariate_blocks(outliers, raw_numeric, mark)) else: blocks.append(model.Heading(text="Filas atípicas (multivariante)", level=2)) blocks.append(model.Note( "No se pudo analizar la anomalía multivariante: hacen falta al menos " "dos columnas numéricas y la muestra cruda (o los modelos del perfil) " "para correr Isolation Forest.")) # --- Interpretation ----------------------------------------------------- # blocks.append(model.Heading(text="Cómo interpretar los atípicos", level=2)) blocks.append(_interpretation_block(mark)) return model.Chapter(id=CHAPTER_ID, title=CHAPTER_TITLE, version=CHAPTER_VERSION, blocks=blocks)