fn_registry/python/functions/datascience/automatic_eda/chapters/outliers.py

"""Outliers chapter (OUTLIERS) — univariate + multivariate atypical values.

Today the analysis of atypical values is scattered across the document: the
NUM DISTR chapter mentions the per-column outlier count inside each distribution
figure, and the MODELOS chapter runs Isolation Forest as one of several cheap
models. This chapter gathers and deepens the whole outlier story in a single
place, with its interpretation: an [[term:outlier]]outlier[[/term]] is **not
necessarily an error** — it can be a legitimate, extreme but real observation —
so the reading is exploratory (what to look at), never confirmatory (what to
delete).

Sections, in order:

1. **Resumen univariante por columna** — for every numeric column, the number
   and percentage of atypical values by two complementary criteria: Tukey's
   1.5·IQR rule ([[term:tukey_fence]]vallas de Tukey[[/term]]) and the
   [[term:zscore]]z-score[[/term]] rule (|z| > 3). The most contaminated columns
   are flagged. The fences come from the pure registry function
   ``build_boxplot_stats`` (derived from the profile percentiles); the per-column
   counts use the raw sample in ``ctx['raw_numeric']`` when available (the exact
   count), degrading to the profile's own z-score counts otherwise.
2. **Boxplots** — a single figure with the Tukey boxplots of the most
   contaminated columns (box, whiskers and atypical points), delegated to the
   reusable registry helper ``build_boxplots_figure``.
3. **Multivariante (filas anómalas)** — rows that are atypical considering ALL
   columns at once, via the registry function ``isolation_forest_outliers``: the
   count and percentage of anomalous rows, the most anomalous rows with their
   score, and the dimensions that make each one rare (top columns by |z|, via
   ``summarize_outlier_dims``). Run live on ``ctx['raw_numeric']`` (the same
   numeric columns ``summarize_outlier_dims`` uses, so the row indexing stays
   coherent and the dimension breakdown is correct); falls back to the
   precomputed ``profile['models']['outliers']`` only when no raw sample is
   available (e.g. the lite preset), where no per-row breakdown is shown.
4. **Interpretación** — outlier ≠ error: how to tell a data-entry error from a
   genuine extreme value, and what to do (inspect, winsorize, or re-express —
   linking to the Tukey re-expression the profile already computes).

The chapter activates whenever the table has at least one numeric column; with
no numeric column it returns ``None`` and disappears from the document.

Reads everything defensively (``.get``) and never raises: every registry
delegation is imported lazily and degraded to an honest note on any failure.

Contract: build_<id>(profile, ctx) -> Chapter | None ; CHAPTER_VERSION = "x.y.z".
"""

from __future__ import annotations

from .. import model

CHAPTER_VERSION = "1.0.0"
CHAPTER_ID = "outliers"
CHAPTER_TITLE = "Valores atípicos"

# z-score threshold for the univariate z rule: |z| > 3 flags a value ~3 standard
# deviations from the mean (≈99.7% of a normal distribution lies within ±3σ).
_Z_THRESH = 3.0
# How many columns to draw in the boxplots figure (most contaminated first) and
# how many anomalous rows to list in the multivariate table.
_TOP_BOX = 12
_TOP_ROWS = 12
# Cap on the raw atypical values passed as boxplot fliers, so a heavy-tailed
# column does not flood the figure with thousands of points.
_MAX_FLIERS = 200
# How many columns flagged as "most contaminated" in the summary note.
_TOP_FLAGGED = 3

# Glossary terms this chapter explains (contract §11.1). Registered in the shared
# collector and marked clickable on first appearance. ``isolation_forest`` and
# ``zscore`` may also be registered by the MODELOS chapter — ``add`` is
# idempotent (first definition wins), so registering them here is harmless and
# keeps this chapter self-contained when MODELOS does not render.
_TERM_DEFS = {
    "outlier": (
        "Valor atípico (outlier)",
        "Una observación que se aparta mucho del grueso de los datos. Un atípico "
        "NO es necesariamente un error: puede ser un fallo de medida o de "
        "registro, pero también un dato real extremo (un cliente que gasta diez "
        "veces la media, un día de ventas excepcional). Por eso se señalan para "
        "revisarlos, no para borrarlos automáticamente.",
    ),
    "tukey_fence": (
        "Vallas de Tukey (1,5·IQR)",
        "Regla clásica para marcar atípicos a partir de los cuartiles: se calcula "
        "el rango intercuartílico IQR = P75 − P25 y se trazan dos vallas, una "
        "inferior en P25 − 1,5·IQR y otra superior en P75 + 1,5·IQR. Los valores "
        "que caen fuera de esas vallas se consideran atípicos. Es robusta porque "
        "se apoya en la mediana y los cuartiles, no en la media.",
    ),
    "zscore": (
        "z-score (puntuación típica)",
        "Mide a cuántas desviaciones típicas está un valor de la media de su "
        "columna: z = (valor − media) / desviación típica. Un |z| grande (aquí > "
        "3) señala un valor alejado del centro. A diferencia de las vallas de "
        "Tukey, el z-score usa media y desviación, así que es más sensible a la "
        "presencia de los propios atípicos.",
    ),
    "isolation_forest": (
        "Isolation Forest (anomalías multivariantes)",
        "Algoritmo de detección de anomalías que considera TODAS las columnas a "
        "la vez: construye árboles que parten el espacio con cortes aleatorios y "
        "mide cuántos cortes hacen falta para aislar cada fila. Las filas raras "
        "se aíslan con muy pocos cortes y se marcan como atípicas según un umbral "
        "de contaminación. Detecta combinaciones de valores poco frecuentes que "
        "ninguna columna por separado revelaría.",
    ),
}


# --------------------------------------------------------------------------- #
# Lazy registry delegations (each degrades to None / no-op on any failure).
# --------------------------------------------------------------------------- #
def _load_build_boxplot_stats():
    try:
        from datascience.build_boxplot_stats import build_boxplot_stats
        return build_boxplot_stats
    except Exception:  # noqa: BLE001
        return None


def _load_detect_outliers():
    # detect_outliers lives in the monolithic ``datascience.datascience`` module
    # (file_path datascience.py), not in its own submodule — try both shapes.
    try:
        from datascience.datascience import detect_outliers
        return detect_outliers
    except Exception:  # noqa: BLE001
        try:
            from datascience import detect_outliers
            return detect_outliers
        except Exception:  # noqa: BLE001
            return None


def _load_isolation_forest():
    try:
        from datascience.isolation_forest_outliers import isolation_forest_outliers
        return isolation_forest_outliers
    except Exception:  # noqa: BLE001
        return None


def _load_summarize_dims():
    try:
        from datascience.summarize_outlier_dims import summarize_outlier_dims
        return summarize_outlier_dims
    except Exception:  # noqa: BLE001
        return None


# --------------------------------------------------------------------------- #
# Defensive formatters (own copy: the chapter never imports siblings).
# --------------------------------------------------------------------------- #
def _fmt_num(value, decimals: int = 3) -> str:
    if value is None:
        return "—"
    if isinstance(value, bool):
        return "sí" if value else "no"
    if isinstance(value, int):
        return f"{value:,}".replace(",", ".")
    if isinstance(value, float):
        if value != value:  # NaN
            return "—"
        if value in (float("inf"), float("-inf")):
            return str(value)
        text = f"{value:.{decimals}f}".rstrip("0").rstrip(".")
        return text if text else "0"
    return model._safe_str(value)


def _fmt_int(value) -> str:
    if value is None:
        return "—"
    try:
        return f"{int(round(float(value))):,}".replace(",", ".")
    except (TypeError, ValueError):
        return model._safe_str(value)


def _fmt_pct(value, decimals: int = 2) -> str:
    """Format an already-0-100 value as a percentage. None -> placeholder."""
    if value is None:
        return "—"
    try:
        return f"{float(value):.{decimals}f}%"
    except (TypeError, ValueError):
        return model._safe_str(value)


def _term(mark: bool, key: str, text: str) -> str:
    return f"[[term:{key}]]{text}[[/term]]" if mark else text


def _is_dict(v) -> bool:
    return isinstance(v, dict)


# --------------------------------------------------------------------------- #
# Profile reads.
# --------------------------------------------------------------------------- #
def _numeric_columns(profile: dict) -> list:
    """Return [(name, numeric_dict)] for numeric columns with usable stats."""
    out = []
    for col in profile.get("columns") or []:
        if not isinstance(col, dict):
            continue
        if col.get("inferred_type") != "numeric":
            continue
        num = col.get("numeric")
        if not isinstance(num, dict) or not num:
            continue
        if num.get("mean") is None and num.get("median") is None:
            continue
        out.append((col.get("name") or "(columna)", num))
    return out


def _clean_values(raw):
    """Return the finite float values of a raw column list (drop None/NaN/inf)."""
    if not isinstance(raw, (list, tuple)):
        return None
    vals = []
    for v in raw:
        if v is None or isinstance(v, bool):
            continue
        try:
            f = float(v)
        except (TypeError, ValueError):
            continue
        if f != f or f in (float("inf"), float("-inf")):
            continue
        vals.append(f)
    return vals


# --------------------------------------------------------------------------- #
# Per-column univariate summary.
# --------------------------------------------------------------------------- #
def _univariate_row(name, numeric, raw_vals, box_fn, detect_fn):
    """Compute one univariate summary row + boxplot inputs for a column.

    Returns a dict with the table cells and, when raw values are available, the
    exact Tukey/z counts and the list of atypical (flier) values; otherwise it
    degrades to the profile's own z-score counts and the fence flags.
    """
    box = {}
    if box_fn is not None:
        try:
            box = box_fn(numeric) or {}
        except Exception:  # noqa: BLE001
            box = {}
    lf = box.get("lower_fence")
    uf = box.get("upper_fence")

    vals = _clean_values(raw_vals)
    n_tukey = pct_tukey = None
    n_z = pct_z = None
    low_extreme = high_extreme = None
    fliers = []
    contamination = None  # metric used to rank columns (prefer Tukey %).

    if vals:
        n = len(vals)
        tukey_out = []
        for v in vals:
            below = (lf is not None and v < lf)
            above = (uf is not None and v > uf)
            if below or above:
                tukey_out.append(v)
        n_tukey = len(tukey_out)
        pct_tukey = 100.0 * n_tukey / n if n else None
        if tukey_out:
            low_extreme = min(tukey_out)
            high_extreme = max(tukey_out)
            fliers = tukey_out[:_MAX_FLIERS]
        # z-score rule via the registry function (returns parallel bools).
        if detect_fn is not None:
            try:
                flags = detect_fn(vals, _Z_THRESH) or []
                n_z = int(sum(1 for b in flags if b))
                pct_z = 100.0 * n_z / n if n else None
            except Exception:  # noqa: BLE001
                n_z = pct_z = None
        contamination = pct_tukey
    else:
        # Degrade: no raw sample for this column. The profile's own outlier
        # count/pct come from the z-score block (build_boxplot_stats note); the
        # Tukey count is unknown, only the fence flags are.
        n_z = numeric.get("n_outliers")
        pct_z = numeric.get("outlier_pct")
        if box.get("has_low_outliers") and box.get("min") is not None:
            low_extreme = box.get("min")
        if box.get("has_high_outliers") and box.get("max") is not None:
            high_extreme = box.get("max")
        contamination = pct_z if isinstance(pct_z, (int, float)) else None

    # Compact "extremos atípicos" cell: down/up arrows for the low/high tail.
    extremes = []
    if low_extreme is not None:
        extremes.append(f"↓ {_fmt_num(low_extreme)}")
    if high_extreme is not None:
        extremes.append(f"↑ {_fmt_num(high_extreme)}")
    extremes_cell = "  ".join(extremes) if extremes else "—"

    return {
        "name": model._safe_str(name),
        "n_tukey": n_tukey,
        "pct_tukey": pct_tukey,
        "n_z": n_z,
        "pct_z": pct_z,
        "lower_fence": lf,
        "upper_fence": uf,
        "extremes": extremes_cell,
        "box": box,
        "fliers": fliers,
        "has_raw": bool(vals),
        "contamination": contamination if isinstance(contamination, (int, float)) else -1.0,
    }


def _univariate_table(rows: list) -> model.DataTable:
    header = ["Columna", "Atípicos Tukey", "% Tukey", "Atípicos z", "% z",
              "Valla inf.", "Valla sup.", "Extremos atípicos"]
    table_rows = []
    for r in rows:
        table_rows.append([
            r["name"],
            _fmt_int(r["n_tukey"]) if r["n_tukey"] is not None else "—",
            _fmt_pct(r["pct_tukey"]) if r["pct_tukey"] is not None else "—",
            _fmt_int(r["n_z"]) if r["n_z"] is not None else "—",
            _fmt_pct(r["pct_z"]) if r["pct_z"] is not None else "—",
            _fmt_num(r["lower_fence"]),
            _fmt_num(r["upper_fence"]),
            r["extremes"],
        ])
    return model.DataTable(
        header=header, rows=table_rows,
        title="Valores atípicos por columna",
        note="Tukey = fuera de las vallas 1,5·IQR · z = |z-score| > 3 · "
             "ordenado de más a menos contaminada")


# --------------------------------------------------------------------------- #
# Multivariate (Isolation Forest) section.
# --------------------------------------------------------------------------- #
def _resolve_multivariate(profile: dict, ctx: dict, raw_numeric):
    """Return (outliers_dict_or_None, source).

    Prefers a LIVE Isolation Forest over ``raw_numeric`` so the detector and
    ``summarize_outlier_dims`` use EXACTLY the same numeric columns and the same
    valid-row indexing — otherwise the precomputed ``profile['models']
    ['outliers']`` (run by MODELOS over a possibly different column subset) would
    yield ``row_index`` values that no longer point at the rows
    ``summarize_outlier_dims`` reconstructs, mislabelling the "dimensions that
    make each row rare". Falls back to the precomputed block when no raw sample
    is available (e.g. the lite preset drops ``raw_numeric``)."""
    if _is_dict(raw_numeric) and raw_numeric:
        iso = _load_isolation_forest()
        if iso is not None:
            try:
                out = iso(raw_numeric)
                if _is_dict(out) and out.get("n_outliers") is not None and out.get("n_rows_used"):
                    return out, "live"
            except Exception:  # noqa: BLE001
                pass
    # Fallback: the model the MODELOS chapter already computed (no raw sample to
    # recompute against, so no per-row dimension breakdown either).
    models = profile.get("models") if _is_dict(profile.get("models")) else {}
    pre = models.get("outliers") if _is_dict(models) else None
    if _is_dict(pre) and pre.get("n_outliers") is not None and pre.get("n_rows_used"):
        return pre, "precomputed"
    return None, "none"


def _multivariate_blocks(outliers: dict, raw_numeric, mark: bool) -> list:
    isof = _term(mark, "isolation_forest", "**Isolation Forest**")
    blocks = [
        model.Heading(text="Filas atípicas (multivariante)", level=2),
        model.Markdown(text=(
            f"Hasta aquí cada columna se ha mirado por separado. {isof} busca "
            "filas raras considerando **todas las columnas a la vez**: una fila "
            "puede ser normal en cada variable y aun así ser atípica por la "
            "**combinación** de sus valores (p. ej. una edad baja con una tarifa "
            "muy alta). La tabla resume cuántas filas se marcaron y el umbral de "
            "decisión.")),
        model.KVTable(rows=[
            ("Filas analizadas", _fmt_int(outliers.get("n_rows_used"))),
            ("Columnas consideradas", _fmt_int(outliers.get("n_features"))),
            ("Filas atípicas", _fmt_int(outliers.get("n_outliers"))),
            ("% filas atípicas", _fmt_pct(outliers.get("outlier_pct"))),
            ("Umbral de decisión", _fmt_num(outliers.get("threshold"), 4)),
        ], title="Anomalías multivariantes"),
    ]

    rows_in = outliers.get("outlier_rows") or []
    if not rows_in:
        return blocks

    # Enrich each anomalous row with the dimensions that make it rare, when the
    # raw sample is available (summarize_outlier_dims reconstructs the same
    # valid-row indexing as isolation_forest_outliers).
    dims_by_row = {}
    if _is_dict(raw_numeric) and raw_numeric:
        summ = _load_summarize_dims()
        if summ is not None:
            try:
                enriched = summ(raw_numeric, rows_in, top_k=3) or []
                for e in enriched:
                    if _is_dict(e) and e.get("row_index") is not None:
                        dims_by_row[e.get("row_index")] = e.get("dims") or []
            except Exception:  # noqa: BLE001
                dims_by_row = {}

    has_dims = bool(dims_by_row)
    header = ["Fila (entre válidas)", "Score"]
    if has_dims:
        header.append("Dimensiones que la hacen rara (col = valor, z)")
    table_rows = []
    for r in rows_in[:_TOP_ROWS]:
        if not _is_dict(r):
            continue
        ridx = r.get("row_index")
        cells = [_fmt_int(ridx), _fmt_num(r.get("score"), 4)]
        if has_dims:
            dims = dims_by_row.get(ridx) or []
            parts = []
            for d in dims:
                if not _is_dict(d):
                    continue
                parts.append(
                    f"{model._safe_str(d.get('col'))} = {_fmt_num(d.get('value'))} "
                    f"(z {_fmt_num(d.get('z'), 2)})")
            cells.append("; ".join(parts) if parts else "—")
        table_rows.append(cells)

    if table_rows:
        shown = len(table_rows)
        total = outliers.get("n_outliers")
        note = "las filas más anómalas primero (score más bajo = más rara)"
        if isinstance(total, int) and total > shown:
            note += f" — top {shown} de {total}"
        if not has_dims:
            note += (" · no se pudo recuperar la muestra cruda para explicar las "
                     "dimensiones de cada fila")
        blocks.append(model.DataTable(
            header=header, rows=table_rows,
            title="Filas más atípicas", note=note))
    return blocks


# --------------------------------------------------------------------------- #
# Interpretation section.
# --------------------------------------------------------------------------- #
def _interpretation_block(mark: bool) -> model.Markdown:
    outlier = _term(mark, "outlier", "atípico")
    text = (
        f"**Un {outlier} no es necesariamente un error.** Conviene distinguir "
        "dos casos antes de actuar:\n\n"
        "- **Error de dato** (medida, registro o unidad equivocada): una edad de "
        "200 años, un importe negativo donde no puede haberlo, un decimal "
        "desplazado. Estos sí se corrigen o se eliminan, idealmente en el origen.\n"
        "- **Dato real extremo**: una observación legítima de la cola de la "
        "distribución (un cliente que gasta mucho más, una tarifa de lujo, un día "
        "de ventas excepcional). Borrarla sesga el análisis y oculta información "
        "valiosa.\n\n"
        "**Qué hacer.** Primero, **revisar** los valores señalados arriba contra "
        "su origen para decidir cuál de los dos casos es. Si son errores, "
        "corregirlos. Si son datos reales que distorsionan medias y modelos, hay "
        "alternativas a borrarlos: **winsorizar** (recortar los extremos a un "
        "percentil), o **re-expresar** la variable (por ejemplo una "
        "transformación logarítmica o la escalera de re-expresión de Tukey que "
        "este mismo perfil ya calcula para las columnas asimétricas), que suele "
        "domar la cola sin perder ninguna fila. La elección depende del objetivo: "
        "esta lectura es **exploratoria** —orienta dónde mirar—, no una regla "
        "automática de limpieza.")
    return model.Markdown(text=text)


# --------------------------------------------------------------------------- #
# Entry point.
# --------------------------------------------------------------------------- #
def build_outliers(profile: dict, ctx: dict):
    """Build the OUTLIERS Chapter, or None if the dataset has no numeric column."""
    profile = profile or {}
    ctx = ctx or {}
    if not isinstance(profile, dict):
        return None

    numerics = _numeric_columns(profile)
    if not numerics:
        return None  # chapter does not apply to a dataset with no numerics.

    # Register glossary terms (if a collector is present) and mark them clickable.
    glossary = ctx.get("glossary")
    mark = False
    if isinstance(glossary, model.GlossaryCollector):
        for key, (label, definition) in _TERM_DEFS.items():
            glossary.add(key, label, definition)
        mark = True

    raw_numeric = ctx.get("raw_numeric")
    raw_numeric = raw_numeric if isinstance(raw_numeric, dict) else {}

    box_fn = _load_build_boxplot_stats()
    detect_fn = _load_detect_outliers()

    # --- Univariate summary ------------------------------------------------- #
    uni_rows = []
    for name, numeric in numerics:
        uni_rows.append(_univariate_row(
            name, numeric, raw_numeric.get(name), box_fn, detect_fn))
    # Rank columns by contamination (Tukey % when available, else z %).
    uni_rows.sort(key=lambda r: r.get("contamination", -1.0), reverse=True)

    intro = (
        "Este capítulo reúne en un solo sitio el análisis de los **valores "
        "atípicos** de la tabla, que en el resto del informe aparecen dispersos. "
        f"Un {_term(mark, 'outlier', 'atípico')} es una observación que se aparta "
        "mucho del grueso de los datos. Cada columna numérica se evalúa con dos "
        f"criterios complementarios: las {_term(mark, 'tukey_fence', 'vallas de Tukey')} "
        "(fuera de P25−1,5·IQR o P75+1,5·IQR, robusto a la propia cola) y el "
        f"{_term(mark, 'zscore', 'z-score')} (|z| > 3, sensible a la media). La "
        "tabla está ordenada de la columna más contaminada a la menos.")

    blocks = [
        model.Heading(text=CHAPTER_TITLE, level=1),
        model.Markdown(text=intro),
        _univariate_table(uni_rows),
    ]

    # Flag the most contaminated columns explicitly.
    flagged = [r["name"] for r in uni_rows
               if r.get("contamination", -1.0) > 0][:_TOP_FLAGGED]
    if flagged:
        names = ", ".join(f"**{n}**" for n in flagged)
        blocks.append(model.Markdown(text=(
            f"Las columnas con mayor proporción de atípicos son {names}: "
            "concentran el grueso de los valores fuera de las vallas y son las "
            "primeras a revisar.")))

    # --- Boxplots figure ---------------------------------------------------- #
    box_entries = [
        {"name": r["name"], "box": r["box"], "fliers": r.get("fliers")}
        for r in uni_rows
        if r.get("box")
    ][:_TOP_BOX]
    if box_entries:
        def _boxplots_make(entries=box_entries):
            try:
                from datascience.build_boxplots_figure import build_boxplots_figure
                return build_boxplots_figure(
                    entries, title="Boxplots de Tukey por columna",
                    max_boxes=_TOP_BOX)
            except Exception:  # noqa: BLE001 — minimal fallback figure.
                import matplotlib
                matplotlib.use("Agg")
                from matplotlib.figure import Figure
                fig = Figure(figsize=(5.0, 2.2))
                ax = fig.add_subplot(111)
                ax.text(0.5, 0.5, "(boxplots no disponibles)",
                        ha="center", va="center")
                ax.axis("off")
                return fig

        blocks.append(model.Group(blocks=[
            model.Heading(text="Boxplots", level=2),
            model.Markdown(text=(
                "Cada caja abarca del primer al tercer cuartil (P25–P75), la línea "
                "interior es la mediana y los bigotes llegan hasta 1,5·IQR; los "
                "puntos son los valores que caen fuera de las vallas (atípicos por "
                "Tukey).")),
            model.Figure(
                make=_boxplots_make,
                caption="Boxplots de Tukey de las columnas más contaminadas."),
        ]))

    # --- Multivariate ------------------------------------------------------- #
    outliers, _src = _resolve_multivariate(profile, ctx, raw_numeric)
    if outliers is not None:
        blocks.extend(_multivariate_blocks(outliers, raw_numeric, mark))
    else:
        blocks.append(model.Heading(text="Filas atípicas (multivariante)", level=2))
        blocks.append(model.Note(
            "No se pudo analizar la anomalía multivariante: hacen falta al menos "
            "dos columnas numéricas y la muestra cruda (o los modelos del perfil) "
            "para correr Isolation Forest."))

    # --- Interpretation ----------------------------------------------------- #
    blocks.append(model.Heading(text="Cómo interpretar los atípicos", level=2))
    blocks.append(_interpretation_block(mark))

    return model.Chapter(id=CHAPTER_ID, title=CHAPTER_TITLE,
                         version=CHAPTER_VERSION, blocks=blocks)