Files
fn_registry/python/functions/datascience/automatic_eda/chapters/outliers.py
T
egutierrez 6f88f184f1 feat(eda): capítulo OUTLIERS — valores atípicos univariantes + multivariantes
Nuevo capítulo dedicado `outliers` para el motor AutomaticEDA que reúne y
profundiza en un solo sitio el análisis de valores atípicos, hoy disperso entre
`num_distr` (conteo por columna) y `modelos` (IsolationForest). Se registra en
`chapters_registry.py` entre `missingness` y `correlacion` (bloque de calidad de
datos: calidad → missingness → outliers).

Contenido del capítulo:
- Resumen univariante por columna: nº y % de atípicos por Tukey (1.5·IQR) y por
  z-score (|z| > 3), con vallas inferior/superior y valores extremos. Ordenado
  por contaminación y marcando las columnas más afectadas. Reusa las funciones
  del registry `build_boxplot_stats` (vallas desde los percentiles del profile)
  y `detect_outliers` (regla z-score sobre la muestra cruda de `ctx`).
- Boxplots de Tukey de las columnas más contaminadas (caja, bigotes y puntos
  atípicos), delegados a la función nueva `build_boxplots_figure`.
- Multivariante: filas anómalas considerando todas las columnas a la vez con
  `isolation_forest_outliers` — nº y % de filas, las más anómalas con su score y
  las dimensiones que las hacen raras (top columnas por |z|, vía la función nueva
  `summarize_outlier_dims`). El detector se corre en vivo sobre `raw_numeric`
  para que el indexado de filas coincida exactamente con el de las dimensiones;
  cae al bloque precomputado del perfil cuando no hay muestra cruda (preset lite).
- Interpretación exploratoria: un atípico no es necesariamente un error
  (distingue error de dato vs dato real extremo) y recomendaciones (revisar,
  winsorizar o re-expresar, enlazando con la re-expresión de Tukey del perfil).

Términos clicables registrados en el glosario compartido: `outlier`,
`tukey_fence`, `zscore`, `isolation_forest`.

Funciones nuevas del registry (dominio datascience, grupo eda):
- `build_boxplots_figure_py_datascience` (figure helper, impura)
- `summarize_outlier_dims_py_datascience` (pura)

El capítulo se activa con ≥1 columna numérica y devuelve None en su ausencia;
lee todo defensivo y nunca lanza. Tests: capítulo (golden + edges + error path +
render PDF/PPTX) y ambas funciones nuevas. Suite de no-regresión de AutomaticEDA
verde. Verificado end-to-end con el dataset Titanic (Fare/Parch/SibSp como las
columnas más contaminadas).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-30 21:12:40 +02:00

594 lines
26 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
"""Outliers chapter (OUTLIERS) — univariate + multivariate atypical values.
Today the analysis of atypical values is scattered across the document: the
NUM DISTR chapter mentions the per-column outlier count inside each distribution
figure, and the MODELOS chapter runs Isolation Forest as one of several cheap
models. This chapter gathers and deepens the whole outlier story in a single
place, with its interpretation: an [[term:outlier]]outlier[[/term]] is **not
necessarily an error** — it can be a legitimate, extreme but real observation —
so the reading is exploratory (what to look at), never confirmatory (what to
delete).
Sections, in order:
1. **Resumen univariante por columna** — for every numeric column, the number
and percentage of atypical values by two complementary criteria: Tukey's
1.5·IQR rule ([[term:tukey_fence]]vallas de Tukey[[/term]]) and the
[[term:zscore]]z-score[[/term]] rule (|z| > 3). The most contaminated columns
are flagged. The fences come from the pure registry function
``build_boxplot_stats`` (derived from the profile percentiles); the per-column
counts use the raw sample in ``ctx['raw_numeric']`` when available (the exact
count), degrading to the profile's own z-score counts otherwise.
2. **Boxplots** — a single figure with the Tukey boxplots of the most
contaminated columns (box, whiskers and atypical points), delegated to the
reusable registry helper ``build_boxplots_figure``.
3. **Multivariante (filas anómalas)** — rows that are atypical considering ALL
columns at once, via the registry function ``isolation_forest_outliers``: the
count and percentage of anomalous rows, the most anomalous rows with their
score, and the dimensions that make each one rare (top columns by |z|, via
``summarize_outlier_dims``). Run live on ``ctx['raw_numeric']`` (the same
numeric columns ``summarize_outlier_dims`` uses, so the row indexing stays
coherent and the dimension breakdown is correct); falls back to the
precomputed ``profile['models']['outliers']`` only when no raw sample is
available (e.g. the lite preset), where no per-row breakdown is shown.
4. **Interpretación** — outlier ≠ error: how to tell a data-entry error from a
genuine extreme value, and what to do (inspect, winsorize, or re-express —
linking to the Tukey re-expression the profile already computes).
The chapter activates whenever the table has at least one numeric column; with
no numeric column it returns ``None`` and disappears from the document.
Reads everything defensively (``.get``) and never raises: every registry
delegation is imported lazily and degraded to an honest note on any failure.
Contract: build_<id>(profile, ctx) -> Chapter | None ; CHAPTER_VERSION = "x.y.z".
"""
from __future__ import annotations
from .. import model
CHAPTER_VERSION = "1.0.0"
CHAPTER_ID = "outliers"
CHAPTER_TITLE = "Valores atípicos"
# z-score threshold for the univariate z rule: |z| > 3 flags a value ~3 standard
# deviations from the mean (≈99.7% of a normal distribution lies within ±3σ).
_Z_THRESH = 3.0
# How many columns to draw in the boxplots figure (most contaminated first) and
# how many anomalous rows to list in the multivariate table.
_TOP_BOX = 12
_TOP_ROWS = 12
# Cap on the raw atypical values passed as boxplot fliers, so a heavy-tailed
# column does not flood the figure with thousands of points.
_MAX_FLIERS = 200
# How many columns flagged as "most contaminated" in the summary note.
_TOP_FLAGGED = 3
# Glossary terms this chapter explains (contract §11.1). Registered in the shared
# collector and marked clickable on first appearance. ``isolation_forest`` and
# ``zscore`` may also be registered by the MODELOS chapter — ``add`` is
# idempotent (first definition wins), so registering them here is harmless and
# keeps this chapter self-contained when MODELOS does not render.
_TERM_DEFS = {
"outlier": (
"Valor atípico (outlier)",
"Una observación que se aparta mucho del grueso de los datos. Un atípico "
"NO es necesariamente un error: puede ser un fallo de medida o de "
"registro, pero también un dato real extremo (un cliente que gasta diez "
"veces la media, un día de ventas excepcional). Por eso se señalan para "
"revisarlos, no para borrarlos automáticamente.",
),
"tukey_fence": (
"Vallas de Tukey (1,5·IQR)",
"Regla clásica para marcar atípicos a partir de los cuartiles: se calcula "
"el rango intercuartílico IQR = P75 P25 y se trazan dos vallas, una "
"inferior en P25 1,5·IQR y otra superior en P75 + 1,5·IQR. Los valores "
"que caen fuera de esas vallas se consideran atípicos. Es robusta porque "
"se apoya en la mediana y los cuartiles, no en la media.",
),
"zscore": (
"z-score (puntuación típica)",
"Mide a cuántas desviaciones típicas está un valor de la media de su "
"columna: z = (valor media) / desviación típica. Un |z| grande (aquí > "
"3) señala un valor alejado del centro. A diferencia de las vallas de "
"Tukey, el z-score usa media y desviación, así que es más sensible a la "
"presencia de los propios atípicos.",
),
"isolation_forest": (
"Isolation Forest (anomalías multivariantes)",
"Algoritmo de detección de anomalías que considera TODAS las columnas a "
"la vez: construye árboles que parten el espacio con cortes aleatorios y "
"mide cuántos cortes hacen falta para aislar cada fila. Las filas raras "
"se aíslan con muy pocos cortes y se marcan como atípicas según un umbral "
"de contaminación. Detecta combinaciones de valores poco frecuentes que "
"ninguna columna por separado revelaría.",
),
}
# --------------------------------------------------------------------------- #
# Lazy registry delegations (each degrades to None / no-op on any failure).
# --------------------------------------------------------------------------- #
def _load_build_boxplot_stats():
try:
from datascience.build_boxplot_stats import build_boxplot_stats
return build_boxplot_stats
except Exception: # noqa: BLE001
return None
def _load_detect_outliers():
# detect_outliers lives in the monolithic ``datascience.datascience`` module
# (file_path datascience.py), not in its own submodule — try both shapes.
try:
from datascience.datascience import detect_outliers
return detect_outliers
except Exception: # noqa: BLE001
try:
from datascience import detect_outliers
return detect_outliers
except Exception: # noqa: BLE001
return None
def _load_isolation_forest():
try:
from datascience.isolation_forest_outliers import isolation_forest_outliers
return isolation_forest_outliers
except Exception: # noqa: BLE001
return None
def _load_summarize_dims():
try:
from datascience.summarize_outlier_dims import summarize_outlier_dims
return summarize_outlier_dims
except Exception: # noqa: BLE001
return None
# --------------------------------------------------------------------------- #
# Defensive formatters (own copy: the chapter never imports siblings).
# --------------------------------------------------------------------------- #
def _fmt_num(value, decimals: int = 3) -> str:
if value is None:
return ""
if isinstance(value, bool):
return "" if value else "no"
if isinstance(value, int):
return f"{value:,}".replace(",", ".")
if isinstance(value, float):
if value != value: # NaN
return ""
if value in (float("inf"), float("-inf")):
return str(value)
text = f"{value:.{decimals}f}".rstrip("0").rstrip(".")
return text if text else "0"
return model._safe_str(value)
def _fmt_int(value) -> str:
if value is None:
return ""
try:
return f"{int(round(float(value))):,}".replace(",", ".")
except (TypeError, ValueError):
return model._safe_str(value)
def _fmt_pct(value, decimals: int = 2) -> str:
"""Format an already-0-100 value as a percentage. None -> placeholder."""
if value is None:
return ""
try:
return f"{float(value):.{decimals}f}%"
except (TypeError, ValueError):
return model._safe_str(value)
def _term(mark: bool, key: str, text: str) -> str:
return f"[[term:{key}]]{text}[[/term]]" if mark else text
def _is_dict(v) -> bool:
return isinstance(v, dict)
# --------------------------------------------------------------------------- #
# Profile reads.
# --------------------------------------------------------------------------- #
def _numeric_columns(profile: dict) -> list:
"""Return [(name, numeric_dict)] for numeric columns with usable stats."""
out = []
for col in profile.get("columns") or []:
if not isinstance(col, dict):
continue
if col.get("inferred_type") != "numeric":
continue
num = col.get("numeric")
if not isinstance(num, dict) or not num:
continue
if num.get("mean") is None and num.get("median") is None:
continue
out.append((col.get("name") or "(columna)", num))
return out
def _clean_values(raw):
"""Return the finite float values of a raw column list (drop None/NaN/inf)."""
if not isinstance(raw, (list, tuple)):
return None
vals = []
for v in raw:
if v is None or isinstance(v, bool):
continue
try:
f = float(v)
except (TypeError, ValueError):
continue
if f != f or f in (float("inf"), float("-inf")):
continue
vals.append(f)
return vals
# --------------------------------------------------------------------------- #
# Per-column univariate summary.
# --------------------------------------------------------------------------- #
def _univariate_row(name, numeric, raw_vals, box_fn, detect_fn):
"""Compute one univariate summary row + boxplot inputs for a column.
Returns a dict with the table cells and, when raw values are available, the
exact Tukey/z counts and the list of atypical (flier) values; otherwise it
degrades to the profile's own z-score counts and the fence flags.
"""
box = {}
if box_fn is not None:
try:
box = box_fn(numeric) or {}
except Exception: # noqa: BLE001
box = {}
lf = box.get("lower_fence")
uf = box.get("upper_fence")
vals = _clean_values(raw_vals)
n_tukey = pct_tukey = None
n_z = pct_z = None
low_extreme = high_extreme = None
fliers = []
contamination = None # metric used to rank columns (prefer Tukey %).
if vals:
n = len(vals)
tukey_out = []
for v in vals:
below = (lf is not None and v < lf)
above = (uf is not None and v > uf)
if below or above:
tukey_out.append(v)
n_tukey = len(tukey_out)
pct_tukey = 100.0 * n_tukey / n if n else None
if tukey_out:
low_extreme = min(tukey_out)
high_extreme = max(tukey_out)
fliers = tukey_out[:_MAX_FLIERS]
# z-score rule via the registry function (returns parallel bools).
if detect_fn is not None:
try:
flags = detect_fn(vals, _Z_THRESH) or []
n_z = int(sum(1 for b in flags if b))
pct_z = 100.0 * n_z / n if n else None
except Exception: # noqa: BLE001
n_z = pct_z = None
contamination = pct_tukey
else:
# Degrade: no raw sample for this column. The profile's own outlier
# count/pct come from the z-score block (build_boxplot_stats note); the
# Tukey count is unknown, only the fence flags are.
n_z = numeric.get("n_outliers")
pct_z = numeric.get("outlier_pct")
if box.get("has_low_outliers") and box.get("min") is not None:
low_extreme = box.get("min")
if box.get("has_high_outliers") and box.get("max") is not None:
high_extreme = box.get("max")
contamination = pct_z if isinstance(pct_z, (int, float)) else None
# Compact "extremos atípicos" cell: down/up arrows for the low/high tail.
extremes = []
if low_extreme is not None:
extremes.append(f"{_fmt_num(low_extreme)}")
if high_extreme is not None:
extremes.append(f"{_fmt_num(high_extreme)}")
extremes_cell = " ".join(extremes) if extremes else ""
return {
"name": model._safe_str(name),
"n_tukey": n_tukey,
"pct_tukey": pct_tukey,
"n_z": n_z,
"pct_z": pct_z,
"lower_fence": lf,
"upper_fence": uf,
"extremes": extremes_cell,
"box": box,
"fliers": fliers,
"has_raw": bool(vals),
"contamination": contamination if isinstance(contamination, (int, float)) else -1.0,
}
def _univariate_table(rows: list) -> model.DataTable:
header = ["Columna", "Atípicos Tukey", "% Tukey", "Atípicos z", "% z",
"Valla inf.", "Valla sup.", "Extremos atípicos"]
table_rows = []
for r in rows:
table_rows.append([
r["name"],
_fmt_int(r["n_tukey"]) if r["n_tukey"] is not None else "",
_fmt_pct(r["pct_tukey"]) if r["pct_tukey"] is not None else "",
_fmt_int(r["n_z"]) if r["n_z"] is not None else "",
_fmt_pct(r["pct_z"]) if r["pct_z"] is not None else "",
_fmt_num(r["lower_fence"]),
_fmt_num(r["upper_fence"]),
r["extremes"],
])
return model.DataTable(
header=header, rows=table_rows,
title="Valores atípicos por columna",
note="Tukey = fuera de las vallas 1,5·IQR · z = |z-score| > 3 · "
"ordenado de más a menos contaminada")
# --------------------------------------------------------------------------- #
# Multivariate (Isolation Forest) section.
# --------------------------------------------------------------------------- #
def _resolve_multivariate(profile: dict, ctx: dict, raw_numeric):
"""Return (outliers_dict_or_None, source).
Prefers a LIVE Isolation Forest over ``raw_numeric`` so the detector and
``summarize_outlier_dims`` use EXACTLY the same numeric columns and the same
valid-row indexing — otherwise the precomputed ``profile['models']
['outliers']`` (run by MODELOS over a possibly different column subset) would
yield ``row_index`` values that no longer point at the rows
``summarize_outlier_dims`` reconstructs, mislabelling the "dimensions that
make each row rare". Falls back to the precomputed block when no raw sample
is available (e.g. the lite preset drops ``raw_numeric``)."""
if _is_dict(raw_numeric) and raw_numeric:
iso = _load_isolation_forest()
if iso is not None:
try:
out = iso(raw_numeric)
if _is_dict(out) and out.get("n_outliers") is not None and out.get("n_rows_used"):
return out, "live"
except Exception: # noqa: BLE001
pass
# Fallback: the model the MODELOS chapter already computed (no raw sample to
# recompute against, so no per-row dimension breakdown either).
models = profile.get("models") if _is_dict(profile.get("models")) else {}
pre = models.get("outliers") if _is_dict(models) else None
if _is_dict(pre) and pre.get("n_outliers") is not None and pre.get("n_rows_used"):
return pre, "precomputed"
return None, "none"
def _multivariate_blocks(outliers: dict, raw_numeric, mark: bool) -> list:
isof = _term(mark, "isolation_forest", "**Isolation Forest**")
blocks = [
model.Heading(text="Filas atípicas (multivariante)", level=2),
model.Markdown(text=(
f"Hasta aquí cada columna se ha mirado por separado. {isof} busca "
"filas raras considerando **todas las columnas a la vez**: una fila "
"puede ser normal en cada variable y aun así ser atípica por la "
"**combinación** de sus valores (p. ej. una edad baja con una tarifa "
"muy alta). La tabla resume cuántas filas se marcaron y el umbral de "
"decisión.")),
model.KVTable(rows=[
("Filas analizadas", _fmt_int(outliers.get("n_rows_used"))),
("Columnas consideradas", _fmt_int(outliers.get("n_features"))),
("Filas atípicas", _fmt_int(outliers.get("n_outliers"))),
("% filas atípicas", _fmt_pct(outliers.get("outlier_pct"))),
("Umbral de decisión", _fmt_num(outliers.get("threshold"), 4)),
], title="Anomalías multivariantes"),
]
rows_in = outliers.get("outlier_rows") or []
if not rows_in:
return blocks
# Enrich each anomalous row with the dimensions that make it rare, when the
# raw sample is available (summarize_outlier_dims reconstructs the same
# valid-row indexing as isolation_forest_outliers).
dims_by_row = {}
if _is_dict(raw_numeric) and raw_numeric:
summ = _load_summarize_dims()
if summ is not None:
try:
enriched = summ(raw_numeric, rows_in, top_k=3) or []
for e in enriched:
if _is_dict(e) and e.get("row_index") is not None:
dims_by_row[e.get("row_index")] = e.get("dims") or []
except Exception: # noqa: BLE001
dims_by_row = {}
has_dims = bool(dims_by_row)
header = ["Fila (entre válidas)", "Score"]
if has_dims:
header.append("Dimensiones que la hacen rara (col = valor, z)")
table_rows = []
for r in rows_in[:_TOP_ROWS]:
if not _is_dict(r):
continue
ridx = r.get("row_index")
cells = [_fmt_int(ridx), _fmt_num(r.get("score"), 4)]
if has_dims:
dims = dims_by_row.get(ridx) or []
parts = []
for d in dims:
if not _is_dict(d):
continue
parts.append(
f"{model._safe_str(d.get('col'))} = {_fmt_num(d.get('value'))} "
f"(z {_fmt_num(d.get('z'), 2)})")
cells.append("; ".join(parts) if parts else "")
table_rows.append(cells)
if table_rows:
shown = len(table_rows)
total = outliers.get("n_outliers")
note = "las filas más anómalas primero (score más bajo = más rara)"
if isinstance(total, int) and total > shown:
note += f" — top {shown} de {total}"
if not has_dims:
note += (" · no se pudo recuperar la muestra cruda para explicar las "
"dimensiones de cada fila")
blocks.append(model.DataTable(
header=header, rows=table_rows,
title="Filas más atípicas", note=note))
return blocks
# --------------------------------------------------------------------------- #
# Interpretation section.
# --------------------------------------------------------------------------- #
def _interpretation_block(mark: bool) -> model.Markdown:
outlier = _term(mark, "outlier", "atípico")
text = (
f"**Un {outlier} no es necesariamente un error.** Conviene distinguir "
"dos casos antes de actuar:\n\n"
"- **Error de dato** (medida, registro o unidad equivocada): una edad de "
"200 años, un importe negativo donde no puede haberlo, un decimal "
"desplazado. Estos sí se corrigen o se eliminan, idealmente en el origen.\n"
"- **Dato real extremo**: una observación legítima de la cola de la "
"distribución (un cliente que gasta mucho más, una tarifa de lujo, un día "
"de ventas excepcional). Borrarla sesga el análisis y oculta información "
"valiosa.\n\n"
"**Qué hacer.** Primero, **revisar** los valores señalados arriba contra "
"su origen para decidir cuál de los dos casos es. Si son errores, "
"corregirlos. Si son datos reales que distorsionan medias y modelos, hay "
"alternativas a borrarlos: **winsorizar** (recortar los extremos a un "
"percentil), o **re-expresar** la variable (por ejemplo una "
"transformación logarítmica o la escalera de re-expresión de Tukey que "
"este mismo perfil ya calcula para las columnas asimétricas), que suele "
"domar la cola sin perder ninguna fila. La elección depende del objetivo: "
"esta lectura es **exploratoria** —orienta dónde mirar—, no una regla "
"automática de limpieza.")
return model.Markdown(text=text)
# --------------------------------------------------------------------------- #
# Entry point.
# --------------------------------------------------------------------------- #
def build_outliers(profile: dict, ctx: dict):
"""Build the OUTLIERS Chapter, or None if the dataset has no numeric column."""
profile = profile or {}
ctx = ctx or {}
if not isinstance(profile, dict):
return None
numerics = _numeric_columns(profile)
if not numerics:
return None # chapter does not apply to a dataset with no numerics.
# Register glossary terms (if a collector is present) and mark them clickable.
glossary = ctx.get("glossary")
mark = False
if isinstance(glossary, model.GlossaryCollector):
for key, (label, definition) in _TERM_DEFS.items():
glossary.add(key, label, definition)
mark = True
raw_numeric = ctx.get("raw_numeric")
raw_numeric = raw_numeric if isinstance(raw_numeric, dict) else {}
box_fn = _load_build_boxplot_stats()
detect_fn = _load_detect_outliers()
# --- Univariate summary ------------------------------------------------- #
uni_rows = []
for name, numeric in numerics:
uni_rows.append(_univariate_row(
name, numeric, raw_numeric.get(name), box_fn, detect_fn))
# Rank columns by contamination (Tukey % when available, else z %).
uni_rows.sort(key=lambda r: r.get("contamination", -1.0), reverse=True)
intro = (
"Este capítulo reúne en un solo sitio el análisis de los **valores "
"atípicos** de la tabla, que en el resto del informe aparecen dispersos. "
f"Un {_term(mark, 'outlier', 'atípico')} es una observación que se aparta "
"mucho del grueso de los datos. Cada columna numérica se evalúa con dos "
f"criterios complementarios: las {_term(mark, 'tukey_fence', 'vallas de Tukey')} "
"(fuera de P251,5·IQR o P75+1,5·IQR, robusto a la propia cola) y el "
f"{_term(mark, 'zscore', 'z-score')} (|z| > 3, sensible a la media). La "
"tabla está ordenada de la columna más contaminada a la menos.")
blocks = [
model.Heading(text=CHAPTER_TITLE, level=1),
model.Markdown(text=intro),
_univariate_table(uni_rows),
]
# Flag the most contaminated columns explicitly.
flagged = [r["name"] for r in uni_rows
if r.get("contamination", -1.0) > 0][:_TOP_FLAGGED]
if flagged:
names = ", ".join(f"**{n}**" for n in flagged)
blocks.append(model.Markdown(text=(
f"Las columnas con mayor proporción de atípicos son {names}: "
"concentran el grueso de los valores fuera de las vallas y son las "
"primeras a revisar.")))
# --- Boxplots figure ---------------------------------------------------- #
box_entries = [
{"name": r["name"], "box": r["box"], "fliers": r.get("fliers")}
for r in uni_rows
if r.get("box")
][:_TOP_BOX]
if box_entries:
def _boxplots_make(entries=box_entries):
try:
from datascience.build_boxplots_figure import build_boxplots_figure
return build_boxplots_figure(
entries, title="Boxplots de Tukey por columna",
max_boxes=_TOP_BOX)
except Exception: # noqa: BLE001 — minimal fallback figure.
import matplotlib
matplotlib.use("Agg")
from matplotlib.figure import Figure
fig = Figure(figsize=(5.0, 2.2))
ax = fig.add_subplot(111)
ax.text(0.5, 0.5, "(boxplots no disponibles)",
ha="center", va="center")
ax.axis("off")
return fig
blocks.append(model.Group(blocks=[
model.Heading(text="Boxplots", level=2),
model.Markdown(text=(
"Cada caja abarca del primer al tercer cuartil (P25P75), la línea "
"interior es la mediana y los bigotes llegan hasta 1,5·IQR; los "
"puntos son los valores que caen fuera de las vallas (atípicos por "
"Tukey).")),
model.Figure(
make=_boxplots_make,
caption="Boxplots de Tukey de las columnas más contaminadas."),
]))
# --- Multivariate ------------------------------------------------------- #
outliers, _src = _resolve_multivariate(profile, ctx, raw_numeric)
if outliers is not None:
blocks.extend(_multivariate_blocks(outliers, raw_numeric, mark))
else:
blocks.append(model.Heading(text="Filas atípicas (multivariante)", level=2))
blocks.append(model.Note(
"No se pudo analizar la anomalía multivariante: hacen falta al menos "
"dos columnas numéricas y la muestra cruda (o los modelos del perfil) "
"para correr Isolation Forest."))
# --- Interpretation ----------------------------------------------------- #
blocks.append(model.Heading(text="Cómo interpretar los atípicos", level=2))
blocks.append(_interpretation_block(mark))
return model.Chapter(id=CHAPTER_ID, title=CHAPTER_TITLE,
version=CHAPTER_VERSION, blocks=blocks)