feat(eda): histograma sin outliers (vista central) en num_distr

describe_numeric emite una nueva clave aditiva histogram_clipped: un segundo histograma re-binado sobre el rango de vallas de Tukey [p25-1.5*IQR, p75+1.5*IQR], reutilizando los percentiles ya calculados. Es [] cuando el recorte no excluye nada (sin outliers), la columna es constante (iqr==0) o la sub-muestra recortada pierde dispersion, de modo que el renderer no duplica el histograma completo. El capitulo num_distr consume histogram_clipped como una segunda figura DENTRO del mismo grupo keep-together de la columna: la vista central se lee cuando una cola larga aplasta la escala del histograma completo. Bump describe_numeric 1.0.0->1.1.0 (aditivo) y CHAPTER_VERSION num_distr 1.3.0->1.4.0. Tests: golden (recorta la cola), edges (sin outliers -> [], constante -> []), contrato de claves y smoke e2e de render.
2026-07-03 20:34:08 +02:00
parent 1fee225bff
commit a9a60cbf2c
4 changed files with 169 additions and 6 deletions
@@ -69,7 +69,9 @@ def describe_numeric(values: list, bins: int = 20) -> dict:
        Dict with the exact keys of the eda `numeric_sub` contract:
        {min, max, mean, median, mode, std, variance, cv, p1, p5, p25, p50,
         p75, p95, p99, iqr, skew, kurtosis, n_outliers, outlier_pct, zero_pct,
-         negative_pct, distribution_type, histogram}.
+         negative_pct, distribution_type, histogram, histogram_clipped}.
+        histogram_clipped is a second histogram over the Tukey inner-fence
+        range (outliers trimmed) or [] when the clip removes nothing.
    """
    clean = _clean(values)
    n = len(clean)
@@ -77,6 +79,7 @@ def describe_numeric(values: list, bins: int = 20) -> dict:
    if n == 0:
        result = {k: None for k in _NULL_KEYS}
        result["histogram"] = []
+        result["histogram_clipped"] = []
        return result

    arr = np.array(clean, dtype=float)
@@ -131,6 +134,32 @@ def describe_numeric(values: list, bins: int = 20) -> dict:
                hi = minimum + (i + 1) * width
                hist.append({"lo": float(lo), "hi": float(hi), "count": int(count)})

+    # Clipped histogram: a second view of the central mass with the outliers
+    # trimmed away, re-binned over the Tukey inner-fence range [Q1-1.5*IQR,
+    # Q3+1.5*IQR] (coherent with the boxplot already drawn below the histogram).
+    # It answers "what does the bulk look like when the long tail no longer
+    # crushes the scale". Computed here because the raw sample (`clean`) is only
+    # alive at this point — the profile keeps aggregated bins, not raw values.
+    # Only emitted when the clip actually removes something *and* the trimmed
+    # sample still has spread; otherwise it degrades to [] and the renderer skips
+    # the second view (no redundant duplicate of the full histogram).
+    hist_clipped: list = []
+    lower_fence = p25 - 1.5 * iqr
+    upper_fence = p75 + 1.5 * iqr
+    if iqr > 0:
+        clipped = [v for v in clean if lower_fence <= v <= upper_fence]
+        if clipped and len(clipped) < len(clean):
+            c_counts = histogram(clipped, bins)
+            c_min = float(min(clipped))
+            c_max = float(max(clipped))
+            if c_counts and c_max > c_min:
+                c_width = (c_max - c_min) / bins
+                for i, count in enumerate(c_counts):
+                    lo = c_min + i * c_width
+                    hi = c_min + (i + 1) * c_width
+                    hist_clipped.append(
+                        {"lo": float(lo), "hi": float(hi), "count": int(count)})
+
    return {
        "min": minimum,
        "max": maximum,
@@ -156,4 +185,5 @@ def describe_numeric(values: list, bins: int = 20) -> dict:
        "negative_pct": negative_pct,
        "distribution_type": distribution_type,
        "histogram": hist,
+        "histogram_clipped": hist_clipped,
    }