feat(eda): histograma sin outliers (vista central) en num_distr
describe_numeric emite una nueva clave aditiva histogram_clipped: un segundo histograma re-binado sobre el rango de vallas de Tukey [p25-1.5*IQR, p75+1.5*IQR], reutilizando los percentiles ya calculados. Es [] cuando el recorte no excluye nada (sin outliers), la columna es constante (iqr==0) o la sub-muestra recortada pierde dispersion, de modo que el renderer no duplica el histograma completo. El capitulo num_distr consume histogram_clipped como una segunda figura DENTRO del mismo grupo keep-together de la columna: la vista central se lee cuando una cola larga aplasta la escala del histograma completo. Bump describe_numeric 1.0.0->1.1.0 (aditivo) y CHAPTER_VERSION num_distr 1.3.0->1.4.0. Tests: golden (recorta la cola), edges (sin outliers -> [], constante -> []), contrato de claves y smoke e2e de render.
This commit is contained in:
@@ -69,7 +69,9 @@ def describe_numeric(values: list, bins: int = 20) -> dict:
|
||||
Dict with the exact keys of the eda `numeric_sub` contract:
|
||||
{min, max, mean, median, mode, std, variance, cv, p1, p5, p25, p50,
|
||||
p75, p95, p99, iqr, skew, kurtosis, n_outliers, outlier_pct, zero_pct,
|
||||
negative_pct, distribution_type, histogram}.
|
||||
negative_pct, distribution_type, histogram, histogram_clipped}.
|
||||
histogram_clipped is a second histogram over the Tukey inner-fence
|
||||
range (outliers trimmed) or [] when the clip removes nothing.
|
||||
"""
|
||||
clean = _clean(values)
|
||||
n = len(clean)
|
||||
@@ -77,6 +79,7 @@ def describe_numeric(values: list, bins: int = 20) -> dict:
|
||||
if n == 0:
|
||||
result = {k: None for k in _NULL_KEYS}
|
||||
result["histogram"] = []
|
||||
result["histogram_clipped"] = []
|
||||
return result
|
||||
|
||||
arr = np.array(clean, dtype=float)
|
||||
@@ -131,6 +134,32 @@ def describe_numeric(values: list, bins: int = 20) -> dict:
|
||||
hi = minimum + (i + 1) * width
|
||||
hist.append({"lo": float(lo), "hi": float(hi), "count": int(count)})
|
||||
|
||||
# Clipped histogram: a second view of the central mass with the outliers
|
||||
# trimmed away, re-binned over the Tukey inner-fence range [Q1-1.5*IQR,
|
||||
# Q3+1.5*IQR] (coherent with the boxplot already drawn below the histogram).
|
||||
# It answers "what does the bulk look like when the long tail no longer
|
||||
# crushes the scale". Computed here because the raw sample (`clean`) is only
|
||||
# alive at this point — the profile keeps aggregated bins, not raw values.
|
||||
# Only emitted when the clip actually removes something *and* the trimmed
|
||||
# sample still has spread; otherwise it degrades to [] and the renderer skips
|
||||
# the second view (no redundant duplicate of the full histogram).
|
||||
hist_clipped: list = []
|
||||
lower_fence = p25 - 1.5 * iqr
|
||||
upper_fence = p75 + 1.5 * iqr
|
||||
if iqr > 0:
|
||||
clipped = [v for v in clean if lower_fence <= v <= upper_fence]
|
||||
if clipped and len(clipped) < len(clean):
|
||||
c_counts = histogram(clipped, bins)
|
||||
c_min = float(min(clipped))
|
||||
c_max = float(max(clipped))
|
||||
if c_counts and c_max > c_min:
|
||||
c_width = (c_max - c_min) / bins
|
||||
for i, count in enumerate(c_counts):
|
||||
lo = c_min + i * c_width
|
||||
hi = c_min + (i + 1) * c_width
|
||||
hist_clipped.append(
|
||||
{"lo": float(lo), "hi": float(hi), "count": int(count)})
|
||||
|
||||
return {
|
||||
"min": minimum,
|
||||
"max": maximum,
|
||||
@@ -156,4 +185,5 @@ def describe_numeric(values: list, bins: int = 20) -> dict:
|
||||
"negative_pct": negative_pct,
|
||||
"distribution_type": distribution_type,
|
||||
"histogram": hist,
|
||||
"histogram_clipped": hist_clipped,
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user