fn_registry/python/functions/datascience/summarize_categorical.md at 295f90afaf78119e1ca8071228a2505b2bf02c11

Files

T

egutierrez 763e06c127 feat(browser): auto-commit con 178 cambios

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-06-20 18:22:23 +02:00

3.2 KiB

Raw Blame History

id, name, kind, lang, domain, version, purity, signature, description, tags, uses_functions, uses_types, returns, returns_optional, error_type, imports, example, tested, tests, test_file_path, file_path, params, output

name

kind

lang

domain

version

purity

signature

description

tags

uses_functions

uses_types

returns

returns_optional

error_type

imports

example

tested

tests

test_file_path

file_path

params

output

summarize_categorical_py_datascience

summarize_categorical

function

datascience

1.0.0

pure

def summarize_categorical(values: list, top_k: int = 10) -> dict

Profiles a categorical/text column for EDA: top-k frequencies, mode, distinct count, Shannon entropy (bits), imbalance ratio and string-length stats. None is dropped; empty string counts as a value. Produces the `categorical_sub` block of an eda ColumnProfile.

eda

categorical

frequency

entropy

profiling

datascience

pure

false

math

collections

from summarize_categorical import summarize_categorical result = summarize_categorical(["a", "a", "b", "c", "a", None, ""])

true

test_summarize_categorical_repeated

test_summarize_categorical_empty

test_summarize_categorical_all_none

test_summarize_categorical_single_value

test_summarize_categorical_top_k

test_summarize_categorical_keys

python/functions/datascience/summarize_categorical_test.py

python/functions/datascience/summarize_categorical.py

name	desc
values	List of categorical/text values. None entries are discarded from every computation; an empty string "" is kept as the empty-string category (counts and has length 0).

name	desc
top_k	Maximum number of most-frequent values to include in the `top` list. Default 10. Does not affect n_distinct/entropy/imbalance.

Dict with the exact keys top, mode, mode_pct, n_distinct, entropy, imbalance, len_mean, len_min, len_max. `top` is a list of {value, count, pct} sorted by count descending (pct over the non-null total). When there are no non-null values, top=[] and every other key is None. With a single distinct value, entropy=0.0 and imbalance=1.0.

Ejemplo

from summarize_categorical import summarize_categorical

summarize_categorical(["a", "a", "b", "c", "a", None, ""])
# {
#   "top": [
#     {"value": "a", "count": 3, "pct": 0.5},
#     {"value": "b", "count": 1, "pct": 0.1666...},
#     {"value": "c", "count": 1, "pct": 0.1666...},
#     {"value": "",  "count": 1, "pct": 0.1666...},
#   ],
#   "mode": "a", "mode_pct": 0.5,
#   "n_distinct": 4,
#   "entropy": 1.79...,        # Shannon entropy in bits
#   "imbalance": 3.0,          # max_count(3) / min_count(1)
#   "len_mean": 0.833..., "len_min": 0, "len_max": 1,
# }

Cuando usarla

Úsala al perfilar una columna categórica o de texto en un EDA: cuando necesites el bloque categorical de un ColumnProfile del grupo eda (top valores, moda, cardinalidad, entropía/desbalanceo de la distribución y estadísticas de longitud de los strings). Pásale la lista de valores crudos de la columna; None se ignora automáticamente.

Notas

Función pura, solo stdlib (collections.Counter + math.log2). No usa numpy ni pandas. La entropía es de Shannon en base 2 (bits): 0.0 con un único valor distinto, máxima cuando todos los valores son distintos. imbalance es max_count / min_count sobre los valores distintos (1.0 si solo hay uno).

3.2 KiB Raw Blame History

Ejemplo

Cuando usarla

Notas

3.2 KiB

Raw Blame History