| summarize_categorical_py_datascience |
summarize_categorical |
function |
py |
datascience |
1.0.0 |
pure |
def summarize_categorical(values: list, top_k: int = 10) -> dict |
Profiles a categorical/text column for EDA: top-k frequencies, mode, distinct count, Shannon entropy (bits), imbalance ratio and string-length stats. None is dropped; empty string counts as a value. Produces the `categorical_sub` block of an eda ColumnProfile. |
| eda |
| categorical |
| frequency |
| entropy |
| profiling |
| datascience |
| pure |
|
|
|
|
false |
|
|
from summarize_categorical import summarize_categorical
result = summarize_categorical(["a", "a", "b", "c", "a", None, ""])
|
true |
| test_summarize_categorical_repeated |
| test_summarize_categorical_empty |
| test_summarize_categorical_all_none |
| test_summarize_categorical_single_value |
| test_summarize_categorical_top_k |
| test_summarize_categorical_keys |
|
python/functions/datascience/summarize_categorical_test.py |
python/functions/datascience/summarize_categorical.py |
| name |
desc |
| values |
List of categorical/text values. None entries are discarded from every computation; an empty string "" is kept as the empty-string category (counts and has length 0). |
|
| name |
desc |
| top_k |
Maximum number of most-frequent values to include in the `top` list. Default 10. Does not affect n_distinct/entropy/imbalance. |
|
|
Dict with the exact keys top, mode, mode_pct, n_distinct, entropy, imbalance, len_mean, len_min, len_max. `top` is a list of {value, count, pct} sorted by count descending (pct over the non-null total). When there are no non-null values, top=[] and every other key is None. With a single distinct value, entropy=0.0 and imbalance=1.0. |