feat(browser): auto-commit con 178 cambios
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,73 @@
|
||||
---
|
||||
id: summarize_categorical_py_datascience
|
||||
name: summarize_categorical
|
||||
kind: function
|
||||
lang: py
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def summarize_categorical(values: list, top_k: int = 10) -> dict"
|
||||
description: "Profiles a categorical/text column for EDA: top-k frequencies, mode, distinct count, Shannon entropy (bits), imbalance ratio and string-length stats. None is dropped; empty string counts as a value. Produces the `categorical_sub` block of an eda ColumnProfile."
|
||||
tags: [eda, categorical, frequency, entropy, profiling, datascience, pure]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: [math, collections]
|
||||
example: |
|
||||
from summarize_categorical import summarize_categorical
|
||||
result = summarize_categorical(["a", "a", "b", "c", "a", None, ""])
|
||||
tested: true
|
||||
tests:
|
||||
- "test_summarize_categorical_repeated"
|
||||
- "test_summarize_categorical_empty"
|
||||
- "test_summarize_categorical_all_none"
|
||||
- "test_summarize_categorical_single_value"
|
||||
- "test_summarize_categorical_top_k"
|
||||
- "test_summarize_categorical_keys"
|
||||
test_file_path: "python/functions/datascience/summarize_categorical_test.py"
|
||||
file_path: "python/functions/datascience/summarize_categorical.py"
|
||||
params:
|
||||
- name: values
|
||||
desc: "List of categorical/text values. None entries are discarded from every computation; an empty string \"\" is kept as the empty-string category (counts and has length 0)."
|
||||
- name: top_k
|
||||
desc: "Maximum number of most-frequent values to include in the `top` list. Default 10. Does not affect n_distinct/entropy/imbalance."
|
||||
output: "Dict with the exact keys top, mode, mode_pct, n_distinct, entropy, imbalance, len_mean, len_min, len_max. `top` is a list of {value, count, pct} sorted by count descending (pct over the non-null total). When there are no non-null values, top=[] and every other key is None. With a single distinct value, entropy=0.0 and imbalance=1.0."
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from summarize_categorical import summarize_categorical
|
||||
|
||||
summarize_categorical(["a", "a", "b", "c", "a", None, ""])
|
||||
# {
|
||||
# "top": [
|
||||
# {"value": "a", "count": 3, "pct": 0.5},
|
||||
# {"value": "b", "count": 1, "pct": 0.1666...},
|
||||
# {"value": "c", "count": 1, "pct": 0.1666...},
|
||||
# {"value": "", "count": 1, "pct": 0.1666...},
|
||||
# ],
|
||||
# "mode": "a", "mode_pct": 0.5,
|
||||
# "n_distinct": 4,
|
||||
# "entropy": 1.79..., # Shannon entropy in bits
|
||||
# "imbalance": 3.0, # max_count(3) / min_count(1)
|
||||
# "len_mean": 0.833..., "len_min": 0, "len_max": 1,
|
||||
# }
|
||||
```
|
||||
|
||||
## Cuando usarla
|
||||
|
||||
Úsala al perfilar una columna categórica o de texto en un EDA: cuando necesites
|
||||
el bloque `categorical` de un ColumnProfile del grupo `eda` (top valores, moda,
|
||||
cardinalidad, entropía/desbalanceo de la distribución y estadísticas de longitud
|
||||
de los strings). Pásale la lista de valores crudos de la columna; `None` se
|
||||
ignora automáticamente.
|
||||
|
||||
## Notas
|
||||
|
||||
Función pura, solo stdlib (`collections.Counter` + `math.log2`). No usa numpy ni
|
||||
pandas. La entropía es de Shannon en base 2 (bits): 0.0 con un único valor
|
||||
distinto, máxima cuando todos los valores son distintos. `imbalance` es
|
||||
`max_count / min_count` sobre los valores distintos (1.0 si solo hay uno).
|
||||
Reference in New Issue
Block a user