--- id: compute_text_length_stats_py_datascience name: compute_text_length_stats kind: function lang: py domain: datascience version: "1.0.0" purity: pure signature: "def compute_text_length_stats(texts, n_bins=20) -> dict" description: "Profiles the length distribution of a corpus of text documents for EDA: per-document characters, words (unicode \\w+ tokens) and sentences (segments split on .!?… with a minimum of 1 per non-empty doc), each summarized with mean/p50/p90/p99/min/max (nearest-rank percentiles), plus an equal-width histogram of per-document word counts. None and non-str items are discarded. Dict-no-throw: never raises. Stdlib only (re)." tags: [eda, datascience, text, nlp, length, statistics, pure, python] uses_functions: [] uses_types: [] returns: [] returns_optional: false error_type: "" imports: [re, math] example: | from datascience.compute_text_length_stats import compute_text_length_stats result = compute_text_length_stats(["Hola mundo.", "Una frase mas larga aqui."], n_bins=5) tested: true tests: - "test_basico" - "test_vacio" - "test_descarta_none" - "test_un_documento" test_file_path: "python/functions/datascience/compute_text_length_stats_test.py" file_path: "python/functions/datascience/compute_text_length_stats.py" params: - name: texts desc: "List of text documents (str). None entries and any non-str items (ints, floats, etc.) are discarded before any computation. An empty string \"\" is kept (chars 0, words 0, sentences 0)." - name: n_bins desc: "Number of equal-width bins for the per-document word-count histogram. Default 20. When all docs have the same word count, there are <2 docs, or n_bins < 1, a single covering bin is returned instead." output: "Dict with keys n_docs (int), chars, words, sentences and word_hist. Each of the three axis sub-dicts has the exact keys mean (float, 2 decimals), p50, p90, p99, min, max (ints). When there are no valid documents, n_docs is 0, every axis statistic is None and word_hist is []. word_hist is a list of {lo: float, hi: float, count: int} bins; the sum of all bin counts equals n_docs." --- ## Ejemplo ```python from datascience.compute_text_length_stats import compute_text_length_stats compute_text_length_stats( [ "Hola mundo.", "Una frase mas larga con varias palabras aqui.", "Esto. Tiene. Tres frases distintas!", ], n_bins=5, ) # { # "n_docs": 3, # "chars": {"mean": 30.33, "p50": 35, "p90": 45, "p99": 45, "min": 11, "max": 45}, # "words": {"mean": 5.0, "p50": 5, "p90": 8, "p99": 8, "min": 2, "max": 8}, # "sentences": {"mean": 1.67, "p50": 1, "p90": 3, "p99": 3, "min": 1, "max": 3}, # "word_hist": [ # {"lo": 2.0, "hi": 3.2, "count": 1}, # {"lo": 3.2, "hi": 4.4, "count": 0}, # {"lo": 4.4, "hi": 5.6, "count": 1}, # {"lo": 5.6, "hi": 6.8, "count": 0}, # {"lo": 6.8, "hi": 8.0, "count": 1}, # ], # } ``` ## Cuando usarla Úsala al perfilar una columna o corpus de texto libre en un EDA: cuando necesites saber lo largos que son los documentos (en caracteres, palabras y frases) y cómo se reparte esa longitud antes de tokenizar, vectorizar o decidir truncados/ventanas para un modelo. Pásale la lista de strings crudos de la columna; `None` y valores no-texto se descartan solos. Encaja en el grupo `eda` como bloque de longitud junto a `summarize_categorical`. ## Gotchas - Función pura, solo stdlib (`re`). No usa numpy, pandas ni sklearn. - Percentiles por método **nearest-rank** (devuelven un valor real de la lista, no interpolan); por eso p50/p90/p99/min/max son enteros y `mean` es el único float (redondeado a 2 decimales). - El conteo de frases es una **aproximación** por puntuación (`.!?…`): un texto sin esa puntuación cuenta como 1 frase si no está vacío; abreviaturas o ellipsis pueden inflar o reducir el conteo. - `word_hist` es equal-width entre min y max de palabras: con todos los docs del mismo tamaño, menos de 2 docs, o `n_bins < 1`, devuelve un único bin. - Dict-no-throw: ante input inesperado devuelve la forma vacía (`n_docs` 0, ejes `None`, `word_hist` []) en vez de lanzar.