cfdf515228
- .claude/CLAUDE.md - .claude/commands/subagentes.md - .claude/rules/INDEX.md - .mcp.json - bash/functions/cybersecurity/analyze_dns.md - bash/functions/cybersecurity/audit_http_headers.md - bash/functions/cybersecurity/audit_ssh_config.md - bash/functions/cybersecurity/check_firewall.md - bash/functions/cybersecurity/detect_suspicious_users.md - bash/functions/cybersecurity/encrypt_file.md - ... Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2.0 KiB
2.0 KiB
name, kind, lang, domain, version, purity, signature, description, tags, params, output, uses_functions, uses_types, returns, returns_optional, error_type, imports, tested, tests, test_file_path, file_path, source_repo, source_license, source_file
| name | kind | lang | domain | version | purity | signature | description | tags | params | output | uses_functions | uses_types | returns | returns_optional | error_type | imports | tested | tests | test_file_path | file_path | source_repo | source_license | source_file | ||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| words_to_dataset | function | py | datascience | 1.0.0 | pure | def words_to_dataset(texts: Iterable[str | None], min_ocurrencias: int = 1, eliminar_stopwords: bool = False) -> list[dict] | Extrae palabras y sus ocurrencias de un iterable de textos. Tokeniza con \b\w+\b, convierte a mayusculas, cuenta con Counter, filtra por minimo de ocurrencias y opcionalmente elimina stopwords en espanol. Sin pandas. |
|
|
Lista de dicts {'palabra': str, 'ocurrencias': int} ordenada por ocurrencias descendente. | false | true |
|
python/functions/datascience/tests/test_words_to_dataset.py | python/functions/datascience/words_to_dataset.py | internal:footprint_aurgi | internal-aurgi | fuzzy_joins/arreglo_fuzzy.py |
Ejemplo
from words_to_dataset import words_to_dataset
texts = ["calle mayor", "calle del sol", "avenida principal"]
result = words_to_dataset(texts)
# [{"palabra": "CALLE", "ocurrencias": 2}, {"palabra": "MAYOR", "ocurrencias": 1}, ...]
result_clean = words_to_dataset(texts, eliminar_stopwords=True)
# "DEL" no aparece
Notas
Stopwords embebidas (frozenset de ~40 palabras ES). Funcion pura: solo stdlib (re, collections.Counter). Tokens en mayusculas para unificar "Calle" y "CALLE".