feat: funciones Python infra y tipos Python (core, datascience, infra)

Infra: cache_to_file, cache_to_sqlite, http_download_file, http_get_json, http_post_json, read_file_with_encoding, safe_extract_zip, scan_directory, setup_logger, normalize_zip_filenames. Tipos: 30+ tipos core (agent_action, context, task, message, parse_result...), 6 tipos datascience (entity_candidate, extraction_result...), 2 tipos infra. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 17:11:43 +02:00
parent 63a9cb5273
commit 9fd0ca9cac
110 changed files with 5714 additions and 0 deletions
@@ -0,0 +1,45 @@
+---
+name: read_file_with_encoding
+kind: function
+lang: py
+domain: infra
+version: "1.0.0"
+purity: impure
+signature: "read_file_with_encoding(path: str, encodings: list[str] | None = None) -> str"
+description: "Lee un archivo de texto intentando multiples encodings en orden hasta encontrar uno que funcione. Util para archivos de origen desconocido (Windows, Latin-1, con BOM, etc.)."
+tags: [file, encoding, io, text, utf8, latin1, cp1252, decode]
+uses_functions: []
+uses_types: []
+returns: []
+returns_optional: false
+error_type: "error_go_core"
+imports: []
+tested: true
+tests:
+  - "archivo utf-8 valido"
+  - "archivo utf-8 con BOM eliminado con utf-8-sig"
+  - "archivo latin-1"
+  - "archivo binario falla con ValueError"
+  - "encodings personalizados"
+  - "archivo no existe lanza FileNotFoundError"
+test_file_path: "python/functions/infra/read_file_with_encoding_test.py"
+file_path: "python/functions/infra/read_file_with_encoding.py"
+---
+
+## Ejemplo
+
+```python
+# Leer archivo de origen desconocido
+content = read_file_with_encoding("/tmp/datos.csv")
+
+# Leer archivo Windows con BOM explicitamente
+content = read_file_with_encoding("/tmp/report.txt", encodings=["utf-8-sig", "cp1252"])
+```
+
+## Notas
+
+Los encodings por defecto son `["utf-8", "utf-8-sig", "latin-1", "cp1252"]`. El orden importa: `utf-8` se intenta primero porque es el mas comun. Si el archivo tiene BOM y se quiere que sea eliminado automaticamente, pasar `encodings=["utf-8-sig"]` o anteponerlo a `utf-8` en la lista personalizada.
+
+`latin-1` nunca lanza `UnicodeDecodeError` porque mapea todos los bytes 0x00-0xFF, por lo que actua como fallback universal. Si `latin-1` es el ultimo encoding y falla con `cp1252` tambien, solo un archivo binario puro (sin mapeo posible) disparara el `ValueError`.
+
+Raises `FileNotFoundError` u `OSError` nativas si el archivo no existe o hay error de I/O — estos no se envuelven en `ValueError`.