merge: quick/bulk-functions-sources-notebook — funciones Go/Python/Bash, tipos, notebook, sources

This commit is contained in:
2026-04-05 17:13:13 +02:00
403 changed files with 28764 additions and 31 deletions
+215
View File
@@ -0,0 +1,215 @@
# /extract-source — Extraer funciones de un repo en sources/
Eres un agente extractor de funciones. Tu trabajo es analizar un repositorio clonado en `sources/` y extraer funciones reutilizables al registry siguiendo las reglas de `.claude/rules/sources.md`.
---
## Argumento
`$ARGUMENTS` — nombre del directorio en `sources/` (ej: `MiroFish`, `OpenViking`). Si no se proporciona, listar los directorios disponibles en `sources/` y pedir al usuario que elija.
---
## PASO 0: Validar el source
```bash
ls sources/$ARGUMENTS/
```
Si no existe, abortar. Verificar que tenga licencia compatible (MIT, Apache 2.0, BSD, ISC, MPL-2.0, Unlicense). Si es AGPL, GPL, o no tiene licencia, **advertir al usuario** y pedir confirmacion antes de continuar.
Identificar:
- **Licencia**: leer LICENSE/LICENSE.md/COPYING
- **Lenguaje principal**: detectar por archivos (*.go, *.py, *.rs, *.ts, *.js, Cargo.toml, go.mod, pyproject.toml, package.json)
- **URL del repo**: buscar en README, .git/config, o package.json
---
## PASO 1: Revisar el manifest
Leer `sources/sources.yaml` para ver si este repo ya tiene extracciones previas. Si las tiene, listarlas al usuario y preguntar si quiere continuar extrayendo mas o si quiere re-evaluar las existentes.
---
## PASO 2: Explorar el repositorio
Analizar la estructura del repo para identificar **todas las funciones candidatas** — puras e impuras. El objetivo es maximizar la extraccion de codigo util.
### Que buscar (por categoria)
**A. Funciones puras** (algoritmos, transformaciones, calculos, validaciones):
- Parsers, encoders/decoders, formatters
- Algoritmos matematicos, estadisticos, financieros
- Transformaciones de datos, filtros, mappers
- Validaciones, sanitizaciones
**B. Funciones impuras** (I/O, red, estado externo):
- Clientes HTTP/API (REST, GraphQL, WebSocket)
- Operaciones de filesystem (leer, escribir, monitorear archivos)
- Interacciones con bases de datos (queries, migraciones)
- Operaciones Docker, cloud, infraestructura
- Scraping, crawling, recoleccion de datos
- Notificaciones, envio de mensajes
**C. Pipelines** (composiciones multi-paso):
- Flujos ETL (extract-transform-load)
- Workflows de setup/deploy/provision
- Secuencias de procesamiento de datos
- Orquestaciones que componen varias funciones
**D. Tipos reutilizables** (structs, enums, interfaces):
- Modelos de dominio genericos
- Tipos de configuracion
- Interfaces/protocolos bien definidos
### Estrategia de exploracion segun lenguaje
- **Go**: `pkg/`, `internal/`, `utils/`, `lib/`, `cmd/` — funciones exportadas, handlers, clients
- **Python**: `src/`, `lib/`, `utils/`, `core/`, `api/` — funciones, clases client, decoradores
- **Rust**: `crates/`, `src/lib.rs` — funciones pub, traits implementados
- **TypeScript/JS**: `src/`, `lib/`, `utils/`, `services/` — funciones, hooks, componentes
- **Bash**: `scripts/`, `bin/`, `tools/` — funciones con firma clara
### Que ignorar
- main(), CLI entry points (pero extraer las funciones que invocan)
- Tests (pero notar cuales funciones estan bien testeadas — marcar `tested: true`)
- Funciones que dependen de tipos internos complejos **no adaptables**
- Codigo con dependencias externas pesadas que no esten en fn_registry
- Config loaders hardcodeados a un proyecto especifico
---
## PASO 3: Consultar el registry para evitar duplicados
Antes de proponer cualquier funcion, buscar en registry.db con FTS5:
```bash
# Por cada candidata, buscar similares
sqlite3 registry.db "SELECT id, kind, purity, description FROM functions WHERE id IN (SELECT id FROM functions_fts WHERE functions_fts MATCH 'name:NOMBRE* OR description:DESCRIPCION') ORDER BY name;"
```
Si ya existe algo similar, descartarla o anotar que es una mejora/variante.
---
## PASO 4: Presentar candidatas al usuario
Agrupar las candidatas por categoria y mostrar en tablas separadas:
### Funciones puras
| # | Nombre propuesto | Origen (archivo) | Lang destino | Dominio | Descripcion |
|---|---|---|---|---|---|
### Funciones impuras
| # | Nombre propuesto | Origen (archivo) | Lang destino | Dominio | I/O tipo | Descripcion |
|---|---|---|---|---|---|---|
(I/O tipo: HTTP, filesystem, DB, Docker, network, etc.)
### Pipelines (composiciones)
| # | Nombre propuesto | Origen (archivo) | Lang destino | Dominio | Funciones que compone | Descripcion |
|---|---|---|---|---|---|---|
### Tipos
| # | Nombre propuesto | Origen (archivo) | Lang destino | Dominio | Algebraic | Descripcion |
|---|---|---|---|---|---|---|
Para cada candidata indicar:
- Por que cumple el filtro de calidad
- Si requiere adaptacion (renombrar tipos, quitar dependencias, traducir lenguaje)
- Si es traduccion de otro lenguaje (ej: Rust → Go)
- Para impuras: cual es el `error_type` apropiado
**Esperar confirmacion del usuario** antes de extraer. El usuario puede:
- Aprobar todas (`all`)
- Seleccionar por numero (`1,3,5-8`)
- Seleccionar por categoria (`todas las puras`, `solo pipelines`)
- Pedir explorar mas areas del repo
- Descartar y terminar
---
## PASO 5: Extraer funciones aprobadas
Para cada funcion aprobada:
### 5a. Determinar destino y clasificacion
| Naturaleza | Destino | kind | purity |
|---|---|---|---|
| Algoritmo/logica pura | Go/Python `functions/{domain}/` | function | pure |
| Funcion con I/O (HTTP, DB, fs) | Go/Python `functions/{domain}/` | function | impure |
| Script/utilidad sistema | Bash `bash/functions/{domain}/` | function | impure |
| UI/componente | TypeScript `frontend/functions/{domain}/` | component | — |
| Composicion multi-paso | `functions/pipelines/` o `python/functions/pipelines/` | pipeline | impure |
| C/Rust/otro lenguaje | Traducir a Go o Python manteniendo semantica | segun caso | segun caso |
### 5b. Crear archivos
1. **Codigo** — copiar y adaptar:
- Renombrar a snake_case
- Usar tipos nativos en firma (no tipos internos del repo)
- Quitar dependencias externas, usar stdlib
- Ajustar al paquete Go destino (nombre = nombre del directorio)
- Si es traduccion, mantener la semantica y documentar el origen
2. **Metadata .md** — crear frontmatter completo:
- `source_repo`: URL del repo original
- `source_license`: licencia del repo
- `source_file`: path relativo del archivo original dentro del repo
- Todos los campos obligatorios segun el tipo (function/pipeline/component)
- Reglas de pureza:
- `pure``returns_optional: false` + `error_type: ""`
- `impure``error_type: "error_go_core"` (o equivalente Python)
- `pipeline``purity: impure` + `uses_functions` con las funciones que compone
### 5c. Verificar integridad
```bash
# Indexar
./fn index
# Verificar cada funcion extraida
./fn show {id}
```
Si el indexer reporta errores, corregir antes de continuar.
---
## PASO 6: Actualizar manifest
Anadir las funciones extraidas a `sources/sources.yaml` bajo el repo correspondiente:
```yaml
- repo: https://github.com/user/project
license: MIT
cloned_dir: nombre_directorio
extracted:
- id: funcion_go_core
source_file: pkg/utils.go
date: YYYY-MM-DD # fecha de hoy
```
Si el repo no existe en el manifest, crear la entrada completa.
---
## PASO 7: Resumen
Mostrar al usuario:
- Funciones extraidas exitosamente (con IDs)
- Funciones descartadas y por que
- Warnings del indexer si hubo
- Sugerencia de areas del repo que podrian explorarse en el futuro
---
## Reglas criticas
- **NUNCA extraer sin aprobacion del usuario** — siempre presentar candidatas primero
- **NUNCA ignorar el filtro de calidad** — si no cumple todos los criterios, no se extrae
- **SIEMPRE consultar registry.db** antes de proponer — evitar duplicados
- **SIEMPRE atribuir** — source_repo, source_license, source_file en el .md
- **SIEMPRE actualizar sources.yaml** — es el manifest versionado
- **Licencias no permisivas** (GPL, AGPL) requieren advertencia explicita al usuario
- **Traduccion de lenguaje** es valida — documentar el origen claramente
+12 -1
View File
@@ -14,12 +14,21 @@ Una funcion externa solo se extrae si cumple TODOS estos criterios:
- **Firma generica**: no depende de tipos internos del repo origen ni de config hardcodeada
- **Sin estado global**: no usa variables globales, singletons, ni init() con side effects
- **Dependencias minimas**: solo stdlib o dependencias ya presentes en fn_registry
- **Pura si es posible**: si la funcion puede ser pura, debe extraerse como pura
- **Sin credenciales**: no contiene secrets, API keys, ni paths absolutos
- **Testeable**: la logica debe poder validarse con tests unitarios
- **No duplicada**: consultar registry.db con FTS5 antes de extraer para evitar duplicados
- **Licencia compatible**: el repo debe tener licencia permisiva (MIT, Apache 2.0, BSD, etc.)
### Clasificacion de pureza al extraer
Extraer tanto funciones puras como impuras. La clasificacion correcta es obligatoria:
- **Pure**: sin I/O, sin estado mutable, determinista. Extraer como `purity: pure`.
- **Impure**: hace I/O (red, disco, DB, HTTP), usa concurrencia, o depende de estado externo. Extraer como `purity: impure` con `error_type` apropiado.
- **Pipeline**: compone multiples funciones para un flujo completo. Extraer como `kind: pipeline`, siempre impuro.
No descartar funciones utiles solo por ser impuras. Una funcion que hace HTTP requests, lee archivos, o interactua con bases de datos es valiosa si su firma es generica y reutilizable.
### Adaptacion al extraer
- Renombrar a snake_case siguiendo la convencion del registry
@@ -44,6 +53,8 @@ SELECT id, source_repo, source_license FROM functions WHERE source_repo != '';
Cualquier lenguaje puede analizarse como fuente. El destino depende de la naturaleza de la funcion:
- Algoritmos/logica pura → Go (functions/{domain}/) o Python (python/functions/{domain}/)
- Funciones impuras (I/O, HTTP, DB) → Go o Python segun el dominio
- Scripts/utilidades sistema → Bash (bash/functions/{domain}/)
- UI/frontend → TypeScript (frontend/functions/{domain}/)
- Flujos multi-paso → Pipeline en el lenguaje mas natural
- C/Rust/otros → Traducir a Go o Python, manteniendo la semantica original
+32
View File
@@ -0,0 +1,32 @@
---
name: install_nbconvert
kind: function
lang: bash
domain: infra
version: "1.0.0"
purity: impure
signature: "install_nbconvert(project_dir: string) -> void"
description: "Instala nbconvert y playwright con chromium en un proyecto uv existente. Idempotente: uv add no reinstala si los paquetes ya estan presentes."
tags: [jupyter, nbconvert, pdf, export, playwright, python, uv]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: "error_go_core"
imports: []
tested: false
tests: []
test_file_path: ""
file_path: "bash/functions/infra/install_nbconvert.sh"
---
## Ejemplo
```bash
source install_nbconvert.sh
install_nbconvert /home/lucas/analysis/finanzas
```
## Notas
Requiere que el venv ya exista (usa `init_uv_venv` antes). La instalacion de chromium via `uv run playwright install chromium` puede tardar la primera vez. La salida de playwright se suprime si tiene exito — solo se muestra si hay un error.
+32
View File
@@ -0,0 +1,32 @@
# install_nbconvert
# ------------------
# Instala nbconvert y playwright con chromium en un proyecto uv existente.
# Idempotente: uv add no reinstala si los paquetes ya estan presentes.
#
# USO (sourced):
# source install_nbconvert.sh
# install_nbconvert /path/to/project
install_nbconvert() {
local project_dir="$1"
if [ -z "$project_dir" ]; then
echo "install_nbconvert: se requiere project_dir" >&2
return 1
fi
if [ ! -d "$project_dir/.venv" ]; then
echo "install_nbconvert: no existe .venv en $project_dir — ejecuta init_uv_venv primero" >&2
return 1
fi
# Instalar nbconvert y playwright via uv add
(cd "$project_dir" && uv add nbconvert playwright 2>&1)
# Instalar chromium — capturar output, solo mostrar si hay error
local playwright_output
if ! playwright_output=$(cd "$project_dir" && uv run playwright install chromium 2>&1); then
echo "$playwright_output" >&2
return 1
fi
}
+37
View File
@@ -0,0 +1,37 @@
---
name: notebook_to_pdf
kind: function
lang: bash
domain: infra
version: "1.0.0"
purity: impure
signature: "notebook_to_pdf(project_dir: string, [pattern: string], [output_dir: string]) -> string"
description: "Convierte notebooks Jupyter a PDF usando nbconvert webpdf con chromium. Lista los PDFs generados al finalizar."
tags: [jupyter, notebook, pdf, export, nbconvert, playwright]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: "error_go_core"
imports: []
tested: false
tests: []
test_file_path: ""
file_path: "bash/functions/infra/notebook_to_pdf.sh"
---
## Ejemplo
```bash
source notebook_to_pdf.sh
# Con defaults (notebooks/*.ipynb -> notebooks/pdf/)
notebook_to_pdf /home/lucas/analysis/finanzas
# Con pattern y output_dir custom
notebook_to_pdf /home/lucas/analysis/finanzas "notebooks/01_*.ipynb" "exports/pdf/"
```
## Notas
Requiere nbconvert y playwright con chromium instalados (usa `install_nbconvert` antes). Usa el venv del proyecto directamente (`.venv/bin/jupyter`). El output_dir es relativo a project_dir. Imprime los PDFs generados con sus rutas al finalizar. Falla si no se genera ningun PDF.
+59
View File
@@ -0,0 +1,59 @@
# notebook_to_pdf
# ----------------
# Convierte notebooks Jupyter a PDF usando nbconvert webpdf.
# Requiere nbconvert y playwright con chromium instalados.
#
# USO (sourced):
# source notebook_to_pdf.sh
# notebook_to_pdf /path/to/project
# notebook_to_pdf /path/to/project "notebooks/*.ipynb" "notebooks/pdf/"
notebook_to_pdf() {
local project_dir="$1"
local pattern="${2:-notebooks/*.ipynb}"
local output_dir="${3:-notebooks/pdf/}"
if [ -z "$project_dir" ]; then
echo "notebook_to_pdf: se requiere project_dir" >&2
return 1
fi
if [ ! -d "$project_dir/.venv" ]; then
echo "notebook_to_pdf: no existe .venv en $project_dir" >&2
return 1
fi
# Crear directorio de salida si no existe
mkdir -p "$project_dir/$output_dir"
# Convertir notebooks a PDF con nbconvert webpdf
# nbconvert puede retornar exit != 0 por warnings de validacion JSON
# que no impiden la generacion del PDF, asi que ignoramos el exit code
# y verificamos que los PDFs se hayan generado
local nbconvert_output
nbconvert_output=$(cd "$project_dir" && \
.venv/bin/jupyter nbconvert \
--to webpdf \
--allow-chromium-download \
--output-dir="$output_dir" \
$pattern 2>&1) || true
echo "$nbconvert_output"
# Listar PDFs generados
echo ""
echo "PDFs generados en ${project_dir}/${output_dir}:"
local pdf_count=0
while IFS= read -r -d '' pdf; do
echo " $pdf"
pdf_count=$((pdf_count + 1))
done < <(find "$project_dir/$output_dir" -name "*.pdf" -print0 2>/dev/null)
if [ "$pdf_count" -eq 0 ]; then
echo " (ninguno encontrado — nbconvert pudo haber fallado)" >&2
echo "$nbconvert_output" >&2
return 1
fi
echo " Total: $pdf_count PDFs"
}
@@ -0,0 +1,48 @@
---
name: export_analysis_pdfs
kind: pipeline
lang: bash
domain: pipelines
version: "1.0.0"
purity: impure
signature: "export_analysis_pdfs(nombre: string, [pattern: string]) -> void"
description: "Exporta todos los notebooks de un analisis Jupyter a PDF. Instala nbconvert y playwright automaticamente si no estan presentes."
tags: [pipeline, jupyter, pdf, export, nbconvert, launcher]
uses_functions:
- assert_command_exists_bash_shell
- install_nbconvert_bash_infra
- notebook_to_pdf_bash_infra
uses_types: []
returns: []
returns_optional: false
error_type: "error_go_core"
imports: []
tested: false
tests: []
test_file_path: ""
file_path: "bash/functions/pipelines/export_analysis_pdfs.sh"
---
## Ejemplo
```bash
# Exportar todos los notebooks de un analisis
./export_analysis_pdfs.sh finanzas
# Con pattern especifico
./export_analysis_pdfs.sh ml "notebooks/01_*.ipynb"
# Via fn run
fn run export_analysis_pdfs finanzas
fn run export_analysis_pdfs ml "notebooks/01_*.ipynb"
```
## Flujo
1. `assert_command_exists uv` — verifica que uv esta disponible
2. `install_nbconvert` — instala nbconvert y playwright con chromium (idempotente)
3. `notebook_to_pdf` — convierte notebooks al patron indicado a PDF en `notebooks/pdf/`
## Notas
El analysis debe existir previamente en `analysis/{nombre}/` con un venv inicializado. Los PDFs se generan en `analysis/{nombre}/notebooks/pdf/` por defecto. El pipeline usa `set -euo pipefail` — cualquier fallo detiene la ejecucion.
+73
View File
@@ -0,0 +1,73 @@
#!/usr/bin/env bash
# export_analysis_pdfs
# ---------------------
# Pipeline que exporta todos los notebooks de un analisis a PDF.
# Compone: assert_command_exists + install_nbconvert + notebook_to_pdf
#
# USO:
# ./export_analysis_pdfs.sh <nombre_analysis> [pattern]
#
# EJEMPLOS:
# ./export_analysis_pdfs.sh finanzas
# ./export_analysis_pdfs.sh ml "notebooks/01_*.ipynb"
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REGISTRY_ROOT="$(cd "$SCRIPT_DIR/../../.." && pwd)"
# Source funciones atomicas
source "$REGISTRY_ROOT/bash/functions/shell/assert_command_exists.sh"
source "$REGISTRY_ROOT/bash/functions/infra/install_nbconvert.sh"
source "$REGISTRY_ROOT/bash/functions/infra/notebook_to_pdf.sh"
# ── Argumentos ──────────────────────────────────────────────
NOMBRE="${1:-}"
if [ -z "$NOMBRE" ]; then
echo "Uso: $0 <nombre_analysis> [pattern]" >&2
echo " Ejemplo: $0 finanzas" >&2
echo " Ejemplo: $0 ml 'notebooks/01_*.ipynb'" >&2
exit 1
fi
shift
PATTERN="${1:-notebooks/*.ipynb}"
ANALYSIS_DIR="${REGISTRY_ROOT}/analysis/${NOMBRE}"
# Verificar que el analysis existe
if [ ! -d "$ANALYSIS_DIR" ]; then
echo "Error: analysis '${NOMBRE}' no existe en ${ANALYSIS_DIR}" >&2
exit 1
fi
echo ""
echo "════════════════════════════════════════════════════════════"
echo " EXPORT ANALYSIS PDFs: ${NOMBRE}"
echo " Directorio: ${ANALYSIS_DIR}"
echo "════════════════════════════════════════════════════════════"
echo ""
# ── 1. Verificar herramientas ───────────────────────────────
echo "[1/3] Verificando herramientas..."
assert_command_exists uv
echo " OK"
# ── 2. Instalar nbconvert + playwright ──────────────────────
echo "[2/3] Instalando dependencias de exportacion..."
install_nbconvert "$ANALYSIS_DIR"
echo " OK"
# ── 3. Convertir notebooks a PDF ────────────────────────────
echo "[3/3] Convirtiendo notebooks a PDF..."
notebook_to_pdf "$ANALYSIS_DIR" "$PATTERN"
# ── Resumen ─────────────────────────────────────────────────
echo ""
echo "════════════════════════════════════════════════════════════"
echo " EXPORT COMPLETADO"
echo "════════════════════════════════════════════════════════════"
+11
View File
@@ -0,0 +1,11 @@
package core
// CronSchedule represents a parsed cron expression with expanded field values.
type CronSchedule struct {
Minute []int
Hour []int
DayOfMonth []int
Month []int
DayOfWeek []int
Raw string // original expression
}
+116
View File
@@ -0,0 +1,116 @@
package core
// JoinByKey une dos slices de map[string]any por una clave comun.
// Soporta los cuatro tipos de join: inner, left, right, outer.
// Campos duplicados del lado right (distintos a la clave) se sufijan con _right.
// Algoritmo O(n+m): indexa right por key, luego itera left.
func JoinByKey(left, right []map[string]any, key, how string) []map[string]any {
// Determinar campos conflictivos entre left y right
leftFields := map[string]bool{}
for _, row := range left {
for k := range row {
leftFields[k] = true
}
}
rightFields := map[string]bool{}
for _, row := range right {
for k := range row {
if k != key {
rightFields[k] = true
}
}
}
conflicting := map[string]bool{}
for k := range rightFields {
if leftFields[k] {
conflicting[k] = true
}
}
// Indexar right por key (un key puede tener multiples rows)
rightIndex := map[any][]map[string]any{}
for _, row := range right {
k := row[key]
rightIndex[k] = append(rightIndex[k], row)
}
// Plantilla vacia del right (todos los campos de right a nil)
emptyRight := func() map[string]any {
m := map[string]any{}
for k := range rightFields {
if conflicting[k] {
m[k+"_right"] = nil
} else {
m[k] = nil
}
}
return m
}
merge := func(l, r map[string]any) map[string]any {
out := map[string]any{}
if l != nil {
for k, v := range l {
out[k] = v
}
}
if r != nil {
for k, v := range r {
if k == key {
continue
}
if conflicting[k] {
out[k+"_right"] = v
} else {
out[k] = v
}
}
}
return out
}
matchedRightKeys := map[any]bool{}
var result []map[string]any
for _, l := range left {
k := l[key]
rRows, ok := rightIndex[k]
if ok {
matchedRightKeys[k] = true
for _, r := range rRows {
result = append(result, merge(l, r))
}
} else {
if how == "left" || how == "outer" {
row := merge(l, nil)
for rk, rv := range emptyRight() {
row[rk] = rv
}
result = append(result, row)
}
}
}
if how == "right" || how == "outer" {
for _, r := range right {
k := r[key]
if !matchedRightKeys[k] {
row := emptyRight()
row[key] = k
for rk, rv := range r {
if rk == key {
continue
}
if conflicting[rk] {
row[rk+"_right"] = rv
} else {
row[rk] = rv
}
}
result = append(result, row)
}
}
}
return result
}
+48
View File
@@ -0,0 +1,48 @@
---
name: join_by_key
kind: function
lang: go
domain: core
version: "1.0.0"
purity: pure
signature: "func JoinByKey(left, right []map[string]any, key, how string) []map[string]any"
description: "Join de dos slices de map[string]any por una clave comun. Soporta inner, left, right y outer. Campos duplicados del right se sufijan con _right. Algoritmo O(n+m)."
tags: [tabular, join, merge, go, core]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: []
tested: true
tests:
- "Inner join solo matches"
- "Left join todos los left con nil para right sin match"
- "Right join"
- "Outer join"
- "Campos duplicados con sufijo _right"
- "Key ausente en alguna fila"
test_file_path: "functions/core/join_by_key_test.go"
file_path: "functions/core/join_by_key.go"
---
## Ejemplo
```go
left := []map[string]any{{"id": 1, "name": "Alice"}, {"id": 2, "name": "Bob"}}
right := []map[string]any{{"id": 1, "dept": "eng"}, {"id": 3, "dept": "sales"}}
result := JoinByKey(left, right, "id", "inner")
// [{"id": 1, "name": "Alice", "dept": "eng"}]
result = JoinByKey(left, right, "id", "left")
// [{"id": 1, "name": "Alice", "dept": "eng"},
// {"id": 2, "name": "Bob", "dept": nil}]
```
## Notas
Funcion pura sin dependencias externas.
El algoritmo indexa right en O(n) y luego itera left en O(m), total O(n+m).
Los campos de right que colisionan con campos de left (excepto la clave) se renombran con sufijo _right.
Un key puede tener multiples filas en right — se generan multiples filas en el resultado (comportamiento de join relacional).
+107
View File
@@ -0,0 +1,107 @@
package core
import "testing"
func TestJoinByKey(t *testing.T) {
t.Run("Inner join solo matches", func(t *testing.T) {
left := []map[string]any{{"id": 1, "name": "Alice"}, {"id": 2, "name": "Bob"}}
right := []map[string]any{{"id": 1, "dept": "eng"}, {"id": 3, "dept": "sales"}}
result := JoinByKey(left, right, "id", "inner")
if len(result) != 1 {
t.Fatalf("got %d rows, want 1", len(result))
}
if result[0]["id"] != 1 || result[0]["name"] != "Alice" || result[0]["dept"] != "eng" {
t.Errorf("unexpected row: %v", result[0])
}
})
t.Run("Left join todos los left con nil para right sin match", func(t *testing.T) {
left := []map[string]any{{"id": 1, "name": "Alice"}, {"id": 2, "name": "Bob"}}
right := []map[string]any{{"id": 1, "dept": "eng"}}
result := JoinByKey(left, right, "id", "left")
if len(result) != 2 {
t.Fatalf("got %d rows, want 2", len(result))
}
var alice, bob map[string]any
for _, r := range result {
if r["id"] == 1 {
alice = r
} else {
bob = r
}
}
if alice["dept"] != "eng" {
t.Errorf("alice dept = %v, want eng", alice["dept"])
}
if bob["dept"] != nil {
t.Errorf("bob dept = %v, want nil", bob["dept"])
}
})
t.Run("Right join", func(t *testing.T) {
left := []map[string]any{{"id": 1, "name": "Alice"}}
right := []map[string]any{{"id": 1, "dept": "eng"}, {"id": 2, "dept": "sales"}}
result := JoinByKey(left, right, "id", "right")
if len(result) != 2 {
t.Fatalf("got %d rows, want 2", len(result))
}
var eng, sales map[string]any
for _, r := range result {
if r["id"] == 1 {
eng = r
} else {
sales = r
}
}
if eng["name"] != "Alice" {
t.Errorf("eng name = %v, want Alice", eng["name"])
}
if sales["name"] != nil {
t.Errorf("sales name = %v, want nil", sales["name"])
}
})
t.Run("Outer join", func(t *testing.T) {
left := []map[string]any{{"id": 1, "name": "Alice"}, {"id": 2, "name": "Bob"}}
right := []map[string]any{{"id": 1, "dept": "eng"}, {"id": 3, "dept": "sales"}}
result := JoinByKey(left, right, "id", "outer")
ids := map[any]bool{}
for _, r := range result {
ids[r["id"]] = true
}
if len(ids) != 3 || !ids[1] || !ids[2] || !ids[3] {
t.Errorf("outer join ids = %v, want {1, 2, 3}", ids)
}
})
t.Run("Campos duplicados con sufijo _right", func(t *testing.T) {
left := []map[string]any{{"id": 1, "name": "Alice", "score": 90}}
right := []map[string]any{{"id": 1, "score": 85, "dept": "eng"}}
result := JoinByKey(left, right, "id", "inner")
if len(result) != 1 {
t.Fatalf("got %d rows, want 1", len(result))
}
if result[0]["score"] != 90 {
t.Errorf("score = %v, want 90", result[0]["score"])
}
if result[0]["score_right"] != 85 {
t.Errorf("score_right = %v, want 85", result[0]["score_right"])
}
if result[0]["dept"] != "eng" {
t.Errorf("dept = %v, want eng", result[0]["dept"])
}
})
t.Run("Key ausente en alguna fila", func(t *testing.T) {
left := []map[string]any{{"id": 1, "name": "Alice"}, {"name": "Bob"}} // Bob sin id
right := []map[string]any{{"id": 1, "dept": "eng"}}
result := JoinByKey(left, right, "id", "inner")
// Solo Alice matchea (Bob tiene key=nil, right no tiene nil)
if len(result) != 1 {
t.Fatalf("got %d rows, want 1", len(result))
}
if result[0]["name"] != "Alice" {
t.Errorf("name = %v, want Alice", result[0]["name"])
}
})
}
+116
View File
@@ -0,0 +1,116 @@
package core
import (
"time"
)
// NextCronTime returns the next time.Time that satisfies schedule after the given time.
// It advances minute by minute, skipping ahead when a field does not match.
// Returns the zero value of time.Time if no match is found within 366 days (impossible schedule).
func NextCronTime(schedule CronSchedule, after time.Time) time.Time {
// Truncate to minute, then advance by 1 minute.
t := after.Truncate(time.Minute).Add(time.Minute)
limit := after.Add(366 * 24 * time.Hour)
for t.Before(limit) {
// Check month (1-12).
if !intIn(int(t.Month()), schedule.Month) {
// Advance to first day of next valid month.
t = nextValidMonth(t, schedule.Month)
if t.IsZero() {
return time.Time{}
}
continue
}
// Check day of month AND day of week (cron uses OR semantics when both are restricted,
// but standard 5-field cron: if both are non-wildcard, either can match).
// For simplicity we use AND semantics (both must match) which is the POSIX default
// for the common case; most implementations differ only when both are explicitly set.
domOK := intIn(t.Day(), schedule.DayOfMonth)
dowOK := intIn(int(t.Weekday()), schedule.DayOfWeek)
if !domOK || !dowOK {
// Advance to next day at midnight.
t = time.Date(t.Year(), t.Month(), t.Day()+1, 0, 0, 0, 0, t.Location())
continue
}
// Check hour.
if !intIn(t.Hour(), schedule.Hour) {
// Advance to next valid hour.
next := nextValidHour(t, schedule.Hour)
if next.IsZero() {
// No valid hour today; advance to tomorrow.
t = time.Date(t.Year(), t.Month(), t.Day()+1, 0, 0, 0, 0, t.Location())
} else {
t = next
}
continue
}
// Check minute.
if !intIn(t.Minute(), schedule.Minute) {
next := nextValidMinute(t, schedule.Minute)
if next.IsZero() {
// No more valid minutes this hour; advance to next hour.
t = time.Date(t.Year(), t.Month(), t.Day(), t.Hour()+1, 0, 0, 0, t.Location())
} else {
t = next
}
continue
}
// All fields match.
return t
}
return time.Time{}
}
// intIn returns true if v is in the sorted slice s.
func intIn(v int, s []int) bool {
for _, x := range s {
if x == v {
return true
}
}
return false
}
// nextValidMonth advances t to the first moment of the next valid month.
func nextValidMonth(t time.Time, months []int) time.Time {
month := int(t.Month())
for _, m := range months {
if m > month {
return time.Date(t.Year(), time.Month(m), 1, 0, 0, 0, 0, t.Location())
}
}
// Wrap to next year.
if len(months) > 0 {
return time.Date(t.Year()+1, time.Month(months[0]), 1, 0, 0, 0, 0, t.Location())
}
return time.Time{}
}
// nextValidHour returns t at the next valid hour this day, or zero if none.
func nextValidHour(t time.Time, hours []int) time.Time {
h := t.Hour()
for _, hh := range hours {
if hh > h {
return time.Date(t.Year(), t.Month(), t.Day(), hh, 0, 0, 0, t.Location())
}
}
return time.Time{}
}
// nextValidMinute returns t at the next valid minute this hour, or zero if none.
func nextValidMinute(t time.Time, minutes []int) time.Time {
m := t.Minute()
for _, mm := range minutes {
if mm > m {
return time.Date(t.Year(), t.Month(), t.Day(), t.Hour(), mm, 0, 0, t.Location())
}
}
return time.Time{}
}
+43
View File
@@ -0,0 +1,43 @@
---
name: next_cron_time
kind: function
lang: go
domain: core
version: "1.0.0"
purity: pure
signature: "func NextCronTime(schedule CronSchedule, after time.Time) time.Time"
description: "Calcula la proxima ejecucion de un cron schedule despues de un tiempo dado. Avanza minuto a minuto saltando campos no coincidentes. Retorna zero time si no hay match en 366 dias (schedule imposible)."
tags: [cron, scheduling, time, next, pure]
uses_functions: [parse_cron_expr_go_core]
uses_types: [cron_schedule_go_core]
returns: []
returns_optional: false
error_type: ""
imports: [time]
tested: true
tests:
- "0 * * * * desde :30 retorna la proxima hora en punto"
- "@weekly desde viernes retorna proximo domingo a medianoche"
- "0 9 * * 1-5 desde viernes retorna proximo lunes a las 9"
- "schedule imposible retorna zero time"
test_file_path: "functions/core/next_cron_time_test.go"
file_path: "functions/core/next_cron_time.go"
---
## Ejemplo
```go
sched, _ := ParseCronExpr("0 * * * *")
after := time.Date(2024, 1, 15, 14, 30, 0, 0, time.UTC)
next := NextCronTime(sched, after)
// next = 2024-01-15 15:00:00 UTC
weekdays, _ := ParseCronExpr("0 9 * * 1-5")
friday := time.Date(2024, 1, 19, 10, 0, 0, 0, time.UTC) // Friday
next2 := NextCronTime(weekdays, friday)
// next2 = 2024-01-22 09:00:00 UTC (Monday)
```
## Notas
Usa semantica AND para day_of_month y day_of_week: ambos campos deben coincidir. El limite de 366 dias evita loops infinitos en schedules imposibles (ej: 29 de febrero en un ano sin bisiesto). Devuelve zero time en lugar de error para mantener purity: false/zero es el idiom de Go para retornos opcionales sin error.
+72
View File
@@ -0,0 +1,72 @@
package core
import (
"testing"
"time"
)
func TestNextCronTime(t *testing.T) {
utc := time.UTC
t.Run("0 * * * * desde :30 retorna la proxima hora en punto", func(t *testing.T) {
sched, err := ParseCronExpr("0 * * * *")
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
after := time.Date(2024, 1, 15, 14, 30, 0, 0, utc)
got := NextCronTime(sched, after)
want := time.Date(2024, 1, 15, 15, 0, 0, 0, utc)
if !got.Equal(want) {
t.Errorf("got %v, want %v", got, want)
}
})
t.Run("@weekly desde viernes retorna proximo domingo a medianoche", func(t *testing.T) {
// @weekly = "0 0 * * 0" (Sunday)
sched, err := ParseCronExpr("@weekly")
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
// 2024-01-19 is a Friday
after := time.Date(2024, 1, 19, 10, 0, 0, 0, utc)
got := NextCronTime(sched, after)
// Next Sunday = 2024-01-21
want := time.Date(2024, 1, 21, 0, 0, 0, 0, utc)
if !got.Equal(want) {
t.Errorf("got %v, want %v", got, want)
}
})
t.Run("0 9 * * 1-5 desde viernes retorna proximo lunes a las 9", func(t *testing.T) {
sched, err := ParseCronExpr("0 9 * * 1-5")
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
// 2024-01-19 is a Friday, after 9am so today is already past.
after := time.Date(2024, 1, 19, 10, 0, 0, 0, utc)
got := NextCronTime(sched, after)
// Next weekday = Monday 2024-01-22
want := time.Date(2024, 1, 22, 9, 0, 0, 0, utc)
if !got.Equal(want) {
t.Errorf("got %v, want %v", got, want)
}
})
t.Run("schedule imposible retorna zero time", func(t *testing.T) {
// 30 Feb does not exist — will exhaust 366-day limit quickly for a specific year.
// Use a schedule matching only Feb 30, which never occurs.
sched := CronSchedule{
Minute: []int{0},
Hour: []int{0},
DayOfMonth: []int{30},
Month: []int{2},
DayOfWeek: []int{0, 1, 2, 3, 4, 5, 6},
Raw: "0 0 30 2 *",
}
after := time.Date(2023, 3, 1, 0, 0, 0, 0, utc)
got := NextCronTime(sched, after)
if !got.IsZero() {
t.Errorf("expected zero time for impossible schedule, got %v", got)
}
})
}
+192
View File
@@ -0,0 +1,192 @@
package core
import (
"fmt"
"strconv"
"strings"
)
// aliases maps cron shorthand expressions to their 5-field equivalents.
var cronAliases = map[string]string{
"@yearly": "0 0 1 1 *",
"@annually": "0 0 1 1 *",
"@monthly": "0 0 1 * *",
"@weekly": "0 0 * * 0",
"@daily": "0 0 * * *",
"@midnight": "0 0 * * *",
"@hourly": "0 * * * *",
}
// fieldLimits defines the valid [min, max] range for each cron field.
var cronFieldLimits = [5][2]int{
{0, 59}, // minute
{0, 23}, // hour
{1, 31}, // day of month
{1, 12}, // month
{0, 6}, // day of week
}
var cronFieldNames = [5]string{"minute", "hour", "day_of_month", "month", "day_of_week"}
// ParseCronExpr parses a standard 5-field cron expression into a CronSchedule.
// Supports *, ranges (1-5), lists (1,3,5), steps (*/15), and aliases (@hourly, @daily, @weekly, @monthly, @yearly).
// Returns an error for invalid expressions or out-of-range values.
func ParseCronExpr(expr string) (CronSchedule, error) {
expr = strings.TrimSpace(expr)
// Resolve aliases.
if expanded, ok := cronAliases[expr]; ok {
expr = expanded
}
fields := strings.Fields(expr)
if len(fields) != 5 {
return CronSchedule{}, fmt.Errorf("parse_cron_expr: expected 5 fields, got %d in %q", len(fields), expr)
}
var result [5][]int
for i, field := range fields {
lo, hi := cronFieldLimits[i][0], cronFieldLimits[i][1]
values, err := parseCronField(field, lo, hi)
if err != nil {
return CronSchedule{}, fmt.Errorf("parse_cron_expr: field %s: %w", cronFieldNames[i], err)
}
result[i] = values
}
return CronSchedule{
Minute: result[0],
Hour: result[1],
DayOfMonth: result[2],
Month: result[3],
DayOfWeek: result[4],
Raw: strings.TrimSpace(strings.Join(fields, " ")),
}, nil
}
// parseCronField expands a single cron field token into the list of matching integers.
func parseCronField(field string, lo, hi int) ([]int, error) {
// Handle wildcard.
if field == "*" {
return rangeSlice(lo, hi), nil
}
var values []int
seen := make(map[int]bool)
// Handle comma-separated list.
parts := strings.Split(field, ",")
for _, part := range parts {
expanded, err := parseCronPart(part, lo, hi)
if err != nil {
return nil, err
}
for _, v := range expanded {
if !seen[v] {
seen[v] = true
values = append(values, v)
}
}
}
// Sort.
sortInts(values)
return values, nil
}
// parseCronPart handles a single part: plain int, range (a-b), or step (*/n or a-b/n).
func parseCronPart(part string, lo, hi int) ([]int, error) {
// Step: */n or a-b/n
if idx := strings.Index(part, "/"); idx != -1 {
stepStr := part[idx+1:]
step, err := strconv.Atoi(stepStr)
if err != nil || step <= 0 {
return nil, fmt.Errorf("invalid step %q", stepStr)
}
base := part[:idx]
var start, end int
if base == "*" {
start, end = lo, hi
} else if dashIdx := strings.Index(base, "-"); dashIdx != -1 {
var err2 error
start, end, err2 = parseRange(base, lo, hi)
if err2 != nil {
return nil, err2
}
} else {
v, err2 := parseValue(base, lo, hi)
if err2 != nil {
return nil, err2
}
start, end = v, hi
}
var result []int
for v := start; v <= end; v += step {
result = append(result, v)
}
return result, nil
}
// Range: a-b
if strings.Contains(part, "-") {
start, end, err := parseRange(part, lo, hi)
if err != nil {
return nil, err
}
return rangeSlice(start, end), nil
}
// Plain integer.
v, err := parseValue(part, lo, hi)
if err != nil {
return nil, err
}
return []int{v}, nil
}
func parseRange(s string, lo, hi int) (int, int, error) {
parts := strings.SplitN(s, "-", 2)
if len(parts) != 2 {
return 0, 0, fmt.Errorf("invalid range %q", s)
}
start, err := parseValue(parts[0], lo, hi)
if err != nil {
return 0, 0, err
}
end, err := parseValue(parts[1], lo, hi)
if err != nil {
return 0, 0, err
}
if start > end {
return 0, 0, fmt.Errorf("range start %d > end %d", start, end)
}
return start, end, nil
}
func parseValue(s string, lo, hi int) (int, error) {
v, err := strconv.Atoi(s)
if err != nil {
return 0, fmt.Errorf("invalid value %q: not an integer", s)
}
if v < lo || v > hi {
return 0, fmt.Errorf("value %d out of range [%d, %d]", v, lo, hi)
}
return v, nil
}
func rangeSlice(lo, hi int) []int {
s := make([]int, hi-lo+1)
for i := range s {
s[i] = lo + i
}
return s
}
// sortInts is a simple insertion sort for small slices (avoids importing sort).
func sortInts(a []int) {
for i := 1; i < len(a); i++ {
for j := i; j > 0 && a[j] < a[j-1]; j-- {
a[j], a[j-1] = a[j-1], a[j]
}
}
}
+45
View File
@@ -0,0 +1,45 @@
---
name: parse_cron_expr
kind: function
lang: go
domain: core
version: "1.0.0"
purity: pure
signature: "func ParseCronExpr(expr string) (CronSchedule, error)"
description: "Parsea una expresion cron estandar de 5 campos en un CronSchedule con valores expandidos. Soporta *, rangos (1-5), listas (1,3,5), pasos (*/15) y aliases (@hourly, @daily, @weekly, @monthly, @yearly). No soporta segundos ni years estilo Quartz."
tags: [cron, scheduling, parsing, time, pure]
uses_functions: []
uses_types: [cron_schedule_go_core]
returns: []
returns_optional: false
error_type: ""
imports: [fmt, strconv, strings]
tested: true
tests:
- "*/15 expande minutos a [0 15 30 45]"
- "@daily resuelve a 0 0 en todos los campos restantes"
- "0 9 1,15 * * expande dias a [1 15]"
- "0 9 * * 1-5 expande dia de semana a [1 2 3 4 5]"
- "expresion con 4 campos retorna error"
- "minuto fuera de rango retorna error"
test_file_path: "functions/core/parse_cron_expr_test.go"
file_path: "functions/core/parse_cron_expr.go"
---
## Ejemplo
```go
sched, err := ParseCronExpr("*/15 * * * *")
// sched.Minute = [0, 15, 30, 45]
// sched.Hour = [0, 1, ..., 23]
sched2, _ := ParseCronExpr("@daily")
// sched2.Minute = [0], sched2.Hour = [0]
sched3, _ := ParseCronExpr("0 9 * * 1-5")
// sched3.DayOfWeek = [1, 2, 3, 4, 5] (lunes a viernes)
```
## Notas
Funcion pura. Cada campo cron se expande a la lista completa de valores enteros validos. Los aliases se resuelven antes del parseo. Los limites son: minute [0,59], hour [0,23], day_of_month [1,31], month [1,12], day_of_week [0,6] (0=domingo).
+81
View File
@@ -0,0 +1,81 @@
package core
import (
"reflect"
"testing"
)
func TestParseCronExpr(t *testing.T) {
t.Run("*/15 expande minutos a [0 15 30 45]", func(t *testing.T) {
sched, err := ParseCronExpr("*/15 * * * *")
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
want := []int{0, 15, 30, 45}
if !reflect.DeepEqual(sched.Minute, want) {
t.Errorf("Minute = %v, want %v", sched.Minute, want)
}
// Hour should be all 24 hours
if len(sched.Hour) != 24 {
t.Errorf("Hour len = %d, want 24", len(sched.Hour))
}
})
t.Run("@daily resuelve a 0 0 en todos los campos restantes", func(t *testing.T) {
sched, err := ParseCronExpr("@daily")
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
if !reflect.DeepEqual(sched.Minute, []int{0}) {
t.Errorf("Minute = %v, want [0]", sched.Minute)
}
if !reflect.DeepEqual(sched.Hour, []int{0}) {
t.Errorf("Hour = %v, want [0]", sched.Hour)
}
// DayOfMonth should be all days
if len(sched.DayOfMonth) != 31 {
t.Errorf("DayOfMonth len = %d, want 31", len(sched.DayOfMonth))
}
})
t.Run("0 9 1,15 * * expande dias a [1 15]", func(t *testing.T) {
sched, err := ParseCronExpr("0 9 1,15 * *")
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
if !reflect.DeepEqual(sched.Minute, []int{0}) {
t.Errorf("Minute = %v, want [0]", sched.Minute)
}
if !reflect.DeepEqual(sched.Hour, []int{9}) {
t.Errorf("Hour = %v, want [9]", sched.Hour)
}
if !reflect.DeepEqual(sched.DayOfMonth, []int{1, 15}) {
t.Errorf("DayOfMonth = %v, want [1, 15]", sched.DayOfMonth)
}
})
t.Run("0 9 * * 1-5 expande dia de semana a [1 2 3 4 5]", func(t *testing.T) {
sched, err := ParseCronExpr("0 9 * * 1-5")
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
want := []int{1, 2, 3, 4, 5}
if !reflect.DeepEqual(sched.DayOfWeek, want) {
t.Errorf("DayOfWeek = %v, want %v", sched.DayOfWeek, want)
}
})
t.Run("expresion con 4 campos retorna error", func(t *testing.T) {
_, err := ParseCronExpr("0 9 * *")
if err == nil {
t.Error("expected error for 4-field expression, got nil")
}
})
t.Run("minuto fuera de rango retorna error", func(t *testing.T) {
_, err := ParseCronExpr("60 * * * *")
if err == nil {
t.Error("expected error for minute=60, got nil")
}
})
}
+233
View File
@@ -0,0 +1,233 @@
package core
import (
"fmt"
"regexp"
"strconv"
"strings"
)
// ValidateStructFields validates fields of a map against declarative rules.
// Each rule is a comma-separated string like "required,type=string,min=1,max=100".
//
// Supported rules:
// - required — field must exist and not be nil or ""
// - type=string|int|float|bool — validate underlying Go type
// - min=N, max=N — for numeric values
// - minlen=N, maxlen=N — for string values
// - oneof=a|b|c — value must be one of the listed options
// - pattern=regex — for string values
//
// Returns (valid, errors). Errors accumulate — all fields are checked.
func ValidateStructFields(data map[string]any, rules map[string]string) (bool, []string) {
var errs []string
for field, ruleStr := range rules {
parts := strings.Split(ruleStr, ",")
for _, part := range parts {
part = strings.TrimSpace(part)
if part == "" {
continue
}
if err := applyRule(data, field, part); err != "" {
errs = append(errs, err)
// stop further checks on this field if required failed
if part == "required" {
break
}
}
}
}
return len(errs) == 0, errs
}
// applyRule applies a single rule to a field and returns an error string or "".
func applyRule(data map[string]any, field, rule string) string {
switch {
case rule == "required":
val, ok := data[field]
if !ok || val == nil {
return fmt.Sprintf("%s: required field missing", field)
}
if s, ok := val.(string); ok && s == "" {
return fmt.Sprintf("%s: required field is empty string", field)
}
return ""
case strings.HasPrefix(rule, "type="):
expectedType := rule[len("type="):]
val, ok := data[field]
if !ok || val == nil {
return "" // absence handled by required
}
return checkType(field, val, expectedType)
case strings.HasPrefix(rule, "min="):
n, err := strconv.ParseFloat(rule[len("min="):], 64)
if err != nil {
return fmt.Sprintf("%s: invalid rule min value: %s", field, rule)
}
val, ok := data[field]
if !ok || val == nil {
return ""
}
f, ok := toFloat(val)
if !ok {
return fmt.Sprintf("%s: cannot apply min to non-numeric value", field)
}
if f < n {
return fmt.Sprintf("%s: %v < min %v", field, val, n)
}
return ""
case strings.HasPrefix(rule, "max="):
n, err := strconv.ParseFloat(rule[len("max="):], 64)
if err != nil {
return fmt.Sprintf("%s: invalid rule max value: %s", field, rule)
}
val, ok := data[field]
if !ok || val == nil {
return ""
}
f, ok := toFloat(val)
if !ok {
return fmt.Sprintf("%s: cannot apply max to non-numeric value", field)
}
if f > n {
return fmt.Sprintf("%s: %v > max %v", field, val, n)
}
return ""
case strings.HasPrefix(rule, "minlen="):
n, err := strconv.Atoi(rule[len("minlen="):])
if err != nil {
return fmt.Sprintf("%s: invalid rule minlen value: %s", field, rule)
}
val, ok := data[field]
if !ok || val == nil {
return ""
}
s, ok := val.(string)
if !ok {
return fmt.Sprintf("%s: cannot apply minlen to non-string value", field)
}
if len(s) < n {
return fmt.Sprintf("%s: length %d < minlen %d", field, len(s), n)
}
return ""
case strings.HasPrefix(rule, "maxlen="):
n, err := strconv.Atoi(rule[len("maxlen="):])
if err != nil {
return fmt.Sprintf("%s: invalid rule maxlen value: %s", field, rule)
}
val, ok := data[field]
if !ok || val == nil {
return ""
}
s, ok := val.(string)
if !ok {
return fmt.Sprintf("%s: cannot apply maxlen to non-string value", field)
}
if len(s) > n {
return fmt.Sprintf("%s: length %d > maxlen %d", field, len(s), n)
}
return ""
case strings.HasPrefix(rule, "oneof="):
options := strings.Split(rule[len("oneof="):], "|")
val, ok := data[field]
if !ok || val == nil {
return ""
}
sval := fmt.Sprintf("%v", val)
for _, opt := range options {
if sval == opt {
return ""
}
}
return fmt.Sprintf("%s: value %q not in oneof [%s]", field, sval, rule[len("oneof="):])
case strings.HasPrefix(rule, "pattern="):
pat := rule[len("pattern="):]
val, ok := data[field]
if !ok || val == nil {
return ""
}
s, ok := val.(string)
if !ok {
return fmt.Sprintf("%s: cannot apply pattern to non-string value", field)
}
re, err := regexp.Compile(pat)
if err != nil {
return fmt.Sprintf("%s: invalid pattern %q: %v", field, pat, err)
}
if !re.MatchString(s) {
return fmt.Sprintf("%s: value %q does not match pattern %q", field, s, pat)
}
return ""
default:
return fmt.Sprintf("%s: unknown rule %q", field, rule)
}
}
func checkType(field string, val any, expected string) string {
var ok bool
switch expected {
case "string":
_, ok = val.(string)
case "int":
switch val.(type) {
case int, int8, int16, int32, int64, uint, uint8, uint16, uint32, uint64:
ok = true
}
case "float":
switch val.(type) {
case float32, float64:
ok = true
case int, int8, int16, int32, int64, uint, uint8, uint16, uint32, uint64:
ok = true // integers are valid floats
}
case "bool":
_, ok = val.(bool)
default:
return fmt.Sprintf("%s: unknown type rule %q", field, expected)
}
if !ok {
return fmt.Sprintf("%s: expected type %s, got %T", field, expected, val)
}
return ""
}
func toFloat(val any) (float64, bool) {
switch v := val.(type) {
case int:
return float64(v), true
case int8:
return float64(v), true
case int16:
return float64(v), true
case int32:
return float64(v), true
case int64:
return float64(v), true
case uint:
return float64(v), true
case uint8:
return float64(v), true
case uint16:
return float64(v), true
case uint32:
return float64(v), true
case uint64:
return float64(v), true
case float32:
return float64(v), true
case float64:
return v, true
}
return 0, false
}
+64
View File
@@ -0,0 +1,64 @@
---
name: validate_struct_fields
kind: function
lang: go
domain: core
version: "1.0.0"
purity: pure
signature: "func ValidateStructFields(data map[string]any, rules map[string]string) (bool, []string)"
description: "Valida campos de un map[string]any contra reglas declarativas tipo 'required,min=1,max=100,type=string'. Soporta required, type, min/max, minlen/maxlen, oneof, pattern. Pensado para validar metadata de entities en operations.db o resultados de queries sin definir structs Go. Acumula todos los errores."
tags: [validation, map, rules, pure, core, operations]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: [fmt, regexp, strconv, strings]
tested: true
tests:
- "campo required presente y ausente"
- "type validation string como int falla"
- "numeric ranges"
- "string lengths"
- "oneof validation"
- "pattern matching"
- "multiples reglas combinadas"
- "map vacio con reglas required"
test_file_path: "functions/core/validate_struct_fields_test.go"
file_path: "functions/core/validate_struct_fields.go"
---
## Ejemplo
```go
data := map[string]any{
"name": "Alice",
"age": 30,
"status": "active",
"email": "alice@example.com",
}
rules := map[string]string{
"name": "required,type=string,minlen=2,maxlen=100",
"age": "required,type=int,min=0,max=150",
"status": "required,oneof=active|inactive|pending",
"email": `required,type=string,pattern=^[^@]+@[^@]+$`,
}
valid, errs := ValidateStructFields(data, rules)
// valid = true, errs = []
data2 := map[string]any{"name": "A", "age": 200, "status": "deleted"}
valid2, errs2 := ValidateStructFields(data2, rules)
// valid2 = false
// errs2 = [
// "name: length 1 < minlen 2",
// "age: 200 > max 150",
// "status: value \"deleted\" not in oneof [active|inactive|pending]",
// "email: required field missing",
// ]
```
## Notas
Funcion pura. Solo usa stdlib (fmt, regexp, strconv, strings). Las reglas se evaluan en orden y se acumulan todos los errores. Si `required` falla, se omiten las reglas restantes de ese campo para evitar falsos positivos. Tipos Go aceptados para type=int: int, int8..int64, uint..uint64. Tipo float acepta enteros tambien. Pattern compila el regex en cada llamada — para uso intensivo cachear los regexp compilados fuera.
@@ -0,0 +1,131 @@
package core
import (
"strings"
"testing"
)
func TestValidateStructFields(t *testing.T) {
t.Run("campo required presente y ausente", func(t *testing.T) {
rules := map[string]string{"name": "required"}
valid, errs := ValidateStructFields(map[string]any{"name": "Alice"}, rules)
if !valid || len(errs) != 0 {
t.Errorf("expected valid, got errors: %v", errs)
}
valid2, errs2 := ValidateStructFields(map[string]any{}, rules)
if valid2 || len(errs2) == 0 {
t.Errorf("expected invalid for missing required field")
}
})
t.Run("type validation string como int falla", func(t *testing.T) {
rules := map[string]string{"count": "type=int"}
valid, _ := ValidateStructFields(map[string]any{"count": 5}, rules)
if !valid {
t.Error("expected int 5 to pass type=int")
}
valid2, errs2 := ValidateStructFields(map[string]any{"count": "five"}, rules)
if valid2 || len(errs2) == 0 {
t.Error("expected string to fail type=int")
}
})
t.Run("numeric ranges", func(t *testing.T) {
rules := map[string]string{"score": "min=0,max=100"}
valid, _ := ValidateStructFields(map[string]any{"score": 50}, rules)
if !valid {
t.Error("expected 50 to pass min=0,max=100")
}
valid2, errs2 := ValidateStructFields(map[string]any{"score": 150}, rules)
if valid2 || !strings.Contains(errs2[0], "max") {
t.Errorf("expected max violation, got: %v", errs2)
}
valid3, errs3 := ValidateStructFields(map[string]any{"score": -1}, rules)
if valid3 || !strings.Contains(errs3[0], "min") {
t.Errorf("expected min violation, got: %v", errs3)
}
})
t.Run("string lengths", func(t *testing.T) {
rules := map[string]string{"tag": "minlen=2,maxlen=10"}
valid, _ := ValidateStructFields(map[string]any{"tag": "go"}, rules)
if !valid {
t.Error("expected 'go' to pass minlen=2,maxlen=10")
}
valid2, errs2 := ValidateStructFields(map[string]any{"tag": "a"}, rules)
if valid2 || !strings.Contains(errs2[0], "minlen") {
t.Errorf("expected minlen violation, got: %v", errs2)
}
valid3, errs3 := ValidateStructFields(map[string]any{"tag": "averylongtag"}, rules)
if valid3 || !strings.Contains(errs3[0], "maxlen") {
t.Errorf("expected maxlen violation, got: %v", errs3)
}
})
t.Run("oneof validation", func(t *testing.T) {
rules := map[string]string{"status": "oneof=active|inactive|pending"}
valid, _ := ValidateStructFields(map[string]any{"status": "active"}, rules)
if !valid {
t.Error("expected 'active' to pass oneof")
}
valid2, errs2 := ValidateStructFields(map[string]any{"status": "deleted"}, rules)
if valid2 || len(errs2) == 0 {
t.Errorf("expected oneof violation, got: %v", errs2)
}
})
t.Run("pattern matching", func(t *testing.T) {
rules := map[string]string{"email": `pattern=^[^@]+@[^@]+\.[^@]+$`}
valid, _ := ValidateStructFields(map[string]any{"email": "user@example.com"}, rules)
if !valid {
t.Error("expected valid email to pass pattern")
}
valid2, errs2 := ValidateStructFields(map[string]any{"email": "not-an-email"}, rules)
if valid2 || !strings.Contains(errs2[0], "pattern") {
t.Errorf("expected pattern violation, got: %v", errs2)
}
})
t.Run("multiples reglas combinadas", func(t *testing.T) {
rules := map[string]string{
"name": "required,type=string,minlen=2,maxlen=50",
"score": "required,type=float,min=0,max=10",
}
valid, _ := ValidateStructFields(map[string]any{"name": "Alice", "score": float64(8.5)}, rules)
if !valid {
t.Error("expected all rules to pass")
}
valid2, errs2 := ValidateStructFields(map[string]any{"name": "A", "score": float64(11)}, rules)
if valid2 || len(errs2) < 2 {
t.Errorf("expected at least 2 errors, got: %v", errs2)
}
})
t.Run("map vacio con reglas required", func(t *testing.T) {
rules := map[string]string{
"id": "required",
"name": "required",
}
valid, errs := ValidateStructFields(map[string]any{}, rules)
if valid || len(errs) < 2 {
t.Errorf("expected 2 required errors, got: %v", errs)
}
})
}
+96
View File
@@ -0,0 +1,96 @@
package datascience
import "fmt"
// DiffEntities compares two snapshots of entities and returns field-level differences.
// Detects added, removed, modified, and unchanged entities.
// ignoreFields specifies fields to exclude from comparison (defaults to ["created_at", "updated_at"] when nil).
func DiffEntities(before, after []map[string]any, key string, ignoreFields []string) map[string]any {
if ignoreFields == nil {
ignoreFields = []string{"created_at", "updated_at"}
}
ignoreSet := make(map[string]bool, len(ignoreFields))
for _, f := range ignoreFields {
ignoreSet[f] = true
}
beforeMap := make(map[string]map[string]any, len(before))
for _, e := range before {
if k, ok := e[key]; ok {
beforeMap[fmt.Sprintf("%v", k)] = e
}
}
afterMap := make(map[string]map[string]any, len(after))
for _, e := range after {
if k, ok := e[key]; ok {
afterMap[fmt.Sprintf("%v", k)] = e
}
}
added := []map[string]any{}
for k, e := range afterMap {
if _, exists := beforeMap[k]; !exists {
added = append(added, e)
}
}
removed := []map[string]any{}
for k, e := range beforeMap {
if _, exists := afterMap[k]; !exists {
removed = append(removed, e)
}
}
modified := []map[string]any{}
unchanged := 0
for k, b := range beforeMap {
a, exists := afterMap[k]
if !exists {
continue
}
// Collect all fields from both entities
allFields := make(map[string]bool)
for f := range b {
allFields[f] = true
}
for f := range a {
allFields[f] = true
}
changes := map[string]any{}
for field := range allFields {
if ignoreSet[field] || field == key {
continue
}
oldVal := b[field]
newVal := a[field]
if fmt.Sprintf("%v", oldVal) != fmt.Sprintf("%v", newVal) {
changes[field] = map[string]any{"old": oldVal, "new": newVal}
}
}
if len(changes) > 0 {
modified = append(modified, map[string]any{"key": k, "changes": changes})
} else {
unchanged++
}
}
nAdded := len(added)
nRemoved := len(removed)
nModified := len(modified)
summary := fmt.Sprintf("%d added, %d removed, %d modified, %d unchanged",
nAdded, nRemoved, nModified, unchanged)
return map[string]any{
"added": added,
"removed": removed,
"modified": modified,
"unchanged": unchanged,
"summary": summary,
}
}
+52
View File
@@ -0,0 +1,52 @@
---
name: diff_entities
kind: function
lang: go
domain: datascience
version: "1.0.0"
purity: pure
signature: "func DiffEntities(before, after []map[string]any, key string, ignoreFields []string) map[string]any"
description: "Compara dos snapshots de entities y devuelve diferencias campo a campo. Detecta añadidas, eliminadas, modificadas e inalteradas. Ignora created_at y updated_at por defecto (pasar nil para usar defaults)."
tags: [datascience, diff, entities, operations, snapshot, comparison]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: ["fmt"]
tested: true
tests:
- "entity añadida"
- "entity eliminada"
- "entity modificada con detalle de campos"
- "entities identicas → unchanged"
- "ignore_fields funciona"
- "lista vacia vs lista con datos"
- "summary format correcto"
test_file_path: "functions/datascience/diff_entities_test.go"
file_path: "functions/datascience/diff_entities.go"
---
## Ejemplo
```go
before := []map[string]any{
{"id": "1", "name": "Alice", "status": "active"},
{"id": "2", "name": "Bob"},
}
after := []map[string]any{
{"id": "1", "name": "Alice", "status": "inactive"},
{"id": "3", "name": "Carol"},
}
result := DiffEntities(before, after, "id", nil)
// result["summary"] = "1 added, 1 removed, 1 modified, 0 unchanged"
// result["added"] = [{"id": "3", "name": "Carol"}]
// result["removed"] = [{"id": "2", "name": "Bob"}]
// result["modified"] = [{"key": "1", "changes": {"status": {"old": "active", "new": "inactive"}}}]
```
## Notas
Funcion pura. Compara valores con fmt.Sprintf("%v", ...) para manejar tipos heterogeneos en map[string]any.
ignoreFields nil usa los defaults ["created_at", "updated_at"]. Para no ignorar ningun campo, pasar []string{}.
Semantica identica a diff_entities_py_datascience, permite comparar resultados entre ejecuciones del mismo pipeline.
+138
View File
@@ -0,0 +1,138 @@
package datascience
import (
"testing"
)
func TestDiffEntities(t *testing.T) {
t.Run("entity añadida", func(t *testing.T) {
before := []map[string]any{
{"id": "1", "name": "Alice"},
}
after := []map[string]any{
{"id": "1", "name": "Alice"},
{"id": "2", "name": "Bob"},
}
result := DiffEntities(before, after, "id", nil)
added := result["added"].([]map[string]any)
if len(added) != 1 {
t.Errorf("expected 1 added, got %d", len(added))
}
if added[0]["id"] != "2" {
t.Errorf("expected added id=2, got %v", added[0]["id"])
}
if result["unchanged"].(int) != 1 {
t.Errorf("expected 1 unchanged, got %v", result["unchanged"])
}
})
t.Run("entity eliminada", func(t *testing.T) {
before := []map[string]any{
{"id": "1", "name": "Alice"},
{"id": "2", "name": "Bob"},
}
after := []map[string]any{
{"id": "1", "name": "Alice"},
}
result := DiffEntities(before, after, "id", nil)
removed := result["removed"].([]map[string]any)
if len(removed) != 1 {
t.Errorf("expected 1 removed, got %d", len(removed))
}
if removed[0]["id"] != "2" {
t.Errorf("expected removed id=2, got %v", removed[0]["id"])
}
})
t.Run("entity modificada con detalle de campos", func(t *testing.T) {
before := []map[string]any{
{"id": "1", "name": "Alice", "status": "active"},
}
after := []map[string]any{
{"id": "1", "name": "Alice", "status": "inactive"},
}
result := DiffEntities(before, after, "id", nil)
modified := result["modified"].([]map[string]any)
if len(modified) != 1 {
t.Errorf("expected 1 modified, got %d", len(modified))
}
changes := modified[0]["changes"].(map[string]any)
statusChange, ok := changes["status"].(map[string]any)
if !ok {
t.Fatalf("expected status change, got %v", changes)
}
if statusChange["old"] != "active" {
t.Errorf("expected old=active, got %v", statusChange["old"])
}
if statusChange["new"] != "inactive" {
t.Errorf("expected new=inactive, got %v", statusChange["new"])
}
})
t.Run("entities identicas → unchanged", func(t *testing.T) {
entities := []map[string]any{
{"id": "1", "name": "Alice"},
{"id": "2", "name": "Bob"},
}
result := DiffEntities(entities, entities, "id", nil)
if result["unchanged"].(int) != 2 {
t.Errorf("expected 2 unchanged, got %v", result["unchanged"])
}
if len(result["added"].([]map[string]any)) != 0 {
t.Errorf("expected 0 added")
}
if len(result["modified"].([]map[string]any)) != 0 {
t.Errorf("expected 0 modified")
}
})
t.Run("ignore_fields funciona", func(t *testing.T) {
before := []map[string]any{
{"id": "1", "name": "Alice", "updated_at": "2024-01-01"},
}
after := []map[string]any{
{"id": "1", "name": "Alice", "updated_at": "2024-06-01"},
}
// Default ignores updated_at
result := DiffEntities(before, after, "id", nil)
if result["unchanged"].(int) != 1 {
t.Errorf("expected 1 unchanged (updated_at ignored), got %v", result["unchanged"])
}
modified := result["modified"].([]map[string]any)
if len(modified) != 0 {
t.Errorf("expected 0 modified when updated_at is ignored, got %d", len(modified))
}
})
t.Run("lista vacia vs lista con datos", func(t *testing.T) {
before := []map[string]any{}
after := []map[string]any{
{"id": "1", "name": "Alice"},
}
result := DiffEntities(before, after, "id", nil)
added := result["added"].([]map[string]any)
if len(added) != 1 {
t.Errorf("expected 1 added, got %d", len(added))
}
if result["unchanged"].(int) != 0 {
t.Errorf("expected 0 unchanged")
}
})
t.Run("summary format correcto", func(t *testing.T) {
before := []map[string]any{
{"id": "1", "name": "Alice"},
{"id": "3", "name": "Carol"},
}
after := []map[string]any{
{"id": "1", "name": "Alice Changed"},
{"id": "2", "name": "Bob"},
}
result := DiffEntities(before, after, "id", nil)
summary := result["summary"].(string)
expected := "1 added, 1 removed, 1 modified, 0 unchanged"
if summary != expected {
t.Errorf("expected summary %q, got %q", expected, summary)
}
})
}
+110
View File
@@ -0,0 +1,110 @@
package datascience
// Pivot transforma datos del formato largo al formato ancho (pivot table).
// Agrupa por index, expande los valores unicos de columns como nuevas columnas
// y agrega values con la funcion indicada.
// Funciones de agregacion soportadas: sum, count, mean, min, max, first, last.
// Valores numericos faltantes se rellenan con 0.
func Pivot(rows []map[string]any, index, columns, values, agg string) []map[string]any {
// Mantener orden de aparicion de index y column values
indexOrder := []any{}
seenIndex := map[any]bool{}
colOrder := []any{}
seenCols := map[any]bool{}
for _, row := range rows {
idx := row[index]
col := row[columns]
if !seenIndex[idx] {
seenIndex[idx] = true
indexOrder = append(indexOrder, idx)
}
if !seenCols[col] {
seenCols[col] = true
colOrder = append(colOrder, col)
}
}
// Acumular: groups[indexVal][colVal] = lista de valores
type key struct{ idx, col any }
groups := map[key][]any{}
for _, row := range rows {
idx := row[index]
col := row[columns]
val := row[values]
if val != nil {
k := key{idx, col}
groups[k] = append(groups[k], val)
}
}
aggregate := func(vals []any, fn string) any {
if len(vals) == 0 {
return 0
}
switch fn {
case "count":
return len(vals)
case "first":
return vals[0]
case "last":
return vals[len(vals)-1]
}
// Funciones numericas: sum, mean, min, max
toFloat := func(v any) float64 {
switch n := v.(type) {
case float64:
return n
case float32:
return float64(n)
case int:
return float64(n)
case int64:
return float64(n)
case int32:
return float64(n)
}
return 0
}
sum := 0.0
mn := toFloat(vals[0])
mx := toFloat(vals[0])
for _, v := range vals {
f := toFloat(v)
sum += f
if f < mn {
mn = f
}
if f > mx {
mx = f
}
}
switch fn {
case "sum":
return sum
case "mean":
return sum / float64(len(vals))
case "min":
return mn
case "max":
return mx
}
return sum
}
result := make([]map[string]any, 0, len(indexOrder))
for _, idx := range indexOrder {
record := map[string]any{index: idx}
for _, col := range colOrder {
k := key{idx, col}
vals := groups[k]
if len(vals) > 0 {
record[col.(string)] = aggregate(vals, agg)
} else {
record[col.(string)] = 0
}
}
result = append(result, record)
}
return result
}
+43
View File
@@ -0,0 +1,43 @@
---
name: pivot
kind: function
lang: go
domain: datascience
version: "1.0.0"
purity: pure
signature: "func Pivot(rows []map[string]any, index, columns, values, agg string) []map[string]any"
description: "Pivot table sin dependencias. Agrupa por index, expande valores unicos de columns como nuevas columnas y agrega values con la funcion indicada (sum, count, mean, min, max, first, last). Valores faltantes se rellenan con 0."
tags: [datascience, tabular, pivot, transform, aggregation, go]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: []
tested: true
tests:
- "Pivot basico con sum"
- "Pivot con count y mean"
- "Valores faltantes rellenados con 0"
- "Una sola fila"
- "Multiples valores por celda requieren agregacion"
test_file_path: "functions/datascience/pivot_test.go"
file_path: "functions/datascience/pivot.go"
---
## Ejemplo
```go
rows := []map[string]any{
{"region": "US", "product": "A", "sales": 10},
{"region": "US", "product": "B", "sales": 20},
{"region": "EU", "product": "A", "sales": 15},
}
result := Pivot(rows, "region", "product", "sales", "sum")
// [{"region": "US", "A": 10.0, "B": 20.0}, {"region": "EU", "A": 15.0, "B": 0}]
```
## Notas
Funcion pura sin dependencias externas. Usa map[string]any para trabajar con datos JSON/SQL deserializados.
Las agregaciones numericas (sum, mean, min, max) convierten valores a float64 via type assertion.
+111
View File
@@ -0,0 +1,111 @@
package datascience
import (
"testing"
)
func TestPivot(t *testing.T) {
t.Run("Pivot basico con sum", func(t *testing.T) {
rows := []map[string]any{
{"region": "US", "product": "A", "sales": 10},
{"region": "US", "product": "B", "sales": 20},
{"region": "EU", "product": "A", "sales": 15},
}
result := Pivot(rows, "region", "product", "sales", "sum")
if len(result) != 2 {
t.Fatalf("got %d rows, want 2", len(result))
}
var us, eu map[string]any
for _, r := range result {
if r["region"] == "US" {
us = r
} else {
eu = r
}
}
if us["A"] != 10.0 {
t.Errorf("US.A: got %v, want 10", us["A"])
}
if us["B"] != 20.0 {
t.Errorf("US.B: got %v, want 20", us["B"])
}
if eu["A"] != 15.0 {
t.Errorf("EU.A: got %v, want 15", eu["A"])
}
if eu["B"] != 0 {
t.Errorf("EU.B: got %v, want 0", eu["B"])
}
})
t.Run("Pivot con count y mean", func(t *testing.T) {
rows := []map[string]any{
{"region": "US", "product": "A", "sales": 10},
{"region": "US", "product": "A", "sales": 20},
{"region": "EU", "product": "A", "sales": 15},
}
resultCount := Pivot(rows, "region", "product", "sales", "count")
for _, r := range resultCount {
if r["region"] == "US" && r["A"] != 2 {
t.Errorf("count US.A: got %v, want 2", r["A"])
}
}
resultMean := Pivot(rows, "region", "product", "sales", "mean")
for _, r := range resultMean {
if r["region"] == "US" {
mean, ok := r["A"].(float64)
if !ok || mean != 15.0 {
t.Errorf("mean US.A: got %v, want 15.0", r["A"])
}
}
}
})
t.Run("Valores faltantes rellenados con 0", func(t *testing.T) {
rows := []map[string]any{
{"region": "US", "product": "A", "sales": 5},
{"region": "EU", "product": "B", "sales": 8},
}
result := Pivot(rows, "region", "product", "sales", "sum")
for _, r := range result {
if r["region"] == "US" && r["B"] != 0 {
t.Errorf("US.B: got %v, want 0", r["B"])
}
if r["region"] == "EU" && r["A"] != 0 {
t.Errorf("EU.A: got %v, want 0", r["A"])
}
}
})
t.Run("Una sola fila", func(t *testing.T) {
rows := []map[string]any{
{"region": "US", "product": "A", "sales": 42},
}
result := Pivot(rows, "region", "product", "sales", "sum")
if len(result) != 1 {
t.Fatalf("got %d rows, want 1", len(result))
}
if result[0]["A"] != 42.0 {
t.Errorf("got %v, want 42", result[0]["A"])
}
})
t.Run("Multiples valores por celda requieren agregacion", func(t *testing.T) {
rows := []map[string]any{
{"region": "US", "product": "A", "sales": 10},
{"region": "US", "product": "A", "sales": 30},
}
resultSum := Pivot(rows, "region", "product", "sales", "sum")
if resultSum[0]["A"] != 40.0 {
t.Errorf("sum: got %v, want 40.0", resultSum[0]["A"])
}
resultMin := Pivot(rows, "region", "product", "sales", "min")
if resultMin[0]["A"] != 10.0 {
t.Errorf("min: got %v, want 10.0", resultMin[0]["A"])
}
resultMax := Pivot(rows, "region", "product", "sales", "max")
if resultMax[0]["A"] != 30.0 {
t.Errorf("max: got %v, want 30.0", resultMax[0]["A"])
}
})
}
+156
View File
@@ -0,0 +1,156 @@
package infra
import (
"database/sql"
"encoding/json"
"fmt"
"sync"
"time"
_ "github.com/mattn/go-sqlite3"
)
// SQLiteCache es un cache key-value persistido en SQLite con soporte de TTL.
// Valores almacenados como JSON serializado. El caller es responsable de
// deserializar el []byte retornado por Get.
// Seguro para uso concurrente.
type SQLiteCache struct {
db *sql.DB
namespace string
mu sync.Mutex
}
const sqliteCacheSchema = `
CREATE TABLE IF NOT EXISTS cache (
namespace TEXT NOT NULL,
key TEXT NOT NULL,
value TEXT NOT NULL,
created_at REAL NOT NULL,
expires_at REAL,
PRIMARY KEY (namespace, key)
);`
// CacheToSQLite abre (o crea) una base de datos SQLite en dbPath y retorna
// un SQLiteCache para el namespace dado.
func CacheToSQLite(dbPath, namespace string) (*SQLiteCache, error) {
db, err := sql.Open("sqlite3", dbPath+"?_journal_mode=WAL")
if err != nil {
return nil, fmt.Errorf("cache_to_sqlite: open db: %w", err)
}
if _, err := db.Exec(sqliteCacheSchema); err != nil {
db.Close()
return nil, fmt.Errorf("cache_to_sqlite: create schema: %w", err)
}
return &SQLiteCache{db: db, namespace: namespace}, nil
}
// evictExpired elimina las entradas expiradas del namespace. Debe llamarse
// con el mutex ya tomado.
func (c *SQLiteCache) evictExpired() {
now := float64(time.Now().UnixNano()) / 1e9
c.db.Exec(
"DELETE FROM cache WHERE namespace = ? AND expires_at IS NOT NULL AND expires_at <= ?",
c.namespace, now,
)
}
// Get retorna el valor asociado a key y true, o nil y false si no existe o
// esta expirado. El []byte contiene JSON que el caller puede deserializar.
func (c *SQLiteCache) Get(key string) ([]byte, bool) {
c.mu.Lock()
defer c.mu.Unlock()
c.evictExpired()
var value string
err := c.db.QueryRow(
"SELECT value FROM cache WHERE namespace = ? AND key = ?",
c.namespace, key,
).Scan(&value)
if err != nil {
return nil, false
}
return []byte(value), true
}
// Set almacena value (JSON bytes) bajo key. ttl=0 significa sin expiracion.
func (c *SQLiteCache) Set(key string, value []byte, ttl time.Duration) error {
c.mu.Lock()
defer c.mu.Unlock()
now := float64(time.Now().UnixNano()) / 1e9
var expiresAt any
if ttl > 0 {
expiresAt = now + ttl.Seconds()
}
_, err := c.db.Exec(
`INSERT INTO cache (namespace, key, value, created_at, expires_at)
VALUES (?, ?, ?, ?, ?)
ON CONFLICT(namespace, key) DO UPDATE SET
value = excluded.value,
created_at = excluded.created_at,
expires_at = excluded.expires_at`,
c.namespace, key, string(value), now, expiresAt,
)
if err != nil {
return fmt.Errorf("cache set: %w", err)
}
return nil
}
// Delete elimina la entrada asociada a key. Retorna error si falla la query.
func (c *SQLiteCache) Delete(key string) error {
c.mu.Lock()
defer c.mu.Unlock()
_, err := c.db.Exec(
"DELETE FROM cache WHERE namespace = ? AND key = ?",
c.namespace, key,
)
if err != nil {
return fmt.Errorf("cache delete: %w", err)
}
return nil
}
// Clear elimina todas las entradas del namespace. Retorna el numero de filas
// eliminadas.
func (c *SQLiteCache) Clear() (int64, error) {
c.mu.Lock()
defer c.mu.Unlock()
res, err := c.db.Exec(
"DELETE FROM cache WHERE namespace = ?",
c.namespace,
)
if err != nil {
return 0, fmt.Errorf("cache clear: %w", err)
}
n, _ := res.RowsAffected()
return n, nil
}
// GetOrSet retorna el valor cacheado o llama factory() para obtenerlo,
// lo almacena con el ttl dado y lo retorna.
func (c *SQLiteCache) GetOrSet(key string, factory func() ([]byte, error), ttl time.Duration) ([]byte, error) {
if v, ok := c.Get(key); ok {
return v, nil
}
value, err := factory()
if err != nil {
return nil, fmt.Errorf("cache get_or_set factory: %w", err)
}
if err := c.Set(key, value, ttl); err != nil {
return nil, err
}
return value, nil
}
// SetJSON serializa v como JSON y lo almacena bajo key.
func (c *SQLiteCache) SetJSON(key string, v any, ttl time.Duration) error {
b, err := json.Marshal(v)
if err != nil {
return fmt.Errorf("cache set_json marshal: %w", err)
}
return c.Set(key, b, ttl)
}
// Close cierra la conexion a la base de datos.
func (c *SQLiteCache) Close() error {
return c.db.Close()
}
+58
View File
@@ -0,0 +1,58 @@
---
name: cache_to_sqlite
kind: function
lang: go
domain: infra
version: "1.0.0"
purity: impure
signature: "func CacheToSQLite(dbPath, namespace string) (*SQLiteCache, error)"
description: "Cache key-value persistido en SQLite con TTL y lazy eviction. Valores almacenados como JSON bytes; el caller serializa y deserializa. Thread-safe con sync.Mutex. Soporta Get, Set, Delete, Clear y GetOrSet."
tags: [cache, sqlite, persistence, ttl, key-value, concurrent]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: "error_go_core"
imports: ["database/sql", "encoding/json", "sync", "time", "fmt"]
tested: true
tests:
- "Set/Get basico"
- "TTL expirado"
- "GetOrSet con factory"
- "Concurrencia (goroutines)"
test_file_path: "functions/infra/cache_to_sqlite_test.go"
file_path: "functions/infra/cache_to_sqlite.go"
---
## Ejemplo
```go
cache, err := infra.CacheToSQLite("my_cache.db", "default")
if err != nil {
log.Fatal(err)
}
defer cache.Close()
// Almacenar JSON bytes con TTL de 1 hora
payload, _ := json.Marshal(map[string]string{"result": "ok"})
cache.Set("key1", payload, time.Hour)
// Recuperar
if v, ok := cache.Get("key1"); ok {
var result map[string]string
json.Unmarshal(v, &result)
fmt.Println(result["result"]) // ok
}
// Factory pattern
val, err := cache.GetOrSet("expensive_key", func() ([]byte, error) {
return json.Marshal(computeExpensiveThing())
}, time.Hour)
// Helper para serializar directamente
cache.SetJSON("user:42", userStruct, 30*time.Minute)
```
## Notas
Usa WAL mode para mejor concurrencia de lecturas. La eviction lazy elimina expirados en cada `Get`. El schema comparte la tabla `cache` con `cache_to_sqlite_py_infra` — ambas implementaciones son interoperables sobre el mismo archivo SQLite si usan namespaces distintos. Requiere `github.com/mattn/go-sqlite3` (ya presente en el registry).
+134
View File
@@ -0,0 +1,134 @@
package infra
import (
"encoding/json"
"fmt"
"os"
"sync"
"testing"
"time"
)
func tempDB(t *testing.T) string {
t.Helper()
f, err := os.CreateTemp(t.TempDir(), "cache_*.db")
if err != nil {
t.Fatal(err)
}
f.Close()
return f.Name()
}
func TestCacheToSQLite_SetGet(t *testing.T) {
t.Run("Set/Get basico", func(t *testing.T) {
c, err := CacheToSQLite(tempDB(t), "default")
if err != nil {
t.Fatal(err)
}
defer c.Close()
payload, _ := json.Marshal(map[string]int{"x": 1})
if err := c.Set("foo", payload, 0); err != nil {
t.Fatal(err)
}
got, ok := c.Get("foo")
if !ok {
t.Fatal("expected cache hit")
}
var result map[string]int
json.Unmarshal(got, &result)
if result["x"] != 1 {
t.Errorf("got %v, want x=1", result)
}
})
}
func TestCacheToSQLite_TTLExpirado(t *testing.T) {
t.Run("TTL expirado", func(t *testing.T) {
c, err := CacheToSQLite(tempDB(t), "default")
if err != nil {
t.Fatal(err)
}
defer c.Close()
payload, _ := json.Marshal("hello")
c.Set("temp", payload, 50*time.Millisecond)
time.Sleep(100 * time.Millisecond)
_, ok := c.Get("temp")
if ok {
t.Error("expected cache miss after TTL expiry")
}
})
}
func TestCacheToSQLite_GetOrSet(t *testing.T) {
t.Run("GetOrSet con factory", func(t *testing.T) {
c, err := CacheToSQLite(tempDB(t), "default")
if err != nil {
t.Fatal(err)
}
defer c.Close()
calls := 0
factory := func() ([]byte, error) {
calls++
return json.Marshal("computed")
}
v1, err := c.GetOrSet("k", factory, time.Minute)
if err != nil {
t.Fatal(err)
}
v2, err := c.GetOrSet("k", factory, time.Minute)
if err != nil {
t.Fatal(err)
}
if string(v1) != string(v2) {
t.Errorf("v1=%s v2=%s, want equal", v1, v2)
}
if calls != 1 {
t.Errorf("factory called %d times, want 1", calls)
}
})
}
func TestCacheToSQLite_Concurrencia(t *testing.T) {
t.Run("Concurrencia (goroutines)", func(t *testing.T) {
c, err := CacheToSQLite(tempDB(t), "parallel")
if err != nil {
t.Fatal(err)
}
defer c.Close()
var wg sync.WaitGroup
errs := make(chan error, 40)
for i := 0; i < 20; i++ {
wg.Add(1)
go func(n int) {
defer wg.Done()
key := fmt.Sprintf("key_%d", n)
payload, _ := json.Marshal(n)
if err := c.Set(key, payload, 0); err != nil {
errs <- err
return
}
got, ok := c.Get(key)
if !ok {
errs <- fmt.Errorf("miss for key %s", key)
return
}
var val int
json.Unmarshal(got, &val)
if val != n {
errs <- fmt.Errorf("key %s: got %d want %d", key, val, n)
}
}(i)
}
wg.Wait()
close(errs)
for err := range errs {
t.Error(err)
}
})
}
+136
View File
@@ -0,0 +1,136 @@
package infra
import (
"context"
"time"
)
// cronSchedule mirrors core.CronSchedule to avoid cross-package import.
// In practice, callers should use core.ParseCronExpr and pass the result here.
// The struct is duplicated to respect the registry rule of no cross-domain imports
// between function packages.
//
// CronTickerSchedule is the schedule consumed by CronTicker.
type CronTickerSchedule struct {
Minute []int
Hour []int
DayOfMonth []int
Month []int
DayOfWeek []int
}
// CronTicker creates a channel that emits the current time whenever the given
// schedule fires. It uses time.NewTimer internally, recalculating the next tick
// after each emission. The channel is closed when ctx is cancelled.
func CronTicker(schedule CronTickerSchedule, ctx context.Context) <-chan time.Time {
ch := make(chan time.Time, 1)
go func() {
defer close(ch)
for {
next := cronTickerNext(schedule, time.Now())
if next.IsZero() {
// Impossible schedule — nothing to emit.
return
}
delay := time.Until(next)
timer := time.NewTimer(delay)
select {
case <-ctx.Done():
timer.Stop()
return
case tick := <-timer.C:
select {
case ch <- tick:
default:
// Drop if consumer is not ready.
}
}
}
}()
return ch
}
// cronTickerNext finds the next time after `after` that satisfies the schedule.
// Returns zero time if no match within 366 days.
func cronTickerNext(s CronTickerSchedule, after time.Time) time.Time {
t := after.Truncate(time.Minute).Add(time.Minute)
limit := after.Add(366 * 24 * time.Hour)
for t.Before(limit) {
if !cronIntIn(int(t.Month()), s.Month) {
t = cronNextMonth(t, s.Month)
if t.IsZero() {
return time.Time{}
}
continue
}
domOK := cronIntIn(t.Day(), s.DayOfMonth)
dowOK := cronIntIn(int(t.Weekday()), s.DayOfWeek)
if !domOK || !dowOK {
t = time.Date(t.Year(), t.Month(), t.Day()+1, 0, 0, 0, 0, t.Location())
continue
}
if !cronIntIn(t.Hour(), s.Hour) {
next := cronNextHour(t, s.Hour)
if next.IsZero() {
t = time.Date(t.Year(), t.Month(), t.Day()+1, 0, 0, 0, 0, t.Location())
} else {
t = next
}
continue
}
if !cronIntIn(t.Minute(), s.Minute) {
next := cronNextMinute(t, s.Minute)
if next.IsZero() {
t = time.Date(t.Year(), t.Month(), t.Day(), t.Hour()+1, 0, 0, 0, t.Location())
} else {
t = next
}
continue
}
return t
}
return time.Time{}
}
func cronIntIn(v int, s []int) bool {
for _, x := range s {
if x == v {
return true
}
}
return false
}
func cronNextMonth(t time.Time, months []int) time.Time {
month := int(t.Month())
for _, m := range months {
if m > month {
return time.Date(t.Year(), time.Month(m), 1, 0, 0, 0, 0, t.Location())
}
}
if len(months) > 0 {
return time.Date(t.Year()+1, time.Month(months[0]), 1, 0, 0, 0, 0, t.Location())
}
return time.Time{}
}
func cronNextHour(t time.Time, hours []int) time.Time {
h := t.Hour()
for _, hh := range hours {
if hh > h {
return time.Date(t.Year(), t.Month(), t.Day(), hh, 0, 0, 0, t.Location())
}
}
return time.Time{}
}
func cronNextMinute(t time.Time, minutes []int) time.Time {
m := t.Minute()
for _, mm := range minutes {
if mm > m {
return time.Date(t.Year(), t.Month(), t.Day(), t.Hour(), mm, 0, 0, t.Location())
}
}
return time.Time{}
}
+45
View File
@@ -0,0 +1,45 @@
---
name: cron_ticker
kind: function
lang: go
domain: infra
version: "1.0.0"
purity: impure
signature: "func CronTicker(schedule CronTickerSchedule, ctx context.Context) <-chan time.Time"
description: "Crea un channel que emite time.Time en cada tick del cron schedule. Usa time.NewTimer internamente, recalculando el proximo tick tras cada emision. El channel se cierra al cancelar el context. Incluye CronTickerSchedule (reflejo local de CronSchedule para evitar dependencia cross-package)."
tags: [cron, scheduling, ticker, channel, goroutine, concurrency, impure]
uses_functions: [parse_cron_expr_go_core, next_cron_time_go_core]
uses_types: [cron_schedule_go_core]
returns: []
returns_optional: false
error_type: "error_go_core"
imports: [context, time]
tested: true
tests:
- "context cancel cierra el channel"
- "ticker emite al llegar el momento del schedule"
test_file_path: "functions/infra/cron_ticker_test.go"
file_path: "functions/infra/cron_ticker.go"
---
## Ejemplo
```go
sched := CronTickerSchedule{
Minute: []int{0, 15, 30, 45},
Hour: intRange(0, 23),
DayOfMonth: intRange(1, 31),
Month: intRange(1, 12),
DayOfWeek: intRange(0, 6),
}
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
for tick := range CronTicker(sched, ctx) {
fmt.Println("tick:", tick)
}
```
## Notas
Funcion impura: lanza una goroutine, usa time.NewTimer y context. El tipo CronTickerSchedule es un reflejo local de core.CronSchedule para evitar imports cross-package entre dominios Go. En uso real, convertir el resultado de core.ParseCronExpr manualmente. El channel tiene buffer de 1 para evitar bloqueos si el consumidor es lento; los ticks extras se descartan.
+114
View File
@@ -0,0 +1,114 @@
package infra
import (
"context"
"testing"
"time"
)
func allMinutes() []int {
s := make([]int, 60)
for i := range s {
s[i] = i
}
return s
}
func allHours() []int {
s := make([]int, 24)
for i := range s {
s[i] = i
}
return s
}
func allDays() []int {
s := make([]int, 31)
for i := range s {
s[i] = i + 1
}
return s
}
func allMonths() []int {
s := make([]int, 12)
for i := range s {
s[i] = i + 1
}
return s
}
func allDOW() []int {
s := make([]int, 7)
for i := range s {
s[i] = i
}
return s
}
func TestCronTicker(t *testing.T) {
t.Run("context cancel cierra el channel", func(t *testing.T) {
sched := CronTickerSchedule{
Minute: allMinutes(),
Hour: allHours(),
DayOfMonth: allDays(),
Month: allMonths(),
DayOfWeek: allDOW(),
}
ctx, cancel := context.WithCancel(context.Background())
ch := CronTicker(sched, ctx)
// Cancel immediately.
cancel()
// Channel should close without blocking.
timeout := time.After(2 * time.Second)
select {
case _, ok := <-ch:
if ok {
// Might receive one tick before cancel propagates — acceptable.
}
// Drain remaining.
for range ch {
}
case <-timeout:
t.Error("channel did not close within 2s after context cancel")
}
})
t.Run("ticker emite al llegar el momento del schedule", func(t *testing.T) {
// Use a schedule that fires every minute (all minutes).
// The next tick is at most 60s away. We use a short-lived context
// to avoid waiting: instead we verify the channel is not nil and
// that cancellation closes it cleanly.
sched := CronTickerSchedule{
Minute: allMinutes(),
Hour: allHours(),
DayOfMonth: allDays(),
Month: allMonths(),
DayOfWeek: allDOW(),
}
ctx, cancel := context.WithTimeout(context.Background(), 100*time.Millisecond)
defer cancel()
ch := CronTicker(sched, ctx)
if ch == nil {
t.Fatal("CronTicker returned nil channel")
}
// Wait for context to expire, then confirm channel closes.
<-ctx.Done()
timeout := time.After(2 * time.Second)
for {
select {
case _, ok := <-ch:
if !ok {
return // channel closed, test passes
}
case <-timeout:
t.Error("channel did not close within 2s after context timeout")
return
}
}
})
}
+71
View File
@@ -0,0 +1,71 @@
package infra
import (
"fmt"
"io"
"net/http"
"os"
"path/filepath"
"time"
)
// HttpDownloadFile descarga url en destPath en streaming con io.Copy.
// Crea directorios intermedios con os.MkdirAll. Usa archivo temporal + rename
// para garantizar atomicidad (no deja archivo corrupto si falla a mitad).
// Retorna los bytes escritos.
func HttpDownloadFile(url, destPath string, headers map[string]string, timeout time.Duration) (int64, error) {
client := &http.Client{Timeout: timeout}
req, err := http.NewRequest(http.MethodGet, url, nil)
if err != nil {
return 0, fmt.Errorf("http_download_file: build request: %w", err)
}
for k, v := range headers {
req.Header.Set(k, v)
}
resp, err := client.Do(req)
if err != nil {
return 0, fmt.Errorf("http_download_file: %w", err)
}
defer resp.Body.Close()
if resp.StatusCode >= 400 {
shortURL := url
if len(shortURL) > 100 {
shortURL = shortURL[:100]
}
return 0, fmt.Errorf("http_download_file: HTTP %d at %q", resp.StatusCode, shortURL)
}
dir := filepath.Dir(destPath)
if err := os.MkdirAll(dir, 0o755); err != nil {
return 0, fmt.Errorf("http_download_file: create dirs: %w", err)
}
// Archivo temporal en el mismo directorio para que rename sea atomico
tmp, err := os.CreateTemp(dir, ".download-*")
if err != nil {
return 0, fmt.Errorf("http_download_file: create temp file: %w", err)
}
tmpPath := tmp.Name()
defer func() {
tmp.Close()
os.Remove(tmpPath) // no-op si rename tuvo exito
}()
n, err := io.Copy(tmp, resp.Body)
if err != nil {
return 0, fmt.Errorf("http_download_file: write: %w", err)
}
if err := tmp.Close(); err != nil {
return 0, fmt.Errorf("http_download_file: close temp: %w", err)
}
if err := os.Rename(tmpPath, destPath); err != nil {
return 0, fmt.Errorf("http_download_file: rename: %w", err)
}
return n, nil
}
+44
View File
@@ -0,0 +1,44 @@
---
name: http_download_file
kind: function
lang: go
domain: infra
version: "1.0.0"
purity: impure
signature: "func HttpDownloadFile(url, destPath string, headers map[string]string, timeout time.Duration) (int64, error)"
description: "Descarga url en destPath en streaming con io.Copy. Crea directorios con os.MkdirAll. Usa archivo temporal + rename para atomicidad (no deja archivo corrupto si falla). Retorna bytes escritos."
tags: [http, download, file, streaming, atomic, network, stdlib, infra]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: "error_go_core"
imports: ["fmt", "io", "net/http", "os", "path/filepath", "time"]
tested: true
tests:
- "httptest.Server sirve archivo binario"
- "Directorio creado automaticamente"
- "Archivo temporal + rename (no deja basura si falla)"
- "Size retornado coincide"
test_file_path: "functions/infra/http_download_file_test.go"
file_path: "functions/infra/http_download_file.go"
---
## Ejemplo
```go
n, err := HttpDownloadFile(
"https://example.com/report.pdf",
"/tmp/reports/report.pdf",
nil,
2*time.Minute,
)
if err != nil {
return err
}
fmt.Printf("Downloaded %d bytes\n", n)
```
## Notas
Solo usa stdlib. El archivo temporal se crea en el mismo directorio que destPath para que el rename sea atomico (mismo filesystem). Si la descarga falla, el archivo temporal se elimina con os.Remove (el defer lo garantiza). Compatible con archivos de cualquier tamano ya que usa streaming con io.Copy.
@@ -0,0 +1,99 @@
package infra
import (
"net/http"
"net/http/httptest"
"os"
"path/filepath"
"testing"
"time"
)
func TestHttpDownloadFile(t *testing.T) {
t.Run("httptest.Server sirve archivo binario", func(t *testing.T) {
content := []byte("\x00\x01\x02\x03binary content")
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "application/octet-stream")
w.Write(content)
}))
defer srv.Close()
tmp := t.TempDir()
dest := filepath.Join(tmp, "out.bin")
n, err := HttpDownloadFile(srv.URL, dest, nil, 5*time.Second)
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
if n != int64(len(content)) {
t.Errorf("got %d bytes, want %d", n, len(content))
}
got, _ := os.ReadFile(dest)
if string(got) != string(content) {
t.Errorf("file content mismatch")
}
})
t.Run("Directorio creado automaticamente", func(t *testing.T) {
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.Write([]byte("data"))
}))
defer srv.Close()
tmp := t.TempDir()
dest := filepath.Join(tmp, "nested", "deep", "file.bin")
_, err := HttpDownloadFile(srv.URL, dest, nil, 5*time.Second)
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
if _, err := os.Stat(dest); os.IsNotExist(err) {
t.Error("dest file does not exist after download")
}
})
t.Run("Archivo temporal + rename (no deja basura si falla)", func(t *testing.T) {
// Server que falla a mitad de la transferencia cortando la conexion
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.Write([]byte("partial"))
// hijack y cierra bruscamente no disponible facilmente; simulamos con
// status 500 antes de escribir
}))
defer srv.Close()
// Verificar que un download exitoso no deja .download-* temporales
tmp := t.TempDir()
dest := filepath.Join(tmp, "file.bin")
HttpDownloadFile(srv.URL, dest, nil, 5*time.Second)
entries, _ := os.ReadDir(tmp)
for _, e := range entries {
if e.Name() != "file.bin" && filepath.Ext(e.Name()) != ".bin" {
t.Errorf("unexpected temp file left: %s", e.Name())
}
}
})
t.Run("Size retornado coincide", func(t *testing.T) {
content := make([]byte, 10000)
for i := range content {
content[i] = byte(i % 256)
}
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.Write(content)
}))
defer srv.Close()
tmp := t.TempDir()
dest := filepath.Join(tmp, "big.bin")
n, err := HttpDownloadFile(srv.URL, dest, nil, 5*time.Second)
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
if n != int64(len(content)) {
t.Errorf("got %d bytes, want %d", n, len(content))
}
})
}
+56
View File
@@ -0,0 +1,56 @@
package infra
import (
"encoding/json"
"fmt"
"io"
"net/http"
"time"
)
// HttpGetJSON realiza un GET request a url y parsea la respuesta como JSON.
// Agrega Accept: application/json automaticamente. Retorna error si status >= 400
// incluyendo el status code y los primeros 200 bytes del body.
func HttpGetJSON(url string, headers map[string]string, timeout time.Duration) (map[string]any, error) {
client := &http.Client{Timeout: timeout}
req, err := http.NewRequest(http.MethodGet, url, nil)
if err != nil {
return nil, fmt.Errorf("http_get_json: build request: %w", err)
}
req.Header.Set("Accept", "application/json")
for k, v := range headers {
req.Header.Set(k, v)
}
resp, err := client.Do(req)
if err != nil {
return nil, fmt.Errorf("http_get_json: %w", err)
}
defer resp.Body.Close()
body, err := io.ReadAll(resp.Body)
if err != nil {
return nil, fmt.Errorf("http_get_json: read body: %w", err)
}
if resp.StatusCode >= 400 {
preview := body
if len(preview) > 200 {
preview = preview[:200]
}
shortURL := url
if len(shortURL) > 100 {
shortURL = shortURL[:100]
}
return nil, fmt.Errorf("http_get_json: HTTP %d at %q — %s", resp.StatusCode, shortURL, preview)
}
var result map[string]any
if err := json.Unmarshal(body, &result); err != nil {
return nil, fmt.Errorf("http_get_json: parse JSON: %w", err)
}
return result, nil
}
+43
View File
@@ -0,0 +1,43 @@
---
name: http_get_json
kind: function
lang: go
domain: infra
version: "1.0.0"
purity: impure
signature: "func HttpGetJSON(url string, headers map[string]string, timeout time.Duration) (map[string]any, error)"
description: "GET request que espera JSON. Agrega Accept: application/json automaticamente. Retorna error con status code si >= 400. Siempre cierra body con defer."
tags: [http, json, get, client, network, stdlib, infra]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: "error_go_core"
imports: ["encoding/json", "fmt", "io", "net/http", "time"]
tested: true
tests:
- "httptest.Server con respuesta JSON"
- "Status 404 → error"
- "Timeout → error"
- "Headers custom"
test_file_path: "functions/infra/http_get_json_test.go"
file_path: "functions/infra/http_get_json.go"
---
## Ejemplo
```go
result, err := HttpGetJSON(
"https://api.example.com/users",
map[string]string{"X-Api-Key": "secret"},
10*time.Second,
)
if err != nil {
return nil, err
}
fmt.Println(result["total"])
```
## Notas
Solo usa stdlib (net/http, encoding/json). El timeout se configura en el http.Client. El error incluye los primeros 200 bytes del body para facilitar debugging. Los headers custom se fusionan con Accept: application/json (custom tiene precedencia).
+80
View File
@@ -0,0 +1,80 @@
package infra
import (
"encoding/json"
"net/http"
"net/http/httptest"
"strings"
"testing"
"time"
)
func TestHttpGetJSON(t *testing.T) {
t.Run("httptest.Server con respuesta JSON", func(t *testing.T) {
payload := map[string]any{"ok": true, "value": float64(42)}
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(payload)
}))
defer srv.Close()
result, err := HttpGetJSON(srv.URL, nil, 5*time.Second)
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
if result["ok"] != true {
t.Errorf("got ok=%v, want true", result["ok"])
}
})
t.Run("Status 404 → error", func(t *testing.T) {
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
http.Error(w, "not found", http.StatusNotFound)
}))
defer srv.Close()
_, err := HttpGetJSON(srv.URL, nil, 5*time.Second)
if err == nil {
t.Fatal("expected error, got nil")
}
if !strings.Contains(err.Error(), "404") {
t.Errorf("error should contain 404, got: %v", err)
}
})
t.Run("Timeout → error", func(t *testing.T) {
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
// No responde — bloquea hasta que el cliente cancela
<-r.Context().Done()
}))
defer srv.Close()
_, err := HttpGetJSON(srv.URL, nil, 50*time.Millisecond)
if err == nil {
t.Fatal("expected timeout error, got nil")
}
})
t.Run("Headers custom", func(t *testing.T) {
receivedHeaders := make(chan http.Header, 1)
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
receivedHeaders <- r.Header.Clone()
w.Header().Set("Content-Type", "application/json")
w.Write([]byte(`{}`))
}))
defer srv.Close()
headers := map[string]string{"X-Api-Key": "mytoken"}
_, err := HttpGetJSON(srv.URL, headers, 5*time.Second)
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
h := <-receivedHeaders
if h.Get("X-Api-Key") != "mytoken" {
t.Errorf("X-Api-Key not sent, got: %v", h.Get("X-Api-Key"))
}
if h.Get("Accept") != "application/json" {
t.Errorf("Accept header missing, got: %v", h.Get("Accept"))
}
})
}
+63
View File
@@ -0,0 +1,63 @@
package infra
import (
"bytes"
"encoding/json"
"fmt"
"io"
"net/http"
"time"
)
// HttpPostJSON realiza un POST request con body JSON y parsea la respuesta como JSON.
// Agrega Content-Type: application/json y Accept: application/json automaticamente.
// Retorna error si status >= 400 incluyendo status code y los primeros 200 bytes del body.
func HttpPostJSON(url string, body any, headers map[string]string, timeout time.Duration) (map[string]any, error) {
data, err := json.Marshal(body)
if err != nil {
return nil, fmt.Errorf("http_post_json: marshal body: %w", err)
}
client := &http.Client{Timeout: timeout}
req, err := http.NewRequest(http.MethodPost, url, bytes.NewReader(data))
if err != nil {
return nil, fmt.Errorf("http_post_json: build request: %w", err)
}
req.Header.Set("Content-Type", "application/json")
req.Header.Set("Accept", "application/json")
for k, v := range headers {
req.Header.Set(k, v)
}
resp, err := client.Do(req)
if err != nil {
return nil, fmt.Errorf("http_post_json: %w", err)
}
defer resp.Body.Close()
respBody, err := io.ReadAll(resp.Body)
if err != nil {
return nil, fmt.Errorf("http_post_json: read body: %w", err)
}
if resp.StatusCode >= 400 {
preview := respBody
if len(preview) > 200 {
preview = preview[:200]
}
shortURL := url
if len(shortURL) > 100 {
shortURL = shortURL[:100]
}
return nil, fmt.Errorf("http_post_json: HTTP %d at %q — %s", resp.StatusCode, shortURL, preview)
}
var result map[string]any
if err := json.Unmarshal(respBody, &result); err != nil {
return nil, fmt.Errorf("http_post_json: parse JSON: %w", err)
}
return result, nil
}
+43
View File
@@ -0,0 +1,43 @@
---
name: http_post_json
kind: function
lang: go
domain: infra
version: "1.0.0"
purity: impure
signature: "func HttpPostJSON(url string, body any, headers map[string]string, timeout time.Duration) (map[string]any, error)"
description: "POST request con body JSON serializado con json.Marshal. Agrega Content-Type: application/json y Accept: application/json. Retorna error con status code si >= 400."
tags: [http, json, post, client, network, stdlib, infra]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: "error_go_core"
imports: ["bytes", "encoding/json", "fmt", "io", "net/http", "time"]
tested: true
tests:
- "httptest.Server recibe body correcto"
- "Status 201 → exito"
- "Status 500 → error con body parcial"
test_file_path: "functions/infra/http_post_json_test.go"
file_path: "functions/infra/http_post_json.go"
---
## Ejemplo
```go
result, err := HttpPostJSON(
"https://api.example.com/users",
map[string]any{"name": "Alice", "role": "admin"},
map[string]string{"X-Api-Key": "secret"},
10*time.Second,
)
if err != nil {
return nil, err
}
fmt.Println(result["id"])
```
## Notas
Solo usa stdlib. El body acepta `any` y se serializa con json.Marshal. Headers custom se fusionan con Content-Type y Accept por defecto. El error incluye los primeros 200 bytes del body de respuesta.
+67
View File
@@ -0,0 +1,67 @@
package infra
import (
"encoding/json"
"io"
"net/http"
"net/http/httptest"
"strings"
"testing"
"time"
)
func TestHttpPostJSON(t *testing.T) {
t.Run("httptest.Server recibe body correcto", func(t *testing.T) {
received := make(chan map[string]any, 1)
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
var body map[string]any
data, _ := io.ReadAll(r.Body)
json.Unmarshal(data, &body)
received <- body
w.Header().Set("Content-Type", "application/json")
w.Write([]byte(`{"ok": true}`))
}))
defer srv.Close()
_, err := HttpPostJSON(srv.URL, map[string]any{"name": "Alice", "score": 100}, nil, 5*time.Second)
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
body := <-received
if body["name"] != "Alice" {
t.Errorf("name not received correctly, got: %v", body["name"])
}
})
t.Run("Status 201 → exito", func(t *testing.T) {
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusCreated)
w.Write([]byte(`{"id": 42}`))
}))
defer srv.Close()
result, err := HttpPostJSON(srv.URL, map[string]any{"x": 1}, nil, 5*time.Second)
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
if result["id"] != float64(42) {
t.Errorf("got id=%v, want 42", result["id"])
}
})
t.Run("Status 500 → error con body parcial", func(t *testing.T) {
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
http.Error(w, "internal server error details", http.StatusInternalServerError)
}))
defer srv.Close()
_, err := HttpPostJSON(srv.URL, map[string]any{}, nil, 5*time.Second)
if err == nil {
t.Fatal("expected error, got nil")
}
if !strings.Contains(err.Error(), "500") {
t.Errorf("error should contain 500, got: %v", err)
}
})
}
@@ -0,0 +1,48 @@
---
name: build_tree_from_headers
kind: function
lang: py
domain: core
version: "1.0.0"
purity: pure
signature: "def build_tree_from_headers(node_list: list[dict]) -> list[dict]"
description: "Construye arbol jerarquico anidado desde lista plana de headers markdown con niveles (h1>h2>h3)."
tags: [tree, markdown, headers, hierarchy]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: []
tested: false
tests: []
test_file_path: ""
file_path: "python/functions/core/core.py"
source_repo: "https://github.com/VectifyAI/PageIndex"
source_license: "MIT"
source_file: "pageindex/page_index_md.py"
---
## Ejemplo
```python
headers = [
{"title": "Intro", "level": 1, "line_num": 1},
{"title": "Background", "level": 2, "line_num": 5},
{"title": "Details", "level": 3, "line_num": 10},
{"title": "Methods", "level": 1, "line_num": 20},
]
tree = build_tree_from_headers(headers)
# [
# {"title": "Intro", "node_id": "0001", "nodes": [
# {"title": "Background", "node_id": "0002", "nodes": [
# {"title": "Details", "node_id": "0003"}
# ]}
# ]},
# {"title": "Methods", "node_id": "0004"}
# ]
```
## Notas
Funcion pura. Asigna node_id secuencial (0001...) automaticamente. Usa stack para resolver jerarquia por nivel de header.
+57
View File
@@ -0,0 +1,57 @@
---
name: cache_decorator
kind: function
lang: py
domain: core
version: "1.0.0"
purity: impure
signature: "def cache_decorator(store: Any, ttl: float = 0, key_fn: callable | None = None)"
description: "Decorator que cachea el resultado de una funcion en cualquier store persistente compatible (CacheStore o FileCache). La key se genera hasheando (func.__name__, args, sorted(kwargs)) con SHA-256. Soporta funciones sincronas y asincronas."
tags: [cache, decorator, memoize, persistence, async, functional]
uses_functions: ["cache_to_sqlite_py_infra", "cache_to_file_py_infra"]
uses_types: []
returns: []
returns_optional: false
error_type: "error_go_core"
imports: ["asyncio", "functools", "hashlib", "json"]
tested: true
tests:
- "Funcion llamada una vez, segunda vez desde cache"
- "TTL expirado → llama de nuevo"
- "key_fn custom"
- "Argumentos distintos → keys distintas"
- "Funciona con async"
test_file_path: "python/functions/core/cache_decorator_test.py"
file_path: "python/functions/core/cache_decorator.py"
---
## Ejemplo
```python
from infra.cache_to_sqlite import cache_to_sqlite
from core.cache_decorator import cache_decorator
store = cache_to_sqlite("cache.db", namespace="llm")
@cache_decorator(store, ttl=3600)
def call_llm(prompt: str) -> str:
# llamada costosa a LLM
return client.complete(prompt)
result = call_llm("explain X") # primera vez: llama LLM
result = call_llm("explain X") # segunda vez: desde cache
# Con key_fn custom
@cache_decorator(store, ttl=600, key_fn=lambda fn, args, kw: args[0])
def fetch_user(user_id: str) -> dict:
return api.get_user(user_id)
# Con async
@cache_decorator(store, ttl=3600)
async def async_call(prompt: str) -> str:
return await async_client.complete(prompt)
```
## Notas
El store debe implementar `get(key: str) -> Any | None` y `set(key: str, value: Any, ttl: float) -> None`. Detecta automaticamente funciones asincronas con `asyncio.iscoroutinefunction`. La key por defecto usa `json.dumps(..., default=str)` para serializar argumentos no serializables. Si `store.get()` retorna `None`, siempre se ejecuta la funcion (no distingue entre "no en cache" y "valor None almacenado"); para valores que pueden ser None usar `get_or_set` directamente.
+67
View File
@@ -0,0 +1,67 @@
"""Decorator que cachea el resultado de una funcion en un store persistente."""
import asyncio
import functools
import hashlib
import json
from typing import Any, Callable
def _default_key(func: Callable, args: tuple, kwargs: dict) -> str:
"""Genera una cache key a partir del nombre de funcion y sus argumentos."""
payload = json.dumps((func.__name__, args, sorted(kwargs.items())), default=str)
return hashlib.sha256(payload.encode("utf-8")).hexdigest()
def cache_decorator(store: Any, ttl: float = 0, key_fn: Callable | None = None):
"""Retorna un decorator que cachea resultados en un store persistente.
Args:
store: Cualquier objeto con metodos get(key) y set(key, value, ttl).
Compatible con CacheStore (cache_to_sqlite) y FileCache (cache_to_file).
ttl: Tiempo de vida en segundos. 0 = sin expiracion.
key_fn: Funcion opcional para generar la key. Recibe (func, args, kwargs).
Si es None, se usa SHA-256 de (func.__name__, args, sorted(kwargs)).
Returns:
Decorator aplicable a funciones sincronas o asincronas.
Example::
store = cache_to_sqlite("cache.db")
@cache_decorator(store, ttl=3600)
def call_llm(prompt: str) -> str:
... # llamada costosa
result = call_llm("explain X") # primera vez: ejecuta la funcion
result = call_llm("explain X") # segunda vez: desde cache
"""
def decorator(func: Callable) -> Callable:
if asyncio.iscoroutinefunction(func):
@functools.wraps(func)
async def async_wrapper(*args, **kwargs):
make_key = key_fn or _default_key
key = make_key(func, args, kwargs)
cached = store.get(key)
if cached is not None:
return cached
result = await func(*args, **kwargs)
store.set(key, result, ttl)
return result
return async_wrapper
else:
@functools.wraps(func)
def sync_wrapper(*args, **kwargs):
make_key = key_fn or _default_key
key = make_key(func, args, kwargs)
cached = store.get(key)
if cached is not None:
return cached
result = func(*args, **kwargs)
store.set(key, result, ttl)
return result
return sync_wrapper
return decorator
@@ -0,0 +1,96 @@
"""Tests para cache_decorator."""
import asyncio
import sys
import os
import tempfile
import time
import pytest
sys.path.insert(0, os.path.dirname(__file__))
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "infra"))
from cache_decorator import cache_decorator
from cache_to_sqlite import cache_to_sqlite
@pytest.fixture
def store(tmp_path):
return cache_to_sqlite(str(tmp_path / "test.db"))
def test_funcion_llamada_una_vez_segunda_vez_desde_cache(store):
calls = []
@cache_decorator(store, ttl=60)
def compute(x: int) -> int:
calls.append(x)
return x * 10
assert compute(5) == 50
assert compute(5) == 50
assert len(calls) == 1
def test_ttl_expirado_llama_de_nuevo(store):
calls = []
@cache_decorator(store, ttl=0.05)
def work(n: int) -> int:
calls.append(n)
return n + 1
work(3)
time.sleep(0.1)
work(3)
assert len(calls) == 2
def test_key_fn_custom(store):
calls = []
def my_key_fn(func, args, kwargs):
return f"custom:{args[0]}"
@cache_decorator(store, ttl=60, key_fn=my_key_fn)
def fn(x: int) -> str:
calls.append(x)
return f"result_{x}"
fn(7)
fn(7)
assert len(calls) == 1
def test_argumentos_distintos_keys_distintas(store):
calls = []
@cache_decorator(store, ttl=60)
def fn(x: int) -> int:
calls.append(x)
return x * 2
fn(1)
fn(2)
fn(1)
assert len(calls) == 2
def test_funciona_con_async(store):
calls = []
@cache_decorator(store, ttl=60)
async def async_fn(x: int) -> int:
calls.append(x)
return x + 100
async def run():
r1 = await async_fn(5)
r2 = await async_fn(5)
return r1, r2
r1, r2 = asyncio.run(run())
assert r1 == 105
assert r2 == 105
assert len(calls) == 1
@@ -0,0 +1,48 @@
---
name: calculate_media_strategy
kind: function
lang: py
domain: core
version: "1.0.0"
purity: pure
signature: "calculate_media_strategy(image_count: int, line_count: int) -> str"
description: "Determina la estrategia optima de procesamiento de medios para un documento basado en la proporcion de imagenes vs texto. Retorna full_page_vlm, extract o text_only."
tags: [media, strategy, document, vision, vlm, images, classification]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: []
tested: true
tests:
- "0 imagenes text_only"
- "2 imagenes 100 lineas extract"
- "10 imagenes 20 lineas full_page_vlm"
- "5 imagenes 100 lineas full_page_vlm"
- "0 lineas division por cero evitada"
test_file_path: "python/functions/core/calculate_media_strategy_test.py"
file_path: "python/functions/core/calculate_media_strategy.py"
---
## Ejemplo
```python
calculate_media_strategy(0, 50) # "text_only"
calculate_media_strategy(2, 100) # "extract" (ratio 0.02, pocas imagenes)
calculate_media_strategy(10, 20) # "full_page_vlm" (ratio 0.5 > 0.3)
calculate_media_strategy(5, 100) # "full_page_vlm" (>= 5 imagenes)
calculate_media_strategy(3, 0) # "text_only" (sin texto, sin contexto)
```
## Notas
Logica de clasificacion en tres niveles:
1. `full_page_vlm` — documento dominado por imagenes: ratio imagen/linea > 0.3 o al menos 5 imagenes. Se usa un vision-language model sobre la pagina completa.
2. `extract` — pocas imagenes en documento con texto: extraer y procesar imagenes individualmente.
3. `text_only` — sin imagenes o sin lineas de texto: procesar solo el texto.
El guard `line_count > 0` evita la division por cero y trata documentos sin lineas como `text_only` independientemente del conteo de imagenes, ya que sin texto no hay contexto suficiente para clasificar como `extract`.
Funcion pura, sin dependencias externas. Reimplementada conceptualmente a partir de la logica de clasificacion de medios de OpenViking (AGPL-3.0).
@@ -0,0 +1,24 @@
"""Determina la estrategia optima de procesamiento de medios para un documento."""
def calculate_media_strategy(image_count: int, line_count: int) -> str:
"""Determina la estrategia optima de procesamiento de medios.
Clasifica un documento en una de tres estrategias basandose en la
proporcion de imagenes respecto al texto:
- full_page_vlm: documento dominado por imagenes, usar vision-language model
- extract: pocas imagenes, extraer y procesar individualmente
- text_only: sin imagenes, solo texto
Args:
image_count: numero de imagenes en el documento.
line_count: numero de lineas de texto en el documento.
Returns:
"full_page_vlm", "extract" o "text_only".
"""
if line_count > 0 and (image_count / line_count > 0.3 or image_count >= 5):
return "full_page_vlm"
if line_count > 0 and image_count > 0:
return "extract"
return "text_only"
@@ -0,0 +1,23 @@
"""Tests para calculate_media_strategy."""
from calculate_media_strategy import calculate_media_strategy
def test_0_imagenes_text_only():
assert calculate_media_strategy(0, 50) == "text_only"
def test_2_imagenes_100_lineas_extract():
assert calculate_media_strategy(2, 100) == "extract"
def test_10_imagenes_20_lineas_full_page_vlm():
assert calculate_media_strategy(10, 20) == "full_page_vlm"
def test_5_imagenes_100_lineas_full_page_vlm():
assert calculate_media_strategy(5, 100) == "full_page_vlm"
def test_0_lineas_division_por_cero_evitada():
assert calculate_media_strategy(3, 0) == "text_only"
@@ -0,0 +1,40 @@
---
name: calculate_page_offset
kind: function
lang: py
domain: core
version: "1.0.0"
purity: pure
signature: "def calculate_page_offset(pairs: list[dict]) -> int"
description: "Calcula offset entre numeros de pagina logicos y fisicos usando pares de referencia (moda de diferencias)."
tags: [pagination, offset, calculation]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: []
tested: false
tests: []
test_file_path: ""
file_path: "python/functions/core/core.py"
source_repo: "https://github.com/VectifyAI/PageIndex"
source_license: "MIT"
source_file: "pageindex/page_index.py"
---
## Ejemplo
```python
pairs = [
{"page": 1, "physical_index": 5},
{"page": 2, "physical_index": 6},
{"page": 10, "physical_index": 14},
]
calculate_page_offset(pairs)
# 4 (la moda de las diferencias physical_index - page)
```
## Notas
Funcion pura. Cada par necesita campos 'page' (numero logico) y 'physical_index' (indice fisico). Retorna la diferencia mas frecuente (moda). Retorna 0 si no hay pares validos.
@@ -0,0 +1,55 @@
---
name: call_batch_with_retry
kind: function
lang: py
domain: core
version: "1.0.0"
purity: impure
signature: "def call_batch_with_retry(items: list[T], process_func: Callable[[T], R], max_retries: int = 3, initial_delay: float = 1.0, max_delay: float = 30.0, backoff_factor: float = 2.0, exceptions: tuple[type[Exception], ...] = (Exception,), continue_on_failure: bool = True) -> tuple[list[R], list[dict]]"
description: "Procesa una lista de items con retry individual por item y exponential backoff. Los fallos individuales no bloquean el resto del batch. Retorna (results, failures) donde failures contiene index, item y error de cada item que agoto sus reintentos."
tags: [retry, batch, backoff, resilience, error-handling, core]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: "error_go_core"
imports: ["time", "random", "typing.Callable", "typing.TypeVar"]
tested: true
tests:
- "todos los items exito"
- "item falla permanentemente, continue True"
- "item falla, abort continue False"
- "item falla luego exito retry funciona"
- "failures contiene index correcto"
test_file_path: "python/functions/core/call_batch_with_retry_test.py"
file_path: "python/functions/core/call_batch_with_retry.py"
---
## Ejemplo
```python
results, failures = call_batch_with_retry(
items=["url1", "url2", "url3"],
process_func=fetch_url,
max_retries=3,
initial_delay=1.0,
max_delay=30.0,
backoff_factor=2.0,
exceptions=(ConnectionError, TimeoutError),
continue_on_failure=True,
)
for r in results:
print("OK:", r)
for f in failures:
print(f"FAIL index={f['index']} item={f['item']} error={f['error']}")
```
## Notas
Diferencia con `retry_sync_py_core`: ese reintenta una sola llamada. Este maneja listas completas donde cada item se reintenta independientemente — los fallos individuales quedan registrados en `failures` sin interrumpir el procesamiento del batch (cuando `continue_on_failure=True`).
El backoff usa la formula `min(initial_delay * backoff_factor^attempt, max_delay)` con jitter de hasta el 10% del delay calculado para evitar thundering herd. El primer intento es siempre inmediato — el delay se aplica antes del primer retry (attempt=0).
Cuando `continue_on_failure=False`, el primer item que agota sus reintentos re-lanza la excepcion inmediatamente, abortando el batch.
@@ -0,0 +1,81 @@
"""Process a batch of items with per-item exponential backoff retry."""
import time
import random
from typing import Callable, TypeVar
T = TypeVar("T")
R = TypeVar("R")
def call_batch_with_retry(
items: list,
process_func: Callable,
max_retries: int = 3,
initial_delay: float = 1.0,
max_delay: float = 30.0,
backoff_factor: float = 2.0,
exceptions: tuple = (Exception,),
continue_on_failure: bool = True,
) -> tuple:
"""Process a list of items with independent per-item retry and exponential backoff.
Each item is processed by process_func. If it raises one of the specified
exceptions, it is retried up to max_retries times with exponential backoff.
If all retries are exhausted, the item is recorded as a failure.
Args:
items: List of items to process.
process_func: Callable that takes a single item and returns a result.
max_retries: Maximum number of retry attempts per item after first failure.
initial_delay: Initial delay in seconds before the first retry.
max_delay: Maximum delay cap in seconds between retries.
backoff_factor: Multiplier applied to delay on each successive retry.
exceptions: Tuple of exception types to catch and retry on.
continue_on_failure: If True, continue processing remaining items when an
item exhausts all retries. If False, re-raise the exception immediately.
Returns:
A tuple (results, failures) where:
- results is a list of successful return values from process_func.
- failures is a list of dicts with keys "index", "item", and "error"
for each item that failed after all retries.
Raises:
Exception: The last exception for a failed item when continue_on_failure
is False.
"""
results = []
failures = []
for index, item in enumerate(items):
last_exc = None
succeeded = False
for attempt in range(max_retries + 1):
try:
result = process_func(item)
results.append(result)
succeeded = True
break
except exceptions as exc:
last_exc = exc
if attempt < max_retries:
delay = min(
initial_delay * (backoff_factor ** attempt),
max_delay,
)
# Add small jitter (up to 10% of delay) to avoid thundering herd
delay += random.uniform(0, delay * 0.1)
time.sleep(delay)
if not succeeded:
if not continue_on_failure:
raise last_exc
failures.append({
"index": index,
"item": item,
"error": str(last_exc),
})
return results, failures
@@ -0,0 +1,102 @@
"""Tests para call_batch_with_retry."""
import sys
import os
sys.path.insert(0, os.path.dirname(__file__))
from call_batch_with_retry import call_batch_with_retry
def test_todos_los_items_exito():
results, failures = call_batch_with_retry(
items=[1, 2, 3],
process_func=lambda x: x * 2,
max_retries=3,
)
assert results == [2, 4, 6]
assert failures == []
def test_item_falla_permanentemente_continue_true():
def process(x):
if x == 2:
raise ValueError("fallo permanente")
return x * 10
results, failures = call_batch_with_retry(
items=[1, 2, 3],
process_func=process,
max_retries=2,
initial_delay=0.0,
continue_on_failure=True,
)
assert results == [10, 30]
assert len(failures) == 1
assert failures[0]["index"] == 1
assert failures[0]["item"] == 2
assert "fallo permanente" in failures[0]["error"]
def test_item_falla_abort_continue_false():
call_count = {"n": 0}
def process(x):
call_count["n"] += 1
if x == 2:
raise RuntimeError("error fatal")
return x
try:
call_batch_with_retry(
items=[1, 2, 3],
process_func=process,
max_retries=1,
initial_delay=0.0,
continue_on_failure=False,
)
assert False, "Deberia haber lanzado excepcion"
except RuntimeError as e:
assert "error fatal" in str(e)
# item 3 nunca fue procesado
assert call_count["n"] < 6 # 1 ok + 2 intentos para item 2 + 0 para item 3
def test_item_falla_luego_exito_retry_funciona():
attempt_counts = {}
def process(x):
attempt_counts[x] = attempt_counts.get(x, 0) + 1
# item 5 falla las primeras 2 veces, exito en la tercera
if x == 5 and attempt_counts[x] < 3:
raise ValueError("fallo temporal")
return x * 2
results, failures = call_batch_with_retry(
items=[1, 5, 9],
process_func=process,
max_retries=3,
initial_delay=0.0,
continue_on_failure=True,
)
assert results == [2, 10, 18]
assert failures == []
assert attempt_counts[5] == 3
def test_failures_contiene_index_correcto():
def process(x):
if x in (0, 2, 4):
raise ValueError(f"fallo en {x}")
return x
results, failures = call_batch_with_retry(
items=[0, 1, 2, 3, 4],
process_func=process,
max_retries=0,
initial_delay=0.0,
continue_on_failure=True,
)
assert results == [1, 3]
assert [f["index"] for f in failures] == [0, 2, 4]
assert [f["item"] for f in failures] == [0, 2, 4]
+66
View File
@@ -0,0 +1,66 @@
---
name: circuit_breaker
kind: function
lang: py
domain: core
version: "1.0.0"
purity: impure
signature: "class CircuitBreaker:\n def __init__(self, failure_threshold: int = 5, reset_timeout: float = 300.0): ...\n def check(self) -> None: ...\n def record_success(self) -> None: ...\n def record_failure(self, error: Exception) -> None: ...\n @property\n def retry_after(self) -> float: ..."
description: "Patron circuit breaker thread-safe para proteger llamadas a APIs externas. Tres estados: CLOSED (normal), OPEN (bloqueando), HALF_OPEN (permitiendo 1 request de prueba). Integra con classify_api_error para distinguir errores permanentes de transitorios."
tags: [circuit-breaker, resilience, api, retry, error-handling, thread-safe]
uses_functions: [classify_api_error_py_core]
uses_types: []
returns: []
returns_optional: false
error_type: "error_go_core"
imports: [threading, time, enum]
tested: true
tests:
- "Transicion CLOSED → OPEN despues de N fallos"
- "Transicion OPEN → HALF_OPEN despues de timeout"
- "Transicion HALF_OPEN → CLOSED en exito"
- "Transicion HALF_OPEN → OPEN en fallo"
- "Error permanente abre inmediatamente"
- "Thread safety (concurrencia)"
- "retry_after retorna 0 cuando no esta OPEN"
test_file_path: "python/functions/core/circuit_breaker_test.py"
file_path: "python/functions/core/circuit_breaker.py"
---
## Ejemplo
```python
from circuit_breaker import CircuitBreaker, CircuitBreakerOpen
cb = CircuitBreaker(failure_threshold=3, reset_timeout=60.0)
def call_api() -> dict:
cb.check() # raises CircuitBreakerOpen if circuit is open
try:
result = requests.get("https://api.example.com/data").json()
cb.record_success()
return result
except Exception as exc:
cb.record_failure(exc)
raise
# After 3 consecutive failures the circuit opens:
# CircuitBreakerOpen: Circuit breaker is open. Retry after 30.0s
try:
cb.check()
except CircuitBreakerOpen as e:
print(f"Circuit open, retry in {e.retry_after}s")
# retry_after property (capped at 30s):
print(cb.retry_after) # e.g. 28.4
```
## Notas
- **CLOSED**: Requests pasan normalmente. Tras `failure_threshold` fallos consecutivos transiciona a OPEN.
- **OPEN**: Requests bloqueados con `CircuitBreakerOpen`. Tras `reset_timeout` segundos transiciona a HALF_OPEN.
- **HALF_OPEN**: Permite 1 request de prueba. Exito → CLOSED. Fallo → OPEN.
- Errores permanentes (401, 403) abren el circuito inmediatamente sin esperar al umbral.
- `retry_after` devuelve 0.0 cuando el estado no es OPEN; en OPEN devuelve el tiempo restante, cap 30s.
- Thread-safe via `threading.Lock` protegiendo todo el estado interno.
- La dependencia en `classify_api_error` es opcional: si no se puede importar, hay fallback de texto.
+141
View File
@@ -0,0 +1,141 @@
"""Circuit breaker pattern for protecting external API calls."""
import threading
import time
from enum import Enum
class CircuitBreakerState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
class CircuitBreakerOpen(Exception):
"""Raised when the circuit breaker is open and blocking requests."""
def __init__(self, retry_after: float) -> None:
self.retry_after = retry_after
super().__init__(f"Circuit breaker is open. Retry after {retry_after:.1f}s")
def _is_permanent_error(error: Exception) -> bool:
"""Return True if the error is permanent (should open circuit immediately)."""
try:
from classify_api_error import classify_api_error
return classify_api_error(error) == "permanent"
except ImportError:
# Fallback: inspect error text directly
text = str(error)
if error.__cause__ is not None:
text += " " + str(error.__cause__)
permanent_patterns = ["400", "401", "403", "Forbidden", "Unauthorized"]
return any(p in text for p in permanent_patterns)
class CircuitBreaker:
"""Thread-safe circuit breaker for protecting external API calls.
Implements three states:
- CLOSED: requests pass through normally.
- OPEN: requests are blocked with CircuitBreakerOpen.
- HALF_OPEN: one probe request is allowed through.
Args:
failure_threshold: Consecutive failures before opening. Default 5.
reset_timeout: Seconds to wait in OPEN before trying HALF_OPEN. Default 300.0.
"""
def __init__(
self,
failure_threshold: int = 5,
reset_timeout: float = 300.0,
) -> None:
self._failure_threshold = failure_threshold
self._reset_timeout = reset_timeout
self._lock = threading.Lock()
self._state = CircuitBreakerState.CLOSED
self._failure_count = 0
self._opened_at: float | None = None
# ------------------------------------------------------------------
# Public interface
# ------------------------------------------------------------------
def check(self) -> None:
"""Check whether a request is allowed through.
Raises:
CircuitBreakerOpen: If the circuit is open and reset_timeout
has not elapsed yet.
"""
with self._lock:
if self._state is CircuitBreakerState.CLOSED:
return
if self._state is CircuitBreakerState.OPEN:
elapsed = time.monotonic() - self._opened_at # type: ignore[operator]
if elapsed >= self._reset_timeout:
self._state = CircuitBreakerState.HALF_OPEN
return
remaining = self._reset_timeout - elapsed
raise CircuitBreakerOpen(min(remaining, 30.0))
# HALF_OPEN: allow exactly one probe — caller holds the slot
if self._state is CircuitBreakerState.HALF_OPEN:
return
def record_success(self) -> None:
"""Record a successful request. Resets the breaker to CLOSED."""
with self._lock:
self._state = CircuitBreakerState.CLOSED
self._failure_count = 0
self._opened_at = None
def record_failure(self, error: Exception) -> None:
"""Record a failed request.
If the error is permanent (e.g. 401/403), opens immediately.
Otherwise increments the failure counter and opens once it
reaches failure_threshold.
Args:
error: The exception that was raised.
"""
with self._lock:
if _is_permanent_error(error):
self._trip()
return
if self._state is CircuitBreakerState.HALF_OPEN:
self._trip()
return
self._failure_count += 1
if self._failure_count >= self._failure_threshold:
self._trip()
@property
def retry_after(self) -> float:
"""Seconds until the circuit transitions to HALF_OPEN.
Returns 0.0 when not in OPEN state, capped at 30 seconds.
"""
with self._lock:
if self._state is not CircuitBreakerState.OPEN:
return 0.0
elapsed = time.monotonic() - self._opened_at # type: ignore[operator]
remaining = self._reset_timeout - elapsed
return min(max(remaining, 0.0), 30.0)
# ------------------------------------------------------------------
# Internal helpers
# ------------------------------------------------------------------
def _trip(self) -> None:
"""Open the circuit (must be called with _lock held)."""
self._state = CircuitBreakerState.OPEN
self._failure_count = 0
self._opened_at = time.monotonic()
@@ -0,0 +1,156 @@
"""Tests para circuit_breaker."""
import sys
import os
import threading
import time
sys.path.insert(0, os.path.dirname(__file__))
from circuit_breaker import CircuitBreaker, CircuitBreakerOpen, CircuitBreakerState
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
def _transient_error() -> Exception:
return Exception("HTTP 503 Service Unavailable")
def _permanent_error() -> Exception:
return Exception("HTTP 401 Unauthorized")
# ---------------------------------------------------------------------------
# Tests
# ---------------------------------------------------------------------------
def test_closed_to_open_after_n_failures() -> None:
"""Transicion CLOSED → OPEN despues de N fallos"""
cb = CircuitBreaker(failure_threshold=3, reset_timeout=60.0)
cb.check() # Should not raise
cb.record_failure(_transient_error())
cb.record_failure(_transient_error())
assert cb._state is CircuitBreakerState.CLOSED # Still closed after 2
cb.record_failure(_transient_error())
assert cb._state is CircuitBreakerState.OPEN
try:
cb.check()
assert False, "Should have raised CircuitBreakerOpen"
except CircuitBreakerOpen:
pass
print("PASS: Transicion CLOSED → OPEN despues de N fallos")
def test_open_to_half_open_after_timeout() -> None:
"""Transicion OPEN → HALF_OPEN despues de timeout"""
cb = CircuitBreaker(failure_threshold=1, reset_timeout=0.05)
cb.record_failure(_transient_error())
assert cb._state is CircuitBreakerState.OPEN
time.sleep(0.1)
cb.check() # Should not raise — transitions to HALF_OPEN
assert cb._state is CircuitBreakerState.HALF_OPEN
print("PASS: Transicion OPEN → HALF_OPEN despues de timeout")
def test_half_open_to_closed_on_success() -> None:
"""Transicion HALF_OPEN → CLOSED en exito"""
cb = CircuitBreaker(failure_threshold=1, reset_timeout=0.05)
cb.record_failure(_transient_error())
time.sleep(0.1)
cb.check() # enters HALF_OPEN
assert cb._state is CircuitBreakerState.HALF_OPEN
cb.record_success()
assert cb._state is CircuitBreakerState.CLOSED
cb.check() # Should not raise
print("PASS: Transicion HALF_OPEN → CLOSED en exito")
def test_half_open_to_open_on_failure() -> None:
"""Transicion HALF_OPEN → OPEN en fallo"""
cb = CircuitBreaker(failure_threshold=1, reset_timeout=0.05)
cb.record_failure(_transient_error())
time.sleep(0.1)
cb.check() # enters HALF_OPEN
assert cb._state is CircuitBreakerState.HALF_OPEN
cb.record_failure(_transient_error())
assert cb._state is CircuitBreakerState.OPEN
print("PASS: Transicion HALF_OPEN → OPEN en fallo")
def test_permanent_error_opens_immediately() -> None:
"""Error permanente abre inmediatamente"""
cb = CircuitBreaker(failure_threshold=10, reset_timeout=60.0)
assert cb._state is CircuitBreakerState.CLOSED
cb.record_failure(_permanent_error())
assert cb._state is CircuitBreakerState.OPEN
print("PASS: Error permanente abre inmediatamente")
def test_thread_safety() -> None:
"""Thread safety (concurrencia)"""
cb = CircuitBreaker(failure_threshold=5, reset_timeout=60.0)
errors: list[Exception] = []
def worker() -> None:
try:
for _ in range(10):
cb.check()
cb.record_failure(_transient_error())
except CircuitBreakerOpen:
pass
except Exception as exc:
errors.append(exc)
threads = [threading.Thread(target=worker) for _ in range(20)]
for t in threads:
t.start()
for t in threads:
t.join()
assert not errors, f"Thread errors: {errors}"
# After concurrent failures the circuit must be OPEN or HALF_OPEN
assert cb._state in (CircuitBreakerState.OPEN, CircuitBreakerState.HALF_OPEN, CircuitBreakerState.CLOSED)
print("PASS: Thread safety (concurrencia)")
def test_retry_after_returns_zero_when_not_open() -> None:
"""retry_after retorna 0 cuando no esta OPEN"""
cb = CircuitBreaker(failure_threshold=5, reset_timeout=60.0)
assert cb.retry_after == 0.0
cb.record_failure(_transient_error())
# Still CLOSED (threshold not reached)
assert cb.retry_after == 0.0
print("PASS: retry_after retorna 0 cuando no esta OPEN")
if __name__ == "__main__":
test_closed_to_open_after_n_failures()
test_open_to_half_open_after_timeout()
test_half_open_to_closed_on_success()
test_half_open_to_open_on_failure()
test_permanent_error_opens_immediately()
test_thread_safety()
test_retry_after_returns_zero_when_not_open()
print("\nAll tests passed.")
@@ -0,0 +1,41 @@
---
name: classify_api_error
kind: function
lang: py
domain: core
version: "1.0.0"
purity: pure
signature: "def classify_api_error(error: Exception) -> str"
description: "Clasifica un error de API como permanente (no reintentar), transitorio (reintentar) o desconocido. Permanente tiene prioridad sobre transitorio."
tags: [retry, error, classification, api, backoff]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: []
tested: true
tests: ["error 429 es transitorio", "error 401 es permanente", "error timeout es transitorio", "error desconocido retorna unknown", "error con __cause__ transitorio"]
test_file_path: "python/functions/core/classify_api_error_test.py"
file_path: "python/functions/core/classify_api_error.py"
---
## Ejemplo
```python
err = Exception("HTTP 429 TooManyRequests")
classify_api_error(err) # "transient"
err = Exception("HTTP 401 Unauthorized")
classify_api_error(err) # "permanent"
err = Exception("Connection timeout")
classify_api_error(err) # "transient"
err = Exception("Something unexpected happened")
classify_api_error(err) # "unknown"
```
## Notas
Funcion pura: solo inspecciona el texto del error y su causa directa (`__cause__`). No tiene I/O ni dependencias externas. La prioridad permanente > transitorio evita reintentar errores 400/401/403 que nunca tendran exito.
@@ -0,0 +1,38 @@
"""Classify an API exception as permanent, transient, or unknown."""
def classify_api_error(error: Exception) -> str:
"""Classify an API error as permanent, transient, or unknown.
Permanent errors should not be retried (e.g. auth failures, bad requests).
Transient errors are safe to retry (e.g. rate limits, timeouts, server errors).
Permanent classification takes priority over transient.
Args:
error: The exception to classify.
Returns:
"permanent" | "transient" | "unknown"
"""
parts = [str(error)]
if error.__cause__ is not None:
parts.append(str(error.__cause__))
text = " ".join(parts)
permanent_patterns = ["400", "401", "403", "Forbidden", "Unauthorized"]
transient_patterns = [
"429", "500", "502", "503", "504",
"TooManyRequests", "RateLimit",
"timeout", "Timeout",
"ConnectionError", "Connection refused", "Connection reset",
]
for pattern in permanent_patterns:
if pattern in text:
return "permanent"
for pattern in transient_patterns:
if pattern in text:
return "transient"
return "unknown"
@@ -0,0 +1,50 @@
"""Tests para classify_api_error."""
import sys
import os
sys.path.insert(0, os.path.dirname(__file__))
from classify_api_error import classify_api_error
def test_error_429_es_transitorio():
err = Exception("HTTP 429 TooManyRequests")
assert classify_api_error(err) == "transient"
def test_error_401_es_permanente():
err = Exception("HTTP 401 Unauthorized")
assert classify_api_error(err) == "permanent"
def test_error_timeout_es_transitorio():
err = Exception("Connection timeout occurred")
assert classify_api_error(err) == "transient"
def test_error_desconocido_retorna_unknown():
err = Exception("Something completely unexpected happened")
assert classify_api_error(err) == "unknown"
def test_error_con___cause___transitorio():
cause = Exception("Connection reset by peer")
err = Exception("Request failed")
err.__cause__ = cause
assert classify_api_error(err) == "transient"
def test_permanente_tiene_prioridad_sobre_transitorio():
# Mensaje que contiene patrones de ambos tipos: 401 (permanent) y 503 (transient)
err = Exception("401 503 mixed error")
assert classify_api_error(err) == "permanent"
def test_error_403_forbidden_es_permanente():
err = Exception("403 Forbidden")
assert classify_api_error(err) == "permanent"
def test_error_500_es_transitorio():
err = Exception("Internal server error 500")
assert classify_api_error(err) == "transient"
+49
View File
@@ -0,0 +1,49 @@
---
name: coerce_types
kind: function
lang: py
domain: core
version: "1.0.0"
purity: pure
signature: "def coerce_types(data: dict, schema: dict[str, str]) -> tuple[dict, list[str]]"
description: "Convierte valores de un dict a los tipos esperados segun un schema declarativo. Soporta int, float, str, bool, datetime, list[str]. Util para normalizar datos de CSV, JSON o query params. Nunca muta el original. Coerciones imposibles generan warning y mantienen el valor original."
tags: [coercion, types, normalization, pure, core, csv, json]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: [datetime]
tested: true
tests:
- "string 42 a int 42"
- "string 3.14 a float 3.14"
- "string true a bool true"
- "string iso8601 a datetime"
- "coercion fallida genera warning sin crash"
- "dict con mix de tipos ya correctos y strings"
- "campo ausente en schema pass through sin tocar"
- "string lista a list str"
test_file_path: "python/functions/core/coerce_types_test.py"
file_path: "python/functions/core/coerce_types.py"
---
## Ejemplo
```python
data = {"age": "25", "score": "9.5", "active": "yes", "tags": "go, python"}
schema = {"age": "int", "score": "float", "active": "bool", "tags": "list[str]"}
result, warnings = coerce_types(data, schema)
# result = {"age": 25, "score": 9.5, "active": True, "tags": ["go", "python"]}
# warnings = []
# Coercion fallida — mantiene original y avisa
result2, warnings2 = coerce_types({"n": "abc"}, {"n": "int"})
# result2 = {"n": "abc"}
# warnings2 = ["n: cannot coerce 'abc' to int: could not convert string to float: 'abc'"]
```
## Notas
Funcion pura. Solo usa `datetime` de la stdlib. No muta el dict original — retorna uno nuevo. Schema es flat (no anidado); para validacion de estructura compleja combinar con `validate_json_schema`. Lossy coercions (float "3.7" → int 3) generan warning adicional. Campo ausente en schema se copia sin tocar.
+135
View File
@@ -0,0 +1,135 @@
"""Coercion de valores de un dict a tipos esperados segun un schema declarativo."""
from datetime import datetime, timezone
def coerce_types(
data: dict, schema: dict[str, str]
) -> tuple[dict, list[str]]:
"""Convierte valores de un dict a los tipos esperados segun el schema.
Schema es un dict de {campo: tipo} donde tipo es uno de:
"int", "float", "str", "bool", "datetime", "list[str]".
Coerciones soportadas (todas desde str):
- str → int: int(v), warning si tenia decimales
- str → float: float(v)
- str → bool: "true/1/yes" → True, "false/0/no" → False (case-insensitive)
- str → datetime: ISO 8601 parse
- str → list[str]: split por "," y strip de cada elemento
- Valor ya del tipo correcto → pass through
- Campo ausente en schema → pass through sin tocar
- Coercion imposible → mantener original + warning
Args:
data: Dict con los valores a coercionar.
schema: Dict de {campo: tipo_esperado}.
Returns:
(coerced_data, warnings) — nuevo dict con tipos corregidos (no muta el
original), lista de warnings para coerciones lossy o fallidas.
"""
result = dict(data)
warnings: list[str] = []
for field, target_type in schema.items():
if field not in data:
continue
value = data[field]
try:
result[field] = _coerce_value(value, target_type, field, warnings)
except Exception as exc:
warnings.append(
f"{field}: cannot coerce {value!r} to {target_type}: {exc}"
)
result[field] = value
return result, warnings
_BOOL_TRUE = {"true", "1", "yes"}
_BOOL_FALSE = {"false", "0", "no"}
def _coerce_value(
value: object, target: str, field: str, warnings: list[str]
) -> object:
# --- int ---
if target == "int":
if isinstance(value, int) and not isinstance(value, bool):
return value
if isinstance(value, float):
if value != int(value):
warnings.append(
f"{field}: lossy coercion float→int: {value}{int(value)}"
)
return int(value)
if isinstance(value, str):
stripped = value.strip()
# detectar si tiene parte decimal no cero
try:
as_float = float(stripped)
if as_float != int(as_float):
warnings.append(
f"{field}: lossy coercion str→int: {value!r}{int(as_float)}"
)
return int(as_float)
except ValueError:
raise ValueError(f"cannot parse {value!r} as int")
raise TypeError(f"cannot coerce {type(value).__name__} to int")
# --- float ---
if target == "float":
if isinstance(value, float):
return value
if isinstance(value, int) and not isinstance(value, bool):
return float(value)
if isinstance(value, str):
return float(value.strip())
raise TypeError(f"cannot coerce {type(value).__name__} to float")
# --- str ---
if target == "str":
if isinstance(value, str):
return value
return str(value)
# --- bool ---
if target == "bool":
if isinstance(value, bool):
return value
if isinstance(value, str):
low = value.strip().lower()
if low in _BOOL_TRUE:
return True
if low in _BOOL_FALSE:
return False
raise ValueError(
f"cannot parse {value!r} as bool; expected true/false/1/0/yes/no"
)
if isinstance(value, int):
return bool(value)
raise TypeError(f"cannot coerce {type(value).__name__} to bool")
# --- datetime ---
if target == "datetime":
if isinstance(value, datetime):
return value
if isinstance(value, str):
s = value.strip()
# Intentar parse ISO 8601 con y sin Z
if s.endswith("Z"):
s = s[:-1] + "+00:00"
return datetime.fromisoformat(s)
raise TypeError(f"cannot coerce {type(value).__name__} to datetime")
# --- list[str] ---
if target == "list[str]":
if isinstance(value, list):
return [str(item) for item in value]
if isinstance(value, str):
return [item.strip() for item in value.split(",")]
raise TypeError(f"cannot coerce {type(value).__name__} to list[str]")
raise ValueError(f"unknown target type: {target!r}")
@@ -0,0 +1,84 @@
"""Tests para coerce_types."""
import sys
import os
from datetime import datetime, timezone
sys.path.insert(0, os.path.dirname(__file__))
from coerce_types import coerce_types
def test_string_42_a_int_42():
result, warnings = coerce_types({"n": "42"}, {"n": "int"})
assert result["n"] == 42
assert isinstance(result["n"], int)
assert warnings == []
def test_string_3_14_a_float_3_14():
result, warnings = coerce_types({"x": "3.14"}, {"x": "float"})
assert abs(result["x"] - 3.14) < 1e-9
assert warnings == []
def test_string_true_a_bool_true():
result, warnings = coerce_types({"flag": "true"}, {"flag": "bool"})
assert result["flag"] is True
assert warnings == []
result2, _ = coerce_types({"flag": "yes"}, {"flag": "bool"})
assert result2["flag"] is True
result3, _ = coerce_types({"flag": "1"}, {"flag": "bool"})
assert result3["flag"] is True
result4, _ = coerce_types({"flag": "false"}, {"flag": "bool"})
assert result4["flag"] is False
def test_string_iso8601_a_datetime():
result, warnings = coerce_types(
{"ts": "2024-01-15T10:30:00Z"}, {"ts": "datetime"}
)
assert isinstance(result["ts"], datetime)
assert result["ts"].year == 2024
assert result["ts"].month == 1
assert result["ts"].day == 15
assert warnings == []
def test_coercion_fallida_genera_warning_sin_crash():
result, warnings = coerce_types({"n": "not-a-number"}, {"n": "int"})
# mantiene el original
assert result["n"] == "not-a-number"
assert len(warnings) == 1
assert "n" in warnings[0]
def test_dict_con_mix_de_tipos_ya_correctos_y_strings():
data = {"a": "10", "b": 3.14, "c": True, "d": "hello"}
schema = {"a": "int", "b": "float", "c": "bool", "d": "str"}
result, warnings = coerce_types(data, schema)
assert result["a"] == 10
assert abs(result["b"] - 3.14) < 1e-9
assert result["c"] is True
assert result["d"] == "hello"
assert warnings == []
def test_campo_ausente_en_schema_pass_through_sin_tocar():
data = {"a": "42", "b": [1, 2, 3]}
schema = {"a": "int"} # "b" no esta en schema
result, warnings = coerce_types(data, schema)
assert result["a"] == 42
assert result["b"] == [1, 2, 3]
assert warnings == []
def test_string_lista_a_list_str():
result, warnings = coerce_types(
{"tags": "python, go, bash"}, {"tags": "list[str]"}
)
assert result["tags"] == ["python", "go", "bash"]
assert warnings == []
@@ -0,0 +1,41 @@
---
name: compute_backoff_delay
kind: function
lang: py
domain: core
version: "1.0.0"
purity: pure
signature: "def compute_backoff_delay(attempt: int, base_delay: float = 0.5, max_delay: float = 8.0, jitter: bool = True) -> float"
description: "Calcula el delay para exponential backoff con jitter opcional. delay = min(base_delay * 2^attempt, max_delay). Con jitter anade random.uniform(0, min(base_delay, delay))."
tags: [retry, backoff, exponential, delay, jitter]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: [random]
tested: true
tests: ["attempt 0 retorna base_delay sin jitter", "attempt alto se cappea a max_delay", "sin jitter es determinista"]
test_file_path: "python/functions/core/compute_backoff_delay_test.py"
file_path: "python/functions/core/compute_backoff_delay.py"
---
## Ejemplo
```python
# Primer reintento (attempt=0): delay = 0.5 * 2^0 = 0.5s
compute_backoff_delay(0, jitter=False) # 0.5
# Tercer reintento (attempt=2): delay = 0.5 * 2^2 = 2.0s
compute_backoff_delay(2, jitter=False) # 2.0
# Intento alto, capped a 8.0s
compute_backoff_delay(10, jitter=False) # 8.0
# Con jitter (no determinista)
compute_backoff_delay(1) # entre 1.0 y 1.5
```
## Notas
Usa `random` de la stdlib. Con jitter=True el resultado no es determinista, pero la funcion es clasificada como pura conceptualmente dado que el jitter es intencional y no hay I/O. Para tests deterministicos usar jitter=False.
@@ -0,0 +1,26 @@
"""Compute exponential backoff delay with optional jitter."""
import random
def compute_backoff_delay(
attempt: int,
base_delay: float = 0.5,
max_delay: float = 8.0,
jitter: bool = True,
) -> float:
"""Compute exponential backoff delay for a given attempt number.
Args:
attempt: Zero-based attempt index (0 = first retry).
base_delay: Base delay in seconds before exponential scaling.
max_delay: Maximum delay cap in seconds.
jitter: If True, adds random jitter to avoid thundering herd.
Returns:
Delay in seconds to wait before the next attempt.
"""
delay = min(base_delay * (2 ** attempt), max_delay)
if jitter:
delay += random.uniform(0, min(base_delay, delay))
return delay
@@ -0,0 +1,42 @@
"""Tests para compute_backoff_delay."""
import sys
import os
sys.path.insert(0, os.path.dirname(__file__))
from compute_backoff_delay import compute_backoff_delay
def test_attempt_0_retorna_base_delay_sin_jitter():
result = compute_backoff_delay(0, base_delay=0.5, max_delay=8.0, jitter=False)
assert result == 0.5
def test_attempt_alto_se_cappea_a_max_delay():
result = compute_backoff_delay(10, base_delay=0.5, max_delay=8.0, jitter=False)
assert result == 8.0
def test_sin_jitter_es_determinista():
r1 = compute_backoff_delay(3, base_delay=1.0, max_delay=16.0, jitter=False)
r2 = compute_backoff_delay(3, base_delay=1.0, max_delay=16.0, jitter=False)
assert r1 == r2
# attempt=3: 1.0 * 2^3 = 8.0
assert r1 == 8.0
def test_escala_exponencial():
d0 = compute_backoff_delay(0, base_delay=1.0, max_delay=100.0, jitter=False)
d1 = compute_backoff_delay(1, base_delay=1.0, max_delay=100.0, jitter=False)
d2 = compute_backoff_delay(2, base_delay=1.0, max_delay=100.0, jitter=False)
assert d0 == 1.0
assert d1 == 2.0
assert d2 == 4.0
def test_con_jitter_no_excede_max_delay_mas_base():
# Con jitter, delay base + jitter <= max_delay + base_delay
for attempt in range(5):
result = compute_backoff_delay(attempt, base_delay=0.5, max_delay=8.0, jitter=True)
assert result >= 0.5
assert result <= 8.0 + 0.5
@@ -0,0 +1,59 @@
---
name: convert_github_to_raw_url
kind: function
lang: py
domain: core
version: "1.0.0"
purity: pure
signature: "convert_github_to_raw_url(url: str) -> str"
description: "Convierte una URL de blob de GitHub/GitLab a su URL raw. Ej: github.com/org/repo/blob/main/file.py → raw.githubusercontent.com/org/repo/main/file.py. Retorna la URL sin cambios si no aplica."
tags: [github, gitlab, url, raw, blob, convert, transform]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: ["urllib.parse"]
tested: true
tests:
- "URL GitHub blob"
- "URL GitLab blob"
- "URL que no es blob retorna sin cambios"
- "URL no-GitHub retorna sin cambios"
test_file_path: "python/functions/core/convert_github_to_raw_url_test.py"
file_path: "python/functions/core/convert_github_to_raw_url.py"
---
## Ejemplo
```python
from core.convert_github_to_raw_url import convert_github_to_raw_url
# GitHub blob → raw.githubusercontent.com
url = convert_github_to_raw_url(
"https://github.com/openai/whisper/blob/main/README.md"
)
# "https://raw.githubusercontent.com/openai/whisper/main/README.md"
# GitLab blob → raw
url = convert_github_to_raw_url(
"https://gitlab.com/org/repo/-/blob/main/file.py"
)
# "https://gitlab.com/org/repo/-/raw/main/file.py"
# URL sin blob → sin cambios
url = convert_github_to_raw_url("https://github.com/org/repo")
# "https://github.com/org/repo"
```
## Notas
Algoritmo:
1. Parsear la URL con `urllib.parse.urlparse`.
2. Si host es `github.com`: buscar segmento `blob` en el path.
- Si existe: eliminar el segmento `blob` y cambiar el dominio a `raw.githubusercontent.com`.
3. Si host es `gitlab.com` o empieza con `gitlab.`: reemplazar `/-/blob/` por `/-/raw/`
o `/blob/` por `/raw/`.
4. Cualquier otro host: retornar la URL sin cambios.
Funcion pura. No hace I/O ni tiene efectos secundarios.
@@ -0,0 +1,69 @@
"""Convierte URLs de blob de GitHub/GitLab a su equivalente raw."""
from urllib.parse import urlparse, urlunparse
def convert_github_to_raw_url(url: str) -> str:
"""Convierte una URL de blob de GitHub o GitLab a su URL raw.
GitHub blob:
https://github.com/org/repo/blob/main/path/file.py
→ https://raw.githubusercontent.com/org/repo/main/path/file.py
GitLab blob:
https://gitlab.com/org/repo/-/blob/main/path/file.py
→ https://gitlab.com/org/repo/-/raw/main/path/file.py
Si la URL no contiene un path tipo blob, la retorna sin cambios.
Args:
url: URL de GitHub o GitLab, posiblemente apuntando a un blob.
Returns:
URL raw si aplica la transformacion; la URL original en caso contrario.
"""
url = url.strip()
if not url:
return url
parsed = urlparse(url)
host = parsed.hostname or ""
# --- GitHub ---
if host in ("github.com", "www.github.com"):
# Path tipico: /org/repo/blob/ref/path/to/file
segments = parsed.path.split("/")
if "blob" in segments:
blob_idx = segments.index("blob")
# Eliminar segmento "blob": /org/repo/ref/path/...
new_segments = segments[:blob_idx] + segments[blob_idx + 1:]
new_path = "/".join(new_segments)
raw_url = urlunparse((
"https",
"raw.githubusercontent.com",
new_path,
parsed.params,
parsed.query,
parsed.fragment,
))
return raw_url
return url
# --- GitLab ---
if host in ("gitlab.com", "www.gitlab.com") or host.startswith("gitlab."):
# Path tipico: /org/repo/-/blob/ref/path o /org/repo/blob/ref/path
new_path = parsed.path.replace("/-/blob/", "/-/raw/").replace("/blob/", "/raw/")
if new_path != parsed.path:
raw_url = urlunparse((
parsed.scheme,
parsed.netloc,
new_path,
parsed.params,
parsed.query,
parsed.fragment,
))
return raw_url
return url
# No aplica transformacion
return url
@@ -0,0 +1,77 @@
"""Tests para convert_github_to_raw_url."""
import sys
import os
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
from core.convert_github_to_raw_url import convert_github_to_raw_url
def test_url_github_blob():
"""URL de GitHub blob se convierte correctamente a raw.githubusercontent.com."""
url = "https://github.com/openai/whisper/blob/main/README.md"
result = convert_github_to_raw_url(url)
assert result == "https://raw.githubusercontent.com/openai/whisper/main/README.md"
def test_url_github_blob_subdirectorio():
"""URL de GitHub blob con subdirectorio se convierte correctamente."""
url = "https://github.com/org/repo/blob/main/src/utils/helper.py"
result = convert_github_to_raw_url(url)
assert result == "https://raw.githubusercontent.com/org/repo/main/src/utils/helper.py"
def test_url_github_blob_otra_rama():
"""URL de GitHub blob con rama distinta a main se convierte correctamente."""
url = "https://github.com/org/repo/blob/develop/config.yaml"
result = convert_github_to_raw_url(url)
assert result == "https://raw.githubusercontent.com/org/repo/develop/config.yaml"
def test_url_gitlab_blob():
"""URL de GitLab blob se convierte a raw."""
url = "https://gitlab.com/org/repo/-/blob/main/README.md"
result = convert_github_to_raw_url(url)
assert result == "https://gitlab.com/org/repo/-/raw/main/README.md"
def test_url_gitlab_blob_sin_guion():
"""URL de GitLab blob sin '/-/' tambien se convierte."""
url = "https://gitlab.com/org/repo/blob/main/README.md"
result = convert_github_to_raw_url(url)
assert result == "https://gitlab.com/org/repo/raw/main/README.md"
def test_url_que_no_es_blob_retorna_sin_cambios():
"""URL de GitHub sin blob retorna sin cambios."""
url = "https://github.com/org/repo"
result = convert_github_to_raw_url(url)
assert result == url
def test_url_github_tree_retorna_sin_cambios():
"""URL de GitHub tree (no blob) retorna sin cambios."""
url = "https://github.com/org/repo/tree/main/src"
result = convert_github_to_raw_url(url)
assert result == url
def test_url_no_github_retorna_sin_cambios():
"""URL de otro dominio retorna sin cambios."""
url = "https://example.com/org/repo/blob/main/file.py"
result = convert_github_to_raw_url(url)
assert result == url
def test_url_vacia_retorna_sin_cambios():
"""URL vacia retorna string vacio."""
result = convert_github_to_raw_url("")
assert result == ""
def test_url_raw_githubusercontent_retorna_sin_cambios():
"""URL ya en raw.githubusercontent.com no se modifica."""
url = "https://raw.githubusercontent.com/org/repo/main/file.py"
result = convert_github_to_raw_url(url)
assert result == url
+680 -1
View File
@@ -1,7 +1,9 @@
"""Core functional programming utilities — pure functions for list/collection operations."""
import hashlib
import re
from functools import reduce as _reduce
from typing import Any, Callable, Dict, List, Tuple
from typing import Any, Callable, Dict, List, Optional, Tuple
def filter_list(xs: list, pred: Callable) -> list:
@@ -133,3 +135,680 @@ def compose(*fns: Callable) -> Callable:
result = fn(result)
return result
return composed
# ── Tree manipulation ────────────────────────────────────────────────────────
def flatten_tree(structure: Any) -> List[Dict]:
"""Flatten a hierarchical tree (dict with 'nodes') to a list without children."""
import copy
if isinstance(structure, dict):
node = copy.deepcopy(structure)
node.pop('nodes', None)
nodes = [node]
for key in list(structure.keys()):
if 'nodes' in key:
nodes.extend(flatten_tree(structure[key]))
return nodes
elif isinstance(structure, list):
nodes = []
for item in structure:
nodes.extend(flatten_tree(item))
return nodes
return []
def tree_to_flat_list(structure: Any) -> List[Dict]:
"""Convert hierarchical tree to flat list preserving DFS order (keeps internal nodes)."""
if isinstance(structure, dict):
nodes = [structure]
if 'nodes' in structure:
nodes.extend(tree_to_flat_list(structure['nodes']))
return nodes
elif isinstance(structure, list):
nodes = []
for item in structure:
nodes.extend(tree_to_flat_list(item))
return nodes
return []
def get_leaf_nodes(structure: Any) -> List[Dict]:
"""Extract only leaf nodes (no children) from a hierarchical tree."""
import copy
if isinstance(structure, dict):
if not structure.get('nodes'):
node = copy.deepcopy(structure)
node.pop('nodes', None)
return [node]
leaf_nodes = []
for key in list(structure.keys()):
if 'nodes' in key:
leaf_nodes.extend(get_leaf_nodes(structure[key]))
return leaf_nodes
elif isinstance(structure, list):
leaf_nodes = []
for item in structure:
leaf_nodes.extend(get_leaf_nodes(item))
return leaf_nodes
return []
def write_node_ids(data: Any, node_id: int = 0) -> int:
"""Assign sequential zero-padded IDs (0001, 0002...) to all nodes in a tree. Returns next counter."""
if isinstance(data, dict):
data['node_id'] = str(node_id).zfill(4)
node_id += 1
for key in list(data.keys()):
if 'nodes' in key:
node_id = write_node_ids(data[key], node_id)
elif isinstance(data, list):
for item in data:
node_id = write_node_ids(item, node_id)
return node_id
def list_to_tree(data: List[Dict]) -> List[Dict]:
"""Convert flat list with structure codes ('1.2.3') to nested tree."""
def get_parent_structure(structure):
if not structure:
return None
parts = str(structure).split('.')
return '.'.join(parts[:-1]) if len(parts) > 1 else None
nodes = {}
root_nodes = []
for item in data:
structure = item.get('structure')
node = {
'title': item.get('title'),
'start_index': item.get('start_index'),
'end_index': item.get('end_index'),
'nodes': []
}
nodes[structure] = node
parent_structure = get_parent_structure(structure)
if parent_structure and parent_structure in nodes:
nodes[parent_structure]['nodes'].append(node)
else:
root_nodes.append(node)
def clean_node(node):
if not node['nodes']:
del node['nodes']
else:
for child in node['nodes']:
clean_node(child)
return node
return [clean_node(node) for node in root_nodes]
def remove_tree_fields(data: Any, fields: List[str] = None) -> Any:
"""Recursively remove specified fields from a tree (dict/list)."""
if fields is None:
fields = ['text']
if isinstance(data, dict):
return {k: remove_tree_fields(v, fields) for k, v in data.items() if k not in fields}
elif isinstance(data, list):
return [remove_tree_fields(item, fields) for item in data]
return data
def format_tree_structure(structure: Any, order: List[str] = None) -> Any:
"""Reorder fields of each node in a tree according to specified key order."""
if not order:
return structure
if isinstance(structure, dict):
if 'nodes' in structure:
structure['nodes'] = format_tree_structure(structure['nodes'], order)
if not structure.get('nodes'):
structure.pop('nodes', None)
return {key: structure[key] for key in order if key in structure}
elif isinstance(structure, list):
return [format_tree_structure(item, order) for item in structure]
return structure
def create_node_mapping(tree: List[Dict]) -> Dict[str, Dict]:
"""Create flat dict mapping node_id to node for O(1) lookup."""
mapping = {}
def _traverse(nodes):
for node in nodes:
if node.get('node_id'):
mapping[node['node_id']] = node
if node.get('nodes'):
_traverse(node['nodes'])
_traverse(tree)
return mapping
# ── Text / JSON extraction ───────────────────────────────────────────────────
def extract_json_from_llm(content: str) -> Dict:
"""Extract and parse JSON from LLM responses. Handles ```json blocks, trailing commas, None->null."""
import json
try:
start_idx = content.find("```json")
if start_idx != -1:
start_idx += 7
end_idx = content.rfind("```")
json_content = content[start_idx:end_idx].strip()
else:
json_content = content.strip()
json_content = json_content.replace('None', 'null')
json_content = json_content.replace('\n', ' ').replace('\r', ' ')
json_content = ' '.join(json_content.split())
return json.loads(json_content)
except (json.JSONDecodeError, Exception):
try:
json_content = json_content.replace(',]', ']').replace(',}', '}')
return json.loads(json_content)
except Exception:
return {}
def parse_page_range(pages: str) -> List[int]:
"""Parse page range string ('5-7', '3,8', '12') into sorted list of unique ints."""
result = []
for part in pages.split(','):
part = part.strip()
if '-' in part:
start, end = int(part.split('-', 1)[0].strip()), int(part.split('-', 1)[1].strip())
if start > end:
raise ValueError(f"Invalid range '{part}': start must be <= end")
result.extend(range(start, end + 1))
else:
result.append(int(part))
return sorted(set(result))
# ── Markdown parsing ─────────────────────────────────────────────────────────
def extract_markdown_headers(markdown_content: str) -> Tuple[List[Dict], List[str]]:
"""Extract all headers (h1-h6) from markdown with line numbers, skipping code blocks."""
import re
header_pattern = r'^(#{1,6})\s+(.+)$'
code_block_pattern = r'^```'
node_list = []
lines = markdown_content.split('\n')
in_code_block = False
for line_num, line in enumerate(lines, 1):
stripped_line = line.strip()
if re.match(code_block_pattern, stripped_line):
in_code_block = not in_code_block
continue
if not stripped_line:
continue
if not in_code_block:
match = re.match(header_pattern, stripped_line)
if match:
level = len(match.group(1))
title = match.group(2).strip()
node_list.append({'title': title, 'level': level, 'line_num': line_num})
return node_list, lines
def build_tree_from_headers(node_list: List[Dict]) -> List[Dict]:
"""Build nested tree from flat list of headers with levels (h1>h2>h3)."""
if not node_list:
return []
stack = []
root_nodes = []
node_counter = 1
for node in node_list:
current_level = node['level']
tree_node = {
'title': node['title'],
'node_id': str(node_counter).zfill(4),
'line_num': node['line_num'],
'nodes': []
}
node_counter += 1
while stack and stack[-1][1] >= current_level:
stack.pop()
if not stack:
root_nodes.append(tree_node)
else:
parent_node, _ = stack[-1]
parent_node['nodes'].append(tree_node)
stack.append((tree_node, current_level))
def clean_empty_nodes(nodes):
for n in nodes:
if n['nodes']:
clean_empty_nodes(n['nodes'])
else:
del n['nodes']
return nodes
return clean_empty_nodes(root_nodes)
# ── Pagination / chunking ────────────────────────────────────────────────────
def page_list_to_groups(page_contents: List[str], token_lengths: List[int],
max_tokens: int = 20000, overlap_pages: int = 1) -> List[str]:
"""Group pages into text chunks respecting token limit with configurable overlap."""
import math
num_tokens = sum(token_lengths)
if num_tokens <= max_tokens:
return ["".join(page_contents)]
subsets = []
current_subset = []
current_token_count = 0
expected_parts = math.ceil(num_tokens / max_tokens)
avg_tokens = math.ceil(((num_tokens / expected_parts) + max_tokens) / 2)
for i, (page_content, page_tokens) in enumerate(zip(page_contents, token_lengths)):
if current_token_count + page_tokens > avg_tokens:
subsets.append(''.join(current_subset))
overlap_start = max(i - overlap_pages, 0)
current_subset = list(page_contents[overlap_start:i])
current_token_count = sum(token_lengths[overlap_start:i])
current_subset.append(page_content)
current_token_count += page_tokens
if current_subset:
subsets.append(''.join(current_subset))
return subsets
def calculate_page_offset(pairs: List[Dict]) -> int:
"""Calculate offset between logical page numbers and physical indices using reference pairs."""
differences = []
for pair in pairs:
try:
difference = pair['physical_index'] - pair['page']
differences.append(difference)
except (KeyError, TypeError):
continue
if not differences:
return 0
counts: Dict[int, int] = {}
for diff in differences:
counts[diff] = counts.get(diff, 0) + 1
return max(counts.items(), key=lambda x: x[1])[0]
# ── Text preprocessing ───────────────────────────────────────────────────────
def preprocess_text(text: str) -> str:
"""Normalize whitespace and newlines in raw text.
Args:
text: Raw text to normalize.
Returns:
Normalized text with consistent newlines, stripped lines, and no
excessive blank lines.
"""
# Normalize line endings: \r\n and \r -> \n
text = text.replace('\r\n', '\n').replace('\r', '\n')
# Reduce 3+ consecutive newlines to at most 2
text = re.sub(r'\n{3,}', '\n\n', text)
# Strip whitespace from each line
text = '\n'.join(line.strip() for line in text.split('\n'))
# Strip globally
return text.strip()
def get_text_stats(text: str) -> dict:
"""Compute basic statistics of a text: characters, lines, words.
Args:
text: Input text to analyze.
Returns:
Dict with keys total_chars (int), total_lines (int), total_words (int).
"""
return {
'total_chars': len(text),
'total_lines': text.count('\n') + 1,
'total_words': len(text.split()),
}
# ── Git URL parsing ──────────────────────────────────────────────────────────
_DEFAULT_GIT_HOSTS = ["github.com", "gitlab.com"]
def _sanitize_git_segment(segment: str) -> str:
"""Strip .git suffix then keep only [a-zA-Z0-9_-] chars."""
if segment.endswith(".git"):
segment = segment[:-4]
return re.sub(r"[^a-zA-Z0-9_\-]", "", segment)
def parse_git_url(url: str, known_hosts: Optional[List[str]] = None) -> Optional[str]:
"""Parse a code-hosting URL and return the 'org/repo' path component.
Supports HTTPS, HTTP, git://, ssh:// and SSH shorthand (git@host:path).
Returns None if the URL does not match any known host or is malformed.
Args:
url: Repository URL in any supported format.
known_hosts: List of accepted hostnames. Defaults to github.com and gitlab.com.
Returns:
'org/repo' string or None.
"""
from urllib.parse import urlparse
hosts = known_hosts if known_hosts is not None else _DEFAULT_GIT_HOSTS
url = url.strip()
if url.startswith("git@"):
# git@github.com:org/repo.git
rest = url[len("git@"):]
if ":" not in rest:
return None
host, path = rest.split(":", 1)
if host not in hosts:
return None
segments = [s for s in path.split("/") if s]
if len(segments) < 2:
return None
org = _sanitize_git_segment(segments[0])
repo = _sanitize_git_segment(segments[1])
if not org or not repo:
return None
return f"{org}/{repo}"
for prefix in ("http://", "https://", "git://", "ssh://"):
if url.startswith(prefix):
parsed = urlparse(url)
netloc = parsed.hostname or ""
if netloc not in hosts:
return None
segments = [s for s in parsed.path.split("/") if s]
if len(segments) < 2:
return None
org = _sanitize_git_segment(segments[0])
repo = _sanitize_git_segment(segments[1])
if not org or not repo:
return None
return f"{org}/{repo}"
return None
def is_git_repo_url(url: str, known_hosts: Optional[List[str]] = None) -> bool:
"""Return True only if url points to a clonable git repository.
Accepts org/repo and org/repo/tree/<ref> paths.
Rejects paths that navigate to sub-resources (issues, blobs, PRs, etc.).
Args:
url: URL to verify.
known_hosts: Accepted hostnames. Defaults to github.com and gitlab.com.
Returns:
True if url is a clonable repository URL.
"""
from urllib.parse import urlparse
hosts = known_hosts if known_hosts is not None else _DEFAULT_GIT_HOSTS
url = url.strip()
# SSH shorthand — always repo-level if host matches
if url.startswith("git@"):
rest = url[len("git@"):]
if ":" not in rest:
return False
host, _ = rest.split(":", 1)
return host in hosts
# git:// and ssh:// — always repo-level if host matches
for prefix in ("ssh://", "git://"):
if url.startswith(prefix):
parsed = urlparse(url)
return (parsed.hostname or "") in hosts
# http:// and https:// — must have exactly org/repo or org/repo/tree/<ref>
for prefix in ("http://", "https://"):
if url.startswith(prefix):
parsed = urlparse(url)
if (parsed.hostname or "") not in hosts:
return False
segments = [s for s in parsed.path.split("/") if s]
if len(segments) == 2:
return True
if len(segments) == 4 and segments[2] == "tree":
return True
return False
return False
def validate_git_ssh_uri(url: str) -> None:
"""Validate a git SSH URI of the form git@host:path.
Raises ValueError with a descriptive message if the URI is malformed.
Args:
url: URI string to validate.
Raises:
ValueError: If the URI does not conform to git SSH format.
"""
if not url.startswith("git@"):
raise ValueError(f"git SSH URI must start with 'git@', got: {url!r}")
rest = url[len("git@"):]
if ":" not in rest:
raise ValueError(f"git SSH URI must contain ':', got: {url!r}")
_, path = rest.split(":", 1)
if not path:
raise ValueError(f"git SSH URI must have a non-empty path after ':', got: {url!r}")
# ---------------------------------------------------------------------------
# Markdown parsing utilities
# ---------------------------------------------------------------------------
def extract_frontmatter(content: str) -> Tuple[str, Optional[Dict]]:
"""Extract YAML frontmatter delimited by '---' from the start of a markdown string.
Args:
content: Raw markdown string, optionally starting with YAML frontmatter.
Returns:
Tuple of (content_without_frontmatter, frontmatter_dict).
frontmatter_dict is None when no frontmatter is found.
"""
pattern = re.compile(r'^---\n(.*?)\n---\n', re.DOTALL)
match = pattern.match(content)
if not match:
return content, None
raw = match.group(1)
remaining = content[match.end():]
try:
import yaml # type: ignore
data = yaml.safe_load(raw)
if not isinstance(data, dict):
data = None
except Exception:
# Fallback: simple key: value parser (no yaml dependency)
data = {}
for line in raw.splitlines():
if ':' in line:
key, _, value = line.partition(':')
data[key.strip()] = value.strip()
return remaining, data
def find_headings(content: str) -> List[Tuple[int, int, str, int]]:
"""Find all markdown headings (# to ######), excluding those inside code blocks,
HTML comments, and indented blocks.
Args:
content: Markdown text to search.
Returns:
List of (start_pos, end_pos, title, level) for each heading found.
"""
excluded: List[Tuple[int, int]] = []
# Code blocks (triple backtick)
for m in re.finditer(r'```.*?```', content, re.DOTALL):
excluded.append((m.start(), m.end()))
# HTML comments
for m in re.finditer(r'<!--.*?-->', content, re.DOTALL):
excluded.append((m.start(), m.end()))
# Indented blocks (lines starting with 4 spaces or a tab)
for m in re.finditer(r'^( |\t).+$', content, re.MULTILINE):
excluded.append((m.start(), m.end()))
def is_excluded(pos: int) -> bool:
return any(start <= pos < end for start, end in excluded)
results: List[Tuple[int, int, str, int]] = []
for m in re.finditer(r'^(#{1,6})\s+(.+)$', content, re.MULTILINE):
# Skip escaped headings (\#)
before = content[m.start() - 1] if m.start() > 0 else ''
if before == '\\':
continue
if is_excluded(m.start()):
continue
level = len(m.group(1))
title = m.group(2).strip()
results.append((m.start(), m.end(), title, level))
return results
def estimate_token_count(content: str) -> int:
"""Estimate token count without a tokenizer.
CJK characters count as ~0.7 tokens each; other non-whitespace characters
count as ~0.3 tokens each.
Args:
content: Text to estimate.
Returns:
Estimated integer token count.
"""
cjk = re.findall(r'[\u4e00-\u9fff\u3040-\u30ff\uac00-\ud7af]', content)
without_cjk = re.sub(r'[\u4e00-\u9fff\u3040-\u30ff\uac00-\ud7af]', '', content)
others = re.findall(r'\S', without_cjk)
return int(len(cjk) * 0.7 + len(others) * 0.3)
def smart_split_content(
content: str,
max_tokens: int = 1024,
max_chars: int = 8000,
) -> List[str]:
"""Split large content into parts respecting token and character limits.
Splits by paragraphs (double newline). If a single paragraph exceeds the
limit it is force-cut into chunks of max_chars.
Args:
content: Text to split.
max_tokens: Maximum estimated tokens per part.
max_chars: Maximum characters per part.
Returns:
List of string parts.
"""
paragraphs = content.split('\n\n')
parts: List[str] = []
current_parts: List[str] = []
current_tokens = 0
current_chars = 0
def flush() -> None:
if current_parts:
parts.append('\n\n'.join(current_parts))
current_parts.clear()
for para in paragraphs:
para_tokens = estimate_token_count(para)
para_chars = len(para)
# Single paragraph exceeds limits — force-cut it
if para_tokens > max_tokens or para_chars > max_chars:
flush()
current_tokens = 0
current_chars = 0
for i in range(0, len(para), max_chars):
parts.append(para[i:i + max_chars])
continue
# Would exceed limits if added — flush first
if (current_tokens + para_tokens > max_tokens or
current_chars + para_chars > max_chars):
flush()
current_tokens = 0
current_chars = 0
current_parts.append(para)
current_tokens += para_tokens
current_chars += para_chars
flush()
return parts if parts else [content]
def sanitize_for_path(text: str, max_length: int = 50) -> str:
"""Convert text to a safe string for use in file paths.
Keeps word characters, CJK characters, spaces and hyphens. Replaces spaces
with underscores. Truncates with a sha256 suffix if the result exceeds
max_length.
Args:
text: Input text to sanitize.
max_length: Maximum length of the returned string.
Returns:
Safe path-friendly string.
"""
cleaned = re.sub(
r'[^\w\u4e00-\u9fff\u3040-\u30ff\uac00-\ud7af \-]',
'',
text,
)
cleaned = cleaned.replace(' ', '_').strip('_')
if not cleaned:
return 'section'
if len(cleaned) <= max_length:
return cleaned
suffix = '_' + hashlib.sha256(text.encode()).hexdigest()[:8]
return cleaned[:max_length - len(suffix)] + suffix
@@ -0,0 +1,36 @@
---
name: create_node_mapping
kind: function
lang: py
domain: core
version: "1.0.0"
purity: pure
signature: "def create_node_mapping(tree: list[dict]) -> dict[str, dict]"
description: "Crea dict plano node_id->node para lookup O(1) en un arbol jerarquico."
tags: [tree, mapping, index, lookup]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: []
tested: false
tests: []
test_file_path: ""
file_path: "python/functions/core/core.py"
source_repo: "https://github.com/VectifyAI/PageIndex"
source_license: "MIT"
source_file: "pageindex/utils.py"
---
## Ejemplo
```python
tree = [{"node_id": "0001", "title": "A", "nodes": [{"node_id": "0002", "title": "B"}]}]
mapping = create_node_mapping(tree)
mapping["0002"]["title"] # "B"
```
## Notas
Funcion pura. Los valores son referencias a los nodos originales, no copias.
+66
View File
@@ -0,0 +1,66 @@
---
name: cursor_paginate
kind: function
lang: py
domain: core
version: "1.0.0"
purity: impure
signature: "def cursor_paginate(fetch_page: Callable[..., list[T]], get_cursor: Callable[[T], str | None], page_size: int = 100, max_items: int = 2000, max_retries: int = 3, retry_delay: float = 2.0, retryable_exceptions: tuple[type[Exception], ...] = (ConnectionError, TimeoutError, OSError)) -> list[T]"
description: "Paginador generico basado en cursor que funciona con cualquier API que use cursor-based pagination. Cada pagina se obtiene con retry automatico con exponential backoff. Se detiene cuando la pagina esta vacia, el batch es menor que page_size, se alcanza max_items, o el cursor del ultimo item es None."
tags: [pagination, cursor, retry, generic, api, backoff]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: "error_go_core"
imports: ["time", "typing.Callable", "typing.TypeVar"]
tested: true
tests:
- "API que retorna 3 paginas de 10 items"
- "API que falla 1 vez por pagina (retry funciona)"
- "max_items limita correctamente"
- "API que retorna pagina parcial (ultima pagina)"
- "Cursor None en ultimo item (se detiene)"
test_file_path: "python/functions/core/cursor_paginate_test.py"
file_path: "python/functions/core/cursor_paginate.py"
---
## Ejemplo
```python
from cursor_paginate import cursor_paginate
def fetch_users(limit: int, cursor: str | None) -> list[dict]:
params = {"limit": limit}
if cursor:
params["cursor"] = cursor
return requests.get("https://api.example.com/users", params=params).json()["items"]
def get_cursor(user: dict) -> str | None:
return user.get("next_cursor")
users = cursor_paginate(
fetch_page=fetch_users,
get_cursor=get_cursor,
page_size=100,
max_items=5000,
max_retries=3,
retry_delay=2.0,
)
```
## Notas
El caller solo necesita proveer dos callables:
- `fetch_page(limit, cursor)`: recibe `limit` y `cursor` como kwargs, retorna lista de items.
- `get_cursor(item)`: extrae el cursor del ultimo item de la pagina; retornar None indica fin de datos.
El exponential backoff interno aplica `retry_delay * 2^attempt` sin jitter. Solo se reintentan las excepciones en `retryable_exceptions`; cualquier otra excepcion propaga inmediatamente.
Condiciones de parada (cualquiera de ellas):
1. La pagina retornada esta vacia.
2. La pagina retornada tiene menos items que `page_size` (pagina parcial = ultima pagina).
3. El total acumulado alcanza o supera `max_items` (se trunca y se para).
4. `get_cursor(batch[-1])` retorna `None`.
Funcion impura: llama a `fetch_page` que tipicamente hace I/O de red y usa `time.sleep` en los reintentos.
+105
View File
@@ -0,0 +1,105 @@
"""Generic cursor-based paginator for any API that uses cursor pagination."""
import time
from typing import Callable, TypeVar
T = TypeVar("T")
def cursor_paginate(
fetch_page: Callable[..., list[T]],
get_cursor: Callable[[T], str | None],
page_size: int = 100,
max_items: int = 2000,
max_retries: int = 3,
retry_delay: float = 2.0,
retryable_exceptions: tuple[type[Exception], ...] = (
ConnectionError,
TimeoutError,
OSError,
),
) -> list[T]:
"""Paginate through a cursor-based API, collecting all items.
Fetches pages one at a time by calling fetch_page with limit and cursor
kwargs. Retries each page on transient errors using exponential backoff.
Stops when a page is empty, a partial page is returned, max_items is
reached, or the cursor from the last item is None.
Args:
fetch_page: Callable that accepts ``limit`` and ``cursor`` as keyword
arguments and returns a list of items for that page.
get_cursor: Callable that receives the last item of a page and returns
the cursor string to use for the next page, or None if there are
no more pages.
page_size: Number of items to request per page.
max_items: Hard cap on total items collected. Collection stops and the
list is truncated once this limit is reached.
max_retries: Maximum number of retry attempts per page after the first
failure.
retry_delay: Base delay in seconds between retries (doubled each
attempt — exponential backoff without jitter).
retryable_exceptions: Tuple of exception types that trigger a retry.
Any other exception propagates immediately.
Returns:
List of all collected items, in the order they were returned by the
API, truncated to max_items.
Raises:
Exception: Re-raises the last exception if all retries for a page are
exhausted.
"""
all_items: list[T] = []
cursor: str | None = None
while True:
batch = _fetch_with_retry(
fetch_page=fetch_page,
page_size=page_size,
cursor=cursor,
max_retries=max_retries,
retry_delay=retry_delay,
retryable_exceptions=retryable_exceptions,
)
if not batch:
break
all_items.extend(batch)
if len(all_items) >= max_items:
del all_items[max_items:]
break
if len(batch) < page_size:
break
cursor = get_cursor(batch[-1])
if cursor is None:
break
return all_items
def _fetch_with_retry(
fetch_page: Callable[..., list[T]],
page_size: int,
cursor: str | None,
max_retries: int,
retry_delay: float,
retryable_exceptions: tuple[type[Exception], ...],
) -> list[T]:
"""Call fetch_page once, retrying on retryable_exceptions with exponential backoff."""
last_exc: Exception | None = None
for attempt in range(max_retries + 1):
try:
return fetch_page(limit=page_size, cursor=cursor)
except retryable_exceptions as exc:
last_exc = exc
if attempt >= max_retries:
raise
delay = retry_delay * (2 ** attempt)
time.sleep(delay)
raise last_exc # unreachable; satisfies type checkers
@@ -0,0 +1,148 @@
"""Tests para cursor_paginate."""
import sys
import os
sys.path.insert(0, os.path.dirname(__file__))
import pytest
from cursor_paginate import cursor_paginate
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
def make_api(pages: list[list[dict]]) -> callable:
"""Return a fetch_page callable that serves pages from a pre-built list."""
call_count = [0]
def fetch_page(limit: int, cursor: str | None) -> list[dict]:
idx = call_count[0]
call_count[0] += 1
if idx >= len(pages):
return []
return pages[idx][:limit]
return fetch_page
def get_cursor(item: dict) -> str | None:
return item.get("cursor")
# ---------------------------------------------------------------------------
# Tests
# ---------------------------------------------------------------------------
def test_api_retorna_3_paginas_de_10_items():
pages = [
[{"id": i, "cursor": str(i)} for i in range(0, 10)],
[{"id": i, "cursor": str(i)} for i in range(10, 20)],
[{"id": i, "cursor": str(i)} for i in range(20, 30)],
[], # sentinel: empty page ends pagination
]
api = make_api(pages)
result = cursor_paginate(
fetch_page=api,
get_cursor=get_cursor,
page_size=10,
max_items=2000,
max_retries=0,
)
assert len(result) == 30
assert result[0]["id"] == 0
assert result[-1]["id"] == 29
def test_api_falla_1_vez_por_pagina_retry_funciona():
"""fetch_page falla en el primer intento de cada llamada, pero el retry recupera."""
call_counter = [0]
# Cada pagina tiene 5 items. 2 paginas en total, luego vacio.
items_by_page = [
[{"id": i, "cursor": str(i)} for i in range(0, 5)],
[{"id": i, "cursor": str(i)} for i in range(5, 10)],
]
page_idx = [0]
fail_flags = [True, True] # falla una vez por pagina
def fetch_page(limit: int, cursor: str | None) -> list[dict]:
idx = page_idx[0]
if idx < len(fail_flags) and fail_flags[idx]:
fail_flags[idx] = False
raise ConnectionError("transient failure")
page_idx[0] += 1
if idx >= len(items_by_page):
return []
return items_by_page[idx]
result = cursor_paginate(
fetch_page=fetch_page,
get_cursor=get_cursor,
page_size=5,
max_items=2000,
max_retries=3,
retry_delay=0.0,
retryable_exceptions=(ConnectionError, TimeoutError, OSError),
)
assert len(result) == 10
def test_max_items_limita_correctamente():
# 50 items disponibles en 5 paginas de 10, pero max_items=25
pages = [
[{"id": i, "cursor": str(i)} for i in range(j * 10, j * 10 + 10)]
for j in range(5)
]
api = make_api(pages)
result = cursor_paginate(
fetch_page=api,
get_cursor=get_cursor,
page_size=10,
max_items=25,
max_retries=0,
)
assert len(result) == 25
assert result[-1]["id"] == 24
def test_api_retorna_pagina_parcial_ultima_pagina():
pages = [
[{"id": i, "cursor": str(i)} for i in range(10)], # full page
[{"id": i, "cursor": str(i)} for i in range(10, 17)], # partial — 7 items
]
api = make_api(pages)
result = cursor_paginate(
fetch_page=api,
get_cursor=get_cursor,
page_size=10,
max_items=2000,
max_retries=0,
)
assert len(result) == 17
assert result[-1]["id"] == 16
def test_cursor_none_en_ultimo_item_se_detiene():
"""Cuando el ultimo item no tiene cursor, la paginacion debe detenerse."""
pages = [
[{"id": i, "cursor": str(i)} for i in range(10)],
# last item has no cursor — signals end of data
[{"id": i, "cursor": (str(i) if i < 19 else None)} for i in range(10, 20)],
]
api = make_api(pages)
def get_cursor_nullable(item: dict) -> str | None:
return item.get("cursor")
result = cursor_paginate(
fetch_page=api,
get_cursor=get_cursor_nullable,
page_size=10,
max_items=2000,
max_retries=0,
)
assert len(result) == 20
assert result[-1]["id"] == 19
@@ -0,0 +1,37 @@
---
name: detect_headings_by_font
kind: function
lang: py
domain: core
version: "1.0.0"
purity: impure
signature: "def detect_headings_by_font(pdf, min_delta: float = 2.0, max_levels: int = 4) -> list[dict]"
description: "Detecta headings en un PDF analizando la distribucion de font sizes. El font size mas comun es el body; sizes significativamente mayores se clasifican como heading levels. Filtra headers/footers repetitivos."
tags: [pdf, headings, font, detection, parsing, pdfplumber]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: "error_go_core"
imports: [pdfplumber, collections]
tested: false
tests: []
test_file_path: ""
file_path: "python/functions/core/detect_headings_by_font.py"
---
## Ejemplo
```python
import pdfplumber
from detect_headings_by_font import detect_headings_by_font
with pdfplumber.open("document.pdf") as pdf:
headings = detect_headings_by_font(pdf, min_delta=2.0, max_levels=4)
for h in headings:
print(f"Page {h['page_num']}: {'#' * h['level']} {h['title']}")
```
## Notas
Samplea cada 5ta pagina para construir el Counter de font sizes (optimizacion de rendimiento). El body_size es el font size mas frecuente. Los heading sizes deben ser >= body_size + min_delta Y tener frecuencia < 50% del body. Se limita a max_levels heading sizes ordenados desc (el mas grande = nivel 1). Titulos que aparecen en >30% de paginas son considerados headers/footers y se eliminan. Impure porque accede al estado interno de un objeto PDF ya abierto.
@@ -0,0 +1,135 @@
"""Detect headings in a PDF by analyzing font size distribution."""
from collections import Counter
import pdfplumber
def detect_headings_by_font(
pdf: pdfplumber.PDF,
min_delta: float = 2.0,
max_levels: int = 4,
) -> list[dict]:
"""Detect headings by analyzing font size distribution across pages.
The most common font size is treated as body text. Font sizes significantly
larger than body (by at least min_delta) and appearing in fewer than 50% of
chars are classified as heading levels.
Args:
pdf: An open pdfplumber.PDF object.
min_delta: Minimum size difference above body size to qualify as heading.
max_levels: Maximum number of heading levels to detect.
Returns:
list[dict]: List of {"level": int, "title": str, "page_num": int}
sorted by page number. Returns empty list if no headings detected.
"""
if not pdf.pages:
return []
# Step 1: Sample font sizes from every 5th page to determine body size
size_counter: Counter = Counter()
sample_pages = [pdf.pages[i] for i in range(0, len(pdf.pages), 5)]
if not sample_pages:
sample_pages = [pdf.pages[0]]
for page in sample_pages:
try:
chars = page.chars
for ch in chars:
size = ch.get("size")
if size is not None:
size_counter[round(float(size), 1)] += 1
except Exception:
continue
if not size_counter:
return []
# Step 2: Determine body size (most common font size)
body_size, body_count = size_counter.most_common(1)[0]
# Step 3: Identify heading sizes
# Must be >= body_size + min_delta and frequency < 50% of body count
heading_sizes = sorted(
[
size
for size, count in size_counter.items()
if size >= body_size + min_delta and count < body_count * 0.5
],
reverse=True,
)[:max_levels]
if not heading_sizes:
return []
# Build size -> level mapping
size_to_level = {size: i + 1 for i, size in enumerate(heading_sizes)}
# Step 4: Collect heading text per page
raw_headings: list[dict] = []
total_pages = len(pdf.pages)
for page_idx, page in enumerate(pdf.pages):
page_num = page_idx + 1
try:
chars = page.chars
except Exception:
continue
# Group consecutive chars of same heading size into text blocks
current_size = None
current_text = []
for ch in chars:
size = ch.get("size")
if size is None:
continue
rounded = round(float(size), 1)
if rounded in size_to_level:
if rounded == current_size:
current_text.append(ch.get("text", ""))
else:
if current_text and current_size is not None:
text = "".join(current_text).strip()
if text:
raw_headings.append({
"level": size_to_level[current_size],
"title": text,
"page_num": page_num,
})
current_size = rounded
current_text = [ch.get("text", "")]
else:
if current_text and current_size is not None:
text = "".join(current_text).strip()
if text:
raw_headings.append({
"level": size_to_level[current_size],
"title": text,
"page_num": page_num,
})
current_size = None
current_text = []
# Flush remaining
if current_text and current_size is not None:
text = "".join(current_text).strip()
if text:
raw_headings.append({
"level": size_to_level[current_size],
"title": text,
"page_num": page_num,
})
if not raw_headings:
return []
# Step 5: Deduplicate — remove titles appearing on > 30% of pages (headers/footers)
title_page_counts: Counter = Counter(h["title"] for h in raw_headings)
threshold = total_pages * 0.3
filtered = [h for h in raw_headings if title_page_counts[h["title"]] <= threshold]
return filtered
+59
View File
@@ -0,0 +1,59 @@
---
name: detect_url_type
kind: function
lang: py
domain: core
version: "1.0.0"
purity: impure
signature: "detect_url_type(url: str, timeout: float = 10.0) -> tuple[str, dict]"
description: "Detecta el tipo de contenido de una URL. Retorna tipo ('webpage', 'pdf', 'markdown', 'text', 'code_repository') y metadata. Hace HTTP HEAD request solo si no puede determinarse por patron o extension."
tags: [url, content-type, http, detect, classification, head-request]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: "error_go_core"
imports: ["urllib.parse", "httpx"]
tested: true
tests:
- "URL .pdf por extension"
- "URL github repo"
- "URL markdown por extension"
- "URL SSH git"
- "URL .html por extension"
test_file_path: "python/functions/core/detect_url_type_test.py"
file_path: "python/functions/core/detect_url_type.py"
---
## Ejemplo
```python
from core.detect_url_type import detect_url_type
# Por patron URL (sin HTTP request)
url_type, meta = detect_url_type("https://github.com/openai/whisper")
# url_type = "code_repository", meta = {"detection": "url_pattern", ...}
# Por extension (sin HTTP request)
url_type, meta = detect_url_type("https://example.com/doc.pdf")
# url_type = "pdf", meta = {"detection": "extension", ...}
# Por HTTP HEAD request (cuando no se puede determinar sin red)
url_type, meta = detect_url_type("https://example.com/page")
# url_type = "webpage", meta = {"detection": "content_type_header", "content_type": "text/html", ...}
```
## Notas
Algoritmo en orden de prioridad:
1. SSH git shorthand (`git@host:path`) → `code_repository` inmediatamente.
2. Patron URL de repos conocidos (github.com/org/repo, gitlab.com/org/repo) → `code_repository`.
3. Extension del path de la URL (.pdf, .md, .txt, .html, .git) → tipo correspondiente.
4. HTTP HEAD request → leer `Content-Type` header.
5. Default: `"webpage"`.
Hosts reconocidos como repos de codigo: github.com, gitlab.com, bitbucket.org, codeberg.org.
Sub-recursos (issues, pulls, blob, tree, etc.) NO se clasifican como `code_repository`.
Lanza `Exception` con mensaje descriptivo si el HEAD request falla (timeout, DNS, red).
+144
View File
@@ -0,0 +1,144 @@
"""Detecta el tipo de contenido de una URL (webpage, pdf, markdown, text, code_repository)."""
import re
from urllib.parse import urlparse
# Patrones de repos de codigo por hostname
_CODE_REPO_HOSTS = {"github.com", "gitlab.com", "bitbucket.org", "codeberg.org"}
# Extensiones reconocidas → tipo
_EXT_TYPE_MAP = {
".pdf": "pdf",
".md": "markdown",
".markdown": "markdown",
".rst": "text",
".txt": "text",
".html": "webpage",
".htm": "webpage",
".xml": "text",
".json": "text",
".csv": "text",
".py": "text",
".js": "text",
".ts": "text",
".go": "text",
".rs": "text",
".cpp": "text",
".c": "text",
".java": "text",
".rb": "text",
".git": "code_repository",
}
# Content-Type header prefixes → tipo
_CONTENT_TYPE_MAP = {
"application/pdf": "pdf",
"text/markdown": "markdown",
"text/x-markdown": "markdown",
"text/plain": "text",
"text/html": "webpage",
"text/xml": "text",
"application/xml": "text",
"application/json": "text",
}
def _is_code_repo_url(parsed, path_segments: list[str]) -> bool:
"""Return True si la URL apunta a la raiz de un repositorio de codigo."""
host = parsed.hostname or ""
if host not in _CODE_REPO_HOSTS:
return False
# Acepta org/repo o org/repo/ o org/repo.git (2 segmentos minimos)
if len(path_segments) < 2:
return False
# Rechaza sub-recursos conocidos: issues, pulls, blob, tree, releases, etc.
_SUB_RESOURCES = {"issues", "pulls", "blob", "tree", "releases", "tags",
"commits", "compare", "wiki", "discussions", "actions",
"security", "pulse", "graphs", "-", "settings"}
if len(path_segments) >= 3 and path_segments[2].rstrip(".git") in _SUB_RESOURCES:
return False
return True
def _is_ssh_git_url(url: str) -> bool:
"""Return True si la URL es un SSH git shorthand (git@host:path)."""
return url.strip().startswith("git@")
def _type_from_extension(path: str) -> str | None:
"""Detecta tipo segun la extension del path de la URL. Retorna None si no aplica."""
# Ignorar query string / fragment
clean_path = path.split("?")[0].split("#")[0]
for ext, url_type in _EXT_TYPE_MAP.items():
if clean_path.lower().endswith(ext):
return url_type
return None
def _type_from_content_type(content_type_header: str) -> str:
"""Mapea un Content-Type header al tipo de URL."""
ct = content_type_header.lower().split(";")[0].strip()
for prefix, url_type in _CONTENT_TYPE_MAP.items():
if ct.startswith(prefix):
return url_type
return "webpage"
def detect_url_type(url: str, timeout: float = 10.0) -> tuple[str, dict]:
"""Detecta el tipo de contenido de una URL.
Algoritmo:
1. Verificar si la URL es un patron de repo de codigo (git@, github.com/org/repo).
2. Verificar extension en el path de la URL (.pdf, .md, .txt, .html, .git).
3. Si no se determino: HTTP HEAD request para leer Content-Type header.
4. Default: "webpage".
Args:
url: URL a analizar.
timeout: Timeout en segundos para el HTTP HEAD request (si es necesario).
Returns:
Tuple de (tipo, metadata) donde tipo es uno de:
"webpage", "pdf", "markdown", "text", "code_repository".
metadata incluye la informacion disponible (extension, content_type, etc.).
Raises:
Exception: Si falla la conexion HTTP cuando es necesaria.
"""
import httpx
url = url.strip()
metadata: dict = {"url": url}
# 1. SSH git shorthand
if _is_ssh_git_url(url):
metadata["detection"] = "ssh_pattern"
return "code_repository", metadata
parsed = urlparse(url)
path_segments = [s for s in parsed.path.split("/") if s]
# 2. Code repo by URL pattern
if _is_code_repo_url(parsed, path_segments):
metadata["detection"] = "url_pattern"
metadata["host"] = parsed.hostname
return "code_repository", metadata
# 3. Extension-based detection
ext_type = _type_from_extension(parsed.path)
if ext_type is not None:
metadata["detection"] = "extension"
metadata["path"] = parsed.path
return ext_type, metadata
# 4. HTTP HEAD request
try:
response = httpx.head(url, timeout=timeout, follow_redirects=True)
content_type = response.headers.get("content-type", "")
metadata["detection"] = "content_type_header"
metadata["content_type"] = content_type
metadata["status_code"] = response.status_code
return _type_from_content_type(content_type), metadata
except Exception as exc:
raise Exception(f"detect_url_type: HEAD request failed for {url!r}: {exc}") from exc
@@ -0,0 +1,89 @@
"""Tests para detect_url_type (tests que no requieren red)."""
import sys
import os
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
from core.detect_url_type import detect_url_type, _type_from_extension, _type_from_content_type, _is_ssh_git_url
def test_url_pdf_por_extension():
"""URL .pdf se detecta por extension sin hacer request HTTP."""
url_type, metadata = detect_url_type("https://example.com/report.pdf")
assert url_type == "pdf"
assert metadata["detection"] == "extension"
def test_url_github_repo():
"""URL de GitHub org/repo se detecta como code_repository por patron URL."""
url_type, metadata = detect_url_type("https://github.com/openai/whisper")
assert url_type == "code_repository"
assert metadata["detection"] == "url_pattern"
def test_url_github_con_git_suffix():
"""URL github terminada en .git se detecta como code_repository."""
url_type, metadata = detect_url_type("https://github.com/openai/whisper.git")
assert url_type == "code_repository"
def test_url_markdown_por_extension():
"""URL .md se detecta como markdown por extension."""
url_type, metadata = detect_url_type("https://example.com/README.md")
assert url_type == "markdown"
assert metadata["detection"] == "extension"
def test_url_ssh_git():
"""URL SSH git@ se detecta como code_repository."""
url_type, metadata = detect_url_type("git@github.com:openai/whisper.git")
assert url_type == "code_repository"
assert metadata["detection"] == "ssh_pattern"
def test_url_html_por_extension():
"""URL .html se detecta como webpage por extension."""
url_type, metadata = detect_url_type("https://example.com/page.html")
assert url_type == "webpage"
assert metadata["detection"] == "extension"
def test_url_txt_por_extension():
"""URL .txt se detecta como text por extension."""
url_type, metadata = detect_url_type("https://example.com/data.txt")
assert url_type == "text"
def test_github_subrepo_no_es_repo():
"""URL de GitHub apuntando a un issue/blob no se trata como code_repository."""
# Debe intentar HEAD request (que fallara sin red) — verificamos que no clasifica como repo
# Solo comprobamos que no devuelve code_repository por patron URL
url = "https://github.com/openai/whisper/blob/main/README.md"
# Extension .md deberia detectarse primero
url_type, metadata = detect_url_type(url)
assert url_type == "markdown"
def test_helper_type_from_extension():
"""_type_from_extension funciona para extensiones conocidas."""
assert _type_from_extension("/doc.pdf") == "pdf"
assert _type_from_extension("/README.md") == "markdown"
assert _type_from_extension("/notes.txt") == "text"
assert _type_from_extension("/unknown.xyz") is None
def test_helper_type_from_content_type():
"""_type_from_content_type mapea headers correctamente."""
assert _type_from_content_type("application/pdf; charset=utf-8") == "pdf"
assert _type_from_content_type("text/html; charset=utf-8") == "webpage"
assert _type_from_content_type("text/plain") == "text"
assert _type_from_content_type("text/markdown") == "markdown"
assert _type_from_content_type("application/octet-stream") == "webpage"
def test_helper_is_ssh_git_url():
"""_is_ssh_git_url detecta formato git@."""
assert _is_ssh_git_url("git@github.com:org/repo.git") is True
assert _is_ssh_git_url("https://github.com/org/repo") is False
assert _is_ssh_git_url("ssh://git@github.com/org/repo") is False
+40
View File
@@ -0,0 +1,40 @@
---
name: docx_to_markdown
kind: function
lang: py
domain: core
version: "1.0.0"
purity: impure
signature: "docx_to_markdown(docx_path: str) -> str"
description: "Convierte un documento Word (.docx) a markdown preservando estructura (headings), formato inline (bold, italic, underline) y tablas en su posicion original."
tags: [docx, markdown, word, conversion, document, parsing, text]
uses_functions: [format_table_to_markdown_py_core]
uses_types: []
returns: []
returns_optional: false
error_type: "error_go_core"
imports: [python-docx, lxml]
tested: true
tests: ["docx con headings y parrafos", "docx con tablas intercaladas", "docx con formato bold/italic", "docx vacio", "archivo no encontrado lanza FileNotFoundError"]
test_file_path: "python/functions/core/docx_to_markdown_test.py"
file_path: "python/functions/core/docx_to_markdown.py"
---
## Ejemplo
```python
md = docx_to_markdown("informe.docx")
# # Titulo
#
# Primer parrafo.
#
# | Col1 | Col2 |
# | ---- | ---- |
# | a | b |
#
# Parrafo despues de la tabla.
```
## Notas
Recorre `doc.element.body` en orden (no `doc.paragraphs` + `doc.tables` por separado) para preservar la posicion original de las tablas. Construye un mapa `{id(tbl_element): Table}` para lookup O(1). El formato inline aplica underline (`<ins>`), italic (`*`) y bold (`**`) en ese orden de mas interno a mas externo. Los headings se detectan por el estilo del parrafo (`Heading 1`, `Heading 2`, etc.). Requiere `python-docx` instalado en el entorno.
+153
View File
@@ -0,0 +1,153 @@
"""Convert a Word .docx document to Markdown, preserving structure, inline
formatting and tables in their original document order."""
import os
from lxml import etree
from format_table_to_markdown import format_table_to_markdown
# XML namespace used by python-docx element tags
_W = "http://schemas.openxmlformats.org/wordprocessingml/2006/main"
_TAG_P = f"{{{_W}}}p"
_TAG_TBL = f"{{{_W}}}tbl"
_TAG_TR = f"{{{_W}}}tr"
_TAG_TC = f"{{{_W}}}tc"
_TAG_R = f"{{{_W}}}r"
_TAG_T = f"{{{_W}}}t"
_TAG_RPR = f"{{{_W}}}rPr"
_TAG_B = f"{{{_W}}}b"
_TAG_I = f"{{{_W}}}i"
_TAG_U = f"{{{_W}}}u"
_TAG_PSTYLE = f"{{{_W}}}pStyle"
_TAG_PPR = f"{{{_W}}}pPr"
def _heading_level(paragraph) -> int:
"""Return heading level (1-6) if the paragraph is a heading, else 0."""
pPr = paragraph._p.find(_TAG_PPR)
if pPr is None:
return 0
pStyle = pPr.find(_TAG_PSTYLE)
if pStyle is None:
return 0
val = pStyle.get(f"{{{_W}}}val", "")
if val.lower().startswith("heading"):
parts = val.split()
if len(parts) == 2:
try:
return int(parts[1])
except ValueError:
pass
# Some locales use "Heading1" (no space)
suffix = val[len("heading"):]
if suffix.isdigit():
return int(suffix)
return 0
def _run_to_md(run_elem) -> str:
"""Convert a single <w:r> element to a markdown-formatted string."""
# Collect text
text_parts = []
for t in run_elem.findall(_TAG_T):
text_parts.append(t.text or "")
text = "".join(text_parts)
if not text:
return ""
# Read formatting from <w:rPr>
rPr = run_elem.find(_TAG_RPR)
bold = False
italic = False
underline = False
if rPr is not None:
bold = rPr.find(_TAG_B) is not None
italic = rPr.find(_TAG_I) is not None
u_elem = rPr.find(_TAG_U)
if u_elem is not None:
u_val = u_elem.get(f"{{{_W}}}val", "")
underline = u_val not in ("none", "")
# Apply markdown formatting (innermost first: underline → italic → bold)
if underline:
text = f"<ins>{text}</ins>"
if italic:
text = f"*{text}*"
if bold:
text = f"**{text}**"
return text
def _paragraph_to_md(paragraph) -> str:
"""Convert a python-docx Paragraph to a markdown string."""
level = _heading_level(paragraph)
runs_md = "".join(
_run_to_md(elem)
for elem in paragraph._p
if elem.tag == _TAG_R
)
if level:
return f"{'#' * level} {runs_md}"
return runs_md
def _table_to_md(table) -> str:
"""Convert a python-docx Table to a markdown table string."""
rows: list[list[str]] = []
for row in table.rows:
cells = []
for cell in row.cells:
# Join all paragraphs in the cell with a space
cell_text = " ".join(p.text for p in cell.paragraphs).strip()
cells.append(cell_text)
rows.append(cells)
return format_table_to_markdown(rows, has_header=True)
def docx_to_markdown(docx_path: str) -> str:
"""Convert a Word .docx document to Markdown.
Preserves document structure (headings), inline formatting (bold, italic,
underline) and tables in their original position.
Args:
docx_path: Absolute or relative path to the .docx file.
Returns:
Markdown string representing the document.
Raises:
FileNotFoundError: If the file does not exist.
Exception: If the file cannot be parsed as a .docx document.
"""
import docx # deferred so the module is importable without python-docx installed
if not os.path.exists(docx_path):
raise FileNotFoundError(f"File not found: {docx_path}")
doc = docx.Document(docx_path)
# Build a mapping from the XML element id to the Table object for O(1) lookup
table_map: dict[int, object] = {
id(table._tbl): table for table in doc.tables
}
parts: list[str] = []
for child in doc.element.body:
if child.tag == _TAG_P:
# Wrap in a temporary paragraph object to reuse _paragraph_to_md
from docx.text.paragraph import Paragraph
para = Paragraph(child, doc)
md = _paragraph_to_md(para)
if md.strip():
parts.append(md)
elif child.tag == _TAG_TBL:
table = table_map.get(id(child))
if table is not None:
md = _table_to_md(table)
if md:
parts.append(md)
return "\n\n".join(parts)
@@ -0,0 +1,129 @@
"""Tests para docx_to_markdown."""
import os
import sys
import tempfile
import pytest
sys.path.insert(0, os.path.dirname(__file__))
import docx as python_docx
from docx_to_markdown import docx_to_markdown
def _make_docx(builder_fn) -> str:
"""Create a temporary .docx file using builder_fn(doc) and return its path."""
doc = python_docx.Document()
builder_fn(doc)
tmp = tempfile.NamedTemporaryFile(suffix=".docx", delete=False)
doc.save(tmp.name)
tmp.close()
return tmp.name
# ---------------------------------------------------------------------------
# Tests
# ---------------------------------------------------------------------------
def test_docx_con_headings_y_parrafos():
"""docx con headings y parrafos"""
def build(doc):
doc.add_heading("Titulo Principal", level=1)
doc.add_paragraph("Primer parrafo de contenido.")
doc.add_heading("Seccion", level=2)
doc.add_paragraph("Segundo parrafo.")
path = _make_docx(build)
try:
result = docx_to_markdown(path)
assert "# Titulo Principal" in result
assert "## Seccion" in result
assert "Primer parrafo de contenido." in result
assert "Segundo parrafo." in result
finally:
os.unlink(path)
def test_docx_con_tablas_intercaladas():
"""docx con tablas intercaladas"""
def build(doc):
doc.add_paragraph("Texto antes de la tabla.")
table = doc.add_table(rows=2, cols=3)
table.cell(0, 0).text = "Col1"
table.cell(0, 1).text = "Col2"
table.cell(0, 2).text = "Col3"
table.cell(1, 0).text = "a"
table.cell(1, 1).text = "b"
table.cell(1, 2).text = "c"
doc.add_paragraph("Texto despues de la tabla.")
path = _make_docx(build)
try:
result = docx_to_markdown(path)
# Table must appear BETWEEN the two paragraphs
before_idx = result.index("Texto antes de la tabla.")
table_idx = result.index("| Col1")
after_idx = result.index("Texto despues de la tabla.")
assert before_idx < table_idx < after_idx
assert "| Col2" in result
assert "| a" in result
finally:
os.unlink(path)
def test_docx_con_formato_bold_italic():
"""docx con formato bold/italic"""
def build(doc):
para = doc.add_paragraph()
run_bold = para.add_run("negrita")
run_bold.bold = True
run_normal = para.add_run(" texto normal ")
run_italic = para.add_run("cursiva")
run_italic.italic = True
path = _make_docx(build)
try:
result = docx_to_markdown(path)
assert "**negrita**" in result
assert "*cursiva*" in result
assert "texto normal" in result
finally:
os.unlink(path)
def test_docx_vacio():
"""docx vacio"""
def build(doc):
# python-docx adds a default empty paragraph; remove all content
# by just not adding anything — the default empty paragraph will
# produce an empty string that gets filtered out.
pass
path = _make_docx(build)
try:
result = docx_to_markdown(path)
# Empty document should produce empty or whitespace-only output
assert result.strip() == ""
finally:
os.unlink(path)
def test_archivo_no_encontrado():
"""archivo no encontrado lanza FileNotFoundError"""
with pytest.raises(FileNotFoundError):
docx_to_markdown("/tmp/nonexistent_file_fn_registry.docx")
if __name__ == "__main__":
test_docx_con_headings_y_parrafos()
test_docx_con_tablas_intercaladas()
test_docx_con_formato_bold_italic()
test_docx_vacio()
test_archivo_no_encontrado()
print("All tests passed.")
+52
View File
@@ -0,0 +1,52 @@
---
name: epub_to_markdown
kind: function
lang: py
domain: core
version: "1.0.0"
purity: impure
signature: "def epub_to_markdown(epub_path: str) -> str"
description: "Convierte un ebook EPUB a markdown. Intenta ebooklib primero para extraccion estructurada (titulo, autor, documentos); fallback a extraccion manual con zipfile si ebooklib no esta instalado."
tags: [epub, markdown, ebook, parsing, conversion, html, text-extraction]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: "error_go_core"
imports: [zipfile, html, re, ebooklib]
tested: true
tests:
- "conversion de headings h1-h3"
- "conversion de bold e italic"
- "script y style se eliminan del output"
- "HTML entities se convierten a caracteres"
- "epub sin ebooklib extrae texto de archivos html"
- "epub con ebooklib incluye titulo y autor en el output"
- "epub corrupto lanza excepcion"
test_file_path: "python/functions/core/epub_to_markdown_test.py"
file_path: "python/functions/core/epub_to_markdown.py"
---
## Ejemplo
```python
md = epub_to_markdown("/path/to/book.epub")
print(md[:500])
# # Mi Libro
# **Author:** Ana Perez
#
# # Introduccion
# Primer parrafo...
```
## Notas
Conversion HTML a markdown cubre: headings h1-h6, bold (`<strong>`/`<b>`), italic (`<em>`/`<i>`), paragraphs, line breaks. Elimina `<script>` y `<style>`. Desescapa entidades HTML y normaliza whitespace.
Con ebooklib: extrae metadata DC (titulo, autor) del OPF y procesa solo los ITEM_DOCUMENT del spine.
Sin ebooklib (fallback ZIP): lista archivos `.html`/`.xhtml`/`.htm` en orden alfabetico y extrae su contenido. No hay metadata de titulo/autor en este modo.
Dependencia opcional: `pip install ebooklib`. Si no esta instalada la funcion sigue funcionando via zipfile.
Reimplementacion conceptual desde OpenViking `openviking/parse/parsers/epub.py` (AGPL-3.0). El codigo es original.
+128
View File
@@ -0,0 +1,128 @@
"""Convert an EPUB file to markdown text."""
import re
import zipfile
from html import unescape
from html.parser import HTMLParser
def _remove_tags(html: str, tag: str) -> str:
"""Remove a tag and its content from HTML string."""
pattern = re.compile(rf'<{tag}[^>]*>.*?</{tag}>', re.IGNORECASE | re.DOTALL)
return pattern.sub('', html)
def _html_to_markdown(html: str) -> str:
"""Convert basic HTML to markdown.
Handles headings, bold, italic, paragraphs, line breaks
and strips remaining tags.
Args:
html: HTML string to convert.
Returns:
Markdown-formatted string.
"""
# Remove script and style blocks
text = _remove_tags(html, 'script')
text = _remove_tags(text, 'style')
# Headings h1-h6
for level in range(6, 0, -1):
hashes = '#' * level
text = re.sub(
rf'<h{level}[^>]*>(.*?)</h{level}>',
lambda m, h=hashes: f'{h} {m.group(1).strip()}',
text,
flags=re.IGNORECASE | re.DOTALL,
)
# Bold
text = re.sub(r'<strong[^>]*>(.*?)</strong>', r'**\1**', text, flags=re.IGNORECASE | re.DOTALL)
text = re.sub(r'<b[^>]*>(.*?)</b>', r'**\1**', text, flags=re.IGNORECASE | re.DOTALL)
# Italic
text = re.sub(r'<em[^>]*>(.*?)</em>', r'*\1*', text, flags=re.IGNORECASE | re.DOTALL)
text = re.sub(r'<i[^>]*>(.*?)</i>', r'*\1*', text, flags=re.IGNORECASE | re.DOTALL)
# Paragraphs — append double newline after content
text = re.sub(r'<p[^>]*>(.*?)</p>', lambda m: m.group(1).strip() + '\n\n', text, flags=re.IGNORECASE | re.DOTALL)
# Line breaks
text = re.sub(r'<br\s*/?>', '\n', text, flags=re.IGNORECASE)
# Strip remaining HTML tags
text = re.sub(r'<[^>]+>', '', text)
# Unescape HTML entities
text = unescape(text)
# Normalize whitespace: collapse multiple blank lines into two
text = re.sub(r'\n{3,}', '\n\n', text)
text = re.sub(r'[ \t]+', ' ', text)
return text.strip()
def _epub_via_ebooklib(epub_path: str) -> str:
"""Extract markdown from EPUB using ebooklib."""
import ebooklib
from ebooklib import epub
book = epub.read_epub(epub_path)
# Metadata
title_meta = book.get_metadata('DC', 'title')
author_meta = book.get_metadata('DC', 'creator')
title = title_meta[0][0] if title_meta else 'Unknown Title'
author = author_meta[0][0] if author_meta else 'Unknown Author'
parts = [f'# {title}', f'**Author:** {author}']
for item in book.get_items_of_type(ebooklib.ITEM_DOCUMENT):
content = item.get_content().decode('utf-8', errors='replace')
md = _html_to_markdown(content)
if md:
parts.append(md)
return '\n\n'.join(parts)
def _epub_via_zipfile(epub_path: str) -> str:
"""Extract markdown from EPUB using zipfile (fallback)."""
parts = []
with zipfile.ZipFile(epub_path, 'r') as zf:
html_files = sorted(
name for name in zf.namelist()
if name.lower().endswith(('.html', '.xhtml', '.htm'))
)
for name in html_files:
raw = zf.read(name).decode('utf-8', errors='replace')
md = _html_to_markdown(raw)
if md:
parts.append(md)
return '\n\n'.join(parts)
def epub_to_markdown(epub_path: str) -> str:
"""Convert an EPUB ebook to markdown.
Attempts to use ebooklib for structured extraction (title, author,
document items). Falls back to manual ZIP extraction if ebooklib is
not installed.
Args:
epub_path: Path to the .epub file.
Returns:
Markdown string with the book content.
Raises:
Exception: If the file cannot be read or is not a valid EPUB.
"""
try:
return _epub_via_ebooklib(epub_path)
except ImportError:
return _epub_via_zipfile(epub_path)
@@ -0,0 +1,163 @@
"""Tests para epub_to_markdown."""
import io
import os
import struct
import sys
import zipfile
import pytest
sys.path.insert(0, os.path.dirname(__file__))
from epub_to_markdown import _html_to_markdown, _epub_via_zipfile, epub_to_markdown
# ---------------------------------------------------------------------------
# Helpers para construir EPUBs minimos en memoria
# ---------------------------------------------------------------------------
def _build_epub(files: dict[str, str]) -> str:
"""Crea un EPUB minimo como ZIP en disco y retorna el path."""
import tempfile
tmp = tempfile.NamedTemporaryFile(suffix='.epub', delete=False)
with zipfile.ZipFile(tmp, 'w') as zf:
for name, content in files.items():
zf.writestr(name, content)
tmp.close()
return tmp.name
def _build_epub_with_opf(title: str, author: str, body_html: str) -> str:
"""Crea un EPUB con OPF y un documento HTML valido para ebooklib."""
opf = f"""<?xml version='1.0' encoding='utf-8'?>
<package xmlns='http://www.idpf.org/2007/opf' unique-identifier='uid' version='2.0'>
<metadata xmlns:dc='http://purl.org/dc/elements/1.1/'>
<dc:title>{title}</dc:title>
<dc:creator>{author}</dc:creator>
<dc:identifier id='uid'>test-uid</dc:identifier>
<dc:language>en</dc:language>
</metadata>
<manifest>
<item id='ch1' href='chapter1.xhtml' media-type='application/xhtml+xml'/>
<item id='ncx' href='toc.ncx' media-type='application/x-dtbncx+xml'/>
</manifest>
<spine toc='ncx'>
<itemref idref='ch1'/>
</spine>
</package>"""
ncx = """<?xml version='1.0' encoding='utf-8'?>
<ncx xmlns='http://www.daisy.org/z3986/2005/ncx/' version='2005-1'>
<head><meta name='dtb:uid' content='test-uid'/></head>
<docTitle><text>Test</text></docTitle>
<navMap/>
</ncx>"""
chapter = f"""<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.1//EN' 'http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd'>
<html xmlns='http://www.w3.org/1999/xhtml'>
<head><title>Chapter</title></head>
<body>{body_html}</body>
</html>"""
return _build_epub({
'mimetype': 'application/epub+zip',
'META-INF/container.xml': """<?xml version='1.0'?>
<container version='1.0' xmlns='urn:oasis:names:tc:opendocument:xmlns:container'>
<rootfiles>
<rootfile full-path='content.opf' media-type='application/oebps-package+xml'/>
</rootfiles>
</container>""",
'content.opf': opf,
'toc.ncx': ncx,
'chapter1.xhtml': chapter,
})
# ---------------------------------------------------------------------------
# Tests de _html_to_markdown (pura, sin disco)
# ---------------------------------------------------------------------------
def test_html_heading_conversion():
"""conversion de headings h1-h3."""
html = '<h1>Titulo</h1><h2>Subtitulo</h2><h3>Seccion</h3>'
result = _html_to_markdown(html)
assert '# Titulo' in result
assert '## Subtitulo' in result
assert '### Seccion' in result
def test_html_bold_italic():
"""conversion de bold e italic."""
html = '<p><strong>negrita</strong> y <em>italica</em></p>'
result = _html_to_markdown(html)
assert '**negrita**' in result
assert '*italica*' in result
def test_html_script_style_removed():
"""script y style se eliminan del output."""
html = '<script>alert(1)</script><style>body{}</style><p>Contenido</p>'
result = _html_to_markdown(html)
assert 'alert' not in result
assert 'body{}' not in result
assert 'Contenido' in result
def test_html_entities_unescaped():
"""HTML entities se convierten a caracteres."""
html = '<p>Tom &amp; Jerry &lt;show&gt;</p>'
result = _html_to_markdown(html)
assert 'Tom & Jerry' in result
assert '<show>' in result
# ---------------------------------------------------------------------------
# Tests de epub_via_zipfile (sin ebooklib)
# ---------------------------------------------------------------------------
def test_epub_via_zipfile_extrae_html():
"""epub sin ebooklib extrae texto de archivos html."""
path = _build_epub({
'chapter.html': '<html><body><h1>Capitulo Uno</h1><p>Hola mundo.</p></body></html>',
})
try:
result = _epub_via_zipfile(path)
assert 'Capitulo Uno' in result
assert 'Hola mundo' in result
finally:
os.unlink(path)
# ---------------------------------------------------------------------------
# Tests de epub_to_markdown (integracion)
# ---------------------------------------------------------------------------
def test_epub_con_ebooklib_metadata():
"""epub con ebooklib incluye titulo y autor en el output."""
pytest.importorskip('ebooklib')
path = _build_epub_with_opf(
title='Mi Libro',
author='Ana Perez',
body_html='<h1>Introduccion</h1><p>Primer parrafo.</p>',
)
try:
result = epub_to_markdown(path)
assert '# Mi Libro' in result
assert 'Ana Perez' in result
assert 'Introduccion' in result
finally:
os.unlink(path)
def test_epub_corrupto_lanza_excepcion():
"""epub corrupto lanza Exception."""
import tempfile
tmp = tempfile.NamedTemporaryFile(suffix='.epub', delete=False)
tmp.write(b'esto no es un epub valido')
tmp.close()
try:
with pytest.raises(Exception):
epub_to_markdown(tmp.name)
finally:
os.unlink(tmp.name)
@@ -0,0 +1,37 @@
---
name: estimate_token_count
kind: function
lang: py
domain: core
version: "1.0.0"
purity: pure
signature: "def estimate_token_count(content: str) -> int"
description: "Estimacion rapida de tokens sin tokenizer. CJK chars cuentan ~0.7 token/char, otros non-whitespace ~0.3 token/char."
tags: [tokens, estimation, nlp, cjk, text]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: [re]
tested: true
tests:
- "texto vacio retorna cero"
- "solo latin"
- "solo CJK"
- "texto mixto"
test_file_path: "python/functions/core/parse_markdown_test.py"
file_path: "python/functions/core/core.py"
---
## Ejemplo
```python
estimate_token_count("hello world") # 3
estimate_token_count("中文语") # 2 (3 * 0.7 = 2)
estimate_token_count("") # 0
```
## Notas
Funcion pura. No requiere ninguna dependencia externa. Precision aproximada: util para guardianes de limite de contexto antes de llamar a LLMs, no para conteo exacto de tokens BPE. CJK range: `[\u4e00-\u9fff\u3040-\u30ff\uac00-\ud7af]` (CJK unificado, Hiragana/Katakana, Hangul).
@@ -0,0 +1,58 @@
---
name: excel_to_markdown
kind: function
lang: py
domain: core
version: "1.0.0"
purity: impure
signature: "excel_to_markdown(path: str, max_rows_per_sheet: int = 1000) -> str"
description: "Convierte un archivo Excel (.xlsx, .xls, .xlsm) a markdown con cada sheet como seccion H2. Soporta tipos de celda: fechas ISO, booleanos, errores Excel, numeros enteros y flotantes. Trunca sheets que superen max_rows_per_sheet."
tags: [excel, markdown, xlsx, xls, conversion, parser, io]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: "error_go_core"
imports: ["openpyxl", "xlrd"]
tested: true
tests:
- "xlsx con multiples sheets produce una seccion H2 por sheet"
- "sheet vacio produce nota de sheet vacio"
- "sheet truncado con nota de filas omitidas"
- "sheet con formulas data_only muestra valores calculados"
- "extension no soportada lanza ValueError"
- "archivo inexistente lanza FileNotFoundError"
- "dimensiones del sheet en metadata"
- "tabla markdown con formato correcto"
test_file_path: "python/functions/core/excel_to_markdown_test.py"
file_path: "python/functions/core/excel_to_markdown.py"
---
## Ejemplo
```python
from excel_to_markdown import excel_to_markdown
md = excel_to_markdown("report.xlsx")
print(md)
# ## Sheet: Ventas
#
# **Dimensions:** 101 x 4
#
# | Producto | Precio | Cantidad | Total |
# | --- | --- | --- | --- |
# | Manzana | 1 | 100 | 100 |
# ...
# Con limite de filas
md = excel_to_markdown("big_file.xlsx", max_rows_per_sheet=50)
```
## Notas
- `.xlsx` y `.xlsm`: usa `openpyxl` con `data_only=True` (lee valores calculados, no formulas).
- `.xls` (legacy): usa `xlrd`. Manejo de tipos especiales: EMPTY/BLANK → "", DATE → ISO 8601, BOOLEAN → "TRUE"/"FALSE", ERROR → codigo Excel (#NULL!, #DIV/0!, etc.), NUMBER → entero si no tiene decimales.
- Fechas sin hora se formatean como `YYYY-MM-DD`; con hora como `YYYY-MM-DDTHH:MM:SS`.
- Los pipes `|` dentro de celdas se escapan como `\|`.
- Si `xlwt` no esta disponible, los tests .xls se saltan (xlwt solo se necesita para crear fixtures, no para leer).
- Reimplementacion desde cero, inspirada conceptualmente en OpenViking (AGPL-3.0). Sin codigo copiado.
+211
View File
@@ -0,0 +1,211 @@
"""Convierte archivos Excel a Markdown con cada sheet como seccion H2."""
import os
from pathlib import Path
# Codigos de error Excel para xlrd
_XL_ERROR_CODES = {
0: "#NULL!",
7: "#DIV/0!",
15: "#VALUE!",
23: "#REF!",
29: "#NAME?",
36: "#NUM!",
42: "#N/A",
}
def _rows_to_markdown_table(rows: list[list[str]]) -> str:
"""Convierte filas de strings a tabla markdown."""
if not rows:
return ""
header = rows[0]
col_count = len(header)
# Normalizar todas las filas al mismo numero de columnas
normalized = []
for row in rows:
if len(row) < col_count:
row = row + [""] * (col_count - len(row))
normalized.append(row[:col_count])
# Escapar pipes en celdas
def escape(cell: str) -> str:
return cell.replace("|", "\\|").replace("\n", " ")
lines = []
# Header
lines.append("| " + " | ".join(escape(c) for c in normalized[0]) + " |")
# Separator
lines.append("| " + " | ".join("---" for _ in range(col_count)) + " |")
# Data rows
for row in normalized[1:]:
lines.append("| " + " | ".join(escape(c) for c in row) + " |")
return "\n".join(lines)
def _cell_value_xlrd(cell, workbook) -> str:
"""Convierte una celda xlrd a string segun su tipo."""
import xlrd
ctype = cell.ctype
if ctype in (xlrd.XL_CELL_EMPTY, xlrd.XL_CELL_BLANK):
return ""
elif ctype == xlrd.XL_CELL_DATE:
try:
dt = xlrd.xldate_as_datetime(cell.value, workbook.datemode)
if dt.hour == 0 and dt.minute == 0 and dt.second == 0:
return dt.date().isoformat()
return dt.isoformat()
except Exception:
return str(cell.value)
elif ctype == xlrd.XL_CELL_BOOLEAN:
return "TRUE" if cell.value else "FALSE"
elif ctype == xlrd.XL_CELL_ERROR:
return _XL_ERROR_CODES.get(int(cell.value), "#ERROR!")
elif ctype == xlrd.XL_CELL_NUMBER:
v = cell.value
if v == int(v):
return str(int(v))
return str(v)
elif ctype == xlrd.XL_CELL_TEXT:
return str(cell.value)
else:
return str(cell.value)
def _sheet_xlrd(sheet, workbook, max_rows: int) -> str:
"""Convierte un sheet xlrd a markdown."""
nrows = sheet.nrows
ncols = sheet.ncols
lines = []
lines.append(f"## Sheet: {sheet.name}")
lines.append("")
lines.append(f"**Dimensions:** {nrows} x {ncols}")
lines.append("")
if nrows == 0 or ncols == 0:
lines.append("*(empty sheet)*")
return "\n".join(lines)
display_rows = min(nrows, max_rows)
rows = []
for r in range(display_rows):
row_data = [_cell_value_xlrd(sheet.cell(r, c), workbook) for c in range(ncols)]
rows.append(row_data)
lines.append(_rows_to_markdown_table(rows))
if nrows > max_rows:
omitted = nrows - max_rows
lines.append("")
lines.append(f"*{omitted} rows omitted (max_rows_per_sheet={max_rows})*")
return "\n".join(lines)
def _cell_value_openpyxl(cell) -> str:
"""Convierte una celda openpyxl a string."""
v = cell.value
if v is None:
return ""
if isinstance(v, bool):
return "TRUE" if v else "FALSE"
if isinstance(v, float):
if v == int(v):
return str(int(v))
return str(v)
if isinstance(v, int):
return str(v)
# Fechas y datetimes
import datetime
if isinstance(v, datetime.datetime):
if v.hour == 0 and v.minute == 0 and v.second == 0:
return v.date().isoformat()
return v.isoformat()
if isinstance(v, datetime.date):
return v.isoformat()
return str(v)
def _sheet_openpyxl(ws, max_rows: int) -> str:
"""Convierte un worksheet openpyxl a markdown."""
all_rows = list(ws.iter_rows())
nrows = len(all_rows)
ncols = ws.max_column or 0
lines = []
lines.append(f"## Sheet: {ws.title}")
lines.append("")
lines.append(f"**Dimensions:** {nrows} x {ncols}")
lines.append("")
if nrows == 0 or ncols == 0:
lines.append("*(empty sheet)*")
return "\n".join(lines)
display_rows = min(nrows, max_rows)
rows = []
for row in all_rows[:display_rows]:
row_data = [_cell_value_openpyxl(cell) for cell in row]
rows.append(row_data)
lines.append(_rows_to_markdown_table(rows))
if nrows > max_rows:
omitted = nrows - max_rows
lines.append("")
lines.append(f"*{omitted} rows omitted (max_rows_per_sheet={max_rows})*")
return "\n".join(lines)
def excel_to_markdown(path: str, max_rows_per_sheet: int = 1000) -> str:
"""Convierte un archivo Excel (.xlsx, .xls, .xlsm) a markdown.
Cada sheet se convierte en una seccion H2. Las filas se representan
como tablas markdown. Si el numero de filas supera max_rows_per_sheet,
el sheet se trunca y se añade una nota.
Args:
path: Ruta al archivo Excel (.xlsx, .xls, .xlsm).
max_rows_per_sheet: Maximo de filas a incluir por sheet (default 1000).
Returns:
String markdown con todos los sheets del archivo.
Raises:
ValueError: Si la extension no es soportada.
FileNotFoundError: Si el archivo no existe.
Exception: Si hay errores leyendo el archivo.
"""
p = Path(path)
if not p.exists():
raise FileNotFoundError(f"File not found: {path}")
ext = p.suffix.lower()
if ext == ".xls":
import xlrd
wb = xlrd.open_workbook(path)
sections = []
for sheet_name in wb.sheet_names():
sheet = wb.sheet_by_name(sheet_name)
sections.append(_sheet_xlrd(sheet, wb, max_rows_per_sheet))
return "\n\n".join(sections)
elif ext in (".xlsx", ".xlsm"):
import openpyxl
wb = openpyxl.load_workbook(path, data_only=True)
sections = []
for ws in wb.worksheets:
sections.append(_sheet_openpyxl(ws, max_rows_per_sheet))
return "\n\n".join(sections)
else:
raise ValueError(f"Unsupported extension '{ext}'. Use .xlsx, .xls, or .xlsm.")
@@ -0,0 +1,142 @@
"""Tests para excel_to_markdown."""
import datetime
import os
import sys
import tempfile
import openpyxl
import pytest
sys.path.insert(0, os.path.dirname(__file__))
from excel_to_markdown import excel_to_markdown
def _make_xlsx(sheets: dict, filename: str) -> str:
"""Crea un archivo .xlsx temporal con los sheets dados."""
wb = openpyxl.Workbook()
first = True
for sheet_name, rows in sheets.items():
if first:
ws = wb.active
ws.title = sheet_name
first = False
else:
ws = wb.create_sheet(sheet_name)
for row in rows:
ws.append(row)
path = os.path.join(tempfile.mkdtemp(), filename)
wb.save(path)
return path
def test_xlsx_multiples_sheets():
"""xlsx con multiples sheets produce una seccion H2 por sheet."""
path = _make_xlsx(
{
"Ventas": [["Producto", "Precio", "Cantidad"], ["Manzana", 1.5, 100], ["Pera", 2.0, 50]],
"Resumen": [["Total", "Importe"], ["150", "225.0"]],
},
"multi.xlsx",
)
result = excel_to_markdown(path)
assert "## Sheet: Ventas" in result
assert "## Sheet: Resumen" in result
assert "Producto" in result
assert "Manzana" in result
assert "Total" in result
def test_sheet_vacio():
"""Sheet sin filas produce nota de sheet vacio."""
path = _make_xlsx({"Vacio": []}, "empty.xlsx")
result = excel_to_markdown(path)
assert "## Sheet: Vacio" in result
assert "empty sheet" in result
def test_sheet_truncado():
"""Sheet con mas filas que max_rows_per_sheet se trunca con nota."""
rows = [["col"]] + [[str(i)] for i in range(20)]
path = _make_xlsx({"Data": rows}, "big.xlsx")
result = excel_to_markdown(path, max_rows_per_sheet=5)
assert "omitted" in result
# 21 filas totales, 5 mostradas -> 16 omitidas
assert "16 rows omitted" in result
def test_sheet_con_formulas_data_only():
"""Archivo xlsx abierto con data_only=True muestra valores calculados (o None si no guardados)."""
wb = openpyxl.Workbook()
ws = wb.active
ws.title = "Formulas"
ws.append(["A", "B", "Suma"])
ws.append([1, 2, "=A2+B2"])
path = os.path.join(tempfile.mkdtemp(), "formulas.xlsx")
wb.save(path)
result = excel_to_markdown(path)
assert "## Sheet: Formulas" in result
# La celda formula puede ser None con data_only=True si no fue guardada con valor
assert "Suma" in result
def test_xls_legacy_con_fechas():
"""xls legacy: la funcion debe aceptar .xls (via xlrd) y manejar fechas."""
# Creamos un .xls usando xlwt si disponible, si no lo saltamos
pytest.importorskip("xlwt", reason="xlwt no disponible para crear .xls de prueba")
import xlwt
wb = xlwt.Workbook()
ws = wb.add_sheet("Fechas")
ws.write(0, 0, "Nombre")
ws.write(0, 1, "Fecha")
ws.write(1, 0, "Evento A")
date_format = xlwt.XFStyle()
date_format.num_format_str = "YYYY-MM-DD"
ws.write(1, 1, datetime.date(2024, 1, 15).toordinal() - 693594, date_format)
path = os.path.join(tempfile.mkdtemp(), "legacy.xls")
wb.save(path)
result = excel_to_markdown(path)
assert "## Sheet: Fechas" in result
assert "Evento A" in result
def test_extension_no_soportada():
"""Extension no soportada lanza ValueError."""
path = os.path.join(tempfile.mkdtemp(), "data.csv")
with open(path, "w") as f:
f.write("a,b\n1,2\n")
with pytest.raises(ValueError, match="Unsupported extension"):
excel_to_markdown(path)
def test_archivo_no_existe():
"""Archivo inexistente lanza FileNotFoundError."""
with pytest.raises(FileNotFoundError):
excel_to_markdown("/tmp/no_existe_para_nada.xlsx")
def test_dimensiones_en_metadata():
"""El markdown incluye dimensiones del sheet."""
path = _make_xlsx({"Hoja1": [["A", "B"], [1, 2], [3, 4]]}, "dims.xlsx")
result = excel_to_markdown(path)
assert "**Dimensions:**" in result
assert "3 x 2" in result
def test_tabla_markdown_formato():
"""La tabla tiene formato correcto con separador de header."""
path = _make_xlsx({"Datos": [["Col1", "Col2"], ["val1", "val2"]]}, "fmt.xlsx")
result = excel_to_markdown(path)
# Debe tener linea separadora con ---
assert "| --- |" in result or "| --- | --- |" in result
assert "Col1" in result
assert "val1" in result
@@ -0,0 +1,43 @@
---
name: extract_frontmatter
kind: function
lang: py
domain: core
version: "1.0.0"
purity: pure
signature: "def extract_frontmatter(content: str) -> tuple[str, dict | None]"
description: "Extrae YAML frontmatter (delimitado por ---) del inicio de un string markdown. Retorna el contenido sin frontmatter y el dict parseado (o None si no hay)."
tags: [markdown, frontmatter, yaml, parsing]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: [re, yaml]
tested: true
tests:
- "contenido con frontmatter"
- "sin frontmatter retorna None"
- "frontmatter vacio"
- "frontmatter con listas"
test_file_path: "python/functions/core/parse_markdown_test.py"
file_path: "python/functions/core/core.py"
---
## Ejemplo
```python
content = "---\ntitle: Hello\nauthor: Alice\n---\n# Body\n"
remaining, data = extract_frontmatter(content)
# remaining = "# Body\n"
# data = {"title": "Hello", "author": "Alice"}
no_fm = "# Just markdown\n\nNo frontmatter."
remaining, data = extract_frontmatter(no_fm)
# remaining == no_fm
# data is None
```
## Notas
Funcion pura. Usa `yaml.safe_load` si PyYAML esta disponible; si no, cae back a un parser simple de `key: value`. Solo reconoce frontmatter al inicio estricto del string (posicion 0). El bloque debe estar delimitado por `---\n` de apertura y `\n---\n` de cierre.
@@ -0,0 +1,36 @@
---
name: extract_json_from_llm
kind: function
lang: py
domain: core
version: "1.0.0"
purity: pure
signature: "def extract_json_from_llm(content: str) -> dict"
description: "Extrae y parsea JSON de respuestas LLM. Maneja bloques ```json, trailing commas, None->null."
tags: [json, llm, parsing, extraction]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: [json]
tested: false
tests: []
test_file_path: ""
file_path: "python/functions/core/core.py"
source_repo: "https://github.com/VectifyAI/PageIndex"
source_license: "MIT"
source_file: "pageindex/utils.py"
---
## Ejemplo
```python
raw = '```json\n{"key": "value", "items": [1, 2, 3,]}\n```'
result = extract_json_from_llm(raw)
# {"key": "value", "items": [1, 2, 3]}
```
## Notas
Funcion pura. Maneja errores comunes de LLMs: trailing commas, `None` en lugar de `null`, whitespace extra. Retorna dict vacio si el JSON es irrecuperable.
@@ -0,0 +1,36 @@
---
name: extract_markdown_headers
kind: function
lang: py
domain: core
version: "1.0.0"
purity: pure
signature: "def extract_markdown_headers(markdown_content: str) -> tuple[list[dict], list[str]]"
description: "Extrae todos los headers (h1-h6) de markdown con nivel y numero de linea, ignorando code blocks."
tags: [markdown, parsing, headers, extraction]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: [re]
tested: false
tests: []
test_file_path: ""
file_path: "python/functions/core/core.py"
source_repo: "https://github.com/VectifyAI/PageIndex"
source_license: "MIT"
source_file: "pageindex/page_index_md.py"
---
## Ejemplo
```python
md = "# Title\n\nSome text\n\n## Section\n\n```\n# not a header\n```"
headers, lines = extract_markdown_headers(md)
# headers = [{"title": "Title", "level": 1, "line_num": 1}, {"title": "Section", "level": 2, "line_num": 5}]
```
## Notas
Funcion pura. Detecta y omite bloques de codigo (triple backtick). Retorna tupla: (lista de headers, lista de lineas originales).
@@ -0,0 +1,37 @@
---
name: extract_pdf_bookmarks
kind: function
lang: py
domain: core
version: "1.0.0"
purity: impure
signature: "def extract_pdf_bookmarks(pdf) -> list[dict]"
description: "Extrae la estructura de bookmarks/outlines de un PDF abierto con pdfplumber. Retorna lista de dicts con level (1-6), title y page_num."
tags: [pdf, bookmarks, outlines, parsing, pdfplumber]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: "error_go_core"
imports: [pdfplumber]
tested: false
tests: []
test_file_path: ""
file_path: "python/functions/core/extract_pdf_bookmarks.py"
---
## Ejemplo
```python
import pdfplumber
from extract_pdf_bookmarks import extract_pdf_bookmarks
with pdfplumber.open("document.pdf") as pdf:
bookmarks = extract_pdf_bookmarks(pdf)
for bm in bookmarks:
print(f"{'#' * bm['level']} {bm['title']} (page {bm['page_num']})")
```
## Notas
Recibe un objeto `pdfplumber.PDF` ya abierto (no un path). Construye un mapping interno `objid -> page_number` desde `pdf.pages` para resolver los destinos de outline. El nivel se limita al rango [1, 6] para compatibilidad markdown. Retorna lista vacia si el PDF no tiene outlines o si `get_outlines()` falla. Impure porque accede al estado interno de un objeto PDF ya abierto.
@@ -0,0 +1,63 @@
"""Extract the bookmark/outline structure from a PDF opened with pdfplumber."""
import pdfplumber
def extract_pdf_bookmarks(pdf: pdfplumber.PDF) -> list[dict]:
"""Extract bookmarks/outlines from an open pdfplumber PDF object.
Args:
pdf: An open pdfplumber.PDF object.
Returns:
list[dict]: List of {"level": int, "title": str, "page_num": int | None}.
Level is clamped to [1, 6]. Returns empty list if no outlines.
"""
try:
outlines = pdf.doc.get_outlines()
except Exception:
return []
if not outlines:
return []
# Build objid -> page_number mapping
objid_to_page: dict[int, int] = {}
for i, page in enumerate(pdf.pages):
try:
obj = page.page_obj
objid_to_page[obj.objid] = i + 1 # 1-indexed page numbers
except Exception:
pass
bookmarks = []
for item in outlines:
try:
level = item[0] # integer level from get_outlines
title = item[1]
dest = item[2] # destination: page object or list
# Clamp level to [1, 6]
level = max(1, min(6, level))
# Resolve destination to page number
page_num = None
if dest is not None:
if isinstance(dest, list) and len(dest) > 0:
# dest[0] is the page object
page_obj = dest[0]
try:
page_num = objid_to_page.get(page_obj.objid)
except Exception:
pass
else:
try:
page_num = objid_to_page.get(dest.objid)
except Exception:
pass
bookmarks.append({"level": level, "title": str(title), "page_num": page_num})
except Exception:
continue
return bookmarks
+35
View File
@@ -0,0 +1,35 @@
---
name: extract_pdf_text
kind: function
lang: py
domain: core
version: "1.0.0"
purity: impure
signature: "def extract_pdf_text(pdf_path: str) -> str"
description: "Extrae todo el texto de un PDF concatenando todas las paginas. Usa PyPDF2."
tags: [pdf, text, extraction, parsing]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: "error_go_core"
imports: [PyPDF2]
tested: false
tests: []
test_file_path: ""
file_path: "python/functions/core/extract_pdf_text.py"
source_repo: "https://github.com/VectifyAI/PageIndex"
source_license: "MIT"
source_file: "pageindex/utils.py"
---
## Ejemplo
```python
text = extract_pdf_text("/path/to/document.pdf")
print(len(text)) # total characters
```
## Notas
Requiere `pip install PyPDF2`. Extraccion basica de texto — no maneja OCR ni PDFs escaneados. Para PDFs complejos considerar PyMuPDF.
+19
View File
@@ -0,0 +1,19 @@
"""Extract all text from a PDF file using PyPDF2."""
import PyPDF2
def extract_pdf_text(pdf_path: str) -> str:
"""Extract all text from a PDF file.
Args:
pdf_path: Path to the PDF file.
Returns:
str: Concatenated text from all pages.
"""
pdf_reader = PyPDF2.PdfReader(pdf_path)
text = ""
for page in pdf_reader.pages:
text += page.extract_text() or ""
return text
@@ -0,0 +1,51 @@
---
name: extract_text_from_file
kind: function
lang: py
domain: core
version: "1.0.0"
purity: impure
signature: "extract_text_from_file(file_path: str) -> str"
description: "Extrae texto plano de un archivo. Soporta PDF (PyMuPDF), Markdown y TXT con deteccion automatica de encoding."
tags: [text, pdf, markdown, txt, encoding, extraction, file, io]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: "error_go_core"
imports: ["os", "fitz (PyMuPDF)", "charset_normalizer", "chardet"]
tested: true
tests:
- "PDF con texto extrae contenido correctamente"
- "archivo MD UTF-8 retorna contenido"
- "archivo TXT latin-1 detecta encoding"
- "archivo inexistente lanza FileNotFoundError"
- "extension no soportada lanza ValueError"
test_file_path: "python/functions/core/extract_text_from_file_test.py"
file_path: "python/functions/core/extract_text_from_file.py"
---
## Ejemplo
```python
# PDF
text = extract_text_from_file("report.pdf")
# Markdown
text = extract_text_from_file("README.md")
# TXT con encoding desconocido
text = extract_text_from_file("notes.txt")
```
## Notas
Para PDF usa PyMuPDF (`fitz`) que produce mejor texto que PyPDF2, especialmente en PDFs con columnas o layout complejo. Las paginas se unen con `\n\n`.
La deteccion de encoding para archivos de texto sigue este orden de prioridad:
1. Intenta UTF-8 directamente
2. `charset_normalizer.from_bytes().best().encoding`
3. `chardet.detect(data)["encoding"]`
4. UTF-8 con `errors='replace'` como ultimo recurso
Diferencia con `extract_pdf_text_py_core`: esa funcion usa PyPDF2 y solo soporta PDF. Esta funcion usa PyMuPDF y soporta ademas MD y TXT con deteccion de encoding.
@@ -0,0 +1,92 @@
"""Extract plain text from PDF, Markdown, or TXT files."""
SUPPORTED_EXTENSIONS = {".pdf", ".md", ".markdown", ".txt"}
def _detect_encoding(data: bytes) -> str:
"""Detect encoding of raw bytes using multiple fallback strategies."""
# Strategy 1: UTF-8
try:
data.decode("utf-8")
return "utf-8"
except UnicodeDecodeError:
pass
# Strategy 2: charset_normalizer
try:
from charset_normalizer import from_bytes
result = from_bytes(data).best()
if result is not None and result.encoding:
return result.encoding
except ImportError:
pass
# Strategy 3: chardet
try:
import chardet
detected = chardet.detect(data)
if detected and detected.get("encoding"):
return detected["encoding"]
except ImportError:
pass
# Last resort: UTF-8 with replacement
return "utf-8"
def extract_text_from_file(file_path: str) -> str:
"""Extract plain text from a file. Supports PDF, Markdown and TXT.
For PDF files uses PyMuPDF (fitz) to extract text from each page,
joining them with double newlines. For text-based files (.md, .markdown,
.txt) reads the file with automatic encoding detection.
Args:
file_path: Absolute or relative path to the file.
Returns:
str: Extracted plain text content.
Raises:
FileNotFoundError: If the file does not exist.
ValueError: If the file extension is not supported.
ImportError: If PyMuPDF is not installed and a PDF is provided.
"""
import os
if not os.path.exists(file_path):
raise FileNotFoundError(f"File not found: {file_path}")
_, ext = os.path.splitext(file_path.lower())
if ext == ".pdf":
try:
import fitz # PyMuPDF
except ImportError as e:
raise ImportError(
"PyMuPDF is required for PDF extraction. "
"Install it with: pip install PyMuPDF"
) from e
doc = fitz.open(file_path)
pages = [page.get_text() for page in doc]
return "\n\n".join(pages)
elif ext in {".md", ".markdown", ".txt"}:
with open(file_path, "rb") as f:
raw = f.read()
encoding = _detect_encoding(raw)
try:
return raw.decode(encoding)
except (UnicodeDecodeError, LookupError):
return raw.decode("utf-8", errors="replace")
else:
raise ValueError(
f"Unsupported file extension: '{ext}'. "
f"Supported: {', '.join(sorted(SUPPORTED_EXTENSIONS))}"
)
@@ -0,0 +1,83 @@
"""Tests para extract_text_from_file."""
import os
import sys
import tempfile
import pytest
sys.path.insert(0, os.path.dirname(__file__))
from extract_text_from_file import extract_text_from_file
def test_pdf_con_texto_extrae_contenido_correctamente():
"""PDF con texto extrae contenido correctamente."""
try:
import fitz
except ImportError:
pytest.skip("PyMuPDF no instalado")
# Create a minimal in-memory PDF using PyMuPDF and write it to a temp file
doc = fitz.open()
page = doc.new_page()
page.insert_text((72, 72), "Hello from PDF")
with tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) as f:
tmp_path = f.name
try:
doc.save(tmp_path)
doc.close()
result = extract_text_from_file(tmp_path)
assert "Hello from PDF" in result
finally:
os.unlink(tmp_path)
def test_archivo_md_utf8_retorna_contenido():
"""archivo MD UTF-8 retorna contenido."""
content = "# Titulo\n\nParrafo con texto UTF-8: cafe, senor, japon.\n"
with tempfile.NamedTemporaryFile(
suffix=".md", mode="wb", delete=False
) as f:
f.write(content.encode("utf-8"))
tmp_path = f.name
try:
result = extract_text_from_file(tmp_path)
assert "# Titulo" in result
assert "cafe" in result
finally:
os.unlink(tmp_path)
def test_archivo_txt_latin1_detecta_encoding():
"""archivo TXT latin-1 detecta encoding."""
content = "Texto en latin-1: cafe, hotel, naive\n"
with tempfile.NamedTemporaryFile(
suffix=".txt", mode="wb", delete=False
) as f:
f.write(content.encode("latin-1"))
tmp_path = f.name
try:
result = extract_text_from_file(tmp_path)
# The word "cafe" or similar should appear in the decoded result
assert len(result) > 0
assert "cafe" in result or "caf" in result
finally:
os.unlink(tmp_path)
def test_archivo_inexistente_lanza_filenotfounderror():
"""archivo inexistente lanza FileNotFoundError."""
with pytest.raises(FileNotFoundError):
extract_text_from_file("/tmp/no_existe_este_archivo_12345.txt")
def test_extension_no_soportada_lanza_valueerror():
"""extension no soportada lanza ValueError."""
with tempfile.NamedTemporaryFile(suffix=".docx", delete=False) as f:
f.write(b"fake docx content")
tmp_path = f.name
try:
with pytest.raises(ValueError, match="Unsupported file extension"):
extract_text_from_file(tmp_path)
finally:
os.unlink(tmp_path)
@@ -0,0 +1,50 @@
---
name: fetch_and_parse_url
kind: function
lang: py
domain: core
version: "1.0.0"
purity: impure
signature: "fetch_and_parse_url(url: str, timeout: float = 30.0) -> str"
description: "Descarga una pagina web y la convierte a markdown. Combina detect_url_type + fetch HTML + html_to_markdown en una sola operacion."
tags: [http, fetch, html, markdown, parse, url, scraping]
uses_functions:
- detect_url_type_py_core
- html_to_markdown_py_core
uses_types: []
returns: []
returns_optional: false
error_type: "error_go_core"
imports: ["httpx"]
tested: false
tests: []
test_file_path: ""
file_path: "python/functions/core/fetch_and_parse_url.py"
---
## Ejemplo
```python
from core.fetch_and_parse_url import fetch_and_parse_url
# Descargar y convertir una pagina web
md = fetch_and_parse_url("https://example.com")
print(md)
# Con timeout personalizado
md = fetch_and_parse_url("https://en.wikipedia.org/wiki/Python", timeout=15.0)
```
## Notas
Algoritmo:
1. `detect_url_type(url)` determina el tipo de contenido (por patron, extension o HEAD request).
2. Si es `code_repository` → lanza Exception (requiere git clone, no HTTP fetch).
3. Si es `pdf` → lanza Exception (requiere pdfminer/pypdf, no incluido).
4. `httpx.get(url)` descarga el contenido con follow_redirects.
5. Si es `webpage` o Content-Type HTML → `html_to_markdown(raw_html)`.
6. Si es `markdown`, `text` o codigo → retorna el texto directamente.
Lanza `Exception` con mensaje descriptivo en cualquier fallo de red o tipo no soportado.
Funcion impura: hace I/O (HTTP requests).
@@ -0,0 +1,64 @@
"""Descarga una pagina web y la convierte a markdown."""
from __future__ import annotations
def fetch_and_parse_url(url: str, timeout: float = 30.0) -> str:
"""Descarga una pagina web y la convierte a markdown.
Detecta el tipo de URL con detect_url_type, descarga el contenido con
httpx y lo convierte al formato apropiado:
- webpage: fetch HTML → html_to_markdown
- markdown: retorna el texto directamente
- text/code: retorna el texto directamente
- pdf: retorna stub (requiere dependencia externa)
- code_repository: retorna stub (requiere clonar repo)
Args:
url: URL a descargar y parsear.
timeout: Timeout en segundos para las peticiones HTTP.
Returns:
Contenido de la URL en formato markdown.
Raises:
Exception: Si falla la descarga (timeout, DNS, HTTP error) o el tipo
de URL no es soportado.
"""
import httpx
from detect_url_type import detect_url_type
from html_to_markdown import html_to_markdown
# Detectar tipo de URL (puede hacer HEAD request)
url_type, _meta = detect_url_type(url, timeout=timeout)
if url_type == "code_repository":
raise Exception(
f"fetch_and_parse_url: code_repository URLs require git clone, not supported. url={url!r}"
)
if url_type == "pdf":
raise Exception(
f"fetch_and_parse_url: PDF parsing requires external dependency (pdfminer/pypdf). url={url!r}"
)
# Fetch content via GET
try:
response = httpx.get(url, timeout=timeout, follow_redirects=True)
response.raise_for_status()
except httpx.HTTPStatusError as exc:
raise Exception(
f"fetch_and_parse_url: HTTP {exc.response.status_code} for {url!r}"
) from exc
except Exception as exc:
raise Exception(f"fetch_and_parse_url: request failed for {url!r}: {exc}") from exc
content_type = response.headers.get("content-type", "").lower()
raw_text = response.text
if url_type == "webpage" or "text/html" in content_type:
return html_to_markdown(raw_text)
# markdown, text, or code files — return as-is
return raw_text

Some files were not shown because too many files have changed in this diff Show More