fix(fn-run): propagar stdout/stderr de bash functions library-style #1
@@ -0,0 +1,215 @@
|
||||
# /extract-source — Extraer funciones de un repo en sources/
|
||||
|
||||
Eres un agente extractor de funciones. Tu trabajo es analizar un repositorio clonado en `sources/` y extraer funciones reutilizables al registry siguiendo las reglas de `.claude/rules/sources.md`.
|
||||
|
||||
---
|
||||
|
||||
## Argumento
|
||||
|
||||
`$ARGUMENTS` — nombre del directorio en `sources/` (ej: `MiroFish`, `OpenViking`). Si no se proporciona, listar los directorios disponibles en `sources/` y pedir al usuario que elija.
|
||||
|
||||
---
|
||||
|
||||
## PASO 0: Validar el source
|
||||
|
||||
```bash
|
||||
ls sources/$ARGUMENTS/
|
||||
```
|
||||
|
||||
Si no existe, abortar. Verificar que tenga licencia compatible (MIT, Apache 2.0, BSD, ISC, MPL-2.0, Unlicense). Si es AGPL, GPL, o no tiene licencia, **advertir al usuario** y pedir confirmacion antes de continuar.
|
||||
|
||||
Identificar:
|
||||
- **Licencia**: leer LICENSE/LICENSE.md/COPYING
|
||||
- **Lenguaje principal**: detectar por archivos (*.go, *.py, *.rs, *.ts, *.js, Cargo.toml, go.mod, pyproject.toml, package.json)
|
||||
- **URL del repo**: buscar en README, .git/config, o package.json
|
||||
|
||||
---
|
||||
|
||||
## PASO 1: Revisar el manifest
|
||||
|
||||
Leer `sources/sources.yaml` para ver si este repo ya tiene extracciones previas. Si las tiene, listarlas al usuario y preguntar si quiere continuar extrayendo mas o si quiere re-evaluar las existentes.
|
||||
|
||||
---
|
||||
|
||||
## PASO 2: Explorar el repositorio
|
||||
|
||||
Analizar la estructura del repo para identificar **todas las funciones candidatas** — puras e impuras. El objetivo es maximizar la extraccion de codigo util.
|
||||
|
||||
### Que buscar (por categoria)
|
||||
|
||||
**A. Funciones puras** (algoritmos, transformaciones, calculos, validaciones):
|
||||
- Parsers, encoders/decoders, formatters
|
||||
- Algoritmos matematicos, estadisticos, financieros
|
||||
- Transformaciones de datos, filtros, mappers
|
||||
- Validaciones, sanitizaciones
|
||||
|
||||
**B. Funciones impuras** (I/O, red, estado externo):
|
||||
- Clientes HTTP/API (REST, GraphQL, WebSocket)
|
||||
- Operaciones de filesystem (leer, escribir, monitorear archivos)
|
||||
- Interacciones con bases de datos (queries, migraciones)
|
||||
- Operaciones Docker, cloud, infraestructura
|
||||
- Scraping, crawling, recoleccion de datos
|
||||
- Notificaciones, envio de mensajes
|
||||
|
||||
**C. Pipelines** (composiciones multi-paso):
|
||||
- Flujos ETL (extract-transform-load)
|
||||
- Workflows de setup/deploy/provision
|
||||
- Secuencias de procesamiento de datos
|
||||
- Orquestaciones que componen varias funciones
|
||||
|
||||
**D. Tipos reutilizables** (structs, enums, interfaces):
|
||||
- Modelos de dominio genericos
|
||||
- Tipos de configuracion
|
||||
- Interfaces/protocolos bien definidos
|
||||
|
||||
### Estrategia de exploracion segun lenguaje
|
||||
- **Go**: `pkg/`, `internal/`, `utils/`, `lib/`, `cmd/` — funciones exportadas, handlers, clients
|
||||
- **Python**: `src/`, `lib/`, `utils/`, `core/`, `api/` — funciones, clases client, decoradores
|
||||
- **Rust**: `crates/`, `src/lib.rs` — funciones pub, traits implementados
|
||||
- **TypeScript/JS**: `src/`, `lib/`, `utils/`, `services/` — funciones, hooks, componentes
|
||||
- **Bash**: `scripts/`, `bin/`, `tools/` — funciones con firma clara
|
||||
|
||||
### Que ignorar
|
||||
- main(), CLI entry points (pero extraer las funciones que invocan)
|
||||
- Tests (pero notar cuales funciones estan bien testeadas — marcar `tested: true`)
|
||||
- Funciones que dependen de tipos internos complejos **no adaptables**
|
||||
- Codigo con dependencias externas pesadas que no esten en fn_registry
|
||||
- Config loaders hardcodeados a un proyecto especifico
|
||||
|
||||
---
|
||||
|
||||
## PASO 3: Consultar el registry para evitar duplicados
|
||||
|
||||
Antes de proponer cualquier funcion, buscar en registry.db con FTS5:
|
||||
|
||||
```bash
|
||||
# Por cada candidata, buscar similares
|
||||
sqlite3 registry.db "SELECT id, kind, purity, description FROM functions WHERE id IN (SELECT id FROM functions_fts WHERE functions_fts MATCH 'name:NOMBRE* OR description:DESCRIPCION') ORDER BY name;"
|
||||
```
|
||||
|
||||
Si ya existe algo similar, descartarla o anotar que es una mejora/variante.
|
||||
|
||||
---
|
||||
|
||||
## PASO 4: Presentar candidatas al usuario
|
||||
|
||||
Agrupar las candidatas por categoria y mostrar en tablas separadas:
|
||||
|
||||
### Funciones puras
|
||||
| # | Nombre propuesto | Origen (archivo) | Lang destino | Dominio | Descripcion |
|
||||
|---|---|---|---|---|---|
|
||||
|
||||
### Funciones impuras
|
||||
| # | Nombre propuesto | Origen (archivo) | Lang destino | Dominio | I/O tipo | Descripcion |
|
||||
|---|---|---|---|---|---|---|
|
||||
|
||||
(I/O tipo: HTTP, filesystem, DB, Docker, network, etc.)
|
||||
|
||||
### Pipelines (composiciones)
|
||||
| # | Nombre propuesto | Origen (archivo) | Lang destino | Dominio | Funciones que compone | Descripcion |
|
||||
|---|---|---|---|---|---|---|
|
||||
|
||||
### Tipos
|
||||
| # | Nombre propuesto | Origen (archivo) | Lang destino | Dominio | Algebraic | Descripcion |
|
||||
|---|---|---|---|---|---|---|
|
||||
|
||||
Para cada candidata indicar:
|
||||
- Por que cumple el filtro de calidad
|
||||
- Si requiere adaptacion (renombrar tipos, quitar dependencias, traducir lenguaje)
|
||||
- Si es traduccion de otro lenguaje (ej: Rust → Go)
|
||||
- Para impuras: cual es el `error_type` apropiado
|
||||
|
||||
**Esperar confirmacion del usuario** antes de extraer. El usuario puede:
|
||||
- Aprobar todas (`all`)
|
||||
- Seleccionar por numero (`1,3,5-8`)
|
||||
- Seleccionar por categoria (`todas las puras`, `solo pipelines`)
|
||||
- Pedir explorar mas areas del repo
|
||||
- Descartar y terminar
|
||||
|
||||
---
|
||||
|
||||
## PASO 5: Extraer funciones aprobadas
|
||||
|
||||
Para cada funcion aprobada:
|
||||
|
||||
### 5a. Determinar destino y clasificacion
|
||||
|
||||
| Naturaleza | Destino | kind | purity |
|
||||
|---|---|---|---|
|
||||
| Algoritmo/logica pura | Go/Python `functions/{domain}/` | function | pure |
|
||||
| Funcion con I/O (HTTP, DB, fs) | Go/Python `functions/{domain}/` | function | impure |
|
||||
| Script/utilidad sistema | Bash `bash/functions/{domain}/` | function | impure |
|
||||
| UI/componente | TypeScript `frontend/functions/{domain}/` | component | — |
|
||||
| Composicion multi-paso | `functions/pipelines/` o `python/functions/pipelines/` | pipeline | impure |
|
||||
| C/Rust/otro lenguaje | Traducir a Go o Python manteniendo semantica | segun caso | segun caso |
|
||||
|
||||
### 5b. Crear archivos
|
||||
|
||||
1. **Codigo** — copiar y adaptar:
|
||||
- Renombrar a snake_case
|
||||
- Usar tipos nativos en firma (no tipos internos del repo)
|
||||
- Quitar dependencias externas, usar stdlib
|
||||
- Ajustar al paquete Go destino (nombre = nombre del directorio)
|
||||
- Si es traduccion, mantener la semantica y documentar el origen
|
||||
|
||||
2. **Metadata .md** — crear frontmatter completo:
|
||||
- `source_repo`: URL del repo original
|
||||
- `source_license`: licencia del repo
|
||||
- `source_file`: path relativo del archivo original dentro del repo
|
||||
- Todos los campos obligatorios segun el tipo (function/pipeline/component)
|
||||
- Reglas de pureza:
|
||||
- `pure` → `returns_optional: false` + `error_type: ""`
|
||||
- `impure` → `error_type: "error_go_core"` (o equivalente Python)
|
||||
- `pipeline` → `purity: impure` + `uses_functions` con las funciones que compone
|
||||
|
||||
### 5c. Verificar integridad
|
||||
|
||||
```bash
|
||||
# Indexar
|
||||
./fn index
|
||||
|
||||
# Verificar cada funcion extraida
|
||||
./fn show {id}
|
||||
```
|
||||
|
||||
Si el indexer reporta errores, corregir antes de continuar.
|
||||
|
||||
---
|
||||
|
||||
## PASO 6: Actualizar manifest
|
||||
|
||||
Anadir las funciones extraidas a `sources/sources.yaml` bajo el repo correspondiente:
|
||||
|
||||
```yaml
|
||||
- repo: https://github.com/user/project
|
||||
license: MIT
|
||||
cloned_dir: nombre_directorio
|
||||
extracted:
|
||||
- id: funcion_go_core
|
||||
source_file: pkg/utils.go
|
||||
date: YYYY-MM-DD # fecha de hoy
|
||||
```
|
||||
|
||||
Si el repo no existe en el manifest, crear la entrada completa.
|
||||
|
||||
---
|
||||
|
||||
## PASO 7: Resumen
|
||||
|
||||
Mostrar al usuario:
|
||||
- Funciones extraidas exitosamente (con IDs)
|
||||
- Funciones descartadas y por que
|
||||
- Warnings del indexer si hubo
|
||||
- Sugerencia de areas del repo que podrian explorarse en el futuro
|
||||
|
||||
---
|
||||
|
||||
## Reglas criticas
|
||||
|
||||
- **NUNCA extraer sin aprobacion del usuario** — siempre presentar candidatas primero
|
||||
- **NUNCA ignorar el filtro de calidad** — si no cumple todos los criterios, no se extrae
|
||||
- **SIEMPRE consultar registry.db** antes de proponer — evitar duplicados
|
||||
- **SIEMPRE atribuir** — source_repo, source_license, source_file en el .md
|
||||
- **SIEMPRE actualizar sources.yaml** — es el manifest versionado
|
||||
- **Licencias no permisivas** (GPL, AGPL) requieren advertencia explicita al usuario
|
||||
- **Traduccion de lenguaje** es valida — documentar el origen claramente
|
||||
@@ -14,12 +14,21 @@ Una funcion externa solo se extrae si cumple TODOS estos criterios:
|
||||
- **Firma generica**: no depende de tipos internos del repo origen ni de config hardcodeada
|
||||
- **Sin estado global**: no usa variables globales, singletons, ni init() con side effects
|
||||
- **Dependencias minimas**: solo stdlib o dependencias ya presentes en fn_registry
|
||||
- **Pura si es posible**: si la funcion puede ser pura, debe extraerse como pura
|
||||
- **Sin credenciales**: no contiene secrets, API keys, ni paths absolutos
|
||||
- **Testeable**: la logica debe poder validarse con tests unitarios
|
||||
- **No duplicada**: consultar registry.db con FTS5 antes de extraer para evitar duplicados
|
||||
- **Licencia compatible**: el repo debe tener licencia permisiva (MIT, Apache 2.0, BSD, etc.)
|
||||
|
||||
### Clasificacion de pureza al extraer
|
||||
|
||||
Extraer tanto funciones puras como impuras. La clasificacion correcta es obligatoria:
|
||||
|
||||
- **Pure**: sin I/O, sin estado mutable, determinista. Extraer como `purity: pure`.
|
||||
- **Impure**: hace I/O (red, disco, DB, HTTP), usa concurrencia, o depende de estado externo. Extraer como `purity: impure` con `error_type` apropiado.
|
||||
- **Pipeline**: compone multiples funciones para un flujo completo. Extraer como `kind: pipeline`, siempre impuro.
|
||||
|
||||
No descartar funciones utiles solo por ser impuras. Una funcion que hace HTTP requests, lee archivos, o interactua con bases de datos es valiosa si su firma es generica y reutilizable.
|
||||
|
||||
### Adaptacion al extraer
|
||||
|
||||
- Renombrar a snake_case siguiendo la convencion del registry
|
||||
@@ -44,6 +53,8 @@ SELECT id, source_repo, source_license FROM functions WHERE source_repo != '';
|
||||
|
||||
Cualquier lenguaje puede analizarse como fuente. El destino depende de la naturaleza de la funcion:
|
||||
- Algoritmos/logica pura → Go (functions/{domain}/) o Python (python/functions/{domain}/)
|
||||
- Funciones impuras (I/O, HTTP, DB) → Go o Python segun el dominio
|
||||
- Scripts/utilidades sistema → Bash (bash/functions/{domain}/)
|
||||
- UI/frontend → TypeScript (frontend/functions/{domain}/)
|
||||
- Flujos multi-paso → Pipeline en el lenguaje mas natural
|
||||
- C/Rust/otros → Traducir a Go o Python, manteniendo la semantica original
|
||||
|
||||
@@ -0,0 +1,32 @@
|
||||
---
|
||||
name: install_nbconvert
|
||||
kind: function
|
||||
lang: bash
|
||||
domain: infra
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "install_nbconvert(project_dir: string) -> void"
|
||||
description: "Instala nbconvert y playwright con chromium en un proyecto uv existente. Idempotente: uv add no reinstala si los paquetes ya estan presentes."
|
||||
tags: [jupyter, nbconvert, pdf, export, playwright, python, uv]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: []
|
||||
tested: false
|
||||
tests: []
|
||||
test_file_path: ""
|
||||
file_path: "bash/functions/infra/install_nbconvert.sh"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```bash
|
||||
source install_nbconvert.sh
|
||||
install_nbconvert /home/lucas/analysis/finanzas
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Requiere que el venv ya exista (usa `init_uv_venv` antes). La instalacion de chromium via `uv run playwright install chromium` puede tardar la primera vez. La salida de playwright se suprime si tiene exito — solo se muestra si hay un error.
|
||||
@@ -0,0 +1,32 @@
|
||||
# install_nbconvert
|
||||
# ------------------
|
||||
# Instala nbconvert y playwright con chromium en un proyecto uv existente.
|
||||
# Idempotente: uv add no reinstala si los paquetes ya estan presentes.
|
||||
#
|
||||
# USO (sourced):
|
||||
# source install_nbconvert.sh
|
||||
# install_nbconvert /path/to/project
|
||||
|
||||
install_nbconvert() {
|
||||
local project_dir="$1"
|
||||
|
||||
if [ -z "$project_dir" ]; then
|
||||
echo "install_nbconvert: se requiere project_dir" >&2
|
||||
return 1
|
||||
fi
|
||||
|
||||
if [ ! -d "$project_dir/.venv" ]; then
|
||||
echo "install_nbconvert: no existe .venv en $project_dir — ejecuta init_uv_venv primero" >&2
|
||||
return 1
|
||||
fi
|
||||
|
||||
# Instalar nbconvert y playwright via uv add
|
||||
(cd "$project_dir" && uv add nbconvert playwright 2>&1)
|
||||
|
||||
# Instalar chromium — capturar output, solo mostrar si hay error
|
||||
local playwright_output
|
||||
if ! playwright_output=$(cd "$project_dir" && uv run playwright install chromium 2>&1); then
|
||||
echo "$playwright_output" >&2
|
||||
return 1
|
||||
fi
|
||||
}
|
||||
@@ -0,0 +1,37 @@
|
||||
---
|
||||
name: notebook_to_pdf
|
||||
kind: function
|
||||
lang: bash
|
||||
domain: infra
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "notebook_to_pdf(project_dir: string, [pattern: string], [output_dir: string]) -> string"
|
||||
description: "Convierte notebooks Jupyter a PDF usando nbconvert webpdf con chromium. Lista los PDFs generados al finalizar."
|
||||
tags: [jupyter, notebook, pdf, export, nbconvert, playwright]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: []
|
||||
tested: false
|
||||
tests: []
|
||||
test_file_path: ""
|
||||
file_path: "bash/functions/infra/notebook_to_pdf.sh"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```bash
|
||||
source notebook_to_pdf.sh
|
||||
|
||||
# Con defaults (notebooks/*.ipynb -> notebooks/pdf/)
|
||||
notebook_to_pdf /home/lucas/analysis/finanzas
|
||||
|
||||
# Con pattern y output_dir custom
|
||||
notebook_to_pdf /home/lucas/analysis/finanzas "notebooks/01_*.ipynb" "exports/pdf/"
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Requiere nbconvert y playwright con chromium instalados (usa `install_nbconvert` antes). Usa el venv del proyecto directamente (`.venv/bin/jupyter`). El output_dir es relativo a project_dir. Imprime los PDFs generados con sus rutas al finalizar. Falla si no se genera ningun PDF.
|
||||
@@ -0,0 +1,59 @@
|
||||
# notebook_to_pdf
|
||||
# ----------------
|
||||
# Convierte notebooks Jupyter a PDF usando nbconvert webpdf.
|
||||
# Requiere nbconvert y playwright con chromium instalados.
|
||||
#
|
||||
# USO (sourced):
|
||||
# source notebook_to_pdf.sh
|
||||
# notebook_to_pdf /path/to/project
|
||||
# notebook_to_pdf /path/to/project "notebooks/*.ipynb" "notebooks/pdf/"
|
||||
|
||||
notebook_to_pdf() {
|
||||
local project_dir="$1"
|
||||
local pattern="${2:-notebooks/*.ipynb}"
|
||||
local output_dir="${3:-notebooks/pdf/}"
|
||||
|
||||
if [ -z "$project_dir" ]; then
|
||||
echo "notebook_to_pdf: se requiere project_dir" >&2
|
||||
return 1
|
||||
fi
|
||||
|
||||
if [ ! -d "$project_dir/.venv" ]; then
|
||||
echo "notebook_to_pdf: no existe .venv en $project_dir" >&2
|
||||
return 1
|
||||
fi
|
||||
|
||||
# Crear directorio de salida si no existe
|
||||
mkdir -p "$project_dir/$output_dir"
|
||||
|
||||
# Convertir notebooks a PDF con nbconvert webpdf
|
||||
# nbconvert puede retornar exit != 0 por warnings de validacion JSON
|
||||
# que no impiden la generacion del PDF, asi que ignoramos el exit code
|
||||
# y verificamos que los PDFs se hayan generado
|
||||
local nbconvert_output
|
||||
nbconvert_output=$(cd "$project_dir" && \
|
||||
.venv/bin/jupyter nbconvert \
|
||||
--to webpdf \
|
||||
--allow-chromium-download \
|
||||
--output-dir="$output_dir" \
|
||||
$pattern 2>&1) || true
|
||||
|
||||
echo "$nbconvert_output"
|
||||
|
||||
# Listar PDFs generados
|
||||
echo ""
|
||||
echo "PDFs generados en ${project_dir}/${output_dir}:"
|
||||
local pdf_count=0
|
||||
while IFS= read -r -d '' pdf; do
|
||||
echo " $pdf"
|
||||
pdf_count=$((pdf_count + 1))
|
||||
done < <(find "$project_dir/$output_dir" -name "*.pdf" -print0 2>/dev/null)
|
||||
|
||||
if [ "$pdf_count" -eq 0 ]; then
|
||||
echo " (ninguno encontrado — nbconvert pudo haber fallado)" >&2
|
||||
echo "$nbconvert_output" >&2
|
||||
return 1
|
||||
fi
|
||||
|
||||
echo " Total: $pdf_count PDFs"
|
||||
}
|
||||
@@ -0,0 +1,48 @@
|
||||
---
|
||||
name: export_analysis_pdfs
|
||||
kind: pipeline
|
||||
lang: bash
|
||||
domain: pipelines
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "export_analysis_pdfs(nombre: string, [pattern: string]) -> void"
|
||||
description: "Exporta todos los notebooks de un analisis Jupyter a PDF. Instala nbconvert y playwright automaticamente si no estan presentes."
|
||||
tags: [pipeline, jupyter, pdf, export, nbconvert, launcher]
|
||||
uses_functions:
|
||||
- assert_command_exists_bash_shell
|
||||
- install_nbconvert_bash_infra
|
||||
- notebook_to_pdf_bash_infra
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: []
|
||||
tested: false
|
||||
tests: []
|
||||
test_file_path: ""
|
||||
file_path: "bash/functions/pipelines/export_analysis_pdfs.sh"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```bash
|
||||
# Exportar todos los notebooks de un analisis
|
||||
./export_analysis_pdfs.sh finanzas
|
||||
|
||||
# Con pattern especifico
|
||||
./export_analysis_pdfs.sh ml "notebooks/01_*.ipynb"
|
||||
|
||||
# Via fn run
|
||||
fn run export_analysis_pdfs finanzas
|
||||
fn run export_analysis_pdfs ml "notebooks/01_*.ipynb"
|
||||
```
|
||||
|
||||
## Flujo
|
||||
|
||||
1. `assert_command_exists uv` — verifica que uv esta disponible
|
||||
2. `install_nbconvert` — instala nbconvert y playwright con chromium (idempotente)
|
||||
3. `notebook_to_pdf` — convierte notebooks al patron indicado a PDF en `notebooks/pdf/`
|
||||
|
||||
## Notas
|
||||
|
||||
El analysis debe existir previamente en `analysis/{nombre}/` con un venv inicializado. Los PDFs se generan en `analysis/{nombre}/notebooks/pdf/` por defecto. El pipeline usa `set -euo pipefail` — cualquier fallo detiene la ejecucion.
|
||||
+73
@@ -0,0 +1,73 @@
|
||||
#!/usr/bin/env bash
|
||||
# export_analysis_pdfs
|
||||
# ---------------------
|
||||
# Pipeline que exporta todos los notebooks de un analisis a PDF.
|
||||
# Compone: assert_command_exists + install_nbconvert + notebook_to_pdf
|
||||
#
|
||||
# USO:
|
||||
# ./export_analysis_pdfs.sh <nombre_analysis> [pattern]
|
||||
#
|
||||
# EJEMPLOS:
|
||||
# ./export_analysis_pdfs.sh finanzas
|
||||
# ./export_analysis_pdfs.sh ml "notebooks/01_*.ipynb"
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
REGISTRY_ROOT="$(cd "$SCRIPT_DIR/../../.." && pwd)"
|
||||
|
||||
# Source funciones atomicas
|
||||
source "$REGISTRY_ROOT/bash/functions/shell/assert_command_exists.sh"
|
||||
source "$REGISTRY_ROOT/bash/functions/infra/install_nbconvert.sh"
|
||||
source "$REGISTRY_ROOT/bash/functions/infra/notebook_to_pdf.sh"
|
||||
|
||||
# ── Argumentos ──────────────────────────────────────────────
|
||||
|
||||
NOMBRE="${1:-}"
|
||||
if [ -z "$NOMBRE" ]; then
|
||||
echo "Uso: $0 <nombre_analysis> [pattern]" >&2
|
||||
echo " Ejemplo: $0 finanzas" >&2
|
||||
echo " Ejemplo: $0 ml 'notebooks/01_*.ipynb'" >&2
|
||||
exit 1
|
||||
fi
|
||||
shift
|
||||
PATTERN="${1:-notebooks/*.ipynb}"
|
||||
|
||||
ANALYSIS_DIR="${REGISTRY_ROOT}/analysis/${NOMBRE}"
|
||||
|
||||
# Verificar que el analysis existe
|
||||
if [ ! -d "$ANALYSIS_DIR" ]; then
|
||||
echo "Error: analysis '${NOMBRE}' no existe en ${ANALYSIS_DIR}" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo ""
|
||||
echo "════════════════════════════════════════════════════════════"
|
||||
echo " EXPORT ANALYSIS PDFs: ${NOMBRE}"
|
||||
echo " Directorio: ${ANALYSIS_DIR}"
|
||||
echo "════════════════════════════════════════════════════════════"
|
||||
echo ""
|
||||
|
||||
# ── 1. Verificar herramientas ───────────────────────────────
|
||||
|
||||
echo "[1/3] Verificando herramientas..."
|
||||
assert_command_exists uv
|
||||
echo " OK"
|
||||
|
||||
# ── 2. Instalar nbconvert + playwright ──────────────────────
|
||||
|
||||
echo "[2/3] Instalando dependencias de exportacion..."
|
||||
install_nbconvert "$ANALYSIS_DIR"
|
||||
echo " OK"
|
||||
|
||||
# ── 3. Convertir notebooks a PDF ────────────────────────────
|
||||
|
||||
echo "[3/3] Convirtiendo notebooks a PDF..."
|
||||
notebook_to_pdf "$ANALYSIS_DIR" "$PATTERN"
|
||||
|
||||
# ── Resumen ─────────────────────────────────────────────────
|
||||
|
||||
echo ""
|
||||
echo "════════════════════════════════════════════════════════════"
|
||||
echo " EXPORT COMPLETADO"
|
||||
echo "════════════════════════════════════════════════════════════"
|
||||
@@ -0,0 +1,11 @@
|
||||
package core
|
||||
|
||||
// CronSchedule represents a parsed cron expression with expanded field values.
|
||||
type CronSchedule struct {
|
||||
Minute []int
|
||||
Hour []int
|
||||
DayOfMonth []int
|
||||
Month []int
|
||||
DayOfWeek []int
|
||||
Raw string // original expression
|
||||
}
|
||||
@@ -0,0 +1,116 @@
|
||||
package core
|
||||
|
||||
// JoinByKey une dos slices de map[string]any por una clave comun.
|
||||
// Soporta los cuatro tipos de join: inner, left, right, outer.
|
||||
// Campos duplicados del lado right (distintos a la clave) se sufijan con _right.
|
||||
// Algoritmo O(n+m): indexa right por key, luego itera left.
|
||||
func JoinByKey(left, right []map[string]any, key, how string) []map[string]any {
|
||||
// Determinar campos conflictivos entre left y right
|
||||
leftFields := map[string]bool{}
|
||||
for _, row := range left {
|
||||
for k := range row {
|
||||
leftFields[k] = true
|
||||
}
|
||||
}
|
||||
rightFields := map[string]bool{}
|
||||
for _, row := range right {
|
||||
for k := range row {
|
||||
if k != key {
|
||||
rightFields[k] = true
|
||||
}
|
||||
}
|
||||
}
|
||||
conflicting := map[string]bool{}
|
||||
for k := range rightFields {
|
||||
if leftFields[k] {
|
||||
conflicting[k] = true
|
||||
}
|
||||
}
|
||||
|
||||
// Indexar right por key (un key puede tener multiples rows)
|
||||
rightIndex := map[any][]map[string]any{}
|
||||
for _, row := range right {
|
||||
k := row[key]
|
||||
rightIndex[k] = append(rightIndex[k], row)
|
||||
}
|
||||
|
||||
// Plantilla vacia del right (todos los campos de right a nil)
|
||||
emptyRight := func() map[string]any {
|
||||
m := map[string]any{}
|
||||
for k := range rightFields {
|
||||
if conflicting[k] {
|
||||
m[k+"_right"] = nil
|
||||
} else {
|
||||
m[k] = nil
|
||||
}
|
||||
}
|
||||
return m
|
||||
}
|
||||
|
||||
merge := func(l, r map[string]any) map[string]any {
|
||||
out := map[string]any{}
|
||||
if l != nil {
|
||||
for k, v := range l {
|
||||
out[k] = v
|
||||
}
|
||||
}
|
||||
if r != nil {
|
||||
for k, v := range r {
|
||||
if k == key {
|
||||
continue
|
||||
}
|
||||
if conflicting[k] {
|
||||
out[k+"_right"] = v
|
||||
} else {
|
||||
out[k] = v
|
||||
}
|
||||
}
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
matchedRightKeys := map[any]bool{}
|
||||
var result []map[string]any
|
||||
|
||||
for _, l := range left {
|
||||
k := l[key]
|
||||
rRows, ok := rightIndex[k]
|
||||
if ok {
|
||||
matchedRightKeys[k] = true
|
||||
for _, r := range rRows {
|
||||
result = append(result, merge(l, r))
|
||||
}
|
||||
} else {
|
||||
if how == "left" || how == "outer" {
|
||||
row := merge(l, nil)
|
||||
for rk, rv := range emptyRight() {
|
||||
row[rk] = rv
|
||||
}
|
||||
result = append(result, row)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if how == "right" || how == "outer" {
|
||||
for _, r := range right {
|
||||
k := r[key]
|
||||
if !matchedRightKeys[k] {
|
||||
row := emptyRight()
|
||||
row[key] = k
|
||||
for rk, rv := range r {
|
||||
if rk == key {
|
||||
continue
|
||||
}
|
||||
if conflicting[rk] {
|
||||
row[rk+"_right"] = rv
|
||||
} else {
|
||||
row[rk] = rv
|
||||
}
|
||||
}
|
||||
result = append(result, row)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return result
|
||||
}
|
||||
@@ -0,0 +1,48 @@
|
||||
---
|
||||
name: join_by_key
|
||||
kind: function
|
||||
lang: go
|
||||
domain: core
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "func JoinByKey(left, right []map[string]any, key, how string) []map[string]any"
|
||||
description: "Join de dos slices de map[string]any por una clave comun. Soporta inner, left, right y outer. Campos duplicados del right se sufijan con _right. Algoritmo O(n+m)."
|
||||
tags: [tabular, join, merge, go, core]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: []
|
||||
tested: true
|
||||
tests:
|
||||
- "Inner join solo matches"
|
||||
- "Left join todos los left con nil para right sin match"
|
||||
- "Right join"
|
||||
- "Outer join"
|
||||
- "Campos duplicados con sufijo _right"
|
||||
- "Key ausente en alguna fila"
|
||||
test_file_path: "functions/core/join_by_key_test.go"
|
||||
file_path: "functions/core/join_by_key.go"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```go
|
||||
left := []map[string]any{{"id": 1, "name": "Alice"}, {"id": 2, "name": "Bob"}}
|
||||
right := []map[string]any{{"id": 1, "dept": "eng"}, {"id": 3, "dept": "sales"}}
|
||||
|
||||
result := JoinByKey(left, right, "id", "inner")
|
||||
// [{"id": 1, "name": "Alice", "dept": "eng"}]
|
||||
|
||||
result = JoinByKey(left, right, "id", "left")
|
||||
// [{"id": 1, "name": "Alice", "dept": "eng"},
|
||||
// {"id": 2, "name": "Bob", "dept": nil}]
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Funcion pura sin dependencias externas.
|
||||
El algoritmo indexa right en O(n) y luego itera left en O(m), total O(n+m).
|
||||
Los campos de right que colisionan con campos de left (excepto la clave) se renombran con sufijo _right.
|
||||
Un key puede tener multiples filas en right — se generan multiples filas en el resultado (comportamiento de join relacional).
|
||||
@@ -0,0 +1,107 @@
|
||||
package core
|
||||
|
||||
import "testing"
|
||||
|
||||
func TestJoinByKey(t *testing.T) {
|
||||
t.Run("Inner join solo matches", func(t *testing.T) {
|
||||
left := []map[string]any{{"id": 1, "name": "Alice"}, {"id": 2, "name": "Bob"}}
|
||||
right := []map[string]any{{"id": 1, "dept": "eng"}, {"id": 3, "dept": "sales"}}
|
||||
result := JoinByKey(left, right, "id", "inner")
|
||||
if len(result) != 1 {
|
||||
t.Fatalf("got %d rows, want 1", len(result))
|
||||
}
|
||||
if result[0]["id"] != 1 || result[0]["name"] != "Alice" || result[0]["dept"] != "eng" {
|
||||
t.Errorf("unexpected row: %v", result[0])
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("Left join todos los left con nil para right sin match", func(t *testing.T) {
|
||||
left := []map[string]any{{"id": 1, "name": "Alice"}, {"id": 2, "name": "Bob"}}
|
||||
right := []map[string]any{{"id": 1, "dept": "eng"}}
|
||||
result := JoinByKey(left, right, "id", "left")
|
||||
if len(result) != 2 {
|
||||
t.Fatalf("got %d rows, want 2", len(result))
|
||||
}
|
||||
var alice, bob map[string]any
|
||||
for _, r := range result {
|
||||
if r["id"] == 1 {
|
||||
alice = r
|
||||
} else {
|
||||
bob = r
|
||||
}
|
||||
}
|
||||
if alice["dept"] != "eng" {
|
||||
t.Errorf("alice dept = %v, want eng", alice["dept"])
|
||||
}
|
||||
if bob["dept"] != nil {
|
||||
t.Errorf("bob dept = %v, want nil", bob["dept"])
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("Right join", func(t *testing.T) {
|
||||
left := []map[string]any{{"id": 1, "name": "Alice"}}
|
||||
right := []map[string]any{{"id": 1, "dept": "eng"}, {"id": 2, "dept": "sales"}}
|
||||
result := JoinByKey(left, right, "id", "right")
|
||||
if len(result) != 2 {
|
||||
t.Fatalf("got %d rows, want 2", len(result))
|
||||
}
|
||||
var eng, sales map[string]any
|
||||
for _, r := range result {
|
||||
if r["id"] == 1 {
|
||||
eng = r
|
||||
} else {
|
||||
sales = r
|
||||
}
|
||||
}
|
||||
if eng["name"] != "Alice" {
|
||||
t.Errorf("eng name = %v, want Alice", eng["name"])
|
||||
}
|
||||
if sales["name"] != nil {
|
||||
t.Errorf("sales name = %v, want nil", sales["name"])
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("Outer join", func(t *testing.T) {
|
||||
left := []map[string]any{{"id": 1, "name": "Alice"}, {"id": 2, "name": "Bob"}}
|
||||
right := []map[string]any{{"id": 1, "dept": "eng"}, {"id": 3, "dept": "sales"}}
|
||||
result := JoinByKey(left, right, "id", "outer")
|
||||
ids := map[any]bool{}
|
||||
for _, r := range result {
|
||||
ids[r["id"]] = true
|
||||
}
|
||||
if len(ids) != 3 || !ids[1] || !ids[2] || !ids[3] {
|
||||
t.Errorf("outer join ids = %v, want {1, 2, 3}", ids)
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("Campos duplicados con sufijo _right", func(t *testing.T) {
|
||||
left := []map[string]any{{"id": 1, "name": "Alice", "score": 90}}
|
||||
right := []map[string]any{{"id": 1, "score": 85, "dept": "eng"}}
|
||||
result := JoinByKey(left, right, "id", "inner")
|
||||
if len(result) != 1 {
|
||||
t.Fatalf("got %d rows, want 1", len(result))
|
||||
}
|
||||
if result[0]["score"] != 90 {
|
||||
t.Errorf("score = %v, want 90", result[0]["score"])
|
||||
}
|
||||
if result[0]["score_right"] != 85 {
|
||||
t.Errorf("score_right = %v, want 85", result[0]["score_right"])
|
||||
}
|
||||
if result[0]["dept"] != "eng" {
|
||||
t.Errorf("dept = %v, want eng", result[0]["dept"])
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("Key ausente en alguna fila", func(t *testing.T) {
|
||||
left := []map[string]any{{"id": 1, "name": "Alice"}, {"name": "Bob"}} // Bob sin id
|
||||
right := []map[string]any{{"id": 1, "dept": "eng"}}
|
||||
result := JoinByKey(left, right, "id", "inner")
|
||||
// Solo Alice matchea (Bob tiene key=nil, right no tiene nil)
|
||||
if len(result) != 1 {
|
||||
t.Fatalf("got %d rows, want 1", len(result))
|
||||
}
|
||||
if result[0]["name"] != "Alice" {
|
||||
t.Errorf("name = %v, want Alice", result[0]["name"])
|
||||
}
|
||||
})
|
||||
}
|
||||
@@ -0,0 +1,116 @@
|
||||
package core
|
||||
|
||||
import (
|
||||
"time"
|
||||
)
|
||||
|
||||
// NextCronTime returns the next time.Time that satisfies schedule after the given time.
|
||||
// It advances minute by minute, skipping ahead when a field does not match.
|
||||
// Returns the zero value of time.Time if no match is found within 366 days (impossible schedule).
|
||||
func NextCronTime(schedule CronSchedule, after time.Time) time.Time {
|
||||
// Truncate to minute, then advance by 1 minute.
|
||||
t := after.Truncate(time.Minute).Add(time.Minute)
|
||||
|
||||
limit := after.Add(366 * 24 * time.Hour)
|
||||
|
||||
for t.Before(limit) {
|
||||
// Check month (1-12).
|
||||
if !intIn(int(t.Month()), schedule.Month) {
|
||||
// Advance to first day of next valid month.
|
||||
t = nextValidMonth(t, schedule.Month)
|
||||
if t.IsZero() {
|
||||
return time.Time{}
|
||||
}
|
||||
continue
|
||||
}
|
||||
|
||||
// Check day of month AND day of week (cron uses OR semantics when both are restricted,
|
||||
// but standard 5-field cron: if both are non-wildcard, either can match).
|
||||
// For simplicity we use AND semantics (both must match) which is the POSIX default
|
||||
// for the common case; most implementations differ only when both are explicitly set.
|
||||
domOK := intIn(t.Day(), schedule.DayOfMonth)
|
||||
dowOK := intIn(int(t.Weekday()), schedule.DayOfWeek)
|
||||
if !domOK || !dowOK {
|
||||
// Advance to next day at midnight.
|
||||
t = time.Date(t.Year(), t.Month(), t.Day()+1, 0, 0, 0, 0, t.Location())
|
||||
continue
|
||||
}
|
||||
|
||||
// Check hour.
|
||||
if !intIn(t.Hour(), schedule.Hour) {
|
||||
// Advance to next valid hour.
|
||||
next := nextValidHour(t, schedule.Hour)
|
||||
if next.IsZero() {
|
||||
// No valid hour today; advance to tomorrow.
|
||||
t = time.Date(t.Year(), t.Month(), t.Day()+1, 0, 0, 0, 0, t.Location())
|
||||
} else {
|
||||
t = next
|
||||
}
|
||||
continue
|
||||
}
|
||||
|
||||
// Check minute.
|
||||
if !intIn(t.Minute(), schedule.Minute) {
|
||||
next := nextValidMinute(t, schedule.Minute)
|
||||
if next.IsZero() {
|
||||
// No more valid minutes this hour; advance to next hour.
|
||||
t = time.Date(t.Year(), t.Month(), t.Day(), t.Hour()+1, 0, 0, 0, t.Location())
|
||||
} else {
|
||||
t = next
|
||||
}
|
||||
continue
|
||||
}
|
||||
|
||||
// All fields match.
|
||||
return t
|
||||
}
|
||||
|
||||
return time.Time{}
|
||||
}
|
||||
|
||||
// intIn returns true if v is in the sorted slice s.
|
||||
func intIn(v int, s []int) bool {
|
||||
for _, x := range s {
|
||||
if x == v {
|
||||
return true
|
||||
}
|
||||
}
|
||||
return false
|
||||
}
|
||||
|
||||
// nextValidMonth advances t to the first moment of the next valid month.
|
||||
func nextValidMonth(t time.Time, months []int) time.Time {
|
||||
month := int(t.Month())
|
||||
for _, m := range months {
|
||||
if m > month {
|
||||
return time.Date(t.Year(), time.Month(m), 1, 0, 0, 0, 0, t.Location())
|
||||
}
|
||||
}
|
||||
// Wrap to next year.
|
||||
if len(months) > 0 {
|
||||
return time.Date(t.Year()+1, time.Month(months[0]), 1, 0, 0, 0, 0, t.Location())
|
||||
}
|
||||
return time.Time{}
|
||||
}
|
||||
|
||||
// nextValidHour returns t at the next valid hour this day, or zero if none.
|
||||
func nextValidHour(t time.Time, hours []int) time.Time {
|
||||
h := t.Hour()
|
||||
for _, hh := range hours {
|
||||
if hh > h {
|
||||
return time.Date(t.Year(), t.Month(), t.Day(), hh, 0, 0, 0, t.Location())
|
||||
}
|
||||
}
|
||||
return time.Time{}
|
||||
}
|
||||
|
||||
// nextValidMinute returns t at the next valid minute this hour, or zero if none.
|
||||
func nextValidMinute(t time.Time, minutes []int) time.Time {
|
||||
m := t.Minute()
|
||||
for _, mm := range minutes {
|
||||
if mm > m {
|
||||
return time.Date(t.Year(), t.Month(), t.Day(), t.Hour(), mm, 0, 0, t.Location())
|
||||
}
|
||||
}
|
||||
return time.Time{}
|
||||
}
|
||||
@@ -0,0 +1,43 @@
|
||||
---
|
||||
name: next_cron_time
|
||||
kind: function
|
||||
lang: go
|
||||
domain: core
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "func NextCronTime(schedule CronSchedule, after time.Time) time.Time"
|
||||
description: "Calcula la proxima ejecucion de un cron schedule despues de un tiempo dado. Avanza minuto a minuto saltando campos no coincidentes. Retorna zero time si no hay match en 366 dias (schedule imposible)."
|
||||
tags: [cron, scheduling, time, next, pure]
|
||||
uses_functions: [parse_cron_expr_go_core]
|
||||
uses_types: [cron_schedule_go_core]
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: [time]
|
||||
tested: true
|
||||
tests:
|
||||
- "0 * * * * desde :30 retorna la proxima hora en punto"
|
||||
- "@weekly desde viernes retorna proximo domingo a medianoche"
|
||||
- "0 9 * * 1-5 desde viernes retorna proximo lunes a las 9"
|
||||
- "schedule imposible retorna zero time"
|
||||
test_file_path: "functions/core/next_cron_time_test.go"
|
||||
file_path: "functions/core/next_cron_time.go"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```go
|
||||
sched, _ := ParseCronExpr("0 * * * *")
|
||||
after := time.Date(2024, 1, 15, 14, 30, 0, 0, time.UTC)
|
||||
next := NextCronTime(sched, after)
|
||||
// next = 2024-01-15 15:00:00 UTC
|
||||
|
||||
weekdays, _ := ParseCronExpr("0 9 * * 1-5")
|
||||
friday := time.Date(2024, 1, 19, 10, 0, 0, 0, time.UTC) // Friday
|
||||
next2 := NextCronTime(weekdays, friday)
|
||||
// next2 = 2024-01-22 09:00:00 UTC (Monday)
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Usa semantica AND para day_of_month y day_of_week: ambos campos deben coincidir. El limite de 366 dias evita loops infinitos en schedules imposibles (ej: 29 de febrero en un ano sin bisiesto). Devuelve zero time en lugar de error para mantener purity: false/zero es el idiom de Go para retornos opcionales sin error.
|
||||
@@ -0,0 +1,72 @@
|
||||
package core
|
||||
|
||||
import (
|
||||
"testing"
|
||||
"time"
|
||||
)
|
||||
|
||||
func TestNextCronTime(t *testing.T) {
|
||||
utc := time.UTC
|
||||
|
||||
t.Run("0 * * * * desde :30 retorna la proxima hora en punto", func(t *testing.T) {
|
||||
sched, err := ParseCronExpr("0 * * * *")
|
||||
if err != nil {
|
||||
t.Fatalf("unexpected error: %v", err)
|
||||
}
|
||||
after := time.Date(2024, 1, 15, 14, 30, 0, 0, utc)
|
||||
got := NextCronTime(sched, after)
|
||||
want := time.Date(2024, 1, 15, 15, 0, 0, 0, utc)
|
||||
if !got.Equal(want) {
|
||||
t.Errorf("got %v, want %v", got, want)
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("@weekly desde viernes retorna proximo domingo a medianoche", func(t *testing.T) {
|
||||
// @weekly = "0 0 * * 0" (Sunday)
|
||||
sched, err := ParseCronExpr("@weekly")
|
||||
if err != nil {
|
||||
t.Fatalf("unexpected error: %v", err)
|
||||
}
|
||||
// 2024-01-19 is a Friday
|
||||
after := time.Date(2024, 1, 19, 10, 0, 0, 0, utc)
|
||||
got := NextCronTime(sched, after)
|
||||
// Next Sunday = 2024-01-21
|
||||
want := time.Date(2024, 1, 21, 0, 0, 0, 0, utc)
|
||||
if !got.Equal(want) {
|
||||
t.Errorf("got %v, want %v", got, want)
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("0 9 * * 1-5 desde viernes retorna proximo lunes a las 9", func(t *testing.T) {
|
||||
sched, err := ParseCronExpr("0 9 * * 1-5")
|
||||
if err != nil {
|
||||
t.Fatalf("unexpected error: %v", err)
|
||||
}
|
||||
// 2024-01-19 is a Friday, after 9am so today is already past.
|
||||
after := time.Date(2024, 1, 19, 10, 0, 0, 0, utc)
|
||||
got := NextCronTime(sched, after)
|
||||
// Next weekday = Monday 2024-01-22
|
||||
want := time.Date(2024, 1, 22, 9, 0, 0, 0, utc)
|
||||
if !got.Equal(want) {
|
||||
t.Errorf("got %v, want %v", got, want)
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("schedule imposible retorna zero time", func(t *testing.T) {
|
||||
// 30 Feb does not exist — will exhaust 366-day limit quickly for a specific year.
|
||||
// Use a schedule matching only Feb 30, which never occurs.
|
||||
sched := CronSchedule{
|
||||
Minute: []int{0},
|
||||
Hour: []int{0},
|
||||
DayOfMonth: []int{30},
|
||||
Month: []int{2},
|
||||
DayOfWeek: []int{0, 1, 2, 3, 4, 5, 6},
|
||||
Raw: "0 0 30 2 *",
|
||||
}
|
||||
after := time.Date(2023, 3, 1, 0, 0, 0, 0, utc)
|
||||
got := NextCronTime(sched, after)
|
||||
if !got.IsZero() {
|
||||
t.Errorf("expected zero time for impossible schedule, got %v", got)
|
||||
}
|
||||
})
|
||||
}
|
||||
@@ -0,0 +1,192 @@
|
||||
package core
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"strconv"
|
||||
"strings"
|
||||
)
|
||||
|
||||
// aliases maps cron shorthand expressions to their 5-field equivalents.
|
||||
var cronAliases = map[string]string{
|
||||
"@yearly": "0 0 1 1 *",
|
||||
"@annually": "0 0 1 1 *",
|
||||
"@monthly": "0 0 1 * *",
|
||||
"@weekly": "0 0 * * 0",
|
||||
"@daily": "0 0 * * *",
|
||||
"@midnight": "0 0 * * *",
|
||||
"@hourly": "0 * * * *",
|
||||
}
|
||||
|
||||
// fieldLimits defines the valid [min, max] range for each cron field.
|
||||
var cronFieldLimits = [5][2]int{
|
||||
{0, 59}, // minute
|
||||
{0, 23}, // hour
|
||||
{1, 31}, // day of month
|
||||
{1, 12}, // month
|
||||
{0, 6}, // day of week
|
||||
}
|
||||
|
||||
var cronFieldNames = [5]string{"minute", "hour", "day_of_month", "month", "day_of_week"}
|
||||
|
||||
// ParseCronExpr parses a standard 5-field cron expression into a CronSchedule.
|
||||
// Supports *, ranges (1-5), lists (1,3,5), steps (*/15), and aliases (@hourly, @daily, @weekly, @monthly, @yearly).
|
||||
// Returns an error for invalid expressions or out-of-range values.
|
||||
func ParseCronExpr(expr string) (CronSchedule, error) {
|
||||
expr = strings.TrimSpace(expr)
|
||||
|
||||
// Resolve aliases.
|
||||
if expanded, ok := cronAliases[expr]; ok {
|
||||
expr = expanded
|
||||
}
|
||||
|
||||
fields := strings.Fields(expr)
|
||||
if len(fields) != 5 {
|
||||
return CronSchedule{}, fmt.Errorf("parse_cron_expr: expected 5 fields, got %d in %q", len(fields), expr)
|
||||
}
|
||||
|
||||
var result [5][]int
|
||||
for i, field := range fields {
|
||||
lo, hi := cronFieldLimits[i][0], cronFieldLimits[i][1]
|
||||
values, err := parseCronField(field, lo, hi)
|
||||
if err != nil {
|
||||
return CronSchedule{}, fmt.Errorf("parse_cron_expr: field %s: %w", cronFieldNames[i], err)
|
||||
}
|
||||
result[i] = values
|
||||
}
|
||||
|
||||
return CronSchedule{
|
||||
Minute: result[0],
|
||||
Hour: result[1],
|
||||
DayOfMonth: result[2],
|
||||
Month: result[3],
|
||||
DayOfWeek: result[4],
|
||||
Raw: strings.TrimSpace(strings.Join(fields, " ")),
|
||||
}, nil
|
||||
}
|
||||
|
||||
// parseCronField expands a single cron field token into the list of matching integers.
|
||||
func parseCronField(field string, lo, hi int) ([]int, error) {
|
||||
// Handle wildcard.
|
||||
if field == "*" {
|
||||
return rangeSlice(lo, hi), nil
|
||||
}
|
||||
|
||||
var values []int
|
||||
seen := make(map[int]bool)
|
||||
|
||||
// Handle comma-separated list.
|
||||
parts := strings.Split(field, ",")
|
||||
for _, part := range parts {
|
||||
expanded, err := parseCronPart(part, lo, hi)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
for _, v := range expanded {
|
||||
if !seen[v] {
|
||||
seen[v] = true
|
||||
values = append(values, v)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Sort.
|
||||
sortInts(values)
|
||||
return values, nil
|
||||
}
|
||||
|
||||
// parseCronPart handles a single part: plain int, range (a-b), or step (*/n or a-b/n).
|
||||
func parseCronPart(part string, lo, hi int) ([]int, error) {
|
||||
// Step: */n or a-b/n
|
||||
if idx := strings.Index(part, "/"); idx != -1 {
|
||||
stepStr := part[idx+1:]
|
||||
step, err := strconv.Atoi(stepStr)
|
||||
if err != nil || step <= 0 {
|
||||
return nil, fmt.Errorf("invalid step %q", stepStr)
|
||||
}
|
||||
base := part[:idx]
|
||||
var start, end int
|
||||
if base == "*" {
|
||||
start, end = lo, hi
|
||||
} else if dashIdx := strings.Index(base, "-"); dashIdx != -1 {
|
||||
var err2 error
|
||||
start, end, err2 = parseRange(base, lo, hi)
|
||||
if err2 != nil {
|
||||
return nil, err2
|
||||
}
|
||||
} else {
|
||||
v, err2 := parseValue(base, lo, hi)
|
||||
if err2 != nil {
|
||||
return nil, err2
|
||||
}
|
||||
start, end = v, hi
|
||||
}
|
||||
var result []int
|
||||
for v := start; v <= end; v += step {
|
||||
result = append(result, v)
|
||||
}
|
||||
return result, nil
|
||||
}
|
||||
|
||||
// Range: a-b
|
||||
if strings.Contains(part, "-") {
|
||||
start, end, err := parseRange(part, lo, hi)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
return rangeSlice(start, end), nil
|
||||
}
|
||||
|
||||
// Plain integer.
|
||||
v, err := parseValue(part, lo, hi)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
return []int{v}, nil
|
||||
}
|
||||
|
||||
func parseRange(s string, lo, hi int) (int, int, error) {
|
||||
parts := strings.SplitN(s, "-", 2)
|
||||
if len(parts) != 2 {
|
||||
return 0, 0, fmt.Errorf("invalid range %q", s)
|
||||
}
|
||||
start, err := parseValue(parts[0], lo, hi)
|
||||
if err != nil {
|
||||
return 0, 0, err
|
||||
}
|
||||
end, err := parseValue(parts[1], lo, hi)
|
||||
if err != nil {
|
||||
return 0, 0, err
|
||||
}
|
||||
if start > end {
|
||||
return 0, 0, fmt.Errorf("range start %d > end %d", start, end)
|
||||
}
|
||||
return start, end, nil
|
||||
}
|
||||
|
||||
func parseValue(s string, lo, hi int) (int, error) {
|
||||
v, err := strconv.Atoi(s)
|
||||
if err != nil {
|
||||
return 0, fmt.Errorf("invalid value %q: not an integer", s)
|
||||
}
|
||||
if v < lo || v > hi {
|
||||
return 0, fmt.Errorf("value %d out of range [%d, %d]", v, lo, hi)
|
||||
}
|
||||
return v, nil
|
||||
}
|
||||
|
||||
func rangeSlice(lo, hi int) []int {
|
||||
s := make([]int, hi-lo+1)
|
||||
for i := range s {
|
||||
s[i] = lo + i
|
||||
}
|
||||
return s
|
||||
}
|
||||
|
||||
// sortInts is a simple insertion sort for small slices (avoids importing sort).
|
||||
func sortInts(a []int) {
|
||||
for i := 1; i < len(a); i++ {
|
||||
for j := i; j > 0 && a[j] < a[j-1]; j-- {
|
||||
a[j], a[j-1] = a[j-1], a[j]
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,45 @@
|
||||
---
|
||||
name: parse_cron_expr
|
||||
kind: function
|
||||
lang: go
|
||||
domain: core
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "func ParseCronExpr(expr string) (CronSchedule, error)"
|
||||
description: "Parsea una expresion cron estandar de 5 campos en un CronSchedule con valores expandidos. Soporta *, rangos (1-5), listas (1,3,5), pasos (*/15) y aliases (@hourly, @daily, @weekly, @monthly, @yearly). No soporta segundos ni years estilo Quartz."
|
||||
tags: [cron, scheduling, parsing, time, pure]
|
||||
uses_functions: []
|
||||
uses_types: [cron_schedule_go_core]
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: [fmt, strconv, strings]
|
||||
tested: true
|
||||
tests:
|
||||
- "*/15 expande minutos a [0 15 30 45]"
|
||||
- "@daily resuelve a 0 0 en todos los campos restantes"
|
||||
- "0 9 1,15 * * expande dias a [1 15]"
|
||||
- "0 9 * * 1-5 expande dia de semana a [1 2 3 4 5]"
|
||||
- "expresion con 4 campos retorna error"
|
||||
- "minuto fuera de rango retorna error"
|
||||
test_file_path: "functions/core/parse_cron_expr_test.go"
|
||||
file_path: "functions/core/parse_cron_expr.go"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```go
|
||||
sched, err := ParseCronExpr("*/15 * * * *")
|
||||
// sched.Minute = [0, 15, 30, 45]
|
||||
// sched.Hour = [0, 1, ..., 23]
|
||||
|
||||
sched2, _ := ParseCronExpr("@daily")
|
||||
// sched2.Minute = [0], sched2.Hour = [0]
|
||||
|
||||
sched3, _ := ParseCronExpr("0 9 * * 1-5")
|
||||
// sched3.DayOfWeek = [1, 2, 3, 4, 5] (lunes a viernes)
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Funcion pura. Cada campo cron se expande a la lista completa de valores enteros validos. Los aliases se resuelven antes del parseo. Los limites son: minute [0,59], hour [0,23], day_of_month [1,31], month [1,12], day_of_week [0,6] (0=domingo).
|
||||
@@ -0,0 +1,81 @@
|
||||
package core
|
||||
|
||||
import (
|
||||
"reflect"
|
||||
"testing"
|
||||
)
|
||||
|
||||
func TestParseCronExpr(t *testing.T) {
|
||||
t.Run("*/15 expande minutos a [0 15 30 45]", func(t *testing.T) {
|
||||
sched, err := ParseCronExpr("*/15 * * * *")
|
||||
if err != nil {
|
||||
t.Fatalf("unexpected error: %v", err)
|
||||
}
|
||||
want := []int{0, 15, 30, 45}
|
||||
if !reflect.DeepEqual(sched.Minute, want) {
|
||||
t.Errorf("Minute = %v, want %v", sched.Minute, want)
|
||||
}
|
||||
// Hour should be all 24 hours
|
||||
if len(sched.Hour) != 24 {
|
||||
t.Errorf("Hour len = %d, want 24", len(sched.Hour))
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("@daily resuelve a 0 0 en todos los campos restantes", func(t *testing.T) {
|
||||
sched, err := ParseCronExpr("@daily")
|
||||
if err != nil {
|
||||
t.Fatalf("unexpected error: %v", err)
|
||||
}
|
||||
if !reflect.DeepEqual(sched.Minute, []int{0}) {
|
||||
t.Errorf("Minute = %v, want [0]", sched.Minute)
|
||||
}
|
||||
if !reflect.DeepEqual(sched.Hour, []int{0}) {
|
||||
t.Errorf("Hour = %v, want [0]", sched.Hour)
|
||||
}
|
||||
// DayOfMonth should be all days
|
||||
if len(sched.DayOfMonth) != 31 {
|
||||
t.Errorf("DayOfMonth len = %d, want 31", len(sched.DayOfMonth))
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("0 9 1,15 * * expande dias a [1 15]", func(t *testing.T) {
|
||||
sched, err := ParseCronExpr("0 9 1,15 * *")
|
||||
if err != nil {
|
||||
t.Fatalf("unexpected error: %v", err)
|
||||
}
|
||||
if !reflect.DeepEqual(sched.Minute, []int{0}) {
|
||||
t.Errorf("Minute = %v, want [0]", sched.Minute)
|
||||
}
|
||||
if !reflect.DeepEqual(sched.Hour, []int{9}) {
|
||||
t.Errorf("Hour = %v, want [9]", sched.Hour)
|
||||
}
|
||||
if !reflect.DeepEqual(sched.DayOfMonth, []int{1, 15}) {
|
||||
t.Errorf("DayOfMonth = %v, want [1, 15]", sched.DayOfMonth)
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("0 9 * * 1-5 expande dia de semana a [1 2 3 4 5]", func(t *testing.T) {
|
||||
sched, err := ParseCronExpr("0 9 * * 1-5")
|
||||
if err != nil {
|
||||
t.Fatalf("unexpected error: %v", err)
|
||||
}
|
||||
want := []int{1, 2, 3, 4, 5}
|
||||
if !reflect.DeepEqual(sched.DayOfWeek, want) {
|
||||
t.Errorf("DayOfWeek = %v, want %v", sched.DayOfWeek, want)
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("expresion con 4 campos retorna error", func(t *testing.T) {
|
||||
_, err := ParseCronExpr("0 9 * *")
|
||||
if err == nil {
|
||||
t.Error("expected error for 4-field expression, got nil")
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("minuto fuera de rango retorna error", func(t *testing.T) {
|
||||
_, err := ParseCronExpr("60 * * * *")
|
||||
if err == nil {
|
||||
t.Error("expected error for minute=60, got nil")
|
||||
}
|
||||
})
|
||||
}
|
||||
@@ -0,0 +1,233 @@
|
||||
package core
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"regexp"
|
||||
"strconv"
|
||||
"strings"
|
||||
)
|
||||
|
||||
// ValidateStructFields validates fields of a map against declarative rules.
|
||||
// Each rule is a comma-separated string like "required,type=string,min=1,max=100".
|
||||
//
|
||||
// Supported rules:
|
||||
// - required — field must exist and not be nil or ""
|
||||
// - type=string|int|float|bool — validate underlying Go type
|
||||
// - min=N, max=N — for numeric values
|
||||
// - minlen=N, maxlen=N — for string values
|
||||
// - oneof=a|b|c — value must be one of the listed options
|
||||
// - pattern=regex — for string values
|
||||
//
|
||||
// Returns (valid, errors). Errors accumulate — all fields are checked.
|
||||
func ValidateStructFields(data map[string]any, rules map[string]string) (bool, []string) {
|
||||
var errs []string
|
||||
|
||||
for field, ruleStr := range rules {
|
||||
parts := strings.Split(ruleStr, ",")
|
||||
for _, part := range parts {
|
||||
part = strings.TrimSpace(part)
|
||||
if part == "" {
|
||||
continue
|
||||
}
|
||||
|
||||
if err := applyRule(data, field, part); err != "" {
|
||||
errs = append(errs, err)
|
||||
// stop further checks on this field if required failed
|
||||
if part == "required" {
|
||||
break
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return len(errs) == 0, errs
|
||||
}
|
||||
|
||||
// applyRule applies a single rule to a field and returns an error string or "".
|
||||
func applyRule(data map[string]any, field, rule string) string {
|
||||
switch {
|
||||
case rule == "required":
|
||||
val, ok := data[field]
|
||||
if !ok || val == nil {
|
||||
return fmt.Sprintf("%s: required field missing", field)
|
||||
}
|
||||
if s, ok := val.(string); ok && s == "" {
|
||||
return fmt.Sprintf("%s: required field is empty string", field)
|
||||
}
|
||||
return ""
|
||||
|
||||
case strings.HasPrefix(rule, "type="):
|
||||
expectedType := rule[len("type="):]
|
||||
val, ok := data[field]
|
||||
if !ok || val == nil {
|
||||
return "" // absence handled by required
|
||||
}
|
||||
return checkType(field, val, expectedType)
|
||||
|
||||
case strings.HasPrefix(rule, "min="):
|
||||
n, err := strconv.ParseFloat(rule[len("min="):], 64)
|
||||
if err != nil {
|
||||
return fmt.Sprintf("%s: invalid rule min value: %s", field, rule)
|
||||
}
|
||||
val, ok := data[field]
|
||||
if !ok || val == nil {
|
||||
return ""
|
||||
}
|
||||
f, ok := toFloat(val)
|
||||
if !ok {
|
||||
return fmt.Sprintf("%s: cannot apply min to non-numeric value", field)
|
||||
}
|
||||
if f < n {
|
||||
return fmt.Sprintf("%s: %v < min %v", field, val, n)
|
||||
}
|
||||
return ""
|
||||
|
||||
case strings.HasPrefix(rule, "max="):
|
||||
n, err := strconv.ParseFloat(rule[len("max="):], 64)
|
||||
if err != nil {
|
||||
return fmt.Sprintf("%s: invalid rule max value: %s", field, rule)
|
||||
}
|
||||
val, ok := data[field]
|
||||
if !ok || val == nil {
|
||||
return ""
|
||||
}
|
||||
f, ok := toFloat(val)
|
||||
if !ok {
|
||||
return fmt.Sprintf("%s: cannot apply max to non-numeric value", field)
|
||||
}
|
||||
if f > n {
|
||||
return fmt.Sprintf("%s: %v > max %v", field, val, n)
|
||||
}
|
||||
return ""
|
||||
|
||||
case strings.HasPrefix(rule, "minlen="):
|
||||
n, err := strconv.Atoi(rule[len("minlen="):])
|
||||
if err != nil {
|
||||
return fmt.Sprintf("%s: invalid rule minlen value: %s", field, rule)
|
||||
}
|
||||
val, ok := data[field]
|
||||
if !ok || val == nil {
|
||||
return ""
|
||||
}
|
||||
s, ok := val.(string)
|
||||
if !ok {
|
||||
return fmt.Sprintf("%s: cannot apply minlen to non-string value", field)
|
||||
}
|
||||
if len(s) < n {
|
||||
return fmt.Sprintf("%s: length %d < minlen %d", field, len(s), n)
|
||||
}
|
||||
return ""
|
||||
|
||||
case strings.HasPrefix(rule, "maxlen="):
|
||||
n, err := strconv.Atoi(rule[len("maxlen="):])
|
||||
if err != nil {
|
||||
return fmt.Sprintf("%s: invalid rule maxlen value: %s", field, rule)
|
||||
}
|
||||
val, ok := data[field]
|
||||
if !ok || val == nil {
|
||||
return ""
|
||||
}
|
||||
s, ok := val.(string)
|
||||
if !ok {
|
||||
return fmt.Sprintf("%s: cannot apply maxlen to non-string value", field)
|
||||
}
|
||||
if len(s) > n {
|
||||
return fmt.Sprintf("%s: length %d > maxlen %d", field, len(s), n)
|
||||
}
|
||||
return ""
|
||||
|
||||
case strings.HasPrefix(rule, "oneof="):
|
||||
options := strings.Split(rule[len("oneof="):], "|")
|
||||
val, ok := data[field]
|
||||
if !ok || val == nil {
|
||||
return ""
|
||||
}
|
||||
sval := fmt.Sprintf("%v", val)
|
||||
for _, opt := range options {
|
||||
if sval == opt {
|
||||
return ""
|
||||
}
|
||||
}
|
||||
return fmt.Sprintf("%s: value %q not in oneof [%s]", field, sval, rule[len("oneof="):])
|
||||
|
||||
case strings.HasPrefix(rule, "pattern="):
|
||||
pat := rule[len("pattern="):]
|
||||
val, ok := data[field]
|
||||
if !ok || val == nil {
|
||||
return ""
|
||||
}
|
||||
s, ok := val.(string)
|
||||
if !ok {
|
||||
return fmt.Sprintf("%s: cannot apply pattern to non-string value", field)
|
||||
}
|
||||
re, err := regexp.Compile(pat)
|
||||
if err != nil {
|
||||
return fmt.Sprintf("%s: invalid pattern %q: %v", field, pat, err)
|
||||
}
|
||||
if !re.MatchString(s) {
|
||||
return fmt.Sprintf("%s: value %q does not match pattern %q", field, s, pat)
|
||||
}
|
||||
return ""
|
||||
|
||||
default:
|
||||
return fmt.Sprintf("%s: unknown rule %q", field, rule)
|
||||
}
|
||||
}
|
||||
|
||||
func checkType(field string, val any, expected string) string {
|
||||
var ok bool
|
||||
switch expected {
|
||||
case "string":
|
||||
_, ok = val.(string)
|
||||
case "int":
|
||||
switch val.(type) {
|
||||
case int, int8, int16, int32, int64, uint, uint8, uint16, uint32, uint64:
|
||||
ok = true
|
||||
}
|
||||
case "float":
|
||||
switch val.(type) {
|
||||
case float32, float64:
|
||||
ok = true
|
||||
case int, int8, int16, int32, int64, uint, uint8, uint16, uint32, uint64:
|
||||
ok = true // integers are valid floats
|
||||
}
|
||||
case "bool":
|
||||
_, ok = val.(bool)
|
||||
default:
|
||||
return fmt.Sprintf("%s: unknown type rule %q", field, expected)
|
||||
}
|
||||
if !ok {
|
||||
return fmt.Sprintf("%s: expected type %s, got %T", field, expected, val)
|
||||
}
|
||||
return ""
|
||||
}
|
||||
|
||||
func toFloat(val any) (float64, bool) {
|
||||
switch v := val.(type) {
|
||||
case int:
|
||||
return float64(v), true
|
||||
case int8:
|
||||
return float64(v), true
|
||||
case int16:
|
||||
return float64(v), true
|
||||
case int32:
|
||||
return float64(v), true
|
||||
case int64:
|
||||
return float64(v), true
|
||||
case uint:
|
||||
return float64(v), true
|
||||
case uint8:
|
||||
return float64(v), true
|
||||
case uint16:
|
||||
return float64(v), true
|
||||
case uint32:
|
||||
return float64(v), true
|
||||
case uint64:
|
||||
return float64(v), true
|
||||
case float32:
|
||||
return float64(v), true
|
||||
case float64:
|
||||
return v, true
|
||||
}
|
||||
return 0, false
|
||||
}
|
||||
@@ -0,0 +1,64 @@
|
||||
---
|
||||
name: validate_struct_fields
|
||||
kind: function
|
||||
lang: go
|
||||
domain: core
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "func ValidateStructFields(data map[string]any, rules map[string]string) (bool, []string)"
|
||||
description: "Valida campos de un map[string]any contra reglas declarativas tipo 'required,min=1,max=100,type=string'. Soporta required, type, min/max, minlen/maxlen, oneof, pattern. Pensado para validar metadata de entities en operations.db o resultados de queries sin definir structs Go. Acumula todos los errores."
|
||||
tags: [validation, map, rules, pure, core, operations]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: [fmt, regexp, strconv, strings]
|
||||
tested: true
|
||||
tests:
|
||||
- "campo required presente y ausente"
|
||||
- "type validation string como int falla"
|
||||
- "numeric ranges"
|
||||
- "string lengths"
|
||||
- "oneof validation"
|
||||
- "pattern matching"
|
||||
- "multiples reglas combinadas"
|
||||
- "map vacio con reglas required"
|
||||
test_file_path: "functions/core/validate_struct_fields_test.go"
|
||||
file_path: "functions/core/validate_struct_fields.go"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```go
|
||||
data := map[string]any{
|
||||
"name": "Alice",
|
||||
"age": 30,
|
||||
"status": "active",
|
||||
"email": "alice@example.com",
|
||||
}
|
||||
|
||||
rules := map[string]string{
|
||||
"name": "required,type=string,minlen=2,maxlen=100",
|
||||
"age": "required,type=int,min=0,max=150",
|
||||
"status": "required,oneof=active|inactive|pending",
|
||||
"email": `required,type=string,pattern=^[^@]+@[^@]+$`,
|
||||
}
|
||||
|
||||
valid, errs := ValidateStructFields(data, rules)
|
||||
// valid = true, errs = []
|
||||
|
||||
data2 := map[string]any{"name": "A", "age": 200, "status": "deleted"}
|
||||
valid2, errs2 := ValidateStructFields(data2, rules)
|
||||
// valid2 = false
|
||||
// errs2 = [
|
||||
// "name: length 1 < minlen 2",
|
||||
// "age: 200 > max 150",
|
||||
// "status: value \"deleted\" not in oneof [active|inactive|pending]",
|
||||
// "email: required field missing",
|
||||
// ]
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Funcion pura. Solo usa stdlib (fmt, regexp, strconv, strings). Las reglas se evaluan en orden y se acumulan todos los errores. Si `required` falla, se omiten las reglas restantes de ese campo para evitar falsos positivos. Tipos Go aceptados para type=int: int, int8..int64, uint..uint64. Tipo float acepta enteros tambien. Pattern compila el regex en cada llamada — para uso intensivo cachear los regexp compilados fuera.
|
||||
@@ -0,0 +1,131 @@
|
||||
package core
|
||||
|
||||
import (
|
||||
"strings"
|
||||
"testing"
|
||||
)
|
||||
|
||||
func TestValidateStructFields(t *testing.T) {
|
||||
t.Run("campo required presente y ausente", func(t *testing.T) {
|
||||
rules := map[string]string{"name": "required"}
|
||||
|
||||
valid, errs := ValidateStructFields(map[string]any{"name": "Alice"}, rules)
|
||||
if !valid || len(errs) != 0 {
|
||||
t.Errorf("expected valid, got errors: %v", errs)
|
||||
}
|
||||
|
||||
valid2, errs2 := ValidateStructFields(map[string]any{}, rules)
|
||||
if valid2 || len(errs2) == 0 {
|
||||
t.Errorf("expected invalid for missing required field")
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("type validation string como int falla", func(t *testing.T) {
|
||||
rules := map[string]string{"count": "type=int"}
|
||||
|
||||
valid, _ := ValidateStructFields(map[string]any{"count": 5}, rules)
|
||||
if !valid {
|
||||
t.Error("expected int 5 to pass type=int")
|
||||
}
|
||||
|
||||
valid2, errs2 := ValidateStructFields(map[string]any{"count": "five"}, rules)
|
||||
if valid2 || len(errs2) == 0 {
|
||||
t.Error("expected string to fail type=int")
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("numeric ranges", func(t *testing.T) {
|
||||
rules := map[string]string{"score": "min=0,max=100"}
|
||||
|
||||
valid, _ := ValidateStructFields(map[string]any{"score": 50}, rules)
|
||||
if !valid {
|
||||
t.Error("expected 50 to pass min=0,max=100")
|
||||
}
|
||||
|
||||
valid2, errs2 := ValidateStructFields(map[string]any{"score": 150}, rules)
|
||||
if valid2 || !strings.Contains(errs2[0], "max") {
|
||||
t.Errorf("expected max violation, got: %v", errs2)
|
||||
}
|
||||
|
||||
valid3, errs3 := ValidateStructFields(map[string]any{"score": -1}, rules)
|
||||
if valid3 || !strings.Contains(errs3[0], "min") {
|
||||
t.Errorf("expected min violation, got: %v", errs3)
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("string lengths", func(t *testing.T) {
|
||||
rules := map[string]string{"tag": "minlen=2,maxlen=10"}
|
||||
|
||||
valid, _ := ValidateStructFields(map[string]any{"tag": "go"}, rules)
|
||||
if !valid {
|
||||
t.Error("expected 'go' to pass minlen=2,maxlen=10")
|
||||
}
|
||||
|
||||
valid2, errs2 := ValidateStructFields(map[string]any{"tag": "a"}, rules)
|
||||
if valid2 || !strings.Contains(errs2[0], "minlen") {
|
||||
t.Errorf("expected minlen violation, got: %v", errs2)
|
||||
}
|
||||
|
||||
valid3, errs3 := ValidateStructFields(map[string]any{"tag": "averylongtag"}, rules)
|
||||
if valid3 || !strings.Contains(errs3[0], "maxlen") {
|
||||
t.Errorf("expected maxlen violation, got: %v", errs3)
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("oneof validation", func(t *testing.T) {
|
||||
rules := map[string]string{"status": "oneof=active|inactive|pending"}
|
||||
|
||||
valid, _ := ValidateStructFields(map[string]any{"status": "active"}, rules)
|
||||
if !valid {
|
||||
t.Error("expected 'active' to pass oneof")
|
||||
}
|
||||
|
||||
valid2, errs2 := ValidateStructFields(map[string]any{"status": "deleted"}, rules)
|
||||
if valid2 || len(errs2) == 0 {
|
||||
t.Errorf("expected oneof violation, got: %v", errs2)
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("pattern matching", func(t *testing.T) {
|
||||
rules := map[string]string{"email": `pattern=^[^@]+@[^@]+\.[^@]+$`}
|
||||
|
||||
valid, _ := ValidateStructFields(map[string]any{"email": "user@example.com"}, rules)
|
||||
if !valid {
|
||||
t.Error("expected valid email to pass pattern")
|
||||
}
|
||||
|
||||
valid2, errs2 := ValidateStructFields(map[string]any{"email": "not-an-email"}, rules)
|
||||
if valid2 || !strings.Contains(errs2[0], "pattern") {
|
||||
t.Errorf("expected pattern violation, got: %v", errs2)
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("multiples reglas combinadas", func(t *testing.T) {
|
||||
rules := map[string]string{
|
||||
"name": "required,type=string,minlen=2,maxlen=50",
|
||||
"score": "required,type=float,min=0,max=10",
|
||||
}
|
||||
|
||||
valid, _ := ValidateStructFields(map[string]any{"name": "Alice", "score": float64(8.5)}, rules)
|
||||
if !valid {
|
||||
t.Error("expected all rules to pass")
|
||||
}
|
||||
|
||||
valid2, errs2 := ValidateStructFields(map[string]any{"name": "A", "score": float64(11)}, rules)
|
||||
if valid2 || len(errs2) < 2 {
|
||||
t.Errorf("expected at least 2 errors, got: %v", errs2)
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("map vacio con reglas required", func(t *testing.T) {
|
||||
rules := map[string]string{
|
||||
"id": "required",
|
||||
"name": "required",
|
||||
}
|
||||
|
||||
valid, errs := ValidateStructFields(map[string]any{}, rules)
|
||||
if valid || len(errs) < 2 {
|
||||
t.Errorf("expected 2 required errors, got: %v", errs)
|
||||
}
|
||||
})
|
||||
}
|
||||
@@ -0,0 +1,96 @@
|
||||
package datascience
|
||||
|
||||
import "fmt"
|
||||
|
||||
// DiffEntities compares two snapshots of entities and returns field-level differences.
|
||||
// Detects added, removed, modified, and unchanged entities.
|
||||
// ignoreFields specifies fields to exclude from comparison (defaults to ["created_at", "updated_at"] when nil).
|
||||
func DiffEntities(before, after []map[string]any, key string, ignoreFields []string) map[string]any {
|
||||
if ignoreFields == nil {
|
||||
ignoreFields = []string{"created_at", "updated_at"}
|
||||
}
|
||||
|
||||
ignoreSet := make(map[string]bool, len(ignoreFields))
|
||||
for _, f := range ignoreFields {
|
||||
ignoreSet[f] = true
|
||||
}
|
||||
|
||||
beforeMap := make(map[string]map[string]any, len(before))
|
||||
for _, e := range before {
|
||||
if k, ok := e[key]; ok {
|
||||
beforeMap[fmt.Sprintf("%v", k)] = e
|
||||
}
|
||||
}
|
||||
|
||||
afterMap := make(map[string]map[string]any, len(after))
|
||||
for _, e := range after {
|
||||
if k, ok := e[key]; ok {
|
||||
afterMap[fmt.Sprintf("%v", k)] = e
|
||||
}
|
||||
}
|
||||
|
||||
added := []map[string]any{}
|
||||
for k, e := range afterMap {
|
||||
if _, exists := beforeMap[k]; !exists {
|
||||
added = append(added, e)
|
||||
}
|
||||
}
|
||||
|
||||
removed := []map[string]any{}
|
||||
for k, e := range beforeMap {
|
||||
if _, exists := afterMap[k]; !exists {
|
||||
removed = append(removed, e)
|
||||
}
|
||||
}
|
||||
|
||||
modified := []map[string]any{}
|
||||
unchanged := 0
|
||||
|
||||
for k, b := range beforeMap {
|
||||
a, exists := afterMap[k]
|
||||
if !exists {
|
||||
continue
|
||||
}
|
||||
|
||||
// Collect all fields from both entities
|
||||
allFields := make(map[string]bool)
|
||||
for f := range b {
|
||||
allFields[f] = true
|
||||
}
|
||||
for f := range a {
|
||||
allFields[f] = true
|
||||
}
|
||||
|
||||
changes := map[string]any{}
|
||||
for field := range allFields {
|
||||
if ignoreSet[field] || field == key {
|
||||
continue
|
||||
}
|
||||
oldVal := b[field]
|
||||
newVal := a[field]
|
||||
if fmt.Sprintf("%v", oldVal) != fmt.Sprintf("%v", newVal) {
|
||||
changes[field] = map[string]any{"old": oldVal, "new": newVal}
|
||||
}
|
||||
}
|
||||
|
||||
if len(changes) > 0 {
|
||||
modified = append(modified, map[string]any{"key": k, "changes": changes})
|
||||
} else {
|
||||
unchanged++
|
||||
}
|
||||
}
|
||||
|
||||
nAdded := len(added)
|
||||
nRemoved := len(removed)
|
||||
nModified := len(modified)
|
||||
summary := fmt.Sprintf("%d added, %d removed, %d modified, %d unchanged",
|
||||
nAdded, nRemoved, nModified, unchanged)
|
||||
|
||||
return map[string]any{
|
||||
"added": added,
|
||||
"removed": removed,
|
||||
"modified": modified,
|
||||
"unchanged": unchanged,
|
||||
"summary": summary,
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,52 @@
|
||||
---
|
||||
name: diff_entities
|
||||
kind: function
|
||||
lang: go
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "func DiffEntities(before, after []map[string]any, key string, ignoreFields []string) map[string]any"
|
||||
description: "Compara dos snapshots de entities y devuelve diferencias campo a campo. Detecta añadidas, eliminadas, modificadas e inalteradas. Ignora created_at y updated_at por defecto (pasar nil para usar defaults)."
|
||||
tags: [datascience, diff, entities, operations, snapshot, comparison]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: ["fmt"]
|
||||
tested: true
|
||||
tests:
|
||||
- "entity añadida"
|
||||
- "entity eliminada"
|
||||
- "entity modificada con detalle de campos"
|
||||
- "entities identicas → unchanged"
|
||||
- "ignore_fields funciona"
|
||||
- "lista vacia vs lista con datos"
|
||||
- "summary format correcto"
|
||||
test_file_path: "functions/datascience/diff_entities_test.go"
|
||||
file_path: "functions/datascience/diff_entities.go"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```go
|
||||
before := []map[string]any{
|
||||
{"id": "1", "name": "Alice", "status": "active"},
|
||||
{"id": "2", "name": "Bob"},
|
||||
}
|
||||
after := []map[string]any{
|
||||
{"id": "1", "name": "Alice", "status": "inactive"},
|
||||
{"id": "3", "name": "Carol"},
|
||||
}
|
||||
result := DiffEntities(before, after, "id", nil)
|
||||
// result["summary"] = "1 added, 1 removed, 1 modified, 0 unchanged"
|
||||
// result["added"] = [{"id": "3", "name": "Carol"}]
|
||||
// result["removed"] = [{"id": "2", "name": "Bob"}]
|
||||
// result["modified"] = [{"key": "1", "changes": {"status": {"old": "active", "new": "inactive"}}}]
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Funcion pura. Compara valores con fmt.Sprintf("%v", ...) para manejar tipos heterogeneos en map[string]any.
|
||||
ignoreFields nil usa los defaults ["created_at", "updated_at"]. Para no ignorar ningun campo, pasar []string{}.
|
||||
Semantica identica a diff_entities_py_datascience, permite comparar resultados entre ejecuciones del mismo pipeline.
|
||||
@@ -0,0 +1,138 @@
|
||||
package datascience
|
||||
|
||||
import (
|
||||
"testing"
|
||||
)
|
||||
|
||||
func TestDiffEntities(t *testing.T) {
|
||||
t.Run("entity añadida", func(t *testing.T) {
|
||||
before := []map[string]any{
|
||||
{"id": "1", "name": "Alice"},
|
||||
}
|
||||
after := []map[string]any{
|
||||
{"id": "1", "name": "Alice"},
|
||||
{"id": "2", "name": "Bob"},
|
||||
}
|
||||
result := DiffEntities(before, after, "id", nil)
|
||||
added := result["added"].([]map[string]any)
|
||||
if len(added) != 1 {
|
||||
t.Errorf("expected 1 added, got %d", len(added))
|
||||
}
|
||||
if added[0]["id"] != "2" {
|
||||
t.Errorf("expected added id=2, got %v", added[0]["id"])
|
||||
}
|
||||
if result["unchanged"].(int) != 1 {
|
||||
t.Errorf("expected 1 unchanged, got %v", result["unchanged"])
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("entity eliminada", func(t *testing.T) {
|
||||
before := []map[string]any{
|
||||
{"id": "1", "name": "Alice"},
|
||||
{"id": "2", "name": "Bob"},
|
||||
}
|
||||
after := []map[string]any{
|
||||
{"id": "1", "name": "Alice"},
|
||||
}
|
||||
result := DiffEntities(before, after, "id", nil)
|
||||
removed := result["removed"].([]map[string]any)
|
||||
if len(removed) != 1 {
|
||||
t.Errorf("expected 1 removed, got %d", len(removed))
|
||||
}
|
||||
if removed[0]["id"] != "2" {
|
||||
t.Errorf("expected removed id=2, got %v", removed[0]["id"])
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("entity modificada con detalle de campos", func(t *testing.T) {
|
||||
before := []map[string]any{
|
||||
{"id": "1", "name": "Alice", "status": "active"},
|
||||
}
|
||||
after := []map[string]any{
|
||||
{"id": "1", "name": "Alice", "status": "inactive"},
|
||||
}
|
||||
result := DiffEntities(before, after, "id", nil)
|
||||
modified := result["modified"].([]map[string]any)
|
||||
if len(modified) != 1 {
|
||||
t.Errorf("expected 1 modified, got %d", len(modified))
|
||||
}
|
||||
changes := modified[0]["changes"].(map[string]any)
|
||||
statusChange, ok := changes["status"].(map[string]any)
|
||||
if !ok {
|
||||
t.Fatalf("expected status change, got %v", changes)
|
||||
}
|
||||
if statusChange["old"] != "active" {
|
||||
t.Errorf("expected old=active, got %v", statusChange["old"])
|
||||
}
|
||||
if statusChange["new"] != "inactive" {
|
||||
t.Errorf("expected new=inactive, got %v", statusChange["new"])
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("entities identicas → unchanged", func(t *testing.T) {
|
||||
entities := []map[string]any{
|
||||
{"id": "1", "name": "Alice"},
|
||||
{"id": "2", "name": "Bob"},
|
||||
}
|
||||
result := DiffEntities(entities, entities, "id", nil)
|
||||
if result["unchanged"].(int) != 2 {
|
||||
t.Errorf("expected 2 unchanged, got %v", result["unchanged"])
|
||||
}
|
||||
if len(result["added"].([]map[string]any)) != 0 {
|
||||
t.Errorf("expected 0 added")
|
||||
}
|
||||
if len(result["modified"].([]map[string]any)) != 0 {
|
||||
t.Errorf("expected 0 modified")
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("ignore_fields funciona", func(t *testing.T) {
|
||||
before := []map[string]any{
|
||||
{"id": "1", "name": "Alice", "updated_at": "2024-01-01"},
|
||||
}
|
||||
after := []map[string]any{
|
||||
{"id": "1", "name": "Alice", "updated_at": "2024-06-01"},
|
||||
}
|
||||
// Default ignores updated_at
|
||||
result := DiffEntities(before, after, "id", nil)
|
||||
if result["unchanged"].(int) != 1 {
|
||||
t.Errorf("expected 1 unchanged (updated_at ignored), got %v", result["unchanged"])
|
||||
}
|
||||
modified := result["modified"].([]map[string]any)
|
||||
if len(modified) != 0 {
|
||||
t.Errorf("expected 0 modified when updated_at is ignored, got %d", len(modified))
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("lista vacia vs lista con datos", func(t *testing.T) {
|
||||
before := []map[string]any{}
|
||||
after := []map[string]any{
|
||||
{"id": "1", "name": "Alice"},
|
||||
}
|
||||
result := DiffEntities(before, after, "id", nil)
|
||||
added := result["added"].([]map[string]any)
|
||||
if len(added) != 1 {
|
||||
t.Errorf("expected 1 added, got %d", len(added))
|
||||
}
|
||||
if result["unchanged"].(int) != 0 {
|
||||
t.Errorf("expected 0 unchanged")
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("summary format correcto", func(t *testing.T) {
|
||||
before := []map[string]any{
|
||||
{"id": "1", "name": "Alice"},
|
||||
{"id": "3", "name": "Carol"},
|
||||
}
|
||||
after := []map[string]any{
|
||||
{"id": "1", "name": "Alice Changed"},
|
||||
{"id": "2", "name": "Bob"},
|
||||
}
|
||||
result := DiffEntities(before, after, "id", nil)
|
||||
summary := result["summary"].(string)
|
||||
expected := "1 added, 1 removed, 1 modified, 0 unchanged"
|
||||
if summary != expected {
|
||||
t.Errorf("expected summary %q, got %q", expected, summary)
|
||||
}
|
||||
})
|
||||
}
|
||||
@@ -0,0 +1,110 @@
|
||||
package datascience
|
||||
|
||||
// Pivot transforma datos del formato largo al formato ancho (pivot table).
|
||||
// Agrupa por index, expande los valores unicos de columns como nuevas columnas
|
||||
// y agrega values con la funcion indicada.
|
||||
// Funciones de agregacion soportadas: sum, count, mean, min, max, first, last.
|
||||
// Valores numericos faltantes se rellenan con 0.
|
||||
func Pivot(rows []map[string]any, index, columns, values, agg string) []map[string]any {
|
||||
// Mantener orden de aparicion de index y column values
|
||||
indexOrder := []any{}
|
||||
seenIndex := map[any]bool{}
|
||||
colOrder := []any{}
|
||||
seenCols := map[any]bool{}
|
||||
|
||||
for _, row := range rows {
|
||||
idx := row[index]
|
||||
col := row[columns]
|
||||
if !seenIndex[idx] {
|
||||
seenIndex[idx] = true
|
||||
indexOrder = append(indexOrder, idx)
|
||||
}
|
||||
if !seenCols[col] {
|
||||
seenCols[col] = true
|
||||
colOrder = append(colOrder, col)
|
||||
}
|
||||
}
|
||||
|
||||
// Acumular: groups[indexVal][colVal] = lista de valores
|
||||
type key struct{ idx, col any }
|
||||
groups := map[key][]any{}
|
||||
for _, row := range rows {
|
||||
idx := row[index]
|
||||
col := row[columns]
|
||||
val := row[values]
|
||||
if val != nil {
|
||||
k := key{idx, col}
|
||||
groups[k] = append(groups[k], val)
|
||||
}
|
||||
}
|
||||
|
||||
aggregate := func(vals []any, fn string) any {
|
||||
if len(vals) == 0 {
|
||||
return 0
|
||||
}
|
||||
switch fn {
|
||||
case "count":
|
||||
return len(vals)
|
||||
case "first":
|
||||
return vals[0]
|
||||
case "last":
|
||||
return vals[len(vals)-1]
|
||||
}
|
||||
// Funciones numericas: sum, mean, min, max
|
||||
toFloat := func(v any) float64 {
|
||||
switch n := v.(type) {
|
||||
case float64:
|
||||
return n
|
||||
case float32:
|
||||
return float64(n)
|
||||
case int:
|
||||
return float64(n)
|
||||
case int64:
|
||||
return float64(n)
|
||||
case int32:
|
||||
return float64(n)
|
||||
}
|
||||
return 0
|
||||
}
|
||||
sum := 0.0
|
||||
mn := toFloat(vals[0])
|
||||
mx := toFloat(vals[0])
|
||||
for _, v := range vals {
|
||||
f := toFloat(v)
|
||||
sum += f
|
||||
if f < mn {
|
||||
mn = f
|
||||
}
|
||||
if f > mx {
|
||||
mx = f
|
||||
}
|
||||
}
|
||||
switch fn {
|
||||
case "sum":
|
||||
return sum
|
||||
case "mean":
|
||||
return sum / float64(len(vals))
|
||||
case "min":
|
||||
return mn
|
||||
case "max":
|
||||
return mx
|
||||
}
|
||||
return sum
|
||||
}
|
||||
|
||||
result := make([]map[string]any, 0, len(indexOrder))
|
||||
for _, idx := range indexOrder {
|
||||
record := map[string]any{index: idx}
|
||||
for _, col := range colOrder {
|
||||
k := key{idx, col}
|
||||
vals := groups[k]
|
||||
if len(vals) > 0 {
|
||||
record[col.(string)] = aggregate(vals, agg)
|
||||
} else {
|
||||
record[col.(string)] = 0
|
||||
}
|
||||
}
|
||||
result = append(result, record)
|
||||
}
|
||||
return result
|
||||
}
|
||||
@@ -0,0 +1,43 @@
|
||||
---
|
||||
name: pivot
|
||||
kind: function
|
||||
lang: go
|
||||
domain: datascience
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "func Pivot(rows []map[string]any, index, columns, values, agg string) []map[string]any"
|
||||
description: "Pivot table sin dependencias. Agrupa por index, expande valores unicos de columns como nuevas columnas y agrega values con la funcion indicada (sum, count, mean, min, max, first, last). Valores faltantes se rellenan con 0."
|
||||
tags: [datascience, tabular, pivot, transform, aggregation, go]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: []
|
||||
tested: true
|
||||
tests:
|
||||
- "Pivot basico con sum"
|
||||
- "Pivot con count y mean"
|
||||
- "Valores faltantes rellenados con 0"
|
||||
- "Una sola fila"
|
||||
- "Multiples valores por celda requieren agregacion"
|
||||
test_file_path: "functions/datascience/pivot_test.go"
|
||||
file_path: "functions/datascience/pivot.go"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```go
|
||||
rows := []map[string]any{
|
||||
{"region": "US", "product": "A", "sales": 10},
|
||||
{"region": "US", "product": "B", "sales": 20},
|
||||
{"region": "EU", "product": "A", "sales": 15},
|
||||
}
|
||||
result := Pivot(rows, "region", "product", "sales", "sum")
|
||||
// [{"region": "US", "A": 10.0, "B": 20.0}, {"region": "EU", "A": 15.0, "B": 0}]
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Funcion pura sin dependencias externas. Usa map[string]any para trabajar con datos JSON/SQL deserializados.
|
||||
Las agregaciones numericas (sum, mean, min, max) convierten valores a float64 via type assertion.
|
||||
@@ -0,0 +1,111 @@
|
||||
package datascience
|
||||
|
||||
import (
|
||||
"testing"
|
||||
)
|
||||
|
||||
func TestPivot(t *testing.T) {
|
||||
t.Run("Pivot basico con sum", func(t *testing.T) {
|
||||
rows := []map[string]any{
|
||||
{"region": "US", "product": "A", "sales": 10},
|
||||
{"region": "US", "product": "B", "sales": 20},
|
||||
{"region": "EU", "product": "A", "sales": 15},
|
||||
}
|
||||
result := Pivot(rows, "region", "product", "sales", "sum")
|
||||
if len(result) != 2 {
|
||||
t.Fatalf("got %d rows, want 2", len(result))
|
||||
}
|
||||
var us, eu map[string]any
|
||||
for _, r := range result {
|
||||
if r["region"] == "US" {
|
||||
us = r
|
||||
} else {
|
||||
eu = r
|
||||
}
|
||||
}
|
||||
if us["A"] != 10.0 {
|
||||
t.Errorf("US.A: got %v, want 10", us["A"])
|
||||
}
|
||||
if us["B"] != 20.0 {
|
||||
t.Errorf("US.B: got %v, want 20", us["B"])
|
||||
}
|
||||
if eu["A"] != 15.0 {
|
||||
t.Errorf("EU.A: got %v, want 15", eu["A"])
|
||||
}
|
||||
if eu["B"] != 0 {
|
||||
t.Errorf("EU.B: got %v, want 0", eu["B"])
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("Pivot con count y mean", func(t *testing.T) {
|
||||
rows := []map[string]any{
|
||||
{"region": "US", "product": "A", "sales": 10},
|
||||
{"region": "US", "product": "A", "sales": 20},
|
||||
{"region": "EU", "product": "A", "sales": 15},
|
||||
}
|
||||
resultCount := Pivot(rows, "region", "product", "sales", "count")
|
||||
for _, r := range resultCount {
|
||||
if r["region"] == "US" && r["A"] != 2 {
|
||||
t.Errorf("count US.A: got %v, want 2", r["A"])
|
||||
}
|
||||
}
|
||||
|
||||
resultMean := Pivot(rows, "region", "product", "sales", "mean")
|
||||
for _, r := range resultMean {
|
||||
if r["region"] == "US" {
|
||||
mean, ok := r["A"].(float64)
|
||||
if !ok || mean != 15.0 {
|
||||
t.Errorf("mean US.A: got %v, want 15.0", r["A"])
|
||||
}
|
||||
}
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("Valores faltantes rellenados con 0", func(t *testing.T) {
|
||||
rows := []map[string]any{
|
||||
{"region": "US", "product": "A", "sales": 5},
|
||||
{"region": "EU", "product": "B", "sales": 8},
|
||||
}
|
||||
result := Pivot(rows, "region", "product", "sales", "sum")
|
||||
for _, r := range result {
|
||||
if r["region"] == "US" && r["B"] != 0 {
|
||||
t.Errorf("US.B: got %v, want 0", r["B"])
|
||||
}
|
||||
if r["region"] == "EU" && r["A"] != 0 {
|
||||
t.Errorf("EU.A: got %v, want 0", r["A"])
|
||||
}
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("Una sola fila", func(t *testing.T) {
|
||||
rows := []map[string]any{
|
||||
{"region": "US", "product": "A", "sales": 42},
|
||||
}
|
||||
result := Pivot(rows, "region", "product", "sales", "sum")
|
||||
if len(result) != 1 {
|
||||
t.Fatalf("got %d rows, want 1", len(result))
|
||||
}
|
||||
if result[0]["A"] != 42.0 {
|
||||
t.Errorf("got %v, want 42", result[0]["A"])
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("Multiples valores por celda requieren agregacion", func(t *testing.T) {
|
||||
rows := []map[string]any{
|
||||
{"region": "US", "product": "A", "sales": 10},
|
||||
{"region": "US", "product": "A", "sales": 30},
|
||||
}
|
||||
resultSum := Pivot(rows, "region", "product", "sales", "sum")
|
||||
if resultSum[0]["A"] != 40.0 {
|
||||
t.Errorf("sum: got %v, want 40.0", resultSum[0]["A"])
|
||||
}
|
||||
resultMin := Pivot(rows, "region", "product", "sales", "min")
|
||||
if resultMin[0]["A"] != 10.0 {
|
||||
t.Errorf("min: got %v, want 10.0", resultMin[0]["A"])
|
||||
}
|
||||
resultMax := Pivot(rows, "region", "product", "sales", "max")
|
||||
if resultMax[0]["A"] != 30.0 {
|
||||
t.Errorf("max: got %v, want 30.0", resultMax[0]["A"])
|
||||
}
|
||||
})
|
||||
}
|
||||
@@ -0,0 +1,156 @@
|
||||
package infra
|
||||
|
||||
import (
|
||||
"database/sql"
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"sync"
|
||||
"time"
|
||||
|
||||
_ "github.com/mattn/go-sqlite3"
|
||||
)
|
||||
|
||||
// SQLiteCache es un cache key-value persistido en SQLite con soporte de TTL.
|
||||
// Valores almacenados como JSON serializado. El caller es responsable de
|
||||
// deserializar el []byte retornado por Get.
|
||||
// Seguro para uso concurrente.
|
||||
type SQLiteCache struct {
|
||||
db *sql.DB
|
||||
namespace string
|
||||
mu sync.Mutex
|
||||
}
|
||||
|
||||
const sqliteCacheSchema = `
|
||||
CREATE TABLE IF NOT EXISTS cache (
|
||||
namespace TEXT NOT NULL,
|
||||
key TEXT NOT NULL,
|
||||
value TEXT NOT NULL,
|
||||
created_at REAL NOT NULL,
|
||||
expires_at REAL,
|
||||
PRIMARY KEY (namespace, key)
|
||||
);`
|
||||
|
||||
// CacheToSQLite abre (o crea) una base de datos SQLite en dbPath y retorna
|
||||
// un SQLiteCache para el namespace dado.
|
||||
func CacheToSQLite(dbPath, namespace string) (*SQLiteCache, error) {
|
||||
db, err := sql.Open("sqlite3", dbPath+"?_journal_mode=WAL")
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("cache_to_sqlite: open db: %w", err)
|
||||
}
|
||||
if _, err := db.Exec(sqliteCacheSchema); err != nil {
|
||||
db.Close()
|
||||
return nil, fmt.Errorf("cache_to_sqlite: create schema: %w", err)
|
||||
}
|
||||
return &SQLiteCache{db: db, namespace: namespace}, nil
|
||||
}
|
||||
|
||||
// evictExpired elimina las entradas expiradas del namespace. Debe llamarse
|
||||
// con el mutex ya tomado.
|
||||
func (c *SQLiteCache) evictExpired() {
|
||||
now := float64(time.Now().UnixNano()) / 1e9
|
||||
c.db.Exec(
|
||||
"DELETE FROM cache WHERE namespace = ? AND expires_at IS NOT NULL AND expires_at <= ?",
|
||||
c.namespace, now,
|
||||
)
|
||||
}
|
||||
|
||||
// Get retorna el valor asociado a key y true, o nil y false si no existe o
|
||||
// esta expirado. El []byte contiene JSON que el caller puede deserializar.
|
||||
func (c *SQLiteCache) Get(key string) ([]byte, bool) {
|
||||
c.mu.Lock()
|
||||
defer c.mu.Unlock()
|
||||
c.evictExpired()
|
||||
var value string
|
||||
err := c.db.QueryRow(
|
||||
"SELECT value FROM cache WHERE namespace = ? AND key = ?",
|
||||
c.namespace, key,
|
||||
).Scan(&value)
|
||||
if err != nil {
|
||||
return nil, false
|
||||
}
|
||||
return []byte(value), true
|
||||
}
|
||||
|
||||
// Set almacena value (JSON bytes) bajo key. ttl=0 significa sin expiracion.
|
||||
func (c *SQLiteCache) Set(key string, value []byte, ttl time.Duration) error {
|
||||
c.mu.Lock()
|
||||
defer c.mu.Unlock()
|
||||
now := float64(time.Now().UnixNano()) / 1e9
|
||||
var expiresAt any
|
||||
if ttl > 0 {
|
||||
expiresAt = now + ttl.Seconds()
|
||||
}
|
||||
_, err := c.db.Exec(
|
||||
`INSERT INTO cache (namespace, key, value, created_at, expires_at)
|
||||
VALUES (?, ?, ?, ?, ?)
|
||||
ON CONFLICT(namespace, key) DO UPDATE SET
|
||||
value = excluded.value,
|
||||
created_at = excluded.created_at,
|
||||
expires_at = excluded.expires_at`,
|
||||
c.namespace, key, string(value), now, expiresAt,
|
||||
)
|
||||
if err != nil {
|
||||
return fmt.Errorf("cache set: %w", err)
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
// Delete elimina la entrada asociada a key. Retorna error si falla la query.
|
||||
func (c *SQLiteCache) Delete(key string) error {
|
||||
c.mu.Lock()
|
||||
defer c.mu.Unlock()
|
||||
_, err := c.db.Exec(
|
||||
"DELETE FROM cache WHERE namespace = ? AND key = ?",
|
||||
c.namespace, key,
|
||||
)
|
||||
if err != nil {
|
||||
return fmt.Errorf("cache delete: %w", err)
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
// Clear elimina todas las entradas del namespace. Retorna el numero de filas
|
||||
// eliminadas.
|
||||
func (c *SQLiteCache) Clear() (int64, error) {
|
||||
c.mu.Lock()
|
||||
defer c.mu.Unlock()
|
||||
res, err := c.db.Exec(
|
||||
"DELETE FROM cache WHERE namespace = ?",
|
||||
c.namespace,
|
||||
)
|
||||
if err != nil {
|
||||
return 0, fmt.Errorf("cache clear: %w", err)
|
||||
}
|
||||
n, _ := res.RowsAffected()
|
||||
return n, nil
|
||||
}
|
||||
|
||||
// GetOrSet retorna el valor cacheado o llama factory() para obtenerlo,
|
||||
// lo almacena con el ttl dado y lo retorna.
|
||||
func (c *SQLiteCache) GetOrSet(key string, factory func() ([]byte, error), ttl time.Duration) ([]byte, error) {
|
||||
if v, ok := c.Get(key); ok {
|
||||
return v, nil
|
||||
}
|
||||
value, err := factory()
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("cache get_or_set factory: %w", err)
|
||||
}
|
||||
if err := c.Set(key, value, ttl); err != nil {
|
||||
return nil, err
|
||||
}
|
||||
return value, nil
|
||||
}
|
||||
|
||||
// SetJSON serializa v como JSON y lo almacena bajo key.
|
||||
func (c *SQLiteCache) SetJSON(key string, v any, ttl time.Duration) error {
|
||||
b, err := json.Marshal(v)
|
||||
if err != nil {
|
||||
return fmt.Errorf("cache set_json marshal: %w", err)
|
||||
}
|
||||
return c.Set(key, b, ttl)
|
||||
}
|
||||
|
||||
// Close cierra la conexion a la base de datos.
|
||||
func (c *SQLiteCache) Close() error {
|
||||
return c.db.Close()
|
||||
}
|
||||
@@ -0,0 +1,58 @@
|
||||
---
|
||||
name: cache_to_sqlite
|
||||
kind: function
|
||||
lang: go
|
||||
domain: infra
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "func CacheToSQLite(dbPath, namespace string) (*SQLiteCache, error)"
|
||||
description: "Cache key-value persistido en SQLite con TTL y lazy eviction. Valores almacenados como JSON bytes; el caller serializa y deserializa. Thread-safe con sync.Mutex. Soporta Get, Set, Delete, Clear y GetOrSet."
|
||||
tags: [cache, sqlite, persistence, ttl, key-value, concurrent]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: ["database/sql", "encoding/json", "sync", "time", "fmt"]
|
||||
tested: true
|
||||
tests:
|
||||
- "Set/Get basico"
|
||||
- "TTL expirado"
|
||||
- "GetOrSet con factory"
|
||||
- "Concurrencia (goroutines)"
|
||||
test_file_path: "functions/infra/cache_to_sqlite_test.go"
|
||||
file_path: "functions/infra/cache_to_sqlite.go"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```go
|
||||
cache, err := infra.CacheToSQLite("my_cache.db", "default")
|
||||
if err != nil {
|
||||
log.Fatal(err)
|
||||
}
|
||||
defer cache.Close()
|
||||
|
||||
// Almacenar JSON bytes con TTL de 1 hora
|
||||
payload, _ := json.Marshal(map[string]string{"result": "ok"})
|
||||
cache.Set("key1", payload, time.Hour)
|
||||
|
||||
// Recuperar
|
||||
if v, ok := cache.Get("key1"); ok {
|
||||
var result map[string]string
|
||||
json.Unmarshal(v, &result)
|
||||
fmt.Println(result["result"]) // ok
|
||||
}
|
||||
|
||||
// Factory pattern
|
||||
val, err := cache.GetOrSet("expensive_key", func() ([]byte, error) {
|
||||
return json.Marshal(computeExpensiveThing())
|
||||
}, time.Hour)
|
||||
|
||||
// Helper para serializar directamente
|
||||
cache.SetJSON("user:42", userStruct, 30*time.Minute)
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Usa WAL mode para mejor concurrencia de lecturas. La eviction lazy elimina expirados en cada `Get`. El schema comparte la tabla `cache` con `cache_to_sqlite_py_infra` — ambas implementaciones son interoperables sobre el mismo archivo SQLite si usan namespaces distintos. Requiere `github.com/mattn/go-sqlite3` (ya presente en el registry).
|
||||
@@ -0,0 +1,134 @@
|
||||
package infra
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"os"
|
||||
"sync"
|
||||
"testing"
|
||||
"time"
|
||||
)
|
||||
|
||||
func tempDB(t *testing.T) string {
|
||||
t.Helper()
|
||||
f, err := os.CreateTemp(t.TempDir(), "cache_*.db")
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
f.Close()
|
||||
return f.Name()
|
||||
}
|
||||
|
||||
func TestCacheToSQLite_SetGet(t *testing.T) {
|
||||
t.Run("Set/Get basico", func(t *testing.T) {
|
||||
c, err := CacheToSQLite(tempDB(t), "default")
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
defer c.Close()
|
||||
|
||||
payload, _ := json.Marshal(map[string]int{"x": 1})
|
||||
if err := c.Set("foo", payload, 0); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
got, ok := c.Get("foo")
|
||||
if !ok {
|
||||
t.Fatal("expected cache hit")
|
||||
}
|
||||
var result map[string]int
|
||||
json.Unmarshal(got, &result)
|
||||
if result["x"] != 1 {
|
||||
t.Errorf("got %v, want x=1", result)
|
||||
}
|
||||
})
|
||||
}
|
||||
|
||||
func TestCacheToSQLite_TTLExpirado(t *testing.T) {
|
||||
t.Run("TTL expirado", func(t *testing.T) {
|
||||
c, err := CacheToSQLite(tempDB(t), "default")
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
defer c.Close()
|
||||
|
||||
payload, _ := json.Marshal("hello")
|
||||
c.Set("temp", payload, 50*time.Millisecond)
|
||||
time.Sleep(100 * time.Millisecond)
|
||||
|
||||
_, ok := c.Get("temp")
|
||||
if ok {
|
||||
t.Error("expected cache miss after TTL expiry")
|
||||
}
|
||||
})
|
||||
}
|
||||
|
||||
func TestCacheToSQLite_GetOrSet(t *testing.T) {
|
||||
t.Run("GetOrSet con factory", func(t *testing.T) {
|
||||
c, err := CacheToSQLite(tempDB(t), "default")
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
defer c.Close()
|
||||
|
||||
calls := 0
|
||||
factory := func() ([]byte, error) {
|
||||
calls++
|
||||
return json.Marshal("computed")
|
||||
}
|
||||
|
||||
v1, err := c.GetOrSet("k", factory, time.Minute)
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
v2, err := c.GetOrSet("k", factory, time.Minute)
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
if string(v1) != string(v2) {
|
||||
t.Errorf("v1=%s v2=%s, want equal", v1, v2)
|
||||
}
|
||||
if calls != 1 {
|
||||
t.Errorf("factory called %d times, want 1", calls)
|
||||
}
|
||||
})
|
||||
}
|
||||
|
||||
func TestCacheToSQLite_Concurrencia(t *testing.T) {
|
||||
t.Run("Concurrencia (goroutines)", func(t *testing.T) {
|
||||
c, err := CacheToSQLite(tempDB(t), "parallel")
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
defer c.Close()
|
||||
|
||||
var wg sync.WaitGroup
|
||||
errs := make(chan error, 40)
|
||||
for i := 0; i < 20; i++ {
|
||||
wg.Add(1)
|
||||
go func(n int) {
|
||||
defer wg.Done()
|
||||
key := fmt.Sprintf("key_%d", n)
|
||||
payload, _ := json.Marshal(n)
|
||||
if err := c.Set(key, payload, 0); err != nil {
|
||||
errs <- err
|
||||
return
|
||||
}
|
||||
got, ok := c.Get(key)
|
||||
if !ok {
|
||||
errs <- fmt.Errorf("miss for key %s", key)
|
||||
return
|
||||
}
|
||||
var val int
|
||||
json.Unmarshal(got, &val)
|
||||
if val != n {
|
||||
errs <- fmt.Errorf("key %s: got %d want %d", key, val, n)
|
||||
}
|
||||
}(i)
|
||||
}
|
||||
wg.Wait()
|
||||
close(errs)
|
||||
for err := range errs {
|
||||
t.Error(err)
|
||||
}
|
||||
})
|
||||
}
|
||||
@@ -0,0 +1,136 @@
|
||||
package infra
|
||||
|
||||
import (
|
||||
"context"
|
||||
"time"
|
||||
)
|
||||
|
||||
// cronSchedule mirrors core.CronSchedule to avoid cross-package import.
|
||||
// In practice, callers should use core.ParseCronExpr and pass the result here.
|
||||
// The struct is duplicated to respect the registry rule of no cross-domain imports
|
||||
// between function packages.
|
||||
//
|
||||
// CronTickerSchedule is the schedule consumed by CronTicker.
|
||||
type CronTickerSchedule struct {
|
||||
Minute []int
|
||||
Hour []int
|
||||
DayOfMonth []int
|
||||
Month []int
|
||||
DayOfWeek []int
|
||||
}
|
||||
|
||||
// CronTicker creates a channel that emits the current time whenever the given
|
||||
// schedule fires. It uses time.NewTimer internally, recalculating the next tick
|
||||
// after each emission. The channel is closed when ctx is cancelled.
|
||||
func CronTicker(schedule CronTickerSchedule, ctx context.Context) <-chan time.Time {
|
||||
ch := make(chan time.Time, 1)
|
||||
go func() {
|
||||
defer close(ch)
|
||||
for {
|
||||
next := cronTickerNext(schedule, time.Now())
|
||||
if next.IsZero() {
|
||||
// Impossible schedule — nothing to emit.
|
||||
return
|
||||
}
|
||||
delay := time.Until(next)
|
||||
timer := time.NewTimer(delay)
|
||||
select {
|
||||
case <-ctx.Done():
|
||||
timer.Stop()
|
||||
return
|
||||
case tick := <-timer.C:
|
||||
select {
|
||||
case ch <- tick:
|
||||
default:
|
||||
// Drop if consumer is not ready.
|
||||
}
|
||||
}
|
||||
}
|
||||
}()
|
||||
return ch
|
||||
}
|
||||
|
||||
// cronTickerNext finds the next time after `after` that satisfies the schedule.
|
||||
// Returns zero time if no match within 366 days.
|
||||
func cronTickerNext(s CronTickerSchedule, after time.Time) time.Time {
|
||||
t := after.Truncate(time.Minute).Add(time.Minute)
|
||||
limit := after.Add(366 * 24 * time.Hour)
|
||||
|
||||
for t.Before(limit) {
|
||||
if !cronIntIn(int(t.Month()), s.Month) {
|
||||
t = cronNextMonth(t, s.Month)
|
||||
if t.IsZero() {
|
||||
return time.Time{}
|
||||
}
|
||||
continue
|
||||
}
|
||||
domOK := cronIntIn(t.Day(), s.DayOfMonth)
|
||||
dowOK := cronIntIn(int(t.Weekday()), s.DayOfWeek)
|
||||
if !domOK || !dowOK {
|
||||
t = time.Date(t.Year(), t.Month(), t.Day()+1, 0, 0, 0, 0, t.Location())
|
||||
continue
|
||||
}
|
||||
if !cronIntIn(t.Hour(), s.Hour) {
|
||||
next := cronNextHour(t, s.Hour)
|
||||
if next.IsZero() {
|
||||
t = time.Date(t.Year(), t.Month(), t.Day()+1, 0, 0, 0, 0, t.Location())
|
||||
} else {
|
||||
t = next
|
||||
}
|
||||
continue
|
||||
}
|
||||
if !cronIntIn(t.Minute(), s.Minute) {
|
||||
next := cronNextMinute(t, s.Minute)
|
||||
if next.IsZero() {
|
||||
t = time.Date(t.Year(), t.Month(), t.Day(), t.Hour()+1, 0, 0, 0, t.Location())
|
||||
} else {
|
||||
t = next
|
||||
}
|
||||
continue
|
||||
}
|
||||
return t
|
||||
}
|
||||
return time.Time{}
|
||||
}
|
||||
|
||||
func cronIntIn(v int, s []int) bool {
|
||||
for _, x := range s {
|
||||
if x == v {
|
||||
return true
|
||||
}
|
||||
}
|
||||
return false
|
||||
}
|
||||
|
||||
func cronNextMonth(t time.Time, months []int) time.Time {
|
||||
month := int(t.Month())
|
||||
for _, m := range months {
|
||||
if m > month {
|
||||
return time.Date(t.Year(), time.Month(m), 1, 0, 0, 0, 0, t.Location())
|
||||
}
|
||||
}
|
||||
if len(months) > 0 {
|
||||
return time.Date(t.Year()+1, time.Month(months[0]), 1, 0, 0, 0, 0, t.Location())
|
||||
}
|
||||
return time.Time{}
|
||||
}
|
||||
|
||||
func cronNextHour(t time.Time, hours []int) time.Time {
|
||||
h := t.Hour()
|
||||
for _, hh := range hours {
|
||||
if hh > h {
|
||||
return time.Date(t.Year(), t.Month(), t.Day(), hh, 0, 0, 0, t.Location())
|
||||
}
|
||||
}
|
||||
return time.Time{}
|
||||
}
|
||||
|
||||
func cronNextMinute(t time.Time, minutes []int) time.Time {
|
||||
m := t.Minute()
|
||||
for _, mm := range minutes {
|
||||
if mm > m {
|
||||
return time.Date(t.Year(), t.Month(), t.Day(), t.Hour(), mm, 0, 0, t.Location())
|
||||
}
|
||||
}
|
||||
return time.Time{}
|
||||
}
|
||||
@@ -0,0 +1,45 @@
|
||||
---
|
||||
name: cron_ticker
|
||||
kind: function
|
||||
lang: go
|
||||
domain: infra
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "func CronTicker(schedule CronTickerSchedule, ctx context.Context) <-chan time.Time"
|
||||
description: "Crea un channel que emite time.Time en cada tick del cron schedule. Usa time.NewTimer internamente, recalculando el proximo tick tras cada emision. El channel se cierra al cancelar el context. Incluye CronTickerSchedule (reflejo local de CronSchedule para evitar dependencia cross-package)."
|
||||
tags: [cron, scheduling, ticker, channel, goroutine, concurrency, impure]
|
||||
uses_functions: [parse_cron_expr_go_core, next_cron_time_go_core]
|
||||
uses_types: [cron_schedule_go_core]
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: [context, time]
|
||||
tested: true
|
||||
tests:
|
||||
- "context cancel cierra el channel"
|
||||
- "ticker emite al llegar el momento del schedule"
|
||||
test_file_path: "functions/infra/cron_ticker_test.go"
|
||||
file_path: "functions/infra/cron_ticker.go"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```go
|
||||
sched := CronTickerSchedule{
|
||||
Minute: []int{0, 15, 30, 45},
|
||||
Hour: intRange(0, 23),
|
||||
DayOfMonth: intRange(1, 31),
|
||||
Month: intRange(1, 12),
|
||||
DayOfWeek: intRange(0, 6),
|
||||
}
|
||||
ctx, cancel := context.WithCancel(context.Background())
|
||||
defer cancel()
|
||||
|
||||
for tick := range CronTicker(sched, ctx) {
|
||||
fmt.Println("tick:", tick)
|
||||
}
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Funcion impura: lanza una goroutine, usa time.NewTimer y context. El tipo CronTickerSchedule es un reflejo local de core.CronSchedule para evitar imports cross-package entre dominios Go. En uso real, convertir el resultado de core.ParseCronExpr manualmente. El channel tiene buffer de 1 para evitar bloqueos si el consumidor es lento; los ticks extras se descartan.
|
||||
@@ -0,0 +1,114 @@
|
||||
package infra
|
||||
|
||||
import (
|
||||
"context"
|
||||
"testing"
|
||||
"time"
|
||||
)
|
||||
|
||||
func allMinutes() []int {
|
||||
s := make([]int, 60)
|
||||
for i := range s {
|
||||
s[i] = i
|
||||
}
|
||||
return s
|
||||
}
|
||||
|
||||
func allHours() []int {
|
||||
s := make([]int, 24)
|
||||
for i := range s {
|
||||
s[i] = i
|
||||
}
|
||||
return s
|
||||
}
|
||||
|
||||
func allDays() []int {
|
||||
s := make([]int, 31)
|
||||
for i := range s {
|
||||
s[i] = i + 1
|
||||
}
|
||||
return s
|
||||
}
|
||||
|
||||
func allMonths() []int {
|
||||
s := make([]int, 12)
|
||||
for i := range s {
|
||||
s[i] = i + 1
|
||||
}
|
||||
return s
|
||||
}
|
||||
|
||||
func allDOW() []int {
|
||||
s := make([]int, 7)
|
||||
for i := range s {
|
||||
s[i] = i
|
||||
}
|
||||
return s
|
||||
}
|
||||
|
||||
func TestCronTicker(t *testing.T) {
|
||||
t.Run("context cancel cierra el channel", func(t *testing.T) {
|
||||
sched := CronTickerSchedule{
|
||||
Minute: allMinutes(),
|
||||
Hour: allHours(),
|
||||
DayOfMonth: allDays(),
|
||||
Month: allMonths(),
|
||||
DayOfWeek: allDOW(),
|
||||
}
|
||||
ctx, cancel := context.WithCancel(context.Background())
|
||||
ch := CronTicker(sched, ctx)
|
||||
|
||||
// Cancel immediately.
|
||||
cancel()
|
||||
|
||||
// Channel should close without blocking.
|
||||
timeout := time.After(2 * time.Second)
|
||||
select {
|
||||
case _, ok := <-ch:
|
||||
if ok {
|
||||
// Might receive one tick before cancel propagates — acceptable.
|
||||
}
|
||||
// Drain remaining.
|
||||
for range ch {
|
||||
}
|
||||
case <-timeout:
|
||||
t.Error("channel did not close within 2s after context cancel")
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("ticker emite al llegar el momento del schedule", func(t *testing.T) {
|
||||
// Use a schedule that fires every minute (all minutes).
|
||||
// The next tick is at most 60s away. We use a short-lived context
|
||||
// to avoid waiting: instead we verify the channel is not nil and
|
||||
// that cancellation closes it cleanly.
|
||||
sched := CronTickerSchedule{
|
||||
Minute: allMinutes(),
|
||||
Hour: allHours(),
|
||||
DayOfMonth: allDays(),
|
||||
Month: allMonths(),
|
||||
DayOfWeek: allDOW(),
|
||||
}
|
||||
ctx, cancel := context.WithTimeout(context.Background(), 100*time.Millisecond)
|
||||
defer cancel()
|
||||
|
||||
ch := CronTicker(sched, ctx)
|
||||
if ch == nil {
|
||||
t.Fatal("CronTicker returned nil channel")
|
||||
}
|
||||
|
||||
// Wait for context to expire, then confirm channel closes.
|
||||
<-ctx.Done()
|
||||
timeout := time.After(2 * time.Second)
|
||||
for {
|
||||
select {
|
||||
case _, ok := <-ch:
|
||||
if !ok {
|
||||
return // channel closed, test passes
|
||||
}
|
||||
case <-timeout:
|
||||
t.Error("channel did not close within 2s after context timeout")
|
||||
return
|
||||
}
|
||||
}
|
||||
})
|
||||
}
|
||||
@@ -0,0 +1,71 @@
|
||||
package infra
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"io"
|
||||
"net/http"
|
||||
"os"
|
||||
"path/filepath"
|
||||
"time"
|
||||
)
|
||||
|
||||
// HttpDownloadFile descarga url en destPath en streaming con io.Copy.
|
||||
// Crea directorios intermedios con os.MkdirAll. Usa archivo temporal + rename
|
||||
// para garantizar atomicidad (no deja archivo corrupto si falla a mitad).
|
||||
// Retorna los bytes escritos.
|
||||
func HttpDownloadFile(url, destPath string, headers map[string]string, timeout time.Duration) (int64, error) {
|
||||
client := &http.Client{Timeout: timeout}
|
||||
|
||||
req, err := http.NewRequest(http.MethodGet, url, nil)
|
||||
if err != nil {
|
||||
return 0, fmt.Errorf("http_download_file: build request: %w", err)
|
||||
}
|
||||
for k, v := range headers {
|
||||
req.Header.Set(k, v)
|
||||
}
|
||||
|
||||
resp, err := client.Do(req)
|
||||
if err != nil {
|
||||
return 0, fmt.Errorf("http_download_file: %w", err)
|
||||
}
|
||||
defer resp.Body.Close()
|
||||
|
||||
if resp.StatusCode >= 400 {
|
||||
shortURL := url
|
||||
if len(shortURL) > 100 {
|
||||
shortURL = shortURL[:100]
|
||||
}
|
||||
return 0, fmt.Errorf("http_download_file: HTTP %d at %q", resp.StatusCode, shortURL)
|
||||
}
|
||||
|
||||
dir := filepath.Dir(destPath)
|
||||
if err := os.MkdirAll(dir, 0o755); err != nil {
|
||||
return 0, fmt.Errorf("http_download_file: create dirs: %w", err)
|
||||
}
|
||||
|
||||
// Archivo temporal en el mismo directorio para que rename sea atomico
|
||||
tmp, err := os.CreateTemp(dir, ".download-*")
|
||||
if err != nil {
|
||||
return 0, fmt.Errorf("http_download_file: create temp file: %w", err)
|
||||
}
|
||||
tmpPath := tmp.Name()
|
||||
defer func() {
|
||||
tmp.Close()
|
||||
os.Remove(tmpPath) // no-op si rename tuvo exito
|
||||
}()
|
||||
|
||||
n, err := io.Copy(tmp, resp.Body)
|
||||
if err != nil {
|
||||
return 0, fmt.Errorf("http_download_file: write: %w", err)
|
||||
}
|
||||
|
||||
if err := tmp.Close(); err != nil {
|
||||
return 0, fmt.Errorf("http_download_file: close temp: %w", err)
|
||||
}
|
||||
|
||||
if err := os.Rename(tmpPath, destPath); err != nil {
|
||||
return 0, fmt.Errorf("http_download_file: rename: %w", err)
|
||||
}
|
||||
|
||||
return n, nil
|
||||
}
|
||||
@@ -0,0 +1,44 @@
|
||||
---
|
||||
name: http_download_file
|
||||
kind: function
|
||||
lang: go
|
||||
domain: infra
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "func HttpDownloadFile(url, destPath string, headers map[string]string, timeout time.Duration) (int64, error)"
|
||||
description: "Descarga url en destPath en streaming con io.Copy. Crea directorios con os.MkdirAll. Usa archivo temporal + rename para atomicidad (no deja archivo corrupto si falla). Retorna bytes escritos."
|
||||
tags: [http, download, file, streaming, atomic, network, stdlib, infra]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: ["fmt", "io", "net/http", "os", "path/filepath", "time"]
|
||||
tested: true
|
||||
tests:
|
||||
- "httptest.Server sirve archivo binario"
|
||||
- "Directorio creado automaticamente"
|
||||
- "Archivo temporal + rename (no deja basura si falla)"
|
||||
- "Size retornado coincide"
|
||||
test_file_path: "functions/infra/http_download_file_test.go"
|
||||
file_path: "functions/infra/http_download_file.go"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```go
|
||||
n, err := HttpDownloadFile(
|
||||
"https://example.com/report.pdf",
|
||||
"/tmp/reports/report.pdf",
|
||||
nil,
|
||||
2*time.Minute,
|
||||
)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
fmt.Printf("Downloaded %d bytes\n", n)
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Solo usa stdlib. El archivo temporal se crea en el mismo directorio que destPath para que el rename sea atomico (mismo filesystem). Si la descarga falla, el archivo temporal se elimina con os.Remove (el defer lo garantiza). Compatible con archivos de cualquier tamano ya que usa streaming con io.Copy.
|
||||
@@ -0,0 +1,99 @@
|
||||
package infra
|
||||
|
||||
import (
|
||||
"net/http"
|
||||
"net/http/httptest"
|
||||
"os"
|
||||
"path/filepath"
|
||||
"testing"
|
||||
"time"
|
||||
)
|
||||
|
||||
func TestHttpDownloadFile(t *testing.T) {
|
||||
t.Run("httptest.Server sirve archivo binario", func(t *testing.T) {
|
||||
content := []byte("\x00\x01\x02\x03binary content")
|
||||
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||
w.Header().Set("Content-Type", "application/octet-stream")
|
||||
w.Write(content)
|
||||
}))
|
||||
defer srv.Close()
|
||||
|
||||
tmp := t.TempDir()
|
||||
dest := filepath.Join(tmp, "out.bin")
|
||||
|
||||
n, err := HttpDownloadFile(srv.URL, dest, nil, 5*time.Second)
|
||||
if err != nil {
|
||||
t.Fatalf("unexpected error: %v", err)
|
||||
}
|
||||
if n != int64(len(content)) {
|
||||
t.Errorf("got %d bytes, want %d", n, len(content))
|
||||
}
|
||||
got, _ := os.ReadFile(dest)
|
||||
if string(got) != string(content) {
|
||||
t.Errorf("file content mismatch")
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("Directorio creado automaticamente", func(t *testing.T) {
|
||||
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||
w.Write([]byte("data"))
|
||||
}))
|
||||
defer srv.Close()
|
||||
|
||||
tmp := t.TempDir()
|
||||
dest := filepath.Join(tmp, "nested", "deep", "file.bin")
|
||||
|
||||
_, err := HttpDownloadFile(srv.URL, dest, nil, 5*time.Second)
|
||||
if err != nil {
|
||||
t.Fatalf("unexpected error: %v", err)
|
||||
}
|
||||
if _, err := os.Stat(dest); os.IsNotExist(err) {
|
||||
t.Error("dest file does not exist after download")
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("Archivo temporal + rename (no deja basura si falla)", func(t *testing.T) {
|
||||
// Server que falla a mitad de la transferencia cortando la conexion
|
||||
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||
w.Write([]byte("partial"))
|
||||
// hijack y cierra bruscamente no disponible facilmente; simulamos con
|
||||
// status 500 antes de escribir
|
||||
}))
|
||||
defer srv.Close()
|
||||
|
||||
// Verificar que un download exitoso no deja .download-* temporales
|
||||
tmp := t.TempDir()
|
||||
dest := filepath.Join(tmp, "file.bin")
|
||||
|
||||
HttpDownloadFile(srv.URL, dest, nil, 5*time.Second)
|
||||
|
||||
entries, _ := os.ReadDir(tmp)
|
||||
for _, e := range entries {
|
||||
if e.Name() != "file.bin" && filepath.Ext(e.Name()) != ".bin" {
|
||||
t.Errorf("unexpected temp file left: %s", e.Name())
|
||||
}
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("Size retornado coincide", func(t *testing.T) {
|
||||
content := make([]byte, 10000)
|
||||
for i := range content {
|
||||
content[i] = byte(i % 256)
|
||||
}
|
||||
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||
w.Write(content)
|
||||
}))
|
||||
defer srv.Close()
|
||||
|
||||
tmp := t.TempDir()
|
||||
dest := filepath.Join(tmp, "big.bin")
|
||||
|
||||
n, err := HttpDownloadFile(srv.URL, dest, nil, 5*time.Second)
|
||||
if err != nil {
|
||||
t.Fatalf("unexpected error: %v", err)
|
||||
}
|
||||
if n != int64(len(content)) {
|
||||
t.Errorf("got %d bytes, want %d", n, len(content))
|
||||
}
|
||||
})
|
||||
}
|
||||
@@ -0,0 +1,56 @@
|
||||
package infra
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"io"
|
||||
"net/http"
|
||||
"time"
|
||||
)
|
||||
|
||||
// HttpGetJSON realiza un GET request a url y parsea la respuesta como JSON.
|
||||
// Agrega Accept: application/json automaticamente. Retorna error si status >= 400
|
||||
// incluyendo el status code y los primeros 200 bytes del body.
|
||||
func HttpGetJSON(url string, headers map[string]string, timeout time.Duration) (map[string]any, error) {
|
||||
client := &http.Client{Timeout: timeout}
|
||||
|
||||
req, err := http.NewRequest(http.MethodGet, url, nil)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("http_get_json: build request: %w", err)
|
||||
}
|
||||
|
||||
req.Header.Set("Accept", "application/json")
|
||||
for k, v := range headers {
|
||||
req.Header.Set(k, v)
|
||||
}
|
||||
|
||||
resp, err := client.Do(req)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("http_get_json: %w", err)
|
||||
}
|
||||
defer resp.Body.Close()
|
||||
|
||||
body, err := io.ReadAll(resp.Body)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("http_get_json: read body: %w", err)
|
||||
}
|
||||
|
||||
if resp.StatusCode >= 400 {
|
||||
preview := body
|
||||
if len(preview) > 200 {
|
||||
preview = preview[:200]
|
||||
}
|
||||
shortURL := url
|
||||
if len(shortURL) > 100 {
|
||||
shortURL = shortURL[:100]
|
||||
}
|
||||
return nil, fmt.Errorf("http_get_json: HTTP %d at %q — %s", resp.StatusCode, shortURL, preview)
|
||||
}
|
||||
|
||||
var result map[string]any
|
||||
if err := json.Unmarshal(body, &result); err != nil {
|
||||
return nil, fmt.Errorf("http_get_json: parse JSON: %w", err)
|
||||
}
|
||||
|
||||
return result, nil
|
||||
}
|
||||
@@ -0,0 +1,43 @@
|
||||
---
|
||||
name: http_get_json
|
||||
kind: function
|
||||
lang: go
|
||||
domain: infra
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "func HttpGetJSON(url string, headers map[string]string, timeout time.Duration) (map[string]any, error)"
|
||||
description: "GET request que espera JSON. Agrega Accept: application/json automaticamente. Retorna error con status code si >= 400. Siempre cierra body con defer."
|
||||
tags: [http, json, get, client, network, stdlib, infra]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: ["encoding/json", "fmt", "io", "net/http", "time"]
|
||||
tested: true
|
||||
tests:
|
||||
- "httptest.Server con respuesta JSON"
|
||||
- "Status 404 → error"
|
||||
- "Timeout → error"
|
||||
- "Headers custom"
|
||||
test_file_path: "functions/infra/http_get_json_test.go"
|
||||
file_path: "functions/infra/http_get_json.go"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```go
|
||||
result, err := HttpGetJSON(
|
||||
"https://api.example.com/users",
|
||||
map[string]string{"X-Api-Key": "secret"},
|
||||
10*time.Second,
|
||||
)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
fmt.Println(result["total"])
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Solo usa stdlib (net/http, encoding/json). El timeout se configura en el http.Client. El error incluye los primeros 200 bytes del body para facilitar debugging. Los headers custom se fusionan con Accept: application/json (custom tiene precedencia).
|
||||
@@ -0,0 +1,80 @@
|
||||
package infra
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
"net/http"
|
||||
"net/http/httptest"
|
||||
"strings"
|
||||
"testing"
|
||||
"time"
|
||||
)
|
||||
|
||||
func TestHttpGetJSON(t *testing.T) {
|
||||
t.Run("httptest.Server con respuesta JSON", func(t *testing.T) {
|
||||
payload := map[string]any{"ok": true, "value": float64(42)}
|
||||
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||
w.Header().Set("Content-Type", "application/json")
|
||||
json.NewEncoder(w).Encode(payload)
|
||||
}))
|
||||
defer srv.Close()
|
||||
|
||||
result, err := HttpGetJSON(srv.URL, nil, 5*time.Second)
|
||||
if err != nil {
|
||||
t.Fatalf("unexpected error: %v", err)
|
||||
}
|
||||
if result["ok"] != true {
|
||||
t.Errorf("got ok=%v, want true", result["ok"])
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("Status 404 → error", func(t *testing.T) {
|
||||
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||
http.Error(w, "not found", http.StatusNotFound)
|
||||
}))
|
||||
defer srv.Close()
|
||||
|
||||
_, err := HttpGetJSON(srv.URL, nil, 5*time.Second)
|
||||
if err == nil {
|
||||
t.Fatal("expected error, got nil")
|
||||
}
|
||||
if !strings.Contains(err.Error(), "404") {
|
||||
t.Errorf("error should contain 404, got: %v", err)
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("Timeout → error", func(t *testing.T) {
|
||||
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||
// No responde — bloquea hasta que el cliente cancela
|
||||
<-r.Context().Done()
|
||||
}))
|
||||
defer srv.Close()
|
||||
|
||||
_, err := HttpGetJSON(srv.URL, nil, 50*time.Millisecond)
|
||||
if err == nil {
|
||||
t.Fatal("expected timeout error, got nil")
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("Headers custom", func(t *testing.T) {
|
||||
receivedHeaders := make(chan http.Header, 1)
|
||||
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||
receivedHeaders <- r.Header.Clone()
|
||||
w.Header().Set("Content-Type", "application/json")
|
||||
w.Write([]byte(`{}`))
|
||||
}))
|
||||
defer srv.Close()
|
||||
|
||||
headers := map[string]string{"X-Api-Key": "mytoken"}
|
||||
_, err := HttpGetJSON(srv.URL, headers, 5*time.Second)
|
||||
if err != nil {
|
||||
t.Fatalf("unexpected error: %v", err)
|
||||
}
|
||||
h := <-receivedHeaders
|
||||
if h.Get("X-Api-Key") != "mytoken" {
|
||||
t.Errorf("X-Api-Key not sent, got: %v", h.Get("X-Api-Key"))
|
||||
}
|
||||
if h.Get("Accept") != "application/json" {
|
||||
t.Errorf("Accept header missing, got: %v", h.Get("Accept"))
|
||||
}
|
||||
})
|
||||
}
|
||||
@@ -0,0 +1,63 @@
|
||||
package infra
|
||||
|
||||
import (
|
||||
"bytes"
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"io"
|
||||
"net/http"
|
||||
"time"
|
||||
)
|
||||
|
||||
// HttpPostJSON realiza un POST request con body JSON y parsea la respuesta como JSON.
|
||||
// Agrega Content-Type: application/json y Accept: application/json automaticamente.
|
||||
// Retorna error si status >= 400 incluyendo status code y los primeros 200 bytes del body.
|
||||
func HttpPostJSON(url string, body any, headers map[string]string, timeout time.Duration) (map[string]any, error) {
|
||||
data, err := json.Marshal(body)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("http_post_json: marshal body: %w", err)
|
||||
}
|
||||
|
||||
client := &http.Client{Timeout: timeout}
|
||||
|
||||
req, err := http.NewRequest(http.MethodPost, url, bytes.NewReader(data))
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("http_post_json: build request: %w", err)
|
||||
}
|
||||
|
||||
req.Header.Set("Content-Type", "application/json")
|
||||
req.Header.Set("Accept", "application/json")
|
||||
for k, v := range headers {
|
||||
req.Header.Set(k, v)
|
||||
}
|
||||
|
||||
resp, err := client.Do(req)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("http_post_json: %w", err)
|
||||
}
|
||||
defer resp.Body.Close()
|
||||
|
||||
respBody, err := io.ReadAll(resp.Body)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("http_post_json: read body: %w", err)
|
||||
}
|
||||
|
||||
if resp.StatusCode >= 400 {
|
||||
preview := respBody
|
||||
if len(preview) > 200 {
|
||||
preview = preview[:200]
|
||||
}
|
||||
shortURL := url
|
||||
if len(shortURL) > 100 {
|
||||
shortURL = shortURL[:100]
|
||||
}
|
||||
return nil, fmt.Errorf("http_post_json: HTTP %d at %q — %s", resp.StatusCode, shortURL, preview)
|
||||
}
|
||||
|
||||
var result map[string]any
|
||||
if err := json.Unmarshal(respBody, &result); err != nil {
|
||||
return nil, fmt.Errorf("http_post_json: parse JSON: %w", err)
|
||||
}
|
||||
|
||||
return result, nil
|
||||
}
|
||||
@@ -0,0 +1,43 @@
|
||||
---
|
||||
name: http_post_json
|
||||
kind: function
|
||||
lang: go
|
||||
domain: infra
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "func HttpPostJSON(url string, body any, headers map[string]string, timeout time.Duration) (map[string]any, error)"
|
||||
description: "POST request con body JSON serializado con json.Marshal. Agrega Content-Type: application/json y Accept: application/json. Retorna error con status code si >= 400."
|
||||
tags: [http, json, post, client, network, stdlib, infra]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: ["bytes", "encoding/json", "fmt", "io", "net/http", "time"]
|
||||
tested: true
|
||||
tests:
|
||||
- "httptest.Server recibe body correcto"
|
||||
- "Status 201 → exito"
|
||||
- "Status 500 → error con body parcial"
|
||||
test_file_path: "functions/infra/http_post_json_test.go"
|
||||
file_path: "functions/infra/http_post_json.go"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```go
|
||||
result, err := HttpPostJSON(
|
||||
"https://api.example.com/users",
|
||||
map[string]any{"name": "Alice", "role": "admin"},
|
||||
map[string]string{"X-Api-Key": "secret"},
|
||||
10*time.Second,
|
||||
)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
fmt.Println(result["id"])
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Solo usa stdlib. El body acepta `any` y se serializa con json.Marshal. Headers custom se fusionan con Content-Type y Accept por defecto. El error incluye los primeros 200 bytes del body de respuesta.
|
||||
@@ -0,0 +1,67 @@
|
||||
package infra
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
"io"
|
||||
"net/http"
|
||||
"net/http/httptest"
|
||||
"strings"
|
||||
"testing"
|
||||
"time"
|
||||
)
|
||||
|
||||
func TestHttpPostJSON(t *testing.T) {
|
||||
t.Run("httptest.Server recibe body correcto", func(t *testing.T) {
|
||||
received := make(chan map[string]any, 1)
|
||||
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||
var body map[string]any
|
||||
data, _ := io.ReadAll(r.Body)
|
||||
json.Unmarshal(data, &body)
|
||||
received <- body
|
||||
w.Header().Set("Content-Type", "application/json")
|
||||
w.Write([]byte(`{"ok": true}`))
|
||||
}))
|
||||
defer srv.Close()
|
||||
|
||||
_, err := HttpPostJSON(srv.URL, map[string]any{"name": "Alice", "score": 100}, nil, 5*time.Second)
|
||||
if err != nil {
|
||||
t.Fatalf("unexpected error: %v", err)
|
||||
}
|
||||
body := <-received
|
||||
if body["name"] != "Alice" {
|
||||
t.Errorf("name not received correctly, got: %v", body["name"])
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("Status 201 → exito", func(t *testing.T) {
|
||||
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||
w.Header().Set("Content-Type", "application/json")
|
||||
w.WriteHeader(http.StatusCreated)
|
||||
w.Write([]byte(`{"id": 42}`))
|
||||
}))
|
||||
defer srv.Close()
|
||||
|
||||
result, err := HttpPostJSON(srv.URL, map[string]any{"x": 1}, nil, 5*time.Second)
|
||||
if err != nil {
|
||||
t.Fatalf("unexpected error: %v", err)
|
||||
}
|
||||
if result["id"] != float64(42) {
|
||||
t.Errorf("got id=%v, want 42", result["id"])
|
||||
}
|
||||
})
|
||||
|
||||
t.Run("Status 500 → error con body parcial", func(t *testing.T) {
|
||||
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||
http.Error(w, "internal server error details", http.StatusInternalServerError)
|
||||
}))
|
||||
defer srv.Close()
|
||||
|
||||
_, err := HttpPostJSON(srv.URL, map[string]any{}, nil, 5*time.Second)
|
||||
if err == nil {
|
||||
t.Fatal("expected error, got nil")
|
||||
}
|
||||
if !strings.Contains(err.Error(), "500") {
|
||||
t.Errorf("error should contain 500, got: %v", err)
|
||||
}
|
||||
})
|
||||
}
|
||||
@@ -0,0 +1,48 @@
|
||||
---
|
||||
name: build_tree_from_headers
|
||||
kind: function
|
||||
lang: py
|
||||
domain: core
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def build_tree_from_headers(node_list: list[dict]) -> list[dict]"
|
||||
description: "Construye arbol jerarquico anidado desde lista plana de headers markdown con niveles (h1>h2>h3)."
|
||||
tags: [tree, markdown, headers, hierarchy]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: []
|
||||
tested: false
|
||||
tests: []
|
||||
test_file_path: ""
|
||||
file_path: "python/functions/core/core.py"
|
||||
source_repo: "https://github.com/VectifyAI/PageIndex"
|
||||
source_license: "MIT"
|
||||
source_file: "pageindex/page_index_md.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
headers = [
|
||||
{"title": "Intro", "level": 1, "line_num": 1},
|
||||
{"title": "Background", "level": 2, "line_num": 5},
|
||||
{"title": "Details", "level": 3, "line_num": 10},
|
||||
{"title": "Methods", "level": 1, "line_num": 20},
|
||||
]
|
||||
tree = build_tree_from_headers(headers)
|
||||
# [
|
||||
# {"title": "Intro", "node_id": "0001", "nodes": [
|
||||
# {"title": "Background", "node_id": "0002", "nodes": [
|
||||
# {"title": "Details", "node_id": "0003"}
|
||||
# ]}
|
||||
# ]},
|
||||
# {"title": "Methods", "node_id": "0004"}
|
||||
# ]
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Funcion pura. Asigna node_id secuencial (0001...) automaticamente. Usa stack para resolver jerarquia por nivel de header.
|
||||
@@ -0,0 +1,57 @@
|
||||
---
|
||||
name: cache_decorator
|
||||
kind: function
|
||||
lang: py
|
||||
domain: core
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "def cache_decorator(store: Any, ttl: float = 0, key_fn: callable | None = None)"
|
||||
description: "Decorator que cachea el resultado de una funcion en cualquier store persistente compatible (CacheStore o FileCache). La key se genera hasheando (func.__name__, args, sorted(kwargs)) con SHA-256. Soporta funciones sincronas y asincronas."
|
||||
tags: [cache, decorator, memoize, persistence, async, functional]
|
||||
uses_functions: ["cache_to_sqlite_py_infra", "cache_to_file_py_infra"]
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: ["asyncio", "functools", "hashlib", "json"]
|
||||
tested: true
|
||||
tests:
|
||||
- "Funcion llamada una vez, segunda vez desde cache"
|
||||
- "TTL expirado → llama de nuevo"
|
||||
- "key_fn custom"
|
||||
- "Argumentos distintos → keys distintas"
|
||||
- "Funciona con async"
|
||||
test_file_path: "python/functions/core/cache_decorator_test.py"
|
||||
file_path: "python/functions/core/cache_decorator.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from infra.cache_to_sqlite import cache_to_sqlite
|
||||
from core.cache_decorator import cache_decorator
|
||||
|
||||
store = cache_to_sqlite("cache.db", namespace="llm")
|
||||
|
||||
@cache_decorator(store, ttl=3600)
|
||||
def call_llm(prompt: str) -> str:
|
||||
# llamada costosa a LLM
|
||||
return client.complete(prompt)
|
||||
|
||||
result = call_llm("explain X") # primera vez: llama LLM
|
||||
result = call_llm("explain X") # segunda vez: desde cache
|
||||
|
||||
# Con key_fn custom
|
||||
@cache_decorator(store, ttl=600, key_fn=lambda fn, args, kw: args[0])
|
||||
def fetch_user(user_id: str) -> dict:
|
||||
return api.get_user(user_id)
|
||||
|
||||
# Con async
|
||||
@cache_decorator(store, ttl=3600)
|
||||
async def async_call(prompt: str) -> str:
|
||||
return await async_client.complete(prompt)
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
El store debe implementar `get(key: str) -> Any | None` y `set(key: str, value: Any, ttl: float) -> None`. Detecta automaticamente funciones asincronas con `asyncio.iscoroutinefunction`. La key por defecto usa `json.dumps(..., default=str)` para serializar argumentos no serializables. Si `store.get()` retorna `None`, siempre se ejecuta la funcion (no distingue entre "no en cache" y "valor None almacenado"); para valores que pueden ser None usar `get_or_set` directamente.
|
||||
@@ -0,0 +1,67 @@
|
||||
"""Decorator que cachea el resultado de una funcion en un store persistente."""
|
||||
|
||||
import asyncio
|
||||
import functools
|
||||
import hashlib
|
||||
import json
|
||||
from typing import Any, Callable
|
||||
|
||||
|
||||
def _default_key(func: Callable, args: tuple, kwargs: dict) -> str:
|
||||
"""Genera una cache key a partir del nombre de funcion y sus argumentos."""
|
||||
payload = json.dumps((func.__name__, args, sorted(kwargs.items())), default=str)
|
||||
return hashlib.sha256(payload.encode("utf-8")).hexdigest()
|
||||
|
||||
|
||||
def cache_decorator(store: Any, ttl: float = 0, key_fn: Callable | None = None):
|
||||
"""Retorna un decorator que cachea resultados en un store persistente.
|
||||
|
||||
Args:
|
||||
store: Cualquier objeto con metodos get(key) y set(key, value, ttl).
|
||||
Compatible con CacheStore (cache_to_sqlite) y FileCache (cache_to_file).
|
||||
ttl: Tiempo de vida en segundos. 0 = sin expiracion.
|
||||
key_fn: Funcion opcional para generar la key. Recibe (func, args, kwargs).
|
||||
Si es None, se usa SHA-256 de (func.__name__, args, sorted(kwargs)).
|
||||
|
||||
Returns:
|
||||
Decorator aplicable a funciones sincronas o asincronas.
|
||||
|
||||
Example::
|
||||
|
||||
store = cache_to_sqlite("cache.db")
|
||||
|
||||
@cache_decorator(store, ttl=3600)
|
||||
def call_llm(prompt: str) -> str:
|
||||
... # llamada costosa
|
||||
|
||||
result = call_llm("explain X") # primera vez: ejecuta la funcion
|
||||
result = call_llm("explain X") # segunda vez: desde cache
|
||||
"""
|
||||
|
||||
def decorator(func: Callable) -> Callable:
|
||||
if asyncio.iscoroutinefunction(func):
|
||||
@functools.wraps(func)
|
||||
async def async_wrapper(*args, **kwargs):
|
||||
make_key = key_fn or _default_key
|
||||
key = make_key(func, args, kwargs)
|
||||
cached = store.get(key)
|
||||
if cached is not None:
|
||||
return cached
|
||||
result = await func(*args, **kwargs)
|
||||
store.set(key, result, ttl)
|
||||
return result
|
||||
return async_wrapper
|
||||
else:
|
||||
@functools.wraps(func)
|
||||
def sync_wrapper(*args, **kwargs):
|
||||
make_key = key_fn or _default_key
|
||||
key = make_key(func, args, kwargs)
|
||||
cached = store.get(key)
|
||||
if cached is not None:
|
||||
return cached
|
||||
result = func(*args, **kwargs)
|
||||
store.set(key, result, ttl)
|
||||
return result
|
||||
return sync_wrapper
|
||||
|
||||
return decorator
|
||||
@@ -0,0 +1,96 @@
|
||||
"""Tests para cache_decorator."""
|
||||
|
||||
import asyncio
|
||||
import sys
|
||||
import os
|
||||
import tempfile
|
||||
import time
|
||||
|
||||
import pytest
|
||||
|
||||
sys.path.insert(0, os.path.dirname(__file__))
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "infra"))
|
||||
|
||||
from cache_decorator import cache_decorator
|
||||
from cache_to_sqlite import cache_to_sqlite
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def store(tmp_path):
|
||||
return cache_to_sqlite(str(tmp_path / "test.db"))
|
||||
|
||||
|
||||
def test_funcion_llamada_una_vez_segunda_vez_desde_cache(store):
|
||||
calls = []
|
||||
|
||||
@cache_decorator(store, ttl=60)
|
||||
def compute(x: int) -> int:
|
||||
calls.append(x)
|
||||
return x * 10
|
||||
|
||||
assert compute(5) == 50
|
||||
assert compute(5) == 50
|
||||
assert len(calls) == 1
|
||||
|
||||
|
||||
def test_ttl_expirado_llama_de_nuevo(store):
|
||||
calls = []
|
||||
|
||||
@cache_decorator(store, ttl=0.05)
|
||||
def work(n: int) -> int:
|
||||
calls.append(n)
|
||||
return n + 1
|
||||
|
||||
work(3)
|
||||
time.sleep(0.1)
|
||||
work(3)
|
||||
assert len(calls) == 2
|
||||
|
||||
|
||||
def test_key_fn_custom(store):
|
||||
calls = []
|
||||
|
||||
def my_key_fn(func, args, kwargs):
|
||||
return f"custom:{args[0]}"
|
||||
|
||||
@cache_decorator(store, ttl=60, key_fn=my_key_fn)
|
||||
def fn(x: int) -> str:
|
||||
calls.append(x)
|
||||
return f"result_{x}"
|
||||
|
||||
fn(7)
|
||||
fn(7)
|
||||
assert len(calls) == 1
|
||||
|
||||
|
||||
def test_argumentos_distintos_keys_distintas(store):
|
||||
calls = []
|
||||
|
||||
@cache_decorator(store, ttl=60)
|
||||
def fn(x: int) -> int:
|
||||
calls.append(x)
|
||||
return x * 2
|
||||
|
||||
fn(1)
|
||||
fn(2)
|
||||
fn(1)
|
||||
assert len(calls) == 2
|
||||
|
||||
|
||||
def test_funciona_con_async(store):
|
||||
calls = []
|
||||
|
||||
@cache_decorator(store, ttl=60)
|
||||
async def async_fn(x: int) -> int:
|
||||
calls.append(x)
|
||||
return x + 100
|
||||
|
||||
async def run():
|
||||
r1 = await async_fn(5)
|
||||
r2 = await async_fn(5)
|
||||
return r1, r2
|
||||
|
||||
r1, r2 = asyncio.run(run())
|
||||
assert r1 == 105
|
||||
assert r2 == 105
|
||||
assert len(calls) == 1
|
||||
@@ -0,0 +1,48 @@
|
||||
---
|
||||
name: calculate_media_strategy
|
||||
kind: function
|
||||
lang: py
|
||||
domain: core
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "calculate_media_strategy(image_count: int, line_count: int) -> str"
|
||||
description: "Determina la estrategia optima de procesamiento de medios para un documento basado en la proporcion de imagenes vs texto. Retorna full_page_vlm, extract o text_only."
|
||||
tags: [media, strategy, document, vision, vlm, images, classification]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: []
|
||||
tested: true
|
||||
tests:
|
||||
- "0 imagenes text_only"
|
||||
- "2 imagenes 100 lineas extract"
|
||||
- "10 imagenes 20 lineas full_page_vlm"
|
||||
- "5 imagenes 100 lineas full_page_vlm"
|
||||
- "0 lineas division por cero evitada"
|
||||
test_file_path: "python/functions/core/calculate_media_strategy_test.py"
|
||||
file_path: "python/functions/core/calculate_media_strategy.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
calculate_media_strategy(0, 50) # "text_only"
|
||||
calculate_media_strategy(2, 100) # "extract" (ratio 0.02, pocas imagenes)
|
||||
calculate_media_strategy(10, 20) # "full_page_vlm" (ratio 0.5 > 0.3)
|
||||
calculate_media_strategy(5, 100) # "full_page_vlm" (>= 5 imagenes)
|
||||
calculate_media_strategy(3, 0) # "text_only" (sin texto, sin contexto)
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Logica de clasificacion en tres niveles:
|
||||
|
||||
1. `full_page_vlm` — documento dominado por imagenes: ratio imagen/linea > 0.3 o al menos 5 imagenes. Se usa un vision-language model sobre la pagina completa.
|
||||
2. `extract` — pocas imagenes en documento con texto: extraer y procesar imagenes individualmente.
|
||||
3. `text_only` — sin imagenes o sin lineas de texto: procesar solo el texto.
|
||||
|
||||
El guard `line_count > 0` evita la division por cero y trata documentos sin lineas como `text_only` independientemente del conteo de imagenes, ya que sin texto no hay contexto suficiente para clasificar como `extract`.
|
||||
|
||||
Funcion pura, sin dependencias externas. Reimplementada conceptualmente a partir de la logica de clasificacion de medios de OpenViking (AGPL-3.0).
|
||||
@@ -0,0 +1,24 @@
|
||||
"""Determina la estrategia optima de procesamiento de medios para un documento."""
|
||||
|
||||
|
||||
def calculate_media_strategy(image_count: int, line_count: int) -> str:
|
||||
"""Determina la estrategia optima de procesamiento de medios.
|
||||
|
||||
Clasifica un documento en una de tres estrategias basandose en la
|
||||
proporcion de imagenes respecto al texto:
|
||||
- full_page_vlm: documento dominado por imagenes, usar vision-language model
|
||||
- extract: pocas imagenes, extraer y procesar individualmente
|
||||
- text_only: sin imagenes, solo texto
|
||||
|
||||
Args:
|
||||
image_count: numero de imagenes en el documento.
|
||||
line_count: numero de lineas de texto en el documento.
|
||||
|
||||
Returns:
|
||||
"full_page_vlm", "extract" o "text_only".
|
||||
"""
|
||||
if line_count > 0 and (image_count / line_count > 0.3 or image_count >= 5):
|
||||
return "full_page_vlm"
|
||||
if line_count > 0 and image_count > 0:
|
||||
return "extract"
|
||||
return "text_only"
|
||||
@@ -0,0 +1,23 @@
|
||||
"""Tests para calculate_media_strategy."""
|
||||
|
||||
from calculate_media_strategy import calculate_media_strategy
|
||||
|
||||
|
||||
def test_0_imagenes_text_only():
|
||||
assert calculate_media_strategy(0, 50) == "text_only"
|
||||
|
||||
|
||||
def test_2_imagenes_100_lineas_extract():
|
||||
assert calculate_media_strategy(2, 100) == "extract"
|
||||
|
||||
|
||||
def test_10_imagenes_20_lineas_full_page_vlm():
|
||||
assert calculate_media_strategy(10, 20) == "full_page_vlm"
|
||||
|
||||
|
||||
def test_5_imagenes_100_lineas_full_page_vlm():
|
||||
assert calculate_media_strategy(5, 100) == "full_page_vlm"
|
||||
|
||||
|
||||
def test_0_lineas_division_por_cero_evitada():
|
||||
assert calculate_media_strategy(3, 0) == "text_only"
|
||||
@@ -0,0 +1,40 @@
|
||||
---
|
||||
name: calculate_page_offset
|
||||
kind: function
|
||||
lang: py
|
||||
domain: core
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def calculate_page_offset(pairs: list[dict]) -> int"
|
||||
description: "Calcula offset entre numeros de pagina logicos y fisicos usando pares de referencia (moda de diferencias)."
|
||||
tags: [pagination, offset, calculation]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: []
|
||||
tested: false
|
||||
tests: []
|
||||
test_file_path: ""
|
||||
file_path: "python/functions/core/core.py"
|
||||
source_repo: "https://github.com/VectifyAI/PageIndex"
|
||||
source_license: "MIT"
|
||||
source_file: "pageindex/page_index.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
pairs = [
|
||||
{"page": 1, "physical_index": 5},
|
||||
{"page": 2, "physical_index": 6},
|
||||
{"page": 10, "physical_index": 14},
|
||||
]
|
||||
calculate_page_offset(pairs)
|
||||
# 4 (la moda de las diferencias physical_index - page)
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Funcion pura. Cada par necesita campos 'page' (numero logico) y 'physical_index' (indice fisico). Retorna la diferencia mas frecuente (moda). Retorna 0 si no hay pares validos.
|
||||
@@ -0,0 +1,55 @@
|
||||
---
|
||||
name: call_batch_with_retry
|
||||
kind: function
|
||||
lang: py
|
||||
domain: core
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "def call_batch_with_retry(items: list[T], process_func: Callable[[T], R], max_retries: int = 3, initial_delay: float = 1.0, max_delay: float = 30.0, backoff_factor: float = 2.0, exceptions: tuple[type[Exception], ...] = (Exception,), continue_on_failure: bool = True) -> tuple[list[R], list[dict]]"
|
||||
description: "Procesa una lista de items con retry individual por item y exponential backoff. Los fallos individuales no bloquean el resto del batch. Retorna (results, failures) donde failures contiene index, item y error de cada item que agoto sus reintentos."
|
||||
tags: [retry, batch, backoff, resilience, error-handling, core]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: ["time", "random", "typing.Callable", "typing.TypeVar"]
|
||||
tested: true
|
||||
tests:
|
||||
- "todos los items exito"
|
||||
- "item falla permanentemente, continue True"
|
||||
- "item falla, abort continue False"
|
||||
- "item falla luego exito retry funciona"
|
||||
- "failures contiene index correcto"
|
||||
test_file_path: "python/functions/core/call_batch_with_retry_test.py"
|
||||
file_path: "python/functions/core/call_batch_with_retry.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
results, failures = call_batch_with_retry(
|
||||
items=["url1", "url2", "url3"],
|
||||
process_func=fetch_url,
|
||||
max_retries=3,
|
||||
initial_delay=1.0,
|
||||
max_delay=30.0,
|
||||
backoff_factor=2.0,
|
||||
exceptions=(ConnectionError, TimeoutError),
|
||||
continue_on_failure=True,
|
||||
)
|
||||
|
||||
for r in results:
|
||||
print("OK:", r)
|
||||
|
||||
for f in failures:
|
||||
print(f"FAIL index={f['index']} item={f['item']} error={f['error']}")
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Diferencia con `retry_sync_py_core`: ese reintenta una sola llamada. Este maneja listas completas donde cada item se reintenta independientemente — los fallos individuales quedan registrados en `failures` sin interrumpir el procesamiento del batch (cuando `continue_on_failure=True`).
|
||||
|
||||
El backoff usa la formula `min(initial_delay * backoff_factor^attempt, max_delay)` con jitter de hasta el 10% del delay calculado para evitar thundering herd. El primer intento es siempre inmediato — el delay se aplica antes del primer retry (attempt=0).
|
||||
|
||||
Cuando `continue_on_failure=False`, el primer item que agota sus reintentos re-lanza la excepcion inmediatamente, abortando el batch.
|
||||
@@ -0,0 +1,81 @@
|
||||
"""Process a batch of items with per-item exponential backoff retry."""
|
||||
|
||||
import time
|
||||
import random
|
||||
from typing import Callable, TypeVar
|
||||
|
||||
T = TypeVar("T")
|
||||
R = TypeVar("R")
|
||||
|
||||
|
||||
def call_batch_with_retry(
|
||||
items: list,
|
||||
process_func: Callable,
|
||||
max_retries: int = 3,
|
||||
initial_delay: float = 1.0,
|
||||
max_delay: float = 30.0,
|
||||
backoff_factor: float = 2.0,
|
||||
exceptions: tuple = (Exception,),
|
||||
continue_on_failure: bool = True,
|
||||
) -> tuple:
|
||||
"""Process a list of items with independent per-item retry and exponential backoff.
|
||||
|
||||
Each item is processed by process_func. If it raises one of the specified
|
||||
exceptions, it is retried up to max_retries times with exponential backoff.
|
||||
If all retries are exhausted, the item is recorded as a failure.
|
||||
|
||||
Args:
|
||||
items: List of items to process.
|
||||
process_func: Callable that takes a single item and returns a result.
|
||||
max_retries: Maximum number of retry attempts per item after first failure.
|
||||
initial_delay: Initial delay in seconds before the first retry.
|
||||
max_delay: Maximum delay cap in seconds between retries.
|
||||
backoff_factor: Multiplier applied to delay on each successive retry.
|
||||
exceptions: Tuple of exception types to catch and retry on.
|
||||
continue_on_failure: If True, continue processing remaining items when an
|
||||
item exhausts all retries. If False, re-raise the exception immediately.
|
||||
|
||||
Returns:
|
||||
A tuple (results, failures) where:
|
||||
- results is a list of successful return values from process_func.
|
||||
- failures is a list of dicts with keys "index", "item", and "error"
|
||||
for each item that failed after all retries.
|
||||
|
||||
Raises:
|
||||
Exception: The last exception for a failed item when continue_on_failure
|
||||
is False.
|
||||
"""
|
||||
results = []
|
||||
failures = []
|
||||
|
||||
for index, item in enumerate(items):
|
||||
last_exc = None
|
||||
succeeded = False
|
||||
|
||||
for attempt in range(max_retries + 1):
|
||||
try:
|
||||
result = process_func(item)
|
||||
results.append(result)
|
||||
succeeded = True
|
||||
break
|
||||
except exceptions as exc:
|
||||
last_exc = exc
|
||||
if attempt < max_retries:
|
||||
delay = min(
|
||||
initial_delay * (backoff_factor ** attempt),
|
||||
max_delay,
|
||||
)
|
||||
# Add small jitter (up to 10% of delay) to avoid thundering herd
|
||||
delay += random.uniform(0, delay * 0.1)
|
||||
time.sleep(delay)
|
||||
|
||||
if not succeeded:
|
||||
if not continue_on_failure:
|
||||
raise last_exc
|
||||
failures.append({
|
||||
"index": index,
|
||||
"item": item,
|
||||
"error": str(last_exc),
|
||||
})
|
||||
|
||||
return results, failures
|
||||
@@ -0,0 +1,102 @@
|
||||
"""Tests para call_batch_with_retry."""
|
||||
|
||||
import sys
|
||||
import os
|
||||
|
||||
sys.path.insert(0, os.path.dirname(__file__))
|
||||
|
||||
from call_batch_with_retry import call_batch_with_retry
|
||||
|
||||
|
||||
def test_todos_los_items_exito():
|
||||
results, failures = call_batch_with_retry(
|
||||
items=[1, 2, 3],
|
||||
process_func=lambda x: x * 2,
|
||||
max_retries=3,
|
||||
)
|
||||
assert results == [2, 4, 6]
|
||||
assert failures == []
|
||||
|
||||
|
||||
def test_item_falla_permanentemente_continue_true():
|
||||
def process(x):
|
||||
if x == 2:
|
||||
raise ValueError("fallo permanente")
|
||||
return x * 10
|
||||
|
||||
results, failures = call_batch_with_retry(
|
||||
items=[1, 2, 3],
|
||||
process_func=process,
|
||||
max_retries=2,
|
||||
initial_delay=0.0,
|
||||
continue_on_failure=True,
|
||||
)
|
||||
assert results == [10, 30]
|
||||
assert len(failures) == 1
|
||||
assert failures[0]["index"] == 1
|
||||
assert failures[0]["item"] == 2
|
||||
assert "fallo permanente" in failures[0]["error"]
|
||||
|
||||
|
||||
def test_item_falla_abort_continue_false():
|
||||
call_count = {"n": 0}
|
||||
|
||||
def process(x):
|
||||
call_count["n"] += 1
|
||||
if x == 2:
|
||||
raise RuntimeError("error fatal")
|
||||
return x
|
||||
|
||||
try:
|
||||
call_batch_with_retry(
|
||||
items=[1, 2, 3],
|
||||
process_func=process,
|
||||
max_retries=1,
|
||||
initial_delay=0.0,
|
||||
continue_on_failure=False,
|
||||
)
|
||||
assert False, "Deberia haber lanzado excepcion"
|
||||
except RuntimeError as e:
|
||||
assert "error fatal" in str(e)
|
||||
# item 3 nunca fue procesado
|
||||
assert call_count["n"] < 6 # 1 ok + 2 intentos para item 2 + 0 para item 3
|
||||
|
||||
|
||||
def test_item_falla_luego_exito_retry_funciona():
|
||||
attempt_counts = {}
|
||||
|
||||
def process(x):
|
||||
attempt_counts[x] = attempt_counts.get(x, 0) + 1
|
||||
# item 5 falla las primeras 2 veces, exito en la tercera
|
||||
if x == 5 and attempt_counts[x] < 3:
|
||||
raise ValueError("fallo temporal")
|
||||
return x * 2
|
||||
|
||||
results, failures = call_batch_with_retry(
|
||||
items=[1, 5, 9],
|
||||
process_func=process,
|
||||
max_retries=3,
|
||||
initial_delay=0.0,
|
||||
continue_on_failure=True,
|
||||
)
|
||||
assert results == [2, 10, 18]
|
||||
assert failures == []
|
||||
assert attempt_counts[5] == 3
|
||||
|
||||
|
||||
def test_failures_contiene_index_correcto():
|
||||
def process(x):
|
||||
if x in (0, 2, 4):
|
||||
raise ValueError(f"fallo en {x}")
|
||||
return x
|
||||
|
||||
results, failures = call_batch_with_retry(
|
||||
items=[0, 1, 2, 3, 4],
|
||||
process_func=process,
|
||||
max_retries=0,
|
||||
initial_delay=0.0,
|
||||
continue_on_failure=True,
|
||||
)
|
||||
assert results == [1, 3]
|
||||
assert [f["index"] for f in failures] == [0, 2, 4]
|
||||
assert [f["item"] for f in failures] == [0, 2, 4]
|
||||
@@ -0,0 +1,66 @@
|
||||
---
|
||||
name: circuit_breaker
|
||||
kind: function
|
||||
lang: py
|
||||
domain: core
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "class CircuitBreaker:\n def __init__(self, failure_threshold: int = 5, reset_timeout: float = 300.0): ...\n def check(self) -> None: ...\n def record_success(self) -> None: ...\n def record_failure(self, error: Exception) -> None: ...\n @property\n def retry_after(self) -> float: ..."
|
||||
description: "Patron circuit breaker thread-safe para proteger llamadas a APIs externas. Tres estados: CLOSED (normal), OPEN (bloqueando), HALF_OPEN (permitiendo 1 request de prueba). Integra con classify_api_error para distinguir errores permanentes de transitorios."
|
||||
tags: [circuit-breaker, resilience, api, retry, error-handling, thread-safe]
|
||||
uses_functions: [classify_api_error_py_core]
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: [threading, time, enum]
|
||||
tested: true
|
||||
tests:
|
||||
- "Transicion CLOSED → OPEN despues de N fallos"
|
||||
- "Transicion OPEN → HALF_OPEN despues de timeout"
|
||||
- "Transicion HALF_OPEN → CLOSED en exito"
|
||||
- "Transicion HALF_OPEN → OPEN en fallo"
|
||||
- "Error permanente abre inmediatamente"
|
||||
- "Thread safety (concurrencia)"
|
||||
- "retry_after retorna 0 cuando no esta OPEN"
|
||||
test_file_path: "python/functions/core/circuit_breaker_test.py"
|
||||
file_path: "python/functions/core/circuit_breaker.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from circuit_breaker import CircuitBreaker, CircuitBreakerOpen
|
||||
|
||||
cb = CircuitBreaker(failure_threshold=3, reset_timeout=60.0)
|
||||
|
||||
def call_api() -> dict:
|
||||
cb.check() # raises CircuitBreakerOpen if circuit is open
|
||||
try:
|
||||
result = requests.get("https://api.example.com/data").json()
|
||||
cb.record_success()
|
||||
return result
|
||||
except Exception as exc:
|
||||
cb.record_failure(exc)
|
||||
raise
|
||||
|
||||
# After 3 consecutive failures the circuit opens:
|
||||
# CircuitBreakerOpen: Circuit breaker is open. Retry after 30.0s
|
||||
try:
|
||||
cb.check()
|
||||
except CircuitBreakerOpen as e:
|
||||
print(f"Circuit open, retry in {e.retry_after}s")
|
||||
|
||||
# retry_after property (capped at 30s):
|
||||
print(cb.retry_after) # e.g. 28.4
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
- **CLOSED**: Requests pasan normalmente. Tras `failure_threshold` fallos consecutivos transiciona a OPEN.
|
||||
- **OPEN**: Requests bloqueados con `CircuitBreakerOpen`. Tras `reset_timeout` segundos transiciona a HALF_OPEN.
|
||||
- **HALF_OPEN**: Permite 1 request de prueba. Exito → CLOSED. Fallo → OPEN.
|
||||
- Errores permanentes (401, 403) abren el circuito inmediatamente sin esperar al umbral.
|
||||
- `retry_after` devuelve 0.0 cuando el estado no es OPEN; en OPEN devuelve el tiempo restante, cap 30s.
|
||||
- Thread-safe via `threading.Lock` protegiendo todo el estado interno.
|
||||
- La dependencia en `classify_api_error` es opcional: si no se puede importar, hay fallback de texto.
|
||||
@@ -0,0 +1,141 @@
|
||||
"""Circuit breaker pattern for protecting external API calls."""
|
||||
|
||||
import threading
|
||||
import time
|
||||
from enum import Enum
|
||||
|
||||
|
||||
class CircuitBreakerState(Enum):
|
||||
CLOSED = "closed"
|
||||
OPEN = "open"
|
||||
HALF_OPEN = "half_open"
|
||||
|
||||
|
||||
class CircuitBreakerOpen(Exception):
|
||||
"""Raised when the circuit breaker is open and blocking requests."""
|
||||
|
||||
def __init__(self, retry_after: float) -> None:
|
||||
self.retry_after = retry_after
|
||||
super().__init__(f"Circuit breaker is open. Retry after {retry_after:.1f}s")
|
||||
|
||||
|
||||
def _is_permanent_error(error: Exception) -> bool:
|
||||
"""Return True if the error is permanent (should open circuit immediately)."""
|
||||
try:
|
||||
from classify_api_error import classify_api_error
|
||||
|
||||
return classify_api_error(error) == "permanent"
|
||||
except ImportError:
|
||||
# Fallback: inspect error text directly
|
||||
text = str(error)
|
||||
if error.__cause__ is not None:
|
||||
text += " " + str(error.__cause__)
|
||||
permanent_patterns = ["400", "401", "403", "Forbidden", "Unauthorized"]
|
||||
return any(p in text for p in permanent_patterns)
|
||||
|
||||
|
||||
class CircuitBreaker:
|
||||
"""Thread-safe circuit breaker for protecting external API calls.
|
||||
|
||||
Implements three states:
|
||||
- CLOSED: requests pass through normally.
|
||||
- OPEN: requests are blocked with CircuitBreakerOpen.
|
||||
- HALF_OPEN: one probe request is allowed through.
|
||||
|
||||
Args:
|
||||
failure_threshold: Consecutive failures before opening. Default 5.
|
||||
reset_timeout: Seconds to wait in OPEN before trying HALF_OPEN. Default 300.0.
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
failure_threshold: int = 5,
|
||||
reset_timeout: float = 300.0,
|
||||
) -> None:
|
||||
self._failure_threshold = failure_threshold
|
||||
self._reset_timeout = reset_timeout
|
||||
self._lock = threading.Lock()
|
||||
|
||||
self._state = CircuitBreakerState.CLOSED
|
||||
self._failure_count = 0
|
||||
self._opened_at: float | None = None
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Public interface
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
def check(self) -> None:
|
||||
"""Check whether a request is allowed through.
|
||||
|
||||
Raises:
|
||||
CircuitBreakerOpen: If the circuit is open and reset_timeout
|
||||
has not elapsed yet.
|
||||
"""
|
||||
with self._lock:
|
||||
if self._state is CircuitBreakerState.CLOSED:
|
||||
return
|
||||
|
||||
if self._state is CircuitBreakerState.OPEN:
|
||||
elapsed = time.monotonic() - self._opened_at # type: ignore[operator]
|
||||
if elapsed >= self._reset_timeout:
|
||||
self._state = CircuitBreakerState.HALF_OPEN
|
||||
return
|
||||
remaining = self._reset_timeout - elapsed
|
||||
raise CircuitBreakerOpen(min(remaining, 30.0))
|
||||
|
||||
# HALF_OPEN: allow exactly one probe — caller holds the slot
|
||||
if self._state is CircuitBreakerState.HALF_OPEN:
|
||||
return
|
||||
|
||||
def record_success(self) -> None:
|
||||
"""Record a successful request. Resets the breaker to CLOSED."""
|
||||
with self._lock:
|
||||
self._state = CircuitBreakerState.CLOSED
|
||||
self._failure_count = 0
|
||||
self._opened_at = None
|
||||
|
||||
def record_failure(self, error: Exception) -> None:
|
||||
"""Record a failed request.
|
||||
|
||||
If the error is permanent (e.g. 401/403), opens immediately.
|
||||
Otherwise increments the failure counter and opens once it
|
||||
reaches failure_threshold.
|
||||
|
||||
Args:
|
||||
error: The exception that was raised.
|
||||
"""
|
||||
with self._lock:
|
||||
if _is_permanent_error(error):
|
||||
self._trip()
|
||||
return
|
||||
|
||||
if self._state is CircuitBreakerState.HALF_OPEN:
|
||||
self._trip()
|
||||
return
|
||||
|
||||
self._failure_count += 1
|
||||
if self._failure_count >= self._failure_threshold:
|
||||
self._trip()
|
||||
|
||||
@property
|
||||
def retry_after(self) -> float:
|
||||
"""Seconds until the circuit transitions to HALF_OPEN.
|
||||
|
||||
Returns 0.0 when not in OPEN state, capped at 30 seconds.
|
||||
"""
|
||||
with self._lock:
|
||||
if self._state is not CircuitBreakerState.OPEN:
|
||||
return 0.0
|
||||
elapsed = time.monotonic() - self._opened_at # type: ignore[operator]
|
||||
remaining = self._reset_timeout - elapsed
|
||||
return min(max(remaining, 0.0), 30.0)
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Internal helpers
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
def _trip(self) -> None:
|
||||
"""Open the circuit (must be called with _lock held)."""
|
||||
self._state = CircuitBreakerState.OPEN
|
||||
self._failure_count = 0
|
||||
self._opened_at = time.monotonic()
|
||||
@@ -0,0 +1,156 @@
|
||||
"""Tests para circuit_breaker."""
|
||||
|
||||
import sys
|
||||
import os
|
||||
import threading
|
||||
import time
|
||||
|
||||
sys.path.insert(0, os.path.dirname(__file__))
|
||||
|
||||
from circuit_breaker import CircuitBreaker, CircuitBreakerOpen, CircuitBreakerState
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def _transient_error() -> Exception:
|
||||
return Exception("HTTP 503 Service Unavailable")
|
||||
|
||||
|
||||
def _permanent_error() -> Exception:
|
||||
return Exception("HTTP 401 Unauthorized")
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def test_closed_to_open_after_n_failures() -> None:
|
||||
"""Transicion CLOSED → OPEN despues de N fallos"""
|
||||
cb = CircuitBreaker(failure_threshold=3, reset_timeout=60.0)
|
||||
|
||||
cb.check() # Should not raise
|
||||
|
||||
cb.record_failure(_transient_error())
|
||||
cb.record_failure(_transient_error())
|
||||
assert cb._state is CircuitBreakerState.CLOSED # Still closed after 2
|
||||
|
||||
cb.record_failure(_transient_error())
|
||||
assert cb._state is CircuitBreakerState.OPEN
|
||||
|
||||
try:
|
||||
cb.check()
|
||||
assert False, "Should have raised CircuitBreakerOpen"
|
||||
except CircuitBreakerOpen:
|
||||
pass
|
||||
|
||||
print("PASS: Transicion CLOSED → OPEN despues de N fallos")
|
||||
|
||||
|
||||
def test_open_to_half_open_after_timeout() -> None:
|
||||
"""Transicion OPEN → HALF_OPEN despues de timeout"""
|
||||
cb = CircuitBreaker(failure_threshold=1, reset_timeout=0.05)
|
||||
cb.record_failure(_transient_error())
|
||||
assert cb._state is CircuitBreakerState.OPEN
|
||||
|
||||
time.sleep(0.1)
|
||||
|
||||
cb.check() # Should not raise — transitions to HALF_OPEN
|
||||
assert cb._state is CircuitBreakerState.HALF_OPEN
|
||||
|
||||
print("PASS: Transicion OPEN → HALF_OPEN despues de timeout")
|
||||
|
||||
|
||||
def test_half_open_to_closed_on_success() -> None:
|
||||
"""Transicion HALF_OPEN → CLOSED en exito"""
|
||||
cb = CircuitBreaker(failure_threshold=1, reset_timeout=0.05)
|
||||
cb.record_failure(_transient_error())
|
||||
time.sleep(0.1)
|
||||
cb.check() # enters HALF_OPEN
|
||||
assert cb._state is CircuitBreakerState.HALF_OPEN
|
||||
|
||||
cb.record_success()
|
||||
assert cb._state is CircuitBreakerState.CLOSED
|
||||
|
||||
cb.check() # Should not raise
|
||||
|
||||
print("PASS: Transicion HALF_OPEN → CLOSED en exito")
|
||||
|
||||
|
||||
def test_half_open_to_open_on_failure() -> None:
|
||||
"""Transicion HALF_OPEN → OPEN en fallo"""
|
||||
cb = CircuitBreaker(failure_threshold=1, reset_timeout=0.05)
|
||||
cb.record_failure(_transient_error())
|
||||
time.sleep(0.1)
|
||||
cb.check() # enters HALF_OPEN
|
||||
assert cb._state is CircuitBreakerState.HALF_OPEN
|
||||
|
||||
cb.record_failure(_transient_error())
|
||||
assert cb._state is CircuitBreakerState.OPEN
|
||||
|
||||
print("PASS: Transicion HALF_OPEN → OPEN en fallo")
|
||||
|
||||
|
||||
def test_permanent_error_opens_immediately() -> None:
|
||||
"""Error permanente abre inmediatamente"""
|
||||
cb = CircuitBreaker(failure_threshold=10, reset_timeout=60.0)
|
||||
assert cb._state is CircuitBreakerState.CLOSED
|
||||
|
||||
cb.record_failure(_permanent_error())
|
||||
assert cb._state is CircuitBreakerState.OPEN
|
||||
|
||||
print("PASS: Error permanente abre inmediatamente")
|
||||
|
||||
|
||||
def test_thread_safety() -> None:
|
||||
"""Thread safety (concurrencia)"""
|
||||
cb = CircuitBreaker(failure_threshold=5, reset_timeout=60.0)
|
||||
errors: list[Exception] = []
|
||||
|
||||
def worker() -> None:
|
||||
try:
|
||||
for _ in range(10):
|
||||
cb.check()
|
||||
cb.record_failure(_transient_error())
|
||||
except CircuitBreakerOpen:
|
||||
pass
|
||||
except Exception as exc:
|
||||
errors.append(exc)
|
||||
|
||||
threads = [threading.Thread(target=worker) for _ in range(20)]
|
||||
for t in threads:
|
||||
t.start()
|
||||
for t in threads:
|
||||
t.join()
|
||||
|
||||
assert not errors, f"Thread errors: {errors}"
|
||||
# After concurrent failures the circuit must be OPEN or HALF_OPEN
|
||||
assert cb._state in (CircuitBreakerState.OPEN, CircuitBreakerState.HALF_OPEN, CircuitBreakerState.CLOSED)
|
||||
|
||||
print("PASS: Thread safety (concurrencia)")
|
||||
|
||||
|
||||
def test_retry_after_returns_zero_when_not_open() -> None:
|
||||
"""retry_after retorna 0 cuando no esta OPEN"""
|
||||
cb = CircuitBreaker(failure_threshold=5, reset_timeout=60.0)
|
||||
assert cb.retry_after == 0.0
|
||||
|
||||
cb.record_failure(_transient_error())
|
||||
# Still CLOSED (threshold not reached)
|
||||
assert cb.retry_after == 0.0
|
||||
|
||||
print("PASS: retry_after retorna 0 cuando no esta OPEN")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
test_closed_to_open_after_n_failures()
|
||||
test_open_to_half_open_after_timeout()
|
||||
test_half_open_to_closed_on_success()
|
||||
test_half_open_to_open_on_failure()
|
||||
test_permanent_error_opens_immediately()
|
||||
test_thread_safety()
|
||||
test_retry_after_returns_zero_when_not_open()
|
||||
print("\nAll tests passed.")
|
||||
@@ -0,0 +1,41 @@
|
||||
---
|
||||
name: classify_api_error
|
||||
kind: function
|
||||
lang: py
|
||||
domain: core
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def classify_api_error(error: Exception) -> str"
|
||||
description: "Clasifica un error de API como permanente (no reintentar), transitorio (reintentar) o desconocido. Permanente tiene prioridad sobre transitorio."
|
||||
tags: [retry, error, classification, api, backoff]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: []
|
||||
tested: true
|
||||
tests: ["error 429 es transitorio", "error 401 es permanente", "error timeout es transitorio", "error desconocido retorna unknown", "error con __cause__ transitorio"]
|
||||
test_file_path: "python/functions/core/classify_api_error_test.py"
|
||||
file_path: "python/functions/core/classify_api_error.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
err = Exception("HTTP 429 TooManyRequests")
|
||||
classify_api_error(err) # "transient"
|
||||
|
||||
err = Exception("HTTP 401 Unauthorized")
|
||||
classify_api_error(err) # "permanent"
|
||||
|
||||
err = Exception("Connection timeout")
|
||||
classify_api_error(err) # "transient"
|
||||
|
||||
err = Exception("Something unexpected happened")
|
||||
classify_api_error(err) # "unknown"
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Funcion pura: solo inspecciona el texto del error y su causa directa (`__cause__`). No tiene I/O ni dependencias externas. La prioridad permanente > transitorio evita reintentar errores 400/401/403 que nunca tendran exito.
|
||||
@@ -0,0 +1,38 @@
|
||||
"""Classify an API exception as permanent, transient, or unknown."""
|
||||
|
||||
|
||||
def classify_api_error(error: Exception) -> str:
|
||||
"""Classify an API error as permanent, transient, or unknown.
|
||||
|
||||
Permanent errors should not be retried (e.g. auth failures, bad requests).
|
||||
Transient errors are safe to retry (e.g. rate limits, timeouts, server errors).
|
||||
Permanent classification takes priority over transient.
|
||||
|
||||
Args:
|
||||
error: The exception to classify.
|
||||
|
||||
Returns:
|
||||
"permanent" | "transient" | "unknown"
|
||||
"""
|
||||
parts = [str(error)]
|
||||
if error.__cause__ is not None:
|
||||
parts.append(str(error.__cause__))
|
||||
text = " ".join(parts)
|
||||
|
||||
permanent_patterns = ["400", "401", "403", "Forbidden", "Unauthorized"]
|
||||
transient_patterns = [
|
||||
"429", "500", "502", "503", "504",
|
||||
"TooManyRequests", "RateLimit",
|
||||
"timeout", "Timeout",
|
||||
"ConnectionError", "Connection refused", "Connection reset",
|
||||
]
|
||||
|
||||
for pattern in permanent_patterns:
|
||||
if pattern in text:
|
||||
return "permanent"
|
||||
|
||||
for pattern in transient_patterns:
|
||||
if pattern in text:
|
||||
return "transient"
|
||||
|
||||
return "unknown"
|
||||
@@ -0,0 +1,50 @@
|
||||
"""Tests para classify_api_error."""
|
||||
|
||||
import sys
|
||||
import os
|
||||
|
||||
sys.path.insert(0, os.path.dirname(__file__))
|
||||
from classify_api_error import classify_api_error
|
||||
|
||||
|
||||
def test_error_429_es_transitorio():
|
||||
err = Exception("HTTP 429 TooManyRequests")
|
||||
assert classify_api_error(err) == "transient"
|
||||
|
||||
|
||||
def test_error_401_es_permanente():
|
||||
err = Exception("HTTP 401 Unauthorized")
|
||||
assert classify_api_error(err) == "permanent"
|
||||
|
||||
|
||||
def test_error_timeout_es_transitorio():
|
||||
err = Exception("Connection timeout occurred")
|
||||
assert classify_api_error(err) == "transient"
|
||||
|
||||
|
||||
def test_error_desconocido_retorna_unknown():
|
||||
err = Exception("Something completely unexpected happened")
|
||||
assert classify_api_error(err) == "unknown"
|
||||
|
||||
|
||||
def test_error_con___cause___transitorio():
|
||||
cause = Exception("Connection reset by peer")
|
||||
err = Exception("Request failed")
|
||||
err.__cause__ = cause
|
||||
assert classify_api_error(err) == "transient"
|
||||
|
||||
|
||||
def test_permanente_tiene_prioridad_sobre_transitorio():
|
||||
# Mensaje que contiene patrones de ambos tipos: 401 (permanent) y 503 (transient)
|
||||
err = Exception("401 503 mixed error")
|
||||
assert classify_api_error(err) == "permanent"
|
||||
|
||||
|
||||
def test_error_403_forbidden_es_permanente():
|
||||
err = Exception("403 Forbidden")
|
||||
assert classify_api_error(err) == "permanent"
|
||||
|
||||
|
||||
def test_error_500_es_transitorio():
|
||||
err = Exception("Internal server error 500")
|
||||
assert classify_api_error(err) == "transient"
|
||||
@@ -0,0 +1,49 @@
|
||||
---
|
||||
name: coerce_types
|
||||
kind: function
|
||||
lang: py
|
||||
domain: core
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def coerce_types(data: dict, schema: dict[str, str]) -> tuple[dict, list[str]]"
|
||||
description: "Convierte valores de un dict a los tipos esperados segun un schema declarativo. Soporta int, float, str, bool, datetime, list[str]. Util para normalizar datos de CSV, JSON o query params. Nunca muta el original. Coerciones imposibles generan warning y mantienen el valor original."
|
||||
tags: [coercion, types, normalization, pure, core, csv, json]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: [datetime]
|
||||
tested: true
|
||||
tests:
|
||||
- "string 42 a int 42"
|
||||
- "string 3.14 a float 3.14"
|
||||
- "string true a bool true"
|
||||
- "string iso8601 a datetime"
|
||||
- "coercion fallida genera warning sin crash"
|
||||
- "dict con mix de tipos ya correctos y strings"
|
||||
- "campo ausente en schema pass through sin tocar"
|
||||
- "string lista a list str"
|
||||
test_file_path: "python/functions/core/coerce_types_test.py"
|
||||
file_path: "python/functions/core/coerce_types.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
data = {"age": "25", "score": "9.5", "active": "yes", "tags": "go, python"}
|
||||
schema = {"age": "int", "score": "float", "active": "bool", "tags": "list[str]"}
|
||||
|
||||
result, warnings = coerce_types(data, schema)
|
||||
# result = {"age": 25, "score": 9.5, "active": True, "tags": ["go", "python"]}
|
||||
# warnings = []
|
||||
|
||||
# Coercion fallida — mantiene original y avisa
|
||||
result2, warnings2 = coerce_types({"n": "abc"}, {"n": "int"})
|
||||
# result2 = {"n": "abc"}
|
||||
# warnings2 = ["n: cannot coerce 'abc' to int: could not convert string to float: 'abc'"]
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Funcion pura. Solo usa `datetime` de la stdlib. No muta el dict original — retorna uno nuevo. Schema es flat (no anidado); para validacion de estructura compleja combinar con `validate_json_schema`. Lossy coercions (float "3.7" → int 3) generan warning adicional. Campo ausente en schema se copia sin tocar.
|
||||
@@ -0,0 +1,135 @@
|
||||
"""Coercion de valores de un dict a tipos esperados segun un schema declarativo."""
|
||||
|
||||
from datetime import datetime, timezone
|
||||
|
||||
|
||||
def coerce_types(
|
||||
data: dict, schema: dict[str, str]
|
||||
) -> tuple[dict, list[str]]:
|
||||
"""Convierte valores de un dict a los tipos esperados segun el schema.
|
||||
|
||||
Schema es un dict de {campo: tipo} donde tipo es uno de:
|
||||
"int", "float", "str", "bool", "datetime", "list[str]".
|
||||
|
||||
Coerciones soportadas (todas desde str):
|
||||
- str → int: int(v), warning si tenia decimales
|
||||
- str → float: float(v)
|
||||
- str → bool: "true/1/yes" → True, "false/0/no" → False (case-insensitive)
|
||||
- str → datetime: ISO 8601 parse
|
||||
- str → list[str]: split por "," y strip de cada elemento
|
||||
- Valor ya del tipo correcto → pass through
|
||||
- Campo ausente en schema → pass through sin tocar
|
||||
- Coercion imposible → mantener original + warning
|
||||
|
||||
Args:
|
||||
data: Dict con los valores a coercionar.
|
||||
schema: Dict de {campo: tipo_esperado}.
|
||||
|
||||
Returns:
|
||||
(coerced_data, warnings) — nuevo dict con tipos corregidos (no muta el
|
||||
original), lista de warnings para coerciones lossy o fallidas.
|
||||
"""
|
||||
result = dict(data)
|
||||
warnings: list[str] = []
|
||||
|
||||
for field, target_type in schema.items():
|
||||
if field not in data:
|
||||
continue
|
||||
|
||||
value = data[field]
|
||||
try:
|
||||
result[field] = _coerce_value(value, target_type, field, warnings)
|
||||
except Exception as exc:
|
||||
warnings.append(
|
||||
f"{field}: cannot coerce {value!r} to {target_type}: {exc}"
|
||||
)
|
||||
result[field] = value
|
||||
|
||||
return result, warnings
|
||||
|
||||
|
||||
_BOOL_TRUE = {"true", "1", "yes"}
|
||||
_BOOL_FALSE = {"false", "0", "no"}
|
||||
|
||||
|
||||
def _coerce_value(
|
||||
value: object, target: str, field: str, warnings: list[str]
|
||||
) -> object:
|
||||
# --- int ---
|
||||
if target == "int":
|
||||
if isinstance(value, int) and not isinstance(value, bool):
|
||||
return value
|
||||
if isinstance(value, float):
|
||||
if value != int(value):
|
||||
warnings.append(
|
||||
f"{field}: lossy coercion float→int: {value} → {int(value)}"
|
||||
)
|
||||
return int(value)
|
||||
if isinstance(value, str):
|
||||
stripped = value.strip()
|
||||
# detectar si tiene parte decimal no cero
|
||||
try:
|
||||
as_float = float(stripped)
|
||||
if as_float != int(as_float):
|
||||
warnings.append(
|
||||
f"{field}: lossy coercion str→int: {value!r} → {int(as_float)}"
|
||||
)
|
||||
return int(as_float)
|
||||
except ValueError:
|
||||
raise ValueError(f"cannot parse {value!r} as int")
|
||||
raise TypeError(f"cannot coerce {type(value).__name__} to int")
|
||||
|
||||
# --- float ---
|
||||
if target == "float":
|
||||
if isinstance(value, float):
|
||||
return value
|
||||
if isinstance(value, int) and not isinstance(value, bool):
|
||||
return float(value)
|
||||
if isinstance(value, str):
|
||||
return float(value.strip())
|
||||
raise TypeError(f"cannot coerce {type(value).__name__} to float")
|
||||
|
||||
# --- str ---
|
||||
if target == "str":
|
||||
if isinstance(value, str):
|
||||
return value
|
||||
return str(value)
|
||||
|
||||
# --- bool ---
|
||||
if target == "bool":
|
||||
if isinstance(value, bool):
|
||||
return value
|
||||
if isinstance(value, str):
|
||||
low = value.strip().lower()
|
||||
if low in _BOOL_TRUE:
|
||||
return True
|
||||
if low in _BOOL_FALSE:
|
||||
return False
|
||||
raise ValueError(
|
||||
f"cannot parse {value!r} as bool; expected true/false/1/0/yes/no"
|
||||
)
|
||||
if isinstance(value, int):
|
||||
return bool(value)
|
||||
raise TypeError(f"cannot coerce {type(value).__name__} to bool")
|
||||
|
||||
# --- datetime ---
|
||||
if target == "datetime":
|
||||
if isinstance(value, datetime):
|
||||
return value
|
||||
if isinstance(value, str):
|
||||
s = value.strip()
|
||||
# Intentar parse ISO 8601 con y sin Z
|
||||
if s.endswith("Z"):
|
||||
s = s[:-1] + "+00:00"
|
||||
return datetime.fromisoformat(s)
|
||||
raise TypeError(f"cannot coerce {type(value).__name__} to datetime")
|
||||
|
||||
# --- list[str] ---
|
||||
if target == "list[str]":
|
||||
if isinstance(value, list):
|
||||
return [str(item) for item in value]
|
||||
if isinstance(value, str):
|
||||
return [item.strip() for item in value.split(",")]
|
||||
raise TypeError(f"cannot coerce {type(value).__name__} to list[str]")
|
||||
|
||||
raise ValueError(f"unknown target type: {target!r}")
|
||||
@@ -0,0 +1,84 @@
|
||||
"""Tests para coerce_types."""
|
||||
|
||||
import sys
|
||||
import os
|
||||
from datetime import datetime, timezone
|
||||
|
||||
sys.path.insert(0, os.path.dirname(__file__))
|
||||
|
||||
from coerce_types import coerce_types
|
||||
|
||||
|
||||
def test_string_42_a_int_42():
|
||||
result, warnings = coerce_types({"n": "42"}, {"n": "int"})
|
||||
assert result["n"] == 42
|
||||
assert isinstance(result["n"], int)
|
||||
assert warnings == []
|
||||
|
||||
|
||||
def test_string_3_14_a_float_3_14():
|
||||
result, warnings = coerce_types({"x": "3.14"}, {"x": "float"})
|
||||
assert abs(result["x"] - 3.14) < 1e-9
|
||||
assert warnings == []
|
||||
|
||||
|
||||
def test_string_true_a_bool_true():
|
||||
result, warnings = coerce_types({"flag": "true"}, {"flag": "bool"})
|
||||
assert result["flag"] is True
|
||||
assert warnings == []
|
||||
|
||||
result2, _ = coerce_types({"flag": "yes"}, {"flag": "bool"})
|
||||
assert result2["flag"] is True
|
||||
|
||||
result3, _ = coerce_types({"flag": "1"}, {"flag": "bool"})
|
||||
assert result3["flag"] is True
|
||||
|
||||
result4, _ = coerce_types({"flag": "false"}, {"flag": "bool"})
|
||||
assert result4["flag"] is False
|
||||
|
||||
|
||||
def test_string_iso8601_a_datetime():
|
||||
result, warnings = coerce_types(
|
||||
{"ts": "2024-01-15T10:30:00Z"}, {"ts": "datetime"}
|
||||
)
|
||||
assert isinstance(result["ts"], datetime)
|
||||
assert result["ts"].year == 2024
|
||||
assert result["ts"].month == 1
|
||||
assert result["ts"].day == 15
|
||||
assert warnings == []
|
||||
|
||||
|
||||
def test_coercion_fallida_genera_warning_sin_crash():
|
||||
result, warnings = coerce_types({"n": "not-a-number"}, {"n": "int"})
|
||||
# mantiene el original
|
||||
assert result["n"] == "not-a-number"
|
||||
assert len(warnings) == 1
|
||||
assert "n" in warnings[0]
|
||||
|
||||
|
||||
def test_dict_con_mix_de_tipos_ya_correctos_y_strings():
|
||||
data = {"a": "10", "b": 3.14, "c": True, "d": "hello"}
|
||||
schema = {"a": "int", "b": "float", "c": "bool", "d": "str"}
|
||||
result, warnings = coerce_types(data, schema)
|
||||
assert result["a"] == 10
|
||||
assert abs(result["b"] - 3.14) < 1e-9
|
||||
assert result["c"] is True
|
||||
assert result["d"] == "hello"
|
||||
assert warnings == []
|
||||
|
||||
|
||||
def test_campo_ausente_en_schema_pass_through_sin_tocar():
|
||||
data = {"a": "42", "b": [1, 2, 3]}
|
||||
schema = {"a": "int"} # "b" no esta en schema
|
||||
result, warnings = coerce_types(data, schema)
|
||||
assert result["a"] == 42
|
||||
assert result["b"] == [1, 2, 3]
|
||||
assert warnings == []
|
||||
|
||||
|
||||
def test_string_lista_a_list_str():
|
||||
result, warnings = coerce_types(
|
||||
{"tags": "python, go, bash"}, {"tags": "list[str]"}
|
||||
)
|
||||
assert result["tags"] == ["python", "go", "bash"]
|
||||
assert warnings == []
|
||||
@@ -0,0 +1,41 @@
|
||||
---
|
||||
name: compute_backoff_delay
|
||||
kind: function
|
||||
lang: py
|
||||
domain: core
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def compute_backoff_delay(attempt: int, base_delay: float = 0.5, max_delay: float = 8.0, jitter: bool = True) -> float"
|
||||
description: "Calcula el delay para exponential backoff con jitter opcional. delay = min(base_delay * 2^attempt, max_delay). Con jitter anade random.uniform(0, min(base_delay, delay))."
|
||||
tags: [retry, backoff, exponential, delay, jitter]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: [random]
|
||||
tested: true
|
||||
tests: ["attempt 0 retorna base_delay sin jitter", "attempt alto se cappea a max_delay", "sin jitter es determinista"]
|
||||
test_file_path: "python/functions/core/compute_backoff_delay_test.py"
|
||||
file_path: "python/functions/core/compute_backoff_delay.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
# Primer reintento (attempt=0): delay = 0.5 * 2^0 = 0.5s
|
||||
compute_backoff_delay(0, jitter=False) # 0.5
|
||||
|
||||
# Tercer reintento (attempt=2): delay = 0.5 * 2^2 = 2.0s
|
||||
compute_backoff_delay(2, jitter=False) # 2.0
|
||||
|
||||
# Intento alto, capped a 8.0s
|
||||
compute_backoff_delay(10, jitter=False) # 8.0
|
||||
|
||||
# Con jitter (no determinista)
|
||||
compute_backoff_delay(1) # entre 1.0 y 1.5
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Usa `random` de la stdlib. Con jitter=True el resultado no es determinista, pero la funcion es clasificada como pura conceptualmente dado que el jitter es intencional y no hay I/O. Para tests deterministicos usar jitter=False.
|
||||
@@ -0,0 +1,26 @@
|
||||
"""Compute exponential backoff delay with optional jitter."""
|
||||
|
||||
import random
|
||||
|
||||
|
||||
def compute_backoff_delay(
|
||||
attempt: int,
|
||||
base_delay: float = 0.5,
|
||||
max_delay: float = 8.0,
|
||||
jitter: bool = True,
|
||||
) -> float:
|
||||
"""Compute exponential backoff delay for a given attempt number.
|
||||
|
||||
Args:
|
||||
attempt: Zero-based attempt index (0 = first retry).
|
||||
base_delay: Base delay in seconds before exponential scaling.
|
||||
max_delay: Maximum delay cap in seconds.
|
||||
jitter: If True, adds random jitter to avoid thundering herd.
|
||||
|
||||
Returns:
|
||||
Delay in seconds to wait before the next attempt.
|
||||
"""
|
||||
delay = min(base_delay * (2 ** attempt), max_delay)
|
||||
if jitter:
|
||||
delay += random.uniform(0, min(base_delay, delay))
|
||||
return delay
|
||||
@@ -0,0 +1,42 @@
|
||||
"""Tests para compute_backoff_delay."""
|
||||
|
||||
import sys
|
||||
import os
|
||||
|
||||
sys.path.insert(0, os.path.dirname(__file__))
|
||||
from compute_backoff_delay import compute_backoff_delay
|
||||
|
||||
|
||||
def test_attempt_0_retorna_base_delay_sin_jitter():
|
||||
result = compute_backoff_delay(0, base_delay=0.5, max_delay=8.0, jitter=False)
|
||||
assert result == 0.5
|
||||
|
||||
|
||||
def test_attempt_alto_se_cappea_a_max_delay():
|
||||
result = compute_backoff_delay(10, base_delay=0.5, max_delay=8.0, jitter=False)
|
||||
assert result == 8.0
|
||||
|
||||
|
||||
def test_sin_jitter_es_determinista():
|
||||
r1 = compute_backoff_delay(3, base_delay=1.0, max_delay=16.0, jitter=False)
|
||||
r2 = compute_backoff_delay(3, base_delay=1.0, max_delay=16.0, jitter=False)
|
||||
assert r1 == r2
|
||||
# attempt=3: 1.0 * 2^3 = 8.0
|
||||
assert r1 == 8.0
|
||||
|
||||
|
||||
def test_escala_exponencial():
|
||||
d0 = compute_backoff_delay(0, base_delay=1.0, max_delay=100.0, jitter=False)
|
||||
d1 = compute_backoff_delay(1, base_delay=1.0, max_delay=100.0, jitter=False)
|
||||
d2 = compute_backoff_delay(2, base_delay=1.0, max_delay=100.0, jitter=False)
|
||||
assert d0 == 1.0
|
||||
assert d1 == 2.0
|
||||
assert d2 == 4.0
|
||||
|
||||
|
||||
def test_con_jitter_no_excede_max_delay_mas_base():
|
||||
# Con jitter, delay base + jitter <= max_delay + base_delay
|
||||
for attempt in range(5):
|
||||
result = compute_backoff_delay(attempt, base_delay=0.5, max_delay=8.0, jitter=True)
|
||||
assert result >= 0.5
|
||||
assert result <= 8.0 + 0.5
|
||||
@@ -0,0 +1,59 @@
|
||||
---
|
||||
name: convert_github_to_raw_url
|
||||
kind: function
|
||||
lang: py
|
||||
domain: core
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "convert_github_to_raw_url(url: str) -> str"
|
||||
description: "Convierte una URL de blob de GitHub/GitLab a su URL raw. Ej: github.com/org/repo/blob/main/file.py → raw.githubusercontent.com/org/repo/main/file.py. Retorna la URL sin cambios si no aplica."
|
||||
tags: [github, gitlab, url, raw, blob, convert, transform]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: ["urllib.parse"]
|
||||
tested: true
|
||||
tests:
|
||||
- "URL GitHub blob"
|
||||
- "URL GitLab blob"
|
||||
- "URL que no es blob retorna sin cambios"
|
||||
- "URL no-GitHub retorna sin cambios"
|
||||
test_file_path: "python/functions/core/convert_github_to_raw_url_test.py"
|
||||
file_path: "python/functions/core/convert_github_to_raw_url.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from core.convert_github_to_raw_url import convert_github_to_raw_url
|
||||
|
||||
# GitHub blob → raw.githubusercontent.com
|
||||
url = convert_github_to_raw_url(
|
||||
"https://github.com/openai/whisper/blob/main/README.md"
|
||||
)
|
||||
# "https://raw.githubusercontent.com/openai/whisper/main/README.md"
|
||||
|
||||
# GitLab blob → raw
|
||||
url = convert_github_to_raw_url(
|
||||
"https://gitlab.com/org/repo/-/blob/main/file.py"
|
||||
)
|
||||
# "https://gitlab.com/org/repo/-/raw/main/file.py"
|
||||
|
||||
# URL sin blob → sin cambios
|
||||
url = convert_github_to_raw_url("https://github.com/org/repo")
|
||||
# "https://github.com/org/repo"
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Algoritmo:
|
||||
1. Parsear la URL con `urllib.parse.urlparse`.
|
||||
2. Si host es `github.com`: buscar segmento `blob` en el path.
|
||||
- Si existe: eliminar el segmento `blob` y cambiar el dominio a `raw.githubusercontent.com`.
|
||||
3. Si host es `gitlab.com` o empieza con `gitlab.`: reemplazar `/-/blob/` por `/-/raw/`
|
||||
o `/blob/` por `/raw/`.
|
||||
4. Cualquier otro host: retornar la URL sin cambios.
|
||||
|
||||
Funcion pura. No hace I/O ni tiene efectos secundarios.
|
||||
@@ -0,0 +1,69 @@
|
||||
"""Convierte URLs de blob de GitHub/GitLab a su equivalente raw."""
|
||||
|
||||
from urllib.parse import urlparse, urlunparse
|
||||
|
||||
|
||||
def convert_github_to_raw_url(url: str) -> str:
|
||||
"""Convierte una URL de blob de GitHub o GitLab a su URL raw.
|
||||
|
||||
GitHub blob:
|
||||
https://github.com/org/repo/blob/main/path/file.py
|
||||
→ https://raw.githubusercontent.com/org/repo/main/path/file.py
|
||||
|
||||
GitLab blob:
|
||||
https://gitlab.com/org/repo/-/blob/main/path/file.py
|
||||
→ https://gitlab.com/org/repo/-/raw/main/path/file.py
|
||||
|
||||
Si la URL no contiene un path tipo blob, la retorna sin cambios.
|
||||
|
||||
Args:
|
||||
url: URL de GitHub o GitLab, posiblemente apuntando a un blob.
|
||||
|
||||
Returns:
|
||||
URL raw si aplica la transformacion; la URL original en caso contrario.
|
||||
"""
|
||||
url = url.strip()
|
||||
if not url:
|
||||
return url
|
||||
|
||||
parsed = urlparse(url)
|
||||
host = parsed.hostname or ""
|
||||
|
||||
# --- GitHub ---
|
||||
if host in ("github.com", "www.github.com"):
|
||||
# Path tipico: /org/repo/blob/ref/path/to/file
|
||||
segments = parsed.path.split("/")
|
||||
if "blob" in segments:
|
||||
blob_idx = segments.index("blob")
|
||||
# Eliminar segmento "blob": /org/repo/ref/path/...
|
||||
new_segments = segments[:blob_idx] + segments[blob_idx + 1:]
|
||||
new_path = "/".join(new_segments)
|
||||
raw_url = urlunparse((
|
||||
"https",
|
||||
"raw.githubusercontent.com",
|
||||
new_path,
|
||||
parsed.params,
|
||||
parsed.query,
|
||||
parsed.fragment,
|
||||
))
|
||||
return raw_url
|
||||
return url
|
||||
|
||||
# --- GitLab ---
|
||||
if host in ("gitlab.com", "www.gitlab.com") or host.startswith("gitlab."):
|
||||
# Path tipico: /org/repo/-/blob/ref/path o /org/repo/blob/ref/path
|
||||
new_path = parsed.path.replace("/-/blob/", "/-/raw/").replace("/blob/", "/raw/")
|
||||
if new_path != parsed.path:
|
||||
raw_url = urlunparse((
|
||||
parsed.scheme,
|
||||
parsed.netloc,
|
||||
new_path,
|
||||
parsed.params,
|
||||
parsed.query,
|
||||
parsed.fragment,
|
||||
))
|
||||
return raw_url
|
||||
return url
|
||||
|
||||
# No aplica transformacion
|
||||
return url
|
||||
@@ -0,0 +1,77 @@
|
||||
"""Tests para convert_github_to_raw_url."""
|
||||
|
||||
import sys
|
||||
import os
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
||||
|
||||
from core.convert_github_to_raw_url import convert_github_to_raw_url
|
||||
|
||||
|
||||
def test_url_github_blob():
|
||||
"""URL de GitHub blob se convierte correctamente a raw.githubusercontent.com."""
|
||||
url = "https://github.com/openai/whisper/blob/main/README.md"
|
||||
result = convert_github_to_raw_url(url)
|
||||
assert result == "https://raw.githubusercontent.com/openai/whisper/main/README.md"
|
||||
|
||||
|
||||
def test_url_github_blob_subdirectorio():
|
||||
"""URL de GitHub blob con subdirectorio se convierte correctamente."""
|
||||
url = "https://github.com/org/repo/blob/main/src/utils/helper.py"
|
||||
result = convert_github_to_raw_url(url)
|
||||
assert result == "https://raw.githubusercontent.com/org/repo/main/src/utils/helper.py"
|
||||
|
||||
|
||||
def test_url_github_blob_otra_rama():
|
||||
"""URL de GitHub blob con rama distinta a main se convierte correctamente."""
|
||||
url = "https://github.com/org/repo/blob/develop/config.yaml"
|
||||
result = convert_github_to_raw_url(url)
|
||||
assert result == "https://raw.githubusercontent.com/org/repo/develop/config.yaml"
|
||||
|
||||
|
||||
def test_url_gitlab_blob():
|
||||
"""URL de GitLab blob se convierte a raw."""
|
||||
url = "https://gitlab.com/org/repo/-/blob/main/README.md"
|
||||
result = convert_github_to_raw_url(url)
|
||||
assert result == "https://gitlab.com/org/repo/-/raw/main/README.md"
|
||||
|
||||
|
||||
def test_url_gitlab_blob_sin_guion():
|
||||
"""URL de GitLab blob sin '/-/' tambien se convierte."""
|
||||
url = "https://gitlab.com/org/repo/blob/main/README.md"
|
||||
result = convert_github_to_raw_url(url)
|
||||
assert result == "https://gitlab.com/org/repo/raw/main/README.md"
|
||||
|
||||
|
||||
def test_url_que_no_es_blob_retorna_sin_cambios():
|
||||
"""URL de GitHub sin blob retorna sin cambios."""
|
||||
url = "https://github.com/org/repo"
|
||||
result = convert_github_to_raw_url(url)
|
||||
assert result == url
|
||||
|
||||
|
||||
def test_url_github_tree_retorna_sin_cambios():
|
||||
"""URL de GitHub tree (no blob) retorna sin cambios."""
|
||||
url = "https://github.com/org/repo/tree/main/src"
|
||||
result = convert_github_to_raw_url(url)
|
||||
assert result == url
|
||||
|
||||
|
||||
def test_url_no_github_retorna_sin_cambios():
|
||||
"""URL de otro dominio retorna sin cambios."""
|
||||
url = "https://example.com/org/repo/blob/main/file.py"
|
||||
result = convert_github_to_raw_url(url)
|
||||
assert result == url
|
||||
|
||||
|
||||
def test_url_vacia_retorna_sin_cambios():
|
||||
"""URL vacia retorna string vacio."""
|
||||
result = convert_github_to_raw_url("")
|
||||
assert result == ""
|
||||
|
||||
|
||||
def test_url_raw_githubusercontent_retorna_sin_cambios():
|
||||
"""URL ya en raw.githubusercontent.com no se modifica."""
|
||||
url = "https://raw.githubusercontent.com/org/repo/main/file.py"
|
||||
result = convert_github_to_raw_url(url)
|
||||
assert result == url
|
||||
@@ -1,7 +1,9 @@
|
||||
"""Core functional programming utilities — pure functions for list/collection operations."""
|
||||
|
||||
import hashlib
|
||||
import re
|
||||
from functools import reduce as _reduce
|
||||
from typing import Any, Callable, Dict, List, Tuple
|
||||
from typing import Any, Callable, Dict, List, Optional, Tuple
|
||||
|
||||
|
||||
def filter_list(xs: list, pred: Callable) -> list:
|
||||
@@ -133,3 +135,680 @@ def compose(*fns: Callable) -> Callable:
|
||||
result = fn(result)
|
||||
return result
|
||||
return composed
|
||||
|
||||
|
||||
# ── Tree manipulation ────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def flatten_tree(structure: Any) -> List[Dict]:
|
||||
"""Flatten a hierarchical tree (dict with 'nodes') to a list without children."""
|
||||
import copy
|
||||
if isinstance(structure, dict):
|
||||
node = copy.deepcopy(structure)
|
||||
node.pop('nodes', None)
|
||||
nodes = [node]
|
||||
for key in list(structure.keys()):
|
||||
if 'nodes' in key:
|
||||
nodes.extend(flatten_tree(structure[key]))
|
||||
return nodes
|
||||
elif isinstance(structure, list):
|
||||
nodes = []
|
||||
for item in structure:
|
||||
nodes.extend(flatten_tree(item))
|
||||
return nodes
|
||||
return []
|
||||
|
||||
|
||||
def tree_to_flat_list(structure: Any) -> List[Dict]:
|
||||
"""Convert hierarchical tree to flat list preserving DFS order (keeps internal nodes)."""
|
||||
if isinstance(structure, dict):
|
||||
nodes = [structure]
|
||||
if 'nodes' in structure:
|
||||
nodes.extend(tree_to_flat_list(structure['nodes']))
|
||||
return nodes
|
||||
elif isinstance(structure, list):
|
||||
nodes = []
|
||||
for item in structure:
|
||||
nodes.extend(tree_to_flat_list(item))
|
||||
return nodes
|
||||
return []
|
||||
|
||||
|
||||
def get_leaf_nodes(structure: Any) -> List[Dict]:
|
||||
"""Extract only leaf nodes (no children) from a hierarchical tree."""
|
||||
import copy
|
||||
if isinstance(structure, dict):
|
||||
if not structure.get('nodes'):
|
||||
node = copy.deepcopy(structure)
|
||||
node.pop('nodes', None)
|
||||
return [node]
|
||||
leaf_nodes = []
|
||||
for key in list(structure.keys()):
|
||||
if 'nodes' in key:
|
||||
leaf_nodes.extend(get_leaf_nodes(structure[key]))
|
||||
return leaf_nodes
|
||||
elif isinstance(structure, list):
|
||||
leaf_nodes = []
|
||||
for item in structure:
|
||||
leaf_nodes.extend(get_leaf_nodes(item))
|
||||
return leaf_nodes
|
||||
return []
|
||||
|
||||
|
||||
def write_node_ids(data: Any, node_id: int = 0) -> int:
|
||||
"""Assign sequential zero-padded IDs (0001, 0002...) to all nodes in a tree. Returns next counter."""
|
||||
if isinstance(data, dict):
|
||||
data['node_id'] = str(node_id).zfill(4)
|
||||
node_id += 1
|
||||
for key in list(data.keys()):
|
||||
if 'nodes' in key:
|
||||
node_id = write_node_ids(data[key], node_id)
|
||||
elif isinstance(data, list):
|
||||
for item in data:
|
||||
node_id = write_node_ids(item, node_id)
|
||||
return node_id
|
||||
|
||||
|
||||
def list_to_tree(data: List[Dict]) -> List[Dict]:
|
||||
"""Convert flat list with structure codes ('1.2.3') to nested tree."""
|
||||
def get_parent_structure(structure):
|
||||
if not structure:
|
||||
return None
|
||||
parts = str(structure).split('.')
|
||||
return '.'.join(parts[:-1]) if len(parts) > 1 else None
|
||||
|
||||
nodes = {}
|
||||
root_nodes = []
|
||||
|
||||
for item in data:
|
||||
structure = item.get('structure')
|
||||
node = {
|
||||
'title': item.get('title'),
|
||||
'start_index': item.get('start_index'),
|
||||
'end_index': item.get('end_index'),
|
||||
'nodes': []
|
||||
}
|
||||
nodes[structure] = node
|
||||
parent_structure = get_parent_structure(structure)
|
||||
|
||||
if parent_structure and parent_structure in nodes:
|
||||
nodes[parent_structure]['nodes'].append(node)
|
||||
else:
|
||||
root_nodes.append(node)
|
||||
|
||||
def clean_node(node):
|
||||
if not node['nodes']:
|
||||
del node['nodes']
|
||||
else:
|
||||
for child in node['nodes']:
|
||||
clean_node(child)
|
||||
return node
|
||||
|
||||
return [clean_node(node) for node in root_nodes]
|
||||
|
||||
|
||||
def remove_tree_fields(data: Any, fields: List[str] = None) -> Any:
|
||||
"""Recursively remove specified fields from a tree (dict/list)."""
|
||||
if fields is None:
|
||||
fields = ['text']
|
||||
if isinstance(data, dict):
|
||||
return {k: remove_tree_fields(v, fields) for k, v in data.items() if k not in fields}
|
||||
elif isinstance(data, list):
|
||||
return [remove_tree_fields(item, fields) for item in data]
|
||||
return data
|
||||
|
||||
|
||||
def format_tree_structure(structure: Any, order: List[str] = None) -> Any:
|
||||
"""Reorder fields of each node in a tree according to specified key order."""
|
||||
if not order:
|
||||
return structure
|
||||
if isinstance(structure, dict):
|
||||
if 'nodes' in structure:
|
||||
structure['nodes'] = format_tree_structure(structure['nodes'], order)
|
||||
if not structure.get('nodes'):
|
||||
structure.pop('nodes', None)
|
||||
return {key: structure[key] for key in order if key in structure}
|
||||
elif isinstance(structure, list):
|
||||
return [format_tree_structure(item, order) for item in structure]
|
||||
return structure
|
||||
|
||||
|
||||
def create_node_mapping(tree: List[Dict]) -> Dict[str, Dict]:
|
||||
"""Create flat dict mapping node_id to node for O(1) lookup."""
|
||||
mapping = {}
|
||||
def _traverse(nodes):
|
||||
for node in nodes:
|
||||
if node.get('node_id'):
|
||||
mapping[node['node_id']] = node
|
||||
if node.get('nodes'):
|
||||
_traverse(node['nodes'])
|
||||
_traverse(tree)
|
||||
return mapping
|
||||
|
||||
|
||||
# ── Text / JSON extraction ───────────────────────────────────────────────────
|
||||
|
||||
|
||||
def extract_json_from_llm(content: str) -> Dict:
|
||||
"""Extract and parse JSON from LLM responses. Handles ```json blocks, trailing commas, None->null."""
|
||||
import json
|
||||
try:
|
||||
start_idx = content.find("```json")
|
||||
if start_idx != -1:
|
||||
start_idx += 7
|
||||
end_idx = content.rfind("```")
|
||||
json_content = content[start_idx:end_idx].strip()
|
||||
else:
|
||||
json_content = content.strip()
|
||||
|
||||
json_content = json_content.replace('None', 'null')
|
||||
json_content = json_content.replace('\n', ' ').replace('\r', ' ')
|
||||
json_content = ' '.join(json_content.split())
|
||||
|
||||
return json.loads(json_content)
|
||||
except (json.JSONDecodeError, Exception):
|
||||
try:
|
||||
json_content = json_content.replace(',]', ']').replace(',}', '}')
|
||||
return json.loads(json_content)
|
||||
except Exception:
|
||||
return {}
|
||||
|
||||
|
||||
def parse_page_range(pages: str) -> List[int]:
|
||||
"""Parse page range string ('5-7', '3,8', '12') into sorted list of unique ints."""
|
||||
result = []
|
||||
for part in pages.split(','):
|
||||
part = part.strip()
|
||||
if '-' in part:
|
||||
start, end = int(part.split('-', 1)[0].strip()), int(part.split('-', 1)[1].strip())
|
||||
if start > end:
|
||||
raise ValueError(f"Invalid range '{part}': start must be <= end")
|
||||
result.extend(range(start, end + 1))
|
||||
else:
|
||||
result.append(int(part))
|
||||
return sorted(set(result))
|
||||
|
||||
|
||||
# ── Markdown parsing ─────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def extract_markdown_headers(markdown_content: str) -> Tuple[List[Dict], List[str]]:
|
||||
"""Extract all headers (h1-h6) from markdown with line numbers, skipping code blocks."""
|
||||
import re
|
||||
header_pattern = r'^(#{1,6})\s+(.+)$'
|
||||
code_block_pattern = r'^```'
|
||||
node_list = []
|
||||
lines = markdown_content.split('\n')
|
||||
in_code_block = False
|
||||
|
||||
for line_num, line in enumerate(lines, 1):
|
||||
stripped_line = line.strip()
|
||||
if re.match(code_block_pattern, stripped_line):
|
||||
in_code_block = not in_code_block
|
||||
continue
|
||||
if not stripped_line:
|
||||
continue
|
||||
if not in_code_block:
|
||||
match = re.match(header_pattern, stripped_line)
|
||||
if match:
|
||||
level = len(match.group(1))
|
||||
title = match.group(2).strip()
|
||||
node_list.append({'title': title, 'level': level, 'line_num': line_num})
|
||||
|
||||
return node_list, lines
|
||||
|
||||
|
||||
def build_tree_from_headers(node_list: List[Dict]) -> List[Dict]:
|
||||
"""Build nested tree from flat list of headers with levels (h1>h2>h3)."""
|
||||
if not node_list:
|
||||
return []
|
||||
|
||||
stack = []
|
||||
root_nodes = []
|
||||
node_counter = 1
|
||||
|
||||
for node in node_list:
|
||||
current_level = node['level']
|
||||
tree_node = {
|
||||
'title': node['title'],
|
||||
'node_id': str(node_counter).zfill(4),
|
||||
'line_num': node['line_num'],
|
||||
'nodes': []
|
||||
}
|
||||
node_counter += 1
|
||||
|
||||
while stack and stack[-1][1] >= current_level:
|
||||
stack.pop()
|
||||
|
||||
if not stack:
|
||||
root_nodes.append(tree_node)
|
||||
else:
|
||||
parent_node, _ = stack[-1]
|
||||
parent_node['nodes'].append(tree_node)
|
||||
|
||||
stack.append((tree_node, current_level))
|
||||
|
||||
def clean_empty_nodes(nodes):
|
||||
for n in nodes:
|
||||
if n['nodes']:
|
||||
clean_empty_nodes(n['nodes'])
|
||||
else:
|
||||
del n['nodes']
|
||||
return nodes
|
||||
|
||||
return clean_empty_nodes(root_nodes)
|
||||
|
||||
|
||||
# ── Pagination / chunking ────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def page_list_to_groups(page_contents: List[str], token_lengths: List[int],
|
||||
max_tokens: int = 20000, overlap_pages: int = 1) -> List[str]:
|
||||
"""Group pages into text chunks respecting token limit with configurable overlap."""
|
||||
import math
|
||||
num_tokens = sum(token_lengths)
|
||||
|
||||
if num_tokens <= max_tokens:
|
||||
return ["".join(page_contents)]
|
||||
|
||||
subsets = []
|
||||
current_subset = []
|
||||
current_token_count = 0
|
||||
|
||||
expected_parts = math.ceil(num_tokens / max_tokens)
|
||||
avg_tokens = math.ceil(((num_tokens / expected_parts) + max_tokens) / 2)
|
||||
|
||||
for i, (page_content, page_tokens) in enumerate(zip(page_contents, token_lengths)):
|
||||
if current_token_count + page_tokens > avg_tokens:
|
||||
subsets.append(''.join(current_subset))
|
||||
overlap_start = max(i - overlap_pages, 0)
|
||||
current_subset = list(page_contents[overlap_start:i])
|
||||
current_token_count = sum(token_lengths[overlap_start:i])
|
||||
|
||||
current_subset.append(page_content)
|
||||
current_token_count += page_tokens
|
||||
|
||||
if current_subset:
|
||||
subsets.append(''.join(current_subset))
|
||||
|
||||
return subsets
|
||||
|
||||
|
||||
def calculate_page_offset(pairs: List[Dict]) -> int:
|
||||
"""Calculate offset between logical page numbers and physical indices using reference pairs."""
|
||||
differences = []
|
||||
for pair in pairs:
|
||||
try:
|
||||
difference = pair['physical_index'] - pair['page']
|
||||
differences.append(difference)
|
||||
except (KeyError, TypeError):
|
||||
continue
|
||||
|
||||
if not differences:
|
||||
return 0
|
||||
|
||||
counts: Dict[int, int] = {}
|
||||
for diff in differences:
|
||||
counts[diff] = counts.get(diff, 0) + 1
|
||||
|
||||
return max(counts.items(), key=lambda x: x[1])[0]
|
||||
|
||||
|
||||
# ── Text preprocessing ───────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def preprocess_text(text: str) -> str:
|
||||
"""Normalize whitespace and newlines in raw text.
|
||||
|
||||
Args:
|
||||
text: Raw text to normalize.
|
||||
|
||||
Returns:
|
||||
Normalized text with consistent newlines, stripped lines, and no
|
||||
excessive blank lines.
|
||||
"""
|
||||
# Normalize line endings: \r\n and \r -> \n
|
||||
text = text.replace('\r\n', '\n').replace('\r', '\n')
|
||||
# Reduce 3+ consecutive newlines to at most 2
|
||||
text = re.sub(r'\n{3,}', '\n\n', text)
|
||||
# Strip whitespace from each line
|
||||
text = '\n'.join(line.strip() for line in text.split('\n'))
|
||||
# Strip globally
|
||||
return text.strip()
|
||||
|
||||
|
||||
def get_text_stats(text: str) -> dict:
|
||||
"""Compute basic statistics of a text: characters, lines, words.
|
||||
|
||||
Args:
|
||||
text: Input text to analyze.
|
||||
|
||||
Returns:
|
||||
Dict with keys total_chars (int), total_lines (int), total_words (int).
|
||||
"""
|
||||
return {
|
||||
'total_chars': len(text),
|
||||
'total_lines': text.count('\n') + 1,
|
||||
'total_words': len(text.split()),
|
||||
}
|
||||
|
||||
|
||||
# ── Git URL parsing ──────────────────────────────────────────────────────────
|
||||
|
||||
_DEFAULT_GIT_HOSTS = ["github.com", "gitlab.com"]
|
||||
|
||||
|
||||
def _sanitize_git_segment(segment: str) -> str:
|
||||
"""Strip .git suffix then keep only [a-zA-Z0-9_-] chars."""
|
||||
if segment.endswith(".git"):
|
||||
segment = segment[:-4]
|
||||
return re.sub(r"[^a-zA-Z0-9_\-]", "", segment)
|
||||
|
||||
|
||||
def parse_git_url(url: str, known_hosts: Optional[List[str]] = None) -> Optional[str]:
|
||||
"""Parse a code-hosting URL and return the 'org/repo' path component.
|
||||
|
||||
Supports HTTPS, HTTP, git://, ssh:// and SSH shorthand (git@host:path).
|
||||
Returns None if the URL does not match any known host or is malformed.
|
||||
|
||||
Args:
|
||||
url: Repository URL in any supported format.
|
||||
known_hosts: List of accepted hostnames. Defaults to github.com and gitlab.com.
|
||||
|
||||
Returns:
|
||||
'org/repo' string or None.
|
||||
"""
|
||||
from urllib.parse import urlparse
|
||||
|
||||
hosts = known_hosts if known_hosts is not None else _DEFAULT_GIT_HOSTS
|
||||
url = url.strip()
|
||||
|
||||
if url.startswith("git@"):
|
||||
# git@github.com:org/repo.git
|
||||
rest = url[len("git@"):]
|
||||
if ":" not in rest:
|
||||
return None
|
||||
host, path = rest.split(":", 1)
|
||||
if host not in hosts:
|
||||
return None
|
||||
segments = [s for s in path.split("/") if s]
|
||||
if len(segments) < 2:
|
||||
return None
|
||||
org = _sanitize_git_segment(segments[0])
|
||||
repo = _sanitize_git_segment(segments[1])
|
||||
if not org or not repo:
|
||||
return None
|
||||
return f"{org}/{repo}"
|
||||
|
||||
for prefix in ("http://", "https://", "git://", "ssh://"):
|
||||
if url.startswith(prefix):
|
||||
parsed = urlparse(url)
|
||||
netloc = parsed.hostname or ""
|
||||
if netloc not in hosts:
|
||||
return None
|
||||
segments = [s for s in parsed.path.split("/") if s]
|
||||
if len(segments) < 2:
|
||||
return None
|
||||
org = _sanitize_git_segment(segments[0])
|
||||
repo = _sanitize_git_segment(segments[1])
|
||||
if not org or not repo:
|
||||
return None
|
||||
return f"{org}/{repo}"
|
||||
|
||||
return None
|
||||
|
||||
|
||||
def is_git_repo_url(url: str, known_hosts: Optional[List[str]] = None) -> bool:
|
||||
"""Return True only if url points to a clonable git repository.
|
||||
|
||||
Accepts org/repo and org/repo/tree/<ref> paths.
|
||||
Rejects paths that navigate to sub-resources (issues, blobs, PRs, etc.).
|
||||
|
||||
Args:
|
||||
url: URL to verify.
|
||||
known_hosts: Accepted hostnames. Defaults to github.com and gitlab.com.
|
||||
|
||||
Returns:
|
||||
True if url is a clonable repository URL.
|
||||
"""
|
||||
from urllib.parse import urlparse
|
||||
|
||||
hosts = known_hosts if known_hosts is not None else _DEFAULT_GIT_HOSTS
|
||||
url = url.strip()
|
||||
|
||||
# SSH shorthand — always repo-level if host matches
|
||||
if url.startswith("git@"):
|
||||
rest = url[len("git@"):]
|
||||
if ":" not in rest:
|
||||
return False
|
||||
host, _ = rest.split(":", 1)
|
||||
return host in hosts
|
||||
|
||||
# git:// and ssh:// — always repo-level if host matches
|
||||
for prefix in ("ssh://", "git://"):
|
||||
if url.startswith(prefix):
|
||||
parsed = urlparse(url)
|
||||
return (parsed.hostname or "") in hosts
|
||||
|
||||
# http:// and https:// — must have exactly org/repo or org/repo/tree/<ref>
|
||||
for prefix in ("http://", "https://"):
|
||||
if url.startswith(prefix):
|
||||
parsed = urlparse(url)
|
||||
if (parsed.hostname or "") not in hosts:
|
||||
return False
|
||||
segments = [s for s in parsed.path.split("/") if s]
|
||||
if len(segments) == 2:
|
||||
return True
|
||||
if len(segments) == 4 and segments[2] == "tree":
|
||||
return True
|
||||
return False
|
||||
|
||||
return False
|
||||
|
||||
|
||||
def validate_git_ssh_uri(url: str) -> None:
|
||||
"""Validate a git SSH URI of the form git@host:path.
|
||||
|
||||
Raises ValueError with a descriptive message if the URI is malformed.
|
||||
|
||||
Args:
|
||||
url: URI string to validate.
|
||||
|
||||
Raises:
|
||||
ValueError: If the URI does not conform to git SSH format.
|
||||
"""
|
||||
if not url.startswith("git@"):
|
||||
raise ValueError(f"git SSH URI must start with 'git@', got: {url!r}")
|
||||
rest = url[len("git@"):]
|
||||
if ":" not in rest:
|
||||
raise ValueError(f"git SSH URI must contain ':', got: {url!r}")
|
||||
_, path = rest.split(":", 1)
|
||||
if not path:
|
||||
raise ValueError(f"git SSH URI must have a non-empty path after ':', got: {url!r}")
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Markdown parsing utilities
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def extract_frontmatter(content: str) -> Tuple[str, Optional[Dict]]:
|
||||
"""Extract YAML frontmatter delimited by '---' from the start of a markdown string.
|
||||
|
||||
Args:
|
||||
content: Raw markdown string, optionally starting with YAML frontmatter.
|
||||
|
||||
Returns:
|
||||
Tuple of (content_without_frontmatter, frontmatter_dict).
|
||||
frontmatter_dict is None when no frontmatter is found.
|
||||
"""
|
||||
pattern = re.compile(r'^---\n(.*?)\n---\n', re.DOTALL)
|
||||
match = pattern.match(content)
|
||||
if not match:
|
||||
return content, None
|
||||
|
||||
raw = match.group(1)
|
||||
remaining = content[match.end():]
|
||||
|
||||
try:
|
||||
import yaml # type: ignore
|
||||
data = yaml.safe_load(raw)
|
||||
if not isinstance(data, dict):
|
||||
data = None
|
||||
except Exception:
|
||||
# Fallback: simple key: value parser (no yaml dependency)
|
||||
data = {}
|
||||
for line in raw.splitlines():
|
||||
if ':' in line:
|
||||
key, _, value = line.partition(':')
|
||||
data[key.strip()] = value.strip()
|
||||
|
||||
return remaining, data
|
||||
|
||||
|
||||
def find_headings(content: str) -> List[Tuple[int, int, str, int]]:
|
||||
"""Find all markdown headings (# to ######), excluding those inside code blocks,
|
||||
HTML comments, and indented blocks.
|
||||
|
||||
Args:
|
||||
content: Markdown text to search.
|
||||
|
||||
Returns:
|
||||
List of (start_pos, end_pos, title, level) for each heading found.
|
||||
"""
|
||||
excluded: List[Tuple[int, int]] = []
|
||||
|
||||
# Code blocks (triple backtick)
|
||||
for m in re.finditer(r'```.*?```', content, re.DOTALL):
|
||||
excluded.append((m.start(), m.end()))
|
||||
|
||||
# HTML comments
|
||||
for m in re.finditer(r'<!--.*?-->', content, re.DOTALL):
|
||||
excluded.append((m.start(), m.end()))
|
||||
|
||||
# Indented blocks (lines starting with 4 spaces or a tab)
|
||||
for m in re.finditer(r'^( |\t).+$', content, re.MULTILINE):
|
||||
excluded.append((m.start(), m.end()))
|
||||
|
||||
def is_excluded(pos: int) -> bool:
|
||||
return any(start <= pos < end for start, end in excluded)
|
||||
|
||||
results: List[Tuple[int, int, str, int]] = []
|
||||
for m in re.finditer(r'^(#{1,6})\s+(.+)$', content, re.MULTILINE):
|
||||
# Skip escaped headings (\#)
|
||||
before = content[m.start() - 1] if m.start() > 0 else ''
|
||||
if before == '\\':
|
||||
continue
|
||||
if is_excluded(m.start()):
|
||||
continue
|
||||
level = len(m.group(1))
|
||||
title = m.group(2).strip()
|
||||
results.append((m.start(), m.end(), title, level))
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def estimate_token_count(content: str) -> int:
|
||||
"""Estimate token count without a tokenizer.
|
||||
|
||||
CJK characters count as ~0.7 tokens each; other non-whitespace characters
|
||||
count as ~0.3 tokens each.
|
||||
|
||||
Args:
|
||||
content: Text to estimate.
|
||||
|
||||
Returns:
|
||||
Estimated integer token count.
|
||||
"""
|
||||
cjk = re.findall(r'[\u4e00-\u9fff\u3040-\u30ff\uac00-\ud7af]', content)
|
||||
without_cjk = re.sub(r'[\u4e00-\u9fff\u3040-\u30ff\uac00-\ud7af]', '', content)
|
||||
others = re.findall(r'\S', without_cjk)
|
||||
return int(len(cjk) * 0.7 + len(others) * 0.3)
|
||||
|
||||
|
||||
def smart_split_content(
|
||||
content: str,
|
||||
max_tokens: int = 1024,
|
||||
max_chars: int = 8000,
|
||||
) -> List[str]:
|
||||
"""Split large content into parts respecting token and character limits.
|
||||
|
||||
Splits by paragraphs (double newline). If a single paragraph exceeds the
|
||||
limit it is force-cut into chunks of max_chars.
|
||||
|
||||
Args:
|
||||
content: Text to split.
|
||||
max_tokens: Maximum estimated tokens per part.
|
||||
max_chars: Maximum characters per part.
|
||||
|
||||
Returns:
|
||||
List of string parts.
|
||||
"""
|
||||
paragraphs = content.split('\n\n')
|
||||
parts: List[str] = []
|
||||
current_parts: List[str] = []
|
||||
current_tokens = 0
|
||||
current_chars = 0
|
||||
|
||||
def flush() -> None:
|
||||
if current_parts:
|
||||
parts.append('\n\n'.join(current_parts))
|
||||
current_parts.clear()
|
||||
|
||||
for para in paragraphs:
|
||||
para_tokens = estimate_token_count(para)
|
||||
para_chars = len(para)
|
||||
|
||||
# Single paragraph exceeds limits — force-cut it
|
||||
if para_tokens > max_tokens or para_chars > max_chars:
|
||||
flush()
|
||||
current_tokens = 0
|
||||
current_chars = 0
|
||||
for i in range(0, len(para), max_chars):
|
||||
parts.append(para[i:i + max_chars])
|
||||
continue
|
||||
|
||||
# Would exceed limits if added — flush first
|
||||
if (current_tokens + para_tokens > max_tokens or
|
||||
current_chars + para_chars > max_chars):
|
||||
flush()
|
||||
current_tokens = 0
|
||||
current_chars = 0
|
||||
|
||||
current_parts.append(para)
|
||||
current_tokens += para_tokens
|
||||
current_chars += para_chars
|
||||
|
||||
flush()
|
||||
return parts if parts else [content]
|
||||
|
||||
|
||||
def sanitize_for_path(text: str, max_length: int = 50) -> str:
|
||||
"""Convert text to a safe string for use in file paths.
|
||||
|
||||
Keeps word characters, CJK characters, spaces and hyphens. Replaces spaces
|
||||
with underscores. Truncates with a sha256 suffix if the result exceeds
|
||||
max_length.
|
||||
|
||||
Args:
|
||||
text: Input text to sanitize.
|
||||
max_length: Maximum length of the returned string.
|
||||
|
||||
Returns:
|
||||
Safe path-friendly string.
|
||||
"""
|
||||
cleaned = re.sub(
|
||||
r'[^\w\u4e00-\u9fff\u3040-\u30ff\uac00-\ud7af \-]',
|
||||
'',
|
||||
text,
|
||||
)
|
||||
cleaned = cleaned.replace(' ', '_').strip('_')
|
||||
|
||||
if not cleaned:
|
||||
return 'section'
|
||||
|
||||
if len(cleaned) <= max_length:
|
||||
return cleaned
|
||||
|
||||
suffix = '_' + hashlib.sha256(text.encode()).hexdigest()[:8]
|
||||
return cleaned[:max_length - len(suffix)] + suffix
|
||||
|
||||
@@ -0,0 +1,36 @@
|
||||
---
|
||||
name: create_node_mapping
|
||||
kind: function
|
||||
lang: py
|
||||
domain: core
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def create_node_mapping(tree: list[dict]) -> dict[str, dict]"
|
||||
description: "Crea dict plano node_id->node para lookup O(1) en un arbol jerarquico."
|
||||
tags: [tree, mapping, index, lookup]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: []
|
||||
tested: false
|
||||
tests: []
|
||||
test_file_path: ""
|
||||
file_path: "python/functions/core/core.py"
|
||||
source_repo: "https://github.com/VectifyAI/PageIndex"
|
||||
source_license: "MIT"
|
||||
source_file: "pageindex/utils.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
tree = [{"node_id": "0001", "title": "A", "nodes": [{"node_id": "0002", "title": "B"}]}]
|
||||
mapping = create_node_mapping(tree)
|
||||
mapping["0002"]["title"] # "B"
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Funcion pura. Los valores son referencias a los nodos originales, no copias.
|
||||
@@ -0,0 +1,66 @@
|
||||
---
|
||||
name: cursor_paginate
|
||||
kind: function
|
||||
lang: py
|
||||
domain: core
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "def cursor_paginate(fetch_page: Callable[..., list[T]], get_cursor: Callable[[T], str | None], page_size: int = 100, max_items: int = 2000, max_retries: int = 3, retry_delay: float = 2.0, retryable_exceptions: tuple[type[Exception], ...] = (ConnectionError, TimeoutError, OSError)) -> list[T]"
|
||||
description: "Paginador generico basado en cursor que funciona con cualquier API que use cursor-based pagination. Cada pagina se obtiene con retry automatico con exponential backoff. Se detiene cuando la pagina esta vacia, el batch es menor que page_size, se alcanza max_items, o el cursor del ultimo item es None."
|
||||
tags: [pagination, cursor, retry, generic, api, backoff]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: ["time", "typing.Callable", "typing.TypeVar"]
|
||||
tested: true
|
||||
tests:
|
||||
- "API que retorna 3 paginas de 10 items"
|
||||
- "API que falla 1 vez por pagina (retry funciona)"
|
||||
- "max_items limita correctamente"
|
||||
- "API que retorna pagina parcial (ultima pagina)"
|
||||
- "Cursor None en ultimo item (se detiene)"
|
||||
test_file_path: "python/functions/core/cursor_paginate_test.py"
|
||||
file_path: "python/functions/core/cursor_paginate.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from cursor_paginate import cursor_paginate
|
||||
|
||||
def fetch_users(limit: int, cursor: str | None) -> list[dict]:
|
||||
params = {"limit": limit}
|
||||
if cursor:
|
||||
params["cursor"] = cursor
|
||||
return requests.get("https://api.example.com/users", params=params).json()["items"]
|
||||
|
||||
def get_cursor(user: dict) -> str | None:
|
||||
return user.get("next_cursor")
|
||||
|
||||
users = cursor_paginate(
|
||||
fetch_page=fetch_users,
|
||||
get_cursor=get_cursor,
|
||||
page_size=100,
|
||||
max_items=5000,
|
||||
max_retries=3,
|
||||
retry_delay=2.0,
|
||||
)
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
El caller solo necesita proveer dos callables:
|
||||
- `fetch_page(limit, cursor)`: recibe `limit` y `cursor` como kwargs, retorna lista de items.
|
||||
- `get_cursor(item)`: extrae el cursor del ultimo item de la pagina; retornar None indica fin de datos.
|
||||
|
||||
El exponential backoff interno aplica `retry_delay * 2^attempt` sin jitter. Solo se reintentan las excepciones en `retryable_exceptions`; cualquier otra excepcion propaga inmediatamente.
|
||||
|
||||
Condiciones de parada (cualquiera de ellas):
|
||||
1. La pagina retornada esta vacia.
|
||||
2. La pagina retornada tiene menos items que `page_size` (pagina parcial = ultima pagina).
|
||||
3. El total acumulado alcanza o supera `max_items` (se trunca y se para).
|
||||
4. `get_cursor(batch[-1])` retorna `None`.
|
||||
|
||||
Funcion impura: llama a `fetch_page` que tipicamente hace I/O de red y usa `time.sleep` en los reintentos.
|
||||
@@ -0,0 +1,105 @@
|
||||
"""Generic cursor-based paginator for any API that uses cursor pagination."""
|
||||
|
||||
import time
|
||||
from typing import Callable, TypeVar
|
||||
|
||||
T = TypeVar("T")
|
||||
|
||||
|
||||
def cursor_paginate(
|
||||
fetch_page: Callable[..., list[T]],
|
||||
get_cursor: Callable[[T], str | None],
|
||||
page_size: int = 100,
|
||||
max_items: int = 2000,
|
||||
max_retries: int = 3,
|
||||
retry_delay: float = 2.0,
|
||||
retryable_exceptions: tuple[type[Exception], ...] = (
|
||||
ConnectionError,
|
||||
TimeoutError,
|
||||
OSError,
|
||||
),
|
||||
) -> list[T]:
|
||||
"""Paginate through a cursor-based API, collecting all items.
|
||||
|
||||
Fetches pages one at a time by calling fetch_page with limit and cursor
|
||||
kwargs. Retries each page on transient errors using exponential backoff.
|
||||
Stops when a page is empty, a partial page is returned, max_items is
|
||||
reached, or the cursor from the last item is None.
|
||||
|
||||
Args:
|
||||
fetch_page: Callable that accepts ``limit`` and ``cursor`` as keyword
|
||||
arguments and returns a list of items for that page.
|
||||
get_cursor: Callable that receives the last item of a page and returns
|
||||
the cursor string to use for the next page, or None if there are
|
||||
no more pages.
|
||||
page_size: Number of items to request per page.
|
||||
max_items: Hard cap on total items collected. Collection stops and the
|
||||
list is truncated once this limit is reached.
|
||||
max_retries: Maximum number of retry attempts per page after the first
|
||||
failure.
|
||||
retry_delay: Base delay in seconds between retries (doubled each
|
||||
attempt — exponential backoff without jitter).
|
||||
retryable_exceptions: Tuple of exception types that trigger a retry.
|
||||
Any other exception propagates immediately.
|
||||
|
||||
Returns:
|
||||
List of all collected items, in the order they were returned by the
|
||||
API, truncated to max_items.
|
||||
|
||||
Raises:
|
||||
Exception: Re-raises the last exception if all retries for a page are
|
||||
exhausted.
|
||||
"""
|
||||
all_items: list[T] = []
|
||||
cursor: str | None = None
|
||||
|
||||
while True:
|
||||
batch = _fetch_with_retry(
|
||||
fetch_page=fetch_page,
|
||||
page_size=page_size,
|
||||
cursor=cursor,
|
||||
max_retries=max_retries,
|
||||
retry_delay=retry_delay,
|
||||
retryable_exceptions=retryable_exceptions,
|
||||
)
|
||||
|
||||
if not batch:
|
||||
break
|
||||
|
||||
all_items.extend(batch)
|
||||
|
||||
if len(all_items) >= max_items:
|
||||
del all_items[max_items:]
|
||||
break
|
||||
|
||||
if len(batch) < page_size:
|
||||
break
|
||||
|
||||
cursor = get_cursor(batch[-1])
|
||||
if cursor is None:
|
||||
break
|
||||
|
||||
return all_items
|
||||
|
||||
|
||||
def _fetch_with_retry(
|
||||
fetch_page: Callable[..., list[T]],
|
||||
page_size: int,
|
||||
cursor: str | None,
|
||||
max_retries: int,
|
||||
retry_delay: float,
|
||||
retryable_exceptions: tuple[type[Exception], ...],
|
||||
) -> list[T]:
|
||||
"""Call fetch_page once, retrying on retryable_exceptions with exponential backoff."""
|
||||
last_exc: Exception | None = None
|
||||
for attempt in range(max_retries + 1):
|
||||
try:
|
||||
return fetch_page(limit=page_size, cursor=cursor)
|
||||
except retryable_exceptions as exc:
|
||||
last_exc = exc
|
||||
if attempt >= max_retries:
|
||||
raise
|
||||
delay = retry_delay * (2 ** attempt)
|
||||
time.sleep(delay)
|
||||
|
||||
raise last_exc # unreachable; satisfies type checkers
|
||||
@@ -0,0 +1,148 @@
|
||||
"""Tests para cursor_paginate."""
|
||||
|
||||
import sys
|
||||
import os
|
||||
|
||||
sys.path.insert(0, os.path.dirname(__file__))
|
||||
|
||||
import pytest
|
||||
from cursor_paginate import cursor_paginate
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def make_api(pages: list[list[dict]]) -> callable:
|
||||
"""Return a fetch_page callable that serves pages from a pre-built list."""
|
||||
call_count = [0]
|
||||
|
||||
def fetch_page(limit: int, cursor: str | None) -> list[dict]:
|
||||
idx = call_count[0]
|
||||
call_count[0] += 1
|
||||
if idx >= len(pages):
|
||||
return []
|
||||
return pages[idx][:limit]
|
||||
|
||||
return fetch_page
|
||||
|
||||
|
||||
def get_cursor(item: dict) -> str | None:
|
||||
return item.get("cursor")
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def test_api_retorna_3_paginas_de_10_items():
|
||||
pages = [
|
||||
[{"id": i, "cursor": str(i)} for i in range(0, 10)],
|
||||
[{"id": i, "cursor": str(i)} for i in range(10, 20)],
|
||||
[{"id": i, "cursor": str(i)} for i in range(20, 30)],
|
||||
[], # sentinel: empty page ends pagination
|
||||
]
|
||||
api = make_api(pages)
|
||||
result = cursor_paginate(
|
||||
fetch_page=api,
|
||||
get_cursor=get_cursor,
|
||||
page_size=10,
|
||||
max_items=2000,
|
||||
max_retries=0,
|
||||
)
|
||||
assert len(result) == 30
|
||||
assert result[0]["id"] == 0
|
||||
assert result[-1]["id"] == 29
|
||||
|
||||
|
||||
def test_api_falla_1_vez_por_pagina_retry_funciona():
|
||||
"""fetch_page falla en el primer intento de cada llamada, pero el retry recupera."""
|
||||
call_counter = [0]
|
||||
# Cada pagina tiene 5 items. 2 paginas en total, luego vacio.
|
||||
items_by_page = [
|
||||
[{"id": i, "cursor": str(i)} for i in range(0, 5)],
|
||||
[{"id": i, "cursor": str(i)} for i in range(5, 10)],
|
||||
]
|
||||
page_idx = [0]
|
||||
fail_flags = [True, True] # falla una vez por pagina
|
||||
|
||||
def fetch_page(limit: int, cursor: str | None) -> list[dict]:
|
||||
idx = page_idx[0]
|
||||
if idx < len(fail_flags) and fail_flags[idx]:
|
||||
fail_flags[idx] = False
|
||||
raise ConnectionError("transient failure")
|
||||
page_idx[0] += 1
|
||||
if idx >= len(items_by_page):
|
||||
return []
|
||||
return items_by_page[idx]
|
||||
|
||||
result = cursor_paginate(
|
||||
fetch_page=fetch_page,
|
||||
get_cursor=get_cursor,
|
||||
page_size=5,
|
||||
max_items=2000,
|
||||
max_retries=3,
|
||||
retry_delay=0.0,
|
||||
retryable_exceptions=(ConnectionError, TimeoutError, OSError),
|
||||
)
|
||||
assert len(result) == 10
|
||||
|
||||
|
||||
def test_max_items_limita_correctamente():
|
||||
# 50 items disponibles en 5 paginas de 10, pero max_items=25
|
||||
pages = [
|
||||
[{"id": i, "cursor": str(i)} for i in range(j * 10, j * 10 + 10)]
|
||||
for j in range(5)
|
||||
]
|
||||
api = make_api(pages)
|
||||
result = cursor_paginate(
|
||||
fetch_page=api,
|
||||
get_cursor=get_cursor,
|
||||
page_size=10,
|
||||
max_items=25,
|
||||
max_retries=0,
|
||||
)
|
||||
assert len(result) == 25
|
||||
assert result[-1]["id"] == 24
|
||||
|
||||
|
||||
def test_api_retorna_pagina_parcial_ultima_pagina():
|
||||
pages = [
|
||||
[{"id": i, "cursor": str(i)} for i in range(10)], # full page
|
||||
[{"id": i, "cursor": str(i)} for i in range(10, 17)], # partial — 7 items
|
||||
]
|
||||
api = make_api(pages)
|
||||
result = cursor_paginate(
|
||||
fetch_page=api,
|
||||
get_cursor=get_cursor,
|
||||
page_size=10,
|
||||
max_items=2000,
|
||||
max_retries=0,
|
||||
)
|
||||
assert len(result) == 17
|
||||
assert result[-1]["id"] == 16
|
||||
|
||||
|
||||
def test_cursor_none_en_ultimo_item_se_detiene():
|
||||
"""Cuando el ultimo item no tiene cursor, la paginacion debe detenerse."""
|
||||
pages = [
|
||||
[{"id": i, "cursor": str(i)} for i in range(10)],
|
||||
# last item has no cursor — signals end of data
|
||||
[{"id": i, "cursor": (str(i) if i < 19 else None)} for i in range(10, 20)],
|
||||
]
|
||||
api = make_api(pages)
|
||||
|
||||
def get_cursor_nullable(item: dict) -> str | None:
|
||||
return item.get("cursor")
|
||||
|
||||
result = cursor_paginate(
|
||||
fetch_page=api,
|
||||
get_cursor=get_cursor_nullable,
|
||||
page_size=10,
|
||||
max_items=2000,
|
||||
max_retries=0,
|
||||
)
|
||||
assert len(result) == 20
|
||||
assert result[-1]["id"] == 19
|
||||
@@ -0,0 +1,37 @@
|
||||
---
|
||||
name: detect_headings_by_font
|
||||
kind: function
|
||||
lang: py
|
||||
domain: core
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "def detect_headings_by_font(pdf, min_delta: float = 2.0, max_levels: int = 4) -> list[dict]"
|
||||
description: "Detecta headings en un PDF analizando la distribucion de font sizes. El font size mas comun es el body; sizes significativamente mayores se clasifican como heading levels. Filtra headers/footers repetitivos."
|
||||
tags: [pdf, headings, font, detection, parsing, pdfplumber]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: [pdfplumber, collections]
|
||||
tested: false
|
||||
tests: []
|
||||
test_file_path: ""
|
||||
file_path: "python/functions/core/detect_headings_by_font.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
import pdfplumber
|
||||
from detect_headings_by_font import detect_headings_by_font
|
||||
|
||||
with pdfplumber.open("document.pdf") as pdf:
|
||||
headings = detect_headings_by_font(pdf, min_delta=2.0, max_levels=4)
|
||||
for h in headings:
|
||||
print(f"Page {h['page_num']}: {'#' * h['level']} {h['title']}")
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Samplea cada 5ta pagina para construir el Counter de font sizes (optimizacion de rendimiento). El body_size es el font size mas frecuente. Los heading sizes deben ser >= body_size + min_delta Y tener frecuencia < 50% del body. Se limita a max_levels heading sizes ordenados desc (el mas grande = nivel 1). Titulos que aparecen en >30% de paginas son considerados headers/footers y se eliminan. Impure porque accede al estado interno de un objeto PDF ya abierto.
|
||||
@@ -0,0 +1,135 @@
|
||||
"""Detect headings in a PDF by analyzing font size distribution."""
|
||||
|
||||
from collections import Counter
|
||||
|
||||
import pdfplumber
|
||||
|
||||
|
||||
def detect_headings_by_font(
|
||||
pdf: pdfplumber.PDF,
|
||||
min_delta: float = 2.0,
|
||||
max_levels: int = 4,
|
||||
) -> list[dict]:
|
||||
"""Detect headings by analyzing font size distribution across pages.
|
||||
|
||||
The most common font size is treated as body text. Font sizes significantly
|
||||
larger than body (by at least min_delta) and appearing in fewer than 50% of
|
||||
chars are classified as heading levels.
|
||||
|
||||
Args:
|
||||
pdf: An open pdfplumber.PDF object.
|
||||
min_delta: Minimum size difference above body size to qualify as heading.
|
||||
max_levels: Maximum number of heading levels to detect.
|
||||
|
||||
Returns:
|
||||
list[dict]: List of {"level": int, "title": str, "page_num": int}
|
||||
sorted by page number. Returns empty list if no headings detected.
|
||||
"""
|
||||
if not pdf.pages:
|
||||
return []
|
||||
|
||||
# Step 1: Sample font sizes from every 5th page to determine body size
|
||||
size_counter: Counter = Counter()
|
||||
sample_pages = [pdf.pages[i] for i in range(0, len(pdf.pages), 5)]
|
||||
if not sample_pages:
|
||||
sample_pages = [pdf.pages[0]]
|
||||
|
||||
for page in sample_pages:
|
||||
try:
|
||||
chars = page.chars
|
||||
for ch in chars:
|
||||
size = ch.get("size")
|
||||
if size is not None:
|
||||
size_counter[round(float(size), 1)] += 1
|
||||
except Exception:
|
||||
continue
|
||||
|
||||
if not size_counter:
|
||||
return []
|
||||
|
||||
# Step 2: Determine body size (most common font size)
|
||||
body_size, body_count = size_counter.most_common(1)[0]
|
||||
|
||||
# Step 3: Identify heading sizes
|
||||
# Must be >= body_size + min_delta and frequency < 50% of body count
|
||||
heading_sizes = sorted(
|
||||
[
|
||||
size
|
||||
for size, count in size_counter.items()
|
||||
if size >= body_size + min_delta and count < body_count * 0.5
|
||||
],
|
||||
reverse=True,
|
||||
)[:max_levels]
|
||||
|
||||
if not heading_sizes:
|
||||
return []
|
||||
|
||||
# Build size -> level mapping
|
||||
size_to_level = {size: i + 1 for i, size in enumerate(heading_sizes)}
|
||||
|
||||
# Step 4: Collect heading text per page
|
||||
raw_headings: list[dict] = []
|
||||
total_pages = len(pdf.pages)
|
||||
|
||||
for page_idx, page in enumerate(pdf.pages):
|
||||
page_num = page_idx + 1
|
||||
try:
|
||||
chars = page.chars
|
||||
except Exception:
|
||||
continue
|
||||
|
||||
# Group consecutive chars of same heading size into text blocks
|
||||
current_size = None
|
||||
current_text = []
|
||||
|
||||
for ch in chars:
|
||||
size = ch.get("size")
|
||||
if size is None:
|
||||
continue
|
||||
rounded = round(float(size), 1)
|
||||
if rounded in size_to_level:
|
||||
if rounded == current_size:
|
||||
current_text.append(ch.get("text", ""))
|
||||
else:
|
||||
if current_text and current_size is not None:
|
||||
text = "".join(current_text).strip()
|
||||
if text:
|
||||
raw_headings.append({
|
||||
"level": size_to_level[current_size],
|
||||
"title": text,
|
||||
"page_num": page_num,
|
||||
})
|
||||
current_size = rounded
|
||||
current_text = [ch.get("text", "")]
|
||||
else:
|
||||
if current_text and current_size is not None:
|
||||
text = "".join(current_text).strip()
|
||||
if text:
|
||||
raw_headings.append({
|
||||
"level": size_to_level[current_size],
|
||||
"title": text,
|
||||
"page_num": page_num,
|
||||
})
|
||||
current_size = None
|
||||
current_text = []
|
||||
|
||||
# Flush remaining
|
||||
if current_text and current_size is not None:
|
||||
text = "".join(current_text).strip()
|
||||
if text:
|
||||
raw_headings.append({
|
||||
"level": size_to_level[current_size],
|
||||
"title": text,
|
||||
"page_num": page_num,
|
||||
})
|
||||
|
||||
if not raw_headings:
|
||||
return []
|
||||
|
||||
# Step 5: Deduplicate — remove titles appearing on > 30% of pages (headers/footers)
|
||||
title_page_counts: Counter = Counter(h["title"] for h in raw_headings)
|
||||
threshold = total_pages * 0.3
|
||||
|
||||
filtered = [h for h in raw_headings if title_page_counts[h["title"]] <= threshold]
|
||||
|
||||
return filtered
|
||||
@@ -0,0 +1,59 @@
|
||||
---
|
||||
name: detect_url_type
|
||||
kind: function
|
||||
lang: py
|
||||
domain: core
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "detect_url_type(url: str, timeout: float = 10.0) -> tuple[str, dict]"
|
||||
description: "Detecta el tipo de contenido de una URL. Retorna tipo ('webpage', 'pdf', 'markdown', 'text', 'code_repository') y metadata. Hace HTTP HEAD request solo si no puede determinarse por patron o extension."
|
||||
tags: [url, content-type, http, detect, classification, head-request]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: ["urllib.parse", "httpx"]
|
||||
tested: true
|
||||
tests:
|
||||
- "URL .pdf por extension"
|
||||
- "URL github repo"
|
||||
- "URL markdown por extension"
|
||||
- "URL SSH git"
|
||||
- "URL .html por extension"
|
||||
test_file_path: "python/functions/core/detect_url_type_test.py"
|
||||
file_path: "python/functions/core/detect_url_type.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from core.detect_url_type import detect_url_type
|
||||
|
||||
# Por patron URL (sin HTTP request)
|
||||
url_type, meta = detect_url_type("https://github.com/openai/whisper")
|
||||
# url_type = "code_repository", meta = {"detection": "url_pattern", ...}
|
||||
|
||||
# Por extension (sin HTTP request)
|
||||
url_type, meta = detect_url_type("https://example.com/doc.pdf")
|
||||
# url_type = "pdf", meta = {"detection": "extension", ...}
|
||||
|
||||
# Por HTTP HEAD request (cuando no se puede determinar sin red)
|
||||
url_type, meta = detect_url_type("https://example.com/page")
|
||||
# url_type = "webpage", meta = {"detection": "content_type_header", "content_type": "text/html", ...}
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Algoritmo en orden de prioridad:
|
||||
1. SSH git shorthand (`git@host:path`) → `code_repository` inmediatamente.
|
||||
2. Patron URL de repos conocidos (github.com/org/repo, gitlab.com/org/repo) → `code_repository`.
|
||||
3. Extension del path de la URL (.pdf, .md, .txt, .html, .git) → tipo correspondiente.
|
||||
4. HTTP HEAD request → leer `Content-Type` header.
|
||||
5. Default: `"webpage"`.
|
||||
|
||||
Hosts reconocidos como repos de codigo: github.com, gitlab.com, bitbucket.org, codeberg.org.
|
||||
|
||||
Sub-recursos (issues, pulls, blob, tree, etc.) NO se clasifican como `code_repository`.
|
||||
|
||||
Lanza `Exception` con mensaje descriptivo si el HEAD request falla (timeout, DNS, red).
|
||||
@@ -0,0 +1,144 @@
|
||||
"""Detecta el tipo de contenido de una URL (webpage, pdf, markdown, text, code_repository)."""
|
||||
|
||||
import re
|
||||
from urllib.parse import urlparse
|
||||
|
||||
|
||||
# Patrones de repos de codigo por hostname
|
||||
_CODE_REPO_HOSTS = {"github.com", "gitlab.com", "bitbucket.org", "codeberg.org"}
|
||||
|
||||
# Extensiones reconocidas → tipo
|
||||
_EXT_TYPE_MAP = {
|
||||
".pdf": "pdf",
|
||||
".md": "markdown",
|
||||
".markdown": "markdown",
|
||||
".rst": "text",
|
||||
".txt": "text",
|
||||
".html": "webpage",
|
||||
".htm": "webpage",
|
||||
".xml": "text",
|
||||
".json": "text",
|
||||
".csv": "text",
|
||||
".py": "text",
|
||||
".js": "text",
|
||||
".ts": "text",
|
||||
".go": "text",
|
||||
".rs": "text",
|
||||
".cpp": "text",
|
||||
".c": "text",
|
||||
".java": "text",
|
||||
".rb": "text",
|
||||
".git": "code_repository",
|
||||
}
|
||||
|
||||
# Content-Type header prefixes → tipo
|
||||
_CONTENT_TYPE_MAP = {
|
||||
"application/pdf": "pdf",
|
||||
"text/markdown": "markdown",
|
||||
"text/x-markdown": "markdown",
|
||||
"text/plain": "text",
|
||||
"text/html": "webpage",
|
||||
"text/xml": "text",
|
||||
"application/xml": "text",
|
||||
"application/json": "text",
|
||||
}
|
||||
|
||||
|
||||
def _is_code_repo_url(parsed, path_segments: list[str]) -> bool:
|
||||
"""Return True si la URL apunta a la raiz de un repositorio de codigo."""
|
||||
host = parsed.hostname or ""
|
||||
if host not in _CODE_REPO_HOSTS:
|
||||
return False
|
||||
# Acepta org/repo o org/repo/ o org/repo.git (2 segmentos minimos)
|
||||
if len(path_segments) < 2:
|
||||
return False
|
||||
# Rechaza sub-recursos conocidos: issues, pulls, blob, tree, releases, etc.
|
||||
_SUB_RESOURCES = {"issues", "pulls", "blob", "tree", "releases", "tags",
|
||||
"commits", "compare", "wiki", "discussions", "actions",
|
||||
"security", "pulse", "graphs", "-", "settings"}
|
||||
if len(path_segments) >= 3 and path_segments[2].rstrip(".git") in _SUB_RESOURCES:
|
||||
return False
|
||||
return True
|
||||
|
||||
|
||||
def _is_ssh_git_url(url: str) -> bool:
|
||||
"""Return True si la URL es un SSH git shorthand (git@host:path)."""
|
||||
return url.strip().startswith("git@")
|
||||
|
||||
|
||||
def _type_from_extension(path: str) -> str | None:
|
||||
"""Detecta tipo segun la extension del path de la URL. Retorna None si no aplica."""
|
||||
# Ignorar query string / fragment
|
||||
clean_path = path.split("?")[0].split("#")[0]
|
||||
for ext, url_type in _EXT_TYPE_MAP.items():
|
||||
if clean_path.lower().endswith(ext):
|
||||
return url_type
|
||||
return None
|
||||
|
||||
|
||||
def _type_from_content_type(content_type_header: str) -> str:
|
||||
"""Mapea un Content-Type header al tipo de URL."""
|
||||
ct = content_type_header.lower().split(";")[0].strip()
|
||||
for prefix, url_type in _CONTENT_TYPE_MAP.items():
|
||||
if ct.startswith(prefix):
|
||||
return url_type
|
||||
return "webpage"
|
||||
|
||||
|
||||
def detect_url_type(url: str, timeout: float = 10.0) -> tuple[str, dict]:
|
||||
"""Detecta el tipo de contenido de una URL.
|
||||
|
||||
Algoritmo:
|
||||
1. Verificar si la URL es un patron de repo de codigo (git@, github.com/org/repo).
|
||||
2. Verificar extension en el path de la URL (.pdf, .md, .txt, .html, .git).
|
||||
3. Si no se determino: HTTP HEAD request para leer Content-Type header.
|
||||
4. Default: "webpage".
|
||||
|
||||
Args:
|
||||
url: URL a analizar.
|
||||
timeout: Timeout en segundos para el HTTP HEAD request (si es necesario).
|
||||
|
||||
Returns:
|
||||
Tuple de (tipo, metadata) donde tipo es uno de:
|
||||
"webpage", "pdf", "markdown", "text", "code_repository".
|
||||
metadata incluye la informacion disponible (extension, content_type, etc.).
|
||||
|
||||
Raises:
|
||||
Exception: Si falla la conexion HTTP cuando es necesaria.
|
||||
"""
|
||||
import httpx
|
||||
|
||||
url = url.strip()
|
||||
metadata: dict = {"url": url}
|
||||
|
||||
# 1. SSH git shorthand
|
||||
if _is_ssh_git_url(url):
|
||||
metadata["detection"] = "ssh_pattern"
|
||||
return "code_repository", metadata
|
||||
|
||||
parsed = urlparse(url)
|
||||
path_segments = [s for s in parsed.path.split("/") if s]
|
||||
|
||||
# 2. Code repo by URL pattern
|
||||
if _is_code_repo_url(parsed, path_segments):
|
||||
metadata["detection"] = "url_pattern"
|
||||
metadata["host"] = parsed.hostname
|
||||
return "code_repository", metadata
|
||||
|
||||
# 3. Extension-based detection
|
||||
ext_type = _type_from_extension(parsed.path)
|
||||
if ext_type is not None:
|
||||
metadata["detection"] = "extension"
|
||||
metadata["path"] = parsed.path
|
||||
return ext_type, metadata
|
||||
|
||||
# 4. HTTP HEAD request
|
||||
try:
|
||||
response = httpx.head(url, timeout=timeout, follow_redirects=True)
|
||||
content_type = response.headers.get("content-type", "")
|
||||
metadata["detection"] = "content_type_header"
|
||||
metadata["content_type"] = content_type
|
||||
metadata["status_code"] = response.status_code
|
||||
return _type_from_content_type(content_type), metadata
|
||||
except Exception as exc:
|
||||
raise Exception(f"detect_url_type: HEAD request failed for {url!r}: {exc}") from exc
|
||||
@@ -0,0 +1,89 @@
|
||||
"""Tests para detect_url_type (tests que no requieren red)."""
|
||||
|
||||
import sys
|
||||
import os
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
||||
|
||||
from core.detect_url_type import detect_url_type, _type_from_extension, _type_from_content_type, _is_ssh_git_url
|
||||
|
||||
|
||||
def test_url_pdf_por_extension():
|
||||
"""URL .pdf se detecta por extension sin hacer request HTTP."""
|
||||
url_type, metadata = detect_url_type("https://example.com/report.pdf")
|
||||
assert url_type == "pdf"
|
||||
assert metadata["detection"] == "extension"
|
||||
|
||||
|
||||
def test_url_github_repo():
|
||||
"""URL de GitHub org/repo se detecta como code_repository por patron URL."""
|
||||
url_type, metadata = detect_url_type("https://github.com/openai/whisper")
|
||||
assert url_type == "code_repository"
|
||||
assert metadata["detection"] == "url_pattern"
|
||||
|
||||
|
||||
def test_url_github_con_git_suffix():
|
||||
"""URL github terminada en .git se detecta como code_repository."""
|
||||
url_type, metadata = detect_url_type("https://github.com/openai/whisper.git")
|
||||
assert url_type == "code_repository"
|
||||
|
||||
|
||||
def test_url_markdown_por_extension():
|
||||
"""URL .md se detecta como markdown por extension."""
|
||||
url_type, metadata = detect_url_type("https://example.com/README.md")
|
||||
assert url_type == "markdown"
|
||||
assert metadata["detection"] == "extension"
|
||||
|
||||
|
||||
def test_url_ssh_git():
|
||||
"""URL SSH git@ se detecta como code_repository."""
|
||||
url_type, metadata = detect_url_type("git@github.com:openai/whisper.git")
|
||||
assert url_type == "code_repository"
|
||||
assert metadata["detection"] == "ssh_pattern"
|
||||
|
||||
|
||||
def test_url_html_por_extension():
|
||||
"""URL .html se detecta como webpage por extension."""
|
||||
url_type, metadata = detect_url_type("https://example.com/page.html")
|
||||
assert url_type == "webpage"
|
||||
assert metadata["detection"] == "extension"
|
||||
|
||||
|
||||
def test_url_txt_por_extension():
|
||||
"""URL .txt se detecta como text por extension."""
|
||||
url_type, metadata = detect_url_type("https://example.com/data.txt")
|
||||
assert url_type == "text"
|
||||
|
||||
|
||||
def test_github_subrepo_no_es_repo():
|
||||
"""URL de GitHub apuntando a un issue/blob no se trata como code_repository."""
|
||||
# Debe intentar HEAD request (que fallara sin red) — verificamos que no clasifica como repo
|
||||
# Solo comprobamos que no devuelve code_repository por patron URL
|
||||
url = "https://github.com/openai/whisper/blob/main/README.md"
|
||||
# Extension .md deberia detectarse primero
|
||||
url_type, metadata = detect_url_type(url)
|
||||
assert url_type == "markdown"
|
||||
|
||||
|
||||
def test_helper_type_from_extension():
|
||||
"""_type_from_extension funciona para extensiones conocidas."""
|
||||
assert _type_from_extension("/doc.pdf") == "pdf"
|
||||
assert _type_from_extension("/README.md") == "markdown"
|
||||
assert _type_from_extension("/notes.txt") == "text"
|
||||
assert _type_from_extension("/unknown.xyz") is None
|
||||
|
||||
|
||||
def test_helper_type_from_content_type():
|
||||
"""_type_from_content_type mapea headers correctamente."""
|
||||
assert _type_from_content_type("application/pdf; charset=utf-8") == "pdf"
|
||||
assert _type_from_content_type("text/html; charset=utf-8") == "webpage"
|
||||
assert _type_from_content_type("text/plain") == "text"
|
||||
assert _type_from_content_type("text/markdown") == "markdown"
|
||||
assert _type_from_content_type("application/octet-stream") == "webpage"
|
||||
|
||||
|
||||
def test_helper_is_ssh_git_url():
|
||||
"""_is_ssh_git_url detecta formato git@."""
|
||||
assert _is_ssh_git_url("git@github.com:org/repo.git") is True
|
||||
assert _is_ssh_git_url("https://github.com/org/repo") is False
|
||||
assert _is_ssh_git_url("ssh://git@github.com/org/repo") is False
|
||||
@@ -0,0 +1,40 @@
|
||||
---
|
||||
name: docx_to_markdown
|
||||
kind: function
|
||||
lang: py
|
||||
domain: core
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "docx_to_markdown(docx_path: str) -> str"
|
||||
description: "Convierte un documento Word (.docx) a markdown preservando estructura (headings), formato inline (bold, italic, underline) y tablas en su posicion original."
|
||||
tags: [docx, markdown, word, conversion, document, parsing, text]
|
||||
uses_functions: [format_table_to_markdown_py_core]
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: [python-docx, lxml]
|
||||
tested: true
|
||||
tests: ["docx con headings y parrafos", "docx con tablas intercaladas", "docx con formato bold/italic", "docx vacio", "archivo no encontrado lanza FileNotFoundError"]
|
||||
test_file_path: "python/functions/core/docx_to_markdown_test.py"
|
||||
file_path: "python/functions/core/docx_to_markdown.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
md = docx_to_markdown("informe.docx")
|
||||
# # Titulo
|
||||
#
|
||||
# Primer parrafo.
|
||||
#
|
||||
# | Col1 | Col2 |
|
||||
# | ---- | ---- |
|
||||
# | a | b |
|
||||
#
|
||||
# Parrafo despues de la tabla.
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Recorre `doc.element.body` en orden (no `doc.paragraphs` + `doc.tables` por separado) para preservar la posicion original de las tablas. Construye un mapa `{id(tbl_element): Table}` para lookup O(1). El formato inline aplica underline (`<ins>`), italic (`*`) y bold (`**`) en ese orden de mas interno a mas externo. Los headings se detectan por el estilo del parrafo (`Heading 1`, `Heading 2`, etc.). Requiere `python-docx` instalado en el entorno.
|
||||
@@ -0,0 +1,153 @@
|
||||
"""Convert a Word .docx document to Markdown, preserving structure, inline
|
||||
formatting and tables in their original document order."""
|
||||
|
||||
import os
|
||||
from lxml import etree
|
||||
|
||||
from format_table_to_markdown import format_table_to_markdown
|
||||
|
||||
|
||||
# XML namespace used by python-docx element tags
|
||||
_W = "http://schemas.openxmlformats.org/wordprocessingml/2006/main"
|
||||
_TAG_P = f"{{{_W}}}p"
|
||||
_TAG_TBL = f"{{{_W}}}tbl"
|
||||
_TAG_TR = f"{{{_W}}}tr"
|
||||
_TAG_TC = f"{{{_W}}}tc"
|
||||
_TAG_R = f"{{{_W}}}r"
|
||||
_TAG_T = f"{{{_W}}}t"
|
||||
_TAG_RPR = f"{{{_W}}}rPr"
|
||||
_TAG_B = f"{{{_W}}}b"
|
||||
_TAG_I = f"{{{_W}}}i"
|
||||
_TAG_U = f"{{{_W}}}u"
|
||||
_TAG_PSTYLE = f"{{{_W}}}pStyle"
|
||||
_TAG_PPR = f"{{{_W}}}pPr"
|
||||
|
||||
|
||||
def _heading_level(paragraph) -> int:
|
||||
"""Return heading level (1-6) if the paragraph is a heading, else 0."""
|
||||
pPr = paragraph._p.find(_TAG_PPR)
|
||||
if pPr is None:
|
||||
return 0
|
||||
pStyle = pPr.find(_TAG_PSTYLE)
|
||||
if pStyle is None:
|
||||
return 0
|
||||
val = pStyle.get(f"{{{_W}}}val", "")
|
||||
if val.lower().startswith("heading"):
|
||||
parts = val.split()
|
||||
if len(parts) == 2:
|
||||
try:
|
||||
return int(parts[1])
|
||||
except ValueError:
|
||||
pass
|
||||
# Some locales use "Heading1" (no space)
|
||||
suffix = val[len("heading"):]
|
||||
if suffix.isdigit():
|
||||
return int(suffix)
|
||||
return 0
|
||||
|
||||
|
||||
def _run_to_md(run_elem) -> str:
|
||||
"""Convert a single <w:r> element to a markdown-formatted string."""
|
||||
# Collect text
|
||||
text_parts = []
|
||||
for t in run_elem.findall(_TAG_T):
|
||||
text_parts.append(t.text or "")
|
||||
text = "".join(text_parts)
|
||||
if not text:
|
||||
return ""
|
||||
|
||||
# Read formatting from <w:rPr>
|
||||
rPr = run_elem.find(_TAG_RPR)
|
||||
bold = False
|
||||
italic = False
|
||||
underline = False
|
||||
if rPr is not None:
|
||||
bold = rPr.find(_TAG_B) is not None
|
||||
italic = rPr.find(_TAG_I) is not None
|
||||
u_elem = rPr.find(_TAG_U)
|
||||
if u_elem is not None:
|
||||
u_val = u_elem.get(f"{{{_W}}}val", "")
|
||||
underline = u_val not in ("none", "")
|
||||
|
||||
# Apply markdown formatting (innermost first: underline → italic → bold)
|
||||
if underline:
|
||||
text = f"<ins>{text}</ins>"
|
||||
if italic:
|
||||
text = f"*{text}*"
|
||||
if bold:
|
||||
text = f"**{text}**"
|
||||
return text
|
||||
|
||||
|
||||
def _paragraph_to_md(paragraph) -> str:
|
||||
"""Convert a python-docx Paragraph to a markdown string."""
|
||||
level = _heading_level(paragraph)
|
||||
runs_md = "".join(
|
||||
_run_to_md(elem)
|
||||
for elem in paragraph._p
|
||||
if elem.tag == _TAG_R
|
||||
)
|
||||
if level:
|
||||
return f"{'#' * level} {runs_md}"
|
||||
return runs_md
|
||||
|
||||
|
||||
def _table_to_md(table) -> str:
|
||||
"""Convert a python-docx Table to a markdown table string."""
|
||||
rows: list[list[str]] = []
|
||||
for row in table.rows:
|
||||
cells = []
|
||||
for cell in row.cells:
|
||||
# Join all paragraphs in the cell with a space
|
||||
cell_text = " ".join(p.text for p in cell.paragraphs).strip()
|
||||
cells.append(cell_text)
|
||||
rows.append(cells)
|
||||
return format_table_to_markdown(rows, has_header=True)
|
||||
|
||||
|
||||
def docx_to_markdown(docx_path: str) -> str:
|
||||
"""Convert a Word .docx document to Markdown.
|
||||
|
||||
Preserves document structure (headings), inline formatting (bold, italic,
|
||||
underline) and tables in their original position.
|
||||
|
||||
Args:
|
||||
docx_path: Absolute or relative path to the .docx file.
|
||||
|
||||
Returns:
|
||||
Markdown string representing the document.
|
||||
|
||||
Raises:
|
||||
FileNotFoundError: If the file does not exist.
|
||||
Exception: If the file cannot be parsed as a .docx document.
|
||||
"""
|
||||
import docx # deferred so the module is importable without python-docx installed
|
||||
|
||||
if not os.path.exists(docx_path):
|
||||
raise FileNotFoundError(f"File not found: {docx_path}")
|
||||
|
||||
doc = docx.Document(docx_path)
|
||||
|
||||
# Build a mapping from the XML element id to the Table object for O(1) lookup
|
||||
table_map: dict[int, object] = {
|
||||
id(table._tbl): table for table in doc.tables
|
||||
}
|
||||
|
||||
parts: list[str] = []
|
||||
|
||||
for child in doc.element.body:
|
||||
if child.tag == _TAG_P:
|
||||
# Wrap in a temporary paragraph object to reuse _paragraph_to_md
|
||||
from docx.text.paragraph import Paragraph
|
||||
para = Paragraph(child, doc)
|
||||
md = _paragraph_to_md(para)
|
||||
if md.strip():
|
||||
parts.append(md)
|
||||
elif child.tag == _TAG_TBL:
|
||||
table = table_map.get(id(child))
|
||||
if table is not None:
|
||||
md = _table_to_md(table)
|
||||
if md:
|
||||
parts.append(md)
|
||||
|
||||
return "\n\n".join(parts)
|
||||
@@ -0,0 +1,129 @@
|
||||
"""Tests para docx_to_markdown."""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import tempfile
|
||||
|
||||
import pytest
|
||||
|
||||
sys.path.insert(0, os.path.dirname(__file__))
|
||||
|
||||
import docx as python_docx
|
||||
from docx_to_markdown import docx_to_markdown
|
||||
|
||||
|
||||
def _make_docx(builder_fn) -> str:
|
||||
"""Create a temporary .docx file using builder_fn(doc) and return its path."""
|
||||
doc = python_docx.Document()
|
||||
builder_fn(doc)
|
||||
tmp = tempfile.NamedTemporaryFile(suffix=".docx", delete=False)
|
||||
doc.save(tmp.name)
|
||||
tmp.close()
|
||||
return tmp.name
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def test_docx_con_headings_y_parrafos():
|
||||
"""docx con headings y parrafos"""
|
||||
|
||||
def build(doc):
|
||||
doc.add_heading("Titulo Principal", level=1)
|
||||
doc.add_paragraph("Primer parrafo de contenido.")
|
||||
doc.add_heading("Seccion", level=2)
|
||||
doc.add_paragraph("Segundo parrafo.")
|
||||
|
||||
path = _make_docx(build)
|
||||
try:
|
||||
result = docx_to_markdown(path)
|
||||
assert "# Titulo Principal" in result
|
||||
assert "## Seccion" in result
|
||||
assert "Primer parrafo de contenido." in result
|
||||
assert "Segundo parrafo." in result
|
||||
finally:
|
||||
os.unlink(path)
|
||||
|
||||
|
||||
def test_docx_con_tablas_intercaladas():
|
||||
"""docx con tablas intercaladas"""
|
||||
|
||||
def build(doc):
|
||||
doc.add_paragraph("Texto antes de la tabla.")
|
||||
table = doc.add_table(rows=2, cols=3)
|
||||
table.cell(0, 0).text = "Col1"
|
||||
table.cell(0, 1).text = "Col2"
|
||||
table.cell(0, 2).text = "Col3"
|
||||
table.cell(1, 0).text = "a"
|
||||
table.cell(1, 1).text = "b"
|
||||
table.cell(1, 2).text = "c"
|
||||
doc.add_paragraph("Texto despues de la tabla.")
|
||||
|
||||
path = _make_docx(build)
|
||||
try:
|
||||
result = docx_to_markdown(path)
|
||||
# Table must appear BETWEEN the two paragraphs
|
||||
before_idx = result.index("Texto antes de la tabla.")
|
||||
table_idx = result.index("| Col1")
|
||||
after_idx = result.index("Texto despues de la tabla.")
|
||||
assert before_idx < table_idx < after_idx
|
||||
assert "| Col2" in result
|
||||
assert "| a" in result
|
||||
finally:
|
||||
os.unlink(path)
|
||||
|
||||
|
||||
def test_docx_con_formato_bold_italic():
|
||||
"""docx con formato bold/italic"""
|
||||
|
||||
def build(doc):
|
||||
para = doc.add_paragraph()
|
||||
run_bold = para.add_run("negrita")
|
||||
run_bold.bold = True
|
||||
run_normal = para.add_run(" texto normal ")
|
||||
run_italic = para.add_run("cursiva")
|
||||
run_italic.italic = True
|
||||
|
||||
path = _make_docx(build)
|
||||
try:
|
||||
result = docx_to_markdown(path)
|
||||
assert "**negrita**" in result
|
||||
assert "*cursiva*" in result
|
||||
assert "texto normal" in result
|
||||
finally:
|
||||
os.unlink(path)
|
||||
|
||||
|
||||
def test_docx_vacio():
|
||||
"""docx vacio"""
|
||||
|
||||
def build(doc):
|
||||
# python-docx adds a default empty paragraph; remove all content
|
||||
# by just not adding anything — the default empty paragraph will
|
||||
# produce an empty string that gets filtered out.
|
||||
pass
|
||||
|
||||
path = _make_docx(build)
|
||||
try:
|
||||
result = docx_to_markdown(path)
|
||||
# Empty document should produce empty or whitespace-only output
|
||||
assert result.strip() == ""
|
||||
finally:
|
||||
os.unlink(path)
|
||||
|
||||
|
||||
def test_archivo_no_encontrado():
|
||||
"""archivo no encontrado lanza FileNotFoundError"""
|
||||
with pytest.raises(FileNotFoundError):
|
||||
docx_to_markdown("/tmp/nonexistent_file_fn_registry.docx")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
test_docx_con_headings_y_parrafos()
|
||||
test_docx_con_tablas_intercaladas()
|
||||
test_docx_con_formato_bold_italic()
|
||||
test_docx_vacio()
|
||||
test_archivo_no_encontrado()
|
||||
print("All tests passed.")
|
||||
@@ -0,0 +1,52 @@
|
||||
---
|
||||
name: epub_to_markdown
|
||||
kind: function
|
||||
lang: py
|
||||
domain: core
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "def epub_to_markdown(epub_path: str) -> str"
|
||||
description: "Convierte un ebook EPUB a markdown. Intenta ebooklib primero para extraccion estructurada (titulo, autor, documentos); fallback a extraccion manual con zipfile si ebooklib no esta instalado."
|
||||
tags: [epub, markdown, ebook, parsing, conversion, html, text-extraction]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: [zipfile, html, re, ebooklib]
|
||||
tested: true
|
||||
tests:
|
||||
- "conversion de headings h1-h3"
|
||||
- "conversion de bold e italic"
|
||||
- "script y style se eliminan del output"
|
||||
- "HTML entities se convierten a caracteres"
|
||||
- "epub sin ebooklib extrae texto de archivos html"
|
||||
- "epub con ebooklib incluye titulo y autor en el output"
|
||||
- "epub corrupto lanza excepcion"
|
||||
test_file_path: "python/functions/core/epub_to_markdown_test.py"
|
||||
file_path: "python/functions/core/epub_to_markdown.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
md = epub_to_markdown("/path/to/book.epub")
|
||||
print(md[:500])
|
||||
# # Mi Libro
|
||||
# **Author:** Ana Perez
|
||||
#
|
||||
# # Introduccion
|
||||
# Primer parrafo...
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Conversion HTML a markdown cubre: headings h1-h6, bold (`<strong>`/`<b>`), italic (`<em>`/`<i>`), paragraphs, line breaks. Elimina `<script>` y `<style>`. Desescapa entidades HTML y normaliza whitespace.
|
||||
|
||||
Con ebooklib: extrae metadata DC (titulo, autor) del OPF y procesa solo los ITEM_DOCUMENT del spine.
|
||||
|
||||
Sin ebooklib (fallback ZIP): lista archivos `.html`/`.xhtml`/`.htm` en orden alfabetico y extrae su contenido. No hay metadata de titulo/autor en este modo.
|
||||
|
||||
Dependencia opcional: `pip install ebooklib`. Si no esta instalada la funcion sigue funcionando via zipfile.
|
||||
|
||||
Reimplementacion conceptual desde OpenViking `openviking/parse/parsers/epub.py` (AGPL-3.0). El codigo es original.
|
||||
@@ -0,0 +1,128 @@
|
||||
"""Convert an EPUB file to markdown text."""
|
||||
|
||||
import re
|
||||
import zipfile
|
||||
from html import unescape
|
||||
from html.parser import HTMLParser
|
||||
|
||||
|
||||
def _remove_tags(html: str, tag: str) -> str:
|
||||
"""Remove a tag and its content from HTML string."""
|
||||
pattern = re.compile(rf'<{tag}[^>]*>.*?</{tag}>', re.IGNORECASE | re.DOTALL)
|
||||
return pattern.sub('', html)
|
||||
|
||||
|
||||
def _html_to_markdown(html: str) -> str:
|
||||
"""Convert basic HTML to markdown.
|
||||
|
||||
Handles headings, bold, italic, paragraphs, line breaks
|
||||
and strips remaining tags.
|
||||
|
||||
Args:
|
||||
html: HTML string to convert.
|
||||
|
||||
Returns:
|
||||
Markdown-formatted string.
|
||||
"""
|
||||
# Remove script and style blocks
|
||||
text = _remove_tags(html, 'script')
|
||||
text = _remove_tags(text, 'style')
|
||||
|
||||
# Headings h1-h6
|
||||
for level in range(6, 0, -1):
|
||||
hashes = '#' * level
|
||||
text = re.sub(
|
||||
rf'<h{level}[^>]*>(.*?)</h{level}>',
|
||||
lambda m, h=hashes: f'{h} {m.group(1).strip()}',
|
||||
text,
|
||||
flags=re.IGNORECASE | re.DOTALL,
|
||||
)
|
||||
|
||||
# Bold
|
||||
text = re.sub(r'<strong[^>]*>(.*?)</strong>', r'**\1**', text, flags=re.IGNORECASE | re.DOTALL)
|
||||
text = re.sub(r'<b[^>]*>(.*?)</b>', r'**\1**', text, flags=re.IGNORECASE | re.DOTALL)
|
||||
|
||||
# Italic
|
||||
text = re.sub(r'<em[^>]*>(.*?)</em>', r'*\1*', text, flags=re.IGNORECASE | re.DOTALL)
|
||||
text = re.sub(r'<i[^>]*>(.*?)</i>', r'*\1*', text, flags=re.IGNORECASE | re.DOTALL)
|
||||
|
||||
# Paragraphs — append double newline after content
|
||||
text = re.sub(r'<p[^>]*>(.*?)</p>', lambda m: m.group(1).strip() + '\n\n', text, flags=re.IGNORECASE | re.DOTALL)
|
||||
|
||||
# Line breaks
|
||||
text = re.sub(r'<br\s*/?>', '\n', text, flags=re.IGNORECASE)
|
||||
|
||||
# Strip remaining HTML tags
|
||||
text = re.sub(r'<[^>]+>', '', text)
|
||||
|
||||
# Unescape HTML entities
|
||||
text = unescape(text)
|
||||
|
||||
# Normalize whitespace: collapse multiple blank lines into two
|
||||
text = re.sub(r'\n{3,}', '\n\n', text)
|
||||
text = re.sub(r'[ \t]+', ' ', text)
|
||||
|
||||
return text.strip()
|
||||
|
||||
|
||||
def _epub_via_ebooklib(epub_path: str) -> str:
|
||||
"""Extract markdown from EPUB using ebooklib."""
|
||||
import ebooklib
|
||||
from ebooklib import epub
|
||||
|
||||
book = epub.read_epub(epub_path)
|
||||
|
||||
# Metadata
|
||||
title_meta = book.get_metadata('DC', 'title')
|
||||
author_meta = book.get_metadata('DC', 'creator')
|
||||
title = title_meta[0][0] if title_meta else 'Unknown Title'
|
||||
author = author_meta[0][0] if author_meta else 'Unknown Author'
|
||||
|
||||
parts = [f'# {title}', f'**Author:** {author}']
|
||||
|
||||
for item in book.get_items_of_type(ebooklib.ITEM_DOCUMENT):
|
||||
content = item.get_content().decode('utf-8', errors='replace')
|
||||
md = _html_to_markdown(content)
|
||||
if md:
|
||||
parts.append(md)
|
||||
|
||||
return '\n\n'.join(parts)
|
||||
|
||||
|
||||
def _epub_via_zipfile(epub_path: str) -> str:
|
||||
"""Extract markdown from EPUB using zipfile (fallback)."""
|
||||
parts = []
|
||||
with zipfile.ZipFile(epub_path, 'r') as zf:
|
||||
html_files = sorted(
|
||||
name for name in zf.namelist()
|
||||
if name.lower().endswith(('.html', '.xhtml', '.htm'))
|
||||
)
|
||||
for name in html_files:
|
||||
raw = zf.read(name).decode('utf-8', errors='replace')
|
||||
md = _html_to_markdown(raw)
|
||||
if md:
|
||||
parts.append(md)
|
||||
|
||||
return '\n\n'.join(parts)
|
||||
|
||||
|
||||
def epub_to_markdown(epub_path: str) -> str:
|
||||
"""Convert an EPUB ebook to markdown.
|
||||
|
||||
Attempts to use ebooklib for structured extraction (title, author,
|
||||
document items). Falls back to manual ZIP extraction if ebooklib is
|
||||
not installed.
|
||||
|
||||
Args:
|
||||
epub_path: Path to the .epub file.
|
||||
|
||||
Returns:
|
||||
Markdown string with the book content.
|
||||
|
||||
Raises:
|
||||
Exception: If the file cannot be read or is not a valid EPUB.
|
||||
"""
|
||||
try:
|
||||
return _epub_via_ebooklib(epub_path)
|
||||
except ImportError:
|
||||
return _epub_via_zipfile(epub_path)
|
||||
@@ -0,0 +1,163 @@
|
||||
"""Tests para epub_to_markdown."""
|
||||
|
||||
import io
|
||||
import os
|
||||
import struct
|
||||
import sys
|
||||
import zipfile
|
||||
|
||||
import pytest
|
||||
|
||||
sys.path.insert(0, os.path.dirname(__file__))
|
||||
from epub_to_markdown import _html_to_markdown, _epub_via_zipfile, epub_to_markdown
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Helpers para construir EPUBs minimos en memoria
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def _build_epub(files: dict[str, str]) -> str:
|
||||
"""Crea un EPUB minimo como ZIP en disco y retorna el path."""
|
||||
import tempfile
|
||||
tmp = tempfile.NamedTemporaryFile(suffix='.epub', delete=False)
|
||||
with zipfile.ZipFile(tmp, 'w') as zf:
|
||||
for name, content in files.items():
|
||||
zf.writestr(name, content)
|
||||
tmp.close()
|
||||
return tmp.name
|
||||
|
||||
|
||||
def _build_epub_with_opf(title: str, author: str, body_html: str) -> str:
|
||||
"""Crea un EPUB con OPF y un documento HTML valido para ebooklib."""
|
||||
opf = f"""<?xml version='1.0' encoding='utf-8'?>
|
||||
<package xmlns='http://www.idpf.org/2007/opf' unique-identifier='uid' version='2.0'>
|
||||
<metadata xmlns:dc='http://purl.org/dc/elements/1.1/'>
|
||||
<dc:title>{title}</dc:title>
|
||||
<dc:creator>{author}</dc:creator>
|
||||
<dc:identifier id='uid'>test-uid</dc:identifier>
|
||||
<dc:language>en</dc:language>
|
||||
</metadata>
|
||||
<manifest>
|
||||
<item id='ch1' href='chapter1.xhtml' media-type='application/xhtml+xml'/>
|
||||
<item id='ncx' href='toc.ncx' media-type='application/x-dtbncx+xml'/>
|
||||
</manifest>
|
||||
<spine toc='ncx'>
|
||||
<itemref idref='ch1'/>
|
||||
</spine>
|
||||
</package>"""
|
||||
|
||||
ncx = """<?xml version='1.0' encoding='utf-8'?>
|
||||
<ncx xmlns='http://www.daisy.org/z3986/2005/ncx/' version='2005-1'>
|
||||
<head><meta name='dtb:uid' content='test-uid'/></head>
|
||||
<docTitle><text>Test</text></docTitle>
|
||||
<navMap/>
|
||||
</ncx>"""
|
||||
|
||||
chapter = f"""<?xml version='1.0' encoding='utf-8'?>
|
||||
<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.1//EN' 'http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd'>
|
||||
<html xmlns='http://www.w3.org/1999/xhtml'>
|
||||
<head><title>Chapter</title></head>
|
||||
<body>{body_html}</body>
|
||||
</html>"""
|
||||
|
||||
return _build_epub({
|
||||
'mimetype': 'application/epub+zip',
|
||||
'META-INF/container.xml': """<?xml version='1.0'?>
|
||||
<container version='1.0' xmlns='urn:oasis:names:tc:opendocument:xmlns:container'>
|
||||
<rootfiles>
|
||||
<rootfile full-path='content.opf' media-type='application/oebps-package+xml'/>
|
||||
</rootfiles>
|
||||
</container>""",
|
||||
'content.opf': opf,
|
||||
'toc.ncx': ncx,
|
||||
'chapter1.xhtml': chapter,
|
||||
})
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Tests de _html_to_markdown (pura, sin disco)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def test_html_heading_conversion():
|
||||
"""conversion de headings h1-h3."""
|
||||
html = '<h1>Titulo</h1><h2>Subtitulo</h2><h3>Seccion</h3>'
|
||||
result = _html_to_markdown(html)
|
||||
assert '# Titulo' in result
|
||||
assert '## Subtitulo' in result
|
||||
assert '### Seccion' in result
|
||||
|
||||
|
||||
def test_html_bold_italic():
|
||||
"""conversion de bold e italic."""
|
||||
html = '<p><strong>negrita</strong> y <em>italica</em></p>'
|
||||
result = _html_to_markdown(html)
|
||||
assert '**negrita**' in result
|
||||
assert '*italica*' in result
|
||||
|
||||
|
||||
def test_html_script_style_removed():
|
||||
"""script y style se eliminan del output."""
|
||||
html = '<script>alert(1)</script><style>body{}</style><p>Contenido</p>'
|
||||
result = _html_to_markdown(html)
|
||||
assert 'alert' not in result
|
||||
assert 'body{}' not in result
|
||||
assert 'Contenido' in result
|
||||
|
||||
|
||||
def test_html_entities_unescaped():
|
||||
"""HTML entities se convierten a caracteres."""
|
||||
html = '<p>Tom & Jerry <show></p>'
|
||||
result = _html_to_markdown(html)
|
||||
assert 'Tom & Jerry' in result
|
||||
assert '<show>' in result
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Tests de epub_via_zipfile (sin ebooklib)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def test_epub_via_zipfile_extrae_html():
|
||||
"""epub sin ebooklib extrae texto de archivos html."""
|
||||
path = _build_epub({
|
||||
'chapter.html': '<html><body><h1>Capitulo Uno</h1><p>Hola mundo.</p></body></html>',
|
||||
})
|
||||
try:
|
||||
result = _epub_via_zipfile(path)
|
||||
assert 'Capitulo Uno' in result
|
||||
assert 'Hola mundo' in result
|
||||
finally:
|
||||
os.unlink(path)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Tests de epub_to_markdown (integracion)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def test_epub_con_ebooklib_metadata():
|
||||
"""epub con ebooklib incluye titulo y autor en el output."""
|
||||
pytest.importorskip('ebooklib')
|
||||
path = _build_epub_with_opf(
|
||||
title='Mi Libro',
|
||||
author='Ana Perez',
|
||||
body_html='<h1>Introduccion</h1><p>Primer parrafo.</p>',
|
||||
)
|
||||
try:
|
||||
result = epub_to_markdown(path)
|
||||
assert '# Mi Libro' in result
|
||||
assert 'Ana Perez' in result
|
||||
assert 'Introduccion' in result
|
||||
finally:
|
||||
os.unlink(path)
|
||||
|
||||
|
||||
def test_epub_corrupto_lanza_excepcion():
|
||||
"""epub corrupto lanza Exception."""
|
||||
import tempfile
|
||||
tmp = tempfile.NamedTemporaryFile(suffix='.epub', delete=False)
|
||||
tmp.write(b'esto no es un epub valido')
|
||||
tmp.close()
|
||||
try:
|
||||
with pytest.raises(Exception):
|
||||
epub_to_markdown(tmp.name)
|
||||
finally:
|
||||
os.unlink(tmp.name)
|
||||
@@ -0,0 +1,37 @@
|
||||
---
|
||||
name: estimate_token_count
|
||||
kind: function
|
||||
lang: py
|
||||
domain: core
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def estimate_token_count(content: str) -> int"
|
||||
description: "Estimacion rapida de tokens sin tokenizer. CJK chars cuentan ~0.7 token/char, otros non-whitespace ~0.3 token/char."
|
||||
tags: [tokens, estimation, nlp, cjk, text]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: [re]
|
||||
tested: true
|
||||
tests:
|
||||
- "texto vacio retorna cero"
|
||||
- "solo latin"
|
||||
- "solo CJK"
|
||||
- "texto mixto"
|
||||
test_file_path: "python/functions/core/parse_markdown_test.py"
|
||||
file_path: "python/functions/core/core.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
estimate_token_count("hello world") # 3
|
||||
estimate_token_count("中文语") # 2 (3 * 0.7 = 2)
|
||||
estimate_token_count("") # 0
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Funcion pura. No requiere ninguna dependencia externa. Precision aproximada: util para guardianes de limite de contexto antes de llamar a LLMs, no para conteo exacto de tokens BPE. CJK range: `[\u4e00-\u9fff\u3040-\u30ff\uac00-\ud7af]` (CJK unificado, Hiragana/Katakana, Hangul).
|
||||
@@ -0,0 +1,58 @@
|
||||
---
|
||||
name: excel_to_markdown
|
||||
kind: function
|
||||
lang: py
|
||||
domain: core
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "excel_to_markdown(path: str, max_rows_per_sheet: int = 1000) -> str"
|
||||
description: "Convierte un archivo Excel (.xlsx, .xls, .xlsm) a markdown con cada sheet como seccion H2. Soporta tipos de celda: fechas ISO, booleanos, errores Excel, numeros enteros y flotantes. Trunca sheets que superen max_rows_per_sheet."
|
||||
tags: [excel, markdown, xlsx, xls, conversion, parser, io]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: ["openpyxl", "xlrd"]
|
||||
tested: true
|
||||
tests:
|
||||
- "xlsx con multiples sheets produce una seccion H2 por sheet"
|
||||
- "sheet vacio produce nota de sheet vacio"
|
||||
- "sheet truncado con nota de filas omitidas"
|
||||
- "sheet con formulas data_only muestra valores calculados"
|
||||
- "extension no soportada lanza ValueError"
|
||||
- "archivo inexistente lanza FileNotFoundError"
|
||||
- "dimensiones del sheet en metadata"
|
||||
- "tabla markdown con formato correcto"
|
||||
test_file_path: "python/functions/core/excel_to_markdown_test.py"
|
||||
file_path: "python/functions/core/excel_to_markdown.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from excel_to_markdown import excel_to_markdown
|
||||
|
||||
md = excel_to_markdown("report.xlsx")
|
||||
print(md)
|
||||
# ## Sheet: Ventas
|
||||
#
|
||||
# **Dimensions:** 101 x 4
|
||||
#
|
||||
# | Producto | Precio | Cantidad | Total |
|
||||
# | --- | --- | --- | --- |
|
||||
# | Manzana | 1 | 100 | 100 |
|
||||
# ...
|
||||
|
||||
# Con limite de filas
|
||||
md = excel_to_markdown("big_file.xlsx", max_rows_per_sheet=50)
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
- `.xlsx` y `.xlsm`: usa `openpyxl` con `data_only=True` (lee valores calculados, no formulas).
|
||||
- `.xls` (legacy): usa `xlrd`. Manejo de tipos especiales: EMPTY/BLANK → "", DATE → ISO 8601, BOOLEAN → "TRUE"/"FALSE", ERROR → codigo Excel (#NULL!, #DIV/0!, etc.), NUMBER → entero si no tiene decimales.
|
||||
- Fechas sin hora se formatean como `YYYY-MM-DD`; con hora como `YYYY-MM-DDTHH:MM:SS`.
|
||||
- Los pipes `|` dentro de celdas se escapan como `\|`.
|
||||
- Si `xlwt` no esta disponible, los tests .xls se saltan (xlwt solo se necesita para crear fixtures, no para leer).
|
||||
- Reimplementacion desde cero, inspirada conceptualmente en OpenViking (AGPL-3.0). Sin codigo copiado.
|
||||
@@ -0,0 +1,211 @@
|
||||
"""Convierte archivos Excel a Markdown con cada sheet como seccion H2."""
|
||||
|
||||
import os
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
# Codigos de error Excel para xlrd
|
||||
_XL_ERROR_CODES = {
|
||||
0: "#NULL!",
|
||||
7: "#DIV/0!",
|
||||
15: "#VALUE!",
|
||||
23: "#REF!",
|
||||
29: "#NAME?",
|
||||
36: "#NUM!",
|
||||
42: "#N/A",
|
||||
}
|
||||
|
||||
|
||||
def _rows_to_markdown_table(rows: list[list[str]]) -> str:
|
||||
"""Convierte filas de strings a tabla markdown."""
|
||||
if not rows:
|
||||
return ""
|
||||
|
||||
header = rows[0]
|
||||
col_count = len(header)
|
||||
|
||||
# Normalizar todas las filas al mismo numero de columnas
|
||||
normalized = []
|
||||
for row in rows:
|
||||
if len(row) < col_count:
|
||||
row = row + [""] * (col_count - len(row))
|
||||
normalized.append(row[:col_count])
|
||||
|
||||
# Escapar pipes en celdas
|
||||
def escape(cell: str) -> str:
|
||||
return cell.replace("|", "\\|").replace("\n", " ")
|
||||
|
||||
lines = []
|
||||
# Header
|
||||
lines.append("| " + " | ".join(escape(c) for c in normalized[0]) + " |")
|
||||
# Separator
|
||||
lines.append("| " + " | ".join("---" for _ in range(col_count)) + " |")
|
||||
# Data rows
|
||||
for row in normalized[1:]:
|
||||
lines.append("| " + " | ".join(escape(c) for c in row) + " |")
|
||||
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def _cell_value_xlrd(cell, workbook) -> str:
|
||||
"""Convierte una celda xlrd a string segun su tipo."""
|
||||
import xlrd
|
||||
|
||||
ctype = cell.ctype
|
||||
|
||||
if ctype in (xlrd.XL_CELL_EMPTY, xlrd.XL_CELL_BLANK):
|
||||
return ""
|
||||
elif ctype == xlrd.XL_CELL_DATE:
|
||||
try:
|
||||
dt = xlrd.xldate_as_datetime(cell.value, workbook.datemode)
|
||||
if dt.hour == 0 and dt.minute == 0 and dt.second == 0:
|
||||
return dt.date().isoformat()
|
||||
return dt.isoformat()
|
||||
except Exception:
|
||||
return str(cell.value)
|
||||
elif ctype == xlrd.XL_CELL_BOOLEAN:
|
||||
return "TRUE" if cell.value else "FALSE"
|
||||
elif ctype == xlrd.XL_CELL_ERROR:
|
||||
return _XL_ERROR_CODES.get(int(cell.value), "#ERROR!")
|
||||
elif ctype == xlrd.XL_CELL_NUMBER:
|
||||
v = cell.value
|
||||
if v == int(v):
|
||||
return str(int(v))
|
||||
return str(v)
|
||||
elif ctype == xlrd.XL_CELL_TEXT:
|
||||
return str(cell.value)
|
||||
else:
|
||||
return str(cell.value)
|
||||
|
||||
|
||||
def _sheet_xlrd(sheet, workbook, max_rows: int) -> str:
|
||||
"""Convierte un sheet xlrd a markdown."""
|
||||
nrows = sheet.nrows
|
||||
ncols = sheet.ncols
|
||||
|
||||
lines = []
|
||||
lines.append(f"## Sheet: {sheet.name}")
|
||||
lines.append("")
|
||||
lines.append(f"**Dimensions:** {nrows} x {ncols}")
|
||||
lines.append("")
|
||||
|
||||
if nrows == 0 or ncols == 0:
|
||||
lines.append("*(empty sheet)*")
|
||||
return "\n".join(lines)
|
||||
|
||||
display_rows = min(nrows, max_rows)
|
||||
rows = []
|
||||
for r in range(display_rows):
|
||||
row_data = [_cell_value_xlrd(sheet.cell(r, c), workbook) for c in range(ncols)]
|
||||
rows.append(row_data)
|
||||
|
||||
lines.append(_rows_to_markdown_table(rows))
|
||||
|
||||
if nrows > max_rows:
|
||||
omitted = nrows - max_rows
|
||||
lines.append("")
|
||||
lines.append(f"*{omitted} rows omitted (max_rows_per_sheet={max_rows})*")
|
||||
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def _cell_value_openpyxl(cell) -> str:
|
||||
"""Convierte una celda openpyxl a string."""
|
||||
v = cell.value
|
||||
if v is None:
|
||||
return ""
|
||||
if isinstance(v, bool):
|
||||
return "TRUE" if v else "FALSE"
|
||||
if isinstance(v, float):
|
||||
if v == int(v):
|
||||
return str(int(v))
|
||||
return str(v)
|
||||
if isinstance(v, int):
|
||||
return str(v)
|
||||
# Fechas y datetimes
|
||||
import datetime
|
||||
if isinstance(v, datetime.datetime):
|
||||
if v.hour == 0 and v.minute == 0 and v.second == 0:
|
||||
return v.date().isoformat()
|
||||
return v.isoformat()
|
||||
if isinstance(v, datetime.date):
|
||||
return v.isoformat()
|
||||
return str(v)
|
||||
|
||||
|
||||
def _sheet_openpyxl(ws, max_rows: int) -> str:
|
||||
"""Convierte un worksheet openpyxl a markdown."""
|
||||
all_rows = list(ws.iter_rows())
|
||||
nrows = len(all_rows)
|
||||
ncols = ws.max_column or 0
|
||||
|
||||
lines = []
|
||||
lines.append(f"## Sheet: {ws.title}")
|
||||
lines.append("")
|
||||
lines.append(f"**Dimensions:** {nrows} x {ncols}")
|
||||
lines.append("")
|
||||
|
||||
if nrows == 0 or ncols == 0:
|
||||
lines.append("*(empty sheet)*")
|
||||
return "\n".join(lines)
|
||||
|
||||
display_rows = min(nrows, max_rows)
|
||||
rows = []
|
||||
for row in all_rows[:display_rows]:
|
||||
row_data = [_cell_value_openpyxl(cell) for cell in row]
|
||||
rows.append(row_data)
|
||||
|
||||
lines.append(_rows_to_markdown_table(rows))
|
||||
|
||||
if nrows > max_rows:
|
||||
omitted = nrows - max_rows
|
||||
lines.append("")
|
||||
lines.append(f"*{omitted} rows omitted (max_rows_per_sheet={max_rows})*")
|
||||
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def excel_to_markdown(path: str, max_rows_per_sheet: int = 1000) -> str:
|
||||
"""Convierte un archivo Excel (.xlsx, .xls, .xlsm) a markdown.
|
||||
|
||||
Cada sheet se convierte en una seccion H2. Las filas se representan
|
||||
como tablas markdown. Si el numero de filas supera max_rows_per_sheet,
|
||||
el sheet se trunca y se añade una nota.
|
||||
|
||||
Args:
|
||||
path: Ruta al archivo Excel (.xlsx, .xls, .xlsm).
|
||||
max_rows_per_sheet: Maximo de filas a incluir por sheet (default 1000).
|
||||
|
||||
Returns:
|
||||
String markdown con todos los sheets del archivo.
|
||||
|
||||
Raises:
|
||||
ValueError: Si la extension no es soportada.
|
||||
FileNotFoundError: Si el archivo no existe.
|
||||
Exception: Si hay errores leyendo el archivo.
|
||||
"""
|
||||
p = Path(path)
|
||||
if not p.exists():
|
||||
raise FileNotFoundError(f"File not found: {path}")
|
||||
|
||||
ext = p.suffix.lower()
|
||||
|
||||
if ext == ".xls":
|
||||
import xlrd
|
||||
wb = xlrd.open_workbook(path)
|
||||
sections = []
|
||||
for sheet_name in wb.sheet_names():
|
||||
sheet = wb.sheet_by_name(sheet_name)
|
||||
sections.append(_sheet_xlrd(sheet, wb, max_rows_per_sheet))
|
||||
return "\n\n".join(sections)
|
||||
|
||||
elif ext in (".xlsx", ".xlsm"):
|
||||
import openpyxl
|
||||
wb = openpyxl.load_workbook(path, data_only=True)
|
||||
sections = []
|
||||
for ws in wb.worksheets:
|
||||
sections.append(_sheet_openpyxl(ws, max_rows_per_sheet))
|
||||
return "\n\n".join(sections)
|
||||
|
||||
else:
|
||||
raise ValueError(f"Unsupported extension '{ext}'. Use .xlsx, .xls, or .xlsm.")
|
||||
@@ -0,0 +1,142 @@
|
||||
"""Tests para excel_to_markdown."""
|
||||
|
||||
import datetime
|
||||
import os
|
||||
import sys
|
||||
import tempfile
|
||||
|
||||
import openpyxl
|
||||
import pytest
|
||||
|
||||
sys.path.insert(0, os.path.dirname(__file__))
|
||||
from excel_to_markdown import excel_to_markdown
|
||||
|
||||
|
||||
def _make_xlsx(sheets: dict, filename: str) -> str:
|
||||
"""Crea un archivo .xlsx temporal con los sheets dados."""
|
||||
wb = openpyxl.Workbook()
|
||||
first = True
|
||||
for sheet_name, rows in sheets.items():
|
||||
if first:
|
||||
ws = wb.active
|
||||
ws.title = sheet_name
|
||||
first = False
|
||||
else:
|
||||
ws = wb.create_sheet(sheet_name)
|
||||
for row in rows:
|
||||
ws.append(row)
|
||||
path = os.path.join(tempfile.mkdtemp(), filename)
|
||||
wb.save(path)
|
||||
return path
|
||||
|
||||
|
||||
def test_xlsx_multiples_sheets():
|
||||
"""xlsx con multiples sheets produce una seccion H2 por sheet."""
|
||||
path = _make_xlsx(
|
||||
{
|
||||
"Ventas": [["Producto", "Precio", "Cantidad"], ["Manzana", 1.5, 100], ["Pera", 2.0, 50]],
|
||||
"Resumen": [["Total", "Importe"], ["150", "225.0"]],
|
||||
},
|
||||
"multi.xlsx",
|
||||
)
|
||||
result = excel_to_markdown(path)
|
||||
|
||||
assert "## Sheet: Ventas" in result
|
||||
assert "## Sheet: Resumen" in result
|
||||
assert "Producto" in result
|
||||
assert "Manzana" in result
|
||||
assert "Total" in result
|
||||
|
||||
|
||||
def test_sheet_vacio():
|
||||
"""Sheet sin filas produce nota de sheet vacio."""
|
||||
path = _make_xlsx({"Vacio": []}, "empty.xlsx")
|
||||
result = excel_to_markdown(path)
|
||||
|
||||
assert "## Sheet: Vacio" in result
|
||||
assert "empty sheet" in result
|
||||
|
||||
|
||||
def test_sheet_truncado():
|
||||
"""Sheet con mas filas que max_rows_per_sheet se trunca con nota."""
|
||||
rows = [["col"]] + [[str(i)] for i in range(20)]
|
||||
path = _make_xlsx({"Data": rows}, "big.xlsx")
|
||||
result = excel_to_markdown(path, max_rows_per_sheet=5)
|
||||
|
||||
assert "omitted" in result
|
||||
# 21 filas totales, 5 mostradas -> 16 omitidas
|
||||
assert "16 rows omitted" in result
|
||||
|
||||
|
||||
def test_sheet_con_formulas_data_only():
|
||||
"""Archivo xlsx abierto con data_only=True muestra valores calculados (o None si no guardados)."""
|
||||
wb = openpyxl.Workbook()
|
||||
ws = wb.active
|
||||
ws.title = "Formulas"
|
||||
ws.append(["A", "B", "Suma"])
|
||||
ws.append([1, 2, "=A2+B2"])
|
||||
path = os.path.join(tempfile.mkdtemp(), "formulas.xlsx")
|
||||
wb.save(path)
|
||||
|
||||
result = excel_to_markdown(path)
|
||||
assert "## Sheet: Formulas" in result
|
||||
# La celda formula puede ser None con data_only=True si no fue guardada con valor
|
||||
assert "Suma" in result
|
||||
|
||||
|
||||
def test_xls_legacy_con_fechas():
|
||||
"""xls legacy: la funcion debe aceptar .xls (via xlrd) y manejar fechas."""
|
||||
# Creamos un .xls usando xlwt si disponible, si no lo saltamos
|
||||
pytest.importorskip("xlwt", reason="xlwt no disponible para crear .xls de prueba")
|
||||
import xlwt
|
||||
|
||||
wb = xlwt.Workbook()
|
||||
ws = wb.add_sheet("Fechas")
|
||||
ws.write(0, 0, "Nombre")
|
||||
ws.write(0, 1, "Fecha")
|
||||
ws.write(1, 0, "Evento A")
|
||||
|
||||
date_format = xlwt.XFStyle()
|
||||
date_format.num_format_str = "YYYY-MM-DD"
|
||||
ws.write(1, 1, datetime.date(2024, 1, 15).toordinal() - 693594, date_format)
|
||||
|
||||
path = os.path.join(tempfile.mkdtemp(), "legacy.xls")
|
||||
wb.save(path)
|
||||
|
||||
result = excel_to_markdown(path)
|
||||
assert "## Sheet: Fechas" in result
|
||||
assert "Evento A" in result
|
||||
|
||||
|
||||
def test_extension_no_soportada():
|
||||
"""Extension no soportada lanza ValueError."""
|
||||
path = os.path.join(tempfile.mkdtemp(), "data.csv")
|
||||
with open(path, "w") as f:
|
||||
f.write("a,b\n1,2\n")
|
||||
|
||||
with pytest.raises(ValueError, match="Unsupported extension"):
|
||||
excel_to_markdown(path)
|
||||
|
||||
|
||||
def test_archivo_no_existe():
|
||||
"""Archivo inexistente lanza FileNotFoundError."""
|
||||
with pytest.raises(FileNotFoundError):
|
||||
excel_to_markdown("/tmp/no_existe_para_nada.xlsx")
|
||||
|
||||
|
||||
def test_dimensiones_en_metadata():
|
||||
"""El markdown incluye dimensiones del sheet."""
|
||||
path = _make_xlsx({"Hoja1": [["A", "B"], [1, 2], [3, 4]]}, "dims.xlsx")
|
||||
result = excel_to_markdown(path)
|
||||
assert "**Dimensions:**" in result
|
||||
assert "3 x 2" in result
|
||||
|
||||
|
||||
def test_tabla_markdown_formato():
|
||||
"""La tabla tiene formato correcto con separador de header."""
|
||||
path = _make_xlsx({"Datos": [["Col1", "Col2"], ["val1", "val2"]]}, "fmt.xlsx")
|
||||
result = excel_to_markdown(path)
|
||||
# Debe tener linea separadora con ---
|
||||
assert "| --- |" in result or "| --- | --- |" in result
|
||||
assert "Col1" in result
|
||||
assert "val1" in result
|
||||
@@ -0,0 +1,43 @@
|
||||
---
|
||||
name: extract_frontmatter
|
||||
kind: function
|
||||
lang: py
|
||||
domain: core
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def extract_frontmatter(content: str) -> tuple[str, dict | None]"
|
||||
description: "Extrae YAML frontmatter (delimitado por ---) del inicio de un string markdown. Retorna el contenido sin frontmatter y el dict parseado (o None si no hay)."
|
||||
tags: [markdown, frontmatter, yaml, parsing]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: [re, yaml]
|
||||
tested: true
|
||||
tests:
|
||||
- "contenido con frontmatter"
|
||||
- "sin frontmatter retorna None"
|
||||
- "frontmatter vacio"
|
||||
- "frontmatter con listas"
|
||||
test_file_path: "python/functions/core/parse_markdown_test.py"
|
||||
file_path: "python/functions/core/core.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
content = "---\ntitle: Hello\nauthor: Alice\n---\n# Body\n"
|
||||
remaining, data = extract_frontmatter(content)
|
||||
# remaining = "# Body\n"
|
||||
# data = {"title": "Hello", "author": "Alice"}
|
||||
|
||||
no_fm = "# Just markdown\n\nNo frontmatter."
|
||||
remaining, data = extract_frontmatter(no_fm)
|
||||
# remaining == no_fm
|
||||
# data is None
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Funcion pura. Usa `yaml.safe_load` si PyYAML esta disponible; si no, cae back a un parser simple de `key: value`. Solo reconoce frontmatter al inicio estricto del string (posicion 0). El bloque debe estar delimitado por `---\n` de apertura y `\n---\n` de cierre.
|
||||
@@ -0,0 +1,36 @@
|
||||
---
|
||||
name: extract_json_from_llm
|
||||
kind: function
|
||||
lang: py
|
||||
domain: core
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def extract_json_from_llm(content: str) -> dict"
|
||||
description: "Extrae y parsea JSON de respuestas LLM. Maneja bloques ```json, trailing commas, None->null."
|
||||
tags: [json, llm, parsing, extraction]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: [json]
|
||||
tested: false
|
||||
tests: []
|
||||
test_file_path: ""
|
||||
file_path: "python/functions/core/core.py"
|
||||
source_repo: "https://github.com/VectifyAI/PageIndex"
|
||||
source_license: "MIT"
|
||||
source_file: "pageindex/utils.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
raw = '```json\n{"key": "value", "items": [1, 2, 3,]}\n```'
|
||||
result = extract_json_from_llm(raw)
|
||||
# {"key": "value", "items": [1, 2, 3]}
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Funcion pura. Maneja errores comunes de LLMs: trailing commas, `None` en lugar de `null`, whitespace extra. Retorna dict vacio si el JSON es irrecuperable.
|
||||
@@ -0,0 +1,36 @@
|
||||
---
|
||||
name: extract_markdown_headers
|
||||
kind: function
|
||||
lang: py
|
||||
domain: core
|
||||
version: "1.0.0"
|
||||
purity: pure
|
||||
signature: "def extract_markdown_headers(markdown_content: str) -> tuple[list[dict], list[str]]"
|
||||
description: "Extrae todos los headers (h1-h6) de markdown con nivel y numero de linea, ignorando code blocks."
|
||||
tags: [markdown, parsing, headers, extraction]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: ""
|
||||
imports: [re]
|
||||
tested: false
|
||||
tests: []
|
||||
test_file_path: ""
|
||||
file_path: "python/functions/core/core.py"
|
||||
source_repo: "https://github.com/VectifyAI/PageIndex"
|
||||
source_license: "MIT"
|
||||
source_file: "pageindex/page_index_md.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
md = "# Title\n\nSome text\n\n## Section\n\n```\n# not a header\n```"
|
||||
headers, lines = extract_markdown_headers(md)
|
||||
# headers = [{"title": "Title", "level": 1, "line_num": 1}, {"title": "Section", "level": 2, "line_num": 5}]
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Funcion pura. Detecta y omite bloques de codigo (triple backtick). Retorna tupla: (lista de headers, lista de lineas originales).
|
||||
@@ -0,0 +1,37 @@
|
||||
---
|
||||
name: extract_pdf_bookmarks
|
||||
kind: function
|
||||
lang: py
|
||||
domain: core
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "def extract_pdf_bookmarks(pdf) -> list[dict]"
|
||||
description: "Extrae la estructura de bookmarks/outlines de un PDF abierto con pdfplumber. Retorna lista de dicts con level (1-6), title y page_num."
|
||||
tags: [pdf, bookmarks, outlines, parsing, pdfplumber]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: [pdfplumber]
|
||||
tested: false
|
||||
tests: []
|
||||
test_file_path: ""
|
||||
file_path: "python/functions/core/extract_pdf_bookmarks.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
import pdfplumber
|
||||
from extract_pdf_bookmarks import extract_pdf_bookmarks
|
||||
|
||||
with pdfplumber.open("document.pdf") as pdf:
|
||||
bookmarks = extract_pdf_bookmarks(pdf)
|
||||
for bm in bookmarks:
|
||||
print(f"{'#' * bm['level']} {bm['title']} (page {bm['page_num']})")
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Recibe un objeto `pdfplumber.PDF` ya abierto (no un path). Construye un mapping interno `objid -> page_number` desde `pdf.pages` para resolver los destinos de outline. El nivel se limita al rango [1, 6] para compatibilidad markdown. Retorna lista vacia si el PDF no tiene outlines o si `get_outlines()` falla. Impure porque accede al estado interno de un objeto PDF ya abierto.
|
||||
@@ -0,0 +1,63 @@
|
||||
"""Extract the bookmark/outline structure from a PDF opened with pdfplumber."""
|
||||
|
||||
import pdfplumber
|
||||
|
||||
|
||||
def extract_pdf_bookmarks(pdf: pdfplumber.PDF) -> list[dict]:
|
||||
"""Extract bookmarks/outlines from an open pdfplumber PDF object.
|
||||
|
||||
Args:
|
||||
pdf: An open pdfplumber.PDF object.
|
||||
|
||||
Returns:
|
||||
list[dict]: List of {"level": int, "title": str, "page_num": int | None}.
|
||||
Level is clamped to [1, 6]. Returns empty list if no outlines.
|
||||
"""
|
||||
try:
|
||||
outlines = pdf.doc.get_outlines()
|
||||
except Exception:
|
||||
return []
|
||||
|
||||
if not outlines:
|
||||
return []
|
||||
|
||||
# Build objid -> page_number mapping
|
||||
objid_to_page: dict[int, int] = {}
|
||||
for i, page in enumerate(pdf.pages):
|
||||
try:
|
||||
obj = page.page_obj
|
||||
objid_to_page[obj.objid] = i + 1 # 1-indexed page numbers
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
bookmarks = []
|
||||
for item in outlines:
|
||||
try:
|
||||
level = item[0] # integer level from get_outlines
|
||||
title = item[1]
|
||||
dest = item[2] # destination: page object or list
|
||||
|
||||
# Clamp level to [1, 6]
|
||||
level = max(1, min(6, level))
|
||||
|
||||
# Resolve destination to page number
|
||||
page_num = None
|
||||
if dest is not None:
|
||||
if isinstance(dest, list) and len(dest) > 0:
|
||||
# dest[0] is the page object
|
||||
page_obj = dest[0]
|
||||
try:
|
||||
page_num = objid_to_page.get(page_obj.objid)
|
||||
except Exception:
|
||||
pass
|
||||
else:
|
||||
try:
|
||||
page_num = objid_to_page.get(dest.objid)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
bookmarks.append({"level": level, "title": str(title), "page_num": page_num})
|
||||
except Exception:
|
||||
continue
|
||||
|
||||
return bookmarks
|
||||
@@ -0,0 +1,35 @@
|
||||
---
|
||||
name: extract_pdf_text
|
||||
kind: function
|
||||
lang: py
|
||||
domain: core
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "def extract_pdf_text(pdf_path: str) -> str"
|
||||
description: "Extrae todo el texto de un PDF concatenando todas las paginas. Usa PyPDF2."
|
||||
tags: [pdf, text, extraction, parsing]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: [PyPDF2]
|
||||
tested: false
|
||||
tests: []
|
||||
test_file_path: ""
|
||||
file_path: "python/functions/core/extract_pdf_text.py"
|
||||
source_repo: "https://github.com/VectifyAI/PageIndex"
|
||||
source_license: "MIT"
|
||||
source_file: "pageindex/utils.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
text = extract_pdf_text("/path/to/document.pdf")
|
||||
print(len(text)) # total characters
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Requiere `pip install PyPDF2`. Extraccion basica de texto — no maneja OCR ni PDFs escaneados. Para PDFs complejos considerar PyMuPDF.
|
||||
@@ -0,0 +1,19 @@
|
||||
"""Extract all text from a PDF file using PyPDF2."""
|
||||
|
||||
import PyPDF2
|
||||
|
||||
|
||||
def extract_pdf_text(pdf_path: str) -> str:
|
||||
"""Extract all text from a PDF file.
|
||||
|
||||
Args:
|
||||
pdf_path: Path to the PDF file.
|
||||
|
||||
Returns:
|
||||
str: Concatenated text from all pages.
|
||||
"""
|
||||
pdf_reader = PyPDF2.PdfReader(pdf_path)
|
||||
text = ""
|
||||
for page in pdf_reader.pages:
|
||||
text += page.extract_text() or ""
|
||||
return text
|
||||
@@ -0,0 +1,51 @@
|
||||
---
|
||||
name: extract_text_from_file
|
||||
kind: function
|
||||
lang: py
|
||||
domain: core
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "extract_text_from_file(file_path: str) -> str"
|
||||
description: "Extrae texto plano de un archivo. Soporta PDF (PyMuPDF), Markdown y TXT con deteccion automatica de encoding."
|
||||
tags: [text, pdf, markdown, txt, encoding, extraction, file, io]
|
||||
uses_functions: []
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: ["os", "fitz (PyMuPDF)", "charset_normalizer", "chardet"]
|
||||
tested: true
|
||||
tests:
|
||||
- "PDF con texto extrae contenido correctamente"
|
||||
- "archivo MD UTF-8 retorna contenido"
|
||||
- "archivo TXT latin-1 detecta encoding"
|
||||
- "archivo inexistente lanza FileNotFoundError"
|
||||
- "extension no soportada lanza ValueError"
|
||||
test_file_path: "python/functions/core/extract_text_from_file_test.py"
|
||||
file_path: "python/functions/core/extract_text_from_file.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
# PDF
|
||||
text = extract_text_from_file("report.pdf")
|
||||
|
||||
# Markdown
|
||||
text = extract_text_from_file("README.md")
|
||||
|
||||
# TXT con encoding desconocido
|
||||
text = extract_text_from_file("notes.txt")
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Para PDF usa PyMuPDF (`fitz`) que produce mejor texto que PyPDF2, especialmente en PDFs con columnas o layout complejo. Las paginas se unen con `\n\n`.
|
||||
|
||||
La deteccion de encoding para archivos de texto sigue este orden de prioridad:
|
||||
1. Intenta UTF-8 directamente
|
||||
2. `charset_normalizer.from_bytes().best().encoding`
|
||||
3. `chardet.detect(data)["encoding"]`
|
||||
4. UTF-8 con `errors='replace'` como ultimo recurso
|
||||
|
||||
Diferencia con `extract_pdf_text_py_core`: esa funcion usa PyPDF2 y solo soporta PDF. Esta funcion usa PyMuPDF y soporta ademas MD y TXT con deteccion de encoding.
|
||||
@@ -0,0 +1,92 @@
|
||||
"""Extract plain text from PDF, Markdown, or TXT files."""
|
||||
|
||||
|
||||
SUPPORTED_EXTENSIONS = {".pdf", ".md", ".markdown", ".txt"}
|
||||
|
||||
|
||||
def _detect_encoding(data: bytes) -> str:
|
||||
"""Detect encoding of raw bytes using multiple fallback strategies."""
|
||||
# Strategy 1: UTF-8
|
||||
try:
|
||||
data.decode("utf-8")
|
||||
return "utf-8"
|
||||
except UnicodeDecodeError:
|
||||
pass
|
||||
|
||||
# Strategy 2: charset_normalizer
|
||||
try:
|
||||
from charset_normalizer import from_bytes
|
||||
|
||||
result = from_bytes(data).best()
|
||||
if result is not None and result.encoding:
|
||||
return result.encoding
|
||||
except ImportError:
|
||||
pass
|
||||
|
||||
# Strategy 3: chardet
|
||||
try:
|
||||
import chardet
|
||||
|
||||
detected = chardet.detect(data)
|
||||
if detected and detected.get("encoding"):
|
||||
return detected["encoding"]
|
||||
except ImportError:
|
||||
pass
|
||||
|
||||
# Last resort: UTF-8 with replacement
|
||||
return "utf-8"
|
||||
|
||||
|
||||
def extract_text_from_file(file_path: str) -> str:
|
||||
"""Extract plain text from a file. Supports PDF, Markdown and TXT.
|
||||
|
||||
For PDF files uses PyMuPDF (fitz) to extract text from each page,
|
||||
joining them with double newlines. For text-based files (.md, .markdown,
|
||||
.txt) reads the file with automatic encoding detection.
|
||||
|
||||
Args:
|
||||
file_path: Absolute or relative path to the file.
|
||||
|
||||
Returns:
|
||||
str: Extracted plain text content.
|
||||
|
||||
Raises:
|
||||
FileNotFoundError: If the file does not exist.
|
||||
ValueError: If the file extension is not supported.
|
||||
ImportError: If PyMuPDF is not installed and a PDF is provided.
|
||||
"""
|
||||
import os
|
||||
|
||||
if not os.path.exists(file_path):
|
||||
raise FileNotFoundError(f"File not found: {file_path}")
|
||||
|
||||
_, ext = os.path.splitext(file_path.lower())
|
||||
|
||||
if ext == ".pdf":
|
||||
try:
|
||||
import fitz # PyMuPDF
|
||||
except ImportError as e:
|
||||
raise ImportError(
|
||||
"PyMuPDF is required for PDF extraction. "
|
||||
"Install it with: pip install PyMuPDF"
|
||||
) from e
|
||||
|
||||
doc = fitz.open(file_path)
|
||||
pages = [page.get_text() for page in doc]
|
||||
return "\n\n".join(pages)
|
||||
|
||||
elif ext in {".md", ".markdown", ".txt"}:
|
||||
with open(file_path, "rb") as f:
|
||||
raw = f.read()
|
||||
|
||||
encoding = _detect_encoding(raw)
|
||||
try:
|
||||
return raw.decode(encoding)
|
||||
except (UnicodeDecodeError, LookupError):
|
||||
return raw.decode("utf-8", errors="replace")
|
||||
|
||||
else:
|
||||
raise ValueError(
|
||||
f"Unsupported file extension: '{ext}'. "
|
||||
f"Supported: {', '.join(sorted(SUPPORTED_EXTENSIONS))}"
|
||||
)
|
||||
@@ -0,0 +1,83 @@
|
||||
"""Tests para extract_text_from_file."""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import tempfile
|
||||
|
||||
import pytest
|
||||
|
||||
sys.path.insert(0, os.path.dirname(__file__))
|
||||
from extract_text_from_file import extract_text_from_file
|
||||
|
||||
|
||||
def test_pdf_con_texto_extrae_contenido_correctamente():
|
||||
"""PDF con texto extrae contenido correctamente."""
|
||||
try:
|
||||
import fitz
|
||||
except ImportError:
|
||||
pytest.skip("PyMuPDF no instalado")
|
||||
|
||||
# Create a minimal in-memory PDF using PyMuPDF and write it to a temp file
|
||||
doc = fitz.open()
|
||||
page = doc.new_page()
|
||||
page.insert_text((72, 72), "Hello from PDF")
|
||||
with tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) as f:
|
||||
tmp_path = f.name
|
||||
try:
|
||||
doc.save(tmp_path)
|
||||
doc.close()
|
||||
result = extract_text_from_file(tmp_path)
|
||||
assert "Hello from PDF" in result
|
||||
finally:
|
||||
os.unlink(tmp_path)
|
||||
|
||||
|
||||
def test_archivo_md_utf8_retorna_contenido():
|
||||
"""archivo MD UTF-8 retorna contenido."""
|
||||
content = "# Titulo\n\nParrafo con texto UTF-8: cafe, senor, japon.\n"
|
||||
with tempfile.NamedTemporaryFile(
|
||||
suffix=".md", mode="wb", delete=False
|
||||
) as f:
|
||||
f.write(content.encode("utf-8"))
|
||||
tmp_path = f.name
|
||||
try:
|
||||
result = extract_text_from_file(tmp_path)
|
||||
assert "# Titulo" in result
|
||||
assert "cafe" in result
|
||||
finally:
|
||||
os.unlink(tmp_path)
|
||||
|
||||
|
||||
def test_archivo_txt_latin1_detecta_encoding():
|
||||
"""archivo TXT latin-1 detecta encoding."""
|
||||
content = "Texto en latin-1: cafe, hotel, naive\n"
|
||||
with tempfile.NamedTemporaryFile(
|
||||
suffix=".txt", mode="wb", delete=False
|
||||
) as f:
|
||||
f.write(content.encode("latin-1"))
|
||||
tmp_path = f.name
|
||||
try:
|
||||
result = extract_text_from_file(tmp_path)
|
||||
# The word "cafe" or similar should appear in the decoded result
|
||||
assert len(result) > 0
|
||||
assert "cafe" in result or "caf" in result
|
||||
finally:
|
||||
os.unlink(tmp_path)
|
||||
|
||||
|
||||
def test_archivo_inexistente_lanza_filenotfounderror():
|
||||
"""archivo inexistente lanza FileNotFoundError."""
|
||||
with pytest.raises(FileNotFoundError):
|
||||
extract_text_from_file("/tmp/no_existe_este_archivo_12345.txt")
|
||||
|
||||
|
||||
def test_extension_no_soportada_lanza_valueerror():
|
||||
"""extension no soportada lanza ValueError."""
|
||||
with tempfile.NamedTemporaryFile(suffix=".docx", delete=False) as f:
|
||||
f.write(b"fake docx content")
|
||||
tmp_path = f.name
|
||||
try:
|
||||
with pytest.raises(ValueError, match="Unsupported file extension"):
|
||||
extract_text_from_file(tmp_path)
|
||||
finally:
|
||||
os.unlink(tmp_path)
|
||||
@@ -0,0 +1,50 @@
|
||||
---
|
||||
name: fetch_and_parse_url
|
||||
kind: function
|
||||
lang: py
|
||||
domain: core
|
||||
version: "1.0.0"
|
||||
purity: impure
|
||||
signature: "fetch_and_parse_url(url: str, timeout: float = 30.0) -> str"
|
||||
description: "Descarga una pagina web y la convierte a markdown. Combina detect_url_type + fetch HTML + html_to_markdown en una sola operacion."
|
||||
tags: [http, fetch, html, markdown, parse, url, scraping]
|
||||
uses_functions:
|
||||
- detect_url_type_py_core
|
||||
- html_to_markdown_py_core
|
||||
uses_types: []
|
||||
returns: []
|
||||
returns_optional: false
|
||||
error_type: "error_go_core"
|
||||
imports: ["httpx"]
|
||||
tested: false
|
||||
tests: []
|
||||
test_file_path: ""
|
||||
file_path: "python/functions/core/fetch_and_parse_url.py"
|
||||
---
|
||||
|
||||
## Ejemplo
|
||||
|
||||
```python
|
||||
from core.fetch_and_parse_url import fetch_and_parse_url
|
||||
|
||||
# Descargar y convertir una pagina web
|
||||
md = fetch_and_parse_url("https://example.com")
|
||||
print(md)
|
||||
|
||||
# Con timeout personalizado
|
||||
md = fetch_and_parse_url("https://en.wikipedia.org/wiki/Python", timeout=15.0)
|
||||
```
|
||||
|
||||
## Notas
|
||||
|
||||
Algoritmo:
|
||||
1. `detect_url_type(url)` determina el tipo de contenido (por patron, extension o HEAD request).
|
||||
2. Si es `code_repository` → lanza Exception (requiere git clone, no HTTP fetch).
|
||||
3. Si es `pdf` → lanza Exception (requiere pdfminer/pypdf, no incluido).
|
||||
4. `httpx.get(url)` descarga el contenido con follow_redirects.
|
||||
5. Si es `webpage` o Content-Type HTML → `html_to_markdown(raw_html)`.
|
||||
6. Si es `markdown`, `text` o codigo → retorna el texto directamente.
|
||||
|
||||
Lanza `Exception` con mensaje descriptivo en cualquier fallo de red o tipo no soportado.
|
||||
|
||||
Funcion impura: hace I/O (HTTP requests).
|
||||
@@ -0,0 +1,64 @@
|
||||
"""Descarga una pagina web y la convierte a markdown."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
|
||||
def fetch_and_parse_url(url: str, timeout: float = 30.0) -> str:
|
||||
"""Descarga una pagina web y la convierte a markdown.
|
||||
|
||||
Detecta el tipo de URL con detect_url_type, descarga el contenido con
|
||||
httpx y lo convierte al formato apropiado:
|
||||
- webpage: fetch HTML → html_to_markdown
|
||||
- markdown: retorna el texto directamente
|
||||
- text/code: retorna el texto directamente
|
||||
- pdf: retorna stub (requiere dependencia externa)
|
||||
- code_repository: retorna stub (requiere clonar repo)
|
||||
|
||||
Args:
|
||||
url: URL a descargar y parsear.
|
||||
timeout: Timeout en segundos para las peticiones HTTP.
|
||||
|
||||
Returns:
|
||||
Contenido de la URL en formato markdown.
|
||||
|
||||
Raises:
|
||||
Exception: Si falla la descarga (timeout, DNS, HTTP error) o el tipo
|
||||
de URL no es soportado.
|
||||
"""
|
||||
import httpx
|
||||
|
||||
from detect_url_type import detect_url_type
|
||||
from html_to_markdown import html_to_markdown
|
||||
|
||||
# Detectar tipo de URL (puede hacer HEAD request)
|
||||
url_type, _meta = detect_url_type(url, timeout=timeout)
|
||||
|
||||
if url_type == "code_repository":
|
||||
raise Exception(
|
||||
f"fetch_and_parse_url: code_repository URLs require git clone, not supported. url={url!r}"
|
||||
)
|
||||
|
||||
if url_type == "pdf":
|
||||
raise Exception(
|
||||
f"fetch_and_parse_url: PDF parsing requires external dependency (pdfminer/pypdf). url={url!r}"
|
||||
)
|
||||
|
||||
# Fetch content via GET
|
||||
try:
|
||||
response = httpx.get(url, timeout=timeout, follow_redirects=True)
|
||||
response.raise_for_status()
|
||||
except httpx.HTTPStatusError as exc:
|
||||
raise Exception(
|
||||
f"fetch_and_parse_url: HTTP {exc.response.status_code} for {url!r}"
|
||||
) from exc
|
||||
except Exception as exc:
|
||||
raise Exception(f"fetch_and_parse_url: request failed for {url!r}: {exc}") from exc
|
||||
|
||||
content_type = response.headers.get("content-type", "").lower()
|
||||
raw_text = response.text
|
||||
|
||||
if url_type == "webpage" or "text/html" in content_type:
|
||||
return html_to_markdown(raw_text)
|
||||
|
||||
# markdown, text, or code files — return as-is
|
||||
return raw_text
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user