fn_registry/python/functions/datascience/translate_es_to_en.md at 3f6b652f3f05b2f0cb4e70142fc69b9f53ccd180

Files

T

egutierrez 47fac22230 chore: auto-commit (799 archivos)

- .claude/CLAUDE.md
- .claude/commands/subagentes.md
- .claude/rules/INDEX.md
- .mcp.json
- bash/functions/cybersecurity/analyze_dns.md
- bash/functions/cybersecurity/audit_http_headers.md
- bash/functions/cybersecurity/audit_ssh_config.md
- bash/functions/cybersecurity/check_firewall.md
- bash/functions/cybersecurity/detect_suspicious_users.md
- bash/functions/cybersecurity/encrypt_file.md
- ...

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-14 00:28:20 +02:00

3.5 KiB

Raw Blame History

name, kind, lang, domain, version, purity, signature, description, tags, uses_functions, uses_types, returns, returns_optional, error_type, imports, params, output, tested, tests, test_file_path, file_path, notes

name

kind

lang

domain

version

purity

signature

description

tags

uses_functions

uses_types

returns

returns_optional

error_type

imports

params

output

tested

tests

test_file_path

file_path

notes

translate_es_to_en

function

datascience

1.0.0

impure

def translate_es_to_en(text: str, tokenizer: Any, model: Any, max_length: int = 512, num_beams: int = 4) -> str

Traduce texto espanol a ingles frase a frase usando MarianMT. Divide por boundaries de oracion, traduce cada una independientemente y une con espacio. Preserva nombres propios mejor que pasar el parrafo entero.

marianmt

translation

es-en

nlp

datascience

python

pendiente-usar

marianmt_es_en_load_model_py_datascience

false

error_go_core

name	desc
text	texto en espanol a traducir — puede ser una frase o un parrafo multi-oracion

name	desc
tokenizer	tokenizer MarianMT cargado con marianmt_es_en_load_model

name	desc
model	modelo MarianMT cargado con marianmt_es_en_load_model

name	desc
max_length	longitud maxima en tokens por oracion para tokenizacion y generacion (defecto 512)

name	desc
num_beams	numero de beams para beam search — mas alto = mejor calidad, mas lento (defecto 4)

texto traducido al ingles. Frases unidas con espacio simple. String vacio si el input es vacio.

true

texto vacio retorna string vacio

una frase en espanol produce output no vacio

python/functions/datascience/tests/test_translate_es_to_en.py

python/functions/datascience/translate_es_to_en.py

impure: invoca model.generate que depende del estado del modelo (pesos, device). El split por oracion usa regex lookahead-behind sobre [.!?] seguidos de espacio. Esto preserva nombres propios con puntos (S.A., U.S.A.) mejor que NLTK sent_tokenize porque no usa reglas de abreviacion — simplemente divide donde hay espacio despues de puntuacion terminal. Util como preprocesador para rebel_load_model (English-only, Apache 2.0): ES text -> translate_es_to_en -> EN text -> REBEL -> triplets Alternativa directa: mrebel_load_model (multilingue, CC BY-NC-SA).

Ejemplo

from python.functions.datascience.marianmt_es_en_load_model import marianmt_es_en_load_model
from python.functions.datascience.translate_es_to_en import translate_es_to_en

tokenizer, model = marianmt_es_en_load_model()

text = "Pablo Isla es presidente de Inditex. La empresa tiene sede en Arteixo."
translated = translate_es_to_en(text, tokenizer, model)
# "Pablo Isla is president of Inditex. The company is headquartered in Arteixo."

Por que frase a frase

Pasar el parrafo entero a MarianMT puede degradar la traduccion de nombres propios porque el modelo redistribuye la atencion sobre el contexto completo. Dividir por oraciones:

Contexto mas corto → menos confusion en nombres propios.
Truncation menos probable (512 tokens alcanza para oraciones normales).
Pipeline mas predecible para debugging (se puede inspeccionar cada frase).

Patron pipeline ES -> EN -> REBEL

# Paso 1: cargar modelos
mt_tok, mt_model = marianmt_es_en_load_model()
rebel_tok, rebel_model = rebel_load_model()

# Paso 2: traducir
en_text = translate_es_to_en(es_text, mt_tok, mt_model)

# Paso 3: extraer relaciones
inputs = rebel_tok(en_text, return_tensors="pt", max_length=512, truncation=True)
generated = rebel_model.generate(**inputs, num_beams=4, max_length=256)
decoded = rebel_tok.decode(generated[0], skip_special_tokens=False)
triplets = parse_rebel_output(decoded)

3.5 KiB Raw Blame History

Ejemplo

Por que frase a frase

Patron pipeline ES -> EN -> REBEL

3.5 KiB

Raw Blame History