a03675113a
- .claude/agents/fn-orquestador/SKILL.md - .claude/commands/fn_claude.md - .claude/rules/INDEX.md - .claude/rules/cpp_apps.md - .claude/rules/ids_naming.md - CHANGELOG.md - apps/dag_engine/README.md - apps/dag_engine/api.go - apps/dag_engine/dags_migrated/example.yaml - apps/dag_engine/dags_migrated/example_lineage_tracking.yaml - ... Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3.7 KiB
3.7 KiB
Transformer — Funciones que limpian, refinan o agregan datos
Tag: transformer. Grupo de funciones que transforman datos: clean, dedup, aggregate, feature-engineer, filter, impute, normalize. Input + output ambos datos (no efectos externos). Segundo eslabon del flujo en data_factory (analogia Factorio = assemblers).
Filtro MCP: mcp__registry__fn_search query="" tag="transformer".
Funciones del grupo
| ID | Lang | Que hace |
|---|---|---|
| aggregate_by_group_py_datascience | py | GROUP BY + agg |
| deduplicate_entities_py_datascience | py | Dedup entities |
| deduplicate_relations_py_datascience | py | Dedup relations |
| align_relations_to_entities_py_datascience | py | Reconciliacion |
| clip_py_datascience | py | Cap valores fuera de rango |
| clip_go_datascience | go | Cap valores fuera de rango |
| impute_py_datascience | py | Rellenar nulls |
| impute_go_datascience | go | Rellenar nulls |
| histogram_py_datascience | py | Bin frequencies |
| histogram_go_datascience | go | Bin frequencies |
| group_by_go_datascience | go | Group rows by key |
| coerce_types_py_core | py | Convert column types |
| csv_to_parquet_duckdb_py_core | py | Format conversion |
| diff_entities_py_datascience | py | Compare entity snapshots |
| diff_relations_py_datascience | py | Compare relation snapshots |
Ejemplo canonico
Pipeline limpiar dataframe: impute nulls -> clip outliers -> dedup -> aggregate.
from datascience import impute, clip, deduplicate_entities, aggregate_by_group
# 1. Rellenar nulls
df = impute(df, columns=["price"], strategy="median")
# 2. Cap valores extremos
df = clip(df, column="price", lower=0.0, upper=10000.0)
# 3. Deduplicar por clave
df = deduplicate_entities(df, key_columns=["sku"])
# 4. Agregar por categoria
out = aggregate_by_group(df, group_by=["category"], aggs={"price": "mean", "qty": "sum"})
Fronteras del grupo
NO cubre:
- Extract (leer de fuente externa) -> extractor.
- Sink (escribir a destino externo) -> sink.
- Validacion (range/null/drift checks) -> validator. Transformers ASUMEN datos validos en entrada.
- ML training (eso es
trainer, deferred a v2 de data_factory).
Cuando NO usar transformer
- Si la funcion solo lee/escribe pero no transforma -> es extractor o sink.
- Si solo verifica una condicion (bool/list) -> es validator.
- Si solo encadena varias funciones -> es
pipeline(kind), no transformer plano.
Pureza
La mayoria de transformers son puros (mismo input -> mismo output). Algunos son impuros por usar SDKs con caches (duckdb, polars). El tag transformer no impone pureza — verifica purity en frontmatter de cada funcion.
Consumidores
data_factory— tab Transformers.dag_engineDAG steps confunction: <transformer_id>.- Notebooks analysis/ pipelines.
Notas
- Transformers Go/Py duplicados intencionalmente — eleccion por stack y rendimiento.
- Si el registry incorpora
polars/duckdbmasivamente, considerar crear sub-grupos.