a03675113a
- .claude/agents/fn-orquestador/SKILL.md - .claude/commands/fn_claude.md - .claude/rules/INDEX.md - .claude/rules/cpp_apps.md - .claude/rules/ids_naming.md - CHANGELOG.md - apps/dag_engine/README.md - apps/dag_engine/api.go - apps/dag_engine/dags_migrated/example.yaml - apps/dag_engine/dags_migrated/example_lineage_tracking.yaml - ... Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
75 lines
3.7 KiB
Markdown
75 lines
3.7 KiB
Markdown
# Transformer — Funciones que limpian, refinan o agregan datos
|
|
|
|
Tag: `transformer`. Grupo de funciones que **transforman datos**: clean, dedup, aggregate, feature-engineer, filter, impute, normalize. Input + output ambos datos (no efectos externos). Segundo eslabon del flujo en `data_factory` (analogia Factorio = assemblers).
|
|
|
|
Filtro MCP: `mcp__registry__fn_search query="" tag="transformer"`.
|
|
|
|
## Funciones del grupo
|
|
|
|
| ID | Lang | Que hace |
|
|
|---|---|---|
|
|
| [aggregate_by_group_py_datascience](../../python/functions/datascience/aggregate_by_group.md) | py | GROUP BY + agg |
|
|
| [deduplicate_entities_py_datascience](../../python/functions/datascience/deduplicate_entities.md) | py | Dedup entities |
|
|
| [deduplicate_relations_py_datascience](../../python/functions/datascience/deduplicate_relations.md) | py | Dedup relations |
|
|
| [align_relations_to_entities_py_datascience](../../python/functions/datascience/align_relations_to_entities.md) | py | Reconciliacion |
|
|
| [clip_py_datascience](../../python/functions/datascience/clip.md) | py | Cap valores fuera de rango |
|
|
| [clip_go_datascience](../../functions/datascience/clip.md) | go | Cap valores fuera de rango |
|
|
| [impute_py_datascience](../../python/functions/datascience/impute.md) | py | Rellenar nulls |
|
|
| [impute_go_datascience](../../functions/datascience/impute.md) | go | Rellenar nulls |
|
|
| [histogram_py_datascience](../../python/functions/datascience/histogram.md) | py | Bin frequencies |
|
|
| [histogram_go_datascience](../../functions/datascience/histogram.md) | go | Bin frequencies |
|
|
| [group_by_go_datascience](../../functions/datascience/group_by.md) | go | Group rows by key |
|
|
| [coerce_types_py_core](../../python/functions/core/coerce_types.md) | py | Convert column types |
|
|
| [csv_to_parquet_duckdb_py_core](../../python/functions/core/csv_to_parquet_duckdb.md) | py | Format conversion |
|
|
| [diff_entities_py_datascience](../../python/functions/datascience/diff_entities.md) | py | Compare entity snapshots |
|
|
| [diff_relations_py_datascience](../../python/functions/datascience/diff_relations.md) | py | Compare relation snapshots |
|
|
|
|
## Ejemplo canonico
|
|
|
|
Pipeline limpiar dataframe: impute nulls -> clip outliers -> dedup -> aggregate.
|
|
|
|
```python
|
|
from datascience import impute, clip, deduplicate_entities, aggregate_by_group
|
|
|
|
# 1. Rellenar nulls
|
|
df = impute(df, columns=["price"], strategy="median")
|
|
|
|
# 2. Cap valores extremos
|
|
df = clip(df, column="price", lower=0.0, upper=10000.0)
|
|
|
|
# 3. Deduplicar por clave
|
|
df = deduplicate_entities(df, key_columns=["sku"])
|
|
|
|
# 4. Agregar por categoria
|
|
out = aggregate_by_group(df, group_by=["category"], aggs={"price": "mean", "qty": "sum"})
|
|
```
|
|
|
|
## Fronteras del grupo
|
|
|
|
NO cubre:
|
|
- **Extract** (leer de fuente externa) -> [[extractor]].
|
|
- **Sink** (escribir a destino externo) -> [[sink]].
|
|
- **Validacion** (range/null/drift checks) -> [[validator]]. Transformers ASUMEN datos validos en entrada.
|
|
- ML training (eso es `trainer`, deferred a v2 de data_factory).
|
|
|
|
## Cuando NO usar `transformer`
|
|
|
|
- Si la funcion solo lee/escribe pero no transforma -> es extractor o sink.
|
|
- Si solo verifica una condicion (bool/list) -> es validator.
|
|
- Si solo encadena varias funciones -> es `pipeline` (kind), no transformer plano.
|
|
|
|
## Pureza
|
|
|
|
La mayoria de transformers son **puros** (mismo input -> mismo output). Algunos son impuros por usar SDKs con caches (duckdb, polars). El tag `transformer` no impone pureza — verifica `purity` en frontmatter de cada funcion.
|
|
|
|
## Consumidores
|
|
|
|
- `data_factory` — tab Transformers.
|
|
- `dag_engine` DAG steps con `function: <transformer_id>`.
|
|
- Notebooks analysis/ pipelines.
|
|
|
|
## Notas
|
|
|
|
- Transformers Go/Py duplicados intencionalmente — eleccion por stack y rendimiento.
|
|
- Si el registry incorpora `polars`/`duckdb` masivamente, considerar crear sub-grupos.
|