# Transformer — Funciones que limpian, refinan o agregan datos Tag: `transformer`. Grupo de funciones que **transforman datos**: clean, dedup, aggregate, feature-engineer, filter, impute, normalize. Input + output ambos datos (no efectos externos). Segundo eslabon del flujo en `data_factory` (analogia Factorio = assemblers). Filtro MCP: `mcp__registry__fn_search query="" tag="transformer"`. ## Funciones del grupo | ID | Lang | Que hace | |---|---|---| | [aggregate_by_group_py_datascience](../../python/functions/datascience/aggregate_by_group.md) | py | GROUP BY + agg | | [deduplicate_entities_py_datascience](../../python/functions/datascience/deduplicate_entities.md) | py | Dedup entities | | [deduplicate_relations_py_datascience](../../python/functions/datascience/deduplicate_relations.md) | py | Dedup relations | | [align_relations_to_entities_py_datascience](../../python/functions/datascience/align_relations_to_entities.md) | py | Reconciliacion | | [clip_py_datascience](../../python/functions/datascience/clip.md) | py | Cap valores fuera de rango | | [clip_go_datascience](../../functions/datascience/clip.md) | go | Cap valores fuera de rango | | [impute_py_datascience](../../python/functions/datascience/impute.md) | py | Rellenar nulls | | [impute_go_datascience](../../functions/datascience/impute.md) | go | Rellenar nulls | | [histogram_py_datascience](../../python/functions/datascience/histogram.md) | py | Bin frequencies | | [histogram_go_datascience](../../functions/datascience/histogram.md) | go | Bin frequencies | | [group_by_go_datascience](../../functions/datascience/group_by.md) | go | Group rows by key | | [coerce_types_py_core](../../python/functions/core/coerce_types.md) | py | Convert column types | | [csv_to_parquet_duckdb_py_core](../../python/functions/core/csv_to_parquet_duckdb.md) | py | Format conversion | | [diff_entities_py_datascience](../../python/functions/datascience/diff_entities.md) | py | Compare entity snapshots | | [diff_relations_py_datascience](../../python/functions/datascience/diff_relations.md) | py | Compare relation snapshots | ## Ejemplo canonico Pipeline limpiar dataframe: impute nulls -> clip outliers -> dedup -> aggregate. ```python from datascience import impute, clip, deduplicate_entities, aggregate_by_group # 1. Rellenar nulls df = impute(df, columns=["price"], strategy="median") # 2. Cap valores extremos df = clip(df, column="price", lower=0.0, upper=10000.0) # 3. Deduplicar por clave df = deduplicate_entities(df, key_columns=["sku"]) # 4. Agregar por categoria out = aggregate_by_group(df, group_by=["category"], aggs={"price": "mean", "qty": "sum"}) ``` ## Fronteras del grupo NO cubre: - **Extract** (leer de fuente externa) -> [[extractor]]. - **Sink** (escribir a destino externo) -> [[sink]]. - **Validacion** (range/null/drift checks) -> [[validator]]. Transformers ASUMEN datos validos en entrada. - ML training (eso es `trainer`, deferred a v2 de data_factory). ## Cuando NO usar `transformer` - Si la funcion solo lee/escribe pero no transforma -> es extractor o sink. - Si solo verifica una condicion (bool/list) -> es validator. - Si solo encadena varias funciones -> es `pipeline` (kind), no transformer plano. ## Pureza La mayoria de transformers son **puros** (mismo input -> mismo output). Algunos son impuros por usar SDKs con caches (duckdb, polars). El tag `transformer` no impone pureza — verifica `purity` en frontmatter de cada funcion. ## Consumidores - `data_factory` — tab Transformers. - `dag_engine` DAG steps con `function: `. - Notebooks analysis/ pipelines. ## Notas - Transformers Go/Py duplicados intencionalmente — eleccion por stack y rendimiento. - Si el registry incorpora `polars`/`duckdb` masivamente, considerar crear sub-grupos.