Files
egutierrez a03675113a chore: auto-commit (286 archivos)
- .claude/agents/fn-orquestador/SKILL.md
- .claude/commands/fn_claude.md
- .claude/rules/INDEX.md
- .claude/rules/cpp_apps.md
- .claude/rules/ids_naming.md
- CHANGELOG.md
- apps/dag_engine/README.md
- apps/dag_engine/api.go
- apps/dag_engine/dags_migrated/example.yaml
- apps/dag_engine/dags_migrated/example_lineage_tracking.yaml
- ...

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 16:33:22 +02:00

68 lines
3.5 KiB
Markdown

# Extractor — Funciones que leen datos de fuentes externas
Tag: `extractor`. Grupo de funciones que **leen datos crudos** de fuentes externas (DB, API, archivos, web) y los devuelven a la aplicacion. Son el primer nodo del flujo de datos en `data_factory` (issue 0097, analogia Factorio = drills).
Filtro MCP: `mcp__registry__fn_search query="" tag="extractor"`.
## Funciones del grupo
| ID | Lang | Fuente |
|---|---|---|
| [bq_query_py_infra](../../python/functions/infra/bq_query.md) | py | BigQuery SQL |
| [bq_preview_rows_py_infra](../../python/functions/infra/bq_preview_rows.md) | py | BigQuery table preview |
| [bq_get_table_py_infra](../../python/functions/infra/bq_get_table.md) | py | BigQuery metadata |
| [bq_list_tables_py_infra](../../python/functions/infra/bq_list_tables.md) | py | BigQuery catalog |
| [metabase_execute_card_py_infra](../../python/functions/infra/metabase_execute_card.md) | py | Metabase card |
| [metabase_execute_query_py_infra](../../python/functions/infra/metabase_execute_query.md) | py | Metabase ad-hoc |
| [load_csv_go_datascience](../../functions/datascience/load_csv.md) | go | CSV file |
| [load_parquet_go_datascience](../../functions/datascience/load_parquet.md) | go | Parquet file |
| [from_csv_py_core](../../python/functions/core/from_csv.md) | py | CSV file |
| [fetch_data_frame_go_datascience](../../functions/datascience/fetch_data_frame.md) | go | Generic dataframe loader |
| [http_get_json_py_infra](../../python/functions/infra/http_get_json.md) | py | HTTP JSON GET |
| [http_get_json_go_infra](../../functions/infra/http_get_json.md) | go | HTTP JSON GET |
| [http_download_file_py_infra](../../python/functions/infra/http_download_file.md) | py | HTTP file download |
| [http_download_file_go_infra](../../functions/infra/http_download_file.md) | go | HTTP file download |
| [jupyter_read_py_notebook](../../python/functions/notebook/jupyter_read.md) | py | Jupyter notebook cells |
## Ejemplo canonico
Extractor BigQuery -> dataframe local.
```python
from infra import bq_query
# 1. Extraer
df = bq_query(
project_id="my-gcp-project",
query="SELECT * FROM `dataset.events` WHERE event_date = CURRENT_DATE() LIMIT 10000",
)
print(f"extracted {len(df)} rows, {df.memory_usage(deep=True).sum() / 1024:.1f} KB")
```
## Fronteras del grupo
NO cubre:
- **Transform** (clean / dedup / aggregate) -> [[transformer]].
- **Sink** (escritura a destino externo) -> [[sink]].
- **Validacion** del dato (range / null / drift) -> [[validator]].
- Extraccion de funciones desde repos externos (eso es `sources/`, no este grupo).
- Streaming continuo (Kafka consumer, etc.) — los extractores son fetch puntual o batch.
## Cuando NO usar `extractor`
- Si la funcion lee de la BD interna del registry (`registry.db`, `operations.db`) -> tag `registry` o `doctor`, no `extractor`. Extractor implica fuente externa.
- Si solo parsea bytes ya cargados en memoria -> es `transformer`.
## Consumidores
- `data_factory` C++ ImGui app — tab Extractors lista el grupo entero y permite Run Now por nodo.
- `dag_engine` — un DAG step puede llamar a un extractor via `function: <id>`.
- Pipelines ad-hoc en notebooks.
## Notas
- BQ extractors estan tambien tageados `bigquery` (otro grupo); Metabase extractors tageados `metabase`. Una funcion puede pertenecer a multiples grupos.
- HTTP extractors duplicados Go/Py — eleccion segun stack del consumidor.
- Crece a medida que el registry incorpora nuevas fuentes (Mongo, Kafka, S3...). Mantener tag al frontmatter al crear funcion nueva.