a03675113a
- .claude/agents/fn-orquestador/SKILL.md - .claude/commands/fn_claude.md - .claude/rules/INDEX.md - .claude/rules/cpp_apps.md - .claude/rules/ids_naming.md - CHANGELOG.md - apps/dag_engine/README.md - apps/dag_engine/api.go - apps/dag_engine/dags_migrated/example.yaml - apps/dag_engine/dags_migrated/example_lineage_tracking.yaml - ... Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
68 lines
3.5 KiB
Markdown
68 lines
3.5 KiB
Markdown
# Extractor — Funciones que leen datos de fuentes externas
|
|
|
|
Tag: `extractor`. Grupo de funciones que **leen datos crudos** de fuentes externas (DB, API, archivos, web) y los devuelven a la aplicacion. Son el primer nodo del flujo de datos en `data_factory` (issue 0097, analogia Factorio = drills).
|
|
|
|
Filtro MCP: `mcp__registry__fn_search query="" tag="extractor"`.
|
|
|
|
## Funciones del grupo
|
|
|
|
| ID | Lang | Fuente |
|
|
|---|---|---|
|
|
| [bq_query_py_infra](../../python/functions/infra/bq_query.md) | py | BigQuery SQL |
|
|
| [bq_preview_rows_py_infra](../../python/functions/infra/bq_preview_rows.md) | py | BigQuery table preview |
|
|
| [bq_get_table_py_infra](../../python/functions/infra/bq_get_table.md) | py | BigQuery metadata |
|
|
| [bq_list_tables_py_infra](../../python/functions/infra/bq_list_tables.md) | py | BigQuery catalog |
|
|
| [metabase_execute_card_py_infra](../../python/functions/infra/metabase_execute_card.md) | py | Metabase card |
|
|
| [metabase_execute_query_py_infra](../../python/functions/infra/metabase_execute_query.md) | py | Metabase ad-hoc |
|
|
| [load_csv_go_datascience](../../functions/datascience/load_csv.md) | go | CSV file |
|
|
| [load_parquet_go_datascience](../../functions/datascience/load_parquet.md) | go | Parquet file |
|
|
| [from_csv_py_core](../../python/functions/core/from_csv.md) | py | CSV file |
|
|
| [fetch_data_frame_go_datascience](../../functions/datascience/fetch_data_frame.md) | go | Generic dataframe loader |
|
|
| [http_get_json_py_infra](../../python/functions/infra/http_get_json.md) | py | HTTP JSON GET |
|
|
| [http_get_json_go_infra](../../functions/infra/http_get_json.md) | go | HTTP JSON GET |
|
|
| [http_download_file_py_infra](../../python/functions/infra/http_download_file.md) | py | HTTP file download |
|
|
| [http_download_file_go_infra](../../functions/infra/http_download_file.md) | go | HTTP file download |
|
|
| [jupyter_read_py_notebook](../../python/functions/notebook/jupyter_read.md) | py | Jupyter notebook cells |
|
|
|
|
## Ejemplo canonico
|
|
|
|
Extractor BigQuery -> dataframe local.
|
|
|
|
```python
|
|
from infra import bq_query
|
|
|
|
# 1. Extraer
|
|
df = bq_query(
|
|
project_id="my-gcp-project",
|
|
query="SELECT * FROM `dataset.events` WHERE event_date = CURRENT_DATE() LIMIT 10000",
|
|
)
|
|
|
|
print(f"extracted {len(df)} rows, {df.memory_usage(deep=True).sum() / 1024:.1f} KB")
|
|
```
|
|
|
|
## Fronteras del grupo
|
|
|
|
NO cubre:
|
|
- **Transform** (clean / dedup / aggregate) -> [[transformer]].
|
|
- **Sink** (escritura a destino externo) -> [[sink]].
|
|
- **Validacion** del dato (range / null / drift) -> [[validator]].
|
|
- Extraccion de funciones desde repos externos (eso es `sources/`, no este grupo).
|
|
- Streaming continuo (Kafka consumer, etc.) — los extractores son fetch puntual o batch.
|
|
|
|
## Cuando NO usar `extractor`
|
|
|
|
- Si la funcion lee de la BD interna del registry (`registry.db`, `operations.db`) -> tag `registry` o `doctor`, no `extractor`. Extractor implica fuente externa.
|
|
- Si solo parsea bytes ya cargados en memoria -> es `transformer`.
|
|
|
|
## Consumidores
|
|
|
|
- `data_factory` C++ ImGui app — tab Extractors lista el grupo entero y permite Run Now por nodo.
|
|
- `dag_engine` — un DAG step puede llamar a un extractor via `function: <id>`.
|
|
- Pipelines ad-hoc en notebooks.
|
|
|
|
## Notas
|
|
|
|
- BQ extractors estan tambien tageados `bigquery` (otro grupo); Metabase extractors tageados `metabase`. Una funcion puede pertenecer a multiples grupos.
|
|
- HTTP extractors duplicados Go/Py — eleccion segun stack del consumidor.
|
|
- Crece a medida que el registry incorpora nuevas fuentes (Mongo, Kafka, S3...). Mantener tag al frontmatter al crear funcion nueva.
|