chore: auto-commit (26 archivos)

- python/functions/bigquery/bq_auth.md - python/functions/bigquery/bq_load_from_file.md - python/functions/bigquery/bq_load_from_gcs.md - python/functions/bigquery/client.py - python/functions/bigquery/queries.py - python/functions/datascience/__init__.py - python/functions/datascience/decode_qr_image.py - python/functions/datascience/load_bq_table_to_duckdb.md - python/functions/datascience/load_bq_table_to_duckdb.py - python/functions/pipelines/profile_bq_table.md - ... Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
chore: auto-commit (8 archivos)
2026-07-02 19:00:13 +02:00 · 2026-07-01 19:00:06 +02:00 · 2026-07-01 17:58:03 +02:00 · 2026-07-01 12:45:39 +02:00 · 2026-07-01 11:42:49 +02:00 · 2026-07-01 11:41:56 +02:00
88 changed files with 9497 additions and 201 deletions
@@ -41,12 +41,13 @@ reconocido se degrada a `Note`, nunca lanza).
 | `Heading(text, level=1)` | título de sección, `level` 1 (grande) … 3 (chico) | una o varias líneas en negrita; nivel 1 lleva subrayado de acento |
 | `Markdown(text)` | texto markdown ligero | ver subset abajo; **nunca corta a media línea** |
 | `KVTable(rows, title=None)` | `rows = [(clave, valor), ...]` | tabla de 2 columnas etiqueta/valor; el valor se envuelve |
-| `DataTable(header, rows, title=None, note=None)` | `header=[...]`, `rows=[[...],...]` | tabla con cabecera; **se parte por filas repitiendo cabecera**; las celdas largas se envuelven dentro de su columna |
+| `DataTable(header, rows, title=None, note=None)` | `header=[...]`, `rows=[[...],...]` | tabla con cabecera; **si cabe** como texto se parte por filas repitiendo cabecera; **si NO cabe** (demasiadas columnas) se rasteriza entera como imagen de alta resolución para hacer zoom. Ver §11.4 |
 | `Figure(fig=None, make=None, caption=None, height_in=None)` | una `matplotlib.figure.Figure` ya construida (`fig`) o un callable `make()->Figure` (perezoso) | se rasteriza y escala para caber entera (nunca recortada) |
 | `Image(path, caption=None, height_in=None)` | ruta a PNG/JPG | se escala para caber entera |
 | `Caption(text)` / `Note(text)` | texto auxiliar pequeño | pie/nota en gris; `Note` es además el fallback de lo desconocido |
-| `Group(blocks, title=None)` | unidad **keep-together**: sus bloques se mantienen juntos | el renderer mide el grupo entero y lo mueve completo a la página/slide siguiente si no cabe; encoge la figura para dejar sitio al título+texto. Ver §11 |
+| `Group(blocks, title=None, page_break_before=False, layout="stack")` | unidad **keep-together**: sus bloques se mantienen juntos | el renderer mide el grupo entero y lo mueve completo a la página/slide siguiente si no cabe; encoge la figura para dejar sitio al título+texto. `layout="side_by_side"` coloca tabla+figura en dos columnas (solo PPTX). Ver §11 y §11.4 |
 | `GlossaryEntry(key, label, definition)` | una entrada del glosario (destino clicable) | la genera el capítulo `glosario`; registra su posición como destino de los términos marcados. Ver §11 |
+| `TocEntry(label, target_id)` | una entrada de **índice clicable** en la portada | la genera el capítulo `portada`; el renderer la cablea como salto al inicio del capítulo cuyo `id` o `title` coincide con `target_id`. Ver §11.4 |

 `Figure`/`Image` aceptan `height_in` (hint): el renderer **clampa** la figura a esa altura máxima (lo usa `Group` para encoger la figura). Toda figura escala dejando sitio a su caption en la misma página/slide; en PPTX el caption es **siempre** visible (si no se da `caption`, cae al último heading o a "Figura").

@@ -397,6 +398,65 @@ cabecera con su fondo propio. Es automático en PDF y PPTX; el patrón se mantie
 cuando una tabla larga se parte y repite cabecera (el índice de fila es lógico, no por
 página). No hay nada que hacer en los capítulos.

+### 11.4 Calidad de render global: DPI alto, tabla ancha → imagen, figura al lado, índice clicable
+
+Cuatro capacidades transversales del motor, **todas automáticas salvo `layout`** (que un
+capítulo activa explícitamente). Aplican a PDF y PPTX salvo donde se indique.
+
+**(a) DPI alto (automático).** Toda figura/imagen embebida se rasteriza a **220 dpi**
+(constante `_RASTER_DPI` en ambos renderers; en PDF se aplica también al `savefig` de la
+página, porque matplotlib re-rasteriza cada `imshow` al escribir la página). Objetivo:
+ampliar en el móvil y leer detalle (ejes, celdas) sin pixelar. El texto sigue siendo
+vectorial y seleccionable. No hay nada que hacer en los capítulos.
+
+**(b) Tabla ancha → imagen de alta resolución (automático).** Cuando un `DataTable` tiene
+**demasiadas columnas para ser legible como texto** en el ancho útil (criterio
+`_table_fits_as_text`: ancho mínimo legible por columna × nº de columnas > ancho útil; en
+la práctica salta sobre tablas tipo `df.head` con muchas columnas), en vez de comprimir las
+columnas hasta hacerlas ilegibles, la tabla se dibuja **entera como una imagen de alta
+resolución** (función `render_table_as_figure_py_datascience`: cabecera sombreada + zebra)
+escalada para caber completa, de modo que el lector hace **zoom** y la lee sin perder datos.
+Si la tabla **sí cabe**, se mantiene como texto seleccionable (PDF) / tabla nativa (PPTX).
+Las `KVTable` (2 columnas) caben siempre y se quedan como texto. No hay nada que hacer en
+los capítulos.
+
+**(c) Figura al lado de la tabla — `Group(layout="side_by_side")`.** Hint de layout que un
+capítulo activa para que su **tabla quede a la izquierda y su figura a la derecha** en la
+misma diapositiva, en lugar de apiladas:
+
+```python
+model.Group(
+    layout="side_by_side",
+    blocks=[
+        model.Heading(text=str(name), level=2),       # va a ancho completo arriba
+        model.DataTable(header=..., rows=...),         # columna IZQUIERDA (~55%)
+        model.Figure(make=_grafico_perezoso(...)),     # columna DERECHA (~45%)
+        model.Markdown(text="explicación…"),           # va a ancho completo abajo
+    ])
+```
+
+Contrato exacto del campo:
+
+| Campo | Valor | Efecto |
+|---|---|---|
+| `layout` | `"stack"` (por defecto) | comportamiento histórico: apilado vertical (keep-together). |
+| `layout` | `"side_by_side"` | **PPTX**: la tabla (rasterizada a imagen) ocupa la columna izquierda (~55% del ancho útil) y la figura la derecha (~45%); cualquier otro bloque (heading, markdown) va a ancho completo arriba/abajo. Si no hay un par tabla+figura, o no caben lado a lado en una slide, **cae automáticamente a apilado**. **PDF**: se trata **igual que `stack`** (el ancho A5 móvil no admite dos columnas legibles). Valores desconocidos degradan a `"stack"`. |
+
+Es **retrocompatible**: un `Group` sin `layout` (o `layout="stack"`) se comporta exactamente
+como antes. El capítulo `cat_distr` es el consumidor previsto (gráfico a la derecha de la
+tabla de categorías en PPT); este motor solo provee el soporte.
+
+**(d) Índice clicable en la portada — `TocEntry`.** La portada emite un `Heading("Índice")`
+seguido de un `TocEntry(label, target_id)` por capítulo. El renderer registra la
+página/slide de inicio de **cada** capítulo (indexado por `id` **y** por `title`) y cablea
+cada `TocEntry` como un salto real a ese inicio: en **PDF** vía
+`add_pdf_internal_links_py_datascience` (link GOTO de PyMuPDF), en **PPTX** vía
+`pptx_link_run_to_slide_py_datascience` (salto a slide nativo). Como la portada solo conoce
+los **títulos** de los capítulos, el `target_id` se hace coincidir contra el `title` (o el
+`id`) de destino. Si un destino no resuelve, la entrada se muestra igualmente como texto
+(en color de enlace), nunca se corta. Es el mismo mecanismo que los términos clicables del
+glosario (§11.1), reutilizado en sentido portada → capítulo.
+
 ---

 ## 10. Integración futura con `profile_table` (siguiente fase)
@@ -3,11 +3,11 @@ name: bq_auth
 kind: function
 lang: py
 domain: infra
-version: "1.0.0"
+version: "1.1.0"
 purity: impure
-signature: "def bq_auth(project_id: str = '', credentials_path: str = '') -> BQClient"
-description: "Autentica contra Google BigQuery con ADC o service account JSON. Retorna un BQClient listo para usar con todas las funciones CRUD."
-tags: [bigquery, gcp, auth, google-cloud, python, pendiente-usar]
+signature: "def bq_auth(project_id: str = '', credentials_path: str = '', drop_quota_project: bool = False) -> BQClient"
+description: "Autentica contra Google BigQuery con ADC o service account JSON. Retorna un BQClient listo para usar con todas las funciones CRUD. Con drop_quota_project=True descarta el quota project del ADC del usuario (creds.with_quota_project(None)) para evitar el 403 USER_PROJECT_DENIED cuando el ADC lleva un quota_project_id ajeno."
+tags: [bigquery, gcp, auth, google-cloud, python, forecast, pendiente-usar]
 uses_functions: []
 uses_types: []
 returns: []
@@ -19,6 +19,8 @@ params:
    desc: "ID del proyecto GCP (vacio = detectar de credenciales/entorno)"
  - name: credentials_path
    desc: "ruta a archivo JSON de service account (vacio = Application Default Credentials)"
+  - name: drop_quota_project
+    desc: "si True y sin credentials_path, resuelve ADC via google.auth.default y descarta el quota project del ADC (with_quota_project(None)); evita el 403 USER_PROJECT_DENIED cuando el ADC del usuario lleva un quota_project_id ajeno. Default False = comportamiento original"
 output: "BQClient: cliente autenticado con proyecto resuelto"
 tested: false
 tests: []
@@ -40,6 +42,9 @@ client = bq_auth("my-project-id")
 # Service account
 client = bq_auth(credentials_path="/path/to/service-account.json")

+# Sin quota project (evita 403 USER_PROJECT_DENIED con ADC de usuario)
+client = bq_auth("autingo-159109", drop_quota_project=True)
+
 # Context manager
 with bq_auth() as client:
    # client se cierra automaticamente
@@ -48,9 +53,14 @@ with bq_auth() as client:

 ## Notas

-Tres modos de autenticacion:
+Modos de autenticacion:
 - Sin argumentos: usa Application Default Credentials (ADC) — requiere `gcloud auth application-default login`
 - Con project_id: usa ADC pero fuerza el proyecto
 - Con credentials_path: lee el JSON de service account directamente
+- Con drop_quota_project=True (y sin credentials_path): resuelve ADC via `google.auth.default(scopes=[".../bigquery"])`, aplica `creds.with_quota_project(None)` si el atributo existe y construye el cliente con ese creds. Es el fix del gotcha conocido: el ADC del usuario (`egutierrez`) lleva `quota_project_id=autingo` ajeno y BigQuery devuelve `403 USER_PROJECT_DENIED`; descartar el quota project lo resuelve.

 El BQClient wrappea `google.cloud.bigquery.Client` y expone `_client` para que las funciones del modulo lo usen internamente.
+
+## Capability growth log
+
+- v1.1.0 (2026-07-02) — anade `drop_quota_project` para descartar el quota project del ADC del usuario (`creds.with_quota_project(None)`) y evitar el 403 USER_PROJECT_DENIED. Default False = comportamiento identico al anterior.
@@ -3,7 +3,7 @@ name: bq_load_from_file
 kind: function
 lang: py
 domain: infra
-version: "1.0.0"
+version: "1.0.1"
 purity: impure
 signature: "def bq_load_from_file(client: BQClient, file_path: str, dataset_id: str, table_id: str, source_format: str = 'CSV', write_disposition: str = 'WRITE_APPEND', autodetect: bool = True, skip_leading_rows: int = 0) -> dict"
 description: "Carga datos desde un archivo local a una tabla BigQuery usando load_table_from_file del SDK. Equivalente a bq_load_from_gcs pero para disco local."
@@ -73,3 +73,7 @@ cargar desde ahi es mas eficiente y permite paralelismo.

 La funcion bloquea hasta que el job termina (`job.result()`). Los archivos Parquet y
 Avro no admiten `skip_leading_rows` — ese parametro solo aplica para CSV.
+
+## Capability growth log
+
+- v1.0.1 (2026-07-02) — fix: `skip_leading_rows` solo se envía al LoadJobConfig cuando `source_format` es CSV; BigQuery rechazaba el job para JSON/Avro/Parquet incluso con valor 0.
@@ -3,7 +3,7 @@ name: bq_load_from_gcs
 kind: function
 lang: py
 domain: infra
-version: "1.0.0"
+version: "1.0.1"
 purity: impure
 signature: "def bq_load_from_gcs(client: BQClient, uri: str | list[str], dataset_id: str, table_id: str, source_format: str = 'CSV', write_disposition: str = 'WRITE_APPEND', autodetect: bool = True, skip_leading_rows: int = 0) -> dict"
 description: "Carga datos desde uno o varios URIs de Google Cloud Storage a una tabla BigQuery configurando un LoadJob. Espera la finalizacion del job."
@@ -75,3 +75,7 @@ acepta la lista de archivos resultante como una sola carga atomica.
 `autodetect=True` es conveniente pero puede inferir tipos incorrectamente para columnas
 con valores nulos o mixtos. Para produccion, definir el schema explicitamente via
 `job_config.schema`.
+
+## Capability growth log
+
+- v1.0.1 (2026-07-02) — fix: `skip_leading_rows` solo se envía al LoadJobConfig cuando `source_format` es CSV; BigQuery rechazaba el job para JSON/Avro/Parquet incluso con valor 0.
@@ -1,6 +1,7 @@
 """Cliente base para Google BigQuery."""

 from dataclasses import dataclass, field
+import google.auth
 from google.cloud import bigquery
 from google.oauth2 import service_account

@@ -27,7 +28,11 @@ class BQClient:
        self.close()


-def bq_auth(project_id: str = "", credentials_path: str = "") -> BQClient:
+def bq_auth(
+    project_id: str = "",
+    credentials_path: str = "",
+    drop_quota_project: bool = False,
+) -> BQClient:
    """Autentica contra Google BigQuery.

    Tres modos de autenticacion:
@@ -35,9 +40,18 @@ def bq_auth(project_id: str = "", credentials_path: str = "") -> BQClient:
    2. Service account JSON: con credentials_path
    3. Proyecto explicito: con project_id (usa ADC para credenciales)

+    Con drop_quota_project=True (y sin credentials_path) resuelve las credenciales
+    ADC via google.auth.default y elimina el quota project fijado en el ADC del
+    usuario (creds.with_quota_project(None)). Esto evita el error 403
+    USER_PROJECT_DENIED cuando el ADC lleva un quota_project_id ajeno al proyecto
+    contra el que se consulta.
+
    Args:
        project_id: ID del proyecto GCP. Vacio = detectar de credenciales.
        credentials_path: Ruta a archivo JSON de service account. Vacio = ADC.
+        drop_quota_project: Si True y sin credentials_path, resuelve ADC con
+            google.auth.default y descarta el quota project del ADC
+            (with_quota_project(None)). Default False = comportamiento original.

    Returns:
        BQClient autenticado listo para usar.
@@ -50,11 +64,19 @@ def bq_auth(project_id: str = "", credentials_path: str = "") -> BQClient:
        >>> client = bq_auth()  # ADC
        >>> client = bq_auth("my-project")  # ADC con proyecto explicito
        >>> client = bq_auth(credentials_path="/path/to/sa.json")  # Service account
+        >>> client = bq_auth("autingo-159109", drop_quota_project=True)  # sin quota project
    """
    if credentials_path:
        creds = service_account.Credentials.from_service_account_file(credentials_path)
        proj = project_id or creds.project_id
        client = bigquery.Client(credentials=creds, project=proj)
+    elif drop_quota_project:
+        creds, adc_project = google.auth.default(
+            scopes=["https://www.googleapis.com/auth/bigquery"]
+        )
+        if hasattr(creds, "with_quota_project"):
+            creds = creds.with_quota_project(None)
+        client = bigquery.Client(project=project_id or adc_project, credentials=creds)
    elif project_id:
        client = bigquery.Client(project=project_id)
    else:
@@ -173,11 +173,14 @@ def bq_load_from_gcs(

    job_config = bigquery.LoadJobConfig(
        source_format=format_map.get(source_format, bigquery.SourceFormat.CSV),
-        write_disposition=disposition_map.get(source_format, bigquery.WriteDisposition.WRITE_APPEND),
+        write_disposition=disposition_map.get(write_disposition, bigquery.WriteDisposition.WRITE_APPEND),
        autodetect=autodetect,
-        skip_leading_rows=skip_leading_rows,
    )
-    job_config.write_disposition = disposition_map.get(write_disposition, bigquery.WriteDisposition.WRITE_APPEND)
+    # skip_leading_rows solo es valido para CSV: BigQuery rechaza el job
+    # ("Only CSV imports may specify leading rows to skip") si el campo va
+    # seteado con cualquier otro formato, incluso a 0.
+    if source_format == "CSV":
+        job_config.skip_leading_rows = skip_leading_rows

    table_ref = client._client.dataset(dataset_id).table(table_id)
    uris = uri if isinstance(uri, list) else [uri]
@@ -251,8 +254,12 @@ def bq_load_from_file(
        source_format=format_map.get(source_format, bigquery.SourceFormat.CSV),
        write_disposition=disposition_map.get(write_disposition, bigquery.WriteDisposition.WRITE_APPEND),
        autodetect=autodetect,
-        skip_leading_rows=skip_leading_rows,
    )
+    # skip_leading_rows solo es valido para CSV: BigQuery rechaza el job
+    # ("Only CSV imports may specify leading rows to skip") si el campo va
+    # seteado con cualquier otro formato, incluso a 0.
+    if source_format == "CSV":
+        job_config.skip_leading_rows = skip_leading_rows

    table_ref = client._client.dataset(dataset_id).table(table_id)

@@ -25,6 +25,7 @@ from .describe_numeric import describe_numeric
 from .summarize_categorical import summarize_categorical
 from .infer_semantic_type import infer_semantic_type
 from .column_quality_score import column_quality_score
+from .build_column_dictionary import build_column_dictionary
 from .select_groupby_keys import select_groupby_keys
 from .render_eda_markdown import render_eda_markdown
 from .detect_distribution_type import detect_distribution_type
@@ -77,8 +78,18 @@ from .add_pdf_internal_links import add_pdf_internal_links
 from .suggest_intratable_fk_candidates import suggest_intratable_fk_candidates
 from .render_paper_pdf import render_paper_pdf
 from .draw_join_graph_figure import draw_join_graph_figure
+from .generate_synthetic_eda_table import generate_synthetic_eda_table
+from .generate_synthetic_eda_folder import generate_synthetic_eda_folder
+from .load_bq_table_to_duckdb import load_bq_table_to_duckdb
+from .list_bq_dataset_tables import list_bq_dataset_tables
+from .forecast_seasonal_median import forecast_seasonal_median

 __all__ = [
+    "forecast_seasonal_median",
+    "load_bq_table_to_duckdb",
+    "list_bq_dataset_tables",
+    "generate_synthetic_eda_table",
+    "generate_synthetic_eda_folder",
    "render_paper_pdf",
    "draw_join_graph_figure",
    "suggest_intratable_fk_candidates",
@@ -135,6 +146,7 @@ __all__ = [
    "summarize_categorical",
    "infer_semantic_type",
    "column_quality_score",
+    "build_column_dictionary",
    "select_groupby_keys",
    "render_eda_markdown",
    "detect_distribution_type",
@@ -29,6 +29,7 @@ from .model import (  # noqa: F401
    KVTable,
    Markdown,
    Note,
+    TocEntry,
    as_blocks,
    as_chapters,
    merge_manifest,
@@ -52,6 +53,7 @@ __all__ = [
    "Group",
    "GlossaryEntry",
    "GlossaryCollector",
+    "TocEntry",
    "Chapter",
    "as_blocks",
    "as_chapters",
@@ -5,28 +5,32 @@ page (PDF) / slide (PPTX)**: every column is wrapped in a keep-together
 ``model.Group`` with ``page_break_before=True`` (except the first, which may share
 the intro's page), so its chart sits next to its tables and no column is split.

-A short intro names the clickable **[[term:entropia]]entropía[[/term]]** term —
-the full definition lives in the GLOSARIO chapter, so it is NOT repeated inline
-here (one click jumps to the glossary entry). The intro also carries the dataset
-row total used as a comparison baseline.
+Per column the Group is laid out ``side_by_side`` (PPTX: cardinality table LEFT,
+chart RIGHT; PDF: stacked) and contains, in order:

-Per column the Group contains, in order:
-
-1. A cardinality key/value table: distinct values, ``% distinct`` (distinct /
+1. The column name plus, when the LLM layer ran, its business **description** and
+   **unit** (read from ``profile['llm']['dictionary']``, matched by column name).
+2. A cardinality key/value table: distinct values, ``% distinct`` (distinct /
   total rows), total dataset rows, singleton values (frequency 1), entropy with
   its theoretical maximum and the normalized ratio, mode, imbalance and
   string-length stats.
-2. A short note flagging problematic cardinality (id-like ≈100% distinct, or a
+3. A short note flagging problematic cardinality (id-like ≈100% distinct, or a
   single dominating category).
-3. A ``top-k`` table (value / count / %).
-4. A **donut pie chart** of the most common categories (top-k + an "Otros"
+4. A ``top-k`` table (value / count / %).
+5. A **horizontal bar chart** of the most common categories (top-k + an "Otros"
   bucket), drawn lazily so the renderers scale it to fit entirely.

+A short intro names the clickable **[[term:entropia]]entropía[[/term]]** and
+**[[term:pagina_categorica]]page-layout[[/term]]** terms — their full
+definitions live in the GLOSARIO chapter, so they are NOT repeated inline here
+(one click jumps to the glossary entry). The intro also carries the dataset row
+total used as a comparison baseline.
+
 Data comes from the ``eda`` group: each ``columns[i]['categorical']`` is the
 output of ``summarize_categorical`` (``top[{value,count,pct}]``, ``mode``,
 ``n_distinct``, ``entropy``, ``imbalance``, ``len_min/mean/max``). The derived
-cardinality metrics and the pie figure are delegated to two registry functions
-(``categorical_cardinality_block`` and ``categorical_top_pie_figure``); both are
+cardinality metrics and the bar figure are delegated to two registry functions
+(``categorical_cardinality_block`` and ``categorical_top_bar_figure``); both are
 imported lazily and degrade to a minimal inline fallback so this chapter never
 raises even if they are unavailable.

@@ -39,10 +43,21 @@ import math

 from .. import model

-CHAPTER_VERSION = "1.2.0"
+CHAPTER_VERSION = "1.3.0"
 CHAPTER_ID = "cat_distr"
 CHAPTER_TITLE = "Distribuciones categóricas"

+# Key under which eda_llm_insights stores its interpretive block in the profile.
+LLM_KEY = "llm"
+
+# Second glossary term this chapter names: "how each categorical page is laid
+# out". The long paragraph that used to describe it inline in the intro now lives
+# in the GLOSARIO chapter (canonical definition in ``glosario._BASELINE_TERMS``);
+# the intro only names the clickable term, relocating the explanation, not losing
+# it. The chapter only needs to register key+label here.
+_TERM_PAGINA_KEY = "pagina_categorica"
+_TERM_PAGINA_LABEL = "Cómo se organiza cada página categórica"
+
 # Glossary term this chapter explains. Registered in the shared collector and
 # marked clickable on its first appearance (end-to-end glossary example —
 # mejora 6). Other chapters hook their own terms the same way (see the contract).
@@ -59,14 +74,14 @@ _TERM_ENTROPIA_DEF = (
 # Cap the number of categorical columns rendered to keep the document bounded;
 # the rest are summarized in a closing note (no silent truncation).
 MAX_COLS = 40
-# Rows shown in each top-k table and explicit slices in the pie. Kept moderate so
-# the whole column — cardinality table + top-k table + donut — fits on ONE
+# Rows shown in each top-k table and explicit bars in the chart. Kept moderate so
+# the whole column — cardinality table + top-k table + bar chart — fits on ONE
 # page/slide with the chart next to its tables; the table note still reports
 # "top N of M" so nothing is silently hidden. For id-like columns (≈100%
 # distinct) the top-k table is dropped entirely (it would be a list of unique
-# values — pure noise), which also frees the room the donut needs (see build).
+# values — pure noise), which also frees the room the chart needs (see build).
 TOP_TABLE_ROWS = 8
-PIE_TOP_K = 6
+CHART_TOP_K = 6
 # Truncate very long category labels in tables (the renderer also wraps). Kept
 # tight so a column with long id-like values (names, tickets) still fits its page.
 LABEL_MAX = 28
@@ -208,26 +223,74 @@ def _fallback_cardinality(cat: dict, n_rows) -> dict:
    }


-def _pie_make(top, n_distinct, title, n_rows):
-    """Return a zero-arg callable that builds the donut figure lazily."""
+def _llm_index(profile: dict, ctx: dict) -> dict:
+    """Map column name -> its LLM dictionary entry (description/unit/...).
+
+    Reads the ``llm.dictionary`` list that ``eda_llm_insights`` stored in the
+    profile (``profile['llm']``; falls back to ``ctx['llm']``). Returns an empty
+    dict when ``run_llm`` did not run, so the caller degrades cleanly. Fully
+    defensive: never raises on malformed input.
+    """
+    llm = profile.get(LLM_KEY)
+    if not isinstance(llm, dict):
+        llm = ctx.get(LLM_KEY)
+    if not isinstance(llm, dict):
+        return {}
+    entries = llm.get("dictionary")
+    if not isinstance(entries, (list, tuple)):
+        return {}
+    index: dict = {}
+    for e in entries:
+        if not isinstance(e, dict):
+            continue
+        col = e.get("column")
+        if col is None:
+            continue
+        index[model._safe_str(col)] = e
+    return index
+
+
+def _llm_desc_unit_block(name: str, llm_index: dict):
+    """Markdown block with the LLM business description + unit of a column, or
+    None when no LLM entry matches the column (clean fallback without LLM)."""
+    entry = llm_index.get(model._safe_str(name))
+    if not isinstance(entry, dict):
+        return None
+    raw_desc = entry.get("description") or entry.get("business_meaning")
+    desc = " ".join(model._safe_str(raw_desc).split()) if raw_desc else ""
+    raw_unit = entry.get("unit")
+    unit = " ".join(model._safe_str(raw_unit).split()) if raw_unit else ""
+    parts = []
+    if desc:
+        parts.append(f"**Descripción:** {desc}")
+    if unit:
+        parts.append(f"**Unidad:** {unit}")
+    if not parts:
+        return None
+    return model.Markdown(text=" · ".join(parts))
+
+
+def _bar_make(top, n_distinct, title, n_rows):
+    """Return a zero-arg callable that builds the bar figure lazily."""

    def make():
        try:
-            from datascience.categorical_top_pie_figure import (
-                categorical_top_pie_figure,
+            from datascience.categorical_top_bar_figure import (
+                categorical_top_bar_figure,
            )

-            return categorical_top_pie_figure(
+            return categorical_top_bar_figure(
                top=top, n_distinct=n_distinct or 0, title=title,
-                top_k=PIE_TOP_K, n_rows=n_rows)
+                top_k=CHART_TOP_K, n_rows=n_rows)
        except Exception:  # noqa: BLE001 — minimal local fallback figure.
-            return _fallback_pie(top, title)
+            return _fallback_bar(top, title)

    return make


-def _fallback_pie(top, title):
-    """Minimal donut figure used only if the registry function is unavailable."""
+def _fallback_bar(top, title):
+    """Minimal horizontal-bar figure used only if the registry function is
+    unavailable. Largest category on top, the rest folded into "Otros"."""
    import matplotlib

    matplotlib.use("Agg")
@@ -238,8 +301,8 @@ def _fallback_pie(top, title):
    items = [t for t in (top or [])
             if isinstance(t, dict) and isinstance(t.get("count"), (int, float))]
    items = sorted(items, key=lambda t: t.get("count") or 0, reverse=True)
-    head = items[:PIE_TOP_K]
-    rest = items[PIE_TOP_K:]
+    head = items[:CHART_TOP_K]
+    rest = items[CHART_TOP_K:]
    labels = [_truncate(t.get("value"), 20) for t in head]
    sizes = [float(t.get("count") or 0) for t in head]
    if rest:
@@ -249,10 +312,13 @@ def _fallback_pie(top, title):
        ax.text(0.5, 0.5, "sin datos categóricos", ha="center", va="center")
        ax.axis("off")
        return fig
-    ax.pie(sizes, labels=None, wedgeprops={"width": 0.42},
-           autopct=lambda p: f"{p:.0f}%" if p >= 4 else "")
-    ax.legend(labels, loc="center left", bbox_to_anchor=(1.0, 0.5),
-              fontsize=7, frameon=False)
+    # barh draws bottom-up, so reverse to put the largest category on top.
+    y_pos = range(len(labels))
+    ax.barh(list(y_pos), list(reversed(sizes)), color="#4C72B0",
+            edgecolor="white")
+    ax.set_yticks(list(y_pos))
+    ax.set_yticklabels(list(reversed(labels)), fontsize=7)
+    ax.set_xlabel("conteo", fontsize=8)
    ax.set_title(_truncate(title, 40))
    fig.tight_layout()
    return fig
@@ -373,22 +439,17 @@ def _topk_table(cat: dict):
                           note=note)


-def _intro_blocks(n_rows, mark_term: bool = False):
-    total = _fmt_int(n_rows)
-    # Mark the first appearance of the term as a clickable glossary jump when the
-    # term was registered (mark_term). The full definition of entropy lives in the
-    # GLOSARIO chapter, so the intro only names the clickable term here instead of
-    # repeating the long explanation (avoids the redundancy with the glossary).
+def _intro_blocks(mark_term: bool = False):
+    # The full explanation of entropy AND of how each categorical page is laid out
+    # lives in the GLOSARIO chapter; the chapter body keeps only the minimal
+    # clickable terms — no descriptive prose — to avoid duplicating the glossary.
+    # The dataset row total is not repeated here: each column's cardinality table
+    # already carries "Total filas (dataset)".
    entropia = ("[[term:entropia]]entropía[[/term]]" if mark_term
                else "entropía")
-    text = (
-        f"Cada columna categórica ocupa su propia página: sus métricas de "
-        f"cardinalidad —incluida la {entropia}—, una nota que señala cardinalidad "
-        "problemática, la tabla de las categorías más frecuentes y un gráfico de "
-        "tarta (donut) de las más comunes, todo junto."
-    )
-    if n_rows is not None:
-        text += f" El dataset tiene {total} filas en total como referencia."
+    pagina = ("[[term:pagina_categorica]]cómo se organiza cada página[[/term]]"
+              if mark_term else "cómo se organiza cada página")
+    text = f"Términos: {entropia} · {pagina}."
    return [
        model.Heading(text="Entropía y cardinalidad", level=2),
        model.Markdown(text=text),
@@ -406,15 +467,22 @@ def build_cat_distr(profile: dict, ctx: dict):
        return None

    n_rows = profile.get("n_rows")
-    # Register "entropía" in the shared glossary collector (if present) and mark
-    # its first appearance clickable. End-to-end glossary example (mejora 6).
+    # Register "entropía" and the "how each categorical page is laid out" term in
+    # the shared glossary collector (if present) and mark their first appearance
+    # clickable. End-to-end glossary example (mejora 6).
    glossary = ctx.get("glossary")
    mark_term = False
    if isinstance(glossary, model.GlossaryCollector):
        glossary.add(_TERM_ENTROPIA_KEY, _TERM_ENTROPIA_LABEL,
                     _TERM_ENTROPIA_DEF)
+        glossary.add(_TERM_PAGINA_KEY, _TERM_PAGINA_LABEL)
        mark_term = True
-    blocks = list(_intro_blocks(n_rows, mark_term=mark_term))
+    blocks = list(_intro_blocks(mark_term=mark_term))
+
+    # Business description + unit per column come from the LLM dictionary
+    # (profile['llm']['dictionary'], matched by column name); absent without
+    # run_llm, in which case the per-column description block is simply omitted.
+    llm_index = _llm_index(profile, ctx)

    rendered = cat_cols[:MAX_COLS]
    for idx, col in enumerate(rendered):
@@ -422,31 +490,36 @@ def build_cat_distr(profile: dict, ctx: dict):
        cat = col.get("categorical") or {}
        card = _normalize_card(_cardinality(cat, n_rows))

-        # One Group per categorical column: heading + cardinality table + flag
-        # note + top-k table + donut figure are kept together and the renderer
-        # starts each on a fresh page/slide (page_break_before) so every column
-        # gets its own page with its chart next to its tables. The first column
-        # may share the intro's page (no forced break) to avoid a near-empty page.
-        col_blocks = [
-            model.Heading(text=str(name), level=2),
-            _cardinality_block(card),
-        ]
+        # One Group per categorical column: heading + (optional) LLM description +
+        # cardinality table + flag note + top-k table + bar figure are kept
+        # together and the renderer starts each on a fresh page/slide
+        # (page_break_before) so every column gets its own page with its chart next
+        # to its tables. The first column may share the intro's page (no forced
+        # break) to avoid a near-empty page.
+        col_blocks = [model.Heading(text=str(name), level=2)]
+        desc_block = _llm_desc_unit_block(name, llm_index)
+        if desc_block is not None:
+            col_blocks.append(desc_block)
+        col_blocks.append(_cardinality_block(card))
        note = _flag_note(card)
        if note is not None:
            col_blocks.append(note)
        # For id-like columns (≈100% distinct) the top-k is a list of unique
        # values — pure noise; skip it (the flag note already explains why) and
-        # let the donut take that room so the whole column fits one page/slide.
+        # let the bar chart take that room so the whole column fits one page/slide.
        if not card.get("id_like"):
            topk = _topk_table(cat)
            if topk is not None:
                col_blocks.append(topk)
        col_blocks.append(model.Figure(
-            make=_pie_make(cat.get("top") or [], card.get("n_distinct"),
+            make=_bar_make(cat.get("top") or [], card.get("n_distinct"),
                           str(name), n_rows),
            caption=(f"Categorías más comunes de «{_truncate(name, 32)}» "
-                     "(donut: top-k + «Otros»)")))
-        blocks.append(model.Group(blocks=col_blocks,
+                     "(barras: top-k + «Otros»)")))
+        # layout="side_by_side": in PPTX the cardinality table goes to the LEFT and
+        # the bar chart to the RIGHT of the same slide; the PDF renderer stacks it
+        # (the A5 mobile page is too narrow for two readable columns).
+        blocks.append(model.Group(blocks=col_blocks, layout="side_by_side",
                                  page_break_before=(idx > 0)))

    if len(cat_cols) > len(rendered):
@@ -2,12 +2,14 @@

 Self-contained: builds synthetic TableProfiles (no DuckDB) so the suite is fast
 and deterministic. Verifies that ``build_cat_distr`` emits the blocks the user
-asked for (distinct/total/%-distinct/unique metrics, top-k table and a donut
+asked for (distinct/total/%-distinct/unique metrics, top-k table and a bar
 figure), that EACH categorical column is wrapped in its own keep-together
-``Group`` that starts on a fresh page/slide (one column per page, chart next to
-its tables), that the long entropy explanation is NOT repeated inline (it lives
-in the glossary — only the clickable term is kept), that the chapter renders
-inside the full document to both PDF and PPTX showing that content, that a
+``Group`` laid out ``side_by_side`` (PPTX: table left / bars right) that starts on
+a fresh page/slide (one column per page, chart next to its tables), that the LLM
+business description + unit are shown per column when the profile carries an LLM
+block, that the long entropy / page-layout explanations are NOT repeated inline
+(they live in the glossary — only the clickable terms are kept), that the chapter
+renders inside the full document to both PDF and PPTX showing that content, that a
 profile with no categorical columns yields ``None`` without raising, and that
 long labels / many columns are never cut in either output.
 """
@@ -116,6 +118,10 @@ def test_golden_build_cat_distr_emite_bloques_pedidos():
    assert "log2" not in md.text          # redundant explanation removed.
    assert "máxima diversidad" not in md.text

+    # The donut/pie is gone: the intro no longer mentions tarta/donut (the chart
+    # is now a bar chart; the long page-layout explanation moved to the glossary).
+    assert "donut" not in md.text and "tarta" not in md.text
+
    # Per-column blocks are wrapped in keep-together Groups: flatten to inspect.
    flat = _flatten(ch.blocks)
    kv = next(b for b in flat if isinstance(b, KVTable))
@@ -128,11 +134,13 @@ def test_golden_build_cat_distr_emite_bloques_pedidos():
    assert any("Entropía" in lbl for lbl in labels)
    assert "únicos" in values and "%" in values
    assert "bits" in values and "norm" in values   # entropy + max + normalized.
-    # Top-k table + pie figure.
+    # Top-k table + bar figure.
    dt = next(b for b in flat if isinstance(b, DataTable))
    assert dt.header == ["Valor", "Conteo", "%"]
    assert any("neumaticos" in str(cell) for row in dt.rows for cell in row)
    assert any(isinstance(b, Figure) for b in flat)
+    # Each per-column Group is laid out side_by_side (table left / bars right).
+    assert all(g.layout == "side_by_side" for g in _column_groups(ch))
    # id-like column flagged with a Note that also explains the top-k is dropped.
    idnote = next((b for b in flat
                   if isinstance(b, Note) and "identificador" in b.text), None)
@@ -140,9 +148,9 @@ def test_golden_build_cat_distr_emite_bloques_pedidos():
    assert "No se lista el top" in idnote.text


-def test_golden_idlike_omite_topk_y_conserva_donut():
+def test_golden_idlike_omite_topk_y_conserva_grafico():
    # The id-like column (uuid, 100% distinct) must NOT carry a top-k DataTable
-    # (it would be a list of unique values), but must still keep its donut Figure
+    # (it would be a list of unique values), but must still keep its bar Figure
    # and its cardinality table so it stays a full per-column page.
    ch = build_cat_distr(_profile(), {})
    groups = _column_groups(ch)
@@ -151,7 +159,7 @@ def test_golden_idlike_omite_topk_y_conserva_donut():
    kinds = [b.kind for b in uuid_group.blocks]
    assert "data_table" not in kinds      # top-k of unique values dropped.
    assert "kv_table" in kinds            # cardinality kept.
-    assert "figure" in kinds              # donut kept (chart per column).
+    assert "figure" in kinds              # bar chart kept (chart per column).
    # A non-id-like column keeps its top-k table.
    cat_group = next(g for g in groups
                     if any(getattr(b, "text", "") == "categoria"
@@ -205,7 +213,7 @@ def test_golden_render_pdf_una_pagina_por_columna():
        assert "Entrop" in txt
        assert "distintos" in txt
        assert "categoria" in txt and "neumaticos" in txt
-        assert "donut" in txt           # figure caption rendered as text.
+        assert "barras" in txt          # bar-chart caption rendered as text (PDF).
        assert "identificador" in txt   # id-like note rendered.


@@ -258,9 +266,11 @@ def _profile_high_card() -> dict:


 def test_golden_pptx_una_slide_por_columna_con_su_grafico():
-    """Each categorical column occupies EXACTLY ONE cat_distr slide that carries
-    BOTH its cardinality table and its donut figure (picture) — i.e. the chart is
-    never separated from its table, even for a high-cardinality column."""
+    """Cada columna categórica ocupa EXACTAMENTE UN slide cat_distr que lleva su
+    gráfico (picture) en la misma slide — el chart nunca se separa de su columna,
+    ni siquiera para una columna de alta cardinalidad. Con layout side_by_side la
+    tabla se rasteriza a imagen, así que la comprobación se hace por presencia de
+    picture (no por el texto de la tabla)."""
    from pptx.enum.shapes import MSO_SHAPE_TYPE

    prof = _profile_high_card()
@@ -272,7 +282,7 @@ def test_golden_pptx_una_slide_por_columna_con_su_grafico():
        prs = Presentation(out)

        # Per column: the cat_distr slides whose text mentions it, and whether the
-        # owning slide also has the donut caption + an actual picture shape.
+        # owning slide also carries an actual picture shape (its chart).
        slides_with_col = {n: [] for n in cat_names}
        owner_has_chart = {n: False for n in cat_names}
        for i, sl in enumerate(prs.slides):
@@ -288,15 +298,106 @@ def test_golden_pptx_una_slide_por_columna_con_su_grafico():
            for n in cat_names:
                if n in txt:
                    slides_with_col[n].append(i)
-                    has_table = "Cardinalidad" in txt or "distintos" in txt
-                    if has_pic and "donut" in txt and has_table:
+                    if has_pic:
                        owner_has_chart[n] = True

        for n in cat_names:
            # Exactly one slide carries the column (not split across slides).
            assert len(slides_with_col[n]) == 1, (n, slides_with_col[n])
-            # That single slide also holds its table AND its donut picture.
-            assert owner_has_chart[n], (n, "tabla y donut no están en el mismo slide")
+            # That single slide also holds its chart picture.
+            assert owner_has_chart[n], (n, "el gráfico no está en el slide de la columna")
+
+
+def test_golden_pptx_columna_side_by_side_tabla_izq_barra_der():
+    """Con layout side_by_side, una columna categórica coloca su tabla de
+    cardinalidad (imagen) en la mitad izquierda y su gráfico de barras (imagen) en
+    la mitad derecha de la MISMA slide. Verifica que al menos una columna queda en
+    dos columnas (tabla-izq / barras-der), evidencia del side_by_side en PPTX."""
+    from pptx.enum.shapes import MSO_SHAPE_TYPE
+    from pptx.util import Inches
+
+    with tempfile.TemporaryDirectory() as d:
+        out = os.path.join(d, "eda.pptx")
+        render_automatic_eda_pptx(_profile(), out, {"title": "EDA"})
+        prs = Presentation(out)
+        centre = int(Inches(13.333 / 2.0))   # half of the 16:9 slide width.
+        two_col_slides = 0
+        for sl in prs.slides:
+            texts, lefts = [], []
+            for sh in sl.shapes:
+                if sh.has_text_frame:
+                    texts.append(sh.text_frame.text)
+                if (sh.shape_type == MSO_SHAPE_TYPE.PICTURE
+                        and sh.left is not None):
+                    lefts.append(sh.left)
+            txt = re.sub(r"\s+", " ", " ".join(texts))
+            if "Distribuciones categ" not in txt:
+                continue
+            # One picture starts in the left half, another in the right half.
+            if len(lefts) >= 2 and min(lefts) < centre and max(lefts) > centre:
+                two_col_slides += 1
+        assert two_col_slides >= 1, (
+            "ninguna columna quedó con tabla-izq / barras-der (side_by_side)")
+
+
+def _profile_with_llm() -> dict:
+    """The base profile plus an ``llm`` block (as eda_llm_insights would store it
+    with run_llm=True): a data dictionary with description/unit per column."""
+    prof = _profile()
+    prof["llm"] = {
+        "dictionary": [
+            {"column": "categoria",
+             "description": "Familia de producto del recambio",
+             "business_meaning": "Agrupa el catálogo por tipo de pieza",
+             "unit": "categoría"},
+            {"column": "uuid",
+             "description": "Identificador único de registro",
+             "unit": ""},
+        ],
+    }
+    return prof
+
+
+def test_llm_descripcion_y_unidad_por_columna():
+    # With an LLM dictionary, each categorical column whose name matches shows its
+    # business description and unit in a per-column markdown block.
+    ch = build_cat_distr(_profile_with_llm(), {})
+    groups = _column_groups(ch)
+    cat_group = next(g for g in groups
+                     if any(getattr(b, "text", "") == "categoria"
+                            for b in g.blocks))
+    md = " ".join(b.text for b in cat_group.blocks
+                  if getattr(b, "kind", "") == "markdown")
+    assert "Descripción" in md and "Familia de producto" in md
+    assert "Unidad" in md and "categoría" in md
+
+
+def test_edge_sin_llm_no_anade_descripcion():
+    # Without an LLM block the per-column description markdown is simply omitted;
+    # the column still renders its cardinality table and bar figure.
+    ch = build_cat_distr(_profile(), {})
+    for g in _column_groups(ch):
+        mds = [b.text for b in g.blocks if getattr(b, "kind", "") == "markdown"]
+        assert not any("Descripción" in t for t in mds)
+
+
+def test_pagina_categorica_clicable_y_definicion_en_glosario():
+    # The "how each categorical page is laid out" term is registered + marked
+    # clickable in the intro, and its full definition lands in the glossary
+    # chapter (canonical baseline catalog), not inline.
+    from datascience.automatic_eda.chapters.glosario import build_glosario
+
+    gc = GlossaryCollector()
+    ch = build_cat_distr(_profile(), {"glossary": gc})
+    md = next(b for b in ch.blocks if isinstance(b, Markdown))
+    assert "[[term:pagina_categorica]]" in md.text
+    assert gc.has("pagina_categorica")
+    glos = build_glosario(_profile(), {"glossary": gc})
+    entry = next(b for b in glos.blocks
+                 if getattr(b, "kind", "") == "glossary_entry"
+                 and b.key == "pagina_categorica")
+    assert "barras" in entry.definition
+    assert "identificador" in entry.definition


 def test_edge_sin_categoricas_devuelve_none():
@@ -17,10 +17,63 @@ from __future__ import annotations

 from .. import model

-CHAPTER_VERSION = "1.0.0"
+CHAPTER_VERSION = "1.1.0"
 CHAPTER_ID = "glosario"
 CHAPTER_TITLE = "Glosario"

+# Canonical definitions for cross-cutting terms — the "how to read it" entries
+# that do not belong to a single chapter. A chapter only needs to *register* the
+# term (``ctx['glossary'].add(key, label)``) and mark its in-text appearance with
+# ``[[term:key]]…[[/term]]``; this chapter supplies the full definition here when
+# the collector carries the term without one. Keeping the prose in a single place
+# avoids repeating a long paragraph inline in every chapter that names the term
+# (the explanation moved out of the NUM DISTR and CAT DISTR intros lives here).
+_BASELINE_TERMS = {
+    "histograma_boxplot": {
+        "label": "Cómo leer el histograma y el boxplot",
+        "definition": (
+            "Para cada columna numérica se muestra su histograma con tres líneas "
+            "de referencia: la media (línea roja discontinua), la mediana (línea "
+            "verde continua) y la banda ±1σ (zona sombreada que cubre una "
+            "desviación estándar a cada lado de la media). Debajo, alineado al "
+            "mismo eje horizontal, un boxplot de Tukey: la caja abarca del primer "
+            "al tercer cuartil (P25–P75), la línea interior es la mediana y los "
+            "bigotes llegan hasta 1,5·IQR; los puntos rojos señalan que hay "
+            "valores más allá de las vallas (posibles atípicos). Comparar la media "
+            "con la mediana revela la asimetría: si la media supera a la mediana la "
+            "cola larga cae hacia los valores altos (asimetría a la derecha), y al "
+            "revés hacia los bajos."),
+    },
+    "pagina_categorica": {
+        "label": "Cómo se organiza cada página categórica",
+        "definition": (
+            "Cada columna categórica ocupa su propia página: muestra sus métricas "
+            "de cardinalidad —incluida la entropía—, una nota que señala "
+            "cardinalidad problemática (columnas que se comportan como "
+            "identificador, con casi todos los valores distintos, o dominadas por "
+            "una sola categoría), la tabla de las categorías más frecuentes (top-k, "
+            "con su conteo y porcentaje) y un gráfico de barras de las categorías "
+            "más comunes (top-k más una barra «Otros» que agrupa la cola). El total "
+            "de filas del dataset se usa como referencia para interpretar los "
+            "conteos."),
+    },
+}
+
+
+def _resolve_term(term: dict) -> tuple:
+    """Return (label, definition) for a collected term, completing a missing
+    definition (and, if absent, the label) from the canonical baseline catalog."""
+    key = model._safe_str(term.get("key"))
+    label = model._safe_str(term.get("label"))
+    definition = model._safe_str(term.get("definition"))
+    base = _BASELINE_TERMS.get(key)
+    if base:
+        if not definition.strip():
+            definition = model._safe_str(base.get("definition"))
+        if not label.strip() or label == key:
+            label = model._safe_str(base.get("label")) or label
+    return label, definition
+

 def build_glosario(profile: dict, ctx: dict):
    """Build the glossary Chapter from the shared collector, or None if empty."""
@@ -36,12 +89,14 @@ def build_glosario(profile: dict, ctx: dict):
            "Cada término va resaltado en el texto y, al pulsarlo, salta a su "
            "definición en esta sección.")),
    ]
-    # One clickable destination per term, alphabetically by visible label.
+    # One clickable destination per term, alphabetically by visible label. A term
+    # registered without a definition is completed from the canonical baseline.
    for term in glossary.terms(by="label"):
+        label, definition = _resolve_term(term)
        blocks.append(model.GlossaryEntry(
            key=model._safe_str(term.get("key")),
-            label=model._safe_str(term.get("label")),
-            definition=model._safe_str(term.get("definition"))))
+            label=label,
+            definition=definition))

    return model.Chapter(id=CHAPTER_ID, title=CHAPTER_TITLE,
                         version=CHAPTER_VERSION, blocks=blocks)
@@ -35,10 +35,21 @@ try:
 except Exception:  # noqa: BLE001 — keep the chapter importable no matter what.
    build_boxplot_stats = None  # type: ignore[assignment]

-CHAPTER_VERSION = "1.2.0"
+CHAPTER_VERSION = "1.3.0"
 CHAPTER_ID = "num_distr"
 CHAPTER_TITLE = "Distribuciones numéricas"

+# Glossary term this chapter explains. The long "how to read the histogram and
+# the boxplot" paragraph used to live inline in the intro; it now lives in the
+# GLOSARIO chapter (canonical definition in ``glosario._BASELINE_TERMS``) and the
+# intro only names the clickable term — one click jumps to the full explanation,
+# so the information is relocated, not lost (mejora glosario).
+_TERM_HISTOBOX_KEY = "histograma_boxplot"
+_TERM_HISTOBOX_LABEL = "Cómo leer el histograma y el boxplot"
+
+# Key under which eda_llm_insights stores its interpretive block in the profile.
+LLM_KEY = "llm"
+
 # Plain-Spanish gloss for every label ``detect_distribution_type`` can emit, so a
 # non-expert reader understands the shape and the suggested next step (MUST-4.3).
 _DIST_GLOSS = {
@@ -99,6 +110,53 @@ def _numeric_columns(profile: dict) -> list:
    return out


+def _llm_index(profile: dict, ctx: dict) -> dict:
+    """Map column name -> its LLM dictionary entry (description/unit/...).
+
+    Reads the ``llm.dictionary`` list that ``eda_llm_insights`` stored in the
+    profile (``profile['llm']``; falls back to ``ctx['llm']``). Returns an empty
+    dict when ``run_llm`` did not run, so the caller degrades cleanly. Fully
+    defensive: never raises on malformed input.
+    """
+    llm = profile.get(LLM_KEY)
+    if not isinstance(llm, dict):
+        llm = ctx.get(LLM_KEY)
+    if not isinstance(llm, dict):
+        return {}
+    entries = llm.get("dictionary")
+    if not isinstance(entries, (list, tuple)):
+        return {}
+    index: dict = {}
+    for e in entries:
+        if not isinstance(e, dict):
+            continue
+        col = e.get("column")
+        if col is None:
+            continue
+        index[model._safe_str(col)] = e
+    return index
+
+
+def _llm_desc_unit_block(name: str, llm_index: dict):
+    """Markdown block with the LLM business description + unit of a column, or
+    None when no LLM entry matches the column (clean fallback without LLM)."""
+    entry = llm_index.get(model._safe_str(name))
+    if not isinstance(entry, dict):
+        return None
+    raw_desc = entry.get("description") or entry.get("business_meaning")
+    desc = " ".join(model._safe_str(raw_desc).split()) if raw_desc else ""
+    raw_unit = entry.get("unit")
+    unit = " ".join(model._safe_str(raw_unit).split()) if raw_unit else ""
+    parts = []
+    if desc:
+        parts.append(f"**Descripción:** {desc}")
+    if unit:
+        parts.append(f"**Unidad:** {unit}")
+    if not parts:
+        return None
+    return model.Markdown(text=" · ".join(parts))
+
+
 def _make_hist_box(name: str, numeric: dict, box: dict):
    """Build the histogram (with mean/median/±σ lines) + boxplot figure.

@@ -271,15 +329,26 @@ def build_num_distr(profile: dict, ctx: dict):
    if not numerics:
        return None  # chapter does not apply to a dataset with no numerics.

+    # Register the "how to read the histogram and boxplot" term in the shared
+    # glossary collector (if present) and mark its first appearance clickable. The
+    # full explanation (colour code, 1,5·IQR rule, asymmetry reading) lives in the
+    # GLOSARIO chapter instead of inline here: the intro only names the term.
+    glossary = ctx.get("glossary")
+    mark_term = False
+    if isinstance(glossary, model.GlossaryCollector):
+        glossary.add(_TERM_HISTOBOX_KEY, _TERM_HISTOBOX_LABEL)
+        mark_term = True
+    como_leer = ("[[term:histograma_boxplot]]cómo leer estos gráficos[[/term]]"
+                 if mark_term else "cómo leer estos gráficos")
    intro = (
-        "Para cada columna numérica se muestra su **histograma** con tres líneas "
-        "de referencia: la **media** (línea roja discontinua), la **mediana** "
-        "(línea verde continua) y la banda **±1σ** (zona sombreada). Debajo, "
-        "alineado al mismo eje, un **boxplot de Tukey**: la caja abarca del "
-        "primer al tercer cuartil (P25–P75), la línea interior es la mediana y "
-        "los bigotes llegan hasta 1,5·IQR; los puntos rojos señalan que hay "
-        "valores más allá de las vallas. Comparar media y mediana revela la "
-        "asimetría de la distribución.")
+        "Cada columna numérica muestra su **histograma** (con la **media**, la "
+        "**mediana** y la banda **±1σ**) y, debajo y al mismo eje, su **boxplot "
+        f"de Tukey** — {como_leer}.")
+
+    # Business description + unit per column come from the LLM dictionary
+    # (profile['llm']['dictionary'], matched by column name); absent without
+    # run_llm, in which case the per-column description block is simply omitted.
+    llm_index = _llm_index(profile, ctx)

    blocks = [
        model.Heading(text=CHAPTER_TITLE, level=1),
@@ -293,17 +362,20 @@ def build_num_distr(profile: dict, ctx: dict):
                box = build_boxplot_stats(numeric) or {}
            except Exception:  # noqa: BLE001 — degrade, never raise.
                box = {}
-        # Keep the column heading, its figure and its stats note together on the
-        # same page/slide (mejora 3 — keep-together): the renderers measure the
-        # whole Group and move it whole when it would not fit.
-        blocks.append(model.Group(blocks=[
-            model.Heading(text=str(name), level=2),
-            model.Figure(
+        # Keep the column heading, its (optional) LLM description, its figure and
+        # its stats note together on the same page/slide (mejora 3 —
+        # keep-together): the renderers measure the whole Group and move it whole
+        # when it would not fit.
+        col_blocks = [model.Heading(text=str(name), level=2)]
+        desc_block = _llm_desc_unit_block(name, llm_index)
+        if desc_block is not None:
+            col_blocks.append(desc_block)
+        col_blocks.append(model.Figure(
            make=_figure_maker(name, numeric, box),
            caption=f"Distribución de «{name}» — histograma "
-                        f"(media/mediana/±σ) y boxplot."),
-            model.Markdown(text=_stats_note(name, numeric, box)),
-        ]))
+                    f"(media/mediana/±σ) y boxplot."))
+        col_blocks.append(model.Markdown(text=_stats_note(name, numeric, box)))
+        blocks.append(model.Group(blocks=col_blocks))

    return model.Chapter(id=CHAPTER_ID, title=CHAPTER_TITLE,
                         version=CHAPTER_VERSION, blocks=blocks)
@@ -101,7 +101,7 @@ def test_golden_chapter_estructura_y_bloques():


 def test_golden_media_mediana_sigma_y_boxplot_presentes():
-    # The intro documents the three reference lines and the Tukey boxplot; the
+    # The short intro names the three reference lines and the Tukey boxplot; the
    # per-column note carries the actual mean/median/σ numbers and the shape.
    ch = build_num_distr(_profile(n_numeric=1, extra_categorical=False), {})
    md_texts = " ".join(b.text for b in _flatten(ch.blocks)
@@ -110,10 +110,58 @@ def test_golden_media_mediana_sigma_y_boxplot_presentes():
    assert "±1σ" in md_texts or "σ" in md_texts
    assert "boxplot" in md_texts.lower()
    assert "Tukey" in md_texts
+    # The long "how to read it" explanation moved to the glossary: the colour-code
+    # / 1,5·IQR walkthrough is no longer inline in the chapter body.
+    assert "1,5·IQR" not in md_texts
+    assert "línea roja" not in md_texts
    # distribution_type gloss surfaced for the column (right-skewed preset).
    assert _DIST_GLOSS["right-skewed"].split(";")[0][:20] in md_texts


+def test_glosario_histograma_boxplot_clicable_y_definicion():
+    # With a glossary collector the intro marks the clickable term and the FULL
+    # explanation (the long paragraph removed from the body) lands in the glossary.
+    from datascience.automatic_eda.chapters.glosario import build_glosario
+
+    gc = model.GlossaryCollector()
+    prof = _profile(n_numeric=1, extra_categorical=False)
+    ch = build_num_distr(prof, {"glossary": gc})
+    intro = next(b for b in ch.blocks if b.kind == "markdown")
+    assert "[[term:histograma_boxplot]]" in intro.text
+    assert gc.has("histograma_boxplot")
+    glos = build_glosario(prof, {"glossary": gc})
+    entry = next(b for b in glos.blocks
+                 if getattr(b, "kind", "") == "glossary_entry"
+                 and b.key == "histograma_boxplot")
+    assert "boxplot" in entry.definition.lower()
+    assert "1,5·IQR" in entry.definition
+
+
+def test_llm_descripcion_y_unidad_por_columna():
+    # With an LLM dictionary, each numeric column whose name matches shows its
+    # business description and unit in a per-column markdown block.
+    prof = _profile(n_numeric=2)
+    prof["llm"] = {"dictionary": [
+        {"column": "precio", "description": "Precio de venta del producto",
+         "unit": "EUR"},
+        {"column": "alcohol", "business_meaning": "Grado alcohólico",
+         "unit": "% vol"},
+    ]}
+    ch = build_num_distr(prof, {})
+    md_all = " ".join(b.text for b in _flatten(ch.blocks)
+                      if b.kind == "markdown")
+    assert "Precio de venta" in md_all and "EUR" in md_all
+    assert "Grado alcohólico" in md_all and "% vol" in md_all
+
+
+def test_edge_sin_llm_no_anade_descripcion():
+    # Without an LLM block the per-column description markdown is simply omitted.
+    ch = build_num_distr(_profile(n_numeric=2), {})
+    md_all = " ".join(b.text for b in _flatten(ch.blocks)
+                      if b.kind == "markdown")
+    assert "Descripción" not in md_all
+
+
 def test_boxplot_stats_se_consumen_del_registry():
    # The chapter must feed build_boxplot_stats (group eda) and the resulting
    # box must carry the Tukey fences for the figure.
@@ -7,11 +7,21 @@ as needed, the renderers paginate):
   NOT carry the raw head, so this is read from ``ctx['head_rows']`` /
   ``profile['head_rows']`` (a list of row dicts). When absent the chapter shows
   an honest placeholder documenting the missing key instead of inventing data.
-2. Column dictionary — name / type / nulls / non-null examples. Examples come
+2. Column dictionary — name / type / nulls / non-null examples plus, when the
+   LLM layer ran, the business **description** and **unit** of each column so the
+   reader knows at a glance what every column is and in which unit. Examples come
   from ``columns[i]['examples']`` when present; otherwise they are derived from
   real non-null profile values (categorical top values, numeric min/median/max)
   so the cell is never empty nor fabricated.
-3. ``df.describe`` — mean / median / min / max / std for every numeric column.
+3. ``df.describe`` — mean / median / min / max / std for every numeric column,
+   plus its **unit** (same LLM source) so the stats read in context.
+
+The description/unit come from the ``llm`` block that ``eda_llm_insights`` (group
+``eda``) already stored in the profile (``profile['llm']['dictionary']``, a list
+of ``{"column","description","business_meaning","unit"}`` entries) — this chapter
+only **consumes** it, matching by column name; it never calls the LLM nor
+recomputes anything. When the block is absent (``run_llm`` did not run) those
+cells degrade to ``"—"`` and the tables still render.

 Contract: build_<id>(profile, ctx) -> Chapter | None ; CHAPTER_VERSION = "x.y.z".
 """
@@ -20,13 +30,59 @@ from __future__ import annotations

 from .. import model

-CHAPTER_VERSION = "1.1.0"
+CHAPTER_VERSION = "1.2.0"
 CHAPTER_ID = "overview"
 CHAPTER_TITLE = "Overview"

 # Profile/ctx keys the calculation phase must add for a full head + examples.
 HEAD_KEY = "head_rows"          # list[dict] — df.head(n)
 EXAMPLES_KEY = "examples"       # per column: list of non-null sample values
+LLM_KEY = "llm"                 # interpretive block from eda_llm_insights
+
+
+def _llm_dict_index(profile: dict, ctx: dict) -> dict:
+    """Map column name -> its LLM dictionary entry (description/unit/...).
+
+    Reads the ``llm.dictionary`` list that ``eda_llm_insights`` stored in the
+    profile (``profile['llm']``; falls back to ``ctx['llm']``). Returns an empty
+    dict when no LLM block ran, so the caller degrades to "—" cells. Fully
+    defensive: never raises on malformed input.
+    """
+    llm = profile.get(LLM_KEY)
+    if not isinstance(llm, dict):
+        llm = ctx.get(LLM_KEY)
+    if not isinstance(llm, dict):
+        return {}
+    entries = llm.get("dictionary")
+    if not isinstance(entries, (list, tuple)):
+        return {}
+    index: dict = {}
+    for e in entries:
+        if not isinstance(e, dict):
+            continue
+        col = e.get("column")
+        if col is None:
+            continue
+        index[model._safe_str(col)] = e
+    return index
+
+
+def _llm_desc(entry) -> str:
+    """Business description of a column from its LLM entry, or "—"."""
+    if not isinstance(entry, dict):
+        return "—"
+    raw = entry.get("description") or entry.get("business_meaning")
+    text = " ".join(model._safe_str(raw).split()) if raw is not None else ""
+    return text or "—"
+
+
+def _llm_unit(entry) -> str:
+    """Unit of a column from its LLM entry, or "—"."""
+    if not isinstance(entry, dict):
+        return "—"
+    raw = entry.get("unit")
+    text = " ".join(model._safe_str(raw).split()) if raw is not None else ""
+    return text or "—"


 def _fmt_num(value, decimals: int = 3) -> str:
@@ -104,9 +160,12 @@ def _head_block(profile: dict, ctx: dict):
        "pasarlo en ctx['head_rows'] para mostrar las primeras filas.")


-def _columns_block(profile: dict):
+def _columns_block(profile: dict, llm_index: dict):
    cols = profile.get("columns") or []
-    header = ["Columna", "Tipo", "Nulos", "Ejemplos (no nulos)"]
+    # Descripción / Unidad come from the LLM dictionary (matched by column name);
+    # they read "—" when run_llm did not run, so the table always renders.
+    header = ["Columna", "Tipo", "Nulos", "Ejemplos (no nulos)",
+              "Descripción", "Unidad"]
    rows = []
    for c in cols:
        if not isinstance(c, dict):
@@ -126,15 +185,18 @@ def _columns_block(profile: dict):
            nulls = str(null_count)
        else:
            nulls = "—"
-        rows.append([name, ctype, nulls, _examples_for(c)])
+        entry = llm_index.get(model._safe_str(name))
+        rows.append([name, ctype, nulls, _examples_for(c),
+                     _llm_desc(entry), _llm_unit(entry)])
    if not rows:
        return None
    return model.DataTable(header=header, rows=rows, title="Columnas")


-def _describe_block(profile: dict):
+def _describe_block(profile: dict, llm_index: dict):
    cols = profile.get("columns") or []
-    header = ["Columna", "mean", "median", "min", "max", "std"]
+    # "Unidad" (LLM source) lets the reader know in which unit each stat is.
+    header = ["Columna", "mean", "median", "min", "max", "std", "Unidad"]
    rows = []
    for c in cols:
        if not isinstance(c, dict) or c.get("inferred_type") != "numeric":
@@ -142,13 +204,16 @@ def _describe_block(profile: dict):
        num = c.get("numeric") or {}
        if not num:
            continue
+        name = c.get("name") or "(col)"
+        entry = llm_index.get(model._safe_str(name))
        rows.append([
-            c.get("name") or "(col)",
+            name,
            _fmt_num(num.get("mean")),
            _fmt_num(num.get("median")),
            _fmt_num(num.get("min")),
            _fmt_num(num.get("max")),
            _fmt_num(num.get("std")),
+            _llm_unit(entry),
        ])
    if not rows:
        return None
@@ -163,16 +228,18 @@ def build_overview(profile: dict, ctx: dict):
    if not cols and not (ctx.get(HEAD_KEY) or profile.get(HEAD_KEY)):
        return None

+    llm_index = _llm_dict_index(profile, ctx)
+
    blocks = [
        model.Heading(text="Primeras filas (df.head)", level=2),
        _head_block(profile, ctx),
    ]
-    cols_block = _columns_block(profile)
+    cols_block = _columns_block(profile, llm_index)
    if cols_block is not None:
        blocks.append(model.Heading(
            text="Diccionario de columnas", level=2))
        blocks.append(cols_block)
-    desc_block = _describe_block(profile)
+    desc_block = _describe_block(profile, llm_index)
    if desc_block is not None:
        blocks.append(model.Heading(
            text="Resumen estadístico numérico", level=2))
@@ -56,7 +56,21 @@ def _head_rows() -> list:
    ]


-def _profile(with_head: bool = True) -> dict:
+def _llm() -> dict:
+    """Interpretive block as eda_llm_insights stores it under profile['llm']."""
+    return {
+        "summary": "Pasajeros del Titanic.",
+        "dictionary": [
+            {"column": "PassengerId", "description": "Identificador del pasajero",
+             "business_meaning": "Clave única de cada pasajero", "unit": "id"},
+            {"column": "Pclass", "description": "Clase del billete",
+             "business_meaning": "Clase socioeconómica", "unit": "clase (1-3)"},
+            # No entry for Survived/Name/Sex on purpose -> they degrade to "—".
+        ],
+    }
+
+
+def _profile(with_head: bool = True, with_llm: bool = False) -> dict:
    prof = {
        "table": "titanic",
        "source": "/data/titanic.csv",
@@ -68,6 +82,8 @@ def _profile(with_head: bool = True) -> dict:
    }
    if with_head:
        prof["head_rows"] = _head_rows()
+    if with_llm:
+        prof["llm"] = _llm()
    return prof


@@ -185,3 +201,70 @@ def test_edge_none_y_vacio_no_rompen():
    assert ch is not None
    tables = [b for b in _flatten(ch.blocks) if isinstance(b, DataTable)]
    assert tables and len(tables[0].rows) == 3
+
+
+def _table_by_header(blocks, marker: str):
+    """Return the first DataTable whose header contains ``marker``."""
+    for b in _flatten(blocks):
+        if isinstance(b, DataTable) and marker in b.header:
+            return b
+    return None
+
+
+def test_golden_diccionario_lleva_descripcion_y_unidad_del_llm():
+    # With run_llm: the column dictionary gains "Descripción" and "Unidad"
+    # columns populated from profile['llm']['dictionary'], matched by name.
+    ch = build_overview(_profile(with_llm=True), {})
+    assert ch is not None
+    dic = _table_by_header(ch.blocks, "Descripción")
+    assert dic is not None
+    assert dic.header == ["Columna", "Tipo", "Nulos", "Ejemplos (no nulos)",
+                          "Descripción", "Unidad"]
+    by_name = {row[0]: row for row in dic.rows}
+    # PassengerId has an LLM entry -> description + unit populated.
+    assert by_name["PassengerId"][4] == "Identificador del pasajero"
+    assert by_name["PassengerId"][5] == "id"
+    assert by_name["Pclass"][5] == "clase (1-3)"
+    # Columns with no LLM entry degrade to "—" without breaking the row.
+    assert by_name["Survived"][4] == "—" and by_name["Survived"][5] == "—"
+
+
+def test_golden_describe_lleva_unidad_del_llm():
+    ch = build_overview(_profile(with_llm=True), {})
+    desc = _table_by_header(ch.blocks, "std")
+    assert desc is not None
+    assert desc.header[-1] == "Unidad"
+    by_name = {row[0]: row for row in desc.rows}
+    assert by_name["PassengerId"][-1] == "id"
+    assert by_name["Pclass"][-1] == "clase (1-3)"
+    # Numeric column with no LLM unit still renders, unit "—".
+    assert by_name["Survived"][-1] == "—"
+
+
+def test_edge_sin_llm_descripcion_unidad_son_guion():
+    # No profile['llm'] at all: the new cells degrade to "—" and nothing breaks.
+    ch = build_overview(_profile(), {})
+    assert ch is not None
+    dic = _table_by_header(ch.blocks, "Unidad")
+    assert dic is not None
+    for row in dic.rows:
+        assert row[4] == "—" and row[5] == "—"
+    desc = _table_by_header(ch.blocks, "std")
+    assert all(row[-1] == "—" for row in desc.rows)
+
+
+def test_golden_llm_via_ctx_tambien_funciona():
+    # LLM block arriving through ctx['llm'] (fallback path) is consumed too.
+    ch = build_overview(_profile(with_llm=False), {"llm": _llm()})
+    dic = _table_by_header(ch.blocks, "Descripción")
+    by_name = {row[0]: row for row in dic.rows}
+    assert by_name["PassengerId"][5] == "id"
+
+
+def test_golden_render_pdf_muestra_descripcion_y_unidad():
+    with tempfile.TemporaryDirectory() as d:
+        out = os.path.join(d, "eda.pdf")
+        render_automatic_eda_pdf(_profile(with_llm=True), out, {"title": "EDA"})
+        txt = _pdf_text(out)
+        assert "Descripción" in txt and "Unidad" in txt
+        assert "Identificador del pasajero" in txt
@@ -26,7 +26,7 @@ from datetime import datetime, timezone

 from .. import model

-CHAPTER_VERSION = "1.2.0"
+CHAPTER_VERSION = "1.4.0"
 CHAPTER_ID = "portada"
 CHAPTER_TITLE = "Portada"

@@ -35,12 +35,9 @@ CHAPTER_TITLE = "Portada"
 # row represents) from it when the LLM layer ran (``run_llm``).
 _LLM_KEY = "llm"

-# Default human description of what the table quality score measures. Chapters
-# can override it via ctx["quality_criteria"].
-_DEFAULT_QUALITY_CRITERIA = (
-    "media de los scores por columna (0–100): completitud (sin nulos/vacíos), "
-    "validez (tipo y rango coherentes) y consistencia (sin duplicados/constantes)."
-)
+# Font size (pt) for the dataset name on the PPTX cover slide — notably larger
+# than the default H1 so the dataset name stands out (shown underlined too).
+_PPTX_TITLE_PT = 44.0


 def _storage_from_source(source: str) -> str:
@@ -120,11 +117,20 @@ def _summary_blocks(summary) -> list:

    blocks = [model.Heading(text="Resumen del análisis", level=2)]
    if rows:
-        blocks.append(model.KVTable(rows=rows))
+        # Values pinned to the right margin (numbers flush right, label left).
+        blocks.append(model.KVTable(rows=rows, value_align="right"))
    if titles:
-        bullets = "\n".join(f"- {model._safe_str(t)}" for t in titles)
-        blocks.append(model.Markdown(
-            text="Este informe incluye los siguientes capítulos:\n" + bullets))
+        # Clickable index ("Índice"): one TocEntry per chapter title. Each entry
+        # becomes a real jump to that chapter's first page/slide once the document
+        # is laid out (the renderers register every chapter start and wire the
+        # links; ``target_id`` is matched against the chapter title). The cover only
+        # knows chapter titles, so the title doubles as the link target.
+        blocks.append(model.Heading(text="Índice", level=2))
+        for t in titles:
+            label = model._safe_str(t)
+            if not label:
+                continue
+            blocks.append(model.TocEntry(label=label, target_id=label))
    return blocks


@@ -213,9 +219,7 @@ def _derive_description(profile: dict, ctx: dict) -> str:
    score = profile.get("quality_score")
    if score is not None:
        parts.append(f"Calidad media estimada: {score}/100.")
-    parts.append(
-        "Resumen derivado del perfil; active la interpretación LLM (`run_llm`) "
-        "para una descripción de negocio más rica.")
+    parts.append("Resumen derivado del perfil.")
    return " ".join(parts)


@@ -259,7 +263,6 @@ def build_portada(profile: dict, ctx: dict):
    shape = f"{_fmt_int(n_rows)} filas × {_fmt_int(n_cols)} columnas"

    score = profile.get("quality_score")
-    quality_criteria = ctx.get("quality_criteria") or _DEFAULT_QUALITY_CRITERIA
    quality_value = "—" if score is None else f"{score} / 100"

    llm = _llm_block(profile, ctx)
@@ -282,8 +285,11 @@ def build_portada(profile: dict, ctx: dict):

    # Title + dataset size shown together and BIG (Heading) at the top, kept on
    # the same page (Group). The size is no longer buried in the metadata table.
+    # The dataset name is shown big and underlined on the PPTX cover slide
+    # (size_pt/underline are honoured by the PPTX renderer; the PDF ignores them).
    cover = [
-        model.Heading(text=str(dataset_name), level=1),
+        model.Heading(text=str(dataset_name), level=1, underline=True,
+                      size_pt=_PPTX_TITLE_PT),
        model.Markdown(text="**Automatic-EDA** · informe exploratorio automático"),
        model.Heading(text=shape, level=2),
    ]
@@ -295,7 +301,6 @@ def build_portada(profile: dict, ctx: dict):
            ("Almacenamiento", storage),
            ("Generado", when),
            ("Calidad", quality_value),
-            ("Criterios de calidad", quality_criteria),
        ]),
        model.Heading(text="Descripción", level=2),
        model.Markdown(text=str(description)),
@@ -38,10 +38,18 @@ ENGINE_NAME = "AutomaticEDA"
 # --------------------------------------------------------------------------- #
@dataclass
 class Heading:
-    """A section heading. ``level`` 1 (largest) .. 3 (smallest)."""
+    """A section heading. ``level`` 1 (largest) .. 3 (smallest).
+
+    ``underline`` and ``size_pt`` are optional emphasis hints honoured by the
+    PPTX renderer (the cover uses them to show the dataset name big and
+    underlined). ``size_pt`` overrides the per-level font size when set; the PDF
+    renderer ignores both so its layout is unchanged.
+    """

    text: str = ""
    level: int = 1
+    underline: bool = False
+    size_pt: Optional[float] = None
    kind: str = field(default="heading", init=False)


@@ -62,10 +70,17 @@ class Markdown:

@dataclass
 class KVTable:
-    """A two-column key/value table. ``rows`` is a list of ``(label, value)``."""
+    """A two-column key/value table. ``rows`` is a list of ``(label, value)``.
+
+    ``value_align`` controls the horizontal alignment of the value column in the
+    PDF renderer: ``"left"`` (default) keeps values next to the label column;
+    ``"right"`` pins them to the right margin (used by the cover's analysis
+    summary so the numbers line up flush right).
+    """

    rows: list = field(default_factory=list)
    title: Optional[str] = None
+    value_align: str = "left"
    kind: str = field(default="kv_table", init=False)


@@ -145,11 +160,21 @@ class Group:
    a chapter can give each unit its own page — e.g. one categorical column per
    page (see CAT DISTR). It is purely additive: the default False keeps the plain
    keep-together behaviour for every existing chapter.
+
+    ``layout`` is a hint for how the group's children are arranged:
+    ``"stack"`` (default) keeps the historical top-to-bottom flow; ``"side_by_side"``
+    asks the PPTX renderer to place the group's table to the LEFT and its figure to
+    the RIGHT of the same slide (table ~55% width, figure ~45%), measuring so both
+    fit and falling back to stacking when they do not. The PDF renderer treats
+    ``"side_by_side"`` exactly like ``"stack"`` (the A5 mobile page is too narrow for
+    two readable columns). Unknown values degrade to ``"stack"``. Purely additive:
+    the default keeps every existing chapter unchanged.
    """

    blocks: list = field(default_factory=list)
    title: Optional[str] = None
    page_break_before: bool = False
+    layout: str = "stack"
    kind: str = field(default="group", init=False)


@@ -168,6 +193,22 @@ class GlossaryEntry:
    kind: str = field(default="glossary_entry", init=False)


+@dataclass
+class TocEntry:
+    """One clickable index (table-of-contents) entry shown on the cover.
+
+    Rendered as a single line — the chapter ``label`` in the accent link colour —
+    that, once the document is laid out, becomes a real click jumping to the first
+    page/slide of the target chapter (PDF link annotation via PyMuPDF; PPTX native
+    slide jump). ``target_id`` is matched against each chapter's ``id`` *and* its
+    ``title`` (the cover only knows chapter titles), so either resolves. If the
+    target cannot be resolved the entry still renders as plain text (never cut)."""
+
+    label: str = ""
+    target_id: str = ""
+    kind: str = field(default="toc_entry", init=False)
+
+
@dataclass
 class Chapter:
    """An ordered set of blocks with an id, a title and a generation version."""
@@ -192,13 +233,14 @@ _BLOCK_BY_KIND = {
    "note": Note,
    "group": Group,
    "glossary_entry": GlossaryEntry,
+    "toc_entry": TocEntry,
 }


 def as_block(obj: Any):
    """Coerce a value into a block dataclass. Unknown values become a Note."""
    if isinstance(obj, (Heading, Markdown, KVTable, DataTable, Figure, Image,
-                        Caption, Note, Group, GlossaryEntry)):
+                        Caption, Note, Group, GlossaryEntry, TocEntry)):
        if isinstance(obj, Group):
            obj.blocks = as_blocks(obj.blocks)
        return obj
@@ -210,13 +252,20 @@ def as_block(obj: Any):
        # Build only with fields the dataclass accepts (ignore extras).
        try:
            if cls is Heading:
+                size_pt = obj.get("size_pt")
                return Heading(text=_safe_str(obj.get("text")),
-                               level=int(obj.get("level", 1) or 1))
+                               level=int(obj.get("level", 1) or 1),
+                               underline=bool(obj.get("underline", False)),
+                               size_pt=(float(size_pt)
+                                        if isinstance(size_pt, (int, float))
+                                        else None))
            if cls is Markdown:
                return Markdown(text=_safe_str(obj.get("text")))
            if cls is KVTable:
                return KVTable(rows=list(obj.get("rows") or []),
-                               title=obj.get("title"))
+                               title=obj.get("title"),
+                               value_align=_safe_str(
+                                   obj.get("value_align")) or "left")
            if cls is DataTable:
                return DataTable(header=list(obj.get("header") or []),
                                 rows=list(obj.get("rows") or []),
@@ -237,11 +286,15 @@ def as_block(obj: Any):
                return Group(blocks=as_blocks(obj.get("blocks")),
                             title=obj.get("title"),
                             page_break_before=bool(
-                                 obj.get("page_break_before", False)))
+                                 obj.get("page_break_before", False)),
+                             layout=_safe_str(obj.get("layout")) or "stack")
            if cls is GlossaryEntry:
                return GlossaryEntry(key=_safe_str(obj.get("key")),
                                     label=_safe_str(obj.get("label")),
                                     definition=_safe_str(obj.get("definition")))
+            if cls is TocEntry:
+                return TocEntry(label=_safe_str(obj.get("label")),
+                                target_id=_safe_str(obj.get("target_id")))
        except Exception:  # noqa: BLE001 — never raise on a malformed block.
            return Note(text=_safe_str(obj))
    return Note(text=_safe_str(obj))
@@ -298,11 +298,16 @@ def test_cover_first_glossary_last_with_summary():
    headings = [b.text for b in cover.blocks if b.kind == "heading"]
    assert any("Resumen" in h for h in headings), \
        "la portada no incluye el resumen agregado"
-    # The summary reflects the body chapters (e.g. the numeric/categorical ones).
-    cover_text = " ".join(
-        b.text for b in cover.blocks if getattr(b, "kind", "") == "markdown")
-    assert "Distribuciones" in cover_text, \
-        "el resumen de portada no menciona los capítulos del cuerpo"
+    # The index ("Índice") is now a clickable list of TocEntry blocks (one per
+    # body chapter), not a markdown bullet list. Verify both the heading and that
+    # the entries name the body chapters.
+    assert any("Índice" in h for h in headings), \
+        "la portada no incluye la sección Índice"
+    toc_labels = " ".join(
+        getattr(b, "label", "") for b in cover.blocks
+        if getattr(b, "kind", "") == "toc_entry")
+    assert "Distribuciones" in toc_labels, \
+        "el índice de portada no menciona los capítulos del cuerpo"


 # --------------------------------------------------------------------------- #
@@ -46,11 +46,23 @@ _MUTED = "#8a8a8a"
 _RULE = "#cccccc"
 _HEAD_BG = "#eef3f6"

+# Rasterization DPI for every embedded raster (figure/table image) AND for the
+# page save itself. Raised from the old 150/default-100 to 220 so a reader can
+# pinch-zoom on a phone and still see crisp detail (axis labels, table cells)
+# without pixelation. Text stays vectorial (pdf.fonttype=42) so it remains
+# selectable regardless of DPI — only the embedded images gain resolution. 220 is
+# a deliberate balance: noticeably sharper than 150 while keeping the file size
+# reasonable. ``savefig.dpi`` matters because matplotlib re-rasterizes each
+# ``imshow`` when PdfPages writes the page; without it the final image would land
+# at ~100 dpi no matter how sharp the intermediate PNG was.
+_RASTER_DPI = 220
+
 _RC = {
    "font.size": 10,
    "font.family": "sans-serif",
    "figure.facecolor": "white",
    "savefig.facecolor": "white",
+    "savefig.dpi": _RASTER_DPI,
    "pdf.fonttype": 42,  # embed TrueType — text stays selectable on mobile.
 }

@@ -80,6 +92,10 @@ class _PdfState:
        # points (1/72") with a top-left origin — same convention as PyMuPDF.
        self.term_sources = []       # [{key, page, rect:[x0,y0,x1,y1]}]
        self.term_dests = {}         # key -> {page, point:[x,y]}
+        # Clickable index (cover → chapter). Sources are the cover's TocEntry
+        # rects; chapter_starts maps a chapter id AND its title to its first page.
+        self.toc_sources = []        # [{target_id, page, rect:[x0,y0,x1,y1]}]
+        self.chapter_starts = {}     # id|title -> {page, point:[x,y]}


 # --------------------------------------------------------------------------- #
@@ -317,10 +333,18 @@ def _place_kv_table(st: _PdfState, block) -> None:
    if title:
        _place_heading(st, model.Heading(title, level=2))
    rows = getattr(block, "rows", []) or []
+    # ``value_align="right"`` pins the value column to the right margin (label
+    # left, number flush right) — used by the cover's analysis summary.
+    right = str(getattr(block, "value_align", "left")).lower() == "right"
    key_w = 1.9  # inches reserved for the label column.
+    # Right-aligned values wrap against the full usable width minus the label
+    # column; left-aligned values wrap against the value column only.
    val_chars = tl.chars_per_line(_USABLE_W - key_w - 0.1, _FS_BODY)
    lh = tl.line_height_in(_FS_BODY)
-    for row in rows:
+    # ``data_idx`` is the 0-based logical row index: even rows (1-based) are
+    # zebra-shaded → 0-based odd indices, matching the data-table convention so
+    # every table in the document carries the same striping.
+    for data_idx, row in enumerate(rows):
        try:
            label, value = row[0], row[1]
        except Exception:  # noqa: BLE001
@@ -329,11 +353,25 @@ def _place_kv_table(st: _PdfState, block) -> None:
        row_h = lh * len(v_lines) + _ROW_VPAD
        _ensure_space(st, row_h)
        y0 = st.y
+        # Faint zebra fill for even rows, drawn first (zorder 0) so striping
+        # never hides the text/value drawn on top.
+        if data_idx % 2 == 1:
+            st.fig.add_artist(Rectangle(
+                (_xf(_ML), _yf(y0 + row_h)), _xf(_ML + _USABLE_W) - _xf(_ML),
+                _yf(y0) - _yf(y0 + row_h), transform=st.fig.transFigure,
+                color=_ZEBRA, lw=0, zorder=0))
        st.fig.text(_xf(_ML), _yf(y0), tl.strip_inline_md(model._safe_str(label)),
-                    fontsize=_FS_BODY, color=_MUTED, ha="left", va="top")
+                    fontsize=_FS_BODY, color=_MUTED, ha="left", va="top",
+                    zorder=2)
        for k, vl in enumerate(v_lines):
+            if right:
+                st.fig.text(_xf(_ML + _USABLE_W), _yf(y0 + k * lh), vl,
+                            fontsize=_FS_BODY, color=_INK, ha="right",
+                            va="top", zorder=2)
+            else:
                st.fig.text(_xf(_ML + key_w), _yf(y0 + k * lh), vl,
-                        fontsize=_FS_BODY, color=_INK, ha="left", va="top")
+                            fontsize=_FS_BODY, color=_INK, ha="left",
+                            va="top", zorder=2)
        st.y = y0 + row_h
    st.y += _GAP

@@ -363,6 +401,57 @@ def _col_widths(header: list, rows: list, fs: float) -> list:
    return widths


+# Minimal legible characters reserved per column when deciding whether a table
+# can be shown as selectable text. Below this width per column the cells become
+# unreadable, so the table is rasterized to a zoomable high-res image instead.
+_MIN_LEGIBLE_CHARS = 8
+
+
+def _table_fits_as_text(header: list, rows: list) -> bool:
+    """True when the table fits the usable width as readable text.
+
+    A table whose columns cannot each get a minimal legible width within the A5
+    usable width (typically many columns, e.g. a 19-column ``df.head``) is flagged
+    so it is rendered as a single high-resolution image — the reader zooms in on
+    the phone and reads every cell, nothing cut — instead of being squeezed until
+    unreadable. Narrow tables (few columns) keep the selectable-text rendering."""
+    header = header or []
+    rows = rows or []
+    ncol = len(header) if header else (len(rows[0]) if rows else 1)
+    ncol = max(1, ncol)
+    cw = tl.avg_char_width_in(_FS_CELL)
+    min_needed = ncol * (_MIN_LEGIBLE_CHARS * cw + _CELL_PAD * 2)
+    return min_needed <= _USABLE_W
+
+
+def _table_figure_block(block):
+    """Wrap a too-wide table as a lazily-rasterized Figure (cached on the block).
+
+    The table is drawn once via ``render_table_as_figure`` (header shading + zebra)
+    and embedded as one high-res image scaled to fit entirely. The same Figure is
+    reused for measuring and placing so keep-together stays consistent. The table
+    title/note are drawn inside the image (self-describing when zoomed/shared), so
+    the block-level caption is left empty to avoid a duplicate title."""
+    cached = getattr(block, "_aeda_tablefig", None)
+    if cached is not None:
+        return cached
+    header = list(getattr(block, "header", []) or [])
+    rows = list(getattr(block, "rows", []) or [])
+    title = getattr(block, "title", None)
+    note = getattr(block, "note", None)
+
+    def _make():
+        from datascience.render_table_as_figure import render_table_as_figure
+        return render_table_as_figure(header, rows, title=title, note=note)
+
+    fig = model.Figure(make=_make, caption=None)
+    try:
+        block._aeda_tablefig = fig
+    except Exception:  # noqa: BLE001 — block may reject attributes; degrade.
+        pass
+    return fig
+
+
 def _wrap_row(cells: list, widths: list, fs: float) -> list:
    """Wrap each cell to its column width → list of line-lists per cell."""
    out = []
@@ -402,11 +491,16 @@ def _draw_table_row(st: _PdfState, cells_lines: list, widths: list, fs: float,


 def _place_data_table(st: _PdfState, block) -> None:
+    header = list(getattr(block, "header", []) or [])
+    rows = list(getattr(block, "rows", []) or [])
+    # Too many columns to be legible as text → render the whole table as one
+    # high-res image, scaled to fit entirely (the reader zooms to read it).
+    if not _table_fits_as_text(header, rows):
+        _place_figure(st, _table_figure_block(block))
+        return
    title = getattr(block, "title", None)
    if title:
        _place_heading(st, model.Heading(title, level=2))
-    header = list(getattr(block, "header", []) or [])
-    rows = list(getattr(block, "rows", []) or [])
    fs = _FS_CELL
    widths = _col_widths(header, rows, fs)
    header_lines = _wrap_row(header, widths, fs) if header else None
@@ -464,8 +558,11 @@ def _resolve_figure(block):


 def _png_from_figure(fig) -> bytes:
+    # ``bbox_inches='tight'`` is kept so the real aspect ratio is what we measure
+    # and place. The page save (savefig.dpi in _RC) re-rasterizes this at the same
+    # high DPI, so the embedded image stays crisp for phone zoom.
    buf = io.BytesIO()
-    fig.savefig(buf, format="png", dpi=150, bbox_inches="tight")
+    fig.savefig(buf, format="png", dpi=_RASTER_DPI, bbox_inches="tight")
    buf.seek(0)
    return buf.read()

@@ -707,12 +804,16 @@ def _measure_data_table(block) -> float:
    Counts the optional title heading, the wrapped header row, every wrapped data
    row (per-column wrap via the same ``_col_widths``/``_wrap_row`` the placer
    uses) and the optional note. Keep this in sync with ``_place_data_table``."""
+    header = list(getattr(block, "header", []) or [])
+    rows = list(getattr(block, "rows", []) or [])
+    # Mirror the placer: a too-wide table is drawn as a single image, so its
+    # keep-together height is the image's, not the (squeezed) text layout's.
+    if not _table_fits_as_text(header, rows):
+        return _measure_figure_like(_table_figure_block(block))
    h = 0.0
    title = getattr(block, "title", None)
    if title:
        h += _measure_heading_text(title, 2)
-    header = list(getattr(block, "header", []) or [])
-    rows = list(getattr(block, "rows", []) or [])
    fs = _FS_CELL
    widths = _col_widths(header, rows, fs)
    lh = tl.line_height_in(fs)
@@ -744,6 +845,10 @@ def _measure_block(st: _PdfState, block) -> float:
            lines = tl.wrap(getattr(block, "text", ""),
                            tl.chars_per_line(_USABLE_W, _FS_NOTE))
            return tl.line_height_in(_FS_NOTE) * len(lines) + _GAP
+        if kind == "toc_entry":
+            lines = tl.wrap(tl.strip_inline_md(getattr(block, "label", "")),
+                            tl.chars_per_line(_USABLE_W - 0.22, _FS_BODY)) or [""]
+            return tl.line_height_in(_FS_BODY) * len(lines) + _GAP * 0.4
        if kind == "kv_table":
            return _measure_kv_table(block)
        if kind == "data_table":
@@ -828,6 +933,38 @@ def _place_glossary_entry(st: _PdfState, block) -> None:
    st.y += _GAP * 0.5


+def _place_toc_entry(st: _PdfState, block) -> None:
+    """Render one clickable index line and record it as a link source.
+
+    Drawn as a bulleted line in the accent link colour; its rectangle is recorded
+    in ``st.toc_sources`` so the post-processor turns it into a real jump to the
+    target chapter's first page. If the target is never resolved the line still
+    shows as plain (accent) text — never cut, never broken."""
+    label = tl.strip_inline_md(getattr(block, "label", "")) or ""
+    target_id = getattr(block, "target_id", "") or ""
+    fs = _FS_BODY
+    lh = tl.line_height_in(fs)
+    bullet = "•  "
+    indent = 0.22
+    max_chars = tl.chars_per_line(_USABLE_W - indent, fs)
+    lines = tl.wrap(label, max_chars) or [""]
+    for idx, ln in enumerate(lines):
+        _ensure_space(st, lh)
+        x = _ML
+        st.fig.text(_xf(x), _yf(st.y), bullet if idx == 0 else "   ",
+                    fontsize=fs, color=_LINK, ha="left", va="top")
+        x += indent
+        w = _text_width_in(st, ln, fs, False)
+        st.fig.text(_xf(x), _yf(st.y), ln, fontsize=fs, color=_LINK,
+                    ha="left", va="top")
+        if target_id and idx == 0:
+            st.toc_sources.append({
+                "target_id": target_id, "page": st.page - 1,
+                "rect": _pt_rect(_ML, st.y, x + w, st.y + lh)})
+        st.y += lh
+    st.y += _GAP * 0.4
+
+
 _PLACERS = {
    "heading": _place_heading,
    "markdown": _place_markdown,
@@ -839,6 +976,7 @@ _PLACERS = {
    "note": _place_note,
    "group": _place_group,
    "glossary_entry": _place_glossary_entry,
+    "toc_entry": _place_toc_entry,
 }


@@ -870,6 +1008,15 @@ def render_pdf(chapters: list, out_path: str, meta: dict = None) -> dict:
                    st.chapter = ch
                    st.chapter_pages = 0
                    _new_page(st)  # each chapter starts on a fresh page.
+                    # Record this chapter's first page as a link target for the
+                    # cover index (keyed by id AND title, since the cover only
+                    # knows titles). Point is the top of the content area.
+                    _start = {"page": st.page - 1,
+                              "point": [_ML * 72.0, _CONTENT_TOP * 72.0]}
+                    if ch.id:
+                        st.chapter_starts[ch.id] = _start
+                    if getattr(ch, "title", ""):
+                        st.chapter_starts.setdefault(ch.title, _start)
                    for block in ch.blocks:
                        placer = _PLACERS.get(getattr(block, "kind", ""),
                                              _place_note)
@@ -902,7 +1049,7 @@ def render_pdf(chapters: list, out_path: str, meta: dict = None) -> dict:

    note = f"{n_pages} páginas"
    if n_links:
-        note += f" · {n_links} enlaces de glosario"
+        note += f" · {n_links} enlaces internos"
    if notes:
        note += " · " + "; ".join(notes)
    return {"path": out_path, "n_pages": n_pages, "chapters": chapters_meta,
@@ -910,9 +1057,11 @@ def render_pdf(chapters: list, out_path: str, meta: dict = None) -> dict:


 def _wire_glossary_links(st: _PdfState, out_path: str, notes: list) -> int:
-    """Build {source rect → glossary dest} links and apply them via PyMuPDF.
+    """Apply internal PDF links via PyMuPDF: glossary terms + the cover index.

-    Returns the number of links applied (0 if there is nothing to wire or the
+    Builds two sets of GOTO links — every in-text glossary term → its entry, and
+    every cover ``TocEntry`` → its chapter's first page — and applies them in one
+    pass. Returns the number of links applied (0 if there is nothing to wire or the
    post-processor is unavailable). Never raises."""
    try:
        links = []
@@ -923,6 +1072,14 @@ def _wire_glossary_links(st: _PdfState, out_path: str, notes: list) -> int:
            links.append({
                "src_page": src["page"], "src_rect": src["rect"],
                "dst_page": dest["page"], "dst_point": dest["point"]})
+        # Cover index → chapter first page (clickable, navigable table of contents).
+        for src in st.toc_sources:
+            dest = st.chapter_starts.get(src.get("target_id"))
+            if not dest:
+                continue
+            links.append({
+                "src_page": src["page"], "src_rect": src["rect"],
+                "dst_page": dest["page"], "dst_point": dest["point"]})
        if not links:
            return 0
        from datascience.add_pdf_internal_links import add_pdf_internal_links
@@ -930,7 +1087,7 @@ def _wire_glossary_links(st: _PdfState, out_path: str, notes: list) -> int:
        if isinstance(res, dict) and res.get("status") == "ok":
            return int(res.get("n_links") or 0)
        if isinstance(res, dict) and res.get("error"):
-            notes.append(f"glosario sin enlaces: {res.get('error')}")
+            notes.append(f"enlaces internos no aplicados: {res.get('error')}")
    except Exception as e:  # noqa: BLE001 — links are best-effort.
-        notes.append(f"glosario sin enlaces: {e}")
+        notes.append(f"enlaces internos no aplicados: {e}")
    return 0
@@ -51,6 +51,12 @@ _FS_H1, _FS_H2, _FS_H3 = 20, 16, 13
 _FS_BODY, _FS_CELL, _FS_NOTE = 14, 11, 11
 _GAP = 0.12

+# Rasterization DPI for every embedded figure/table image. Raised from 150 to 220
+# so a viewer can zoom into a slide (or a shared picture) and read crisp detail —
+# axis labels, table cells — without pixelation. Kept moderate so the deck size
+# stays reasonable. Same value as the PDF renderer.
+_RASTER_DPI = 220
+

 class _PptxState:
    def __init__(self, prs, title: str):
@@ -65,6 +71,10 @@ class _PptxState:
        # Glossary wiring (mejora 6): runs to link and per-term target slide.
        self.term_runs = []           # [(key, run)]
        self.term_anchor_slide = {}   # key -> Slide (glossary entry)
+        # Clickable index (cover → chapter). toc_runs are the cover's index runs;
+        # chapter_starts maps a chapter id AND its title to its first slide.
+        self.toc_runs = []            # [(target_id, run, src_slide)]
+        self.chapter_starts = {}      # id|title -> Slide (chapter first slide)


 def _rgb(c):
@@ -135,7 +145,7 @@ def _ensure(st: _PptxState, height: float) -> None:


 def _add_text(st: _PptxState, lines: list, fs: float, color, bold=False,
-              italic=False, indent=0.0, bullet=False) -> None:
+              italic=False, indent=0.0, bullet=False, underline=False) -> None:
    lh = tl.line_height_in(fs)
    height = lh * len(lines) + 0.05
    _ensure(st, height)
@@ -153,6 +163,7 @@ def _add_text(st: _PptxState, lines: list, fs: float, color, bold=False,
        run.font.size = Pt(fs)
        run.font.bold = bold
        run.font.italic = italic
+        run.font.underline = underline
        run.font.color.rgb = _rgb(color)
    st.y += height

@@ -206,10 +217,16 @@ def _add_rich_text(st: _PptxState, rich_lines: list, fs: float, color,
 def _place_heading(st: _PptxState, block) -> None:
    level = max(1, min(3, int(getattr(block, "level", 1) or 1)))
    fs = {1: _FS_H1, 2: _FS_H2, 3: _FS_H3}[level]
+    # Optional per-heading emphasis (cover dataset name): a larger font and an
+    # underline. ``size_pt`` overrides the per-level size when set.
+    size_override = getattr(block, "size_pt", None)
+    if isinstance(size_override, (int, float)) and size_override > 0:
+        fs = float(size_override)
+    underline = bool(getattr(block, "underline", False))
    text = tl.strip_inline_md(getattr(block, "text", ""))
    st.last_heading = text or st.last_heading
    lines = tl.wrap(text, tl.chars_per_line(_USABLE_W, fs))
-    _add_text(st, lines, fs, _INK, bold=True)
+    _add_text(st, lines, fs, _INK, bold=True, underline=underline)
    st.y += 0.04


@@ -302,6 +319,58 @@ def _col_widths(header, rows):
    return [_USABLE_W * w / total for w in clamped]


+# Minimal legible characters reserved per column when deciding whether a table
+# can be shown as a native (selectable) PowerPoint table. Below this width per
+# column the cells become unreadable, so the table is rasterized to a zoomable
+# high-res image instead. The 16:9 slide is wide, so more columns fit than on A5.
+_MIN_LEGIBLE_CHARS = 8
+_CELL_PAD = 0.05
+
+
+def _table_fits_as_text(header: list, rows: list) -> bool:
+    """True when the table fits the usable slide width as a readable table.
+
+    A table whose columns cannot each get a minimal legible width within the slide
+    usable width (typically many columns, e.g. a 19-column ``df.head``) is flagged
+    so it is rendered as one high-resolution image — the viewer zooms in and reads
+    every cell — instead of being squeezed unreadable. Narrow tables keep the
+    native selectable table."""
+    header = header or []
+    rows = rows or []
+    ncol = len(header) if header else (len(rows[0]) if rows else 1)
+    ncol = max(1, ncol)
+    cw = tl.avg_char_width_in(_FS_CELL)
+    min_needed = ncol * (_MIN_LEGIBLE_CHARS * cw + _CELL_PAD * 2)
+    return min_needed <= _USABLE_W
+
+
+def _table_figure_block(block):
+    """Wrap a too-wide table as a lazily-rasterized Figure (cached on the block).
+
+    Drawn once via ``render_table_as_figure`` (header shading + zebra) and embedded
+    as one high-res image scaled to fit entirely. The title/note are drawn inside
+    the image (self-describing when zoomed/shared), so no separate caption is
+    emitted. Reused for measuring and placing so keep-together stays consistent."""
+    cached = getattr(block, "_aeda_tablefig", None)
+    if cached is not None:
+        return cached
+    header = list(getattr(block, "header", []) or [])
+    rows = list(getattr(block, "rows", []) or [])
+    title = getattr(block, "title", None)
+    note = getattr(block, "note", None)
+
+    def _make():
+        from datascience.render_table_as_figure import render_table_as_figure
+        return render_table_as_figure(header, rows, title=title, note=note)
+
+    fig = model.Figure(make=_make, caption=None)
+    try:
+        block._aeda_tablefig = fig
+    except Exception:  # noqa: BLE001 — block may reject attributes; degrade.
+        pass
+    return fig
+
+
 def _row_height_in(cells, widths, fs) -> float:
    lh = tl.line_height_in(fs)
    maxlines = 1
@@ -365,11 +434,27 @@ def _style_cell(cell, fs, color, bold, fill) -> None:

 def _place_data_table(st: _PptxState, block, shaded_header=True,
                      key_value=False) -> None:
+    header = list(getattr(block, "header", []) or [])
+    rows = list(getattr(block, "rows", []) or [])
+    # Too many columns to be legible as a native table → render the whole table as
+    # one high-res picture, scaled to fit entirely (the viewer zooms to read it).
+    # KVTables (rendered here as a 2-column Campo/Valor table) are excluded: they
+    # always fit in width and stay as a selectable table.
+    if not key_value and not _table_fits_as_text(header, rows):
+        figblock = _table_figure_block(block)
+        data, _asp = _figure_bytes_cached(figblock)
+        if data is None:
+            _add_text(st, ["(tabla no disponible)"], _FS_NOTE, _MUTED,
+                      italic=True)
+            st.y += _GAP
+            return
+        _place_picture_bytes(st, data, None,
+                             max_h_in=getattr(figblock, "height_in", None),
+                             force_caption=False)
+        return
    title = getattr(block, "title", None)
    if title:
        _place_heading(st, model.Heading(title, level=2))
-    header = list(getattr(block, "header", []) or [])
-    rows = list(getattr(block, "rows", []) or [])
    fs = _FS_CELL
    widths = _col_widths(header, rows)
    header_h = _row_height_in(header, widths, fs) if header else 0.0
@@ -429,7 +514,7 @@ def _resolve_png(block):
    try:
        import matplotlib.pyplot as plt
        buf = io.BytesIO()
-        f.savefig(buf, format="png", dpi=150, bbox_inches="tight")
+        f.savefig(buf, format="png", dpi=_RASTER_DPI, bbox_inches="tight")
        buf.seek(0)
        return buf.read()
    except Exception:  # noqa: BLE001
@@ -476,12 +561,15 @@ def _figure_bytes_cached(block):


 def _place_picture_bytes(st: _PptxState, data: bytes, caption,
-                         max_h_in=None) -> None:
+                         max_h_in=None, force_caption=True) -> None:
    # Mejora 4 — every figure on a slide carries a visible caption/title. If the
    # block has no caption, fall back to the current section heading, then to a
-    # generic label, so no image is ever shown untitled.
-    caption = (model._safe_str(caption).strip()
-               or model._safe_str(st.last_heading).strip() or "Figura")
+    # generic label, so no image is ever shown untitled. ``force_caption=False``
+    # suppresses that fallback (used for table images, whose title is inside the
+    # picture) so no redundant caption is drawn.
+    caption = model._safe_str(caption).strip()
+    if not caption and force_caption:
+        caption = model._safe_str(st.last_heading).strip() or "Figura"
    w_px, h_px = _img_size_px(data)
    aspect = (h_px / w_px) if w_px else 0.66
    # Reserve the caption's REAL (possibly multi-line) height FIRST, then scale
@@ -489,9 +577,11 @@ def _place_picture_bytes(st: _PptxState, data: bytes, caption,
    # so its caption always fits on the SAME slide and no image is untitled.
    # cap_real = what _add_text consumes; cap_reserve adds the post-image gap and
    # a small cushion so the caption never spills to the next slide.
-    cap_lines = tl.wrap(caption, tl.chars_per_line(_USABLE_W, _FS_NOTE))
-    cap_real = tl.line_height_in(_FS_NOTE) * len(cap_lines) + 0.05
-    cap_reserve = cap_real + 0.05 + 0.10
+    cap_lines = tl.wrap(caption, tl.chars_per_line(_USABLE_W, _FS_NOTE)) \
+        if caption else []
+    cap_real = (tl.line_height_in(_FS_NOTE) * len(cap_lines) + 0.05) \
+        if cap_lines else 0.0
+    cap_reserve = (cap_real + 0.05 + 0.10) if cap_lines else 0.05
    max_h = _CONTENT_BOTTOM - _CONTENT_TOP
    # height_in hint (model.Figure/Image): cap the target height so a figure in a
    # keep-together Group shrinks to leave room for its heading and text.
@@ -510,6 +600,7 @@ def _place_picture_bytes(st: _PptxState, data: bytes, caption,
    st.slide.shapes.add_picture(io.BytesIO(data), Inches(left), Inches(st.y),
                                width=Inches(target_w), height=Inches(target_h))
    st.y += target_h + 0.05
+    if cap_lines:
        _add_text(st, cap_lines, _FS_NOTE, _MUTED, italic=True)
    st.y += _GAP

@@ -552,9 +643,11 @@ def _place_note(st: _PptxState, block) -> None:
 # WITHOUT drawing it so a Group can move whole to the next slide before drawing.
 # Over-estimating only triggers an earlier slide break, never a content cut.
 # --------------------------------------------------------------------------- #
-def _measure_heading_text(text: str, level: int) -> float:
+def _measure_heading_text(text: str, level: int, size_pt=None) -> float:
    level = max(1, min(3, int(level or 1)))
    fs = {1: _FS_H1, 2: _FS_H2, 3: _FS_H3}[level]
+    if isinstance(size_pt, (int, float)) and size_pt > 0:
+        fs = float(size_pt)
    lines = tl.wrap(tl.strip_inline_md(text), tl.chars_per_line(_USABLE_W, fs))
    return tl.line_height_in(fs) * len(lines) + 0.05 + 0.04

@@ -654,12 +747,16 @@ def _measure_kv_table(block) -> float:
 def _measure_data_table(block) -> float:
    """Faithful DataTable height — matches ``_place_data_table`` (title heading +
    wrapped header + every wrapped row + optional note). Keep in sync."""
+    header = list(getattr(block, "header", []) or [])
+    rows = list(getattr(block, "rows", []) or [])
+    # Mirror the placer: a too-wide table is drawn as one image, so its
+    # keep-together height is the image's, not the (squeezed) table layout's.
+    if not _table_fits_as_text(header, rows):
+        return _measure_figure_like(_table_figure_block(block))
    h = 0.0
    title = getattr(block, "title", None)
    if title:
        h += _measure_heading_text(title, 2)
-    header = list(getattr(block, "header", []) or [])
-    rows = list(getattr(block, "rows", []) or [])
    fs = _FS_CELL
    widths = _col_widths(header, rows)
    if header:
@@ -679,7 +776,8 @@ def _measure_block(st: _PptxState, block) -> float:
    try:
        if kind == "heading":
            return _measure_heading_text(getattr(block, "text", ""),
-                                         getattr(block, "level", 1))
+                                         getattr(block, "level", 1),
+                                         size_pt=getattr(block, "size_pt", None))
        if kind == "markdown":
            return _measure_markdown(block)
        if kind in ("figure", "image"):
@@ -688,6 +786,10 @@ def _measure_block(st: _PptxState, block) -> float:
            lines = tl.wrap(getattr(block, "text", ""),
                            tl.chars_per_line(_USABLE_W, _FS_NOTE))
            return tl.line_height_in(_FS_NOTE) * len(lines) + 0.05 + _GAP
+        if kind == "toc_entry":
+            lines = tl.wrap(tl.strip_inline_md(getattr(block, "label", "")),
+                            tl.chars_per_line(_USABLE_W - 0.3, _FS_BODY)) or [""]
+            return tl.line_height_in(_FS_BODY) * len(lines) + 0.05
        if kind == "kv_table":
            return _measure_kv_table(block)
        if kind == "data_table":
@@ -800,6 +902,73 @@ def _fit_group_blocks(st: _PptxState, blocks: list, avail_full: float) -> list:
    return out


+def _fit_img(width_col: float, aspect: float, max_h: float):
+    """Scale an image to ``width_col`` then clamp to ``max_h`` keeping aspect."""
+    w = width_col
+    h = w * aspect
+    if h > max_h:
+        h = max_h
+        w = (h / aspect) if aspect else width_col
+    return w, h
+
+
+def _place_group_side_by_side(st: _PptxState, block, avail_full: float) -> bool:
+    """Place a Group's table (left ~55%) next to its figure (right ~45%).
+
+    Both the table and the figure are rasterized to high-res images and placed in
+    two columns of the SAME slide; any other blocks (e.g. a heading) render full
+    width above the pair, the rest below. Returns True on success; returns False
+    (so the caller falls back to stacking) when the group has no table+figure pair
+    or the pair cannot fit side by side on one slide. Never raises by itself."""
+    blocks = getattr(block, "blocks", []) or []
+    tbl = next((b for b in blocks
+                if getattr(b, "kind", "") in ("data_table", "kv_table")), None)
+    fig = next((b for b in blocks
+                if getattr(b, "kind", "") in ("figure", "image")), None)
+    if tbl is None or fig is None:
+        return False
+    gap_col = 0.3
+    left_w = _USABLE_W * 0.55 - gap_col / 2.0
+    right_w = _USABLE_W * 0.45 - gap_col / 2.0
+    if left_w <= 1.0 or right_w <= 1.0:
+        return False
+    tdata, tasp = _figure_bytes_cached(_table_figure_block(tbl))
+    fdata, fasp = _figure_bytes_cached(fig)
+    if not tdata or not fdata:
+        return False
+    ti, fi = blocks.index(tbl), blocks.index(fig)
+    lo = min(ti, fi)
+    lead = list(blocks[:lo])
+    rest = [b for b in blocks[lo + 1:] if b is not tbl and b is not fig]
+    lead_h = sum(_measure_block(st, b) for b in lead)
+    rest_h = sum(_measure_block(st, b) for b in rest)
+    col_max_h = avail_full - lead_h - rest_h - _GAP * 2
+    if col_max_h < 1.2:
+        return False  # not enough vertical room to put the pair side by side.
+    tw, th = _fit_img(left_w, tasp, col_max_h)
+    fw, fh = _fit_img(right_w, fasp, col_max_h)
+    band = max(th, fh)
+    needed = lead_h + band + rest_h + _GAP * 2
+    if needed > avail_full:
+        return False  # taller than a whole slide even side by side → stack.
+    if needed > _remaining(st):
+        _new_slide(st, cont=True)
+    for b in lead:
+        _PLACERS.get(getattr(b, "kind", ""), _place_note)(st, b)
+    top = st.y
+    f_left = _ML + left_w + gap_col
+    st.slide.shapes.add_picture(
+        io.BytesIO(tdata), Inches(_ML + (left_w - tw) / 2.0),
+        Inches(top + (band - th) / 2.0), width=Inches(tw), height=Inches(th))
+    st.slide.shapes.add_picture(
+        io.BytesIO(fdata), Inches(f_left + (right_w - fw) / 2.0),
+        Inches(top + (band - fh) / 2.0), width=Inches(fw), height=Inches(fh))
+    st.y = top + band + _GAP
+    for b in rest:
+        _PLACERS.get(getattr(b, "kind", ""), _place_note)(st, b)
+    return True
+
+
 def _place_group(st: _PptxState, block) -> None:
    """Render a keep-together Group: move it whole to the next slide if needed."""
    blocks = getattr(block, "blocks", []) or []
@@ -810,6 +979,14 @@ def _place_group(st: _PptxState, block) -> None:
    if getattr(block, "page_break_before", False) and st.y > _CONTENT_TOP + 1e-6:
        _new_slide(st, cont=True)
    avail_full = _CONTENT_BOTTOM - _CONTENT_TOP
+    # layout="side_by_side": try table-left / figure-right on one slide; on any
+    # reason it can't, fall through to the normal stacked keep-together below.
+    if str(getattr(block, "layout", "stack")).lower() == "side_by_side":
+        try:
+            if _place_group_side_by_side(st, block, avail_full):
+                return
+        except Exception:  # noqa: BLE001 — degrade to stacking, never abort.
+            pass
    # Trim oversized tables first (keeps the chart on the same slide), then shrink
    # the figure to share the remaining room.
    blocks = _fit_group_blocks(st, blocks, avail_full)
@@ -843,6 +1020,44 @@ def _place_glossary_entry(st: _PptxState, block) -> None:
    st.y += _GAP


+def _place_toc_entry(st: _PptxState, block) -> None:
+    """Render one clickable index line and record its run as a link source.
+
+    Drawn as a bulleted line in the accent link colour; the run is recorded in
+    ``st.toc_runs`` so it later becomes a native slide-jump to the target chapter's
+    first slide. If the target is never resolved the line still shows as plain
+    (accent) text — never cut."""
+    label = tl.strip_inline_md(getattr(block, "label", "")) or ""
+    target_id = getattr(block, "target_id", "") or ""
+    fs = _FS_BODY
+    lines = tl.wrap(label, tl.chars_per_line(_USABLE_W - 0.3, fs)) or [""]
+    lh = tl.line_height_in(fs)
+    height = lh * len(lines) + 0.05
+    _ensure(st, height)
+    box = st.slide.shapes.add_textbox(
+        Inches(_ML), Inches(st.y), Inches(_USABLE_W), Inches(height))
+    tf = box.text_frame
+    tf.word_wrap = True
+    first = True
+    link_run = None
+    for idx, ln in enumerate(lines):
+        p = tf.paragraphs[0] if first else tf.add_paragraph()
+        first = False
+        r0 = p.add_run()
+        r0.text = "•  " if idx == 0 else "   "
+        r0.font.size = Pt(fs)
+        r0.font.color.rgb = _rgb(_LINK)
+        run = p.add_run()
+        run.text = ln
+        run.font.size = Pt(fs)
+        run.font.color.rgb = _rgb(_LINK)
+        if idx == 0:
+            link_run = run
+    if target_id and link_run is not None:
+        st.toc_runs.append((target_id, link_run, st.slide))
+    st.y += height
+
+
 _PLACERS = {
    "heading": _place_heading,
    "markdown": _place_markdown,
@@ -854,6 +1069,7 @@ _PLACERS = {
    "note": _place_note,
    "group": _place_group,
    "glossary_entry": _place_glossary_entry,
+    "toc_entry": _place_toc_entry,
 }


@@ -889,6 +1105,12 @@ def render_pptx(chapters: list, out_path: str, meta: dict = None) -> dict:
            st.chapter = ch
            st.chapter_slides = 0
            _new_slide(st, cont=False)
+            # Record this chapter's first slide as a link target for the cover
+            # index (keyed by id AND title, since the cover only knows titles).
+            if ch.id:
+                st.chapter_starts[ch.id] = st.slide
+            if getattr(ch, "title", ""):
+                st.chapter_starts.setdefault(ch.title, st.slide)
            for block in ch.blocks:
                placer = _PLACERS.get(getattr(block, "kind", ""), _place_note)
                try:
@@ -916,7 +1138,7 @@ def render_pptx(chapters: list, out_path: str, meta: dict = None) -> dict:

    note = f"{n_slides} slides"
    if n_links:
-        note += f" · {n_links} enlaces de glosario"
+        note += f" · {n_links} enlaces internos"
    if notes:
        note += " · " + "; ".join(notes)
    return {"path": out_path, "n_slides": n_slides, "chapters": chapters_meta,
@@ -924,19 +1146,21 @@ def render_pptx(chapters: list, out_path: str, meta: dict = None) -> dict:


 def _wire_glossary_links(st: _PptxState, notes: list) -> int:
-    """Turn each recorded term run into a native jump to its glossary slide.
+    """Apply native slide-jumps: glossary terms + the cover index.

-    Returns the number of links applied. A term whose only appearance is inside
-    its own glossary entry (source slide == target slide) is skipped. Never
+    Each in-text glossary term run jumps to its glossary entry slide, and each
+    cover ``TocEntry`` run jumps to its chapter's first slide. Returns the total
+    number of links applied. A run whose target is its own slide is skipped. Never
    raises."""
-    if not st.term_runs or not st.term_anchor_slide:
+    if not (st.term_runs and st.term_anchor_slide) and not (
+            st.toc_runs and st.chapter_starts):
        return 0
-    linked = 0
    try:
        from datascience.pptx_link_run_to_slide import pptx_link_run_to_slide
    except Exception as e:  # noqa: BLE001
-        notes.append(f"glosario sin enlaces: {e}")
+        notes.append(f"enlaces internos no aplicados: {e}")
        return 0
+    linked = 0
    for key, run, src_slide in st.term_runs:
        tgt = st.term_anchor_slide.get(key)
        if tgt is None or tgt is src_slide:
@@ -946,4 +1170,14 @@ def _wire_glossary_links(st: _PptxState, notes: list) -> int:
                linked += 1
        except Exception:  # noqa: BLE001 — links are best-effort.
            pass
+    # Cover index → chapter first slide (clickable, navigable table of contents).
+    for target_id, run, src_slide in st.toc_runs:
+        tgt = st.chapter_starts.get(target_id)
+        if tgt is None or tgt is src_slide:
+            continue
+        try:
+            if pptx_link_run_to_slide(run, src_slide, tgt):
+                linked += 1
+        except Exception:  # noqa: BLE001 — links are best-effort.
+            pass
    return linked
@@ -0,0 +1,283 @@
+"""Golden tests for the global render-quality features (issue: eda-render-quality).
+
+Covers, with executable evidence:
+  * High DPI: every embedded figure is rasterized at 220 dpi, so a phone reader
+    can zoom in and still see crisp detail.
+  * Wide table → image: a table too wide to be legible as text (e.g. a 19-column
+    df.head) is rendered as one high-res image that scales to fit entirely, while
+    a narrow table keeps its selectable-text/native-table rendering.
+  * ``Group(layout="side_by_side")``: in PPTX the table and figure are placed in
+    two columns of the same slide; in PDF the same group stacks vertically.
+  * Backward compatibility: a Group without ``layout`` defaults to ``"stack"`` and
+    a fitting table renders exactly as before.
+
+Renderers are invoked for real; PDFs are inspected with PyMuPDF and PPTX decks
+with python-pptx.
+"""
+
+from __future__ import annotations
+
+import os
+import tempfile
+
+import matplotlib
+
+matplotlib.use("Agg")
+import matplotlib.pyplot as plt  # noqa: E402
+
+import pytest  # noqa: E402
+
+from datascience.automatic_eda import model  # noqa: E402
+from datascience.automatic_eda.render_pdf_impl import (  # noqa: E402
+    render_pdf, _RASTER_DPI as _PDF_DPI, _table_fits_as_text as _pdf_fits)
+from datascience.automatic_eda.render_pptx_impl import (  # noqa: E402
+    render_pptx, _RASTER_DPI as _PPTX_DPI, _table_fits_as_text as _pptx_fits)
+
+
+# --------------------------------------------------------------------------- #
+# Helpers.
+# --------------------------------------------------------------------------- #
+def _simple_fig():
+    """A small, real matplotlib figure for the figure blocks."""
+    fig, ax = plt.subplots(figsize=(4, 3))
+    ax.plot([0, 1, 2, 3], [1, 3, 2, 4])
+    ax.set_title("demo")
+    return fig
+
+
+def _wide_table(n_cols=19, n_rows=5):
+    header = [f"columna_{i}" for i in range(n_cols)]
+    rows = [[f"v{r}_{c}" for c in range(n_cols)] for r in range(n_rows)]
+    return model.DataTable(header=header, rows=rows, title="Primeras filas")
+
+
+def _narrow_table():
+    return model.DataTable(header=["a", "b", "c"],
+                           rows=[["1", "2", "3"], ["4", "5", "6"]],
+                           title="Tabla estrecha")
+
+
+def _chapter(blocks, cid="cap", title="Capítulo"):
+    return [model.Chapter(id=cid, title=title, version="1.0.0", blocks=blocks)]
+
+
+# --------------------------------------------------------------------------- #
+# 1) High DPI — the unit constant and a real embedded image.
+# --------------------------------------------------------------------------- #
+def test_raster_dpi_is_high_both_renderers():
+    assert _PDF_DPI >= 200, "el DPI del PDF debe ser alto (>=200)"
+    assert _PPTX_DPI >= 200, "el DPI del PPTX debe ser alto (>=200)"
+
+
+def test_pdf_embedded_figure_is_high_resolution(tmp_path):
+    fitz = pytest.importorskip("fitz")
+    out = str(tmp_path / "fig.pdf")
+    res = render_pdf(_chapter([model.Figure(make=_simple_fig, caption="demo")]),
+                     out, {"title": "T"})
+    assert res["path"] == out
+    doc = fitz.open(out)
+    try:
+        widths = []
+        for page in doc:
+            for img in page.get_images(full=True):
+                xref = img[0]
+                info = doc.extract_image(xref)
+                widths.append(info.get("width", 0))
+        assert widths, "no se incrustó ninguna imagen en el PDF"
+        # A ~4" figure rasterized at 220 dpi is ~ >850 px wide. At the old 150 dpi
+        # it would be ~600 px. The high-res threshold proves the DPI bump.
+        assert max(widths) >= 800, \
+            f"la figura embebida no es de alta resolución: {max(widths)} px"
+    finally:
+        doc.close()
+
+
+# --------------------------------------------------------------------------- #
+# 2) Wide table → image (PDF and PPTX); narrow table stays text.
+# --------------------------------------------------------------------------- #
+def test_fit_criterion_flags_wide_and_keeps_narrow():
+    wide = _wide_table()
+    narrow = _narrow_table()
+    assert not _pdf_fits(wide.header, wide.rows), \
+        "una tabla de 19 columnas debería NO caber como texto en A5"
+    assert not _pptx_fits(wide.header, wide.rows), \
+        "una tabla de 19 columnas debería NO caber como tabla nativa en 16:9"
+    assert _pdf_fits(narrow.header, narrow.rows), \
+        "una tabla de 3 columnas debería caber como texto en A5"
+    assert _pptx_fits(narrow.header, narrow.rows), \
+        "una tabla de 3 columnas debería caber como tabla nativa en 16:9"
+
+
+def test_wide_table_rendered_as_image_pdf(tmp_path):
+    fitz = pytest.importorskip("fitz")
+    out = str(tmp_path / "wide.pdf")
+    res = render_pdf(_chapter([_wide_table()]), out, {"title": "T"})
+    assert res["path"] == out
+    doc = fitz.open(out)
+    try:
+        n_images = sum(len(page.get_images(full=True)) for page in doc)
+        text = "".join(page.get_text() for page in doc)
+    finally:
+        doc.close()
+    assert n_images >= 1, "la tabla ancha no se rasterizó como imagen en el PDF"
+    # The cells are now inside the image, not selectable text. A unique cell value
+    # must therefore NOT appear as extractable text (it lives in the picture).
+    assert "v4_18" not in text, \
+        "la tabla ancha sigue como texto seleccionable (no se hizo imagen)"
+
+
+def test_narrow_table_stays_selectable_text_pdf(tmp_path):
+    fitz = pytest.importorskip("fitz")
+    out = str(tmp_path / "narrow.pdf")
+    render_pdf(_chapter([_narrow_table()]), out, {"title": "T"})
+    doc = fitz.open(out)
+    try:
+        text = "".join(page.get_text() for page in doc)
+    finally:
+        doc.close()
+    # Narrow table is selectable text: its header/cells are extractable.
+    for v in ("a", "b", "c", "1", "6"):
+        assert v in text, f"la celda '{v}' debería ser texto seleccionable"
+
+
+def test_wide_table_rendered_as_picture_pptx(tmp_path):
+    pptx = pytest.importorskip("pptx")
+    from pptx.enum.shapes import MSO_SHAPE_TYPE
+    out = str(tmp_path / "wide.pptx")
+    res = render_pptx(_chapter([_wide_table()]), out, {"title": "T"})
+    assert res["path"] == out
+    prs = pptx.Presentation(out)
+    pics = sum(1 for s in prs.slides for sh in s.shapes
+               if sh.shape_type == MSO_SHAPE_TYPE.PICTURE)
+    assert pics >= 1, "la tabla ancha no se colocó como imagen en el PPTX"
+
+
+# --------------------------------------------------------------------------- #
+# 3) Group(layout="side_by_side"): two columns in PPTX, stacked in PDF.
+# --------------------------------------------------------------------------- #
+def _side_by_side_group():
+    return model.Group(
+        blocks=[model.Heading(text="Columna X", level=2),
+                _narrow_table(),
+                model.Figure(make=_simple_fig, caption="grafico")],
+        layout="side_by_side")
+
+
+def test_side_by_side_places_two_columns_pptx(tmp_path):
+    pptx = pytest.importorskip("pptx")
+    from pptx.enum.shapes import MSO_SHAPE_TYPE
+    from pptx.util import Inches
+    out = str(tmp_path / "sbs.pptx")
+    render_pptx(_chapter([_side_by_side_group()]), out, {"title": "T"})
+    prs = pptx.Presentation(out)
+    # Find the slide that holds the pair (table image + figure image).
+    centre_emu = int(Inches(13.333 / 2.0))
+    placed = False
+    for s in prs.slides:
+        lefts = [sh.left for sh in s.shapes
+                 if sh.shape_type == MSO_SHAPE_TYPE.PICTURE
+                 and sh.left is not None]
+        if len(lefts) >= 2:
+            # one picture starts in the left half, another in the right half.
+            if min(lefts) < centre_emu and max(lefts) > centre_emu:
+                placed = True
+                break
+    assert placed, \
+        "side_by_side no colocó tabla y figura en dos columnas de la misma slide"
+
+
+def test_side_by_side_stacks_in_pdf(tmp_path):
+    fitz = pytest.importorskip("fitz")
+    out = str(tmp_path / "sbs.pdf")
+    res = render_pdf(_chapter([_side_by_side_group()]), out, {"title": "T"})
+    assert res["path"] == out and res["n_pages"] >= 1
+    doc = fitz.open(out)
+    try:
+        n_images = sum(len(page.get_images(full=True)) for page in doc)
+        text = "".join(page.get_text() for page in doc)
+    finally:
+        doc.close()
+    # PDF stacks: the narrow table stays selectable text (1 of its cells is
+    # extractable) and the figure is the single embedded image — not a 2-column
+    # pair of pictures like PPTX.
+    assert n_images == 1, "el PDF no debería usar el layout de dos imágenes"
+    assert "Columna X" in text and "1" in text, \
+        "la tabla del grupo debería seguir como texto apilado en el PDF"
+
+
+# --------------------------------------------------------------------------- #
+# 4) Backward compatibility — default layout stacks, fitting table unchanged.
+# --------------------------------------------------------------------------- #
+def test_group_default_layout_is_stack():
+    g = model.Group(blocks=[_narrow_table()])
+    assert g.layout == "stack", "el layout por defecto debe ser 'stack'"
+
+
+# --------------------------------------------------------------------------- #
+# 5) Clickable cover index ("Índice") → chapter first page/slide.
+# --------------------------------------------------------------------------- #
+def _doc_with_index():
+    portada = model.Chapter(id="portada", title="Portada", version="1.0.0",
+                            blocks=[model.Heading(text="Índice", level=2),
+                                    model.TocEntry(label="Distribuciones",
+                                                   target_id="Distribuciones")])
+    cap = model.Chapter(id="num", title="Distribuciones", version="1.0.0",
+                        blocks=[model.Markdown(text="contenido del capítulo")])
+    return [portada, cap]
+
+
+def test_cover_index_is_clickable_pdf(tmp_path):
+    fitz = pytest.importorskip("fitz")
+    out = str(tmp_path / "idx.pdf")
+    res = render_pdf(_doc_with_index(), out, {"title": "T"})
+    assert res["path"] == out
+    doc = fitz.open(out)
+    try:
+        # The cover (page 0) must carry a GOTO link jumping to a later page.
+        goto = [lk for lk in doc[0].get_links()
+                if lk.get("kind") == fitz.LINK_GOTO and lk.get("page", 0) > 0]
+    finally:
+        doc.close()
+    assert goto, "el índice de la portada no produjo enlaces clicables en el PDF"
+
+
+def test_cover_index_shows_heading_pdf(tmp_path):
+    fitz = pytest.importorskip("fitz")
+    out = str(tmp_path / "idxh.pdf")
+    render_pdf(_doc_with_index(), out, {"title": "T"})
+    doc = fitz.open(out)
+    try:
+        text = "".join(page.get_text() for page in doc)
+    finally:
+        doc.close()
+    assert "Índice" in text, "la portada no muestra el encabezado 'Índice'"
+    assert "Este informe incluye" not in text, \
+        "la portada aún muestra el texto antiguo 'Este informe incluye'"
+
+
+def test_cover_index_is_clickable_pptx(tmp_path):
+    pptx = pytest.importorskip("pptx")
+    out = str(tmp_path / "idx.pptx")
+    render_pptx(_doc_with_index(), out, {"title": "T"})
+    prs = pptx.Presentation(out)
+    cover_xml = prs.slides[0]._element.xml
+    assert "hlinksldjump" in cover_xml, \
+        "el índice de la portada no produjo un salto de slide nativo en el PPTX"
+
+
+def test_default_group_renders_like_before_pptx(tmp_path):
+    pptx = pytest.importorskip("pptx")
+    from pptx.enum.shapes import MSO_SHAPE_TYPE
+    out = str(tmp_path / "stack.pptx")
+    grp = model.Group(blocks=[model.Heading(text="Y", level=2),
+                              _narrow_table(),
+                              model.Figure(make=_simple_fig, caption="g")])
+    render_pptx(_chapter([grp]), out, {"title": "T"})
+    prs = pptx.Presentation(out)
+    # Stacked group: the narrow table is a NATIVE table (selectable), and there is
+    # exactly one picture (the figure) — not the two-image side-by-side layout.
+    n_tables = sum(1 for s in prs.slides for sh in s.shapes if sh.has_table)
+    n_pics = sum(1 for s in prs.slides for sh in s.shapes
+                 if sh.shape_type == MSO_SHAPE_TYPE.PICTURE)
+    assert n_tables >= 1, "el grupo apilado debería usar una tabla nativa"
+    assert n_pics == 1, "el grupo apilado no debería duplicar imágenes"
@@ -0,0 +1,142 @@
+---
+id: build_column_dictionary_py_datascience
+name: build_column_dictionary
+kind: function
+lang: py
+domain: datascience
+version: "1.0.0"
+purity: pure
+signature: "def build_column_dictionary(db_profile: dict) -> dict"
+description: "Construye el diccionario de columnas BUSCABLE de una base entera a partir del DatabaseProfile que emite profile_database (grupo eda). Aplana db_profile['table_profiles'] (lista de TableProfile con table y columns) en una entrada por columna con tabla, tipo inferido, tipo semantico, marca de PII (RGPD/LOPDGDD), %null, cardinalidad y valores top. Responde a nivel de base 'donde esta el customer_id / telefono / IBAN'. Emite tambien pii_columns y un markdown grep-able ordenado por columna, precedido de las columnas compartidas por nombre entre tablas (candidatas a join key cross-tabla). Funcion pura, dict-no-throw, no muta el input."
+tags: [eda, relations]
+uses_functions: []
+uses_types: []
+returns: []
+returns_optional: false
+error_type: ""
+imports: []
+example: |
+  from datascience import build_column_dictionary
+  db_profile = {"table_profiles": [
+      {"table": "clientes", "columns": [
+          {"name": "email", "inferred_type": "text", "semantic_type": "email",
+           "null_pct": 0.05, "distinct_count": 990}]}]}
+  res = build_column_dictionary(db_profile)
+  # res["pii_columns"] -> [{"table": "clientes", "column": "email", "is_pii": True, ...}]
+tested: true
+tests:
+  - "test_golden_flattens_two_tables"
+  - "test_pii_flagged_from_semantic_type"
+  - "test_empty_semantic_type_maps_to_none_and_not_pii"
+  - "test_shared_column_names_detected_as_join_keys"
+  - "test_top_values_from_categorical_block"
+  - "test_empty_profile_returns_empty_ok"
+  - "test_malformed_input_returns_empty_ok"
+  - "test_missing_keys_read_defensively"
+  - "test_does_not_mutate_input"
+test_file_path: "python/functions/datascience/build_column_dictionary_test.py"
+file_path: "python/functions/datascience/build_column_dictionary.py"
+params:
+  - name: db_profile
+    desc: >
+      DatabaseProfile del grupo eda tal como lo devuelve profile_database en su
+      clave db_profile (el dict con table_profiles). table_profiles es una lista
+      de TableProfile; de cada uno se leen table (nombre) y columns (lista de
+      ColumnProfile). De cada ColumnProfile se leen defensivamente con .get(...):
+      name, inferred_type (numeric|categorical|datetime|text|boolean),
+      semantic_type ("" que se normaliza a None; los que emite infer_semantic_type:
+      email, iban, credit_card, phone_intl, postal_code_es, ...), null_pct
+      (fraccion 0-1), distinct_count (cardinalidad, expuesta como n_distinct) y el
+      bloque categorical.top (para top_values). Una entrada vacia, None o
+      malformada produce el resultado vacio en estado ok (nunca lanza).
+output: >
+  dict dict-no-throw con status ("ok" siempre), n_tables (int, tablas con columnas
+  procesadas), n_columns (int total de columnas), entries (list[dict] una por
+  columna con table, column, inferred_type, semantic_type|None, is_pii (bool),
+  null_pct (float 0-1|None), n_distinct (int|None), top_values (list[str]|None)),
+  pii_columns (subconjunto de entries con is_pii=True: dato personal segun
+  [POL-MMNSEG-001-1.0]) y markdown (str, tabla grep-able ordenada por nombre de
+  columna precedida de las columnas compartidas por nombre entre tablas). Entrada
+  vacia o malformada -> n_tables/n_columns 0, listas vacias, markdown "".
+---
+
+## Ejemplo
+
+```python
+from datascience import build_column_dictionary
+
+# db_profile minimo de juguete (forma de la clave db_profile de profile_database).
+db_profile = {
+    "table_profiles": [
+        {
+            "table": "clientes",
+            "columns": [
+                {"name": "customer_id", "inferred_type": "numeric",
+                 "semantic_type": "", "null_pct": 0.0, "distinct_count": 1000},
+                {"name": "email", "inferred_type": "text",
+                 "semantic_type": "email", "null_pct": 0.05, "distinct_count": 990},
+                {"name": "ciudad", "inferred_type": "categorical",
+                 "semantic_type": "", "null_pct": 0.0, "distinct_count": 3,
+                 "categorical": {"top": [
+                     {"value": "Madrid", "count": 5, "pct": 0.5},
+                     {"value": "Bilbao", "count": 3, "pct": 0.3}]}},
+            ],
+        },
+        {
+            "table": "pedidos",
+            "columns": [
+                {"name": "customer_id", "inferred_type": "numeric",
+                 "semantic_type": "", "null_pct": 0.0, "distinct_count": 800},
+                {"name": "iban", "inferred_type": "text",
+                 "semantic_type": "iban", "null_pct": 0.1, "distinct_count": 795},
+            ],
+        },
+    ]
+}
+
+res = build_column_dictionary(db_profile)
+print(res["n_tables"], res["n_columns"])          # 2 5
+print([(e["table"], e["column"]) for e in res["pii_columns"]])
+# [('clientes', 'email'), ('pedidos', 'iban')]
+print(res["markdown"])   # tabla grep-able + seccion de join keys (customer_id)
+```
+
+Uso real componiendo con `profile_database` (perfila la base y construye el diccionario):
+
+```python
+from pipelines.profile_database import profile_database
+from datascience import build_column_dictionary
+
+r = profile_database("mi_base.duckdb", write_report=False)
+if r["status"] == "ok":
+    dicc = build_column_dictionary(r["db_profile"])
+    # grep sobre dicc["markdown"] para localizar donde vive cada dato,
+    # dicc["pii_columns"] para el inventario RGPD de la base.
+```
+
+## Cuando usarla
+
+Usala cuando necesites un indice tabla.columna de una base ENTERA: para localizar
+por busqueda "donde esta el customer_id / telefono / IBAN" antes de escribir un
+join, para descubrir claves de join cross-tabla (columnas con el mismo nombre en
+varias tablas) o para levantar un inventario de columnas con datos personales
+(RGPD/LOPDGDD) sobre el que auditar. Es el paso natural despues de
+`profile_database`: toma su `db_profile` y lo convierte en diccionario buscable.
+
+## Gotchas
+
+- El criterio de PII se basa SOLO en el `semantic_type` que hoy emite el grupo
+  `eda` (`infer_semantic_type`): se marcan email, phone_intl, iban, credit_card y
+  postal_code_es. El catalogo de regex NO detecta hoy nombre de persona ni DNI/NIE,
+  asi que esas columnas caen como texto/categorico y NO se marcan automaticamente.
+  Politica [POL-MMNSEG-001-1.0]: ante cualquier duda sobre si una columna contiene
+  datos personales, tratala como PII y avisa antes de exponerla; `pii_columns` es
+  una ayuda, no un inventario RGPD exhaustivo.
+- `n_distinct` se lee de la clave `distinct_count` del ColumnProfile (no de
+  `categorical.n_distinct`); en tablas grandes puede venir de `approx_unique`
+  (HyperLogLog) capado a n_rows, no exacto.
+- `top_values` solo se rellena si la columna trae bloque `categorical` (lo pone
+  `profile_table` para columnas categorical/text); las numericas/datetime lo
+  dejan en None.
+- Funcion pura: no toca disco ni muta el input. NO perfila la base — eso lo hace
+  `profile_database`; aqui solo se APLANA su salida.
@@ -0,0 +1,245 @@
+"""build_column_dictionary — diccionario de columnas BUSCABLE de una base entera.
+
+Funcion pura, stdlib-only. No hace I/O, no depende de nada externo y NO muta el
+input. Toma el ``db_profile`` (DatabaseProfile) que emite ``profile_database`` del
+grupo de capacidad ``eda`` y aplana su ``table_profiles`` (lista de TableProfile,
+cada uno con ``table`` y ``columns``: lista de ColumnProfile) en una entrada por
+columna. Es la pieza que responde, a nivel de BASE, "donde esta el customer_id /
+telefono / IBAN en este dataset?": un indice grep-able tabla.columna con su tipo,
+tipo semantico inferido, marca de PII, % de nulos, cardinalidad y valores top.
+
+Ademas del listado plano emite:
+  - ``pii_columns``: subconjunto marcado como dato personal (RGPD/LOPDGDD).
+  - ``markdown``: tabla grep-able ordenada por nombre de columna, precedida de una
+    seccion que agrupa columnas con el MISMO nombre presentes en varias tablas
+    (candidatas a clave de join cross-tabla).
+
+Estilo dict-no-throw del grupo ``eda``: nunca lanza. Lee cada clave de forma
+defensiva con ``.get(...)`` y tolera valores None / estructuras malformadas; ante
+una entrada vacia o corrupta devuelve el resultado vacio en estado ``ok``.
+
+Criterio de PII (politica [POL-MMNSEG-001-1.0]): se marca ``is_pii=True`` cuando el
+``semantic_type`` real que emite el grupo ``eda`` (ver ``infer_semantic_type``)
+pertenece al conjunto de tipos de dato personal detectables hoy: email, telefono
+internacional, IBAN, tarjeta de credito y codigo postal (componente de direccion).
+El catalogo de regex del grupo NO detecta hoy nombre de persona ni DNI/NIE, asi que
+esas columnas caen como texto/categorico y no se marcan automaticamente: ante
+cualquier duda sobre si una columna contiene datos personales, tratala como PII y
+avisa antes de exponerla.
+"""
+
+# semantic_types del grupo eda (infer_semantic_type) que son dato personal.
+# El grupo emite hoy: email, url, ipv4, ipv6, uuid, iban, credit_card, phone_intl,
+# postal_code_es, currency, datetime_iso, date_eu, integer, decimal, boolean,
+# hex_color. De esos, los que identifican a una persona fisica (RGPD/LOPDGDD) son:
+_PII_SEMANTIC_TYPES = frozenset(
+    {
+        "email",
+        "phone_intl",
+        "iban",
+        "credit_card",
+        "postal_code_es",  # codigo postal: componente de direccion (dato de localizacion)
+    }
+)
+
+# Numero maximo de valores frecuentes que se listan por columna categorica.
+_TOP_VALUES_LIMIT = 5
+
+
+def _empty_result() -> dict:
+    """Resultado vacio en estado ok para entradas vacias o malformadas."""
+    return {
+        "status": "ok",
+        "n_tables": 0,
+        "n_columns": 0,
+        "entries": [],
+        "pii_columns": [],
+        "markdown": "",
+    }
+
+
+def _top_values(col: dict) -> list | None:
+    """Extrae hasta _TOP_VALUES_LIMIT valores frecuentes del bloque categorical.
+
+    ``summarize_categorical`` deja ``col["categorical"]["top"]`` como lista de
+    ``{value, count, pct}`` ordenada por frecuencia. Devuelve solo los valores
+    como strings, o None si la columna no tiene bloque categorical util.
+    """
+    cat = col.get("categorical")
+    if not isinstance(cat, dict):
+        return None
+    top = cat.get("top")
+    if not isinstance(top, list) or not top:
+        return None
+    values = []
+    for item in top[:_TOP_VALUES_LIMIT]:
+        if isinstance(item, dict):
+            values.append(str(item.get("value")))
+        else:
+            values.append(str(item))
+    return values or None
+
+
+def _column_entry(table_name, col: dict) -> dict:
+    """Construye la entrada del diccionario para un ColumnProfile.
+
+    Lee las claves del contrato eda de forma defensiva: name, inferred_type,
+    semantic_type ("" se normaliza a None), null_pct (fraccion 0-1),
+    distinct_count (se expone como n_distinct) y el bloque categorical (top).
+    """
+    sem_raw = col.get("semantic_type")
+    semantic_type = sem_raw if sem_raw else None  # "" -> None
+
+    null_pct = col.get("null_pct")
+    if isinstance(null_pct, bool) or not isinstance(null_pct, (int, float)):
+        null_pct = None
+    else:
+        null_pct = float(null_pct)
+
+    n_distinct = col.get("distinct_count")
+    if isinstance(n_distinct, bool) or not isinstance(n_distinct, int):
+        n_distinct = None
+
+    return {
+        "table": table_name,
+        "column": col.get("name"),
+        "inferred_type": col.get("inferred_type"),
+        "semantic_type": semantic_type,
+        "is_pii": semantic_type in _PII_SEMANTIC_TYPES,
+        "null_pct": null_pct,
+        "n_distinct": n_distinct,
+        "top_values": _top_values(col),
+    }
+
+
+def _render_markdown(entries: list) -> str:
+    """Renderiza el diccionario en markdown grep-able.
+
+    Primero una seccion que agrupa columnas con el MISMO nombre presentes en
+    varias tablas (candidatas a clave de join cross-tabla), luego la tabla
+    completa ordenada por nombre de columna.
+    """
+    lines = ["# Diccionario de columnas", ""]
+
+    # Seccion: columnas compartidas por nombre (candidatas a join key).
+    by_name: dict = {}
+    for e in entries:
+        by_name.setdefault(e["column"], set()).add(e["table"])
+    shared = {
+        name: tables
+        for name, tables in by_name.items()
+        if name is not None and len(tables) > 1
+    }
+
+    lines.append("## Columnas presentes en varias tablas (candidatas a join key)")
+    lines.append("")
+    if shared:
+        lines.append("| Columna | Tablas |")
+        lines.append("|---|---|")
+        for name in sorted(shared, key=lambda s: str(s).lower()):
+            tbls = ", ".join(sorted((str(t) for t in shared[name]), key=str.lower))
+            lines.append(f"| {name} | {tbls} |")
+    else:
+        lines.append(
+            "_Ninguna columna aparece con el mismo nombre en mas de una tabla._"
+        )
+    lines.append("")
+
+    # Tabla completa ordenada por nombre de columna (y tabla como desempate).
+    lines.append("## Columnas")
+    lines.append("")
+    lines.append(
+        "| Columna | Tabla | Tipo | Tipo semantico | PII | %null | Distinct |"
+    )
+    lines.append("|---|---|---|---|---|---|---|")
+    for e in sorted(
+        entries, key=lambda e: (str(e["column"]).lower(), str(e["table"]).lower())
+    ):
+        sem = e["semantic_type"] or "—"
+        pii = "SI" if e["is_pii"] else ""
+        null_s = (
+            f"{e['null_pct'] * 100:.1f}%"
+            if isinstance(e["null_pct"], (int, float))
+            else ""
+        )
+        distinct_s = str(e["n_distinct"]) if e["n_distinct"] is not None else ""
+        itype = e["inferred_type"] or ""
+        lines.append(
+            f"| {e['column']} | {e['table']} | {itype} | {sem} | {pii} "
+            f"| {null_s} | {distinct_s} |"
+        )
+    lines.append("")
+    return "\n".join(lines)
+
+
+def build_column_dictionary(db_profile: dict) -> dict:
+    """Construye el diccionario de columnas buscable de una base entera.
+
+    Recorre ``db_profile["table_profiles"]`` (lista de TableProfile del grupo eda,
+    cada uno con ``table`` y ``columns``) y emite una entrada por columna con su
+    tipo fisico inferido, tipo semantico, marca de PII, % de nulos, cardinalidad y
+    valores frecuentes. Responde, a nivel de base, donde vive cada dato.
+
+    Args:
+        db_profile: DatabaseProfile tal como lo devuelve
+            ``profile_database`` en su clave ``db_profile`` (el dict con
+            ``table_profiles``). Se lee de forma defensiva; una entrada vacia,
+            None o malformada produce el resultado vacio en estado ``ok``.
+
+    Returns:
+        Dict dict-no-throw (nunca lanza) con las claves:
+        - ``status`` (str): siempre ``"ok"``.
+        - ``n_tables`` (int): tablas con columnas procesadas.
+        - ``n_columns`` (int): total de columnas indexadas.
+        - ``entries`` (list[dict]): una entrada por columna con
+          ``{table, column, inferred_type, semantic_type|None, is_pii,
+          null_pct|None, n_distinct|None, top_values|None}``.
+        - ``pii_columns`` (list[dict]): subconjunto de ``entries`` con
+          ``is_pii=True`` (dato personal segun [POL-MMNSEG-001-1.0]).
+        - ``markdown`` (str): tabla grep-able ordenada por nombre de columna,
+          precedida de las columnas compartidas por nombre entre tablas.
+    """
+    try:
+        if not isinstance(db_profile, dict):
+            return _empty_result()
+
+        table_profiles = db_profile.get("table_profiles")
+        if not isinstance(table_profiles, list) or not table_profiles:
+            return _empty_result()
+
+        entries: list = []
+        n_tables = 0
+        for tp in table_profiles:
+            if not isinstance(tp, dict):
+                continue
+            columns = tp.get("columns")
+            if not isinstance(columns, list):
+                continue
+            n_tables += 1
+            table_name = tp.get("table")
+            for col in columns:
+                if not isinstance(col, dict):
+                    continue
+                entries.append(_column_entry(table_name, col))
+
+        if not entries:
+            return {
+                "status": "ok",
+                "n_tables": n_tables,
+                "n_columns": 0,
+                "entries": [],
+                "pii_columns": [],
+                "markdown": "",
+            }
+
+        pii_columns = [e for e in entries if e["is_pii"]]
+        return {
+            "status": "ok",
+            "n_tables": n_tables,
+            "n_columns": len(entries),
+            "entries": entries,
+            "pii_columns": pii_columns,
+            "markdown": _render_markdown(entries),
+        }
+    except Exception:  # noqa: BLE001
+        return _empty_result()
@@ -0,0 +1,193 @@
+"""Tests para build_column_dictionary.
+
+Verifica el aplanado de un DatabaseProfile del grupo eda a un diccionario de
+columnas buscable: entradas por columna, marca de PII desde el semantic_type,
+deteccion de columnas compartidas por nombre (join keys), lectura defensiva y
+que la funcion es pura (no muta el input).
+"""
+
+import copy
+import os
+import sys
+
+sys.path.insert(0, os.path.dirname(__file__))
+
+from build_column_dictionary import build_column_dictionary
+
+
+def _col(name, inferred_type="categorical", semantic_type="", null_pct=0.0,
+         distinct_count=10, categorical=None) -> dict:
+    """ColumnProfile minimo con las claves del contrato eda usadas por la funcion."""
+    return {
+        "name": name,
+        "physical_type": "VARCHAR",
+        "inferred_type": inferred_type,
+        "semantic_type": semantic_type,
+        "null_pct": null_pct,
+        "distinct_count": distinct_count,
+        "flags": [],
+        "numeric": None,
+        "categorical": categorical,
+        "datetime": None,
+    }
+
+
+def _db_profile() -> dict:
+    """DatabaseProfile de juguete con dos tablas y una columna de join comun."""
+    return {
+        "db_path": "toy.duckdb",
+        "n_tables": 2,
+        "table_profiles": [
+            {
+                "table": "clientes",
+                "columns": [
+                    _col("customer_id", "numeric", "", 0.0, 1000),
+                    _col("email", "text", "email", 0.05, 990),
+                    _col(
+                        "ciudad",
+                        "categorical",
+                        "",
+                        0.0,
+                        3,
+                        categorical={
+                            "top": [
+                                {"value": "Madrid", "count": 5, "pct": 0.5},
+                                {"value": "Bilbao", "count": 3, "pct": 0.3},
+                            ]
+                        },
+                    ),
+                ],
+            },
+            {
+                "table": "pedidos",
+                "columns": [
+                    _col("customer_id", "numeric", "", 0.0, 800),
+                    _col("iban", "text", "iban", 0.1, 795),
+                ],
+            },
+        ],
+    }
+
+
+# --------------------------------------------------------------------------- #
+# Golden
+# --------------------------------------------------------------------------- #
+def test_golden_flattens_two_tables():
+    res = build_column_dictionary(_db_profile())
+    assert res["status"] == "ok"
+    assert res["n_tables"] == 2
+    assert res["n_columns"] == 5
+    # Una entrada por columna, con las claves del contrato.
+    keys = {
+        "table", "column", "inferred_type", "semantic_type",
+        "is_pii", "null_pct", "n_distinct", "top_values",
+    }
+    for e in res["entries"]:
+        assert keys.issubset(e.keys())
+    # El markdown tiene la tabla y la seccion de join keys.
+    assert "## Columnas" in res["markdown"]
+    assert "candidatas a join key" in res["markdown"]
+
+
+# --------------------------------------------------------------------------- #
+# PII desde el semantic_type real del grupo
+# --------------------------------------------------------------------------- #
+def test_pii_flagged_from_semantic_type():
+    res = build_column_dictionary(_db_profile())
+    pii_cols = {(e["table"], e["column"]) for e in res["pii_columns"]}
+    assert ("clientes", "email") in pii_cols
+    assert ("pedidos", "iban") in pii_cols
+    # customer_id / ciudad NO son PII.
+    assert ("clientes", "customer_id") not in pii_cols
+    assert ("clientes", "ciudad") not in pii_cols
+    # Coherencia entre is_pii en entries y la lista pii_columns.
+    assert res["pii_columns"] == [e for e in res["entries"] if e["is_pii"]]
+
+
+def test_empty_semantic_type_maps_to_none_and_not_pii():
+    res = build_column_dictionary(_db_profile())
+    ciudad = next(
+        e for e in res["entries"]
+        if e["table"] == "clientes" and e["column"] == "ciudad"
+    )
+    assert ciudad["semantic_type"] is None
+    assert ciudad["is_pii"] is False
+
+
+# --------------------------------------------------------------------------- #
+# Columnas compartidas por nombre = candidatas a join key
+# --------------------------------------------------------------------------- #
+def test_shared_column_names_detected_as_join_keys():
+    res = build_column_dictionary(_db_profile())
+    md = res["markdown"]
+    # customer_id aparece en las dos tablas -> listada en la seccion de join keys.
+    join_section = md.split("## Columnas\n")[0]
+    assert "customer_id" in join_section
+    assert "clientes" in join_section and "pedidos" in join_section
+    # email solo esta en una tabla -> no aparece en la seccion de join keys.
+    assert "email" not in join_section
+
+
+# --------------------------------------------------------------------------- #
+# top_values desde el bloque categorical
+# --------------------------------------------------------------------------- #
+def test_top_values_from_categorical_block():
+    res = build_column_dictionary(_db_profile())
+    ciudad = next(e for e in res["entries"] if e["column"] == "ciudad")
+    assert ciudad["top_values"] == ["Madrid", "Bilbao"]
+    # Columnas sin bloque categorical -> None.
+    email = next(e for e in res["entries"] if e["column"] == "email")
+    assert email["top_values"] is None
+
+
+# --------------------------------------------------------------------------- #
+# Entrada vacia / malformada -> resultado vacio en ok
+# --------------------------------------------------------------------------- #
+def test_empty_profile_returns_empty_ok():
+    empty = build_column_dictionary({})
+    assert empty == {
+        "status": "ok", "n_tables": 0, "n_columns": 0,
+        "entries": [], "pii_columns": [], "markdown": "",
+    }
+
+
+def test_malformed_input_returns_empty_ok():
+    for bad in (None, [], "nope", 42, {"table_profiles": "x"}):
+        res = build_column_dictionary(bad)
+        assert res["status"] == "ok"
+        assert res["n_columns"] == 0
+        assert res["entries"] == []
+        assert res["markdown"] == ""
+
+
+def test_missing_keys_read_defensively():
+    # TableProfiles y columnas con claves ausentes / basura no rompen.
+    profile = {
+        "table_profiles": [
+            {"table": "t1", "columns": [{"name": "a"}, "no-dict", None]},
+            "no-dict",
+            {"table": "t2"},  # sin columns
+            {"columns": [{}]},  # sin table, columna vacia
+        ]
+    }
+    res = build_column_dictionary(profile)
+    assert res["status"] == "ok"
+    # t1 (1 col dict valida; "no-dict" y None se saltan) + tabla sin table
+    # (1 col {}). t2 no tiene columns -> no cuenta como tabla.
+    assert res["n_tables"] == 2
+    assert res["n_columns"] == 2
+    a = next(e for e in res["entries"] if e["column"] == "a")
+    assert a["semantic_type"] is None
+    assert a["null_pct"] is None
+    assert a["n_distinct"] is None
+    assert a["top_values"] is None
+
+
+# --------------------------------------------------------------------------- #
+# Pureza
+# --------------------------------------------------------------------------- #
+def test_does_not_mutate_input():
+    profile = _db_profile()
+    snapshot = copy.deepcopy(profile)
+    build_column_dictionary(profile)
+    assert profile == snapshot
@@ -0,0 +1,111 @@
+---
+id: categorical_top_bar_figure_py_datascience
+name: categorical_top_bar_figure
+kind: function
+lang: py
+domain: datascience
+version: "1.0.0"
+purity: impure
+signature: "def categorical_top_bar_figure(top: list, n_distinct: int = 0, title: str = \"\", top_k: int = 6, n_rows=None) -> \"matplotlib.figure.Figure\""
+description: "Construye una figura matplotlib de barras horizontales de las top_k categorías más frecuentes de una columna categórica, con la mayor arriba y agregando el resto en una barra gris \"Otros (N categorías)\". Contrato de entrada idéntico a categorical_top_pie_figure (swap directo donut↔barras): consume el bloque `top` de summarize_categorical y devuelve un matplotlib.figure.Figure listo para rasterizar por el renderer del informe EDA. Backend Agg sin pyplot global; defensivo total ante top vacío/None, nunca lanza."
+tags: [eda, categorical, bar, barh, matplotlib, figure, visualization, datascience, impure]
+uses_functions: []
+uses_types: []
+returns: []
+returns_optional: false
+error_type: "error_go_core"
+imports: [matplotlib]
+example: |
+  from categorical_top_bar_figure import categorical_top_bar_figure
+  top = [
+      {"value": "rojo", "count": 40, "pct": 0.4},
+      {"value": "azul", "count": 30, "pct": 0.3},
+      {"value": "verde", "count": 20, "pct": 0.2},
+  ]
+  fig = categorical_top_bar_figure(top, n_distinct=12, title="color", top_k=6, n_rows=100)
+tested: true
+tests:
+  - "test_returns_figure"
+  - "test_ten_items_topk_six_yields_seven_bars"
+  - "test_empty_top_does_not_raise_and_returns_figure"
+  - "test_long_value_truncated"
+  - "test_none_value_and_none_count_are_handled"
+  - "test_n_rows_adds_exact_others_bar"
+test_file_path: "python/functions/datascience/categorical_top_bar_figure_test.py"
+file_path: "python/functions/datascience/categorical_top_bar_figure.py"
+params:
+  - name: top
+    desc: "Lista de dicts {value, count, pct} ordenada de mayor a menor por count (salida del bloque `top` de summarize_categorical). Puede venir vacía o con dicts incompletos: items no-dict, sin count, con count None o count <= 0 se descartan. value None se admite (etiqueta vacía)."
+  - name: n_distinct
+    desc: "Nº total de categorías distintas de la columna. Etiqueta la barra agregada como \"Otros (n_distinct - top_k)\" (mínimo 0). Si no supera el nº de barras mostradas, se usa el overflow real de `top` como nº de categorías agregadas. Default 0."
+  - name: title
+    desc: "Título de la figura (nombre de la columna). Se trunca a ~48 chars con elipsis si es muy largo. Default \"\" (sin título)."
+  - name: top_k
+    desc: "Nº máximo de barras explícitas. Default 6. La barra \"Otros\" no cuenta contra este límite. Con top_k <= 0 se muestra al menos la categoría mayor."
+  - name: n_rows
+    desc: "Opcional. Total de filas del dataset. Si se da y la suma de counts mostrados < n_rows, la barra \"Otros\" usa (n_rows - suma_mostrada) como count para que sea exacta respecto al total real. Si se omite, \"Otros\" usa la suma de counts fuera del top_k mostrado (solo cuando top trae más de top_k items). Default None."
+output: "Un matplotlib.figure.Figure (figsize 6.4 x altura escalada con el nº de barras, dpi 150) con un Axes de barras horizontales: la categoría más frecuente arriba, la barra gris \"Otros (N categorías)\" abajo, cada barra anotada con su conteo y porcentaje al final y etiquetas de categoría (yticklabels) truncadas a ~22 chars. Si no hay counts válidos devuelve igualmente una Figure con un texto centrado \"sin datos categóricos\" (nunca lanza); cualquier error inesperado cae a una Figure con el texto del error. El caller rasteriza/cierra la figura; la función no la muestra ni la guarda."
+---
+
+## Ejemplo
+
+```python
+from categorical_top_bar_figure import categorical_top_bar_figure
+
+# `top` es la salida del bloque "top" de summarize_categorical (ya ordenado desc).
+top = [
+    {"value": "rojo", "count": 40, "pct": 0.40},
+    {"value": "azul", "count": 30, "pct": 0.30},
+    {"value": "verde", "count": 20, "pct": 0.20},
+    {"value": "amarillo", "count": 5, "pct": 0.05},
+]
+
+fig = categorical_top_bar_figure(
+    top,
+    n_distinct=12,            # 12 categorías distintas en total
+    title="color_producto",
+    top_k=6,                  # hasta 6 barras explícitas
+    n_rows=100,               # "Otros" = 100 - 95 = 5, sobre 8 categorías agregadas
+)
+
+# El renderer del informe lo rasteriza; aquí solo persistimos para inspección.
+fig.savefig("/tmp/barras_color.png")
+```
+
+## Cuando usarla
+
+Úsala dentro de un informe EDA cuando quieras comparar **magnitudes** de las
+categorías dominantes de una columna categórica: qué categoría manda y por
+cuánto frente a las siguientes. Pásale directamente el bloque `top` de
+`summarize_categorical` (ya ordenado de mayor a menor) más `n_distinct` para que
+la barra "Otros" indique cuántas categorías quedan agrupadas. Es el clon "de
+barras" del donut `categorical_top_pie_figure` con **contrato de entrada
+idéntico**: puedes intercambiar una por otra sin tocar el caller. Elige barras
+cuando importe comparar tamaños exactos; el donut cuando importe la proporción
+del total.
+
+## Gotchas
+
+- **Impura por matplotlib.** Toca la maquinaria de render. Usa el backend `Agg`
+  y la API orientada a objetos `Figure`/`add_subplot` — NUNCA `pyplot.*` aquí,
+  para no tocar el estado global ni filtrar figuras entre llamadas. `pyplot` NO
+  es thread-safe; esta función evita ese riesgo construyendo el `Figure`
+  directamente, así que es segura de llamar en bucle desde el renderer.
+- **El caller cierra la figura.** La función devuelve el `Figure` pero no lo
+  muestra ni lo guarda. Quien la consume debe rasterizarla y luego liberarla
+  (`fig.clf()` / `matplotlib.pyplot.close(fig)` si se usó pyplot en el caller)
+  para no acumular memoria en lotes grandes de columnas.
+- **`barh` dibuja de abajo arriba.** La categoría más frecuente va arriba porque
+  el orden de display se invierte antes de plotear; la barra "Otros" queda
+  siempre al fondo. No reordenes `top` esperando otro layout: la función asume
+  que ya viene ordenado desc por count.
+- **Magnitud exacta de "Otros" solo con `n_rows`.** Sin `n_rows`, la barra
+  "Otros" se calcula con el overflow presente en `top`; si `top` ya viene
+  recortado a `top_k` por el productor, no habrá "Otros" aunque existan más
+  categorías. Pasa `n_rows` (total de filas del dataset) para una barra correcta
+  respecto al total real.
+- **Defensiva, nunca lanza.** `top=[]`, `value=None`, `count=None` o counts no
+  numéricos se manejan sin error: en el peor caso devuelve una `Figure` con
+  "sin datos categóricos", y cualquier excepción inesperada cae a una `Figure`
+  con el texto del error. No envuelvas la llamada en try/except por miedo a un
+  raise — no lo hay.
@@ -0,0 +1,233 @@
+"""Impure EDA helper: horizontal bar figure of the most common categories (`eda` group).
+
+Builds a horizontal bar chart of the ``top_k`` most frequent categories of a
+categorical column, folding everything else into a single gray
+"Otros (N categorías)" bar. The most frequent category sits at the top, each bar
+labelled with its count (and percentage) at the end. Returns a ready-to-rasterize
+``matplotlib.figure.Figure``; it never shows nor saves it.
+
+This is the "magnitude" twin of ``categorical_top_pie_figure``: identical input
+contract (same ``top``/``n_distinct``/``title``/``top_k``/``n_rows`` signature) so
+it can be swapped in directly, but it communicates comparable magnitudes via bars
+instead of proportions via wedges.
+
+Impure because it touches matplotlib's rendering machinery. It uses the headless
+Agg backend and the object-oriented ``Figure`` API (no ``pyplot``) so it leaks no
+global state and is safe to call repeatedly from a report renderer.
+"""
+
+import matplotlib
+
+matplotlib.use("Agg")
+
+from matplotlib.figure import Figure  # noqa: E402
+
+
+# Gray reserved for the aggregated "Otros" bar.
+_OTHER_COLOR = "#9e9e9e"
+# Muted gray for secondary text (title fallback, no-data message).
+_MUTED_TEXT = "#5f6b7a"
+# Soft red for the error fallback message.
+_ERROR_TEXT = "#b00020"
+# Pleasant, colour-blind-friendly qualitative palette for the explicit bars.
+_PALETTE = [
+    "#4C72B0",
+    "#DD8452",
+    "#55A868",
+    "#C44E52",
+    "#8172B3",
+    "#937860",
+    "#DA8BC3",
+    "#8C8C8C",
+    "#CCB974",
+    "#64B5CD",
+]
+
+
+def _truncate(text, width: int = 22) -> str:
+    """Truncate ``text`` to ``width`` chars, appending an ellipsis if cut."""
+    s = "" if text is None else str(text)
+    if len(s) <= width:
+        return s
+    if width <= 1:
+        return s[:width]
+    return s[: width - 1] + "…"
+
+
+def _message_figure(message: str, color: str = _MUTED_TEXT, title: str = "") -> "Figure":
+    """Return a fallback ``Figure`` carrying a single centered message."""
+    fig = Figure(figsize=(6.4, 4.0), dpi=150)
+    ax = fig.add_subplot(111)
+    ax.axis("off")
+    ax.text(
+        0.5,
+        0.5,
+        message,
+        ha="center",
+        va="center",
+        fontsize=12,
+        color=color,
+        wrap=True,
+        transform=ax.transAxes,
+    )
+    if title:
+        ax.set_title(_truncate(title, 48), fontsize=12, loc="center", pad=8)
+    fig.tight_layout()
+    return fig
+
+
+def categorical_top_bar_figure(
+    top: list,
+    n_distinct: int = 0,
+    title: str = "",
+    top_k: int = 6,
+    n_rows=None,
+) -> "matplotlib.figure.Figure":
+    """Build a horizontal bar figure of the most common categories of a column.
+
+    Renders the ``top_k`` most frequent categories as explicit horizontal bars,
+    largest at the top, and aggregates every remaining category into a single
+    gray "Otros (N categorías)" bar at the bottom. Each bar is annotated with its
+    count and percentage of the total at the end of the bar; the category names
+    are truncated Y tick labels.
+
+    The function shares the exact input contract of
+    ``categorical_top_pie_figure`` (the donut twin) so it is a drop-in swap. It is
+    fully defensive: empty input, missing/``None`` values or counts never raise.
+    When there is nothing valid to draw it still returns a ``Figure`` carrying a
+    centered "sin datos categóricos" message, and any unexpected error is caught
+    and turned into a fallback ``Figure`` carrying the error text.
+
+    Args:
+        top: List of ``{value, count, pct}`` dicts, already sorted by ``count``
+            descending (the ``top`` block of ``summarize_categorical``). May be
+            empty or carry incomplete/``None`` entries; non-dict items, items
+            without a positive numeric ``count`` and ``None`` counts are skipped.
+        n_distinct: Total number of distinct categories in the column. Used to
+            label the aggregated bar as "Otros (n_distinct - top_k)" (floored at
+            0). Ignored when it does not exceed the number of shown bars.
+        title: Figure title (the column name). Truncated when too long.
+        top_k: Maximum number of explicit bars. Default 6. The "Otros" bar does
+            not count against this limit.
+        n_rows: Optional total row count of the dataset. When given and the sum of
+            shown counts is below ``n_rows``, the "Otros" bar uses
+            ``n_rows - sum_shown`` as its count so it is exact with respect to the
+            real total. When omitted, "Otros" uses the sum of the counts that fall
+            outside the shown ``top_k`` (only when ``top`` carries more than
+            ``top_k`` items).
+
+    Returns:
+        A ``matplotlib.figure.Figure`` with a single horizontal-bar Axes. The
+        caller is responsible for rasterizing/closing it.
+    """
+    try:
+        safe_title = _truncate(title, 48)
+
+        # --- Defensive parse: keep only well-formed {value, count} with count > 0.
+        cleaned = []
+        if isinstance(top, list):
+            for item in top:
+                if not isinstance(item, dict):
+                    continue
+                count = item.get("count")
+                if count is None:
+                    continue
+                try:
+                    count = float(count)
+                except (TypeError, ValueError):
+                    continue
+                if count <= 0:
+                    continue
+                cleaned.append((item.get("value"), count))
+
+        if not cleaned:
+            return _message_figure("sin datos categóricos", title=title)
+
+        # --- Split into shown bars and the aggregated remainder.
+        shown = cleaned[: max(int(top_k), 0)]
+        if not shown:  # top_k <= 0 — show at least the largest category.
+            shown = cleaned[:1]
+
+        sum_shown = sum(c for _, c in shown)
+        overflow_count = sum(c for _, c in cleaned[len(shown):])
+
+        # How many categories are folded into "Otros".
+        try:
+            nd = int(n_distinct)
+        except (TypeError, ValueError):
+            nd = 0
+        others_categories = max(nd - len(shown), 0)
+        # If n_distinct is unknown/too small, fall back to the overflow we
+        # actually have in `top` beyond the shown bars.
+        overflow_items = len(cleaned) - len(shown)
+        if others_categories == 0 and overflow_items > 0:
+            others_categories = overflow_items
+
+        # Count attributed to the "Otros" bar.
+        others_count = 0.0
+        if n_rows is not None:
+            try:
+                total_rows = float(n_rows)
+            except (TypeError, ValueError):
+                total_rows = None
+            if total_rows is not None and total_rows > sum_shown:
+                others_count = total_rows - sum_shown
+        if others_count <= 0:
+            others_count = overflow_count
+
+        # --- Build the display order (top to bottom): largest .. smallest, Otros.
+        display_labels = [_truncate(v, 22) for v, _ in shown]
+        display_values = [c for _, c in shown]
+        display_colors = [_PALETTE[i % len(_PALETTE)] for i in range(len(shown))]
+
+        has_others = others_count > 0 and others_categories > 0
+        if has_others:
+            display_labels.append(f"Otros ({others_categories} categorías)")
+            display_values.append(others_count)
+            display_colors.append(_OTHER_COLOR)
+
+        total = sum(display_values) or 1.0
+
+        # barh draws bottom-up, so reverse the display order before plotting to
+        # land the largest category on top and "Otros" at the bottom.
+        labels = list(reversed(display_labels))
+        values = list(reversed(display_values))
+        colors = list(reversed(display_colors))
+        y_pos = range(len(values))
+
+        # Height scales with the number of bars so dense reports stay readable.
+        n_bars = len(values)
+        height = max(2.4, min(0.4 * n_bars + 1.2, 14.0))
+        fig = Figure(figsize=(6.4, height), dpi=150)
+        ax = fig.add_subplot(111)
+
+        ax.barh(list(y_pos), values, color=colors, edgecolor="white")
+        ax.set_yticks(list(y_pos))
+        ax.set_yticklabels(labels, fontsize=8)
+        ax.set_xlabel("conteo", fontsize=9)
+
+        max_val = max(values) if values else 1.0
+        ax.set_xlim(0, max_val * 1.18 if max_val > 0 else 1.0)
+
+        # Annotate each bar with its count and percentage at the end of the bar.
+        for y, val in zip(y_pos, values):
+            pct = val / total * 100.0
+            ax.text(
+                val + max_val * 0.012,
+                y,
+                f"{int(round(val))} ({pct:.0f}%)",
+                va="center",
+                ha="left",
+                fontsize=7,
+                color="#202020",
+            )
+
+        if safe_title:
+            ax.set_title(safe_title, fontsize=13, loc="left", pad=10)
+
+        fig.tight_layout()
+        return fig
+    except Exception as exc:  # noqa: BLE001 — never raise from a figure builder.
+        return _message_figure(
+            f"error al dibujar barras: {exc}", color=_ERROR_TEXT
+        )
@@ -0,0 +1,103 @@
+"""Tests para categorical_top_bar_figure (barras de categorías top, grupo eda).
+
+Usa el backend Agg sin pyplot; no muestra ni guarda figuras. Cada test cierra
+explícitamente la Figure construida (matplotlib.pyplot.close) para no acumular
+estado entre tests.
+"""
+
+import matplotlib
+
+matplotlib.use("Agg")
+
+import matplotlib.pyplot as plt  # noqa: E402
+from matplotlib.figure import Figure  # noqa: E402
+
+from categorical_top_bar_figure import categorical_top_bar_figure
+
+
+def _make_top(n):
+    """n items {value, count, pct} ordenados desc por count."""
+    return [
+        {"value": f"cat_{i}", "count": n - i, "pct": (n - i) / sum(range(1, n + 1))}
+        for i in range(n)
+    ]
+
+
+def _bar_count(ax):
+    """Devuelve el nº de barras (longitud del primer BarContainer del Axes)."""
+    if ax.containers:
+        return len(ax.containers[0])
+    return 0
+
+
+def test_returns_figure():
+    fig = categorical_top_bar_figure(_make_top(3), n_distinct=3, title="col")
+    assert isinstance(fig, Figure)
+    plt.close(fig)
+
+
+def test_ten_items_topk_six_yields_seven_bars():
+    top = _make_top(10)
+    fig = categorical_top_bar_figure(top, n_distinct=10, title="muchas", top_k=6)
+    ax = fig.axes[0]
+    # 6 categorías explícitas + 1 barra "Otros".
+    assert _bar_count(ax) == 7
+    plt.close(fig)
+
+
+def test_empty_top_does_not_raise_and_returns_figure():
+    fig = categorical_top_bar_figure([], n_distinct=0, title="vacía")
+    assert isinstance(fig, Figure)
+    # Sin datos: no debe haber barras.
+    assert _bar_count(fig.axes[0]) == 0
+    plt.close(fig)
+
+
+def test_long_value_truncated():
+    long_value = "una_categoria_con_un_nombre_larguisimo_que_excede_el_limite"
+    top = [
+        {"value": long_value, "count": 10, "pct": 0.5},
+        {"value": "corta", "count": 10, "pct": 0.5},
+    ]
+    fig = categorical_top_bar_figure(top, n_distinct=2, title="col", top_k=6)
+    ax = fig.axes[0]
+    tick_texts = [t.get_text() for t in ax.get_yticklabels()]
+    # El valor largo aparece truncado con elipsis y NO en su forma completa.
+    assert any("…" in t for t in tick_texts)
+    assert long_value not in " ".join(tick_texts)
+    plt.close(fig)
+
+
+def test_none_value_and_none_count_are_handled():
+    top = [
+        {"value": None, "count": 5, "pct": 0.5},
+        {"value": "b", "count": None, "pct": 0.0},  # count None -> se descarta
+        {"value": "c", "count": 5, "pct": 0.5},
+    ]
+    fig = categorical_top_bar_figure(top, n_distinct=2, title="con nones", top_k=6)
+    assert isinstance(fig, Figure)
+    # Solo 2 items válidos, sin overflow -> 2 barras, sin "Otros".
+    assert _bar_count(fig.axes[0]) == 2
+    plt.close(fig)
+
+
+def test_n_rows_adds_exact_others_bar():
+    # 3 categorías mostradas suman 30, dataset real 100 -> "Otros" = 70.
+    top = [
+        {"value": "a", "count": 15, "pct": 0.15},
+        {"value": "b", "count": 10, "pct": 0.10},
+        {"value": "c", "count": 5, "pct": 0.05},
+    ]
+    fig = categorical_top_bar_figure(
+        top, n_distinct=20, title="col", top_k=3, n_rows=100
+    )
+    ax = fig.axes[0]
+    # 3 explícitas + Otros.
+    assert _bar_count(ax) == 4
+    tick_texts = [t.get_text() for t in ax.get_yticklabels()]
+    # La barra Otros refleja n_distinct - top_k = 17 categorías.
+    assert any("Otros (17 categorías)" in t for t in tick_texts)
+    # Su anotación lleva el count 70.
+    annotation_texts = [t.get_text() for t in ax.texts]
+    assert any("70" in t for t in annotation_texts)
+    plt.close(fig)
@@ -23,15 +23,20 @@ from __future__ import annotations

 import sys

-import cv2
 import numpy as np

+# OpenCV (cv2) se importa de forma perezosa dentro de las funciones que lo usan:
+# un import a nivel de módulo rompería `import datascience` en entornos sin
+# opencv instalado (p. ej. venvs de analysis que solo usan las funciones de
+# series temporales o perfilado del paquete).
+

 # --------------------------------------------------------------------------------------------
 # Detectores. Cada uno se normaliza a una función run(img) -> list[str] que nunca lanza.
 # --------------------------------------------------------------------------------------------
 def _make_opencv_runner(detector):
    """Envuelve un cv2.QRCodeDetector(Aruco) en run(img) -> list[str]."""
+    import cv2

    def run(img):
        out: list[str] = []
@@ -89,6 +94,8 @@ def _make_pyzbar_runner(zbar_decode):

 def _build_detectors(debug=False):
    """Construye la lista de (nombre, runner) de detectores disponibles, en orden de preferencia."""
+    import cv2
+
    detectors = []

    # OpenCV Aruco (preferido): no requiere libs de sistema ni descarga de modelos.
@@ -135,6 +142,8 @@ def _build_detectors(debug=False):
 # --------------------------------------------------------------------------------------------
 def _load_bgr(image_path):
    """Carga la imagen como BGR (uint8). Devuelve None si no se puede leer."""
+    import cv2
+
    bgr = cv2.imread(image_path, cv2.IMREAD_COLOR)
    if bgr is not None:
        return bgr
@@ -150,6 +159,8 @@ def _load_bgr(image_path):

 def _build_variants(image_path, upscale):
    """Genera (nombre, ndarray) de variantes preprocesadas, en orden de prioridad."""
+    import cv2
+
    bgr = _load_bgr(image_path)
    if bgr is None:
        return []
@@ -0,0 +1,94 @@
+---
+name: forecast_seasonal_median
+kind: function
+lang: py
+domain: datascience
+version: "1.0.0"
+purity: pure
+signature: "def forecast_seasonal_median(history: list[dict], horizon_dates: list[str], as_of: str, dow_weeks: int = 8, trend_recent_weeks: int = 4, trend_clip: tuple = (0.5, 2.0)) -> list[dict]"
+description: "Forecast diario por mediana estacional (mismo dia de semana) mas factor de tendencia acotado, para una o varias series temporales. Base estacional = mediana del valor en las ultimas dow_weeks fechas con el mismo dia de semana que la fecha objetivo (dias ausentes = 0, para series intermitentes). Factor de tendencia por serie = razon de la suma de las ultimas trend_recent_weeks semanas frente a las trend_recent_weeks anteriores, clipped a trend_clip. y_pred = max(0, base * factor). Funcion pura y determinista (solo stdlib, sin I/O ni datetime.now). Nucleo del forecast de ventas diarias Aurgi (dia x centro x subcategoria CGQ)."
+tags: [forecast, bigquery, timeseries, seasonal, median, baseline, sales, aurgi, python]
+uses_functions: []
+uses_types: []
+returns: []
+returns_optional: false
+error_type: ""
+imports: []
+params:
+  - name: history
+    desc: "lista de observaciones {series_id: str, date: 'YYYY-MM-DD', value: float}. Filas duplicadas (misma serie+fecha) se suman. Los dias sin fila dentro de las ventanas cuentan como valor 0 (series intermitentes: sin fila = sin venta)"
+  - name: horizon_dates
+    desc: "fechas futuras a predecir, strings ISO 'YYYY-MM-DD'. Tipicamente as_of+1..as_of+horizon"
+  - name: as_of
+    desc: "fecha de corte 'YYYY-MM-DD': ultimo dia de historia utilizable, inclusive. Todas las ventanas se calculan hacia atras desde aqui"
+  - name: dow_weeks
+    desc: "numero de fechas del mismo dia de semana que la objetivo a promediar (mediana) para la base estacional. Default 8 (8 semanas)"
+  - name: trend_recent_weeks
+    desc: "tamano en semanas de cada una de las dos ventanas de tendencia (reciente y anterior). Default 4: compara 4 semanas recientes vs las 4 previas"
+  - name: trend_clip
+    desc: "tupla (min, max) al que se acota el factor de tendencia. Default (0.5, 2.0): la prediccion no puede caer a menos de la mitad ni superar el doble por tendencia"
+output: "list[dict]: una fila {series_id: str, date: str, y_pred: float} por cada serie presente en history y cada fecha de horizon_dates. Ordenada por series_id (asc) y luego por el orden de horizon_dates. y_pred siempre >= 0.0"
+tested: true
+tests:
+  - "serie regular con patron semanal claro da la mediana correcta"
+  - "serie intermitente: los dias ausentes cuentan como 0 en la mediana"
+  - "serie con tendencia creciente aplica factor >1 acotado a trend_clip"
+  - "sin datos en la ventana anterior, el factor de tendencia es 1.0"
+  - "horizon de 7 dias produce una fila por serie y fecha, ordenadas"
+test_file_path: "python/functions/datascience/forecast_seasonal_median_test.py"
+file_path: "python/functions/datascience/forecast_seasonal_median.py"
+---
+
+## Ejemplo
+
+```python
+from datascience import forecast_seasonal_median
+
+# Historia diaria por serie (centro|subcategoria). Sin fila = sin venta = 0.
+history = [
+    {"series_id": "12|NEUMATICOS", "date": "2026-06-23", "value": 1450.0},
+    {"series_id": "12|NEUMATICOS", "date": "2026-06-16", "value": 1380.0},
+    {"series_id": "12|NEUMATICOS", "date": "2026-06-09", "value": 1500.0},
+    # ... mas historia (idealmente >= 8 semanas para la base estacional) ...
+]
+
+# as_of = ultimo dia cerrado; predice los 7 dias siguientes.
+horizon = ["2026-06-30", "2026-07-01", "2026-07-02", "2026-07-03",
+           "2026-07-04", "2026-07-05", "2026-07-06"]
+
+preds = forecast_seasonal_median(history, horizon, as_of="2026-06-29")
+for p in preds:
+    print(p["series_id"], p["date"], round(p["y_pred"], 2))
+```
+
+## Cuando usarla
+
+Cuando necesites un baseline de forecast diario robusto y explicable para series
+con estacionalidad semanal fuerte (ventas por dia de la semana) y posibles huecos
+(dias sin venta). Es el nucleo puro del pipeline `run_sales_forecast`: se llama una
+vez con toda la historia agregada y devuelve todas las predicciones de golpe.
+Usala como punto de partida antes de modelos mas pesados (Prophet, ARIMA, gradient
+boosting): captura el patron dia-de-semana + una correccion de tendencia acotada
+sin dependencias externas ni entrenamiento. Ideal para muchas series a la vez
+(miles de pares centro x subcategoria) donde entrenar un modelo por serie no
+compensa.
+
+## Notas
+
+- Funcion pura y determinista: no hace I/O, no llama `datetime.now()`; el corte
+  temporal siempre es el argumento `as_of` explicito. Solo stdlib (datetime,
+  statistics), sin numpy ni pandas.
+- La base estacional toma las fechas EXACTAS del calendario: la mas reciente
+  <= as_of con el mismo dia de semana que la objetivo, y de ahi 7 dias hacia atras
+  por punto (hasta `dow_weeks` puntos). Una fecha ausente en `history` cuenta como
+  0, por lo que la mediana refleja bien las series intermitentes.
+- El factor de tendencia se calcula UNA vez por serie (no depende de la fecha
+  objetivo) como razon de sumas de dos ventanas contiguas de `trend_recent_weeks`
+  semanas. Denominador 0 => factor 1.0 (evita division por cero y no infla series
+  que arrancan). El clip a `trend_clip` evita que un pico reciente dispare la
+  prediccion.
+- `y_pred = max(0.0, base * factor)`: nunca negativo. No modela festivos ni eventos
+  puntuales; para eso se necesitaria una capa de calendario adicional.
+- Para que la base estacional sea fiable conviene aportar >= `dow_weeks` semanas de
+  historia. Con menos historia, los puntos ausentes (=0) empujan la mediana hacia
+  abajo.
@@ -0,0 +1,126 @@
+"""forecast_seasonal_median — forecast diario por mediana estacional + tendencia.
+
+Funcion PURA (sin I/O, sin datetime.now(), determinista). Predice el valor futuro
+de una o varias series temporales diarias combinando dos senales:
+
+  1. Base estacional: la mediana del valor en las ultimas `dow_weeks` fechas con el
+     MISMO dia de semana que la fecha objetivo (dias ausentes = 0, para series
+     intermitentes donde "sin fila" significa "sin venta").
+  2. Factor de tendencia por serie: cuanto ha crecido/caido la actividad reciente
+     respecto al periodo inmediatamente anterior (razon de sumas), acotado a un
+     rango para no amplificar ruido.
+
+Disenada para el forecast de ventas diarias de Aurgi (dia x centro x subcategoria
+CGQ): cada serie es un par centro|subcategoria y el patron semanal domina la
+demanda (los sabados venden distinto que los martes). Solo usa stdlib
+(datetime, statistics).
+"""
+
+from datetime import date, datetime, timedelta
+from statistics import median
+
+
+def _to_date(value: str) -> date:
+    """Convierte una fecha ISO 'YYYY-MM-DD' (o datetime.date) a datetime.date."""
+    if isinstance(value, date) and not isinstance(value, datetime):
+        return value
+    if isinstance(value, datetime):
+        return value.date()
+    return datetime.strptime(value[:10], "%Y-%m-%d").date()
+
+
+def forecast_seasonal_median(
+    history: list[dict],
+    horizon_dates: list[str],
+    as_of: str,
+    dow_weeks: int = 8,
+    trend_recent_weeks: int = 4,
+    trend_clip: tuple = (0.5, 2.0),
+) -> list[dict]:
+    """Predice el valor de cada serie para cada fecha del horizonte.
+
+    Para cada serie presente en `history` y cada fecha objetivo del horizonte:
+
+    1. Base estacional = mediana del valor en las ultimas `dow_weeks` fechas con el
+       MISMO dia de semana que la fecha objetivo, todas <= `as_of`. Se toman las
+       fechas EXACTAS del calendario (la mas reciente <= as_of con ese dia de
+       semana, y de ahi 7 dias hacia atras por punto); una fecha ausente en la
+       historia cuenta como 0 (series intermitentes).
+    2. Factor de tendencia por serie = suma de los valores de las ultimas
+       `trend_recent_weeks` semanas (desde `as_of` hacia atras) dividida entre la
+       suma de las `trend_recent_weeks` semanas anteriores a esas. Si el
+       denominador es 0 el factor es 1.0. Se acota a `trend_clip`.
+    3. y_pred = max(0.0, base * factor).
+
+    Args:
+        history: observaciones {"series_id": str, "date": "YYYY-MM-DD",
+            "value": float}. Filas duplicadas (misma serie y fecha) se suman. Los
+            dias sin fila dentro de las ventanas se tratan como valor 0.
+        horizon_dates: fechas futuras a predecir (strings ISO 'YYYY-MM-DD').
+        as_of: fecha de corte (ultimo dia de historia utilizable, inclusive).
+        dow_weeks: numero de fechas del mismo dia de semana a promediar para la
+            base estacional. Default 8.
+        trend_recent_weeks: tamano (en semanas) de cada una de las dos ventanas de
+            tendencia (reciente y anterior). Default 4.
+        trend_clip: (min, max) al que se acota el factor de tendencia. Default
+            (0.5, 2.0): la prediccion no puede menos que caer a la mitad ni mas
+            que duplicarse por tendencia.
+
+    Returns:
+        Lista de {"series_id": str, "date": str, "y_pred": float}, una fila por
+        cada serie presente en `history` y cada fecha del horizonte. Ordenada por
+        series_id (asc) y luego por el orden de `horizon_dates`.
+    """
+    as_of_d = _to_date(as_of)
+    lo_clip, hi_clip = trend_clip
+
+    # Mapa (series_id, date) -> valor acumulado + conjunto de series presentes.
+    values: dict[tuple[str, date], float] = {}
+    series_ids: set[str] = set()
+    for obs in history:
+        sid = obs["series_id"]
+        d = _to_date(obs["date"])
+        v = float(obs.get("value", 0.0) or 0.0)
+        series_ids.add(sid)
+        values[(sid, d)] = values.get((sid, d), 0.0) + v
+
+    # Ventanas de tendencia (en dias) relativas a as_of.
+    span = 7 * trend_recent_weeks
+    recent_lo = as_of_d - timedelta(days=span)      # reciente: recent_lo < d <= as_of
+    prior_lo = as_of_d - timedelta(days=2 * span)   # anterior: prior_lo  < d <= recent_lo
+
+    # Factor de tendencia por serie (una sola vez por serie, no depende del horizonte).
+    trend_factor: dict[str, float] = {}
+    for sid in series_ids:
+        recent_sum = 0.0
+        prior_sum = 0.0
+        for (s, d), v in values.items():
+            if s != sid:
+                continue
+            if recent_lo < d <= as_of_d:
+                recent_sum += v
+            elif prior_lo < d <= recent_lo:
+                prior_sum += v
+        if prior_sum == 0.0:
+            factor = 1.0
+        else:
+            factor = recent_sum / prior_sum
+        trend_factor[sid] = min(hi_clip, max(lo_clip, factor))
+
+    horizon = [_to_date(h) for h in horizon_dates]
+    out: list[dict] = []
+    for sid in sorted(series_ids):
+        factor = trend_factor[sid]
+        for h_str, h_d in zip(horizon_dates, horizon):
+            # Fecha mas reciente <= as_of con el mismo dia de semana que la objetivo.
+            back = (as_of_d.weekday() - h_d.weekday()) % 7
+            anchor = as_of_d - timedelta(days=back)
+            dow_values = [
+                values.get((sid, anchor - timedelta(days=7 * i)), 0.0)
+                for i in range(dow_weeks)
+            ]
+            base = median(dow_values)
+            y_pred = max(0.0, base * factor)
+            out.append({"series_id": sid, "date": h_str, "y_pred": y_pred})
+
+    return out
@@ -0,0 +1,113 @@
+"""Tests para forecast_seasonal_median."""
+
+import os
+import sys
+from datetime import date, timedelta
+
+sys.path.insert(0, os.path.dirname(__file__))
+
+from forecast_seasonal_median import forecast_seasonal_median
+
+
+def _iso(d: date) -> str:
+    return d.isoformat()
+
+
+def test_serie_regular_patron_semanal_mediana_correcta():
+    """serie regular con patron semanal claro da la mediana correcta."""
+    # as_of martes; historia diaria de 12 semanas con valor fijo por dia de semana
+    # y patron constante (tendencia neutra -> factor 1).
+    as_of = date(2026, 6, 30)  # martes
+    by_weekday = {0: 100.0, 1: 10.0, 2: 20.0, 3: 30.0, 4: 40.0, 5: 5.0, 6: 0.0}
+    history = []
+    for i in range(84):  # 12 semanas de dias
+        d = as_of - timedelta(days=i)
+        history.append({"series_id": "c1|sub", "date": _iso(d), "value": by_weekday[d.weekday()]})
+
+    horizon = [_iso(as_of + timedelta(days=k)) for k in range(1, 8)]  # 7 dias
+    result = forecast_seasonal_median(history, horizon, _iso(as_of))
+
+    # 1 serie x 7 fechas de horizonte
+    assert len(result) == 7
+    for row in result:
+        wd = date.fromisoformat(row["date"]).weekday()
+        # base = mediana de 8 semanas del mismo valor constante; factor = 1.
+        assert row["y_pred"] == by_weekday[wd]
+        assert row["series_id"] == "c1|sub"
+
+
+def test_serie_intermitente_con_ceros():
+    """serie intermitente: los dias ausentes cuentan como 0 en la mediana."""
+    # as_of martes. La serie solo vende martes alternos (w=0,2,4,6), el resto 0.
+    as_of = date(2026, 6, 30)  # martes
+    history = []
+    for w in (0, 2, 4, 6):
+        history.append({"series_id": "s", "date": _iso(as_of - timedelta(days=7 * w)), "value": 40.0})
+
+    horizon = [_iso(as_of + timedelta(days=7))]  # proximo martes
+    result = forecast_seasonal_median(history, horizon, _iso(as_of))
+
+    # dow_weeks=8 martes: [40,0,40,0,40,0,40,0] -> mediana (0+40)/2 = 20.
+    # tendencia: reciente (w0..3)=40+40=80, anterior (w4..7)=40+40=80 -> factor 1.
+    assert len(result) == 1
+    assert result[0]["y_pred"] == 20.0
+
+
+def test_serie_con_tendencia_creciente_factor_clipped():
+    """serie con tendencia creciente aplica factor >1 acotado a trend_clip."""
+    as_of = date(2026, 6, 30)  # martes
+    # Reciente (4 martes) = 30 c/u, anterior (4 martes) = 10 c/u.
+    vals = {0: 30.0, 1: 30.0, 2: 30.0, 3: 30.0, 4: 10.0, 5: 10.0, 6: 10.0, 7: 10.0}
+    history = [
+        {"series_id": "s", "date": _iso(as_of - timedelta(days=7 * w)), "value": v}
+        for w, v in vals.items()
+    ]
+
+    horizon = [_iso(as_of + timedelta(days=7))]
+    result = forecast_seasonal_median(history, horizon, _iso(as_of))
+
+    # base = mediana de [30,30,30,30,10,10,10,10] = 20.
+    # factor = 120/40 = 3.0 -> clipped a 2.0 (trend_clip=(0.5,2.0)).
+    # y_pred = 20 * 2.0 = 40.
+    assert result[0]["y_pred"] == 40.0
+
+
+def test_serie_sin_datos_en_denominador_tendencia_factor_1():
+    """sin datos en la ventana anterior, el factor de tendencia es 1.0."""
+    as_of = date(2026, 6, 30)  # martes
+    # Solo hay datos en las 4 semanas recientes (w=0..3); nada mas antiguo.
+    history = [
+        {"series_id": "s", "date": _iso(as_of - timedelta(days=7 * w)), "value": 50.0}
+        for w in range(4)
+    ]
+
+    horizon = [_iso(as_of + timedelta(days=7))]
+    result = forecast_seasonal_median(history, horizon, _iso(as_of))
+
+    # denominador (semanas anteriores) = 0 -> factor 1.0 (no crashea).
+    # base = mediana [50,50,50,50,0,0,0,0] = (0+50)/2 = 25 -> y_pred = 25.
+    assert result[0]["y_pred"] == 25.0
+
+
+def test_horizon_de_7_dias_una_fila_por_serie_y_fecha():
+    """horizon de 7 dias produce una fila por serie y fecha, ordenadas."""
+    as_of = date(2026, 6, 30)
+    history = [
+        {"series_id": "b|x", "date": _iso(as_of - timedelta(days=7 * w)), "value": 12.0}
+        for w in range(8)
+    ] + [
+        {"series_id": "a|x", "date": _iso(as_of - timedelta(days=7 * w)), "value": 8.0}
+        for w in range(8)
+    ]
+
+    horizon = [_iso(as_of + timedelta(days=k)) for k in range(1, 8)]
+    result = forecast_seasonal_median(history, horizon, _iso(as_of))
+
+    # 2 series x 7 fechas = 14 filas, ordenadas por series_id asc.
+    assert len(result) == 14
+    assert [r["series_id"] for r in result[:7]] == ["a|x"] * 7
+    assert [r["series_id"] for r in result[7:]] == ["b|x"] * 7
+    # el orden de fechas dentro de cada serie respeta horizon_dates
+    assert [r["date"] for r in result[:7]] == horizon
+    # y_pred >= 0 siempre
+    assert all(r["y_pred"] >= 0.0 for r in result)
@@ -0,0 +1,77 @@
+---
+name: generate_synthetic_eda_folder
+kind: function
+lang: py
+domain: datascience
+version: "1.0.0"
+purity: impure
+signature: "def generate_synthetic_eda_folder(out_dir: str, n_rows: int = 2000, seed: int = 42) -> dict"
+description: "Genera una carpeta con 3 CSV RELACIONADOS (customers, orders, reviews) deterministas por seed (Faker + numpy) para ejercitar el motor AutomaticEDA multi-tabla / profile_database. orders.customer_id y reviews.customer_id estan contenidos al 100% en customers.customer_id (PK uuid), de modo que la deteccion FK por containment (min_inclusion=0.9) descubre ambas relaciones. customers es la tabla padre; reutiliza helpers de generate_synthetic_eda_table (texto multi-idioma, lat/lon validas, amount con outliers). Estilo dict-no-throw: nunca lanza."
+tags: [eda, synthetic, faker, testing, fixture, datascience]
+params:
+  - name: out_dir
+    desc: "Carpeta de salida. Se crea con mkdir -p si no existe. Recibe customers.csv, orders.csv y reviews.csv."
+  - name: n_rows
+    desc: "Numero de clientes (filas de customers). orders ~= 2*n_rows filas, reviews ~= n_rows filas. Default 2000."
+  - name: seed
+    desc: "Semilla para Faker (Faker.seed) y numpy (np.random.default_rng). Mismo seed -> CSVs identicos byte a byte. Default 42."
+output: "dict dict-no-throw. En exito {status:'ok', out_dir, files:{customers,orders,reviews}, n_customers, n_orders, n_reviews, expected_relations:[{from_table,from_col,to_table,to_col}, ...], seed}. En error (sin lanzar, p.ej. n_rows<=0) {status:'error', error:str}. expected_relations declara las 2 FK orders->customers y reviews->customers (ambas por customer_id)."
+uses_functions: []
+uses_types: []
+returns: []
+returns_optional: false
+error_type: "error_go_core"
+imports: []
+tested: true
+tests: ["test_genera_ok_y_archivos", "test_determinismo_mismo_seed", "test_seeds_distintos_difieren", "test_fk_containment", "test_review_text_mediana_palabras", "test_n_rows_invalido"]
+test_file_path: "python/functions/datascience/generate_synthetic_eda_folder_test.py"
+file_path: "python/functions/datascience/generate_synthetic_eda_folder.py"
+---
+
+## Ejemplo
+
+```bash
+# Genera /tmp/eda_folder/{customers,orders,reviews}.csv (300 customers, seed 42)
+fn run generate_synthetic_eda_folder /tmp/eda_folder 300 42
+```
+
+```python
+import sys, os
+sys.path.insert(0, os.path.join("python", "functions"))
+from datascience import generate_synthetic_eda_folder
+
+res = generate_synthetic_eda_folder("/tmp/eda_folder", n_rows=300, seed=42)
+# res["files"] -> {"customers": ".../customers.csv", "orders": ..., "reviews": ...}
+# res["expected_relations"] -> orders.customer_id y reviews.customer_id -> customers.customer_id
+# Luego perfila la carpeta/base con el grupo eda:
+#   fn run profile_database /tmp/eda_folder
+```
+
+## Cuando usarla
+
+- Cuando necesites un fixture REPRODUCIBLE multi-tabla para evaluar el EDA de carpeta/base (`profile_database`, join graph, capitulo de relaciones inter-tabla) con relaciones FK reales y detectables.
+- Cuando escribas tests de la deteccion de claves foraneas por containment: orders y reviews referencian customer_id contenido al 100% en customers (inclusion 1.0 >= min_inclusion 0.9).
+- Como contraparte multi-tabla de `generate_synthetic_eda_table` (que cubre el EDA de UNA tabla).
+
+## Gotchas
+
+- **Impura**: escribe 3 CSV a disco (`mkdir -p` de la carpeta). Sobrescribe los CSV existentes con el mismo nombre.
+- **Requiere `faker`, `numpy` y `pandas`** en el venv. Sin `faker` devuelve `{status:'error'}` (no lanza).
+- **El containment depende del orden**: customers se genera PRIMERO y orders/reviews muestrean sus `customer_id`. Si se invierte el orden, la FK deja de estar contenida y el detector no la encuentra.
+- **`signup_date`/`ts` se escriben como texto ISO en el CSV** (`YYYY-MM-DD` / `YYYY-MM-DD HH:MM:SS`): es CSV, todo es texto; el profiler los promociona a datetime al leerlos.
+- **Determinismo dependiente del orden de llamadas**: se siembra `Faker.seed(seed)` + `np.random.default_rng(seed)` al inicio; mismo seed -> CSVs identicos byte a byte.
+- **Reutiliza helpers privados** de `generate_synthetic_eda_table` (`_make_fakers`, `_make_latlon`, `_make_reviews`, `_amount_with_outliers`): no romper esas firmas sin actualizar esta funcion.
+
+## Notas
+
+Estructura generada:
+
+| Archivo | PK | FK | Columnas clave |
+|---|---|---|---|
+| customers.csv | customer_id (uuid) | — | name, country, signup_date, latitude, longitude, email |
+| orders.csv | order_id (uuid) | customer_id -> customers | amount (lognormal + outliers), category, ts |
+| reviews.csv | review_id (uuid) | customer_id -> customers | review_text (multi-idioma, mediana palabras>=20), rating (1..5) |
+
+orders tiene ~2x filas que customers y reviews ~1x. Todos los `customer_id` de orders
+y reviews estan contenidos en customers (containment ⊆), por lo que la deteccion FK por
+inclusion descubre las dos relaciones declaradas en `expected_relations`.
@@ -0,0 +1,177 @@
+"""generate_synthetic_eda_folder — fixture multi-tabla relacionado para el EDA de base/carpeta.
+
+Funcion impura (escribe CSVs a disco) y determinista por ``seed``: crea una
+carpeta con 3 CSV RELACIONADOS (customers, orders, reviews) cuyo contenido esta
+disenado para que el motor AutomaticEDA multi-tabla / `profile_database` detecte
+las relaciones FK por containment de valores (orders.customer_id y
+reviews.customer_id contenidos al 100% en customers.customer_id, por encima del
+``min_inclusion=0.9`` que usa la deteccion).
+
+Reutiliza los helpers de ``generate_synthetic_eda_table`` (texto multi-idioma,
+lat/lon validas, amount con outliers, listas fijas de paises/categorias) para no
+reimplementar logica.
+
+Estilo dict-no-throw del grupo `eda`: NUNCA lanza; devuelve
+``{"status": "error", "error": str}`` ante cualquier fallo.
+"""
+
+import os
+
+from .generate_synthetic_eda_table import (
+    _CATEGORIES,
+    _COUNTRIES,
+    _amount_with_outliers,
+    _make_fakers,
+    _make_latlon,
+    _make_reviews,
+)
+
+
+def generate_synthetic_eda_folder(out_dir, n_rows=2000, seed=42):
+    """Genera una carpeta con 3 CSV relacionados (customers/orders/reviews).
+
+    customers es la tabla padre (PK ``customer_id`` uuid unica). orders y reviews
+    referencian ``customer_id`` muestreandolo de customers, de modo que TODOS sus
+    valores estan contenidos en customers (inclusion 1.0 -> FK detectable).
+
+    Funcion impura (escribe a disco) y determinista por ``seed``. NUNCA lanza.
+
+    Args:
+        out_dir: carpeta de salida. Se crea con ``mkdir -p`` si no existe.
+        n_rows: numero de clientes (customers). orders ~= 2*n_rows, reviews ~= n_rows.
+            Default 2000.
+        seed: semilla para Faker y numpy. Default 42.
+
+    Returns:
+        dict dict-no-throw. En exito::
+
+            {"status": "ok", "out_dir": ..., "files": {customers, orders, reviews},
+             "n_customers": ..., "n_orders": ..., "n_reviews": ...,
+             "expected_relations": [{from_table, from_col, to_table, to_col}, ...],
+             "seed": seed}
+
+        En error (sin lanzar)::
+
+            {"status": "error", "error": str}
+    """
+    try:
+        import numpy as np
+        import pandas as pd
+
+        n = int(n_rows)
+        if n <= 0:
+            return {"status": "error", "error": f"n_rows debe ser > 0, dado {n_rows!r}"}
+
+        os.makedirs(out_dir, exist_ok=True)
+
+        fakers = _make_fakers(seed)
+        rng = np.random.default_rng(seed)
+
+        # ---------------- customers (tabla padre) ----------------
+        n_cust = n
+        customer_ids = [fakers["en_US"].uuid4() for _ in range(n_cust)]
+        names = [fakers["en_US"].name() for _ in range(n_cust)]
+        cust_country = rng.choice(_COUNTRIES, n_cust)
+        base = np.datetime64("2022-01-01")
+        signup_offsets = rng.integers(0, 730, n_cust)
+        signup_date = pd.to_datetime(base) + pd.to_timedelta(signup_offsets, unit="D")
+        signup_iso = [d.strftime("%Y-%m-%d") for d in signup_date]
+        lat, lon = _make_latlon(cust_country, rng)
+        cust_email = [fakers["en_US"].email() for _ in range(n_cust)]
+
+        customers = pd.DataFrame(
+            {
+                "customer_id": customer_ids,
+                "name": names,
+                "country": cust_country,
+                "signup_date": signup_iso,
+                "latitude": lat,
+                "longitude": lon,
+                "email": cust_email,
+            }
+        )
+
+        # ---------------- orders (FK -> customers) ----------------
+        n_orders = n_cust * 2
+        order_ids = [fakers["en_US"].uuid4() for _ in range(n_orders)]
+        order_cust = rng.choice(customer_ids, n_orders)  # subset/multiset de customers
+        amount = _amount_with_outliers(n_orders, rng, n_extreme=10)
+        order_cat = rng.choice(_CATEGORIES, n_orders)
+        ts_offsets = rng.integers(0, 730 * 24 * 3600, n_orders)
+        ts = pd.to_datetime(np.datetime64("2022-01-01T00:00:00")) + pd.to_timedelta(
+            ts_offsets, unit="s"
+        )
+        ts_iso = [t.strftime("%Y-%m-%d %H:%M:%S") for t in ts]
+
+        orders = pd.DataFrame(
+            {
+                "order_id": order_ids,
+                "customer_id": order_cust,
+                "amount": amount,
+                "category": order_cat,
+                "ts": ts_iso,
+            }
+        )
+
+        # ---------------- reviews (FK -> customers) ----------------
+        n_reviews = n_cust
+        review_ids = [fakers["en_US"].uuid4() for _ in range(n_reviews)]
+        # Subconjunto de customers (no todos) -> containment estricto ⊆ customers.
+        rev_cust = rng.choice(customer_ids, n_reviews)
+        review_text = _make_reviews(n_reviews, rng, fakers, null_frac=0.0)
+        rating = rng.integers(1, 6, n_reviews)
+
+        reviews = pd.DataFrame(
+            {
+                "review_id": review_ids,
+                "customer_id": rev_cust,
+                "review_text": review_text,
+                "rating": rating,
+            }
+        )
+
+        files = {
+            "customers": os.path.join(out_dir, "customers.csv"),
+            "orders": os.path.join(out_dir, "orders.csv"),
+            "reviews": os.path.join(out_dir, "reviews.csv"),
+        }
+        customers.to_csv(files["customers"], index=False)
+        orders.to_csv(files["orders"], index=False)
+        reviews.to_csv(files["reviews"], index=False)
+
+        return {
+            "status": "ok",
+            "out_dir": out_dir,
+            "files": files,
+            "n_customers": n_cust,
+            "n_orders": n_orders,
+            "n_reviews": n_reviews,
+            "expected_relations": [
+                {
+                    "from_table": "orders",
+                    "from_col": "customer_id",
+                    "to_table": "customers",
+                    "to_col": "customer_id",
+                },
+                {
+                    "from_table": "reviews",
+                    "from_col": "customer_id",
+                    "to_table": "customers",
+                    "to_col": "customer_id",
+                },
+            ],
+            "seed": seed,
+        }
+    except Exception as exc:  # noqa: BLE001 — dict-no-throw del grupo eda.
+        return {"status": "error", "error": str(exc)}
+
+
+if __name__ == "__main__":
+    import json
+    import sys
+
+    args = sys.argv[1:]
+    out = args[0] if len(args) > 0 else "/tmp/synthetic_eda_folder"
+    rows = int(args[1]) if len(args) > 1 else 2000
+    sd = int(args[2]) if len(args) > 2 else 42
+    print(json.dumps(generate_synthetic_eda_folder(out, rows, sd), indent=2))
@@ -0,0 +1,74 @@
+"""Tests para generate_synthetic_eda_folder."""
+
+import os
+import statistics
+
+import pandas as pd
+
+from datascience.generate_synthetic_eda_folder import generate_synthetic_eda_folder
+
+
+def test_genera_ok_y_archivos(tmp_path):
+    out = str(tmp_path / "folder")
+    res = generate_synthetic_eda_folder(out, n_rows=300, seed=42)
+    assert res["status"] == "ok"
+    assert res["n_customers"] == 300
+    assert res["n_orders"] == 600
+    assert res["n_reviews"] == 300
+    for key in ("customers", "orders", "reviews"):
+        assert os.path.exists(res["files"][key])
+    # Relaciones esperadas declaradas.
+    rels = {(r["from_table"], r["to_table"]) for r in res["expected_relations"]}
+    assert ("orders", "customers") in rels
+    assert ("reviews", "customers") in rels
+
+
+def test_determinismo_mismo_seed(tmp_path):
+    out1 = str(tmp_path / "f1")
+    out2 = str(tmp_path / "f2")
+    generate_synthetic_eda_folder(out1, n_rows=250, seed=11)
+    generate_synthetic_eda_folder(out2, n_rows=250, seed=11)
+    for name in ("customers.csv", "orders.csv", "reviews.csv"):
+        a = open(os.path.join(out1, name), "rb").read()
+        b = open(os.path.join(out2, name), "rb").read()
+        assert a == b, f"{name} difiere entre dos generaciones con el mismo seed"
+
+
+def test_seeds_distintos_difieren(tmp_path):
+    out1 = str(tmp_path / "f1")
+    out2 = str(tmp_path / "f2")
+    generate_synthetic_eda_folder(out1, n_rows=250, seed=11)
+    generate_synthetic_eda_folder(out2, n_rows=250, seed=12)
+    a = open(os.path.join(out1, "customers.csv"), "rb").read()
+    b = open(os.path.join(out2, "customers.csv"), "rb").read()
+    assert a != b
+
+
+def test_fk_containment(tmp_path):
+    out = str(tmp_path / "folder")
+    res = generate_synthetic_eda_folder(out, n_rows=300, seed=42)
+    customers = pd.read_csv(res["files"]["customers"])
+    orders = pd.read_csv(res["files"]["orders"])
+    reviews = pd.read_csv(res["files"]["reviews"])
+    cust_ids = set(customers["customer_id"])
+    # Todos los customer_id de orders y reviews ⊆ customers.
+    assert set(orders["customer_id"]) <= cust_ids
+    assert set(reviews["customer_id"]) <= cust_ids
+    # customer_id es PK unica en customers.
+    assert customers["customer_id"].is_unique
+    assert orders["order_id"].is_unique
+    assert reviews["review_id"].is_unique
+
+
+def test_review_text_mediana_palabras(tmp_path):
+    out = str(tmp_path / "folder")
+    res = generate_synthetic_eda_folder(out, n_rows=300, seed=42)
+    reviews = pd.read_csv(res["files"]["reviews"])
+    words = [len(str(t).split()) for t in reviews["review_text"].dropna()]
+    assert statistics.median(words) >= 20
+
+
+def test_n_rows_invalido(tmp_path):
+    out = str(tmp_path / "folder")
+    res = generate_synthetic_eda_folder(out, n_rows=0, seed=42)
+    assert res["status"] == "error"
@@ -0,0 +1,82 @@
+---
+name: generate_synthetic_eda_table
+kind: function
+lang: py
+domain: datascience
+version: "1.0.0"
+purity: impure
+signature: "def generate_synthetic_eda_table(out_db_path: str, table: str = 'synthetic', n_rows: int = 2000, seed: int = 42) -> dict"
+description: "Genera una tabla DuckDB sintetica (Faker + numpy, determinista por seed) cuyo contenido esta disenado para ACTIVAR el maximo de capitulos del motor AutomaticEDA del grupo eda: numericas continuas con correlacion lineal/no-lineal, numericas con outliers, categoricas desbalanceadas, texto libre multi-idioma con duplicados, fecha para serie temporal, lat/lon validas, semanticos/PII (uuid/email/iban/phone) y nulos con patron MCAR/MAR. Fixture para evaluar el EDA de punta a punta. Estilo dict-no-throw: nunca lanza."
+tags: [eda, synthetic, faker, testing, fixture, datascience]
+params:
+  - name: out_db_path
+    desc: "Ruta al archivo DuckDB de salida. Se crea (o reutiliza) y la tabla se reemplaza con CREATE OR REPLACE TABLE si ya existe."
+  - name: table
+    desc: "Nombre de la tabla a crear. Se valida contra ^[A-Za-z_][A-Za-z0-9_]*$ y se cita en el DDL. Default 'synthetic'."
+  - name: n_rows
+    desc: "Numero de filas (clientes unicos). Cada fila es un cliente con id/email/iban/phone propios. Default 2000."
+  - name: seed
+    desc: "Semilla para Faker (Faker.seed) y numpy (np.random.default_rng). Mismo seed -> tabla identica byte a byte. Default 42."
+output: "dict dict-no-throw. En exito {status:'ok', db_path, table, n_rows, columns:[19 nombres de columna], seed}. En error (sin lanzar, p.ej. nombre de tabla invalido o n_rows<=0) {status:'error', error:str}. Columnas: customer_id,email,iban,phone,income,spending,age,risk_score,tenure_months,engagement_quad,amount,n_purchases,country,category,plan,review,signup_date,latitude,longitude."
+uses_functions: []
+uses_types: []
+returns: []
+returns_optional: false
+error_type: "error_go_core"
+imports: []
+tested: true
+tests: ["test_genera_ok_y_columnas", "test_determinismo_mismo_seed", "test_seeds_distintos_difieren", "test_latlon_en_rango", "test_plan_solo_niveles_validos", "test_income_spending_co_nulos", "test_review_mediana_palabras_y_signup_datetime", "test_phone_matchea_regex_internacional", "test_outliers_y_correlaciones", "test_tabla_invalida_devuelve_error"]
+test_file_path: "python/functions/datascience/generate_synthetic_eda_table_test.py"
+file_path: "python/functions/datascience/generate_synthetic_eda_table.py"
+---
+
+## Ejemplo
+
+```bash
+# Genera /tmp/x.duckdb con la tabla `synthetic` (2000 filas, seed 42)
+fn run generate_synthetic_eda_table /tmp/x.duckdb synthetic 2000 42
+```
+
+```python
+import sys, os
+sys.path.insert(0, os.path.join("python", "functions"))
+from datascience import generate_synthetic_eda_table
+
+res = generate_synthetic_eda_table("/tmp/x.duckdb", "synthetic", n_rows=2000, seed=42)
+# res == {"status":"ok", "db_path":"/tmp/x.duckdb", "table":"synthetic",
+#         "n_rows":2000, "columns":[...19...], "seed":42}
+# Luego perfilala con el grupo eda:
+#   fn run profile_table /tmp/x.duckdb synthetic
+```
+
+## Cuando usarla
+
+- Cuando necesites un dataset de prueba REPRODUCIBLE para evaluar el motor AutomaticEDA de punta a punta: su contenido dispara, a proposito, num_distr, cat_distr, text_distr, correlacion, missingness (MCAR/MAR), modelos (PCA/KMeans/outliers), timeseries, geospatial, calidad, agregacion y los detectores semanticos / PII (`infer_semantic_type`).
+- Cuando escribas tests de capitulos del EDA y quieras una tabla con una columna que active CADA detector sin montar datos a mano.
+- Cuando quieras un fixture determinista (mismo seed -> misma tabla) para comparar el render del EDA entre versiones.
+
+## Gotchas
+
+- **Impura**: escribe a disco (crea/reutiliza el archivo DuckDB). Reemplaza la tabla destino con `CREATE OR REPLACE`.
+- **Requiere `faker`, `duckdb`, `numpy` y `pandas`** instalados en el venv. Sin `faker` la generacion devuelve `{status:'error'}` (no lanza).
+- **`signup_date` queda como TIMESTAMP/DATE en DuckDB** (se construye con `datetime64[ns]`), NO VARCHAR — condicion para que `detect_time_column` la elija y se active el capitulo timeseries. Si fuese VARCHAR, el detector de fecha fallaria.
+- **El texto de `review` debe superar el gate de text_distr**: media de caracteres >= 50 y mediana de palabras >= 20. Por eso cada review concatena dos parrafos Faker (~50 palabras de mediana); no reducir el numero de frases o el capitulo text_distr no activa.
+- **Determinismo dependiente del orden de llamadas**: se siembra `Faker.seed(seed)` + `np.random.default_rng(seed)` al inicio; cambiar el orden de las extracciones cambia la salida aunque el seed sea el mismo.
+- **PII real-istica**: `email`/`iban`/`phone`/`customer_id` matchean los regex de `infer_semantic_type` (email/iban/phone_intl/uuid) al 100%; son datos sinteticos de Faker, no personas reales.
+
+## Notas
+
+Mapa columna -> detector que activa:
+
+| Columna(s) | Tipo | Detector / capitulo |
+|---|---|---|
+| income, spending | num continua | correlacion POSITIVA fuerte (Pearson > 0.8) |
+| age, risk_score | num continua | correlacion NEGATIVA |
+| tenure_months, engagement_quad | num continua | relacion NO LINEAL (cuadratica) |
+| amount, n_purchases | num + outliers | num_distr / outliers (cola pesada + extremos inyectados) |
+| country (12), category (6), plan (3 desbalanceado) | categorica | cat_distr / agregacion (entropia baja en plan) |
+| review | texto libre multi-idioma | text_distr (len_mean>=50, mediana palabras>=20) + duplicados exactos |
+| signup_date | DATE/TIMESTAMP | timeseries |
+| latitude, longitude | num [-90,90]/[-180,180] | geospatial (detect_latlon_columns) |
+| customer_id, email, iban, phone | texto | semantic_type uuid/email/iban/phone_intl (PII) |
+| income+spending (co-nulos 12%), risk_score (nulo si plan=alta), review (8%) | nulos con patron | missingness MCAR/MAR |
@@ -0,0 +1,314 @@
+"""generate_synthetic_eda_table — fixture sintetico para ejercitar el motor AutomaticEDA.
+
+Funcion impura (escribe un archivo DuckDB a disco) y determinista por ``seed``:
+construye una unica tabla cuyo CONTENIDO esta disenado para ACTIVAR el maximo
+numero de capitulos del motor AutomaticEDA del grupo `eda` (num_distr, cat_distr,
+text_distr, correlacion, missingness, modelos, timeseries, geospatial, relaciones,
+calidad, agregacion) y los detectores semanticos / PII (`infer_semantic_type`).
+
+Estilo dict-no-throw del grupo `eda`: NUNCA lanza; captura cualquier error y
+devuelve ``{"status": "error", "error": str}``.
+
+Determinismo: con el mismo ``seed`` el DataFrame y, por tanto, la tabla DuckDB
+resultante son identicos byte a byte. Se siembra Faker (``Faker.seed``) y numpy
+(``np.random.default_rng(seed)``) al inicio de cada generacion.
+"""
+
+import re
+
+# Lista fija de paises (12 -> cardinalidad media para cat_distr / agregacion).
+_COUNTRIES = [
+    "ES", "FR", "DE", "IT", "PT", "NL",
+    "BE", "US", "GB", "IE", "SE", "PL",
+]
+
+# Lista fija de categorias de producto (6 -> cardinalidad media).
+_CATEGORIES = [
+    "electronics", "clothing", "home", "sports", "books", "toys",
+]
+
+# Niveles de plan con probabilidades DESBALANCEADAS (entropia baja para cat_distr).
+_PLANS = ["baja", "media", "alta"]
+_PLAN_PROBS = [0.70, 0.25, 0.05]
+
+# Centroides (lat, lon) aproximados por pais: muestrean coordenadas validas
+# dentro de [-90, 90] x [-180, 180] para que detect_latlon_columns las acepte.
+_CENTROIDS = {
+    "ES": (40.4, -3.7), "FR": (46.6, 2.2), "DE": (51.1, 10.4), "IT": (41.9, 12.5),
+    "PT": (39.4, -8.2), "NL": (52.1, 5.3), "BE": (50.5, 4.5), "US": (39.0, -98.0),
+    "GB": (54.0, -2.0), "IE": (53.4, -8.0), "SE": (60.1, 18.6), "PL": (52.0, 19.1),
+}
+
+# Locales rotados para generar texto multi-idioma (es/en/fr).
+_TEXT_LOCALES = ["es_ES", "en_US", "fr_FR"]
+
+# Identificador SQL valido (DuckDB no parametriza el nombre de tabla en DDL).
+_IDENT_RE = re.compile(r"^[A-Za-z_][A-Za-z0-9_]*$")
+
+
+def _make_fakers(seed):
+    """Crea los Faker por locale tras sembrar el generador compartido.
+
+    ``Faker.seed(seed)`` siembra el ``random.Random`` compartido por todas las
+    instancias Faker que usan el generador por defecto, asi que el orden de
+    llamadas determina por completo la salida (determinismo).
+    """
+    from faker import Faker
+
+    Faker.seed(seed)
+    es_es, en_us, fr_fr = (Faker(loc) for loc in _TEXT_LOCALES)
+    return {"es_ES": es_es, "en_US": en_us, "fr_FR": fr_fr}
+
+
+# Texto duplicado canonico (multi-idioma, > 20 palabras) que se inyecta en una
+# fraccion de las filas para que el analisis de duplicados exactos lo detecte.
+_DUP_REVIEW = (
+    "Servicio excelente y entrega muy rapida, el producto llego en perfecto "
+    "estado y coincide con la descripcion publicada en la tienda. The customer "
+    "support team answered every question quickly and the packaging was solid "
+    "and well protected during shipping. Je recommande vivement ce vendeur a "
+    "tous mes amis, la qualite est vraiment au rendez-vous cette fois."
+)
+
+
+def _make_reviews(n, rng, fakers, dup_frac=0.04, null_frac=0.08):
+    """Genera ``n`` reviews de texto libre largo multi-idioma (es/en/fr).
+
+    Cada review concatena dos parrafos de Faker en el idioma rotado por fila, de
+    modo que la MEDIANA de palabras por documento queda muy por encima de 20 y la
+    media de caracteres por encima de 50 (gates del capitulo text_distr). Se
+    inyectan duplicados exactos (``dup_frac``) y nulos (``null_frac``).
+
+    Devuelve una ``list`` de ``str`` o ``None`` (nulos) de longitud ``n``.
+    """
+    # Numero de frases por parrafo precomputado con numpy (determinista) para no
+    # interleavar draws de rng dentro del bucle de faker.
+    nb1 = rng.integers(4, 8, n)
+    nb2 = rng.integers(3, 7, n)
+
+    reviews = []
+    for i in range(n):
+        fk = fakers[_TEXT_LOCALES[i % 3]]
+        p1 = fk.paragraph(nb_sentences=int(nb1[i]))
+        p2 = fk.paragraph(nb_sentences=int(nb2[i]))
+        reviews.append(f"{p1} {p2}")
+
+    # Duplicados exactos: una fraccion de filas comparte un review identico.
+    if n > 0 and dup_frac > 0:
+        k_dup = max(1, int(n * dup_frac))
+        dup_idx = rng.choice(n, size=min(k_dup, n), replace=False)
+        for j in dup_idx:
+            reviews[int(j)] = _DUP_REVIEW
+
+    # Nulos MCAR-ish: una fraccion de filas al azar queda en None.
+    if n > 0 and null_frac > 0:
+        k_null = max(1, int(n * null_frac))
+        null_idx = rng.choice(n, size=min(k_null, n), replace=False)
+        for j in null_idx:
+            reviews[int(j)] = None
+
+    return reviews
+
+
+def _make_phone_intl(rng):
+    """Construye un telefono en formato internacional que casa phone_intl.
+
+    Regex objetivo (fullmatch): ``\\+\\d[\\d\\s()-]{6,}\\d``. Empieza por '+',
+    digito, bloques de digitos separados por espacios y termina en digito.
+    """
+    cc = int(rng.integers(1, 99))
+    a = int(rng.integers(100, 999))
+    b = int(rng.integers(100, 999))
+    c = int(rng.integers(100, 999))
+    return f"+{cc} {a} {b} {c}"
+
+
+def _make_latlon(countries, rng):
+    """Devuelve (latitudes, longitudes) muestreando centroides de pais + jitter.
+
+    Mantiene los valores dentro de [-90, 90] y [-180, 180] (validez exigida por
+    detect_latlon_columns). El jitter es pequeno para no salirse del rango.
+    """
+    import numpy as np
+
+    lats = np.empty(len(countries), dtype=float)
+    lons = np.empty(len(countries), dtype=float)
+    jitter_lat = rng.normal(0.0, 0.5, len(countries))
+    jitter_lon = rng.normal(0.0, 0.5, len(countries))
+    for i, code in enumerate(countries):
+        base_lat, base_lon = _CENTROIDS[code]
+        lats[i] = float(np.clip(base_lat + jitter_lat[i], -90.0, 90.0))
+        lons[i] = float(np.clip(base_lon + jitter_lon[i], -180.0, 180.0))
+    return lats, lons
+
+
+def _amount_with_outliers(n, rng, n_extreme=6, factor=50.0):
+    """Serie lognormal de cola pesada con ~``n_extreme`` outliers altos (x``factor``)."""
+    import numpy as np
+
+    amount = rng.lognormal(mean=4.0, sigma=1.0, size=n)
+    if n > 0 and n_extreme > 0:
+        idx = rng.choice(n, size=min(n_extreme, n), replace=False)
+        amount[idx] = amount[idx] * factor
+    return amount
+
+
+def generate_synthetic_eda_table(
+    out_db_path, table="synthetic", n_rows=2000, seed=42
+):
+    """Genera una tabla DuckDB sintetica que activa el maximo de capitulos del EDA.
+
+    Construye un DataFrame de ``n_rows`` clientes unicos con columnas elegidas para
+    disparar detectores concretos del motor AutomaticEDA (numericas continuas con
+    correlaciones lineal/no-lineal, numericas con outliers, categoricas
+    desbalanceadas, texto libre multi-idioma con duplicados, fecha para serie
+    temporal, lat/lon validas, semanticos/PII y nulos con patron MCAR/MAR), y la
+    materializa en ``out_db_path`` con ``CREATE OR REPLACE TABLE``.
+
+    Funcion impura (escribe a disco) y determinista por ``seed``: con el mismo
+    seed la tabla resultante es identica byte a byte. NUNCA lanza.
+
+    Args:
+        out_db_path: ruta al archivo DuckDB de salida. Se crea (o reutiliza) y la
+            tabla se reemplaza si ya existe.
+        table: nombre de la tabla a crear. Se valida contra
+            ``^[A-Za-z_][A-Za-z0-9_]*$`` y se cita en el DDL.
+        n_rows: numero de filas (clientes unicos). Default 2000.
+        seed: semilla para Faker y numpy. Default 42.
+
+    Returns:
+        dict dict-no-throw. En exito::
+
+            {"status": "ok", "db_path": out_db_path, "table": table,
+             "n_rows": n_rows, "columns": [<nombres de columna>], "seed": seed}
+
+        En error (sin lanzar)::
+
+            {"status": "error", "error": str}
+    """
+    try:
+        import duckdb
+        import numpy as np
+        import pandas as pd
+
+        if not _IDENT_RE.match(table or ""):
+            return {
+                "status": "error",
+                "error": (
+                    f"nombre de tabla invalido: {table!r} "
+                    "(debe casar con ^[A-Za-z_][A-Za-z0-9_]*$)"
+                ),
+            }
+        n = int(n_rows)
+        if n <= 0:
+            return {"status": "error", "error": f"n_rows debe ser > 0, dado {n_rows!r}"}
+
+        fakers = _make_fakers(seed)
+        rng = np.random.default_rng(seed)
+
+        # --- Numericas continuas (distinct alto, correlaciones) ---
+        income = np.clip(rng.normal(40000.0, 12000.0, n), 1000.0, None)
+        spending = income * 0.35 + rng.normal(0.0, 2000.0, n)  # corr POSITIVA fuerte
+        age = rng.integers(18, 91, n)
+        risk_score = 90.0 - age * 0.7 + rng.normal(0.0, 5.0, n)  # corr NEGATIVA con age
+        tenure_months = rng.uniform(0.0, 60.0, n)
+        engagement_quad = ((tenure_months - 30.0) ** 2) / 30.0 + rng.normal(0.0, 1.0, n)
+
+        # --- Numericas con outliers claros ---
+        amount = _amount_with_outliers(n, rng)
+        n_purchases = rng.poisson(3.0, n).astype(float)
+        if n > 0:
+            k_hi = min(max(1, int(n * 0.002)) + 2, n)  # ~3-5 valores altisimos
+            hi_idx = rng.choice(n, size=k_hi, replace=False)
+            n_purchases[hi_idx] = rng.integers(200, 400, len(hi_idx)).astype(float)
+
+        # --- Categoricas ---
+        country = rng.choice(_COUNTRIES, n)
+        category = rng.choice(_CATEGORIES, n)
+        plan = rng.choice(_PLANS, n, p=_PLAN_PROBS)
+
+        # --- Texto libre multi-idioma con duplicados ---
+        review = _make_reviews(n, rng, fakers)
+
+        # --- Fecha / serie temporal (rango ~2 anios, cadencia ~diaria) ---
+        base = np.datetime64("2022-01-01")
+        offsets = rng.integers(0, 730, n)
+        signup_date = pd.to_datetime(base) + pd.to_timedelta(offsets, unit="D")
+
+        # --- Geo lat/lon validas ---
+        latitude, longitude = _make_latlon(country, rng)
+
+        # --- Semanticos / PII (>=80% match para infer_semantic_type) ---
+        customer_id = [fakers["en_US"].uuid4() for _ in range(n)]
+        email = [fakers["en_US"].email() for _ in range(n)]
+        iban = [fakers["en_US"].iban() for _ in range(n)]
+        phone = [_make_phone_intl(rng) for _ in range(n)]
+
+        df = pd.DataFrame(
+            {
+                "customer_id": customer_id,
+                "email": email,
+                "iban": iban,
+                "phone": phone,
+                "income": income,
+                "spending": spending,
+                "age": age,
+                "risk_score": risk_score,
+                "tenure_months": tenure_months,
+                "engagement_quad": engagement_quad,
+                "amount": amount,
+                "n_purchases": n_purchases,
+                "country": country,
+                "category": category,
+                "plan": plan,
+                "review": review,
+                "signup_date": signup_date,
+                "latitude": latitude,
+                "longitude": longitude,
+            }
+        )
+
+        # --- Nulos con patron ---
+        # income + spending faltan JUNTAS en las MISMAS filas (co-ocurrencia -> MAR).
+        k_co = max(1, int(n * 0.12))
+        co_idx = rng.choice(n, size=min(k_co, n), replace=False)
+        df.loc[co_idx, "income"] = np.nan
+        df.loc[co_idx, "spending"] = np.nan
+        # risk_score falta cuando plan == "alta" (mas una pizca de azar) -> MAR.
+        risk_mask = (df["plan"] == "alta").to_numpy() | (rng.random(n) < 0.02)
+        df.loc[risk_mask, "risk_score"] = np.nan
+
+        columns = list(df.columns)
+
+        con = duckdb.connect(out_db_path)
+        try:
+            con.register("df_synth_eda", df)
+            con.execute(
+                f'CREATE OR REPLACE TABLE "{table}" AS SELECT * FROM df_synth_eda'
+            )
+            con.unregister("df_synth_eda")
+        finally:
+            con.close()
+
+        return {
+            "status": "ok",
+            "db_path": out_db_path,
+            "table": table,
+            "n_rows": n,
+            "columns": columns,
+            "seed": seed,
+        }
+    except Exception as exc:  # noqa: BLE001 — dict-no-throw del grupo eda.
+        return {"status": "error", "error": str(exc)}
+
+
+if __name__ == "__main__":
+    import json
+    import sys
+
+    args = sys.argv[1:]
+    db_path = args[0] if len(args) > 0 else "/tmp/synthetic_eda.duckdb"
+    tbl = args[1] if len(args) > 1 else "synthetic"
+    rows = int(args[2]) if len(args) > 2 else 2000
+    sd = int(args[3]) if len(args) > 3 else 42
+    print(json.dumps(generate_synthetic_eda_table(db_path, tbl, rows, sd), indent=2))
@@ -0,0 +1,129 @@
+"""Tests para generate_synthetic_eda_table."""
+
+import os
+import re
+import statistics
+
+import duckdb
+
+from datascience.generate_synthetic_eda_table import generate_synthetic_eda_table
+
+_EXPECTED_COLS = [
+    "customer_id", "email", "iban", "phone", "income", "spending", "age",
+    "risk_score", "tenure_months", "engagement_quad", "amount", "n_purchases",
+    "country", "category", "plan", "review", "signup_date", "latitude", "longitude",
+]
+_PHONE_RE = re.compile(r"\+\d[\d\s()-]{6,}\d")
+
+
+def _load(db_path, table="synthetic"):
+    con = duckdb.connect(db_path, read_only=True)
+    try:
+        return con.execute(f'SELECT * FROM "{table}"').fetch_df()
+    finally:
+        con.close()
+
+
+def test_genera_ok_y_columnas(tmp_path):
+    db = str(tmp_path / "t.duckdb")
+    res = generate_synthetic_eda_table(db, "synthetic", n_rows=500, seed=42)
+    assert res["status"] == "ok"
+    assert res["table"] == "synthetic"
+    assert res["n_rows"] == 500
+    assert res["columns"] == _EXPECTED_COLS
+    assert os.path.exists(db)
+    df = _load(db)
+    assert list(df.columns) == _EXPECTED_COLS
+    assert len(df) == 500
+
+
+def test_determinismo_mismo_seed(tmp_path):
+    db1 = str(tmp_path / "a.duckdb")
+    db2 = str(tmp_path / "b.duckdb")
+    generate_synthetic_eda_table(db1, "synthetic", n_rows=400, seed=7)
+    generate_synthetic_eda_table(db2, "synthetic", n_rows=400, seed=7)
+    df1 = _load(db1).astype(str)
+    df2 = _load(db2).astype(str)
+    # Misma semilla -> tabla identica fila a fila.
+    assert df1.equals(df2)
+
+
+def test_seeds_distintos_difieren(tmp_path):
+    db1 = str(tmp_path / "a.duckdb")
+    db2 = str(tmp_path / "b.duckdb")
+    generate_synthetic_eda_table(db1, "synthetic", n_rows=400, seed=7)
+    generate_synthetic_eda_table(db2, "synthetic", n_rows=400, seed=8)
+    df1 = _load(db1).astype(str)
+    df2 = _load(db2).astype(str)
+    assert not df1.equals(df2)
+
+
+def test_latlon_en_rango(tmp_path):
+    db = str(tmp_path / "t.duckdb")
+    generate_synthetic_eda_table(db, "synthetic", n_rows=500, seed=42)
+    df = _load(db)
+    assert df["latitude"].between(-90, 90).all()
+    assert df["longitude"].between(-180, 180).all()
+
+
+def test_plan_solo_niveles_validos(tmp_path):
+    db = str(tmp_path / "t.duckdb")
+    generate_synthetic_eda_table(db, "synthetic", n_rows=500, seed=42)
+    df = _load(db)
+    assert set(df["plan"].unique()) <= {"baja", "media", "alta"}
+
+
+def test_income_spending_co_nulos(tmp_path):
+    db = str(tmp_path / "t.duckdb")
+    generate_synthetic_eda_table(db, "synthetic", n_rows=600, seed=42)
+    df = _load(db)
+    inc_null = df["income"].isna()
+    sp_null = df["spending"].isna()
+    # income y spending faltan exactamente en las MISMAS filas.
+    assert (inc_null == sp_null).all()
+    assert inc_null.sum() > 0
+
+
+def test_review_mediana_palabras_y_signup_datetime(tmp_path):
+    db = str(tmp_path / "t.duckdb")
+    generate_synthetic_eda_table(db, "synthetic", n_rows=500, seed=42)
+    df = _load(db)
+    words = [len(str(r).split()) for r in df["review"].dropna()]
+    assert statistics.median(words) >= 20
+    # signup_date debe ser datetime/date en DuckDB (no VARCHAR).
+    con = duckdb.connect(db, read_only=True)
+    try:
+        dtype = con.execute(
+            "SELECT column_type FROM (DESCRIBE synthetic) WHERE column_name='signup_date'"
+        ).fetchone()[0]
+    finally:
+        con.close()
+    assert dtype.upper().startswith(("DATE", "TIMESTAMP"))
+
+
+def test_phone_matchea_regex_internacional(tmp_path):
+    db = str(tmp_path / "t.duckdb")
+    generate_synthetic_eda_table(db, "synthetic", n_rows=500, seed=42)
+    df = _load(db)
+    phones = [p for p in df["phone"].tolist() if p is not None]
+    assert all(_PHONE_RE.fullmatch(str(p)) for p in phones)
+
+
+def test_outliers_y_correlaciones(tmp_path):
+    db = str(tmp_path / "t.duckdb")
+    generate_synthetic_eda_table(db, "synthetic", n_rows=800, seed=42)
+    df = _load(db)
+    # amount tiene cola con outliers altos evidentes.
+    assert df["amount"].max() > df["amount"].median() * 20
+    # correlacion positiva fuerte income~spending y negativa age~risk_score.
+    sub = df[["income", "spending"]].dropna()
+    assert sub["income"].corr(sub["spending"]) > 0.8
+    sub2 = df[["age", "risk_score"]].dropna()
+    assert sub2["age"].corr(sub2["risk_score"]) < -0.6
+
+
+def test_tabla_invalida_devuelve_error(tmp_path):
+    db = str(tmp_path / "t.duckdb")
+    res = generate_synthetic_eda_table(db, "bad name;", n_rows=10, seed=42)
+    assert res["status"] == "error"
+    assert "invalido" in res["error"]
@@ -0,0 +1,79 @@
+---
+name: list_bq_dataset_tables
+kind: function
+lang: py
+domain: datascience
+version: "1.0.0"
+purity: impure
+signature: "def list_bq_dataset_tables(project_id: str, dataset: str, include_views: bool = True, location: str = None) -> dict"
+description: "Lista todas las tablas y vistas de un dataset BigQuery y enriquece las BASE TABLE con conteo de filas y tamaño en disco. Capa de descubrimiento del grupo eda: qué hay en el dataset, cuánto pesa cada tabla, qué es tabla vs vista, antes de perfilar una concreta. Query 1 sobre INFORMATION_SCHEMA.TABLES (catálogo completo) + query 2 sobre __TABLES__ (row_count, size_bytes). Las vistas dejan n_rows/size_mb en None (contarlas exigiría full scan). Auth ADC con fix de quota project (403 USER_PROJECT_DENIED)."
+tags: [eda, bigquery]
+params:
+  - name: project_id
+    desc: "Proyecto GCP que contiene el dataset (ej. `autingo-159109`). Se usa como proyecto de facturación de las dos queries."
+  - name: dataset
+    desc: "Nombre del dataset BigQuery a listar (ej. `customer_marts`). Solo el dataset, sin proyecto ni tabla."
+  - name: include_views
+    desc: "True (DEFAULT) incluye tablas y vistas. False filtra y devuelve solo las BASE TABLE."
+  - name: location
+    desc: "Región del dataset para las queries (ej. `europe-west1`, `EU`). None (DEFAULT) deja que el cliente resuelva la ubicación. Necesario si el dataset vive en una región no-US."
+output: "dict dict-no-throw. En éxito {status:'ok', project_id, dataset, n_tables:int, tables:[{table, fqn:'project.dataset.table', table_type:'BASE TABLE'|'VIEW'|..., n_rows:int|None, size_mb:float|None, created:str|None}]}. En error {status:'error', error:str}."
+uses_functions: []
+uses_types: []
+returns: []
+returns_optional: false
+error_type: "error_go_core"
+imports: []
+tested: false
+tests: []
+test_file_path: ""
+file_path: "python/functions/datascience/list_bq_dataset_tables.py"
+---
+
+## Ejemplo
+
+```python
+from datascience import list_bq_dataset_tables
+
+# Catálogo completo del dataset (tablas + vistas) con filas y tamaño.
+r = list_bq_dataset_tables("autingo-159109", "customer_marts")
+print(r["status"], r["n_tables"])
+for t in r["tables"]:
+    print(t["table"], t["table_type"], t["n_rows"], t["size_mb"], "MB")
+
+# Solo tablas base, dataset en europe-west1 (necesita location).
+r = list_bq_dataset_tables(
+    "autingo-159109", "customer_marts",
+    include_views=False, location="europe-west1",
+)
+```
+
+## Cuando usarla
+
+- Antes de perfilar una tabla concreta con el grupo `eda` (`profile_bq_table`, `load_bq_table_to_duckdb`): descubre qué tablas y vistas hay en el dataset y cuánto pesa cada una para decidir cuál analizar.
+- Cuando necesites un inventario rápido de un dataset BigQuery (nombre, tipo, filas, tamaño, fecha de creación) sin abrir la consola de GCP.
+- Cuando quieras distinguir tablas base de vistas antes de una carga o un cruce (las vistas no traen conteo de filas).
+
+## Gotchas
+
+- **Impura**: hace I/O de red contra la API de BigQuery (dos queries). Requiere ADC configurado (`gcloud auth application-default login`).
+- **403 USER_PROJECT_DENIED**: se evita aplicando `creds.with_quota_project(None)` cuando el ADC del usuario arrastra un quota project ajeno (memoria `bq_direct_quota_project`). Mismo patrón que `load_bq_table_to_duckdb`.
+- **Región del dataset**: si el dataset vive en `europe-west1` (o cualquier región distinta de la que asume el cliente por defecto) y no pasas `location`, las queries fallan con "Not found: Dataset ... was not found in location US". Pasa `location="europe-west1"` o `location="EU"` según corresponda. Muchos datasets de Aurgi están en `europe-west1`; otros en `EU` multi-region.
+- **Las vistas no traen n_rows ni size_mb**: `__TABLES__` no da conteo fiable para vistas y contarlas exigiría un full scan por vista (coste + latencia). Por eso `n_rows`/`size_mb` van a None para todo lo que no sea `BASE TABLE`.
+- **size_mb es tamaño lógico en disco** (bytes de `__TABLES__` / 1024²), no el coste de una query sobre la tabla.
+- **dict-no-throw**: nunca lanza excepción; ante cualquier fallo (project/dataset inválido, auth, región, permisos) devuelve `{status:'error', error:str}`.
+
+## Notas
+
+Capa de descubrimiento del grupo de capacidad `eda`. Complementa a
+`load_bq_table_to_duckdb` (que trae UNA tabla a DuckDB) y a `profile_bq_table`
+(que perfila UNA tabla end-to-end): esta función responde "¿qué tablas hay en
+este dataset y cuáles merece la pena perfilar?". `project_id` y `dataset` se
+validan con regex (`^[A-Za-z0-9\-]+$` y `^[A-Za-z0-9_]+$`) antes de
+interpolarlos en los identificadores con backticks de las dos queries, para
+cerrar la superficie de inyección.
+
+A diferencia de `bq_list_tables_py_infra` (dominio infra, usa el wrapper
+`BQClient` del SDK y no enriquece con filas ni tamaño), esta función es
+standalone (auth ADC propia con el fix de quota project) y devuelve el conteo de
+filas y el tamaño por tabla en el estilo dict-no-throw del grupo `eda`.
@@ -0,0 +1,134 @@
+"""list_bq_dataset_tables — catálogo de tablas y vistas de un dataset BigQuery.
+
+Lista todas las tablas y vistas de un dataset de Google BigQuery y enriquece las
+BASE TABLE con su conteo de filas y su tamaño en disco. Es la capa de
+descubrimiento del grupo `eda`: antes de perfilar una tabla concreta (con
+`profile_bq_table` / `load_bq_table_to_duckdb`) necesitas saber qué hay en el
+dataset, cuántas filas pesa cada tabla y qué es tabla vs vista.
+
+Estrategia de dos queries:
+  1. `INFORMATION_SCHEMA.TABLES` del dataset -> table_name, table_type,
+     creation_time de TODOS los objetos (tablas y vistas).
+  2. `__TABLES__` del dataset (una sola query adicional) -> row_count y
+     size_bytes por tabla. Solo las BASE TABLE se enriquecen; las vistas
+     dejan n_rows y size_mb en None (contarlas exigiría un full scan por vista,
+     con coste y latencia que no compensan para un catálogo).
+
+Autenticación: ADC (gcloud auth). Aplica `creds.with_quota_project(None)` para
+evitar el 403 USER_PROJECT_DENIED cuando el ADC del usuario lleva un quota
+project ajeno — mismo patrón que `load_bq_table_to_duckdb`.
+
+Estilo dict-no-throw del grupo `eda`: nunca lanza; devuelve
+{status:'error', ...} en cualquier fallo.
+"""
+
+import re
+
+_PROJECT_RE = re.compile(r"^[A-Za-z0-9\-]+$")
+_DATASET_RE = re.compile(r"^[A-Za-z0-9_]+$")
+
+
+def list_bq_dataset_tables(
+    project_id: str,
+    dataset: str,
+    include_views: bool = True,
+    location: str = None,
+) -> dict:
+    try:
+        import google.auth
+        from google.cloud import bigquery
+
+        if not project_id or not _PROJECT_RE.match(project_id):
+            return {"status": "error", "error": f"project_id inválido: {project_id!r}"}
+        if not dataset or not _DATASET_RE.match(dataset):
+            return {"status": "error", "error": f"dataset inválido: {dataset!r}"}
+
+        # Auth ADC con fix de quota project (403 USER_PROJECT_DENIED).
+        creds, adc_project = google.auth.default(
+            scopes=["https://www.googleapis.com/auth/bigquery"]
+        )
+        if hasattr(creds, "with_quota_project"):
+            creds = creds.with_quota_project(None)
+        proj = project_id or adc_project
+        client = bigquery.Client(project=proj, credentials=creds)
+
+        # Query 1: catálogo de objetos (tablas + vistas) del dataset.
+        info_sql = (
+            "SELECT table_name, table_type, creation_time "
+            f"FROM `{proj}.{dataset}`.INFORMATION_SCHEMA.TABLES "
+            "ORDER BY table_name"
+        )
+        info_rows = list(client.query(info_sql, location=location).result())
+
+        # Query 2: enriquecimiento (row_count, size_bytes) desde __TABLES__.
+        stats_sql = (
+            "SELECT table_id, row_count, size_bytes "
+            f"FROM `{proj}.{dataset}`.__TABLES__"
+        )
+        stats = {}
+        for row in client.query(stats_sql, location=location).result():
+            stats[row["table_id"]] = (row["row_count"], row["size_bytes"])
+
+        tables = []
+        for row in info_rows:
+            table_name = row["table_name"]
+            table_type = row["table_type"]
+            is_base_table = table_type == "BASE TABLE"
+
+            if not include_views and not is_base_table:
+                continue
+
+            created = row["creation_time"]
+            created_iso = created.isoformat() if created is not None else None
+
+            n_rows = None
+            size_mb = None
+            if is_base_table and table_name in stats:
+                raw_rows, raw_bytes = stats[table_name]
+                if raw_rows is not None:
+                    n_rows = int(raw_rows)
+                if raw_bytes is not None:
+                    size_mb = round(int(raw_bytes) / (1024 * 1024), 3)
+
+            tables.append(
+                {
+                    "table": table_name,
+                    "fqn": f"{proj}.{dataset}.{table_name}",
+                    "table_type": table_type,
+                    "n_rows": n_rows,
+                    "size_mb": size_mb,
+                    "created": created_iso,
+                }
+            )
+
+        return {
+            "status": "ok",
+            "project_id": proj,
+            "dataset": dataset,
+            "n_tables": len(tables),
+            "tables": tables,
+        }
+    except Exception as e:  # noqa: BLE001
+        return {"status": "error", "error": str(e)}
+
+
+if __name__ == "__main__":
+    import json
+    import sys
+
+    args = sys.argv[1:]
+    if len(args) < 2:
+        print(
+            "uso: list_bq_dataset_tables.py <project_id> <dataset> [--no-views] [--location LOC]",
+            file=sys.stderr,
+        )
+        sys.exit(2)
+    proj_arg, dataset_arg = args[0], args[1]
+    include_views_arg = "--no-views" not in args
+    loc_arg = None
+    if "--location" in args:
+        loc_arg = args[args.index("--location") + 1]
+    result = list_bq_dataset_tables(
+        proj_arg, dataset_arg, include_views=include_views_arg, location=loc_arg
+    )
+    print(json.dumps(result, ensure_ascii=False, indent=2, default=str))
@@ -0,0 +1,124 @@
+---
+name: load_bq_table_to_duckdb
+kind: function
+lang: py
+domain: datascience
+version: "1.3.0"
+purity: impure
+signature: "def load_bq_table_to_duckdb(table_fqn: str, duckdb_path: str, dest_table: str = '', sample_frac: float = None, max_rows: int = 0, project_id: str = '', pseudonymize_cols: list = None, where_sql: str = '', select_sql: str = '') -> dict"
+description: "Adaptador BigQuery -> DuckDB local para el grupo eda. Trae una tabla o vista de Google BigQuery a un archivo DuckDB local (por defecto COMPLETA, todas las filas; muestreo opt-in con sample_frac), de modo que las funciones del grupo de capacidad eda (que solo hablan DuckDB/PostgreSQL) puedan perfilarla. Ingesta streaming Arrow -> DuckDB por batches (pyarrow.RecordBatch) para RAM acotada en tablas de decenas de millones de filas; fallback al camino DataFrame completo si pyarrow no esta. Filtra el origen con where_sql y proyecta/castea con select_sql. Seudonimiza columnas PII con hash SHA-1 truncado antes de materializar (LOPDGDD/RGPD)."
+tags: [eda, bigquery, duckdb, datascience]
+params:
+  - name: table_fqn
+    desc: "FQN completo de la tabla/vista BigQuery: `project.dataset.table`."
+  - name: duckdb_path
+    desc: "Ruta del archivo DuckDB local donde materializar la tabla (se crea/sobrescribe la tabla dest)."
+  - name: dest_table
+    desc: "Nombre de la tabla DuckDB destino. Vacío = último segmento del FQN, saneado."
+  - name: sample_frac
+    desc: "None (DEFAULT) = FULL, trae todas las filas. Un float en (0,1) activa el muestreo opt-in con `WHERE rand() < frac` (~frac del total). Vistas no admiten TABLESAMPLE, por eso rand()."
+  - name: max_rows
+    desc: "Tope duro opcional de filas (LIMIT). 0 (DEFAULT) = sin tope. Se combina con sample_frac si ambos se pasan."
+  - name: project_id
+    desc: "Proyecto GCP de facturación. Vacío = primer segmento del FQN o el del ADC."
+  - name: pseudonymize_cols
+    desc: "Lista de columnas PII a seudonimizar con hash SHA-1 truncado antes de materializar (LOPDGDD/RGPD). Preserva nulos y cardinalidad. En el camino streaming se aplica POR BATCH antes de insertar."
+  - name: where_sql
+    desc: "Clausula WHERE SQL (SIN la palabra WHERE) aplicada al SELECT sobre el origen y tambien al COUNT de n_rows_source (cuenta el origen filtrado). Se combina con el muestreo (sample_frac) via AND. Ej: `fecha <= CURRENT_DATE() AND venta_n IS NOT NULL`. Se interpola tal cual: NO usar con input no confiable."
+  - name: select_sql
+    desc: "Lista de expresiones del SELECT (SIN la palabra SELECT). Vacio (DEFAULT) = `*`. Permite proyectar/castear tipos problematicos, ej. `fecha, idCentro, CAST(venta_n AS FLOAT64) AS venta_n` (util para castear BIGNUMERIC a FLOAT64 antes de ingerir). Se interpola tal cual: NO usar con input no confiable."
+output: "dict dict-no-throw. En éxito {status:'ok', duckdb_path, table, n_rows_source, n_rows_fetched, sampled, sample_frac, columns, pseudonymized, streamed, auto_casts}. En error {status:'error', error, stage?}. streamed=True si la ingesta fue por batches Arrow; n_rows_fetched = suma de filas de los batches insertados."
+uses_functions: []
+uses_types: []
+returns: []
+returns_optional: false
+error_type: "error_go_core"
+imports: []
+tested: true
+tests:
+  - "test_default_selects_star_no_where_no_limit"
+  - "test_select_sql_replaces_star"
+  - "test_select_sql_blank_and_whitespace_fall_back_to_star"
+  - "test_where_sql_only"
+  - "test_sample_frac_only"
+  - "test_where_sql_and_sample_frac_combined_with_and_parenthesized"
+  - "test_single_condition_not_parenthesized"
+  - "test_max_rows_appends_limit"
+  - "test_max_rows_zero_or_negative_no_limit"
+  - "test_all_combined_order_where_then_limit"
+  - "test_sample_frac_out_of_range_ignored"
+  - "test_dest_empty_uses_last_fqn_segment"
+  - "test_dest_explicit_valid_kept"
+  - "test_dest_invalid_chars_replaced_with_underscore"
+  - "test_dest_from_fqn_segment_with_hyphen_sanitized"
+test_file_path: "python/functions/datascience/load_bq_table_to_duckdb_test.py"
+file_path: "python/functions/datascience/load_bq_table_to_duckdb.py"
+---
+
+## Ejemplo
+
+```python
+from datascience import load_bq_table_to_duckdb
+
+# FULL por defecto: trae TODAS las filas de la vista (3,8M) a DuckDB.
+r = load_bq_table_to_duckdb(
+    "autingo-159109.customer_marts.customer_profile",
+    "/tmp/eda_bq.duckdb",
+    pseudonymize_cols=["document_number", "full_name", "email", "phone"],
+)
+print(r["table"], r["n_rows_fetched"], "de", r["n_rows_source"], "sampled=", r["sampled"])
+
+# Muestreo opt-in: ~5 % de las filas.
+r = load_bq_table_to_duckdb(
+    "autingo-159109.customer_marts.customer_profile",
+    "/tmp/eda_bq_sample.duckdb",
+    sample_frac=0.05,
+    pseudonymize_cols=["document_number", "full_name", "email", "phone"],
+)
+
+# Filtrar el origen + castear columnas problematicas antes de ingerir. El COUNT de
+# n_rows_source respeta el mismo where_sql (cuenta el origen filtrado). Streaming
+# Arrow por batches: RAM acotada aunque la tabla tenga decenas de millones de filas.
+r = load_bq_table_to_duckdb(
+    "autingo-159109.data.ventas_39M",
+    "/tmp/eda_ventas.duckdb",
+    where_sql="fecha <= CURRENT_DATE() AND venta_n IS NOT NULL",
+    select_sql="fecha, idCentro, CAST(importe_bignumeric AS FLOAT64) AS importe",
+)
+print(r["n_rows_fetched"], "de", r["n_rows_source"], "streamed=", r["streamed"])
+```
+
+## Cuando usarla
+
+- Antes de perfilar una tabla/vista de BigQuery con el grupo `eda` (que solo habla DuckDB/PostgreSQL): trae el origen COMPLETO a DuckDB local (o una muestra con `sample_frac`) con seudonimización PII.
+- Cuando necesites un puente único BigQuery -> DuckDB local -> grupo `eda` sin escribir el bridge inline cada vez.
+- Cuando quieras que un EDA sobre datos de negocio conserve valor analítico (cardinalidad, nulos, distribución) sin incrustar datos personales reales.
+
+## Gotchas
+
+- **Impura**: hace I/O de red (BigQuery) + escritura a disco (DuckDB). Requiere ADC configurado (`gcloud auth application-default login`).
+- **403 USER_PROJECT_DENIED**: se evita aplicando `creds.with_quota_project(None)` cuando el ADC arrastra un quota project ajeno (memoria `bq_direct_quota_project`).
+- **TABLESAMPLE no funciona en vistas**: el muestreo (opt-in, `sample_frac`) usa `WHERE rand() < frac` (aplicable a tablas y vistas). `max_rows` es un `LIMIT` como tope duro opcional. `where_sql` y el muestreo se combinan con AND (cada condición entre paréntesis cuando hay varias, para respetar precedencia).
+- **Ingesta streaming Arrow (RAM acotada)**: cuando `pyarrow` + `to_arrow_iterable` están disponibles, el resultado se materializa por `pyarrow.RecordBatch` (primer batch `CREATE OR REPLACE TABLE ... AS SELECT`, siguientes `INSERT INTO`), con `streamed=True` en el retorno. Así una tabla de decenas de millones de filas no se carga entera en RAM. El cliente BigQuery Storage se crea con las mismas credenciales corregidas (`with_quota_project(None)`).
+- **Fallback DataFrame completo (carga TODO en RAM)**: si `pyarrow` o `to_arrow_iterable` no están disponibles, se cae al camino antiguo — `to_dataframe()` completo antes de materializar (`streamed=False`), que puede consumir varios GB en tablas grandes. Para acotar, pasa `sample_frac`, `max_rows` o `where_sql`.
+- **Auto-cast de tipos problemáticos (v1.3.0)**: si NO se pasa `select_sql`, la función inspecciona el schema del origen (`client.get_table`) y castea automáticamente en el SELECT: BIGNUMERIC -> `CAST(col AS FLOAT64)` (Arrow decimal256, DuckDB no lo ingiere), REPEATED/RECORD/JSON -> `TO_JSON_STRING(col)` (los LIST/STRUCT rompen el perfilado aguas abajo con "unhashable type: 'list'"), GEOGRAPHY -> `ST_ASTEXT(col)`. Las transformaciones aplicadas se reportan en `auto_casts` del retorno. Si se pasa `select_sql` explícito, se respeta tal cual (sin auto-cast). Si el schema no se puede leer, degrada a `SELECT *`. El guard decimal256 en la ingesta se conserva como backstop (`{status:'error', stage:'stream_schema'|'stream_insert'}`).
+- **Inyección SQL**: `where_sql` y `select_sql` (igual que `table_fqn`) se interpolan TAL CUAL en la query, sin escapar. NO los construyas a partir de input no confiable.
+- **db-dtypes solo en el camino DataFrame**: la normalización de `dbdate`/`dbtime` a tipos que DuckDB reconoce solo aplica al fallback pandas. En el camino Arrow los DATE/TIME llegan como tipos Arrow nativos que DuckDB ingiere directamente.
+- **La seudonimización es un hash unidireccional** (SHA-1 truncado a 12 hex): no es reversible, correcto para EDA. Preserva nulos, cardinalidad y patrón de faltantes, pero NO permite recuperar el valor original. En streaming se aplica por batch (columnas no PII conservan su tipo Arrow; las PII se reescriben a string).
+- **dict-no-throw**: nunca lanza excepción; ante cualquier fallo (FQN inválido, auth, query, ingesta) devuelve `{status:'error', error:str}` (con `stage` en fallos de ingesta streaming).
+
+## Notas
+
+Adaptador del grupo de capacidad `eda`: el resto de funciones del grupo perfilan
+DuckDB/PostgreSQL, pero no hablan BigQuery de forma nativa. Esta función cubre ese
+hueco materializando una sola tabla DuckDB desde el resultado de la query BigQuery,
+por batches Arrow cuando es posible. El SELECT sobre el origen lo compone el helper
+puro `_build_source_sql` (testeable sin red) y el nombre de tabla destino se sanea
+con `_sanitize_dest_table` (`^[A-Za-z_][A-Za-z0-9_]*$`) antes de citarlo en el
+`CREATE OR REPLACE TABLE`.
+
+## Capability growth log
+
+- v1.3.0 (2026-07-02) — Auto-cast de tipos problemáticos cuando no se pasa `select_sql`: inspecciona el schema del origen y castea BIGNUMERIC->FLOAT64, REPEATED/RECORD/JSON->TO_JSON_STRING y GEOGRAPHY->ST_ASTEXT (elimina el gotcha decimal256 y el "unhashable type: 'list'" de profile_table sobre columnas array). Nueva clave `auto_casts` en el retorno. Descubierto en el piloto AEDA del dataset external_datasets (product_info_mat con BIGNUMERIC, product_object con arrays).
+- v1.2.0 (2026-07-02) — Añade `where_sql` (cláusula WHERE en origen, combinada con el muestreo vía AND y aplicada también al COUNT de `n_rows_source`) y `select_sql` (proyección/casteo de columnas, útil para castear BIGNUMERIC->FLOAT64). Ingesta streaming Arrow -> DuckDB por batches (`pyarrow.RecordBatch`, RAM acotada al tamaño del batch) para tablas de decenas de millones de filas que no caben como DataFrame; fallback al camino DataFrame completo si pyarrow/`to_arrow_iterable` no están. Gotcha decimal256 (BIGNUMERIC) devuelto como error con recomendación de castear vía `select_sql`. Nueva clave `streamed` en el retorno. Tests unitarios sin red del builder de SQL y del saneado del destino.
+- v1.1.0 (2026-07-01) — FULL pasa a ser el DEFAULT: se sustituye `max_rows=300000, sample=True` por `sample_frac=None` (None = todas las filas) + `max_rows=0` (tope duro opcional). El muestreo es opt-in explícito. Fetch acelerado via BigQuery Storage Read API (Arrow) con fallback REST. Preferencia estándar del usuario: los EDA se corren sobre el total salvo que se pida lo contrario.
@@ -0,0 +1,419 @@
+"""load_bq_table_to_duckdb — adaptador BigQuery -> DuckDB local para el grupo `eda`.
+
+Trae una tabla o vista de Google BigQuery a un archivo DuckDB local (por defecto
+COMPLETA — todas las filas — o una muestra si se pasa `sample_frac`), de modo que
+las funciones del grupo de capacidad `eda` (que perfilan DuckDB/PostgreSQL)
+puedan analizarla sin un adaptador BigQuery nativo. Materializa una sola tabla
+DuckDB desde el resultado de la query.
+
+Modo por defecto = FULL: `sample_frac=None` trae la vista/tabla entera (preferencia
+estándar del usuario: los EDA se corren sobre el total salvo que se pida lo
+contrario). El muestreo es opt-in explícito: `sample_frac=0.05` trae ~5 %; `max_rows`
+es un tope duro opcional (0 = sin tope).
+
+Ingesta streaming Arrow -> DuckDB por batches: cuando `pyarrow` y el iterador
+`to_arrow_iterable` están disponibles, el resultado se trae y materializa por
+`pyarrow.RecordBatch`, insertando batch a batch en DuckDB. Así la RAM queda
+acotada al tamaño de un batch y una tabla de decenas de millones de filas cabe sin
+cargarse entera como DataFrame de pandas. Si `pyarrow`/`to_arrow_iterable` no están
+disponibles, cae al camino DataFrame completo (que sí carga todo en RAM).
+
+Filtrado en origen: `where_sql` aplica una cláusula WHERE SQL sobre la tabla origen
+(y también al COUNT del origen, para contar las filas filtradas). `select_sql`
+permite proyectar/castear expresiones concretas en el SELECT (vacío = `*`), útil
+para castear tipos problemáticos (p. ej. BIGNUMERIC -> FLOAT64) antes de ingerir.
+
+Seudonimización LOPDGDD/RGPD: las columnas listadas en `pseudonymize_cols` se
+transforman con un hash SHA-1 truncado ANTES de escribir a disco, preservando
+nulos, cardinalidad y patrón de faltantes pero sin volcar el valor real (DNI,
+nombre, email, teléfono, etc.). En el camino streaming se aplica POR BATCH antes de
+insertar. El EDA conserva su valor analítico sin incrustar datos personales reales.
+
+Autenticación: ADC (gcloud auth). Aplica creds.with_quota_project(None) para
+evitar el 403 USER_PROJECT_DENIED cuando el ADC lleva quota project ajeno. El
+cliente BigQuery Storage (usado por el streaming Arrow) se crea con esas MISMAS
+credenciales corregidas.
+
+Estilo dict-no-throw del grupo `eda`: nunca lanza; devuelve {status:'error', ...}.
+"""
+
+import hashlib
+import re
+
+_FQN_RE = re.compile(r"^[A-Za-z0-9_.\-]+$")
+
+
+def _pseudonymize_series(values):
+    """Hash SHA-1 truncado (12 hex) de cada valor no nulo; conserva None/NaN."""
+    import pandas as pd
+    out = []
+    for v in values:
+        if v is None or (isinstance(v, float) and pd.isna(v)) or (
+            not isinstance(v, (list, dict)) and pd.isna(v) if _safe_isna(v) else False
+        ):
+            out.append(None)
+        else:
+            h = hashlib.sha1(str(v).encode("utf-8")).hexdigest()[:12]
+            out.append(h)
+    return out
+
+
+def _safe_isna(v):
+    import pandas as pd
+    try:
+        return bool(pd.isna(v))
+    except (TypeError, ValueError):
+        return False
+
+
+def _sanitize_dest_table(dest_table: str, table_fqn: str) -> str:
+    """Nombre de tabla DuckDB destino saneado (helper puro, testeable sin red).
+
+    Reglas:
+      - `dest_table` vacío -> último segmento del FQN.
+      - Si el resultado no casa `^[A-Za-z_][A-Za-z0-9_]*$`, cada carácter inválido
+        se sustituye por `_`; si quedara vacío se usa `bq_table`.
+    """
+    dest = dest_table or table_fqn.split(".")[-1]
+    if not re.match(r"^[A-Za-z_][A-Za-z0-9_]*$", dest):
+        dest = re.sub(r"[^A-Za-z0-9_]", "_", dest) or "bq_table"
+    return dest
+
+
+def _build_source_sql(
+    table_fqn: str,
+    select_sql: str = "",
+    where_sql: str = "",
+    sample_frac: float = None,
+    max_rows: int = 0,
+) -> str:
+    """Compone el SELECT sobre la tabla/vista origen de BigQuery (helper puro).
+
+    Sin efectos: solo construye la cadena SQL, testeable sin red.
+
+    SEGURIDAD: `select_sql` y `where_sql` se interpolan TAL CUAL (no se escapan),
+    igual que `table_fqn`, por lo que NO deben construirse a partir de input no
+    confiable (riesgo de inyección SQL).
+
+    Reglas:
+      - `select_sql` vacío -> `SELECT *`; en otro caso `SELECT <select_sql>`.
+      - `where_sql` y el muestreo (`rand() < sample_frac`, para `sample_frac` en
+        (0,1)) se combinan con AND. Si hay más de una condición cada una se
+        envuelve en paréntesis para respetar la precedencia de operadores.
+      - `max_rows` > 0 añade un `LIMIT` como tope duro.
+    """
+    select_expr = select_sql.strip() if (select_sql and select_sql.strip()) else "*"
+
+    conditions = []
+    ws = (where_sql or "").strip()
+    if ws:
+        conditions.append(ws)
+    if sample_frac is not None and 0 < float(sample_frac) < 1:
+        conditions.append(f"rand() < {float(sample_frac)}")
+
+    if len(conditions) > 1:
+        where = " WHERE " + " AND ".join(f"({c})" for c in conditions)
+    elif conditions:
+        where = " WHERE " + conditions[0]
+    else:
+        where = ""
+
+    limit = f" LIMIT {int(max_rows)}" if max_rows and int(max_rows) > 0 else ""
+    return f"SELECT {select_expr} FROM `{table_fqn}`{where}{limit}"
+
+
+def _decimal256_columns(schema) -> list:
+    """Nombres de columnas Arrow de tipo decimal256 (BigQuery BIGNUMERIC).
+
+    DuckDB no ingiere decimal256 directamente; se usa para dar un error claro que
+    recomiende castear esas columnas a FLOAT64 vía `select_sql`.
+    """
+    import pyarrow as pa
+    return [f.name for f in schema if pa.types.is_decimal256(f.type)]
+
+
+def _auto_select_exprs(schema_fields) -> tuple:
+    """Construye el SELECT auto-casteado desde el schema BigQuery (helper puro).
+
+    Recibe la lista de campos top-level del schema de BigQuery
+    (`google.cloud.bigquery.SchemaField` o cualquier objeto con `.name`,
+    `.field_type` y `.mode`) y devuelve `(select_sql, auto_casts)`:
+
+      - BIGNUMERIC              -> CAST(col AS FLOAT64)   (Arrow decimal256, DuckDB no lo ingiere)
+      - REPEATED / RECORD / JSON -> TO_JSON_STRING(col)   (arrays/structs rompen profile_table:
+                                                           "unhashable type: 'list'")
+      - GEOGRAPHY               -> ST_ASTEXT(col)         (WKT string)
+      - resto                   -> col sin tocar
+
+    Si ninguna columna necesita transformación devuelve ("", {}) para que el
+    caller use `SELECT *` (comportamiento previo intacto).
+    """
+    exprs = []
+    auto_casts = {}
+    for f in schema_fields:
+        name = f.name
+        ftype = (f.field_type or "").upper()
+        mode = (getattr(f, "mode", "") or "").upper()
+        if mode == "REPEATED" or ftype in ("RECORD", "STRUCT", "JSON"):
+            exprs.append(f"TO_JSON_STRING(`{name}`) AS `{name}`")
+            auto_casts[name] = "TO_JSON_STRING"
+        elif ftype == "BIGNUMERIC":
+            exprs.append(f"CAST(`{name}` AS FLOAT64) AS `{name}`")
+            auto_casts[name] = "CAST_FLOAT64"
+        elif ftype == "GEOGRAPHY":
+            exprs.append(f"ST_ASTEXT(`{name}`) AS `{name}`")
+            auto_casts[name] = "ST_ASTEXT"
+        else:
+            exprs.append(f"`{name}`")
+    if not auto_casts:
+        return "", {}
+    return ", ".join(exprs), auto_casts
+
+
+def _pseudonymize_arrow_table(batch, pseudo_set: set, pseudo_applied: list):
+    """Envuelve un `pyarrow.RecordBatch` en una `pyarrow.Table`, hasheando las PII.
+
+    Las columnas no listadas en `pseudo_set` conservan su tipo Arrow NATIVO (DATE,
+    TIME, TIMESTAMP incluidos), que DuckDB ingiere directamente sin normalización.
+    Solo las columnas PII se reescriben a string con el hash SHA-1 truncado.
+
+    Muta `pseudo_applied` in situ (añade el nombre de cada columna seudonimizada la
+    primera vez que aparece).
+    """
+    import pyarrow as pa
+    if not pseudo_set:
+        return pa.Table.from_batches([batch])
+    names = list(batch.schema.names)
+    arrays = []
+    for i, name in enumerate(names):
+        col = batch.column(i)
+        if name in pseudo_set:
+            hashed = _pseudonymize_series(col.to_pylist())
+            arrays.append(pa.array(hashed, type=pa.string()))
+            if name not in pseudo_applied:
+                pseudo_applied.append(name)
+        else:
+            arrays.append(col)
+    new_batch = pa.RecordBatch.from_arrays(arrays, names=names)
+    return pa.Table.from_batches([new_batch])
+
+
+def load_bq_table_to_duckdb(
+    table_fqn: str,
+    duckdb_path: str,
+    dest_table: str = "",
+    sample_frac: float = None,
+    max_rows: int = 0,
+    project_id: str = "",
+    pseudonymize_cols: list = None,
+    where_sql: str = "",
+    select_sql: str = "",
+) -> dict:
+    try:
+        import duckdb
+        import google.auth
+        from google.cloud import bigquery
+
+        if not table_fqn or not _FQN_RE.match(table_fqn):
+            return {"status": "error", "error": f"table_fqn inválido: {table_fqn!r}"}
+
+        # dest_table: derivar del último segmento del FQN si no se pasa, saneado.
+        dest = _sanitize_dest_table(dest_table, table_fqn)
+
+        # Auth ADC con fix de quota project (403 USER_PROJECT_DENIED).
+        creds, adc_project = google.auth.default(
+            scopes=["https://www.googleapis.com/auth/bigquery"]
+        )
+        if hasattr(creds, "with_quota_project"):
+            creds = creds.with_quota_project(None)
+        proj = project_id or table_fqn.split(".")[0] or adc_project
+        client = bigquery.Client(project=proj, credentials=creds)
+
+        # Auto-cast de tipos problemáticos: si el caller no proyecta un
+        # select_sql propio, se inspecciona el schema del origen y se castean
+        # automáticamente BIGNUMERIC -> FLOAT64 (Arrow decimal256 que DuckDB no
+        # ingiere), REPEATED/RECORD/JSON -> TO_JSON_STRING (los LIST/STRUCT
+        # rompen el perfilado aguas abajo) y GEOGRAPHY -> ST_ASTEXT. Best-effort:
+        # si el schema no se puede leer, se sigue con SELECT * como antes.
+        auto_casts = {}
+        if not (select_sql and select_sql.strip()):
+            try:
+                src = client.get_table(table_fqn)
+                auto_sel, auto_casts = _auto_select_exprs(src.schema)
+                if auto_sel:
+                    select_sql = auto_sel
+            except Exception:  # noqa: BLE001
+                auto_casts = {}
+
+        # Conteo de filas del origen FILTRADO: aplica el mismo `where_sql` (cuenta
+        # las filas que se van a traer, no la tabla entera). El muestreo NO entra
+        # en el conteo (es un submuestreo aparte del origen filtrado).
+        count_where = ""
+        _ws = (where_sql or "").strip()
+        if _ws:
+            count_where = f" WHERE {_ws}"
+        cnt = client.query(
+            f"SELECT COUNT(*) AS n FROM `{table_fqn}`{count_where}"
+        ).result()
+        n_source = 0
+        for row in cnt:
+            n_source = int(row["n"])
+
+        # Modo por defecto = FULL (sample_frac=None -> todas las filas). El
+        # muestreo es opt-in: sample_frac in (0,1) muestrea esa fracción con
+        # `rand() < frac`, combinado con `where_sql` vía AND. max_rows>0 es un tope
+        # duro opcional (LIMIT). `select_sql` proyecta expresiones (vacío = `*`).
+        sampled = sample_frac is not None and 0 < float(sample_frac) < 1
+        sql = _build_source_sql(table_fqn, select_sql, where_sql, sample_frac, max_rows)
+
+        # ¿Está pyarrow disponible? Decide el camino de ingesta ANTES de consumir
+        # el resultado (streaming Arrow por batches vs DataFrame completo).
+        try:
+            import pyarrow  # noqa: F401
+            has_pyarrow = True
+        except Exception:  # noqa: BLE001
+            has_pyarrow = False
+
+        job = client.query(sql)
+        result = job.result()
+        use_stream = has_pyarrow and hasattr(result, "to_arrow_iterable")
+
+        pseudo_set = set(pseudonymize_cols or [])
+        pseudo_applied = []
+        n_fetched = 0
+        columns = []
+        streamed = False
+
+        con = duckdb.connect(duckdb_path)
+        try:
+            if use_stream:
+                # Cliente BigQuery Storage con las MISMAS creds corregidas
+                # (quota None). Si la lib no está, to_arrow_iterable cae al
+                # transporte REST-Arrow con bqstorage_client=None.
+                try:
+                    from google.cloud import bigquery_storage
+                    bqstorage_client = bigquery_storage.BigQueryReadClient(
+                        credentials=creds
+                    )
+                except Exception:  # noqa: BLE001
+                    bqstorage_client = None
+
+                first = True
+                for batch in result.to_arrow_iterable(
+                    bqstorage_client=bqstorage_client
+                ):
+                    # Seudonimización PII POR BATCH; no PII conserva tipo Arrow.
+                    tbl = _pseudonymize_arrow_table(batch, pseudo_set, pseudo_applied)
+
+                    # Gotcha BIGNUMERIC: decimal256 no lo ingiere DuckDB. Detectar
+                    # en el primer batch y devolver un error claro que recomiende
+                    # castear a FLOAT64 vía select_sql (no intentar magia de tipos).
+                    if first:
+                        dcols = _decimal256_columns(tbl.schema)
+                        if dcols:
+                            return {
+                                "status": "error",
+                                "error": (
+                                    "Ingesta Arrow bloqueada: columnas BIGNUMERIC "
+                                    f"(Arrow decimal256) que DuckDB no ingiere: {dcols}. "
+                                    "Castéalas a FLOAT64 con select_sql, p. ej. "
+                                    "select_sql='..., CAST(col AS FLOAT64) AS col, ...'."
+                                ),
+                                "stage": "stream_schema",
+                            }
+
+                    con.register("_batch_arrow", tbl)
+                    try:
+                        if first:
+                            con.execute(
+                                f'CREATE OR REPLACE TABLE "{dest}" '
+                                f"AS SELECT * FROM _batch_arrow"
+                            )
+                            columns = list(tbl.schema.names)
+                            first = False
+                        else:
+                            con.execute(
+                                f'INSERT INTO "{dest}" SELECT * FROM _batch_arrow'
+                            )
+                    except Exception as ie:  # noqa: BLE001
+                        msg = str(ie).lower()
+                        if "decimal256" in msg or ("decimal" in msg and "256" in msg):
+                            return {
+                                "status": "error",
+                                "error": (
+                                    "Ingesta Arrow falló por columna BIGNUMERIC "
+                                    "(Arrow decimal256) que DuckDB no ingiere. Castea "
+                                    "esas columnas a FLOAT64 con select_sql. Detalle: "
+                                    + str(ie)
+                                ),
+                                "stage": "stream_insert",
+                            }
+                        raise
+                    finally:
+                        con.unregister("_batch_arrow")
+                    n_fetched += tbl.num_rows
+
+                # Origen vacío: si el iterable no emitió ningún batch, materializa
+                # una tabla vacía con el esquema del origen (evita que aguas abajo
+                # falle por "tabla inexistente"). job.result() da un iterador fresco.
+                if first:
+                    empty_df = job.result().to_dataframe(create_bqstorage_client=False)
+                    con.register("_empty_df", empty_df)
+                    con.execute(
+                        f'CREATE OR REPLACE TABLE "{dest}" AS SELECT * FROM _empty_df'
+                    )
+                    con.unregister("_empty_df")
+                    columns = list(empty_df.columns)
+                streamed = True
+            else:
+                # Fallback: camino DataFrame completo (carga TODO el resultado en
+                # RAM). Mismo comportamiento que antes del streaming Arrow.
+                try:
+                    df = result.to_dataframe(create_bqstorage_client=True)
+                except Exception:  # noqa: BLE001
+                    df = job.result().to_dataframe(create_bqstorage_client=False)
+                n_fetched = len(df)
+
+                # Normalizar dtypes de db-dtypes (solo camino pandas): el conversor
+                # REST mapea DATE/TIME a las extension dtypes `dbdate`/`dbtime` de
+                # db-dtypes, que DuckDB NO reconoce al registrar el DataFrame. Se
+                # convierten a tipos estándar: DATE -> datetime64[ns], TIME ->
+                # string. En el camino Arrow esto no aplica (tipos Arrow nativos).
+                import pandas as pd
+                for col in df.columns:
+                    dt = str(df[col].dtype)
+                    if dt == "dbdate":
+                        df[col] = pd.to_datetime(df[col], errors="coerce")
+                    elif dt == "dbtime":
+                        df[col] = df[col].astype("string").astype(object)
+
+                # Seudonimización de columnas PII antes de escribir a disco.
+                for col in (pseudonymize_cols or []):
+                    if col in df.columns:
+                        df[col] = _pseudonymize_series(df[col].tolist())
+                        pseudo_applied.append(col)
+
+                con.register("_src_df", df)
+                con.execute(
+                    f'CREATE OR REPLACE TABLE "{dest}" AS SELECT * FROM _src_df'
+                )
+                con.unregister("_src_df")
+                columns = list(df.columns)
+        finally:
+            con.close()
+
+        return {
+            "status": "ok",
+            "duckdb_path": duckdb_path,
+            "table": dest,
+            "n_rows_source": n_source,
+            "n_rows_fetched": n_fetched,
+            "sampled": sampled,
+            "sample_frac": float(sample_frac) if sampled else None,
+            "columns": columns,
+            "pseudonymized": pseudo_applied,
+            "streamed": streamed,
+            "auto_casts": auto_casts,
+        }
+    except Exception as e:  # noqa: BLE001
+        return {"status": "error", "error": str(e)}
@@ -0,0 +1,135 @@
+"""Tests para load_bq_table_to_duckdb.
+
+Cubre la lógica PURA extraíble sin red ni BigQuery: la construcción del SELECT
+sobre el origen (`_build_source_sql` — combinación de where_sql + sample_frac con
+AND, select_sql sustituyendo a `*`, límite duro) y el saneado del nombre de tabla
+destino (`_sanitize_dest_table`). No se toca la red: importar el módulo solo carga
+`hashlib`/`re` a nivel superior (BigQuery/DuckDB/pyarrow se importan dentro de la
+función impura, que aquí no se invoca).
+"""
+
+import os
+import sys
+
+sys.path.insert(0, os.path.dirname(__file__))
+
+from load_bq_table_to_duckdb import _build_source_sql, _sanitize_dest_table
+
+_FQN = "autingo-159109.data.ventas"
+
+
+# --------------------------------------------------------------------------- #
+# _build_source_sql — golden / defaults
+# --------------------------------------------------------------------------- #
+def test_default_selects_star_no_where_no_limit():
+    sql = _build_source_sql(_FQN)
+    assert sql == "SELECT * FROM `autingo-159109.data.ventas`"
+
+
+def test_select_sql_replaces_star():
+    sql = _build_source_sql(
+        _FQN,
+        select_sql="fecha, idCentro, CAST(venta_n AS FLOAT64) AS venta_n",
+    )
+    assert sql == (
+        "SELECT fecha, idCentro, CAST(venta_n AS FLOAT64) AS venta_n "
+        "FROM `autingo-159109.data.ventas`"
+    )
+
+
+def test_select_sql_blank_and_whitespace_fall_back_to_star():
+    assert _build_source_sql(_FQN, select_sql="").startswith("SELECT * FROM")
+    assert _build_source_sql(_FQN, select_sql="   ").startswith("SELECT * FROM")
+
+
+# --------------------------------------------------------------------------- #
+# where_sql y sample_frac — solos y combinados con AND
+# --------------------------------------------------------------------------- #
+def test_where_sql_only():
+    sql = _build_source_sql(_FQN, where_sql="fecha <= CURRENT_DATE()")
+    assert sql == (
+        "SELECT * FROM `autingo-159109.data.ventas` "
+        "WHERE fecha <= CURRENT_DATE()"
+    )
+
+
+def test_sample_frac_only():
+    sql = _build_source_sql(_FQN, sample_frac=0.05)
+    assert sql == "SELECT * FROM `autingo-159109.data.ventas` WHERE rand() < 0.05"
+
+
+def test_where_sql_and_sample_frac_combined_with_and_parenthesized():
+    sql = _build_source_sql(
+        _FQN,
+        where_sql="fecha <= CURRENT_DATE() AND venta_n IS NOT NULL",
+        sample_frac=0.1,
+    )
+    # Dos condiciones -> cada una entre paréntesis, unidas con AND.
+    assert sql == (
+        "SELECT * FROM `autingo-159109.data.ventas` "
+        "WHERE (fecha <= CURRENT_DATE() AND venta_n IS NOT NULL) "
+        "AND (rand() < 0.1)"
+    )
+
+
+def test_single_condition_not_parenthesized():
+    # Con una sola condición no se envuelve en paréntesis (más limpio).
+    assert " WHERE fecha = 1" in _build_source_sql(_FQN, where_sql="fecha = 1")
+
+
+# --------------------------------------------------------------------------- #
+# max_rows (LIMIT) — solo y combinado
+# --------------------------------------------------------------------------- #
+def test_max_rows_appends_limit():
+    sql = _build_source_sql(_FQN, max_rows=1000)
+    assert sql == "SELECT * FROM `autingo-159109.data.ventas` LIMIT 1000"
+
+
+def test_max_rows_zero_or_negative_no_limit():
+    assert "LIMIT" not in _build_source_sql(_FQN, max_rows=0)
+    assert "LIMIT" not in _build_source_sql(_FQN, max_rows=-5)
+
+
+def test_all_combined_order_where_then_limit():
+    sql = _build_source_sql(
+        _FQN,
+        select_sql="a, b",
+        where_sql="a > 0",
+        sample_frac=0.2,
+        max_rows=500,
+    )
+    assert sql == (
+        "SELECT a, b FROM `autingo-159109.data.ventas` "
+        "WHERE (a > 0) AND (rand() < 0.2) LIMIT 500"
+    )
+
+
+# --------------------------------------------------------------------------- #
+# sample_frac fuera de rango -> no muestrea
+# --------------------------------------------------------------------------- #
+def test_sample_frac_out_of_range_ignored():
+    # >=1, <=0 y None no añaden la cláusula rand().
+    assert "rand()" not in _build_source_sql(_FQN, sample_frac=1.0)
+    assert "rand()" not in _build_source_sql(_FQN, sample_frac=0.0)
+    assert "rand()" not in _build_source_sql(_FQN, sample_frac=None)
+
+
+# --------------------------------------------------------------------------- #
+# _sanitize_dest_table
+# --------------------------------------------------------------------------- #
+def test_dest_empty_uses_last_fqn_segment():
+    assert _sanitize_dest_table("", "proj.dataset.customer_profile") == "customer_profile"
+
+
+def test_dest_explicit_valid_kept():
+    assert _sanitize_dest_table("mi_tabla", _FQN) == "mi_tabla"
+
+
+def test_dest_invalid_chars_replaced_with_underscore():
+    assert _sanitize_dest_table("my-table", _FQN) == "my_table"
+    assert _sanitize_dest_table("weird!!name", _FQN) == "weird__name"
+
+
+def test_dest_from_fqn_segment_with_hyphen_sanitized():
+    # El último segmento con guiones se sanea (guion no es válido en identificador).
+    assert _sanitize_dest_table("", "proj.dataset.tabla-con-guiones") == "tabla_con_guiones"
@@ -0,0 +1,121 @@
+---
+id: render_table_as_figure_py_datascience
+name: render_table_as_figure
+kind: function
+lang: py
+domain: datascience
+version: "1.0.0"
+purity: impure
+signature: "def render_table_as_figure(header, rows, title=None, note=None, fontsize=9.0, max_cell_chars=40) -> \"matplotlib.figure.Figure\""
+description: "Dibuja un bloque tabular (cabecera + filas) como una matplotlib.figure.Figure nítida, lista para rasterizar a DPI alto. Pensada para tablas que NO caben como texto en una página/slide del informe EDA: se rasteriza a alta resolución (el caller usa dpi=220, bbox_inches='tight') y el usuario hace zoom en el móvil para leerla entera sin perder datos. Cabecera sombreada (#eef3f6) y en negrita, filas pares (1-based) con zebra suave (#f6f8fa), tinta oscura (#1b1b1b) sobre blanco, rejilla gris muy fina (#cccccc). Trunca cada celda a max_cell_chars con elipsis y str()-ea cada valor (None -> \"\"). figsize proporcional al contenido (ancho por nº y longitud de columnas, alto por nº de filas) para que sea legible con zoom. Backend Agg sin pyplot global. Defensiva: header/rows vacíos o None, filas irregulares o cualquier error interno devuelven una Figure placeholder con texto centrado \"(tabla no disponible)\". NUNCA lanza."
+tags: [eda, table, figure, matplotlib, visualization, rasterize, zoom, render, datascience, impure]
+uses_functions: []
+uses_types: []
+returns: []
+returns_optional: false
+error_type: "error_go_core"
+imports: [matplotlib]
+example: |
+  from datascience.render_table_as_figure import render_table_as_figure
+  header = ["columna", "n_nulos", "%_nulos", "distintos", "tipo", "ejemplo"]
+  rows = [
+      ["ingresos", 12, "1.2%", 980, "float64", "2345.67"],
+      ["edad", 0, "0.0%", 88, "int64", "37"],
+      ["ciudad", 5, "0.5%", 412, "object", "Madrid"],
+  ]
+  fig = render_table_as_figure(header, rows, title="Resumen de columnas",
+                               note="rasteriza a dpi=220 y haz zoom")
+  fig.savefig("/tmp/tabla.png", dpi=220, bbox_inches="tight")
+tested: true
+tests:
+  - "test_returns_figure_with_table"
+  - "test_rows_none_does_not_raise"
+  - "test_header_none_does_not_raise"
+  - "test_empty_lists_return_placeholder_figure"
+  - "test_both_none_return_placeholder_figure"
+  - "test_long_cell_is_truncated"
+  - "test_none_cells_become_empty_strings"
+  - "test_can_rasterize_to_png_high_dpi"
+  - "test_placeholder_can_rasterize"
+  - "test_ragged_rows_are_padded"
+test_file_path: "python/functions/datascience/render_table_as_figure_test.py"
+file_path: "python/functions/datascience/render_table_as_figure.py"
+params:
+  - name: header
+    desc: "Lista de nombres de columna (puede ser [] o None). Cada nombre se str()-ea, se trunca a max_cell_chars y se pinta en la fila cabecera sombreada en negrita. Si está vacío/None no se dibuja fila de cabecera (solo cuerpo)."
+  - name: rows
+    desc: "Lista de filas; cada fila es una lista de celdas con valores cualesquiera (se str()-ean; None -> \"\"). Admite None (se trata como []), filas escalares (se envuelven en una celda) y filas de distinta longitud (la rejilla se rectangulariza al ancho máximo, rellenando con celdas vacías). Saltos de línea/tabs en una celda se colapsan a espacios para que no desborde a otras filas."
+  - name: title
+    desc: "Título opcional dibujado encima de la tabla, en negrita tinta #1b1b1b, alineado a la izquierda. None o \"\" => sin título. Default None."
+  - name: note
+    desc: "Nota opcional al pie de la figura, en gris #8a8a8a e itálica. None o \"\" => sin nota. Default None."
+  - name: fontsize
+    desc: "Tamaño de fuente base (pt) de las celdas del cuerpo. La cabecera usa fontsize+3 y la nota max(7, fontsize-1). Un valor no numérico o <= 0 cae a 9.0. Default 9.0."
+  - name: max_cell_chars
+    desc: "Trunca el texto de cada celda a este nº de chars (con … final cuando se recorta) para que el ancho no explote. Un valor no entero cae a 40; <= 0 deja las celdas vacías. Default 40."
+output: "Un matplotlib.figure.Figure (figsize proporcional al contenido: ancho ≈ 0.9-1.6\" por columna según su texto, total acotado a 3-26\"; alto ≈ 0.32\" por fila + cabecera + espacio para título/nota, acotado) con un Axes sin ejes que contiene un ax.table(...) NO cerrado. Cabecera fondo #eef3f6 texto #1b1b1b bold; filas pares (1-based) zebra #f6f8fa, impares blanco; tinta #1b1b1b; bordes/rejilla #cccccc lw 0.4; texto alineado a la izquierda. Título encima (bold) y nota debajo (gris itálica) si se pasan. Si header/rows son vacíos o None, o ante cualquier error interno, devuelve una Figure placeholder pequeña con el texto centrado \"(tabla no disponible)\". NUNCA lanza. El caller la rasteriza (dpi=220, bbox_inches='tight') y la cierra; la función no la muestra ni la guarda."
+---
+
+## Ejemplo
+
+```python
+import sys, os
+sys.path.insert(0, os.path.join("python", "functions"))
+from datascience.render_table_as_figure import render_table_as_figure
+
+# Tabla que no cabe como texto en la slide -> se rasteriza y se lee con zoom.
+header = ["columna", "n_nulos", "%_nulos", "distintos", "tipo", "ejemplo"]
+rows = [
+    ["ingresos", 12, "1.2%", 980, "float64", "2345.67"],
+    ["edad", 0, "0.0%", 88, "int64", "37"],
+    ["ciudad", 5, "0.5%", 412, "object", "Madrid"],
+    ["categoria_producto", 0, "0.0%", 1840, "object",
+     "un_valor_categorico_muy_largo_que_se_trunca"],
+]
+
+fig = render_table_as_figure(
+    header,
+    rows,
+    title="Resumen de columnas",
+    note="rasteriza a dpi=220 y haz zoom en el móvil",
+    fontsize=9.0,
+    max_cell_chars=40,
+)
+
+# El renderer del informe lo rasteriza a alta resolución; aquí lo persistimos.
+fig.savefig("/tmp/tabla.png", dpi=220, bbox_inches="tight")
+```
+
+## Cuando usarla
+
+Úsala en un informe EDA cuando una tabla **no cabe como texto** en una página o
+slide y prefieres una imagen nítida que el lector pueda ampliar en el móvil para
+leerla entera (perfiles de columnas, matrices de conteo, tablas de frecuencias
+con muchas filas o columnas anchas). Pásale la cabecera y las filas tal cual (los
+valores se `str()`-ean por ti) más un `title`/`note` opcionales; el llamante la
+rasteriza a `dpi=220` con `bbox_inches='tight'`. Es la pareja "tabla-como-imagen"
+de los gráficos `build_boxplots_figure` / `categorical_top_pie_figure`: misma
+paleta y mismo contrato (Agg, sin `pyplot`, el caller cierra la figura).
+
+## Gotchas
+
+- **Impura por matplotlib.** Toca la maquinaria de render. Usa el backend `Agg`
+  y la API orientada a objetos `Figure`/`add_subplot` — NUNCA `pyplot.*` aquí,
+  para no tocar el estado global ni filtrar figuras entre llamadas. `pyplot` NO
+  es thread-safe; esta función construye el `Figure` directamente, así que es
+  segura de llamar en bucle desde el renderer.
+- **El caller cierra la figura.** Devuelve el `Figure` pero no lo muestra ni lo
+  guarda. Quien la consume debe rasterizarla y luego liberarla
+  (`matplotlib.pyplot.close(fig)`) para no acumular memoria en lotes grandes.
+- **Pensada para rasterizar a DPI alto.** El `figsize` es proporcional al
+  contenido pero la legibilidad real viene del DPI: rasteriza con `dpi=220` y
+  `bbox_inches='tight'`. Una tabla con muchísimas filas crece en alto (capado a
+  ~60") — para miles de filas, parte la tabla o resume antes de pasarla.
+- **Truncación de celda visible.** Cada celda se recorta a `max_cell_chars`
+  (default 40) con `…` final y los saltos de línea/tabs se colapsan a espacios,
+  para que ninguna celda desborde a otras filas. Sube `max_cell_chars` si
+  necesitas ver el valor completo (a costa de ancho).
+- **Defensiva, nunca lanza.** `header`/`rows` vacíos o `None`, filas escalares,
+  filas de distinta longitud o cualquier error interno se manejan sin propagar:
+  en el peor caso devuelve una `Figure` placeholder con "(tabla no disponible)".
+  No envuelvas la llamada en try/except por miedo a un raise — no lo hay.
@@ -0,0 +1,241 @@
+"""Impure EDA helper: a crisp table rendered as a matplotlib Figure (`eda` group).
+
+Draws a tabular block (header + rows) as a sharp ``matplotlib.figure.Figure``
+ready to be rasterized at high DPI, so a table that does NOT fit as text on a
+page/slide can still be read in full by zooming into the rasterized image on a
+phone. The header is shaded and bold, even rows carry a soft zebra stripe, the
+ink is dark on white and the grid is very thin.
+
+Impure because it touches matplotlib's rendering machinery. It uses the headless
+Agg backend and the object-oriented ``Figure`` API (no ``pyplot``) so it leaks no
+global state and is safe to call repeatedly from a report renderer. It is fully
+defensive and NEVER raises: empty/invalid input or any internal error returns a
+small placeholder figure carrying a centered "(tabla no disponible)".
+"""
+
+import matplotlib
+
+matplotlib.use("Agg")
+
+from matplotlib.figure import Figure  # noqa: E402
+
+# Palette shared with the EDA report renderer so the document stays coherent.
+_HEADER_BG = "#eef3f6"   # header cell background.
+_HEADER_TEXT = "#1b1b1b"  # header cell text (bold).
+_ZEBRA_BG = "#f6f8fa"    # even (1-based) row background stripe.
+_BODY_BG = "#ffffff"     # odd row background.
+_INK = "#1b1b1b"         # body text + title ink.
+_GRID = "#cccccc"        # cell borders / grid (thin).
+_NOTE_TEXT = "#8a8a8a"   # muted gray for the note (italic).
+
+
+def _placeholder_figure(message: str = "(tabla no disponible)") -> "Figure":
+    """Return a small fallback ``Figure`` carrying a single centered message."""
+    fig = Figure(figsize=(6.0, 1.6), dpi=150)
+    ax = fig.add_subplot(111)
+    ax.axis("off")
+    ax.text(
+        0.5,
+        0.5,
+        message,
+        ha="center",
+        va="center",
+        fontsize=11,
+        color=_NOTE_TEXT,
+        style="italic",
+        wrap=True,
+        transform=ax.transAxes,
+    )
+    fig.tight_layout()
+    return fig
+
+
+def _cell_text(value, max_cell_chars: int) -> str:
+    """``str()`` a cell value defensively, None -> "", truncate with an ellipsis."""
+    s = "" if value is None else str(value)
+    # Collapse newlines/tabs so a single cell never spills across table rows.
+    s = s.replace("\n", " ").replace("\r", " ").replace("\t", " ")
+    try:
+        limit = int(max_cell_chars)
+    except (TypeError, ValueError):
+        limit = 40
+    if limit <= 0:
+        return ""
+    if len(s) <= limit:
+        return s
+    if limit == 1:
+        return "…"
+    return s[: limit - 1] + "…"
+
+
+def render_table_as_figure(
+    header,
+    rows,
+    title=None,
+    note=None,
+    fontsize=9.0,
+    max_cell_chars=40,
+):
+    """Dibuja una tabla nítida como matplotlib.figure.Figure, lista para rasterizar a DPI alto.
+
+    Pensada para tablas que NO caben como texto en una página/slide: se rasteriza
+    a alta resolución y el usuario hace zoom en el móvil para leerla entera sin
+    perder datos. Cabecera sombreada + negrita, filas pares con zebra suave,
+    tinta oscura sobre blanco, rejilla muy fina.
+
+    Args:
+        header: lista de nombres de columna (puede ser []).
+        rows: lista de filas; cada fila es una lista de celdas (valores cualquiera, se str()-ean).
+        title: título opcional dibujado encima de la tabla (o None).
+        note: nota opcional en gris/itálica bajo la tabla (o None).
+        fontsize: tamaño de fuente base (pt) de las celdas.
+        max_cell_chars: trunca el texto de celda a este nº de chars (con … final) para que no explote el ancho.
+
+    Returns:
+        matplotlib.figure.Figure — NO cerrada (el llamante la rasteriza y la cierra).
+        Nunca lanza: ante cualquier error devuelve una Figure con el texto "(tabla no disponible)".
+    """
+    try:
+        # --- Defensive normalization of header/rows into a rectangular grid.
+        header_list = list(header) if isinstance(header, (list, tuple)) else []
+        raw_rows = list(rows) if isinstance(rows, (list, tuple)) else []
+
+        clean_rows = []
+        for row in raw_rows:
+            if isinstance(row, (list, tuple)):
+                clean_rows.append(list(row))
+            elif row is None:
+                clean_rows.append([])
+            else:
+                # A scalar row becomes a single-cell row instead of being dropped.
+                clean_rows.append([row])
+
+        # Nothing to draw at all -> placeholder.
+        if not header_list and not clean_rows:
+            return _placeholder_figure()
+
+        # Number of columns = widest of header / any row.
+        n_cols = len(header_list)
+        for row in clean_rows:
+            if len(row) > n_cols:
+                n_cols = len(row)
+        if n_cols <= 0:
+            return _placeholder_figure()
+
+        # Base font size, tolerate a bad value.
+        try:
+            base_fs = float(fontsize)
+        except (TypeError, ValueError):
+            base_fs = 9.0
+        if base_fs <= 0:
+            base_fs = 9.0
+
+        # --- Build the truncated, padded text matrix.
+        header_cells = [
+            _cell_text(header_list[c] if c < len(header_list) else "", max_cell_chars)
+            for c in range(n_cols)
+        ]
+        body_cells = []
+        for row in clean_rows:
+            body_cells.append(
+                [
+                    _cell_text(row[c] if c < len(row) else "", max_cell_chars)
+                    for c in range(n_cols)
+                ]
+            )
+
+        has_header = any(t for t in header_cells)
+        n_body = len(body_cells)
+        # Total drawn table rows (header counts as one when present).
+        n_table_rows = n_body + (1 if has_header else 0)
+        if n_table_rows <= 0:
+            return _placeholder_figure()
+
+        # --- figsize proportional to content so it reads under zoom.
+        # Width: per-column width scales with the longest text in that column,
+        # clamped to a sensible per-column range, total capped.
+        per_col_widths = []
+        for c in range(n_cols):
+            col_texts = [header_cells[c]] if has_header else []
+            col_texts += [body_cells[r][c] for r in range(n_body)]
+            longest = max((len(t) for t in col_texts), default=0)
+            # ~0.085" per char at the base font, clamped to [0.9, 1.6] inches.
+            w = 0.9 + 0.085 * max(longest - 6, 0)
+            w = max(0.9, min(1.6, w))
+            per_col_widths.append(w)
+        fig_w = sum(per_col_widths)
+        fig_w = max(3.0, min(26.0, fig_w))
+
+        # Height: ~0.32" per row + room for title / note.
+        fig_h = 0.32 * n_table_rows + 0.30
+        if title is not None and str(title) != "":
+            fig_h += 0.45
+        if note is not None and str(note) != "":
+            fig_h += 0.30
+        fig_h = max(1.0, min(60.0, fig_h))
+
+        fig = Figure(figsize=(fig_w, fig_h), dpi=150)
+        ax = fig.add_subplot(111)
+        ax.axis("off")
+
+        # Reserve vertical bands for the optional title (top) and note (bottom)
+        # so the table itself never overlaps them.
+        title_band = 0.10 if (title is not None and str(title) != "") else 0.0
+        note_band = 0.07 if (note is not None and str(note) != "") else 0.0
+        table_bbox = [0.0, note_band, 1.0, max(0.05, 1.0 - title_band - note_band)]
+
+        cell_text = ([header_cells] if has_header else []) + body_cells
+
+        col_widths = [w / fig_w for w in per_col_widths]
+
+        table = ax.table(
+            cellText=cell_text,
+            colWidths=col_widths,
+            cellLoc="left",
+            loc="center",
+            bbox=table_bbox,
+        )
+        table.auto_set_font_size(False)
+        table.set_fontsize(base_fs)
+
+        # --- Style every cell: zebra body, shaded bold header, thin gray grid.
+        for (r, _c), cell in table.get_celld().items():
+            cell.set_edgecolor(_GRID)
+            cell.set_linewidth(0.4)
+            # Small horizontal padding so text does not touch the border.
+            cell.PAD = 0.04
+            if has_header and r == 0:
+                cell.set_facecolor(_HEADER_BG)
+                cell.set_text_props(color=_HEADER_TEXT, fontweight="bold", ha="left")
+            else:
+                body_index = r - 1 if has_header else r  # 0-based body row.
+                # 1-based even rows get the zebra stripe.
+                is_even = ((body_index + 1) % 2) == 0
+                cell.set_facecolor(_ZEBRA_BG if is_even else _BODY_BG)
+                cell.set_text_props(color=_INK, ha="left")
+
+        if title is not None and str(title) != "":
+            ax.set_title(
+                str(title),
+                fontsize=base_fs + 3.0,
+                fontweight="bold",
+                color=_INK,
+                loc="left",
+                pad=8,
+            )
+
+        if note is not None and str(note) != "":
+            fig.text(
+                0.01,
+                0.01,
+                str(note),
+                ha="left",
+                va="bottom",
+                fontsize=max(7.0, base_fs - 1.0),
+                color=_NOTE_TEXT,
+                style="italic",
+            )
+
+        return fig
+    except Exception:  # noqa: BLE001 — never raise from a figure builder.
+        return _placeholder_figure()
@@ -0,0 +1,119 @@
+"""Tests para render_table_as_figure (tabla nítida como Figure, grupo eda).
+
+Usa el backend Agg sin display; no muestra ni guarda figuras a disco salvo a un
+BytesIO en memoria. Cada test cierra explícitamente la Figure construida
+(matplotlib.pyplot.close) para no acumular estado entre tests.
+"""
+
+from io import BytesIO
+
+import matplotlib
+
+matplotlib.use("Agg")
+
+import matplotlib.pyplot as plt  # noqa: E402
+from matplotlib.figure import Figure  # noqa: E402
+
+from render_table_as_figure import render_table_as_figure
+
+
+def _grid(n_cols, n_rows):
+    """Cabecera de n_cols columnas + n_rows filas de celdas."""
+    header = [f"col_{c}" for c in range(n_cols)]
+    rows = [[f"r{r}c{c}" for c in range(n_cols)] for r in range(n_rows)]
+    return header, rows
+
+
+def test_returns_figure_with_table():
+    header, rows = _grid(6, 5)
+    fig = render_table_as_figure(header, rows, title="Tabla", note="nota al pie")
+    assert isinstance(fig, Figure)
+    # Hay al menos un Axes y ese Axes contiene una tabla con celdas.
+    assert len(fig.axes) >= 1
+    ax = fig.axes[0]
+    assert len(ax.tables) >= 1
+    # 6 columnas x (1 cabecera + 5 filas) = 36 celdas.
+    assert len(ax.tables[0].get_celld()) == 6 * (5 + 1)
+    plt.close(fig)
+
+
+def test_rows_none_does_not_raise():
+    fig = render_table_as_figure(["a", "b"], None)
+    assert isinstance(fig, Figure)
+    assert len(fig.axes) >= 1
+    plt.close(fig)
+
+
+def test_header_none_does_not_raise():
+    fig = render_table_as_figure(None, [["x", "y"], ["z", "w"]])
+    assert isinstance(fig, Figure)
+    assert len(fig.axes) >= 1
+    plt.close(fig)
+
+
+def test_empty_lists_return_placeholder_figure():
+    fig = render_table_as_figure([], [])
+    assert isinstance(fig, Figure)
+    # Placeholder: un Axes con texto, sin tabla.
+    assert len(fig.axes) >= 1
+    assert len(fig.axes[0].tables) == 0
+    plt.close(fig)
+
+
+def test_both_none_return_placeholder_figure():
+    fig = render_table_as_figure(None, None)
+    assert isinstance(fig, Figure)
+    assert len(fig.axes[0].tables) == 0
+    plt.close(fig)
+
+
+def test_long_cell_is_truncated():
+    long_value = "x" * 200
+    header, _ = _grid(2, 0)
+    fig = render_table_as_figure(header, [[long_value, "ok"]], max_cell_chars=20)
+    assert isinstance(fig, Figure)
+    ax = fig.axes[0]
+    texts = [c.get_text().get_text() for c in ax.tables[0].get_celld().values()]
+    # La celda larga aparece truncada con elipsis y nunca en su forma completa.
+    assert any(t.endswith("…") and len(t) <= 20 for t in texts)
+    assert long_value not in texts
+    plt.close(fig)
+
+
+def test_none_cells_become_empty_strings():
+    fig = render_table_as_figure(["a", "b"], [[None, "v"], ["w", None]])
+    assert isinstance(fig, Figure)
+    ax = fig.axes[0]
+    texts = [c.get_text().get_text() for c in ax.tables[0].get_celld().values()]
+    # Hay celdas vacías (los None) y celdas con valor.
+    assert "" in texts
+    assert "v" in texts
+    plt.close(fig)
+
+
+def test_can_rasterize_to_png_high_dpi():
+    header, rows = _grid(6, 8)
+    fig = render_table_as_figure(header, rows, title="Render", note="zoom me")
+    buf = BytesIO()
+    # No debe lanzar al rasterizar a DPI alto con bbox tight.
+    fig.savefig(buf, format="png", dpi=220, bbox_inches="tight")
+    assert buf.getbuffer().nbytes > 0
+    plt.close(fig)
+
+
+def test_placeholder_can_rasterize():
+    fig = render_table_as_figure([], [])
+    buf = BytesIO()
+    fig.savefig(buf, format="png", dpi=220, bbox_inches="tight")
+    assert buf.getbuffer().nbytes > 0
+    plt.close(fig)
+
+
+def test_ragged_rows_are_padded():
+    # Filas de distinta longitud: la rejilla se rectangulariza al ancho máximo.
+    fig = render_table_as_figure(["a", "b", "c"], [["1"], ["1", "2", "3", "4"]])
+    assert isinstance(fig, Figure)
+    ax = fig.axes[0]
+    # 4 columnas (la fila más ancha) x (1 cabecera + 2 filas) = 12 celdas.
+    assert len(ax.tables[0].get_celld()) == 4 * (2 + 1)
+    plt.close(fig)
@@ -0,0 +1,466 @@
+"""Batería de tests de ACEPTACIÓN del AutomaticEDA — "que cada AEDA salga como queremos".
+
+Esta suite es la red de seguridad del subsistema EDA del grupo `eda`: garantiza
+que CADA capítulo de un informe AutomaticEDA sale poblado y con su contenido
+esencial, que la feature de capítulos sueltos (``only_chapters``) resuelve sus
+dependencias de cómputo, que los capítulos opcionales devuelven None cuando no
+aplican, que el informe de carpeta multi-tabla detecta la FK, y que el Markdown
+trae el apéndice completo (matriz de asociación entera + describe con
+skew/kurtosis). A diferencia de los tests unitarios de cada capítulo, aquí se
+ejercita el pipeline END-TO-END sobre un dataset sintético determinista que
+activa todos los capítulos a la vez.
+
+Determinismo: el dataset se genera con ``seed`` fijo y el pipeline corre sin LLM
+(``profile_level='standard'``), de modo que el manifest y el Markdown son
+reproducibles entre corridas. Un único render `standard` se reutiliza vía un
+fixture de scope module para no repetir el cómputo caro.
+
+dict-no-throw: los pipelines del grupo `eda` nunca lanzan; aquí se asserta sobre
+``status == 'ok'`` y luego sobre el contenido concreto del manifest / Markdown.
+
+Honestidad (DoD): los asserts comprueban CONTENIDO real (texto esencial de cada
+capítulo), no solo el heading. Si un capítulo dejara de emitir su contenido (un
+cambio rompiera la distribución numérica, el Isolation Forest, la matriz de
+correlación completa, …), el test correspondiente FALLA nombrando el capítulo y
+el fragmento ausente — no se ablanda para que pase.
+"""
+
+import json
+import os
+import subprocess
+import sys
+
+import pytest
+
+_HERE = os.path.dirname(os.path.abspath(__file__))
+_FUNCTIONS = os.path.abspath(os.path.join(_HERE, "..", ".."))  # python/functions
+if _FUNCTIONS not in sys.path:
+    sys.path.insert(0, _FUNCTIONS)
+
+from datascience.automatic_eda import CHAPTER_ORDER  # noqa: E402
+from datascience.generate_synthetic_eda_folder import (  # noqa: E402
+    generate_synthetic_eda_folder,
+)
+from datascience.generate_synthetic_eda_table import (  # noqa: E402
+    generate_synthetic_eda_table,
+)
+from pipelines.render_automatic_eda import render_automatic_eda  # noqa: E402
+from pipelines.render_automatic_eda_folder import (  # noqa: E402
+    render_automatic_eda_folder,
+)
+
+# --------------------------------------------------------------------------- #
+# Parámetros deterministas del fixture de oro.
+# --------------------------------------------------------------------------- #
+SEED = 42
+N_ROWS = 800
+TABLE = "synthetic"
+
+# El capítulo `analisis_llm` SOLO se computa con run_llm=True; en el preset
+# `standard` (sin LLM, lo que esta suite usa) no debe aparecer. Por eso los
+# capítulos esperados en un informe `standard` son todos los de CHAPTER_ORDER
+# MENOS analisis_llm. CHAPTER_ORDER es la fuente de verdad de los 16 capítulos
+# del motor (portada … glosario).
+LLM_ONLY_CHAPTERS = {"analisis_llm"}
+EXPECTED_STANDARD = [c for c in CHAPTER_ORDER if c not in LLM_ONLY_CHAPTERS]
+
+
+def _pdf_text(path):
+    """Texto del PDF vía pdftotext, o None si la herramienta no está disponible."""
+    try:
+        out = subprocess.run(
+            ["pdftotext", "-layout", path, "-"],
+            capture_output=True, text=True, timeout=60,
+        )
+        return out.stdout if out.returncode == 0 else None
+    except Exception:  # noqa: BLE001 — la verificación principal es sobre el MD.
+        return None
+
+
+def _manifest_chapters(result):
+    """Set de ids de capítulo presentes en el manifest del resultado."""
+    with open(result["manifest_path"], encoding="utf-8") as fh:
+        return set((json.load(fh).get("chapters") or {}).keys())
+
+
+# --------------------------------------------------------------------------- #
+# Fixtures de scope module: el dataset sintético se genera UNA vez y el render
+# `standard` se computa UNA vez; todos los tests de contenido lo reutilizan.
+# --------------------------------------------------------------------------- #
+@pytest.fixture(scope="module")
+def synth_db(tmp_path_factory):
+    """Tabla sintética determinista que activa los 16 capítulos del motor."""
+    d = tmp_path_factory.mktemp("aeda_accept_synth")
+    db = str(d / "synthetic.duckdb")
+    g = generate_synthetic_eda_table(db, TABLE, n_rows=N_ROWS, seed=SEED)
+    assert g["status"] == "ok", g.get("error")
+    return {"db": db, "table": TABLE, "gen": g}
+
+
+@pytest.fixture(scope="module")
+def standard_run(synth_db, tmp_path_factory):
+    """Render AutomaticEDA `standard` (sin LLM) sobre el dataset sintético.
+
+    Devuelve el dict del pipeline más el manifest cargado, el texto del Markdown
+    y el del PDF (si pdftotext está). Reutilizado por la mayoría de los tests.
+    """
+    out = str(tmp_path_factory.mktemp("aeda_accept_std"))
+    r = render_automatic_eda(
+        synth_db["db"], synth_db["table"],
+        profile_level="standard", out_dir=out, basename="synth_std",
+    )
+    assert r["status"] == "ok", r.get("error")
+    with open(r["manifest_path"], encoding="utf-8") as fh:
+        manifest = json.load(fh)
+    md = open(r["aeda_md_path"], encoding="utf-8").read()
+    return {
+        "r": r,
+        "manifest": manifest,
+        "chapters": manifest.get("chapters") or {},
+        "md": md,
+        "pdf_text": _pdf_text(r["pdf_path"]),
+    }
+
+
+@pytest.fixture(scope="module")
+def minimal_db(tmp_path_factory):
+    """Tabla mínima SIN texto libre, SIN fecha y SIN lat/lon.
+
+    Sirve para comprobar que text_distr / timeseries / geospatial devuelven None
+    (no aparecen en el manifest) y el EDA no peta. Solo numéricas continuas +
+    una categórica de baja cardinalidad.
+    """
+    import random
+
+    import duckdb
+
+    d = tmp_path_factory.mktemp("aeda_accept_min")
+    db = str(d / "minimal.duckdb")
+    con = duckdb.connect(db)
+    con.execute("CREATE TABLE minimal (a DOUBLE, b DOUBLE, c INTEGER, grp VARCHAR)")
+    random.seed(7)
+    rows = [
+        (round(random.gauss(10, 2), 3), round(random.gauss(50, 5), 3),
+         random.randint(1, 100), ["x", "y", "z"][i % 3])
+        for i in range(120)
+    ]
+    con.executemany("INSERT INTO minimal VALUES (?,?,?,?)", rows)
+    con.close()
+    return {"db": db, "table": "minimal"}
+
+
+# --------------------------------------------------------------------------- #
+# 1) COBERTURA DE CAPÍTULOS (golden) — el manifest standard trae los 15
+#    capítulos no-LLM esperados, ninguno falta, y analisis_llm NO sale sin LLM.
+# --------------------------------------------------------------------------- #
+def test_standard_cubre_todos_los_capitulos_esperados(standard_run):
+    chapters = set(standard_run["chapters"].keys())
+    expected = set(EXPECTED_STANDARD)
+    missing = expected - chapters
+    assert not missing, (
+        "capítulos esperados ausentes del manifest standard: "
+        f"{sorted(missing)} (presentes: {sorted(chapters)})"
+    )
+    # analisis_llm requiere run_llm=True: en standard NO debe aparecer.
+    assert "analisis_llm" not in chapters, (
+        "analisis_llm apareció sin LLM: el preset standard no debería computarlo"
+    )
+
+
+def test_manifest_top_level_es_valido(standard_run):
+    """El manifest declara el motor y un dict de capítulos con metadatos por id."""
+    man = standard_run["manifest"]
+    assert man.get("engine") == "AutomaticEDA"
+    assert man.get("engine_version")
+    chapters = standard_run["chapters"]
+    # Cada capítulo trae version + nº de páginas/slides (formato del manifest).
+    for cid, meta in chapters.items():
+        assert meta.get("version"), f"capítulo {cid} sin version en el manifest"
+        assert (meta.get("n_pages") or 0) > 0, f"capítulo {cid} con 0 páginas"
+
+
+# --------------------------------------------------------------------------- #
+# 2) CONTENIDO CLAVE POR CAPÍTULO (acceptance) — cada capítulo trae su contenido
+#    ESENCIAL en el Markdown, no solo el heading. Un fragmento ausente nombra el
+#    capítulo y el texto que falta.
+# --------------------------------------------------------------------------- #
+# Fragmentos de texto ESTABLE que cada capítulo emite en el Markdown del dataset
+# sintético. No son números frágiles: son etiquetas/estructura del capítulo más
+# nombres de columna del fixture. Si un capítulo deja de poblar su contenido, su
+# fragmento desaparece y el test falla nombrándolo.
+CHAPTER_NEEDLES = {
+    "portada":      ["800 filas", "19 columnas"],
+    "overview":     ["Primeras filas (df.head)", "Diccionario de columnas",
+                     "customer_id", "signup_date"],
+    "num_distr":    ["Distribuciones numéricas", "vallas Tukey", "income"],
+    "cat_distr":    ["Distribuciones categóricas", "Entropía", "Top categorías",
+                     "country"],
+    "text_distr":   ["Texto libre (NLP)", "TTR", "Términos más frecuentes",
+                     "Idioma dominante"],
+    "calidad":      ["Cómo se calcula la calidad", "Calidad global"],
+    "missingness":  ["Datos faltantes", "Celdas faltantes (global)",
+                     "Faltantes por columna"],
+    "outliers":     ["Valores atípicos por columna", "Filas atípicas (multivariante)",
+                     "Isolation Forest", "Filas analizadas"],
+    "correlacion":  ["Matriz de asociación", "Pares más correlacionados"],
+    "relaciones":   ["Candidatas a clave primaria", "customer_id"],
+    "modelos":      ["PCA — varianza explicada", "Segmentación (KMeans)"],
+    "timeseries":   ["Series temporales", "Columna de fecha", "signup_date"],
+    "geospatial":   ["Análisis geoespacial", "Extensión geográfica", "Centroide"],
+    "agregacion":   ["Agregación por grupos", "Agrupado por"],
+    "glosario":     ["Glosario de términos",
+                     "### Isolation Forest (anomalías multivariantes)",
+                     "### PCA (componentes principales)"],
+}
+
+
+def test_needles_cubren_exactamente_los_capitulos_standard():
+    """Guard de mantenimiento: las needles cubren los mismos 15 capítulos no-LLM.
+
+    Si alguien añade un capítulo nuevo a CHAPTER_ORDER, este test recuerda que
+    hay que documentar su contenido esencial aquí (o marcarlo como LLM-only)."""
+    assert set(CHAPTER_NEEDLES.keys()) == set(EXPECTED_STANDARD), (
+        "CHAPTER_NEEDLES desincronizado con los capítulos esperados de standard: "
+        f"falta needles para {set(EXPECTED_STANDARD) - set(CHAPTER_NEEDLES)}, "
+        f"sobra {set(CHAPTER_NEEDLES) - set(EXPECTED_STANDARD)}"
+    )
+
+
+@pytest.mark.parametrize("chapter_id", list(CHAPTER_NEEDLES.keys()))
+def test_capitulo_trae_su_contenido_esencial(standard_run, chapter_id):
+    md = standard_run["md"]
+    # Pre-condición: el capítulo está en el manifest (cobertura). Si no, es un
+    # fallo de cobertura, no de contenido — se reporta como tal.
+    assert chapter_id in standard_run["chapters"], (
+        f"capítulo {chapter_id} ausente del manifest (fallo de cobertura)"
+    )
+    for needle in CHAPTER_NEEDLES[chapter_id]:
+        assert needle in md, (
+            f"capítulo '{chapter_id}': falta su contenido esencial en el Markdown "
+            f"— fragmento ausente: {needle!r}"
+        )
+
+
+def test_outliers_isolation_forest_poblado_no_degradado(standard_run):
+    """El bloque multivariante (Isolation Forest) sale con datos, no degradado."""
+    md = standard_run["md"]
+    assert "Anomalías multivariantes" in md
+    assert "Filas analizadas" in md, "el Isolation Forest no trae su tabla poblada"
+    assert "No se pudo analizar la anomalía multivariante" not in md, (
+        "el bloque multivariante salió degradado en el informe completo"
+    )
+    # El perfil trae el bloque de modelos con los outliers multivariantes.
+    models = (standard_run["r"]["profile"] or {}).get("models") or {}
+    assert models.get("outliers") is not None, "profile['models']['outliers'] vacío"
+
+
+# --------------------------------------------------------------------------- #
+# 3) CAPÍTULOS SUELTOS CON DEPS RESUELTAS (acceptance de only_chapters) — pedir
+#    un capítulo suelto lo deja POBLADO porque la resolución de dependencias
+#    activa el cómputo que necesita, aunque el caller no lo pidiera.
+# --------------------------------------------------------------------------- #
+def test_only_outliers_isolation_forest_poblado(synth_db, tmp_path):
+    """only=['outliers'] sin run_models explícito → IsolationForest poblado."""
+    out = str(tmp_path / "only_out")
+    r = render_automatic_eda(
+        synth_db["db"], synth_db["table"],
+        only_chapters=["outliers"], out_dir=out, basename="only_outliers",
+    )
+    assert r["status"] == "ok", r.get("error")
+    # Documento = portada + outliers + glosario, nada más.
+    assert _manifest_chapters(r) == {"portada", "outliers", "glosario"}
+    md = open(r["aeda_md_path"], encoding="utf-8").read()
+    assert "Filas atípicas (multivariante)" in md
+    assert "Filas analizadas" in md, "Isolation Forest sin tabla poblada"
+    assert "No se pudo analizar la anomalía multivariante" not in md, (
+        "el multivariante salió degradado pese a resolver las deps"
+    )
+    # La resolución activó run_models → el perfil trae el bloque de modelos.
+    assert ((r["profile"] or {}).get("models") or {}).get("outliers") is not None
+
+
+def test_only_timeseries_rango_temporal_presente(synth_db, tmp_path):
+    """only=['timeseries'] → rango temporal poblado (run_series resuelto)."""
+    out = str(tmp_path / "only_ts")
+    r = render_automatic_eda(
+        synth_db["db"], synth_db["table"],
+        only_chapters=["timeseries"], out_dir=out, basename="only_ts",
+    )
+    assert r["status"] == "ok", r.get("error")
+    assert "timeseries" in _manifest_chapters(r)
+    md = open(r["aeda_md_path"], encoding="utf-8").read()
+    assert "Columna de fecha" in md
+    assert "signup_date" in md, "la serie no nombra su columna de fecha"
+    # run_series resuelto por deps → el perfil trae el análisis de serie.
+    assert (r["profile"] or {}).get("series") is not None, (
+        "only=['timeseries'] debe activar run_series por dependencias"
+    )
+
+
+def test_only_correlacion_scatters_presentes(synth_db, tmp_path):
+    """only=['correlacion'] → matriz + scatters de los pares fuertes."""
+    out = str(tmp_path / "only_corr")
+    r = render_automatic_eda(
+        synth_db["db"], synth_db["table"],
+        only_chapters=["correlacion"], out_dir=out, basename="only_corr",
+    )
+    assert r["status"] == "ok", r.get("error")
+    assert _manifest_chapters(r) == {"portada", "correlacion", "glosario"}
+    md = open(r["aeda_md_path"], encoding="utf-8").read()
+    assert "Matriz de asociación" in md
+    assert "Relaciones más fuertes (scatter)" in md, "faltan los scatters"
+    assert "Dispersión de" in md, "no se emitió ninguna figura de dispersión"
+
+
+# --------------------------------------------------------------------------- #
+# 4) NONE CUANDO NO APLICA — sobre una tabla sin texto largo, sin fecha y sin
+#    lat/lon, text_distr / timeseries / geospatial NO aparecen y el EDA no peta.
+# --------------------------------------------------------------------------- #
+def test_capitulos_opcionales_ausentes_cuando_no_aplican(minimal_db, tmp_path):
+    out = str(tmp_path / "minimal_out")
+    r = render_automatic_eda(
+        minimal_db["db"], minimal_db["table"],
+        profile_level="standard", out_dir=out, basename="minimal",
+    )
+    assert r["status"] == "ok", r.get("error")
+    chapters = _manifest_chapters(r)
+    for absent in ("text_distr", "timeseries", "geospatial"):
+        assert absent not in chapters, (
+            f"capítulo {absent} apareció en una tabla que no lo justifica "
+            f"(presentes: {sorted(chapters)})"
+        )
+    # El documento sigue siendo válido: portada + glosario + capítulos que sí
+    # aplican (overview/num_distr/correlacion al menos).
+    assert {"portada", "glosario", "overview", "num_distr"} <= chapters
+
+
+# --------------------------------------------------------------------------- #
+# 5) FOLDER MULTI-TABLA (acceptance) — el informe de carpeta perfila las N tablas
+#    y el capítulo de relaciones detecta la FK por containment.
+# --------------------------------------------------------------------------- #
+def test_folder_multitabla_con_fk_detectada(tmp_path):
+    fdir = str(tmp_path / "folder")
+    g = generate_synthetic_eda_folder(fdir, n_rows=300, seed=SEED)
+    assert g["status"] == "ok", g.get("error")
+
+    out = str(tmp_path / "fout")
+    rf = render_automatic_eda_folder(fdir, out_dir=out, basename="folder")
+    assert rf["status"] == "ok", rf.get("error")
+
+    # Las 3 tablas se perfilaron.
+    assert rf["n_tables"] == 3, f"esperadas 3 tablas, vistas {rf['n_tables']}"
+
+    # El manifest base trae el capítulo de relaciones inter-tabla.
+    with open(rf["manifest_path"], encoding="utf-8") as fh:
+        chapters = set((json.load(fh).get("chapters") or {}).keys())
+    assert "relaciones" in chapters, (
+        f"el documento de carpeta no incluye el capítulo de relaciones: {chapters}"
+    )
+
+    # El Markdown nombra las 3 tablas y declara la FK detectada por containment.
+    md = open(rf["md_path"], encoding="utf-8").read()
+    for tbl in ("customers", "orders", "reviews"):
+        assert tbl in md, f"la tabla {tbl} no aparece en el informe de carpeta"
+    assert "FK candidatas" in md, "no se declaran las FK candidatas"
+    assert "orders.customer_id" in md and "customers.customer_id" in md, (
+        "la FK orders→customers no se detectó por containment"
+    )
+    assert "reviews.customer_id" in md, "la FK reviews→customers no se detectó"
+
+
+# --------------------------------------------------------------------------- #
+# 6) MD COMPLETITUD (regresión) — el Markdown trae el apéndice con la matriz de
+#    asociación COMPLETA (todos los pares, no solo el top) y el describe con
+#    skew/kurtosis de todas las numéricas. Protege un fix ya mergeado.
+# --------------------------------------------------------------------------- #
+def test_md_apendice_matriz_correlacion_completa(standard_run):
+    md = standard_run["md"]
+    assert "Matriz de asociación — todos los pares" in md, (
+        "falta el apéndice con la matriz de asociación completa"
+    )
+    # Un par num-num de correlación BAJA que el top del capítulo NUNCA mostraría:
+    # su presencia prueba que el apéndice lista TODOS los pares, no solo el top.
+    assert "income ↔ longitude" in md, (
+        "el apéndice no contiene los pares de baja correlación: no es la matriz "
+        "completa, solo el top-k del capítulo"
+    )
+
+
+def test_md_apendice_describe_con_skew_kurtosis(standard_run):
+    md = standard_run["md"]
+    assert "Estadísticos numéricos completos (describe)" in md, (
+        "falta el apéndice describe completo"
+    )
+    # La cabecera del describe del apéndice lleva las columnas skew y kurtosis
+    # (subcadena única de ese header). Sin ellas el describe está incompleto.
+    assert "| skew | kurtosis |" in md, (
+        "el describe del apéndice no trae las columnas skew/kurtosis"
+    )
+
+
+# --------------------------------------------------------------------------- #
+# 7) LAS 3 SALIDAS NO-VACÍAS — PDF con páginas, PPTX con slides, MD con un mínimo
+#    de caracteres, y los tres archivos en disco. Manifest válido.
+# --------------------------------------------------------------------------- #
+def test_tres_salidas_no_vacias(standard_run):
+    r = standard_run["r"]
+    assert r["pdf_path"] and os.path.exists(r["pdf_path"])
+    assert r["pptx_path"] and os.path.exists(r["pptx_path"])
+    assert r["aeda_md_path"] and os.path.exists(r["aeda_md_path"])
+    assert (r["n_pages"] or 0) > 0, "el PDF no tiene páginas"
+    assert (r["n_slides"] or 0) > 0, "el PPTX no tiene slides"
+    # El informe completo es grande: un mínimo holgado protege contra un MD vacío
+    # o truncado sin atarse a un tamaño exacto.
+    assert (r["md_chars"] or 0) > 10000, f"MD demasiado corto: {r['md_chars']} chars"
+    assert r["manifest_path"] and os.path.exists(r["manifest_path"])
+
+
+def test_pdf_texto_extraible_con_contenido(standard_run):
+    """Si pdftotext está disponible, el PDF debe traer texto real (no solo
+    imágenes): la portada nombra el dataset y su forma. Si no está la
+    herramienta, el test se omite (no es un fallo del EDA)."""
+    txt = standard_run["pdf_text"]
+    if txt is None:
+        pytest.skip("pdftotext no disponible")
+    assert len(txt) > 5000, "el PDF apenas tiene texto extraíble"
+    assert "Portada" in txt or "synthetic" in txt, (
+        "el texto del PDF no contiene la portada esperada"
+    )
+
+
+# --------------------------------------------------------------------------- #
+# DETERMINISMO — dos renders del MISMO dataset producen el MISMO manifest
+# (mismos capítulos y mismos n_pages/n_slides por capítulo). El generated_at
+# difiere por timestamp, por eso se compara el dict de capítulos, no el archivo.
+# --------------------------------------------------------------------------- #
+def test_render_es_determinista(synth_db, tmp_path):
+    out1 = str(tmp_path / "det1")
+    out2 = str(tmp_path / "det2")
+    r1 = render_automatic_eda(synth_db["db"], synth_db["table"],
+                              profile_level="standard", out_dir=out1, basename="d1")
+    r2 = render_automatic_eda(synth_db["db"], synth_db["table"],
+                              profile_level="standard", out_dir=out2, basename="d2")
+    assert r1["status"] == "ok" and r2["status"] == "ok"
+    c1 = json.load(open(r1["manifest_path"], encoding="utf-8")).get("chapters")
+    c2 = json.load(open(r2["manifest_path"], encoding="utf-8")).get("chapters")
+    assert c1 == c2, "el manifest no es determinista entre dos renders del mismo dataset"
+
+
+# --------------------------------------------------------------------------- #
+# SLOW (opcional, skippeable) — informe `full` con narrativa LLM. Requiere red /
+# credenciales y NO es determinista, por eso está apagado salvo opt-in explícito
+# vía la variable de entorno EDA_ACCEPT_LLM=1. Se omite con skipif (no con un
+# marker custom) para no depender de registro de marks en la config del repo.
+# --------------------------------------------------------------------------- #
+@pytest.mark.skipif(
+    os.environ.get("EDA_ACCEPT_LLM") != "1",
+    reason="full+LLM es lento/no determinista; exporta EDA_ACCEPT_LLM=1 para correrlo",
+)
+def test_full_incluye_capitulo_analisis_llm(synth_db, tmp_path):
+    out = str(tmp_path / "full")
+    r = render_automatic_eda(synth_db["db"], synth_db["table"],
+                             profile_level="full", out_dir=out, basename="full")
+    assert r["status"] == "ok", r.get("error")
+    assert "analisis_llm" in _manifest_chapters(r), (
+        "el preset full debe incluir el capítulo de análisis LLM"
+    )
@@ -0,0 +1,133 @@
+---
+name: profile_bq_dataset
+kind: pipeline
+lang: py
+domain: pipelines
+purity: impure
+version: "1.1.0"
+signature: "def profile_bq_dataset(project_id: str, dataset: str, tables: list = None, include_views: bool = False, sample_frac: float = None, max_rows: int = 0, pseudonymize_cols: dict = None, report_dir: str = \"reports\", duckdb_path: str = \"\", keep_duckdb: bool = True, min_inclusion: float = 0.9, emit_pdf: bool = True, run_llm: bool = False) -> dict"
+description: "EDA one-shot de un dataset BigQuery ENTERO con descubrimiento cross-tabla: materializa CADA tabla del dataset (COMPLETA por defecto; muestreo opt-in con sample_frac; seudonimizacion PII por tabla, LOPDGDD/RGPD) en UN MISMO DuckDB compartido con load_bq_table_to_duckdb, lo perfila entero con profile_database (perfiles por tabla + relaciones FK inter-tabla por containment + join graph Mermaid + report markdown/JSON + PDF movil), y construye el diccionario de columnas del dataset con build_column_dictionary. Es el analogo BigQuery de profile_database a nivel de dataset, resuelto por composicion estricta (list_bq_dataset_tables -> load_bq_table_to_duckdb x N -> profile_database -> build_column_dictionary) sin duplicar descubrimiento, perfilado ni inferencia de relaciones. Es el hazme un EDA de este dataset BigQuery entero y descubre como se relacionan sus tablas, en una sola llamada."
+tags: [eda, bigquery, relations, launcher]
+uses_functions:
+  - list_bq_dataset_tables_py_datascience
+  - load_bq_table_to_duckdb_py_datascience
+  - profile_database_py_pipelines
+  - build_column_dictionary_py_datascience
+uses_types: []
+returns: []
+returns_optional: false
+error_type: error_go_core
+imports: []
+tested: false
+tests: []
+test_file_path: ""
+file_path: "python/functions/pipelines/profile_bq_dataset.py"
+params:
+  - name: project_id
+    desc: "Proyecto GCP (facturacion + primer segmento del FQN). Ej: 'autingo-159109'."
+  - name: dataset
+    desc: "Dataset BigQuery a perfilar entero. Ej: 'customer_marts'."
+  - name: tables
+    desc: "Lista de NOMBRES de tabla del dataset. None (DEFAULT) = todas las del dataset (filtradas por include_views)."
+  - name: include_views
+    desc: "Si True incluye las VIEW ademas de las BASE TABLE cuando tables=None. Default False (solo BASE TABLE, coherente con profile_database que salta las VIEWs)."
+  - name: sample_frac
+    desc: "None (DEFAULT) = FULL, perfila TODAS las filas de cada tabla. Un float en (0,1) activa el muestreo opt-in por tabla (`WHERE rand() < frac`)."
+  - name: max_rows
+    desc: "Tope duro de filas por tabla (LIMIT). 0 (DEFAULT) = sin tope. Se aplica a cada tabla materializada."
+  - name: pseudonymize_cols
+    desc: "Dict {\"tabla\": [\"col1\", \"col2\"]} de columnas PII a seudonimizar (hash) por tabla ANTES de materializar (LOPDGDD/RGPD [POL-MMNSEG-001-1.0]). Preserva nulos y cardinalidad."
+  - name: report_dir
+    desc: "Directorio de salida de los reports + del DuckDB compartido por defecto. Default 'reports' (artefacto local gitignored). Se crea si no existe."
+  - name: duckdb_path
+    desc: "Ruta del DuckDB compartido donde se materializan todas las tablas. Vacio (DEFAULT) = report_dir/eda_bq_<dataset>.duckdb (se limpia si ya existia)."
+  - name: keep_duckdb
+    desc: "Si True (DEFAULT) conserva el DuckDB materializado: es el artefacto explorable post-EDA (notebook Jupyter, joins ad-hoc). Con False se borra al terminar."
+  - name: min_inclusion
+    desc: "Umbral de inclusion (0-1) para emitir una FK candidata cross-tabla (se pasa a profile_database -> infer_fk_containment_duckdb). Default 0.9."
+  - name: emit_pdf
+    desc: "Si True (DEFAULT) emite el PDF movil DB-level (resumen de tablas + relaciones FK + join graph). Se pasa a profile_database."
+  - name: run_llm
+    desc: "Si True (default False) activa la capa LLM interpretativa por tabla (se pasa a profile_database -> profile_table): una llamada LLM por tabla sobre el perfil agregado, nunca filas crudas."
+output: "dict dict-no-throw. En exito {status:'ok', project_id, dataset, n_tables_loaded, n_tables_profiled, tables_skipped:[...], errors:[...], duckdb_path, db_profile:<DatabaseProfile con tables[resumen], table_profiles[completos], fk_candidates, join_graph{nodes,edges,mermaid,hubs}>, column_dictionary:{entries,pii_columns} (sin markdown), report_md_path, report_json_path, report_pdf_path, dict_md_path, dict_json_path}. En error {status:'error', error, stage}."
+---
+
+## Ejemplo
+
+```python
+from pipelines.profile_bq_dataset import profile_bq_dataset
+
+# FULL por defecto: EDA del dataset ENTERO (todas las tablas, todas las filas),
+# con FK cross-tabla + join graph + diccionario de columnas. El DuckDB compartido
+# queda en reports/eda_bq_customer_marts.duckdb para seguir explorando.
+r = profile_bq_dataset("autingo-159109", "customer_marts")
+print(r["n_tables_loaded"], "tablas materializadas,", r["n_tables_profiled"], "perfiladas")
+print("FKs:", [f"{fk['from_table']}.{fk['from_col']}->{fk['to_table']}.{fk['to_col']}"
+               for fk in r["db_profile"]["fk_candidates"]])
+print(r["report_md_path"]); print(r["report_pdf_path"]); print(r["dict_md_path"])
+print("DuckDB explorable:", r["duckdb_path"])
+
+# Dataset con tablas enormes: muestreo opt-in + PII seudonimizada por tabla.
+r = profile_bq_dataset(
+    "autingo-159109",
+    "customer_marts",
+    sample_frac=0.05,
+    pseudonymize_cols={
+        "customer_profile": ["document_number", "full_name", "email", "phone"],
+    },
+)
+```
+
+## Cuando usarla
+
+Cuando pidan un EDA de un DATASET BigQuery entero y no solo de una tabla: quieres
+el perfil de todas sus tablas MAS su esquema relacional (que tabla referencia a
+cual, con que cardinalidad) descubierto cross-tabla en una sola llamada. Es el
+escalon a nivel de dataset sobre `profile_bq_table` (una tabla) y el adaptador
+BigQuery de `profile_database` (una base DuckDB). Usala al recibir un dataset
+BigQuery desconocido, para documentar un data mart, para descubrir el star schema
+(las tablas hub del join graph) o antes de escribir joins sin tener el modelo
+declarado. Para datasets con tablas enormes, pasa `sample_frac` o `max_rows` y
+dejalo declarado en el report.
+
+## Gotchas
+
+- Impura: requiere ADC de BigQuery configurado (Application Default Credentials).
+  Si el ADC del usuario lleva un quota project ajeno, `load_bq_table_to_duckdb` /
+  `list_bq_dataset_tables` aplican `creds.with_quota_project(None)` para evitar el
+  403 USER_PROJECT_DENIED — remitido a los gotchas de esas funciones.
+- Coste de traer un DATASET entero: FULL por defecto materializa TODAS las filas
+  de CADA tabla a RAM (via BigQuery Storage Read API/Arrow) antes de volcar al
+  DuckDB compartido. Un dataset con varias tablas de millones de filas puede
+  costar en tiempo, bytes escaneados de BigQuery y GBs de RAM/disco. Acota con
+  `sample_frac` in (0,1) (muestreo opt-in por tabla) o `max_rows` (tope duro por
+  tabla). Si por limite de recursos no cabe el total, dilo explicito en el report
+  con el maximo que si se cargo.
+- El DuckDB compartido puede ocupar GBs: todas las tablas del dataset viven en un
+  MISMO archivo (necesario para que la inferencia de FK opere cross-tabla). Con
+  `keep_duckdb=True` (default) queda en disco como artefacto explorable; pasa
+  `keep_duckdb=False` para borrarlo al terminar. Con `duckdb_path` explicito la
+  ruta se respeta; el path por defecto se limpia al inicio para no mezclar tablas
+  de corridas anteriores.
+- FK por containment es una HEURISTICA (falsos positivos/negativos posibles) y
+  `profile_database` SALTA los pares de FK hacia tablas con mas de 200k filas (el
+  lado caro del INTERSECT): esas relaciones quedan sin evaluar. Es un mapa de
+  partida del esquema, no un DDL autoritativo.
+- Vistas excluidas por defecto (`include_views=False`, coherente con
+  profile_database que salta VIEWs — perfilarlas infla n_tables y multiplica FK
+  falsas). Pasa `include_views=True` solo si necesitas perfilarlas como si fueran
+  tablas materializadas.
+- Seudonimiza PII con `pseudonymize_cols` (dict por tabla) para cumplir LOPDGDD/
+  RGPD [POL-MMNSEG-001-1.0] ANTES de escribir a disco: nombres, DNI/NIE, email,
+  telefono, direccion, IDs de cliente, IBAN, etc. Sin seudonimizar, el DuckDB
+  compartido + los reports contienen datos personales reales.
+- Tolera fallos por tabla: si una carga falla, se anota en `errors[]` +
+  `tables_skipped[]` y el pipeline sigue con las demas; `n_tables_profiled` cuenta
+  solo las perfiladas con exito. Revisa `errors` para saber que quedo fuera.
+- Escribe a `report_dir` (default 'reports', artefacto local gitignored): el report
+  DB-level de `profile_database` (markdown + JSON + PDF si emit_pdf) MAS el
+  diccionario de columnas (`..._dict.md` + `..._dict.json`).
+
+## Capability growth log
+
+- v1.1.0 (2026-07-02) — añade `run_llm` (passthrough a profile_database -> profile_table: una llamada LLM por tabla sobre el perfil agregado). Default False, sin breaking changes.
@@ -0,0 +1,263 @@
+"""profile_bq_dataset — EDA one-shot de un dataset BigQuery ENTERO (cross-tabla).
+
+Pipeline impuro: perfila un dataset de Google BigQuery COMPLETO con descubrimiento
+cross-tabla (relaciones FK inter-tabla + join graph + diccionario de columnas). Es
+el analogo a nivel de dataset de `profile_bq_table` (que perfila UNA tabla) y el
+adaptador BigQuery de `profile_database` (que perfila una base DuckDB entera),
+resuelto por composicion estricta de funciones del registry — sin reimplementar
+ni el descubrimiento, ni el perfilado, ni la inferencia de relaciones.
+
+Clave del diseno: materializa CADA tabla del dataset en UN MISMO archivo DuckDB
+compartido, para que la inferencia de FK por containment (que necesita todas las
+tablas en la misma base) opere cross-tabla. El DuckDB compartido queda como el
+artefacto explorable post-EDA (keep_duckdb=True por defecto).
+
+Funciones del registry compuestas (NO se reimplementa su logica):
+  - list_bq_dataset_tables : catalogo de tablas/vistas del dataset (descubrimiento).
+  - load_bq_table_to_duckdb: materializa cada tabla BigQuery al DuckDB compartido
+                             (completa por defecto, muestra si sample_frac; PII
+                             seudonimizada por tabla antes de escribir a disco).
+  - profile_database       : perfila el DuckDB compartido entero (perfiles por
+                             tabla + FK por containment + join graph Mermaid +
+                             report markdown/JSON + PDF movil DB-level).
+  - build_column_dictionary: diccionario de columnas buscable (tipo semantico, PII)
+                             a partir del DatabaseProfile ensamblado.
+
+Modo por defecto = FULL: `sample_frac=None` trae cada tabla entera (preferencia
+estandar del usuario: los EDA se corren sobre el total salvo que se pida lo
+contrario). El muestreo es opt-in explicito por dataset: `sample_frac=0.05` trae
+~5 % de cada tabla; `max_rows` es un tope duro por tabla (0 = sin tope).
+
+Seudonimizacion LOPDGDD/RGPD [POL-MMNSEG-001-1.0]: `pseudonymize_cols` es un dict
+`{"tabla": ["col1", "col2"]}` que se aplica por tabla ANTES de materializar.
+
+Estilo dict-no-throw del grupo `eda`: nunca lanza; captura cualquier error y
+devuelve {status:'error', error, stage}. Los fallos por tabla individual se
+toleran: se anotan en errors[]/tables_skipped[] y se sigue con las demas.
+"""
+
+import json
+import os
+from datetime import datetime, timezone
+
+from datascience import (
+    build_column_dictionary,
+    list_bq_dataset_tables,
+    load_bq_table_to_duckdb,
+)
+from pipelines.profile_database import profile_database
+
+
+def profile_bq_dataset(
+    project_id: str,
+    dataset: str,
+    tables: list = None,
+    include_views: bool = False,
+    sample_frac: float = None,
+    max_rows: int = 0,
+    pseudonymize_cols: dict = None,
+    report_dir: str = "reports",
+    duckdb_path: str = "",
+    keep_duckdb: bool = True,
+    min_inclusion: float = 0.9,
+    emit_pdf: bool = True,
+    run_llm: bool = False,
+) -> dict:
+    """EDA one-shot de un dataset BigQuery entero, con descubrimiento cross-tabla.
+
+    Materializa cada tabla del dataset a un DuckDB compartido, lo perfila entero
+    con `profile_database` (perfiles por tabla + FK inter-tabla + join graph +
+    reports + PDF) y construye el diccionario de columnas del dataset. Por defecto
+    perfila TODAS las filas de cada tabla (`sample_frac=None`, modo FULL) y solo
+    las BASE TABLE (las vistas se excluyen salvo `include_views=True`).
+
+    Args:
+        project_id: proyecto GCP (facturacion + primer segmento del FQN).
+        dataset: dataset BigQuery a perfilar.
+        tables: lista de NOMBRES de tabla del dataset. None (default) = todas las
+            del dataset (filtradas por include_views).
+        include_views: si True incluye las VIEW ademas de las BASE TABLE cuando
+            tables=None. Default False (solo BASE TABLE, coherente con
+            profile_database que salta las VIEWs).
+        sample_frac: None (default) = FULL, perfila todas las filas de cada tabla.
+            Un float en (0,1) activa el muestreo opt-in por tabla.
+        max_rows: tope duro de filas por tabla (LIMIT). 0 (default) = sin tope.
+        pseudonymize_cols: dict {"tabla": ["col1", "col2"]} de columnas PII a
+            seudonimizar (hash) por tabla ANTES de materializar (LOPDGDD/RGPD).
+        report_dir: directorio de salida de los reports + del DuckDB por defecto.
+        duckdb_path: ruta del DuckDB compartido. Vacio = report_dir/eda_bq_<dataset>.duckdb.
+        keep_duckdb: si True (default) conserva el DuckDB materializado (es el
+            artefacto explorable post-EDA). Con False se borra al terminar.
+        min_inclusion: umbral de inclusion (0-1) para emitir una FK candidata
+            (se pasa a profile_database -> infer_fk_containment_duckdb).
+        emit_pdf: si True (default) emite el PDF movil DB-level (se pasa a
+            profile_database).
+        run_llm: si True (default False) activa la capa LLM interpretativa por
+            tabla (se pasa a profile_database -> profile_table; una llamada LLM
+            por tabla sobre el perfil agregado, nunca filas crudas).
+
+    Returns:
+        dict dict-no-throw con el resultado del pipeline (ver output del .md).
+    """
+    try:
+        # 1) Resolver la lista de tablas a materializar como (nombre, fqn).
+        if tables is None:
+            lst = list_bq_dataset_tables(
+                project_id, dataset, include_views=include_views
+            )
+            if lst.get("status") != "ok":
+                return {
+                    "status": "error",
+                    "error": lst.get("error", "list_bq_dataset_tables fallo"),
+                    "stage": "list_tables",
+                }
+            table_specs = [
+                (t["table"], t.get("fqn") or f"{project_id}.{dataset}.{t['table']}")
+                for t in lst.get("tables", [])
+            ]
+        elif isinstance(tables, list):
+            table_specs = [
+                (name, f"{project_id}.{dataset}.{name}") for name in tables
+            ]
+        else:
+            return {
+                "status": "error",
+                "error": "tables debe ser una lista de nombres o None",
+                "stage": "list_tables",
+            }
+
+        if not table_specs:
+            return {
+                "status": "error",
+                "error": (
+                    "el dataset no tiene tablas que perfilar "
+                    "(revisa include_views / el nombre del dataset)"
+                ),
+                "stage": "list_tables",
+            }
+
+        # 2) Resolver el DuckDB compartido. Vacio -> report_dir/eda_bq_<dataset>.duckdb.
+        os.makedirs(report_dir, exist_ok=True)
+        created_default = False
+        if not duckdb_path:
+            duckdb_path = os.path.join(report_dir, f"eda_bq_{dataset}.duckdb")
+            created_default = True
+        # Si es el path por defecto y ya existe de una corrida previa, empezar
+        # limpio para que no se mezclen tablas de datasets distintos en la base.
+        if created_default and os.path.exists(duckdb_path):
+            try:
+                os.remove(duckdb_path)
+            except OSError:
+                pass
+
+        # 3) Materializar CADA tabla en el MISMO DuckDB (tolerando fallos).
+        pseudo_map = pseudonymize_cols or {}
+        loaded_tables = []       # nombres de tabla dentro del DuckDB
+        tables_skipped = []
+        errors = []
+        for name, fqn in table_specs:
+            load = load_bq_table_to_duckdb(
+                fqn,
+                duckdb_path,
+                sample_frac=sample_frac,
+                max_rows=max_rows,
+                project_id=project_id,
+                pseudonymize_cols=pseudo_map.get(name),
+            )
+            if load.get("status") == "ok":
+                loaded_tables.append(load["table"])
+            else:
+                errors.append(
+                    {
+                        "table": name,
+                        "stage": "load",
+                        "error": load.get("error", "load fallo"),
+                    }
+                )
+                tables_skipped.append(name)
+
+        if not loaded_tables:
+            return {
+                "status": "error",
+                "error": "ninguna tabla del dataset se pudo materializar",
+                "stage": "load",
+                "errors": errors,
+            }
+
+        # 4) Perfilar el DuckDB compartido entero: perfiles por tabla + FK
+        # cross-tabla + join graph + report markdown/JSON + PDF (si emit_pdf).
+        prof = profile_database(
+            duckdb_path,
+            tables=loaded_tables,
+            report_dir=report_dir,
+            write_report=True,
+            min_inclusion=min_inclusion,
+            emit_pdf=emit_pdf,
+            run_llm=run_llm,
+        )
+        if prof.get("status") != "ok":
+            return {
+                "status": "error",
+                "error": prof.get("error", "profile_database fallo"),
+                "stage": "profile",
+            }
+        db_profile = prof.get("db_profile", {})
+
+        # 5) Diccionario de columnas del dataset sobre el DatabaseProfile.
+        dict_md_path = None
+        dict_json_path = None
+        column_dictionary = None
+        dict_res = build_column_dictionary(db_profile)
+        if dict_res.get("status") == "ok":
+            ts = datetime.now(timezone.utc).strftime("%Y%m%d-%H%M%S")
+            dict_md_path = os.path.join(
+                report_dir, f"eda_bq_dataset_{dataset}_{ts}_dict.md"
+            )
+            dict_json_path = os.path.join(
+                report_dir, f"eda_bq_dataset_{dataset}_{ts}_dict.json"
+            )
+            with open(dict_md_path, "w", encoding="utf-8") as fh:
+                fh.write(dict_res.get("markdown", ""))
+            # JSON sidecar del diccionario sin el markdown (compacto).
+            dict_payload = {k: v for k, v in dict_res.items() if k != "markdown"}
+            with open(dict_json_path, "w", encoding="utf-8") as fh:
+                fh.write(
+                    json.dumps(dict_payload, ensure_ascii=False, indent=1, default=str)
+                )
+            column_dictionary = dict_payload
+        else:
+            errors.append(
+                {
+                    "stage": "column_dictionary",
+                    "error": dict_res.get("error", "build_column_dictionary fallo"),
+                }
+            )
+
+        # 6) Limpieza del DuckDB compartido salvo que se pida conservarlo.
+        final_duckdb = duckdb_path
+        if not keep_duckdb and os.path.exists(duckdb_path):
+            try:
+                os.remove(duckdb_path)
+            except OSError:
+                pass
+            final_duckdb = None
+
+        return {
+            "status": "ok",
+            "project_id": project_id,
+            "dataset": dataset,
+            "n_tables_loaded": len(loaded_tables),
+            "n_tables_profiled": db_profile.get("n_tables", 0),
+            "tables_skipped": tables_skipped,
+            "errors": errors,
+            "duckdb_path": final_duckdb,
+            "db_profile": db_profile,
+            "column_dictionary": column_dictionary,
+            "report_md_path": prof.get("report_md_path"),
+            "report_json_path": prof.get("report_json_path"),
+            "report_pdf_path": prof.get("report_pdf_path"),
+            "dict_md_path": dict_md_path,
+            "dict_json_path": dict_json_path,
+        }
+    except Exception as e:  # noqa: BLE001
+        return {"status": "error", "error": str(e), "stage": "unexpected"}
@@ -0,0 +1,128 @@
+---
+name: profile_bq_table
+kind: pipeline
+lang: py
+domain: pipelines
+purity: impure
+version: "1.2.0"
+signature: "def profile_bq_table(table_fqn: str, sample_frac: float = None, max_rows: int = 0, pseudonymize_cols: list = None, run_models: bool = True, run_series: bool = False, run_llm: bool = False, project_id: str = \"\", report_dir: str = \"reports\", duckdb_path: str = \"\", keep_duckdb: bool = False, where_sql: str = \"\", select_sql: str = \"\") -> dict"
+description: "EDA one-shot de una tabla o vista de BigQuery: materializa el origen COMPLETO por defecto (todas las filas; muestreo opt-in con sample_frac; seudonimizacion PII opcional, LOPDGDD/RGPD) a un DuckDB local con load_bq_table_to_duckdb y lo perfila end-to-end con profile_table del grupo de capacidad eda, emitiendo el informe AutomaticEDA (PDF A5 movil + PPTX 16:9), Markdown y JSON sidecar. Es el adaptador BigQuery que faltaba en el grupo eda, resuelto por composicion (BigQuery -> DuckDB local -> profile_table) sin duplicar la logica de perfilado ni de render. Es el hazme un EDA de esta tabla BigQuery en una sola llamada, sobre el total de filas por defecto."
+tags: [eda, bigquery, launcher]
+uses_functions:
+  - load_bq_table_to_duckdb_py_datascience
+  - profile_table_py_pipelines
+uses_types: []
+returns: []
+returns_optional: false
+error_type: error_go_core
+imports: []
+tested: false
+tests: []
+test_file_path: ""
+file_path: "python/functions/pipelines/profile_bq_table.py"
+params:
+  - name: table_fqn
+    desc: "FQN de la tabla/vista BigQuery: `project.dataset.table`."
+  - name: sample_frac
+    desc: "None (DEFAULT) = FULL, perfila TODAS las filas del origen. Un float en (0,1) activa el muestreo opt-in (`WHERE rand() < frac`, ~frac del total)."
+  - name: max_rows
+    desc: "Tope duro opcional de filas (LIMIT). 0 (DEFAULT) = sin tope. Se combina con sample_frac si ambos se pasan."
+  - name: pseudonymize_cols
+    desc: "Columnas PII a seudonimizar (hash) antes de materializar (LOPDGDD/RGPD). Preserva nulos y cardinalidad."
+  - name: run_models
+    desc: "PCA/KMeans/IsolationForest/normalidad sobre numericas. Default True (informe AutomaticEDA completo)."
+  - name: run_series
+    desc: "Analisis de serie temporal por columna numerica. Default False."
+  - name: run_llm
+    desc: "1 llamada LLM sobre el perfil agregado (nunca filas crudas). Default False."
+  - name: project_id
+    desc: "Proyecto GCP de facturacion. Vacio = primer segmento del FQN."
+  - name: report_dir
+    desc: "Directorio de salida de los reports. Default 'reports' (artefacto local gitignored)."
+  - name: duckdb_path
+    desc: "Ruta DuckDB a usar. Vacio = temporal autogestionado."
+  - name: keep_duckdb
+    desc: "Si True conserva el DuckDB materializado (para el notebook Jupyter). Default False."
+  - name: where_sql
+    desc: "Clausula WHERE SQL (sin la palabra WHERE) aplicada al origen y a su COUNT. Pass-through a load_bq_table_to_duckdb; se combina con sample_frac via AND. Ej: `fecha <= CURRENT_DATE() AND venta_n IS NOT NULL`. Se interpola tal cual: no usar con input no confiable."
+  - name: select_sql
+    desc: "Expresiones del SELECT (sin la palabra SELECT); vacio (DEFAULT) = `*`. Pass-through a load_bq_table_to_duckdb. Util para castear tipos problematicos (p. ej. BIGNUMERIC->FLOAT64) antes de perfilar. Se interpola tal cual: no usar con input no confiable."
+output: "dict dict-no-throw. En exito {status:'ok', table_fqn, load:{n_rows_source,n_rows_fetched,sampled,sample_frac,pseudonymized,table,streamed, where_sql?, select_sql?}, duckdb_path, report_md_path, report_json_path, aeda_pdf_path, aeda_pptx_path, aeda_manifest_path, profile}. En error {status:'error', error, stage}. where_sql/select_sql aparecen en load solo si vienen informados."
+---
+
+## Ejemplo
+
+```python
+from pipelines.profile_bq_table import profile_bq_table
+
+# FULL por defecto: EDA sobre TODAS las filas de la vista (3,8M).
+r = profile_bq_table(
+    "autingo-159109.customer_marts.customer_profile",
+    pseudonymize_cols=["document_number", "full_name", "email", "phone", "postal_code", "salesforce_customer_id"],
+    run_models=True,
+)
+print(r["load"]["n_rows_fetched"], "filas perfiladas, sampled=", r["load"]["sampled"])
+print(r["aeda_pdf_path"]); print(r["aeda_pptx_path"]); print(r["report_md_path"])
+
+# Muestreo opt-in: EDA sobre ~5 % de las filas (tabla enorme / iteracion rapida).
+r = profile_bq_table(
+    "autingo-159109.customer_marts.customer_profile",
+    sample_frac=0.05,
+    pseudonymize_cols=["document_number", "full_name", "email", "phone", "postal_code", "salesforce_customer_id"],
+)
+
+# Filtrar el origen + castear una columna BIGNUMERIC antes de perfilar. where_sql y
+# select_sql se pasan al loader (que ingiere por batches Arrow, RAM acotada).
+r = profile_bq_table(
+    "autingo-159109.data.ventas_39M",
+    where_sql="fecha <= CURRENT_DATE() AND venta_n IS NOT NULL",
+    select_sql="fecha, idCentro, CAST(importe_bignumeric AS FLOAT64) AS importe",
+    run_models=True,
+)
+print(r["load"]["n_rows_fetched"], "filas, streamed=", r["load"].get("streamed"))
+```
+
+## Cuando usarla
+
+Cuando pidan un EDA de una tabla o vista de BigQuery ("hazme un EDA de esta
+tabla BigQuery"). Es el adaptador BigQuery del grupo de capacidad `eda` por
+composicion: trae el origen COMPLETO (todas las filas, por defecto) a un DuckDB
+local y delega todo el perfilado y render en `profile_table`, sin adaptador
+BigQuery nativo ni logica de EDA duplicada. Usala como primer paso al recibir un
+dataset BigQuery desconocido, antes de modelar o limpiar, o para auditar la
+calidad de una vista ya productiva. Para iteracion rapida o tablas que no quepan
+en RAM, pasa `sample_frac` (muestreo opt-in).
+
+## Gotchas
+
+- Impura: requiere ADC de BigQuery configurado (Application Default Credentials)
+  para que `load_bq_table_to_duckdb` autentique contra el proyecto.
+- FULL por defecto: `sample_frac=None` perfila TODAS las filas del origen. El
+  loader `load_bq_table_to_duckdb` (v1.2.0) ingiere por batches Arrow ->
+  DuckDB cuando `pyarrow` esta disponible, con la RAM acotada al tamano de un
+  batch (una tabla de decenas de millones de filas cabe sin cargarse entera); si
+  no, cae al camino DataFrame completo (todo en RAM, varios GB posibles). Para
+  acotar coste/memoria pasa `sample_frac` in (0,1), `max_rows` (tope duro) o
+  `where_sql` (filtra el origen). Si por limite de recursos no cabe el total,
+  dilo explicito con el maximo que si se cargo.
+- `where_sql` / `select_sql` (pass-through al loader) se interpolan TAL CUAL en la
+  query BigQuery: NO los construyas a partir de input no confiable (inyeccion SQL).
+  `select_sql` es la via para castear columnas BIGNUMERIC (Arrow decimal256, que
+  DuckDB no ingiere) a FLOAT64 antes de perfilar; si no las casteas, el loader
+  devuelve `{status:'error', stage:'stream_schema'|'stream_insert'}`.
+- Seudonimiza PII con `pseudonymize_cols` para cumplir LOPDGDD/RGPD ANTES de
+  escribir a disco: nombres, DNI/NIE, email, telefono, direccion, IDs de cliente,
+  etc. Se hashean preservando nulos y cardinalidad. Sin seudonimizar, la muestra
+  materializada (DuckDB + reports) contiene datos personales reales [POL-MMNSEG-001-1.0].
+- El DuckDB temporal se borra al terminar salvo `keep_duckdb=True` (pasalo para
+  seguir explorando la muestra desde un notebook Jupyter). Si pasas `duckdb_path`
+  explicito, la ruta se respeta y solo se conserva con `keep_duckdb=True`.
+- Escribe reports a `report_dir` (default 'reports', artefacto local gitignored):
+  Markdown + JSON sidecar + PDF A5 movil + PPTX 16:9 del informe AutomaticEDA.
+- `run_llm=True` gasta tokens (haiku) pero solo envia el perfil agregado, nunca
+  filas crudas ni datos personales.
+
+## Capability growth log
+
+- v1.2.0 (2026-07-02) — Añade `where_sql` y `select_sql` como pass-through al loader `load_bq_table_to_duckdb`: filtran/proyectan el origen antes de perfilar (`where_sql` tambien acota el COUNT del origen; `select_sql` permite castear BIGNUMERIC->FLOAT64). Ambos se reflejan en el bloque `load` del retorno (solo si vienen informados), junto con la nueva clave `streamed`. Hereda del loader v1.2.0 la ingesta streaming Arrow -> DuckDB por batches (RAM acotada) para tablas que no caben en RAM, con fallback DataFrame completo.
+- v1.1.0 (2026-07-01) — FULL pasa a ser el DEFAULT del pipeline: se sustituye `max_rows=300000, sample=True` por `sample_frac=None` (None = perfila todas las filas) + `max_rows=0` (tope duro opcional). El muestreo es opt-in explicito (`sample_frac`). Alinea con la preferencia estandar del usuario: los EDA se corren sobre el total salvo que se pida lo contrario. Hereda el fetch acelerado (Arrow/bqstorage) de `load_bq_table_to_duckdb` v1.1.0.
@@ -0,0 +1,159 @@
+"""profile_bq_table — EDA one-shot de una tabla/vista BigQuery con el grupo `eda`.
+
+Pipeline impuro: materializa una tabla o vista de BigQuery (por defecto COMPLETA —
+todas las filas — o una muestra si se pasa `sample_frac`, con seudonimizacion PII
+opcional, LOPDGDD/RGPD) a un DuckDB local con `load_bq_table_to_duckdb`, y la
+perfila end-to-end con `profile_table` del grupo de capacidad `eda`, emitiendo el
+informe AutomaticEDA (PDF A5 movil + PPTX 16:9), Markdown y JSON sidecar. Es el
+adaptador BigQuery que faltaba en el grupo `eda`, resuelto por composicion
+(BigQuery -> DuckDB local -> profile_table) sin duplicar la logica de perfilado ni
+de render.
+
+Modo por defecto = FULL: `sample_frac=None` perfila TODAS las filas del origen
+(preferencia estandar del usuario: los EDA se corren sobre el total salvo que se
+pida lo contrario). El muestreo es opt-in explicito: `sample_frac=0.05` perfila
+~5 % de las filas; `max_rows` es un tope duro opcional (0 = sin tope).
+
+Funciones del registry compuestas (NO se reimplementa su logica):
+  - load_bq_table_to_duckdb : trae la tabla/vista BigQuery a un DuckDB local
+                              (completa por defecto, o muestra si sample_frac).
+  - profile_table           : orquestador one-shot del grupo `eda` que perfila la
+                              DuckDB materializada y emite el informe AutomaticEDA.
+
+Estilo dict-no-throw del grupo `eda`: nunca lanza; devuelve {status:'error', ...}.
+"""
+
+import os
+import tempfile
+
+from datascience import load_bq_table_to_duckdb
+from pipelines.profile_table import profile_table
+
+
+def profile_bq_table(
+    table_fqn: str,
+    sample_frac: float = None,
+    max_rows: int = 0,
+    pseudonymize_cols: list = None,
+    run_models: bool = True,
+    run_series: bool = False,
+    run_llm: bool = False,
+    project_id: str = "",
+    report_dir: str = "reports",
+    duckdb_path: str = "",
+    keep_duckdb: bool = False,
+    where_sql: str = "",
+    select_sql: str = "",
+) -> dict:
+    """EDA one-shot de una tabla/vista BigQuery.
+
+    Por defecto perfila TODAS las filas del origen (`sample_frac=None`, modo FULL).
+    Materializa el origen (con seudonimizacion PII opcional) a un DuckDB local y lo
+    perfila con `profile_table` del grupo `eda`, emitiendo el informe AutomaticEDA
+    (PDF A5 movil + PPTX 16:9) + Markdown + JSON sidecar.
+
+    Args:
+        table_fqn: FQN de la tabla/vista BigQuery ("project.dataset.table").
+        sample_frac: None (default) = FULL, perfila todas las filas. Un float en
+            (0,1) activa el muestreo opt-in (`WHERE rand() < frac`, ~frac del total).
+        max_rows: Tope duro opcional de filas (LIMIT). 0 (default) = sin tope.
+        pseudonymize_cols: Columnas PII a seudonimizar (hash) antes de materializar.
+        run_models: Modelos baratos (PCA/KMeans/IsolationForest/normalidad).
+        run_series: Analisis de serie temporal por columna numerica.
+        run_llm: 1 llamada LLM sobre el perfil agregado (nunca filas crudas).
+        project_id: Proyecto GCP de facturacion. Vacio = primer segmento del FQN.
+        report_dir: Directorio de salida de los reports.
+        duckdb_path: Ruta DuckDB a usar. Vacio = temporal autogestionado.
+        keep_duckdb: Si True conserva el DuckDB materializado.
+        where_sql: Clausula WHERE SQL (sin la palabra WHERE) aplicada al origen y a
+            su COUNT. Pass-through a `load_bq_table_to_duckdb`. Ej:
+            "fecha <= CURRENT_DATE() AND venta_n IS NOT NULL". Se interpola tal cual:
+            no usar con input no confiable.
+        select_sql: Expresiones del SELECT (sin la palabra SELECT); vacio = `*`.
+            Pass-through a `load_bq_table_to_duckdb`. Util para castear tipos
+            problematicos (p. ej. BIGNUMERIC->FLOAT64) antes de perfilar.
+
+    Returns:
+        dict dict-no-throw con el resultado del pipeline (ver output del .md).
+    """
+    tmp_created = False
+    try:
+        # DuckDB temporal si no se pasa ruta.
+        if not duckdb_path:
+            fd, duckdb_path = tempfile.mkstemp(prefix="eda_bq_", suffix=".duckdb")
+            os.close(fd)
+            os.remove(duckdb_path)  # que lo cree DuckDB limpio
+            tmp_created = True
+
+        load = load_bq_table_to_duckdb(
+            table_fqn,
+            duckdb_path,
+            sample_frac=sample_frac,
+            max_rows=max_rows,
+            project_id=project_id,
+            pseudonymize_cols=pseudonymize_cols,
+            where_sql=where_sql,
+            select_sql=select_sql,
+        )
+        if load.get("status") != "ok":
+            return {
+                "status": "error",
+                "error": load.get("error", "load fallo"),
+                "stage": "load",
+            }
+
+        prof = profile_table(
+            duckdb_path,
+            load["table"],
+            backend="duckdb",
+            run_models=run_models,
+            run_series=run_series,
+            run_llm=run_llm,
+            emit_automatic=True,   # PDF A5 movil + PPTX 16:9
+            emit_pdf=False,
+            write_report=True,     # Markdown + JSON sidecar
+            report_dir=report_dir,
+        )
+        if prof.get("status") != "ok":
+            return {
+                "status": "error",
+                "error": prof.get("error", "profile fallo"),
+                "stage": "profile",
+                "load": load,
+            }
+
+        load_block = {
+            k: load[k]
+            for k in (
+                "n_rows_source", "n_rows_fetched", "sampled", "sample_frac",
+                "pseudonymized", "table", "streamed",
+            )
+            if k in load
+        }
+        # Trazabilidad de los filtros de origen (solo si vienen informados).
+        if where_sql:
+            load_block["where_sql"] = where_sql
+        if select_sql:
+            load_block["select_sql"] = select_sql
+
+        return {
+            "status": "ok",
+            "table_fqn": table_fqn,
+            "load": load_block,
+            "duckdb_path": duckdb_path if keep_duckdb else None,
+            "report_md_path": prof.get("report_md_path"),
+            "report_json_path": prof.get("report_json_path"),
+            "aeda_pdf_path": prof.get("aeda_pdf_path"),
+            "aeda_pptx_path": prof.get("aeda_pptx_path"),
+            "aeda_manifest_path": prof.get("aeda_manifest_path"),
+            "profile": prof.get("profile"),
+        }
+    except Exception as e:  # noqa: BLE001
+        return {"status": "error", "error": str(e)}
+    finally:
+        # Limpia el DuckDB temporal salvo que se pida conservarlo.
+        if tmp_created and not keep_duckdb and duckdb_path and os.path.exists(duckdb_path):
+            try:
+                os.remove(duckdb_path)
+            except OSError:
+                pass
@@ -4,8 +4,8 @@ kind: pipeline
 lang: py
 domain: pipelines
 purity: impure
-version: "1.0.0"
-signature: "def profile_database(db_path: str, tables: list = None, sample: int = 5000, report_dir: str = \"reports\", write_report: bool = True, min_inclusion: float = 0.9) -> dict"
+version: "1.1.0"
+signature: "def profile_database(db_path: str, tables: list = None, sample: int = 5000, report_dir: str = \"reports\", write_report: bool = True, min_inclusion: float = 0.9, emit_pdf: bool = False, run_llm: bool = False) -> dict"
 description: "Orquestador one-shot del grupo eda a nivel de BASE: perfila TODA una base DuckDB (todas las tablas o las indicadas) componiendo profile_table por tabla, infiere las relaciones FK inter-tabla por containment y construye el join graph con diagrama Mermaid. Ensambla un DatabaseProfile (resumen por tabla + TableProfiles completos + fk_candidates + join_graph) y opcionalmente emite un report markdown DB-level + JSON sidecar. Es la composicion canonica para hazme un EDA de esta base de datos y entender su esquema relacional."
 tags: [eda, relations, duckdb, profiling, data-quality, pipeline, dataops]
 uses_functions:
@@ -38,6 +38,10 @@ params:
    desc: "Si True (default) escribe report markdown DB-level + JSON sidecar timestamped en report_dir; si False no toca disco y los paths del retorno son None."
  - name: min_inclusion
    desc: "Umbral minimo de inclusion (0-1) para emitir una FK candidata (se pasa a infer_fk_containment_duckdb). Default 0.9."
+  - name: emit_pdf
+    desc: "Si True (default False) renderiza el PDF movil DB-level con render_eda_pdf_relational junto a los reports (report_pdf_path en el retorno)."
+  - name: run_llm
+    desc: "Si True (default False) activa la capa LLM interpretativa de profile_table para CADA tabla: una llamada LLM por tabla sobre el perfil agregado, nunca filas crudas."
 output: "dict {status:'ok', db_profile:<DatabaseProfile con db_path, profiled_at, n_tables, tables[resumen], table_profiles[completos], fk_candidates, join_graph{nodes,edges,mermaid,hubs}, errors>, report_md_path:str|None, report_json_path:str|None} o {status:'error', error:str} (dict-no-throw)."
 ---

@@ -101,3 +105,7 @@ se infieren las FK y se dibuja el diagrama de relaciones.
  perfiladas con exito. Revisa `errors` para saber que quedo fuera.
 - `db_path` debe existir: DuckDB read-only NO crea la base. El muestreo de cada
  tabla usa el sandbox read-only por defecto (sin acceso a FS/red).
+
+## Capability growth log
+
+- v1.1.0 (2026-07-02) — añade `run_llm` (passthrough a profile_table: capa LLM interpretativa por tabla) y documenta `emit_pdf` en el frontmatter (existía en el código desde el renderer relational). Sin breaking changes: ambos default False.
@@ -121,6 +121,7 @@ def profile_database(
    write_report: bool = True,
    min_inclusion: float = 0.9,
    emit_pdf: bool = False,
+    run_llm: bool = False,
 ) -> dict:
    """Perfila una base DuckDB entera + sus relaciones inter-tabla.

@@ -141,6 +142,9 @@ def profile_database(
            render_eda_pdf_relational (resumen de tablas + relaciones FK + join
            graph) junto a los reports y devuelve su ruta en report_pdf_path. Con
            False no se toca el PDF (retrocompatible) y report_pdf_path es None.
+        run_llm: si True (default False) activa la capa LLM interpretativa de
+            profile_table para CADA tabla (una llamada LLM por tabla sobre el
+            perfil agregado, nunca filas crudas).

    Returns:
        dict dict-no-throw. En exito:
@@ -177,7 +181,9 @@ def profile_database(

        # 2) Perfilar cada tabla (tolerando fallos individuales).
        for table in tables:
-            r = profile_table(db_path, table, sample=sample, write_report=False)
+            r = profile_table(
+                db_path, table, sample=sample, write_report=False, run_llm=run_llm
+            )
            if r.get("status") == "ok":
                prof = r["profile"]
                table_profiles.append(prof)
@@ -0,0 +1,103 @@
+---
+name: run_sales_forecast
+kind: pipeline
+lang: py
+domain: pipelines
+purity: impure
+version: "1.1.0"
+signature: "def run_sales_forecast(as_of: str = '', horizon: int = 7, model: str = 'baseline_v1', author: str = 'egutierrez', dry_run: bool = False) -> dict"
+description: "Forecast diario de ventas Aurgi (dia x centro x subcategoria CGQ) escrito en BigQuery autingo-159109.sales_forecast.predictions, en una sola llamada. Compone funciones del registry: bq_auth(drop_quota_project=True) para el cliente sin quota project ajeno, bq_query para leer la historia agregada del mart bi_ventas_mart.base_margenes_aa (18 semanas, venta_n saneado) y ejecutar el DELETE de idempotencia, forecast_seasonal_median (modelo PURO mediana estacional + tendencia acotada) para generar todas las predicciones, y bq_load_from_file para cargar el JSONL a la tabla de predicciones. Historia utilizable hasta as_of-1 (el dia as_of esta parcial cuando corre el cron a las 21:00); predice as_of+1..as_of+horizon; run_date=as_of. Solo predice series activas (venta>0 en las ultimas 8 semanas). Idempotente por (run_date, model, author). --dry-run no escribe."
+tags: [forecast, bigquery, sales, aurgi, pipeline, launcher]
+uses_functions:
+  - forecast_seasonal_median_py_datascience
+  - bq_auth_py_infra
+  - bq_query_py_infra
+  - bq_load_from_file_py_infra
+uses_types: []
+returns: []
+returns_optional: false
+error_type: error_go_core
+imports: [google-cloud-bigquery]
+tested: false
+tests: []
+test_file_path: ""
+file_path: "python/functions/pipelines/run_sales_forecast.py"
+params:
+  - name: as_of
+    desc: "fecha de corte 'YYYY-MM-DD' (dia de la corrida). Vacio (DEFAULT) = hoy. La historia utilizable llega hasta as_of-1 dia (el dia as_of esta parcial en el cron 21:00); se predice as_of+1..as_of+horizon; run_date=as_of"
+  - name: horizon
+    desc: "numero de dias futuros a predecir a partir de as_of+1. Default 7"
+  - name: model
+    desc: "etiqueta del modelo escrita en la columna model de cada fila. Default 'baseline_v1'. Forma parte de la clave de idempotencia"
+  - name: author
+    desc: "autor de la corrida (columna author). Default 'egutierrez'. Forma parte de la clave de idempotencia"
+  - name: dry_run
+    desc: "si True no escribe en BigQuery (ni DELETE ni load): devuelve el resumen + una muestra de 5 filas. Default False"
+output: "dict dict-no-throw. En exito {status:'ok', run_date, series:N (series activas), rows:N (filas predichas), model, author, rows_loaded, job_id}; con dry_run=True incluye sample:[5 filas] y omite rows_loaded/job_id. En error {status:'error', error, stage}. Por stdout imprime el JSON del resumen; exit 0 si ok, 1 si error"
+---
+
+## Ejemplo
+
+```bash
+# Corrida real (cron 21:00): predice los 7 dias siguientes a hoy y carga a BigQuery.
+./fn run run_sales_forecast
+
+# Fecha de corte y horizonte explicitos, sin escribir (revisar la muestra):
+./fn run run_sales_forecast --as-of 2026-07-01 --horizon 7 --dry-run
+
+# Modelo alternativo (clave de idempotencia distinta: no pisa baseline_v1):
+./fn run run_sales_forecast --model baseline_v2 --author egutierrez
+```
+
+```python
+# Uso programatico (venv del proyecto, PYTHONPATH=python/functions):
+from pipelines.run_sales_forecast import run_sales_forecast
+
+r = run_sales_forecast(as_of="2026-07-01", horizon=7, dry_run=True)
+print(r["series"], "series activas,", r["rows"], "filas")
+for row in r["sample"]:
+    print(row["forecast_date"], row["center_id"], row["subcat_cgq"], row["y_pred"])
+```
+
+## Cuando usarla
+
+Cuando quieras (re)generar el forecast diario de ventas Aurgi por centro y
+subcategoria CGQ y dejarlo en `autingo-159109.sales_forecast.predictions` en una
+sola llamada. Es el pipeline que dispara el cron nocturno (21:00): lee la historia
+del mart, aplica el baseline estacional, y carga las predicciones de forma
+idempotente. Usa `--dry-run` para inspeccionar la muestra antes de escribir, o
+para probar tras un cambio en el mart o en el modelo. Cambia `--model` para probar
+una variante sin pisar las predicciones del modelo actual (la clave de
+idempotencia es run_date + model + author).
+
+## Gotchas
+
+- Impura: requiere ADC de BigQuery configurado (`gcloud auth application-default
+  login`) con acceso a `autingo-159109`. Usa `bq_auth(drop_quota_project=True)`
+  para descartar el quota project del ADC del usuario `egutierrez` y evitar el
+  `403 USER_PROJECT_DENIED` (gotcha conocido del repo).
+- Escribe en produccion: en modo real hace `DELETE` de las predicciones previas de
+  `(run_date, model, author)` y luego carga (WRITE_APPEND). Es idempotente para esa
+  combinacion: re-ejecutar la misma corrida no duplica. Cambiar `model` o `author`
+  crea un conjunto de predicciones paralelo. Usa `--dry-run` si solo quieres mirar.
+- La tabla `sales_forecast.predictions` debe existir con schema fijo y con las
+  columnas exactas que emite el pipeline: `run_ts` (TIMESTAMP), `run_date` (DATE),
+  `forecast_date` (DATE), `lag_days` (INT64), `center_id` (STRING), `center_name`
+  (STRING), `ambito` (STRING), `subcat_cgq` (STRING), `model` (STRING), `author`
+  (STRING), `y_pred` (FLOAT64). El load usa `autodetect=False`: los nombres del
+  JSONL deben coincidir con los de la tabla o el load falla.
+- `center_id` se emite como STRING (str(idCentro)); `subcat_cgq` toma el valor de
+  la columna `subcat_cqq` del mart (el nombre difiere entre origen y destino a
+  proposito). center_name/ambito son los ultimos conocidos por serie (fecha maxima).
+- Historia hasta as_of-1: el dia `as_of` NO entra en la historia (esta parcial en
+  el cron de las 21:00). Si necesitas incluir el dia en curso, pasa `--as-of` con
+  el dia siguiente.
+- Solo predice series con venta > 0 en las ultimas 8 semanas: las series muertas se
+  omiten (no aparecen en la tabla). `series` en la salida cuenta las activas.
+- Guarda 18 semanas de historia del mart: cubre la ventana estacional (8 semanas
+  del mismo dia) mas la tendencia (4+4 semanas) con margen. venta_n se filtra
+  `ABS < 1e9` para descartar las filas veneno del mart.
+
+## Capability growth log
+
+- v1.1.0 (2026-07-02) — añade paso 9: refresh de `sales_forecast.actuals_daily` (tabla física de venta real, ventana móvil de 10 días) tras cargar las predicciones; `forecast_eval` y el dashboard de competición comparan contra ella.
@@ -0,0 +1,298 @@
+"""run_sales_forecast — forecast diario de ventas Aurgi a BigQuery (one-shot).
+
+Pipeline IMPURO que produce el forecast diario de ventas (dia x centro x
+subcategoria CGQ) y lo escribe en `autingo-159109.sales_forecast.predictions`.
+Compone funciones del registry sin reimplementar su logica:
+
+  - bq_auth(..., drop_quota_project=True): cliente BigQuery sin quota project ajeno
+    (evita el 403 USER_PROJECT_DENIED del ADC del usuario).
+  - bq_query: lee la historia agregada del mart `bi_ventas_mart.base_margenes_aa`
+    y ejecuta el DELETE de idempotencia (parametros tipados).
+  - forecast_seasonal_median: modelo PURO (mediana estacional + tendencia acotada)
+    que genera todas las predicciones de golpe.
+  - bq_load_from_file: carga las filas (JSONL) a la tabla de predicciones.
+
+Cron previsto: 21:00. Por eso la historia utilizable llega hasta as_of - 1 dia
+(el dia as_of aun esta parcial) y se predice as_of + 1 .. as_of + horizon.
+
+Estilo dict-no-throw: nunca lanza; captura errores y devuelve
+{status:'error', error, stage}. Idempotente por (run_date, model, author):
+borra las predicciones previas de esa combinacion antes de cargar.
+"""
+
+import json
+import os
+import sys
+import tempfile
+from datetime import date, datetime, timedelta, timezone
+
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
+
+from bigquery import bq_auth, bq_query, bq_load_from_file
+from datascience import forecast_seasonal_median
+
+PROJECT_ID = "autingo-159109"
+SOURCE_TABLE = "autingo-159109.bi_ventas_mart.base_margenes_aa"
+DEST_DATASET = "sales_forecast"
+DEST_TABLE = "predictions"
+
+HISTORY_SQL = f"""
+SELECT fecha, idCentro, subcat_cqq,
+       ANY_VALUE(NombreCentro) AS center_name, ANY_VALUE(Ambito) AS ambito,
+       SUM(CAST(venta_n AS FLOAT64)) AS venta
+FROM `{SOURCE_TABLE}`
+WHERE fecha BETWEEN DATE_SUB(@as_of, INTERVAL 18 WEEK) AND DATE_SUB(@as_of, INTERVAL 1 DAY)
+  AND venta_n IS NOT NULL AND ABS(CAST(venta_n AS FLOAT64)) < 1e9
+  AND subcat_cqq IS NOT NULL AND idCentro IS NOT NULL
+GROUP BY fecha, idCentro, subcat_cqq
+"""
+
+DELETE_SQL = (
+    f"DELETE FROM `{PROJECT_ID}.{DEST_DATASET}.{DEST_TABLE}` "
+    "WHERE run_date = @d AND model = @m AND author = @a"
+)
+
+# Refresh de la tabla fisica de reales (sales_forecast.actuals_daily), consumida
+# por la vista forecast_eval y por los dashboards de competicion. Ventana movil
+# para recoger correcciones retroactivas del mart.
+ACTUALS_DELETE_SQL = (
+    f"DELETE FROM `{PROJECT_ID}.{DEST_DATASET}.actuals_daily` "
+    "WHERE fecha BETWEEN DATE_SUB(@as_of, INTERVAL @w DAY) AND DATE_SUB(@as_of, INTERVAL 1 DAY)"
+)
+
+ACTUALS_INSERT_SQL = f"""
+INSERT INTO `{PROJECT_ID}.{DEST_DATASET}.actuals_daily`
+  (fecha, center_id, center_name, ambito, subcat_cgq, y_real, unidades, loaded_ts)
+SELECT forecast_date, IFNULL(center_id, 'SIN_CENTRO'), center_name, ambito,
+       IFNULL(subcat_cgq, 'Sin subcategoria'),
+       y_real, unidades, CURRENT_TIMESTAMP()
+FROM `{PROJECT_ID}.{DEST_DATASET}.actuals`
+WHERE forecast_date BETWEEN DATE_SUB(@as_of, INTERVAL @w DAY) AND DATE_SUB(@as_of, INTERVAL 1 DAY)
+"""
+
+
+def _refresh_actuals(client, as_of: date, window_days: int = 10) -> None:
+    """Rehace los ultimos `window_days` dias de actuals_daily desde la vista actuals."""
+    params = [
+        {"name": "as_of", "type": "DATE", "value": as_of},
+        {"name": "w", "type": "INT64", "value": window_days},
+    ]
+    bq_query(client, ACTUALS_DELETE_SQL, params=params)
+    bq_query(client, ACTUALS_INSERT_SQL, params=params)
+
+
+def _as_date(value) -> date:
+    if isinstance(value, date) and not isinstance(value, datetime):
+        return value
+    if isinstance(value, datetime):
+        return value.date()
+    return datetime.strptime(str(value)[:10], "%Y-%m-%d").date()
+
+
+def run_sales_forecast(
+    as_of: str = "",
+    horizon: int = 7,
+    model: str = "baseline_v1",
+    author: str = "egutierrez",
+    dry_run: bool = False,
+) -> dict:
+    """Genera el forecast diario de ventas y lo escribe en BigQuery.
+
+    Args:
+        as_of: fecha de corte 'YYYY-MM-DD' (dia de la corrida). Vacio = hoy. La
+            historia utilizable llega hasta as_of - 1 dia; se predice
+            as_of + 1 .. as_of + horizon. run_date = as_of.
+        horizon: numero de dias futuros a predecir. Default 7.
+        model: etiqueta del modelo escrita en cada fila (columna model). Default
+            'baseline_v1'.
+        author: autor de la corrida (columna author). Default 'egutierrez'.
+        dry_run: si True no escribe en BigQuery; devuelve el resumen + una muestra
+            de filas.
+
+    Returns:
+        dict dict-no-throw. En exito {status:'ok', run_date, series, rows, model,
+        author, (sample si dry_run)}. En error {status:'error', error, stage}.
+    """
+    try:
+        run_d = _as_date(as_of) if as_of else date.today()
+        # Ultimo dia de historia utilizable (inclusive): as_of - 1 dia.
+        hist_as_of = run_d - timedelta(days=1)
+        horizon_dates = [
+            (run_d + timedelta(days=k)).isoformat() for k in range(1, horizon + 1)
+        ]
+
+        # 1) Cliente BigQuery sin quota project (evita 403 USER_PROJECT_DENIED).
+        client = bq_auth(PROJECT_ID, drop_quota_project=True)
+
+        # 2) Historia agregada del mart (hasta run_d - 1 via el WHERE de la query).
+        q = bq_query(
+            client,
+            HISTORY_SQL,
+            params=[{"name": "as_of", "type": "DATE", "value": run_d}],
+        )
+        cols = {name: i for i, name in enumerate(q["columns"])}
+
+        # Historia por serie + ultimos center_name/ambito conocidos + venta 8 semanas.
+        history = []
+        last_meta = {}          # series_id -> (max_date, center_name, ambito, center_id, subcat)
+        recent_sum = {}         # series_id -> venta acumulada en las ultimas 8 semanas
+        active_cutoff = hist_as_of - timedelta(weeks=8)
+        for row in q["rows"]:
+            fecha = _as_date(row[cols["fecha"]])
+            center_id = str(row[cols["idCentro"]])
+            subcat = row[cols["subcat_cqq"]]
+            center_name = row[cols["center_name"]]
+            ambito = row[cols["ambito"]]
+            venta = float(row[cols["venta"]] or 0.0)
+            series_id = f"{center_id}|{subcat}"
+
+            history.append(
+                {"series_id": series_id, "date": fecha.isoformat(), "value": venta}
+            )
+            prev = last_meta.get(series_id)
+            if prev is None or fecha > prev[0]:
+                last_meta[series_id] = (fecha, center_name, ambito, center_id, subcat)
+            if fecha > active_cutoff:
+                recent_sum[series_id] = recent_sum.get(series_id, 0.0) + venta
+
+        # 3) Series activas: venta > 0 en las ultimas 8 semanas.
+        active = {sid for sid, s in recent_sum.items() if s > 0.0}
+        history = [h for h in history if h["series_id"] in active]
+
+        if not history:
+            result = {
+                "status": "ok",
+                "run_date": run_d.isoformat(),
+                "series": 0,
+                "rows": 0,
+                "model": model,
+                "author": author,
+            }
+            if dry_run:
+                result["sample"] = []
+            return result
+
+        # 4) Modelo puro: todas las predicciones de golpe.
+        preds = forecast_seasonal_median(
+            history, horizon_dates, as_of=hist_as_of.isoformat()
+        )
+
+        # 5) Filas para la tabla de predicciones.
+        run_ts = datetime.now(timezone.utc).isoformat()
+        rows_out = []
+        for p in preds:
+            sid = p["series_id"]
+            meta = last_meta.get(sid)
+            _, center_name, ambito, center_id, subcat = meta
+            forecast_date = _as_date(p["date"])
+            rows_out.append(
+                {
+                    "run_ts": run_ts,
+                    "run_date": run_d.isoformat(),
+                    "forecast_date": forecast_date.isoformat(),
+                    "lag_days": (forecast_date - run_d).days,
+                    "center_id": center_id,
+                    "center_name": center_name,
+                    "ambito": ambito,
+                    "subcat_cgq": subcat,
+                    "model": model,
+                    "author": author,
+                    "y_pred": round(float(p["y_pred"]), 4),
+                }
+            )
+
+        summary = {
+            "status": "ok",
+            "run_date": run_d.isoformat(),
+            "series": len(active),
+            "rows": len(rows_out),
+            "model": model,
+            "author": author,
+        }
+
+        # 6) dry-run: no escribe; devuelve resumen + muestra.
+        if dry_run:
+            summary["sample"] = rows_out[:5]
+            return summary
+
+        # 7) Idempotencia: borra las predicciones previas de (run_date, model, author).
+        bq_query(
+            client,
+            DELETE_SQL,
+            params=[
+                {"name": "d", "type": "DATE", "value": run_d},
+                {"name": "m", "type": "STRING", "value": model},
+                {"name": "a", "type": "STRING", "value": author},
+            ],
+        )
+
+        # 8) Carga JSONL a la tabla (WRITE_APPEND, schema fijo de la tabla).
+        tmp_path = None
+        try:
+            fd, tmp_path = tempfile.mkstemp(prefix="sales_forecast_", suffix=".jsonl")
+            with os.fdopen(fd, "w", encoding="utf-8") as fh:
+                for r in rows_out:
+                    fh.write(json.dumps(r, ensure_ascii=False) + "\n")
+            load = bq_load_from_file(
+                client,
+                tmp_path,
+                DEST_DATASET,
+                DEST_TABLE,
+                source_format="NEWLINE_DELIMITED_JSON",
+                write_disposition="WRITE_APPEND",
+                autodetect=False,
+            )
+        finally:
+            if tmp_path and os.path.exists(tmp_path):
+                os.remove(tmp_path)
+
+        if load.get("status") != "DONE":
+            return {
+                "status": "error",
+                "error": f"load job no termino DONE: {load}",
+                "stage": "load",
+            }
+
+        summary["rows_loaded"] = load.get("rows_loaded")
+        summary["job_id"] = load.get("job_id")
+
+        # 9) Refresca la tabla fisica de reales (ventana movil de 10 dias) para
+        # que forecast_eval y el dashboard de competicion comparen contra el
+        # ultimo estado del mart.
+        try:
+            _refresh_actuals(client, run_d)
+            summary["actuals_refreshed"] = True
+        except Exception as e:  # noqa: BLE001
+            # No invalida las predicciones ya cargadas: se reporta y se sigue.
+            summary["actuals_refreshed"] = False
+            summary["actuals_error"] = str(e)
+
+        return summary
+    except Exception as e:  # noqa: BLE001
+        return {"status": "error", "error": str(e), "stage": "unexpected"}
+
+
+if __name__ == "__main__":
+    import argparse
+
+    parser = argparse.ArgumentParser(
+        description="Forecast diario de ventas Aurgi -> BigQuery sales_forecast.predictions."
+    )
+    parser.add_argument("--as-of", default="", help="Fecha de corte YYYY-MM-DD (vacio = hoy).")
+    parser.add_argument("--horizon", type=int, default=7, help="Dias a predecir. Default 7.")
+    parser.add_argument("--model", default="baseline_v1", help="Etiqueta del modelo.")
+    parser.add_argument("--author", default="egutierrez", help="Autor de la corrida.")
+    parser.add_argument(
+        "--dry-run", action="store_true", help="No escribe en BigQuery; imprime muestra."
+    )
+    args = parser.parse_args()
+
+    out = run_sales_forecast(
+        as_of=args.as_of,
+        horizon=args.horizon,
+        model=args.model,
+        author=args.author,
+        dry_run=args.dry_run,
+    )
+    print(json.dumps(out, ensure_ascii=False, default=str))
+    sys.exit(0 if out.get("status") == "ok" else 1)
@@ -9,6 +9,7 @@ dependencies = [
    "contextily>=1.7.0",
    "cryptography>=46.0.6",
    "duckdb>=1.5.2",
+    "faker>=40.27.0",
    "fpdf2>=2.8.7",
    "geopandas>=1.1.3",
    "google-api-python-client>=2.197.0",
@@ -839,6 +839,18 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/c1/ea/53f2148663b321f21b5a606bd5f191517cf40b7072c0497d3c92c4a13b1e/executing-2.2.1-py2.py3-none-any.whl", hash = "sha256:760643d3452b4d777d295bb167ccc74c64a81df23fb5e08eff250c425a4b2017", size = 28317, upload-time = "2025-09-01T09:48:08.5Z" },
 ]

+[[package]]
+name = "faker"
+version = "40.27.0"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "tzdata", marker = "sys_platform == 'win32'" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/1a/7b/c62c98764137c949be240ad83f763b6f96cf76055952a3e2835359acc3af/faker-40.27.0.tar.gz", hash = "sha256:f697cf07f461474ad7d511164c21f45317e69f1d531d25f3e0f872b639e346a1", size = 2018361, upload-time = "2026-06-30T18:05:17.775Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/c6/b2/788aae329da3d7e4f08f8e1a82e82243c3376c0f3f49b75ae29eea40b371/faker-40.27.0-py3-none-any.whl", hash = "sha256:6099bd6d7bc79041b46c28e100815e2558952bcf384b76ce6c71c8bdca744256", size = 2057897, upload-time = "2026-06-30T18:05:15.555Z" },
+]
+
 [[package]]
 name = "fastapi"
 version = "0.136.3"
@@ -890,6 +902,7 @@ dependencies = [
    { name = "contextily" },
    { name = "cryptography" },
    { name = "duckdb" },
+    { name = "faker" },
    { name = "fpdf2" },
    { name = "geopandas" },
    { name = "google-api-python-client" },
@@ -949,6 +962,7 @@ requires-dist = [
    { name = "contextily", specifier = ">=1.7.0" },
    { name = "cryptography", specifier = ">=46.0.6" },
    { name = "duckdb", specifier = ">=1.5.2" },
+    { name = "faker", specifier = ">=40.27.0" },
    { name = "fpdf2", specifier = ">=2.8.7" },
    { name = "geopandas", specifier = ">=1.1.3" },
    { name = "gliner", marker = "extra == 'nlp'", specifier = ">=0.2.13" },
@@ -0,0 +1,7 @@
+import google.auth
+from google.cloud import bigquery
+_creds, _ = google.auth.default(scopes=['https://www.googleapis.com/auth/bigquery'])
+_creds = _creds.with_quota_project(None)
+client = bigquery.Client(project='autingo-159109', location='europe-west1', credentials=_creds)
+def q(sql):
+    return client.query(sql).result().to_dataframe()
@@ -0,0 +1 @@
+{"c1": 12363, "c2": 12364, "c3": 12365}
@@ -0,0 +1,61 @@
+ensena,year,mes,diego,bq_neto,match
+Aurgi,2023,feb,80.52,,
+Aurgi,2023,mar,89.94,,
+Aurgi,2023,abr,76.87,,
+Aurgi,2023,may,87.95,,
+Aurgi,2023,jun,97.84,,
+Aurgi,2023,jul,138.24,,
+Aurgi,2023,ago,89.7,,
+Aurgi,2023,sep,61.53,,
+Aurgi,2023,oct,56.48,,
+Aurgi,2023,nov,73.2,,
+Aurgi,2023,dic,78.81,,
+Aurgi,2024,ene,75.34,75.35,100.0
+Aurgi,2024,feb,60.21,60.21,100.0
+Aurgi,2024,mar,70.62,71.26,99.1
+Aurgi,2024,abr,70.46,70.46,100.0
+Aurgi,2024,may,84.76,84.76,100.0
+Aurgi,2024,jun,108.7,108.7,100.0
+Aurgi,2024,jul,141.2,141.2,100.0
+Aurgi,2024,ago,100.18,100.18,100.0
+Aurgi,2024,sep,67.91,67.91,100.0
+Aurgi,2024,oct,81.31,81.31,100.0
+Aurgi,2024,nov,71.57,71.57,100.0
+Aurgi,2024,dic,74.33,74.33,100.0
+Aurgi,2025,ene,86.28,86.28,100.0
+Aurgi,2025,feb,53.05,53.05,100.0
+Aurgi,2025,mar,86.75,86.75,100.0
+Aurgi,2025,abr,83.89,83.89,100.0
+Aurgi,2025,may,84.24,84.24,100.0
+Aurgi,2025,jun,134.46,134.46,100.0
+Aurgi,2025,jul,101.17,174.32,58.0
+MT,2023,feb,30.19,,
+MT,2023,mar,41.89,,
+MT,2023,abr,36.16,,
+MT,2023,may,42.01,,
+MT,2023,jun,44.24,,
+MT,2023,jul,63.61,,
+MT,2023,ago,40.7,,
+MT,2023,sep,28.6,,
+MT,2023,oct,28.79,,
+MT,2023,nov,30.3,,
+MT,2023,dic,35.21,,
+MT,2024,ene,38.13,38.13,100.0
+MT,2024,feb,32.44,32.44,100.0
+MT,2024,mar,35.17,35.18,100.0
+MT,2024,abr,35.38,35.38,100.0
+MT,2024,may,37.58,37.58,100.0
+MT,2024,jun,44.54,44.54,100.0
+MT,2024,jul,58.92,58.92,100.0
+MT,2024,ago,40.97,40.98,100.0
+MT,2024,sep,35.03,35.03,100.0
+MT,2024,oct,38.86,38.86,100.0
+MT,2024,nov,36.48,36.48,100.0
+MT,2024,dic,40.52,40.52,100.0
+MT,2025,ene,39.16,39.16,100.0
+MT,2025,feb,28.16,28.16,100.0
+MT,2025,mar,42.26,42.26,100.0
+MT,2025,abr,44.04,44.04,100.0
+MT,2025,may,52.71,52.71,100.0
+MT,2025,jun,63.54,63.54,100.0
+MT,2025,jul,49.47,84.94,58.2
@@ -0,0 +1 @@
+https://reports.autingo.es/dashboard/1142
@@ -0,0 +1,60 @@
+STRUCT(DATE(2023,2,1) AS mes, 80.515 AS diego_neto_k, 1 AS company_id),
+    STRUCT(DATE(2023,3,1) AS mes, 89.936 AS diego_neto_k, 1 AS company_id),
+    STRUCT(DATE(2023,4,1) AS mes, 76.866 AS diego_neto_k, 1 AS company_id),
+    STRUCT(DATE(2023,5,1) AS mes, 87.952 AS diego_neto_k, 1 AS company_id),
+    STRUCT(DATE(2023,6,1) AS mes, 97.84 AS diego_neto_k, 1 AS company_id),
+    STRUCT(DATE(2023,7,1) AS mes, 138.24 AS diego_neto_k, 1 AS company_id),
+    STRUCT(DATE(2023,8,1) AS mes, 89.7 AS diego_neto_k, 1 AS company_id),
+    STRUCT(DATE(2023,9,1) AS mes, 61.53 AS diego_neto_k, 1 AS company_id),
+    STRUCT(DATE(2023,10,1) AS mes, 56.48 AS diego_neto_k, 1 AS company_id),
+    STRUCT(DATE(2023,11,1) AS mes, 73.2 AS diego_neto_k, 1 AS company_id),
+    STRUCT(DATE(2023,12,1) AS mes, 78.81 AS diego_neto_k, 1 AS company_id),
+    STRUCT(DATE(2024,1,1) AS mes, 75.345 AS diego_neto_k, 1 AS company_id),
+    STRUCT(DATE(2024,2,1) AS mes, 60.211 AS diego_neto_k, 1 AS company_id),
+    STRUCT(DATE(2024,3,1) AS mes, 70.62 AS diego_neto_k, 1 AS company_id),
+    STRUCT(DATE(2024,4,1) AS mes, 70.456 AS diego_neto_k, 1 AS company_id),
+    STRUCT(DATE(2024,5,1) AS mes, 84.759 AS diego_neto_k, 1 AS company_id),
+    STRUCT(DATE(2024,6,1) AS mes, 108.702 AS diego_neto_k, 1 AS company_id),
+    STRUCT(DATE(2024,7,1) AS mes, 141.204 AS diego_neto_k, 1 AS company_id),
+    STRUCT(DATE(2024,8,1) AS mes, 100.181 AS diego_neto_k, 1 AS company_id),
+    STRUCT(DATE(2024,9,1) AS mes, 67.91 AS diego_neto_k, 1 AS company_id),
+    STRUCT(DATE(2024,10,1) AS mes, 81.307 AS diego_neto_k, 1 AS company_id),
+    STRUCT(DATE(2024,11,1) AS mes, 71.569 AS diego_neto_k, 1 AS company_id),
+    STRUCT(DATE(2024,12,1) AS mes, 74.329 AS diego_neto_k, 1 AS company_id),
+    STRUCT(DATE(2025,1,1) AS mes, 86.277 AS diego_neto_k, 1 AS company_id),
+    STRUCT(DATE(2025,2,1) AS mes, 53.054 AS diego_neto_k, 1 AS company_id),
+    STRUCT(DATE(2025,3,1) AS mes, 86.749 AS diego_neto_k, 1 AS company_id),
+    STRUCT(DATE(2025,4,1) AS mes, 83.888 AS diego_neto_k, 1 AS company_id),
+    STRUCT(DATE(2025,5,1) AS mes, 84.24 AS diego_neto_k, 1 AS company_id),
+    STRUCT(DATE(2025,6,1) AS mes, 134.464 AS diego_neto_k, 1 AS company_id),
+    STRUCT(DATE(2025,7,1) AS mes, 101.168 AS diego_neto_k, 1 AS company_id),
+    STRUCT(DATE(2023,2,1) AS mes, 30.189 AS diego_neto_k, 2 AS company_id),
+    STRUCT(DATE(2023,3,1) AS mes, 41.89 AS diego_neto_k, 2 AS company_id),
+    STRUCT(DATE(2023,4,1) AS mes, 36.16 AS diego_neto_k, 2 AS company_id),
+    STRUCT(DATE(2023,5,1) AS mes, 42.011 AS diego_neto_k, 2 AS company_id),
+    STRUCT(DATE(2023,6,1) AS mes, 44.24 AS diego_neto_k, 2 AS company_id),
+    STRUCT(DATE(2023,7,1) AS mes, 63.61 AS diego_neto_k, 2 AS company_id),
+    STRUCT(DATE(2023,8,1) AS mes, 40.7 AS diego_neto_k, 2 AS company_id),
+    STRUCT(DATE(2023,9,1) AS mes, 28.6 AS diego_neto_k, 2 AS company_id),
+    STRUCT(DATE(2023,10,1) AS mes, 28.79 AS diego_neto_k, 2 AS company_id),
+    STRUCT(DATE(2023,11,1) AS mes, 30.3 AS diego_neto_k, 2 AS company_id),
+    STRUCT(DATE(2023,12,1) AS mes, 35.207 AS diego_neto_k, 2 AS company_id),
+    STRUCT(DATE(2024,1,1) AS mes, 38.132 AS diego_neto_k, 2 AS company_id),
+    STRUCT(DATE(2024,2,1) AS mes, 32.438 AS diego_neto_k, 2 AS company_id),
+    STRUCT(DATE(2024,3,1) AS mes, 35.174 AS diego_neto_k, 2 AS company_id),
+    STRUCT(DATE(2024,4,1) AS mes, 35.382 AS diego_neto_k, 2 AS company_id),
+    STRUCT(DATE(2024,5,1) AS mes, 37.584 AS diego_neto_k, 2 AS company_id),
+    STRUCT(DATE(2024,6,1) AS mes, 44.54 AS diego_neto_k, 2 AS company_id),
+    STRUCT(DATE(2024,7,1) AS mes, 58.921 AS diego_neto_k, 2 AS company_id),
+    STRUCT(DATE(2024,8,1) AS mes, 40.974 AS diego_neto_k, 2 AS company_id),
+    STRUCT(DATE(2024,9,1) AS mes, 35.029 AS diego_neto_k, 2 AS company_id),
+    STRUCT(DATE(2024,10,1) AS mes, 38.861 AS diego_neto_k, 2 AS company_id),
+    STRUCT(DATE(2024,11,1) AS mes, 36.48 AS diego_neto_k, 2 AS company_id),
+    STRUCT(DATE(2024,12,1) AS mes, 40.522 AS diego_neto_k, 2 AS company_id),
+    STRUCT(DATE(2025,1,1) AS mes, 39.161 AS diego_neto_k, 2 AS company_id),
+    STRUCT(DATE(2025,2,1) AS mes, 28.16 AS diego_neto_k, 2 AS company_id),
+    STRUCT(DATE(2025,3,1) AS mes, 42.263 AS diego_neto_k, 2 AS company_id),
+    STRUCT(DATE(2025,4,1) AS mes, 44.04 AS diego_neto_k, 2 AS company_id),
+    STRUCT(DATE(2025,5,1) AS mes, 52.71 AS diego_neto_k, 2 AS company_id),
+    STRUCT(DATE(2025,6,1) AS mes, 63.544 AS diego_neto_k, 2 AS company_id),
+    STRUCT(DATE(2025,7,1) AS mes, 49.469 AS diego_neto_k, 2 AS company_id)
@@ -0,0 +1,8 @@
+import sys, json
+from google.cloud import bigquery
+import google.auth
+creds=google.auth.default(scopes=['https://www.googleapis.com/auth/bigquery'])[0].with_quota_project(None)
+c=bigquery.Client(project='autingo-159109', location='europe-west1', credentials=creds)
+sql=sys.stdin.read()
+for r in c.query(sql).result():
+    print(json.dumps(dict(r), default=str))
@@ -0,0 +1,152 @@
+import json, os, urllib.request, sys
+
+MB = os.environ["MB"]; KEY = os.environ["KEY"]
+
+def api(method, path, body=None, timeout=180):
+    data = json.dumps(body).encode() if body is not None else None
+    req = urllib.request.Request(MB + path, data=data, method=method,
+        headers={"X-API-KEY": KEY, "Content-Type": "application/json"})
+    try:
+        return json.load(urllib.request.urlopen(req, timeout=timeout))
+    except urllib.error.HTTPError as e:
+        print(f"HTTP {e.code} on {method} {path}:", e.read().decode()[:1200]); raise
+
+# Bridge documento -> service_request (canal + charged), tal cual 1094 card 11751.
+BASE = r"""
+WITH vf AS (
+  SELECT document_id, LOGICAL_OR(is_pw) is_pw FROM (
+    SELECT CAST(document_id AS STRING) document_id, ANY_VALUE(is_precaweb) is_pw
+      FROM `autingo-159109.anjana_bi_datamart.VENTAS_aurgi` GROUP BY 1
+    UNION ALL
+    SELECT CAST(document_id AS STRING), ANY_VALUE(is_precaweb)
+      FROM `autingo-159109.anjana_bi_datamart.VENTAS_Motortown` GROUP BY 1
+  ) GROUP BY 1
+),
+lineas AS (
+  SELECT
+    CAST(s.numeroDocumento AS STRING) AS numdoc,
+    CAST(s.idCentro AS STRING)        AS idcentro,
+    DATE(s.Fecha)                     AS fecha,
+    s.Base_imponible_linea            AS bil
+  FROM {{#4494}} s
+  WHERE DATE(s.Fecha) >= DATE_SUB(CURRENT_DATE(), INTERVAL 365 DAY)
+    [[AND DATE(s.Fecha) >= {{fecha_desde}}]]
+    [[AND DATE(s.Fecha) <= {{fecha_hasta}}]]
+),
+web AS (
+  SELECT l.numdoc, l.fecha, l.bil, oc.name AS centro, oc.Companies__name AS ambito
+  FROM lineas l
+  LEFT JOIN vf ON l.numdoc = vf.document_id
+  LEFT JOIN `autingo-159109.rag_datasets.Objeto_Centros` oc
+    ON l.idcentro = CAST(oc.nav_id AS STRING)
+  WHERE (COALESCE(vf.is_pw, FALSE) OR oc.name IN ('Aurgi Web','MT Web'))
+    AND (oc.Companies__name IS NULL OR oc.Companies__name NOT IN ('Aurgi Glass','MotorTown Glass'))
+    [[AND oc.name IN ({{centro}})]]
+    [[AND oc.Companies__name IN ({{ensena}})]]
+),
+sr_link AS (
+  SELECT CAST(inv.nav_id AS STRING) numdoc, CAST(j.service_request_id AS STRING) sr_id
+  FROM `autingo-159109.psql_dcpublic.tpv_orders_invoice` inv
+  JOIN `autingo-159109.psql_dcpublic.tpv_precawebs_servicerequestjob` j ON j.order_id = inv.order_id
+  WHERE inv.nav_id IS NOT NULL
+  UNION DISTINCT
+  SELECT CAST(invoice_number AS STRING), CAST(service_request_id AS STRING)
+  FROM `autingo-159109.psql_dcpublic.logistic_orders`
+  WHERE invoice_number IS NOT NULL AND invoice_number != ''
+),
+sr_link1 AS (SELECT numdoc, MIN(sr_id) sr_id FROM sr_link GROUP BY 1),
+sr AS (
+  SELECT CAST(id AS STRING) sr_id, channel_id, charged
+  FROM `autingo-159109.psql_dcpublic.service_requests`
+),
+doc AS (
+  SELECT
+    w.numdoc,
+    ANY_VALUE(w.fecha)      AS fecha,
+    SUM(w.bil)              AS venta,
+    ANY_VALUE(sl.sr_id)     AS sr_id,
+    ANY_VALUE(sr.channel_id) AS channel_id,
+    ANY_VALUE(sr.charged)   AS charged
+  FROM web w
+  LEFT JOIN sr_link1 sl USING (numdoc)
+  LEFT JOIN sr ON sr.sr_id = sl.sr_id
+  GROUP BY w.numdoc
+),
+fin AS (
+  SELECT
+    numdoc, fecha, venta,
+    CASE WHEN sr_id IS NULL THEN 'Sin solicitud'
+         WHEN channel_id = 1 THEN 'aurgi.com'
+         WHEN channel_id = 2 THEN 'motortown.es'
+         WHEN channel_id = 3 THEN 'Autingo'
+         WHEN channel_id IN (11,13,14,15,6,8) THEN 'Marketplaces'
+         WHEN channel_id = 10 THEN 'Talleres Digitales'
+         ELSE 'Otros' END AS canal,
+    CASE WHEN sr_id IS NULL THEN 'Sin solicitud'
+         WHEN charged THEN 'Pago web'
+         ELSE 'Pago tienda' END AS forma_pago
+  FROM doc
+)
+"""
+
+CARDS = {
+  "total": {
+    "name": "Venta web total (facturacion NAV / modelo 4494)",
+    "sql": BASE + "SELECT ROUND(SUM(venta),0) AS venta_web_eur, COUNT(DISTINCT numdoc) AS documentos FROM fin",
+    "display": "scalar",
+  },
+  "canal": {
+    "name": "Venta web por canal",
+    "sql": BASE + "SELECT canal, ROUND(SUM(venta),0) AS venta_eur, COUNT(DISTINCT numdoc) AS documentos FROM fin GROUP BY canal ORDER BY venta_eur DESC",
+    "display": "bar",
+  },
+  "pago": {
+    "name": "Venta web por forma de pago",
+    "sql": BASE + "SELECT forma_pago, ROUND(SUM(venta),0) AS venta_eur, COUNT(DISTINCT numdoc) AS documentos FROM fin GROUP BY forma_pago ORDER BY venta_eur DESC",
+    "display": "row",
+  },
+  "matriz": {
+    "name": "Venta web: matriz canal x forma de pago",
+    "sql": BASE + "SELECT canal, forma_pago, ROUND(SUM(venta),0) AS venta_eur, COUNT(DISTINCT numdoc) AS documentos FROM fin GROUP BY canal, forma_pago ORDER BY venta_eur DESC",
+    "display": "table",
+  },
+  "evolutivo": {
+    "name": "Venta web mensual por canal",
+    "sql": BASE + "SELECT DATE_TRUNC(fecha, MONTH) AS mes, canal, ROUND(SUM(venta),0) AS venta_eur FROM fin GROUP BY mes, canal ORDER BY mes, venta_eur DESC",
+    "display": "bar",
+  },
+}
+
+TAGS = {
+  "#4494": {"type":"card","name":"#4494","id":"card__4494","display-name":"#4494","card-id":4494},
+  "fecha_desde": {"type":"date","name":"fecha_desde","id":"tag-fecha-desde","display-name":"Fecha desde"},
+  "fecha_hasta": {"type":"date","name":"fecha_hasta","id":"tag-fecha-hasta","display-name":"Fecha hasta"},
+  "centro": {"type":"text","name":"centro","id":"tag-centro","display-name":"Centro"},
+  "ensena": {"type":"text","name":"ensena","id":"tag-ensena","display-name":"Ensena"},
+}
+
+def dq(sql):
+    return {"type":"native","database":6,"native":{"query":sql,"template-tags":TAGS}}
+
+def test_query(sql, params=None):
+    body = dq(sql)
+    body["parameters"] = params or []
+    r = api("POST", "/api/dataset", body)
+    if r.get("error"):
+        print("QUERY ERROR:", r.get("error")); return None
+    cols = [c["name"] for c in r["data"]["cols"]]
+    rows = r["data"]["rows"]
+    return cols, rows
+
+if __name__ == "__main__":
+    which = sys.argv[1] if len(sys.argv) > 1 else "all"
+    # param YTD 2026 para verificar reconciliacion
+    p_ytd = [{"type":"date/single","value":"2026-01-01","target":["variable",["template-tag","fecha_desde"]]}]
+    for k, c in CARDS.items():
+        if which != "all" and which != k: continue
+        print(f"\n===== TEST {k}: {c['name']} =====")
+        res = test_query(c["sql"], p_ytd)
+        if res:
+            cols, rows = res
+            print("cols:", cols)
+            for row in rows[:15]: print(" ", row)
@@ -0,0 +1 @@
+{"total": 12367, "canal": 12368, "pago": 12369, "matriz": 12370, "evolutivo": 12371}
@@ -0,0 +1,42 @@
+import json, sys
+sys.path.insert(0, "scratchpad/exf")
+from build import api, BASE, CARDS, TAGS, dq
+
+COLLECTION = 583  # "Claude" (junto a 1094)
+
+CUR = {"number_style":"currency","currency":"EUR","currency_style":"symbol","decimals":0}
+
+def viz(kind):
+    if kind == "total":
+        return {"column_settings":{'["name","venta_web_eur"]':CUR},
+                "scalar.field":"venta_web_eur"}
+    if kind == "canal":
+        return {"graph.dimensions":["canal"],"graph.metrics":["venta_eur"],
+                "graph.x_axis.title_text":"Canal","graph.y_axis.title_text":"Venta web (EUR)",
+                "column_settings":{'["name","venta_eur"]':CUR},"graph.show_values":True}
+    if kind == "pago":
+        return {"graph.dimensions":["forma_pago"],"graph.metrics":["venta_eur"],
+                "column_settings":{'["name","venta_eur"]':CUR},"graph.show_values":True}
+    if kind == "matriz":
+        return {"column_settings":{'["name","venta_eur"]':CUR},
+                "table.columns":[
+                  {"name":"canal","enabled":True},{"name":"forma_pago","enabled":True},
+                  {"name":"venta_eur","enabled":True},{"name":"documentos","enabled":True}]}
+    if kind == "evolutivo":
+        return {"graph.dimensions":["mes","canal"],"graph.metrics":["venta_eur"],
+                "stackable.stack_type":"stacked","column_settings":{'["name","venta_eur"]':CUR},
+                "graph.x_axis.title_text":"Mes","graph.y_axis.title_text":"Venta web (EUR)"}
+    return {}
+
+created = {}
+for k, c in CARDS.items():
+    body = {"name": c["name"], "display": c["display"],
+            "dataset_query": dq(c["sql"]),
+            "visualization_settings": viz(k),
+            "collection_id": COLLECTION}
+    r = api("POST", "/api/card", body)
+    created[k] = r["id"]
+    print(f"card {k}: id {r['id']}  {c['name']}")
+
+json.dump(created, open("scratchpad/exf/cards.json","w"))
+print("CARDS:", created)
@@ -0,0 +1 @@
+{"dashboard_id": 1143}
@@ -0,0 +1,54 @@
+import json, sys
+sys.path.insert(0, "scratchpad/exf")
+from build import api
+
+C = json.load(open("scratchpad/exf/cards.json"))
+COLLECTION = 583
+
+# 1) crear dashboard vacio
+dash = api("POST", "/api/dashboard", {
+    "name": "Venta Web por Canal y Forma de Pago (facturacion NAV / modelo 4494)",
+    "collection_id": COLLECTION,
+    "description": "Solo venta web (origen precaweb) tomada del modelo 4494 (SUM Base_imponible_linea, facturacion NAV neta), desglosada por canal (channel_id) y forma de pago (pago web vs pago tienda), segun las convenciones del dashboard 1094. Glass excluido. Default: YTD 2026.",
+})
+DID = dash["id"]
+print("dashboard id:", DID)
+
+# 2) parametros del dashboard
+PARAMS = [
+  {"id":"p_desde","name":"Fecha desde","slug":"fecha_desde","type":"date/single","default":"2026-01-01"},
+  {"id":"p_hasta","name":"Fecha hasta","slug":"fecha_hasta","type":"date/single"},
+  {"id":"p_centro","name":"Centro","slug":"centro","type":"string/=","sectionId":"string"},
+  {"id":"p_ensena","name":"Ensena","slug":"ensena","type":"string/=","sectionId":"string"},
+]
+
+def mappings(cid):
+    return [
+      {"parameter_id":"p_desde","card_id":cid,"target":["variable",["template-tag","fecha_desde"]]},
+      {"parameter_id":"p_hasta","card_id":cid,"target":["variable",["template-tag","fecha_hasta"]]},
+      {"parameter_id":"p_centro","card_id":cid,"target":["variable",["template-tag","centro"]]},
+      {"parameter_id":"p_ensena","card_id":cid,"target":["variable",["template-tag","ensena"]]},
+    ]
+
+# 3) layout (grid 24 col)
+LAYOUT = {
+  "total":     (0, 0, 6, 4),
+  "pago":      (0, 6, 18, 4),
+  "canal":     (4, 0, 12, 7),
+  "matriz":    (4, 12, 12, 7),
+  "evolutivo": (11, 0, 24, 7),
+}
+dashcards = []
+neg = -1
+for k,(row,col,sx,sy) in LAYOUT.items():
+    cid = C[k]
+    dashcards.append({
+      "id": neg, "card_id": cid, "row": row, "col": col, "size_x": sx, "size_y": sy,
+      "series": [], "parameter_mappings": mappings(cid), "visualization_settings": {}
+    })
+    neg -= 1
+
+r = api("PUT", f"/api/dashboard/{DID}", {"dashcards": dashcards, "parameters": PARAMS})
+print("dashcards saved:", len(r.get("dashcards",[])))
+print("URL: https://reports.autingo.es/dashboard/%d" % DID)
+json.dump({"dashboard_id":DID}, open("scratchpad/exf/dash.json","w"))
@@ -0,0 +1,313 @@
+"""Genera la carpeta de documentacion de linaje en el Escritorio de Windows.
+
+A partir del grafo trazado (scratchpad/lineage_graph.json) escribe:
+  00_INDICE.txt                         resumen + mapa de capas + tabla de todos los objetos
+  01_marts/<vista>.txt                  una por vista de customer_marts: que es + arbol de linaje + SQL
+  02_intermedio_clientes_intel/*.txt    tablas base del pipeline de inteligencia de clientes
+  03_producto/*.txt                     cadena de catalogo de producto (vistas con SQL + bases)
+  04_fuentes/*.txt                      tablas fuente (replica Postgres, Navision, imagenes, tasas)
+
+Todos los .txt se escriben con CRLF para abrirse limpios en Bloc de notas de Windows.
+"""
+import json
+import os
+import textwrap
+
+DEST = "/mnt/c/Users/egutierrez/Desktop/linaje_customer_marts"
+DATA = json.load(open("scratchpad/lineage_graph.json"))
+G = DATA["graph"]
+PROJECT = DATA["project"]
+
+# ---------------------------------------------------------------------------
+# Descripciones ("que es") por objeto. La SQL/DDL incluida en cada archivo es la
+# fuente de verdad; estas lineas son un resumen para orientar al lector.
+# ---------------------------------------------------------------------------
+DESC = {
+    # ---- customer_marts (marts finales, grano = persona_id / cliente) ----
+    "customer_marts.customer_profile":
+        "Ficha maestra 360 del cliente: identidad + features agregadas + score CLV + segmento. Vista de perfil que consolida todo lo demas.",
+    "customer_marts.customer_monetary":
+        "Metricas monetarias del cliente (gasto total, ticket medio, recencia/frecuencia/valor). Componente M del RFM.",
+    "customer_marts.customer_channel":
+        "Canal del cliente: canal preferido transaccional, mix aurgi/motortown/web/servicio, canal de entrada (canal8) y fuentes de origen.",
+    "customer_marts.customer_contactability":
+        "Contactabilidad del cliente: disponibilidad de email/telefono y consentimientos, a partir de la dimension persona + features + segmento.",
+    "customer_marts.customer_category_spend":
+        "Gasto del cliente desglosado por categoria de producto, a partir de la tabla de hechos de transaccion.",
+    "customer_marts.customer_brand_affinity":
+        "Afinidad de marca del cliente: que marcas compra y con que peso, cruzando transacciones con el catalogo de producto (Objeto_productos).",
+    "customer_marts.customer_product":
+        "Productos comprados por el cliente (detalle de que ha adquirido) desde la tabla de hechos de transaccion.",
+    "customer_marts.customer_store_spend":
+        "Gasto del cliente por centro/tienda desde la tabla de hechos de transaccion.",
+    "customer_marts.customer_temporal":
+        "Patrones temporales de compra del cliente (estacionalidad, recencia, frecuencia) desde transacciones + features.",
+    "customer_marts.customer_vehicles":
+        "Vehiculos asociados al cliente: dimension vehiculo + features de vehiculo + mapping N:N persona-vehiculo.",
+    "customer_marts.customer_payment_method":
+        "Metodo de pago del cliente reconstruido desde los pedidos TPV (orders/invoice/payment/payment_types).",
+    "customer_marts.customer_promo_usage":
+        "Uso de promociones/descuentos por el cliente (pedidos con descuento) desde transacciones + pedidos TPV + segmento.",
+    "customer_marts.customer_promo_tolerance":
+        "Tolerancia del cliente a promociones: respuesta a campanas + sensibilidad a descuentos en pedidos.",
+    "customer_marts.customer_predictive":
+        "Senales predictivas del cliente: score CLV, proxima mejor accion (recomendaciones) y segmento.",
+
+    # ---- clientes_intel (capa intermedia; tablas base del pipeline de inteligencia de clientes) ----
+    "clientes_intel.dim_persona":
+        "Dimension PERSONA: identidad de cliente consolidada (una fila por persona_id). Nucleo de la doble identidad persona+vehiculo.",
+    "clientes_intel.dim_vehiculo":
+        "Dimension VEHICULO: una fila por vehiculo (matricula/bastidor) con sus atributos.",
+    "clientes_intel.fact_transaccion":
+        "Tabla de HECHOS de transaccion: linea/venta por cliente. Base de casi todos los marts monetarios y de producto.",
+    "clientes_intel.fact_campana_respuesta":
+        "Tabla de HECHOS de respuesta a campanas de marketing (envio/apertura/conversion) por cliente.",
+    "clientes_intel.feat_cliente_persona":
+        "Features agregadas a nivel PERSONA (RFM, mix de canal, indicadores derivados). Alimenta perfil, monetary, channel, temporal, contactability.",
+    "clientes_intel.feat_cliente_vehiculo":
+        "Features agregadas a nivel VEHICULO. Alimenta customer_vehicles.",
+    "clientes_intel.seg_cliente_360":
+        "Segmentacion 360 del cliente (segmentos de negocio / clusters). Alimenta perfil, channel, contactability, predictive, promo_usage.",
+    "clientes_intel.score_clv":
+        "Score de valor de vida del cliente (CLV). Alimenta perfil y predictive.",
+    "clientes_intel.reco_acciones":
+        "Recomendaciones / proxima mejor accion (NBA) por cliente. Alimenta customer_predictive.",
+    "clientes_intel.map_persona_canal8":
+        "Mapeo persona -> canal8 (canal de entrada). Puente para customer_channel.",
+    "clientes_intel.map_persona_fuente":
+        "Mapeo persona -> fuente(s) de origen (de que sistema/canal proviene el cliente). Puente para customer_channel.",
+    "clientes_intel.map_persona_vehiculo":
+        "Mapeo N:N persona <-> vehiculo. Puente para customer_vehicles.",
+
+    # ---- cadena de catalogo de producto ----
+    "anjana_bi_datamart.Objeto_productos":
+        "Vista maestra de PRODUCTO: catalogo Navision + categorias CGQ + imagenes + tasa/margen por material. Se usa para afinidad de marca.",
+    "anjana_bi_datamart.Cruce_16_07_cgq":
+        "Tabla de cruce de categorias CGQ (categoria/subcategoria/tipo) usada por Objeto_productos.",
+    "claude_bi.productos_tasa_mat":
+        "Tabla de tasa/margen por material de producto. La consume Objeto_productos.",
+    "external_datasets.product_object_images":
+        "Imagenes de producto (imagen principal/secundaria). Dataset externo. La consume Objeto_productos.",
+    "stg_anjana_bi.producto":
+        "Staging de producto: cruza item de Navision con equivalencias de matriculas (SAF). Capa de preparacion sobre las tablas de SQL Server.",
+
+    # ---- fuentes base ----
+    "psql_dcpublic.products":
+        "Catalogo de productos. Replica en BigQuery de la BBDD Postgres ANJANA (DCPublic).",
+    "psql_dcpublic.product_categories":
+        "Categorias de producto. Replica Postgres ANJANA (DCPublic).",
+    "psql_dcpublic.product_groups":
+        "Grupos de producto. Replica Postgres ANJANA (DCPublic).",
+    "psql_dcpublic.tpv_orders_order":
+        "Pedidos TPV (cabecera de pedido). Replica Postgres ANJANA (DCPublic).",
+    "psql_dcpublic.tpv_orders_orderitem":
+        "Lineas de pedido TPV. Replica Postgres ANJANA (DCPublic).",
+    "psql_dcpublic.tpv_orders_invoice":
+        "Facturas TPV. Replica Postgres ANJANA (DCPublic).",
+    "psql_dcpublic.tpv_orders_payment":
+        "Pagos de pedidos TPV. Replica Postgres ANJANA (DCPublic).",
+    "psql_dcpublic.tpv_payment_types":
+        "Tipos de pago TPV (catalogo). Replica Postgres ANJANA (DCPublic).",
+    "mssql2022_dbo.item":
+        "Catalogo de articulos de Navision (SQL Server 2022, esquema dbo).",
+    "mssql2022_dbo.equivalencias_matriculas_saf":
+        "Equivalencias de matriculas (SAF) en Navision (SQL Server 2022, esquema dbo).",
+}
+
+TYPE_ES = {
+    "VIEW": "VISTA (tiene SQL propio)",
+    "MATERIALIZED VIEW": "VISTA MATERIALIZADA (tiene SQL propio)",
+    "BASE TABLE": "TABLA BASE (datos materializados; sin SQL de definicion, solo esquema)",
+    "EXTERNAL": "TABLA EXTERNA",
+    "UNKNOWN": "DESCONOCIDO",
+}
+
+# Carpeta destino por objeto.
+def folder_of(key: str) -> str:
+    ds = key.split(".", 1)[0]
+    if ds == "customer_marts":
+        return "01_marts"
+    if ds == "clientes_intel":
+        return "02_intermedio_clientes_intel"
+    if ds in ("anjana_bi_datamart", "claude_bi", "external_datasets", "stg_anjana_bi"):
+        return "03_producto"
+    return "04_fuentes"
+
+def fname_of(key: str) -> str:
+    return key.replace(".", "__") + ".txt"
+
+def relpath_of(key: str) -> str:
+    return f"{folder_of(key)}/{fname_of(key)}"
+
+def desc_of(key: str) -> str:
+    return DESC.get(key, "(sin descripcion)")
+
+# ---------------------------------------------------------------------------
+# Arbol de linaje recursivo (para los marts).
+# ---------------------------------------------------------------------------
+def render_tree(key: str, prefix: str | None = None, is_last: bool = True, seen=None) -> list[str]:
+    if seen is None:
+        seen = set()
+    tag = {"VIEW": "[vista]", "MATERIALIZED VIEW": "[vista mat]",
+           "BASE TABLE": "[TABLA BASE/FUENTE]", "EXTERNAL": "[externa]",
+           "UNKNOWN": "[?]"}.get(G.get(key, {"type": "UNKNOWN"})["type"], "")
+
+    if prefix is None:  # raiz
+        lines = [f"{key}  {tag}"]
+        child_prefix = ""
+    else:
+        connector = "└── " if is_last else "├── "
+        lines = [f"{prefix}{connector}{key}  {tag}"]
+        child_prefix = prefix + ("    " if is_last else "│   ")
+
+    if key in seen:
+        lines[-1] += "  (ya expandido arriba)"
+        return lines
+    seen.add(key)
+    refs = G.get(key, {"refs": []}).get("refs", [])
+    for i, r in enumerate(refs):
+        lines += render_tree(r, child_prefix, i == len(refs) - 1, seen)
+    return lines
+
+# ---------------------------------------------------------------------------
+# Escritura.
+# ---------------------------------------------------------------------------
+def w(path: str, text: str):
+    full = os.path.join(DEST, path)
+    os.makedirs(os.path.dirname(full), exist_ok=True)
+    with open(full, "w", newline="\r\n", encoding="utf-8") as f:
+        f.write(text)
+
+SEP = "=" * 78 + "\n"
+
+def object_file(key: str, include_tree: bool) -> str:
+    node = G[key]
+    out = []
+    out.append(SEP)
+    out.append(f"OBJETO : {PROJECT}.{key}\n")
+    out.append(f"TIPO   : {TYPE_ES.get(node['type'], node['type'])}\n")
+    out.append(f"DATASET: {key.split('.',1)[0]}\n")
+    out.append(SEP)
+    out.append("\nQUE ES\n------\n")
+    out.append(textwrap.fill(desc_of(key), width=78) + "\n")
+
+    if node.get("refs"):
+        out.append("\nDEPENDE DIRECTAMENTE DE\n-----------------------\n")
+        for r in node["refs"]:
+            out.append(f"  - {PROJECT}.{r}   -> ver {relpath_of(r)}\n")
+
+    if include_tree:
+        out.append("\nLINAJE COMPLETO (hasta la fuente)\n---------------------------------\n")
+        out.append("\n".join(render_tree(key)) + "\n")
+
+    out.append("\nSQL / DDL\n---------\n")
+    if node["type"] in ("VIEW", "MATERIALIZED VIEW"):
+        out.append("(Definicion de la vista. Este es el SQL que puedes copiar.)\n\n")
+    else:
+        out.append("(Tabla base: no tiene SQL de transformacion. Se incluye el CREATE TABLE\n"
+                    " con el esquema de columnas para referencia.)\n\n")
+    out.append(node["ddl"].strip() + "\n")
+    return "".join(out)
+
+# Marts: incluir arbol de linaje.
+marts = sorted(k for k in G if k.startswith("customer_marts."))
+for k in marts:
+    w(f"01_marts/{fname_of(k)}", object_file(k, include_tree=True))
+
+# Resto de objetos: sin arbol (o arbol solo si es vista con dependencias).
+for k in sorted(G):
+    if k.startswith("customer_marts."):
+        continue
+    include_tree = G[k]["type"] in ("VIEW", "MATERIALIZED VIEW") and bool(G[k].get("refs"))
+    w(relpath_of(k), object_file(k, include_tree=include_tree))
+
+# ---------------------------------------------------------------------------
+# INDICE.
+# ---------------------------------------------------------------------------
+idx = []
+idx.append(SEP)
+idx.append("INDICE - LINAJE DEL DATASET customer_marts\n")
+idx.append(f"Proyecto BigQuery: {PROJECT}\n")
+idx.append(SEP)
+idx.append("""
+QUE ES ESTA CARPETA
+-------------------
+Documenta, para cada tabla/vista del dataset `customer_marts`, de donde salen sus
+datos: la cadena completa desde el mart final hasta las tablas fuente, con el SQL
+de cada vista listo para copiar y compartir.
+
+Cada objeto tiene su propio .txt con:
+  - QUE ES (resumen de una linea; la SQL es la fuente de verdad)
+  - DE QUE DEPENDE (dependencias directas, con la ruta a su archivo)
+  - LINAJE COMPLETO (arbol hasta la fuente) -- solo en los marts y vistas
+  - SQL / DDL (el codigo: definicion de la vista, o el esquema si es tabla base)
+
+MAPA DE CAPAS
+-------------
+  customer_marts (VISTAS finales, grano = cliente/persona_id)
+        |
+        v
+  clientes_intel (TABLAS BASE: capa intermedia construida por el pipeline de
+        |          inteligencia de clientes -- dim_*, feat_*, seg_*, score_*,
+        |          reco_*, fact_*, map_*)
+        v
+  Fuentes:
+     - psql_dcpublic.*   Replica en BigQuery de la BBDD Postgres ANJANA (TPV + catalogo)
+     - anjana_bi_datamart / claude_bi / external_datasets / stg_anjana_bi
+                          Cadena de catalogo de PRODUCTO (Objeto_productos y sus fuentes)
+     - mssql2022_dbo.*   Navision (SQL Server 2022, esquema dbo)
+
+NOTA: las tablas de `clientes_intel` son TABLAS BASE: no son vistas, sino tablas que
+un pipeline reconstruye cada dia con sentencias CREATE TABLE AS SELECT (CTAS). Su
+esquema esta en 02_intermedio_clientes_intel/. El SQL REAL que las construye (y que
+baja hasta TPV / customers / users / Navision / Salesforce) esta en la carpeta
+05_construccion_clientes_intel/ -- ver tambien 00b_FUENTES_DE_CLIENTE.txt.
+
+""")
+
+idx.append(SEP)
+idx.append("CARPETAS\n")
+idx.append(SEP)
+idx.append("""
+  01_marts/                      Las 14 vistas de customer_marts (con arbol de linaje)
+  02_intermedio_clientes_intel/  Las 12 tablas base intermedias (esquema)
+  03_producto/                   Cadena de catalogo de producto (vistas + bases)
+  04_fuentes/                    Tablas fuente (replica Postgres, Navision, imagenes, tasas)
+  05_construccion_clientes_intel/  El SQL (CTAS) que construye cada tabla de clientes_intel
+  00b_FUENTES_DE_CLIENTE.txt     Que consulta lee cada fuente de cliente (TPV/customers/
+                                 users/Navision/Salesforce)
+
+""")
+
+def index_block(title, keys):
+    lines = [SEP, title + "\n", SEP, "\n"]
+    for k in keys:
+        t = {"VIEW": "vista", "MATERIALIZED VIEW": "vista_mat", "BASE TABLE": "tabla",
+             "EXTERNAL": "externa", "UNKNOWN": "?"}.get(G[k]["type"], "")
+        lines.append(f"[{t:9s}] {k}\n")
+        lines.append(f"            {desc_of(k)}\n")
+        lines.append(f"            archivo: {relpath_of(k)}\n\n")
+    return "".join(lines)
+
+idx.append(index_block("1) MARTS FINALES (customer_marts)", marts))
+idx.append(index_block("2) CAPA INTERMEDIA (clientes_intel)",
+                       sorted(k for k in G if k.startswith("clientes_intel."))))
+idx.append(index_block("3) CADENA DE PRODUCTO",
+                       sorted(k for k in G if folder_of(k) == "03_producto")))
+idx.append(index_block("4) FUENTES BASE",
+                       sorted(k for k in G if folder_of(k) == "04_fuentes")))
+
+w("00_INDICE.txt", "".join(idx))
+
+# Conteo final
+n_files = sum(len(files) for _, _, files in os.walk(DEST))
+print(f"Escrito en: {DEST}")
+print(f"Archivos .txt generados: {n_files}")
+print("Estructura:")
+for root, dirs, files in sorted(os.walk(DEST)):
+    rel = os.path.relpath(root, DEST)
+    if rel == ".":
+        for f in sorted(files):
+            print(f"  {f}")
+    else:
+        print(f"  {rel}/  ({len(files)} archivos)")
@@ -0,0 +1,164 @@
+"""Genera 05_construccion_clientes_intel/ (SQL CTAS de cada tabla de clientes_intel)
+y 00b_FUENTES_DE_CLIENTE.txt (mapa fuente-de-cliente -> consulta que la lee).
+
+Fuente de datos: scratchpad/intel_build.json (SQL de construccion capturado de
+INFORMATION_SCHEMA.JOBS) y scratchpad/intel_lineage.json (tablas implicadas).
+"""
+import json
+import os
+import textwrap
+
+DEST = "/mnt/c/Users/egutierrez/Desktop/linaje_customer_marts"
+PROJECT = "autingo-159109"
+builds = json.load(open("scratchpad/intel_build.json"))
+lin = json.load(open("scratchpad/intel_lineage.json"))
+
+# Tablas para las que escribimos el SQL de construccion: las del linaje de customer_marts
+# + las que leen fuentes de cliente/Salesforce.
+EXTRA = ["seg_vega_persona", "fact_campana_respuesta__sfnew"]
+want = sorted(set(lin["intel_involved"]) | set(EXTRA))
+want = [t for t in want if t in builds]  # solo las que tienen SQL capturado
+
+DESC = {
+    "_persona_records":
+        "IDENTIDAD DEL CLIENTE (nucleo). UNION de 7 fuentes -> normaliza DNI/NIE/CIF, email y "
+        "telefono -> resuelve persona_id (FARM_FINGERPRINT de persona_key) con nivel de confianza. "
+        "AQUI es donde se juntan TPV customers, customers web, OTR, Navision, citaprevia, users y "
+        "Salesforce contacts_latest.",
+    "dim_persona":
+        "Dimension PERSONA final: una fila por persona_id, elegida desde _persona_records "
+        "(prioriza el mejor registro por fuente/confianza) + banderas de contacto.",
+    "dim_vehiculo":
+        "Dimension VEHICULO: una fila por vehiculo (matricula/bastidor) desde TPV vehicles, OTR, "
+        "citaprevia matriculas y calibrado de ano de matricula.",
+    "map_persona_fuente":
+        "Mapeo persona -> fuente(s) de origen (tpv/web/otr/navision/citaprevia/users/salesforce). "
+        "Registra de que sistemas proviene cada persona.",
+    "map_persona_vehiculo":
+        "Mapeo N:N persona <-> vehiculo (quien conduce/posee que coche) desde OTR, TPV vehicleowner "
+        "y citaprevia matriculas.",
+    "map_persona_canal8":
+        "Mapeo persona -> canal8 (canal de entrada del cliente).",
+    "fact_transaccion":
+        "Tabla de HECHOS de transaccion (linea/venta por persona). Base de los marts monetarios.",
+    "fact_visita":
+        "Tabla de HECHOS de visita (visitas del cliente al taller/tienda).",
+    "fact_campana_respuesta":
+        "HECHOS de respuesta a campanas: cruza envios/aperturas/clics/sms de Salesforce con personas.",
+    "fact_campana_respuesta__sfnew":
+        "Variante de fact_campana_respuesta con el esquema nuevo de Salesforce (email_sent/opened/clicked/sms).",
+    "feat_cliente_persona":
+        "Features agregadas por PERSONA (RFM, mix de canal, ticket medio, margen, recencia...).",
+    "feat_cliente_vehiculo":
+        "Features agregadas por VEHICULO.",
+    "seg_cliente_360":
+        "Segmentacion 360 del cliente (segmentos/clusters de negocio).",
+    "seg_vega_persona":
+        "Segmentacion VEGA por persona (contactabilidad/valor); lee fuentes de cliente para calcular "
+        "disponibilidad de contacto.",
+    "seg_cluster_persona":
+        "Clustering de personas (asignacion de cluster) que alimenta la segmentacion.",
+    "reco_acciones":
+        "Recomendaciones / proxima mejor accion (NBA) por cliente.",
+    "data_points_contacto":
+        "Puntos de dato de contacto (email/telefono) consolidados y calidad por persona.",
+    "_margen_rate_producto":
+        "Tasa de margen por producto (auxiliar para features monetarias).",
+    "_plate_year_calib":
+        "Calibrado del ano a partir de la matricula (auxiliar para dim_vehiculo).",
+    "dim_cp_provincia":
+        "Diccionario codigo postal -> provincia/CCAA.",
+    "tipologia_cliente":
+        "Tipologia de cliente (clasificacion de negocio).",
+}
+
+# Descripcion corta de cada fuente de cliente.
+SRC_DESC = {
+    "psql_dcpublic.tpv_customers": "Clientes del TPV (mostrador). Replica Postgres ANJANA (DCPublic).",
+    "psql_dcpublic.customers": "Clientes web (e-commerce). Replica Postgres ANJANA (DCPublic).",
+    "psql_dcpublic.otr_customers": "Clientes de OTR (ordenes de reparacion/taller). Replica Postgres ANJANA.",
+    "psql_dcpublic.users": "Usuarios (cuentas). Replica Postgres ANJANA (DCPublic).",
+    "mssql2022_dbo.anjana_customer": "Cliente de NAVISION (SQL Server 2022, esquema dbo). Campos no_/e_mail/movil/name/post_code.",
+    "salesforce_ew1.contacts_latest": "Contactos de SALESFORCE (ultima version). Dataset en europe-west1.",
+    "salesforce_ew1.email_sent": "Envios de email de Salesforce (Marketing Cloud).",
+    "salesforce_ew1.email_opened": "Aperturas de email de Salesforce.",
+    "salesforce_ew1.email_clicked": "Clics de email de Salesforce.",
+    "salesforce_ew1.sms": "SMS de Salesforce.",
+    "citaprevia_aurphcp.clientes": "Clientes de CITA PREVIA (aurphcp).",
+    "citaprevia_aurphcp.clientes_matriculas": "Matriculas por cliente en cita previa.",
+    "psql_dcpublic.tpv_vehicles_vehicle": "Vehiculos del TPV. Replica Postgres ANJANA.",
+    "psql_dcpublic.tpv_vehicles_vehicleowner": "Propietarios de vehiculo del TPV (N:N). Replica Postgres ANJANA.",
+}
+
+CUST_SOURCES = list(SRC_DESC.keys())
+
+SEP = "=" * 78 + "\n"
+
+def w(path, text):
+    full = os.path.join(DEST, path)
+    os.makedirs(os.path.dirname(full), exist_ok=True)
+    with open(full, "w", newline="\r\n", encoding="utf-8") as f:
+        f.write(text)
+
+def build_file(tbl):
+    b = builds[tbl]
+    out = [SEP, f"OBJETO : {PROJECT}.clientes_intel.{tbl}\n",
+           f"TIPO   : TABLA BASE construida por {b['stmt']} (se reconstruye periodicamente)\n",
+           f"ULTIMA EJECUCION CAPTURADA: {b['last_run']}\n", SEP,
+           "\nQUE ES\n------\n",
+           textwrap.fill(DESC.get(tbl, "(sin descripcion)"), width=78) + "\n"]
+    if b["refs"]:
+        out.append("\nLEE DE (tablas fuente / intermedias)\n------------------------------------\n")
+        for r in b["refs"]:
+            note = "   << FUENTE DE CLIENTE" if r in SRC_DESC else ""
+            out.append(f"  - {PROJECT}.{r}{note}\n")
+    out.append("\nSQL DE CONSTRUCCION (copiable)\n------------------------------\n\n")
+    out.append(b["query"].strip() + "\n")
+    return "".join(out)
+
+for t in want:
+    w(f"05_construccion_clientes_intel/{t}.txt", build_file(t))
+
+# 00b_FUENTES_DE_CLIENTE.txt
+f = [SEP, "FUENTES DE CLIENTE  ->  QUE CONSULTA DE clientes_intel LAS USA\n", SEP,
+     "\nResponde a: de donde salen los clientes (TPV, web, OTR, Navision, Salesforce, cita\n"
+     "previa) y en que consulta se juntan. El punto de union de identidades es\n"
+     "_persona_records (ver 05_construccion_clientes_intel/_persona_records.txt).\n\n"]
+
+f.append(SEP + "RESUMEN: LO QUE PEDISTE\n" + SEP + "\n")
+mapping = [
+    ("TPV customers", "psql_dcpublic.tpv_customers"),
+    ("customers (web)", "psql_dcpublic.customers"),
+    ("customers (OTR / taller)", "psql_dcpublic.otr_customers"),
+    ("users", "psql_dcpublic.users"),
+    ("customer de NAVISION", "mssql2022_dbo.anjana_customer"),
+    ("SALESFORCE (contactos)", "salesforce_ew1.contacts_latest"),
+]
+for label, src in mapping:
+    f.append(f"  {label:26s} -> {PROJECT}.{src}\n")
+f.append("\n  SI: tenemos Salesforce. El dataset es `salesforce_ew1` (europe-west1):\n"
+         "      contactos en contacts_latest; marketing en email_sent/opened/clicked y sms.\n\n")
+
+for src in CUST_SOURCES:
+    consumers = sorted(t for t, b in builds.items() if src in b["refs"])
+    f.append(SEP)
+    f.append(f"{PROJECT}.{src}\n")
+    f.append(SEP)
+    f.append(f"  {SRC_DESC[src]}\n")
+    f.append("  La leen estas tablas de clientes_intel (con su SQL en 05_construccion_...):\n")
+    if consumers:
+        for t in consumers:
+            star = "  [SQL disponible]" if t in want else ""
+            f.append(f"     - {t}  ({builds[t]['stmt']}){star}\n")
+    else:
+        f.append("     (ninguna la referencia directamente)\n")
+    f.append("\n")
+
+w("00b_FUENTES_DE_CLIENTE.txt", "".join(f))
+
+print("Generado:")
+print(f"  05_construccion_clientes_intel/  -> {len(want)} archivos con SQL de construccion")
+print(f"  00b_FUENTES_DE_CLIENTE.txt")
+print("\nTablas con SQL de construccion escrito:")
+for t in want:
+    print(f"  - {t}")
@@ -0,0 +1,126 @@
+"""Genera 00c_VERIFICACION.txt (chequeo de completitud del linaje) y
+06_otros_outputs_clientes_intel/ (SQL de las tablas de clientes_intel que NO acaban
+en customer_marts, para no dejar ninguna atras).
+"""
+import json
+import os
+import textwrap
+
+DEST = "/mnt/c/Users/egutierrez/Desktop/linaje_customer_marts"
+PROJECT = "autingo-159109"
+builds = json.load(open("scratchpad/intel_build.json"))
+lin = json.load(open("scratchpad/intel_lineage.json"))
+involved = set(lin["intel_involved"])
+
+# Catalogo completo de clientes_intel (40 objetos) reconstruido: involved + leftovers conocidos.
+LEFTOVER = [
+    "_presupuesto_persona", "_veh_cluster_feat", "_veh_tec_feat", "audit_persona_divergencias",
+    "calidad_email_snapshot", "f0_audit_keys", "fact_impacto_campana", "map_mutualista_particular",
+    "reco_promo_personalizada", "reco_promo_segmento", "rpt_campana", "rpt_campana_lift",
+    "rpt_campana_usuario", "rpt_impacto_persona", "seg_audiencia", "seg_vega_persona",
+    "sf_contact_map", "tipologia_cliente_resumen", "veh_cluster",
+]
+
+# Clasificacion por proposito (a donde va cada leftover).
+CATEGORY = {
+    "rpt_campana": "Informe de campanas (BI / dashboards de marketing)",
+    "rpt_campana_lift": "Informe de campanas: lift (BI / dashboards)",
+    "rpt_campana_usuario": "Informe de campanas por usuario (BI / dashboards)",
+    "rpt_impacto_persona": "Informe de impacto por persona (BI / dashboards)",
+    "fact_impacto_campana": "Hechos de impacto de campana (base de los informes)",
+    "reco_promo_personalizada": "Recomendacion de promo personalizada (activacion)",
+    "reco_promo_segmento": "Recomendacion de promo por segmento (activacion)",
+    "seg_audiencia": "Audiencias para activacion (probable push a Salesforce/Marketing)",
+    "sf_contact_map": "Mapa de contactos Salesforce (sincronizacion de IDs)",
+    "audit_persona_divergencias": "Auditoria de calidad: divergencias en resolucion de persona",
+    "calidad_email_snapshot": "Auditoria de calidad: snapshot de emails",
+    "f0_audit_keys": "Auditoria de claves (control interno del pipeline)",
+    "_presupuesto_persona": "Auxiliar: presupuestos por persona (interim)",
+    "_veh_cluster_feat": "Auxiliar: features para clustering de vehiculo (interim)",
+    "_veh_tec_feat": "Auxiliar: features tecnicas de vehiculo (interim)",
+    "veh_cluster": "Clustering de vehiculo (resultado; no lo usan los marts hoy)",
+    "tipologia_cliente_resumen": "Resumen de tipologia de cliente (BI)",
+    "map_mutualista_particular": "Vista auxiliar: mapa mutualista/particular",
+    "seg_vega_persona": "Segmentacion VEGA por persona (contactabilidad; lee fuentes de cliente)",
+}
+
+SEP = "=" * 78 + "\n"
+
+def w(path, text):
+    full = os.path.join(DEST, path)
+    os.makedirs(os.path.dirname(full), exist_ok=True)
+    with open(full, "w", newline="\r\n", encoding="utf-8") as f:
+        f.write(text)
+
+# --- 06: SQL de los leftovers que tengan build capturado ---
+written = []
+for t in LEFTOVER:
+    b = builds.get(t)
+    if not b:
+        continue
+    out = [SEP, f"OBJETO : {PROJECT}.clientes_intel.{t}\n",
+           f"TIPO   : {b['stmt']}   (NO alimenta customer_marts)\n",
+           f"ULTIMA EJECUCION CAPTURADA: {b['last_run']}\n", SEP,
+           "\nQUE ES / A DONDE VA\n-------------------\n",
+           textwrap.fill(CATEGORY.get(t, "(sin clasificar)"), width=78) + "\n"]
+    if b["refs"]:
+        out.append("\nLEE DE\n------\n")
+        for r in b["refs"]:
+            out.append(f"  - {PROJECT}.{r}\n")
+    out.append("\nSQL DE CONSTRUCCION (copiable)\n------------------------------\n\n")
+    out.append(b["query"].strip() + "\n")
+    w(f"06_otros_outputs_clientes_intel/{t}.txt", "".join(out))
+    written.append(t)
+
+# --- 00c: verificacion de completitud ---
+v = [SEP, "VERIFICACION DE COMPLETITUD DEL LINAJE\n", SEP, "\n"]
+v.append("PREGUNTA: todo esto acaba en customer_marts? Comprobado.\n\n")
+v.append("""RESPUESTA CORTA
+---------------
+La cadena customer_marts -> fuentes esta COMPLETA (todas las referencias resueltas,
+0 tablas sin identificar). PERO customer_marts NO es el unico destino: es UNO de los
+consumidores de la capa clientes_intel.
+
+  - clientes_intel tiene 40 objetos.
+  - 21 de ellos alimentan (directa o indirectamente) las 14 vistas de customer_marts.
+  - 19 NO van a customer_marts: son OTRAS salidas del mismo pipeline (informes de
+    campana, recomendaciones de promo, audiencias, auditorias, auxiliares).
+
+El unico dataset MODELADO que lee clientes_intel es customer_marts. El resto de lo que
+lee clientes_intel y customer_marts son consultas de BI / ad-hoc (tablas temporales
+_hexhash / anon...), es decir Metabase u otros lo consumen directamente. En ese sentido
+customer_marts SI es terminal en el modelo (aguas abajo solo hay BI).
+
+""")
+
+v.append(SEP + "1) LAS 21 TABLAS DE clientes_intel QUE SI ALIMENTAN customer_marts\n" + SEP + "\n")
+for t in sorted(involved):
+    b = builds.get(t, {})
+    v.append(f"  - {t}  ({b.get('stmt','(sin job)')})\n")
+
+v.append("\n" + SEP + "2) LAS 19 TABLAS DE clientes_intel QUE NO VAN A customer_marts\n" + SEP + "\n")
+v.append("   (SQL de cada una en 06_otros_outputs_clientes_intel/)\n\n")
+for t in LEFTOVER:
+    sql_note = "" if t in written else "   [sin SQL de job capturado]"
+    v.append(f"  - {t:28s} {CATEGORY.get(t,'')}{sql_note}\n")
+
+v.append("\n" + SEP + "3) FUENTES BASE ALCANZADAS (fin del linaje)\n" + SEP + "\n")
+v.append("   Fuera de clientes_intel, el pipeline lee de:\n\n")
+for s in sorted(lin["external_sources"]):
+    v.append(f"  - {PROJECT}.{s}\n")
+
+v.append("\n" + SEP + "4) NOTAS DE COBERTURA\n" + SEP + "\n")
+v.append("""  - score_clv y seg_cluster_vehiculo: usadas por customer_marts pero sin CTAS reciente
+    en el historial de jobs (son modelos ML / cargas antiguas). Su esquema esta en
+    02_intermedio_clientes_intel/; no hay un SQL de un solo job que las reconstruya.
+  - El SQL de construccion se tomo del ULTIMO job exitoso de cada tabla
+    (INFORMATION_SCHEMA.JOBS, region europe-west1, ventana 120 dias). Si una tabla se
+    reconstruye con otra logica fuera de esa ventana, no se captura aqui.
+  - customer_marts: 14 vistas = el dataset entero (no falta ninguna).
+""")
+
+w("00c_VERIFICACION.txt", "".join(v))
+
+print(f"06_otros_outputs_clientes_intel/ -> {len(written)} archivos")
+print("00c_VERIFICACION.txt -> escrito")
+print("\nLeftovers sin SQL capturado:", [t for t in LEFTOVER if t not in written] or "ninguno")
@@ -0,0 +1,53 @@
+{
+  "intel_involved": [
+    "_margen_rate_producto",
+    "_persona_records",
+    "_plate_year_calib",
+    "data_points_contacto",
+    "dim_cp_provincia",
+    "dim_persona",
+    "dim_vehiculo",
+    "fact_campana_respuesta",
+    "fact_transaccion",
+    "fact_visita",
+    "feat_cliente_persona",
+    "feat_cliente_vehiculo",
+    "map_persona_canal8",
+    "map_persona_fuente",
+    "map_persona_vehiculo",
+    "reco_acciones",
+    "score_clv",
+    "seg_cliente_360",
+    "seg_cluster_persona",
+    "seg_cluster_vehiculo",
+    "tipologia_cliente"
+  ],
+  "external_sources": [
+    "anjana_bi_amg.margenes_mat",
+    "citaprevia_aurphcp.clientes",
+    "citaprevia_aurphcp.clientes_matriculas",
+    "claude_bi.churn_scores_current",
+    "claude_bi.conversion_cqg_base_mat",
+    "claude_bi.todos_datos_lineas_mat",
+    "mssql2022_dbo.anjana_customer",
+    "ontologia.aurgiCitas_mat",
+    "psql_dcpublic.call_transactions",
+    "psql_dcpublic.car_makes",
+    "psql_dcpublic.car_model_families",
+    "psql_dcpublic.car_models",
+    "psql_dcpublic.car_versions",
+    "psql_dcpublic.customers",
+    "psql_dcpublic.otr_customers",
+    "psql_dcpublic.otr_vehicles",
+    "psql_dcpublic.tecrmi_license_plates",
+    "psql_dcpublic.tpv_customers",
+    "psql_dcpublic.tpv_vehicles_vehicle",
+    "psql_dcpublic.tpv_vehicles_vehicleowner",
+    "psql_dcpublic.users",
+    "salesforce_ew1.contacts_latest",
+    "salesforce_ew1.email_clicked",
+    "salesforce_ew1.email_opened",
+    "salesforce_ew1.email_sent",
+    "salesforce_ew1.sms"
+  ]
+}
@@ -0,0 +1,51 @@
+"""Helper: run SQL against Metabase BigQuery db=6 via REST API.
+
+Usage:
+    python3 mbq.py "SELECT 1"
+    python3 mbq.py < query.sql
+Reads API key from `pass metabase/aurgi-api-key`.
+Prints columns header + rows as TSV.
+"""
+import os
+import sys
+import json
+import subprocess
+
+sys.path.insert(0, "python/functions")
+from metabase import MetabaseClient, metabase_execute_query
+
+MB_URL = "https://reports.autingo.es"
+DB_ID = 6
+
+
+def get_key():
+    return subprocess.check_output(["pass", "show", "metabase/aurgi-api-key"]).decode().splitlines()[0].strip()
+
+
+def run(sql, max_results=2000):
+    import httpx
+    c = MetabaseClient(MB_URL, get_key())
+    try:
+        res = metabase_execute_query(c, DB_ID, sql, max_results=max_results)
+    except httpx.HTTPStatusError as e:
+        print("HTTP", e.response.status_code, e.response.text[:3000])
+        return
+    data = res.get("data", {})
+    cols = [col.get("display_name") or col.get("name") for col in data.get("cols", [])]
+    rows = data.get("rows", [])
+    # error?
+    if res.get("error") or (res.get("status") and res.get("status") != "completed"):
+        print("ERROR:", json.dumps(res.get("error") or res, ensure_ascii=False)[:2000])
+        return
+    print("\t".join(str(x) for x in cols))
+    for r in rows:
+        print("\t".join("" if v is None else str(v) for v in r))
+    print(f"-- {len(rows)} rows", file=sys.stderr)
+
+
+if __name__ == "__main__":
+    if len(sys.argv) > 1:
+        sql = sys.argv[1]
+    else:
+        sql = sys.stdin.read()
+    run(sql)
@@ -0,0 +1,106 @@
+"""Traza la construccion de clientes_intel: para cada tabla, recupera el SQL del ultimo
+job que la escribio (INFORMATION_SCHEMA.JOBS) + sus referenced_tables, y recorre hacia
+atras hasta las tablas fuente (TPV, customers, users, Navision, Salesforce).
+
+Vuelca todo a scratchpad/intel_build.json.
+"""
+import json
+import warnings
+
+warnings.filterwarnings("ignore")
+import google.auth
+from google.cloud import bigquery
+
+PROJECT = "autingo-159109"
+REGION = "region-europe-west1"
+
+creds, _ = google.auth.default(scopes=["https://www.googleapis.com/auth/bigquery"])
+creds = creds.with_quota_project(None)
+c = bigquery.Client(project=PROJECT, credentials=creds)
+
+# Ultimo job por tabla destino en clientes_intel: query + referenced_tables + stmt.
+sql = f"""
+WITH j AS (
+  SELECT
+    dest.table_id AS tbl,
+    query,
+    statement_type AS stmt,
+    creation_time,
+    ARRAY(
+      SELECT AS STRUCT rt.project_id, rt.dataset_id, rt.table_id
+      FROM UNNEST(referenced_tables) rt
+    ) AS refs,
+    ROW_NUMBER() OVER (PARTITION BY dest.table_id ORDER BY creation_time DESC) AS rn
+  FROM `{PROJECT}`.`{REGION}`.INFORMATION_SCHEMA.JOBS_BY_PROJECT,
+       UNNEST([destination_table]) dest
+  WHERE dest.dataset_id = 'clientes_intel'
+    AND state = 'DONE' AND error_result IS NULL
+    AND statement_type IS NOT NULL
+    AND creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 120 DAY)
+)
+SELECT tbl, query, stmt, creation_time, refs FROM j WHERE rn = 1
+ORDER BY tbl
+"""
+
+builds = {}
+for r in c.query(sql).result():
+    refs = []
+    for rt in r.refs:
+        refs.append(f"{rt['dataset_id']}.{rt['table_id']}")
+    builds[r.tbl] = {
+        "query": r.query or "",
+        "stmt": r.stmt,
+        "last_run": str(r.creation_time),
+        "refs": sorted(set(x for x in refs if not x.endswith(f".{r.tbl}"))),
+    }
+
+json.dump(builds, open("scratchpad/intel_build.json", "w"), indent=2, ensure_ascii=False)
+print(f"tablas clientes_intel con SQL de construccion capturado: {len(builds)}\n")
+
+# Recursion desde las 12 tablas usadas por customer_marts.
+SEED = [
+    "dim_persona", "dim_vehiculo", "fact_transaccion", "fact_campana_respuesta",
+    "feat_cliente_persona", "feat_cliente_vehiculo", "seg_cliente_360", "score_clv",
+    "reco_acciones", "map_persona_canal8", "map_persona_fuente", "map_persona_vehiculo",
+]
+intel_involved = set()
+external_sources = set()
+stack = list(SEED)
+while stack:
+    t = stack.pop()
+    if t in intel_involved:
+        continue
+    intel_involved.add(t)
+    b = builds.get(t)
+    if not b:
+        continue
+    for ref in b["refs"]:
+        ds, tbl = ref.split(".", 1)
+        if ds == "clientes_intel":
+            if tbl not in intel_involved:
+                stack.append(tbl)
+        else:
+            external_sources.add(ref)
+
+print("== tablas clientes_intel implicadas en el linaje de customer_marts ==")
+for t in sorted(intel_involved):
+    b = builds.get(t, {})
+    print(f"  {t:26s} {b.get('stmt','(sin job)')}")
+
+print("\n== FUENTES EXTERNAS (fuera de clientes_intel) usadas por el pipeline ==")
+for s in sorted(external_sources):
+    print(f"  {s}")
+
+# Marcar las fuentes de CLIENTE que pide el usuario.
+KEYS = ["customer", "customers", "cliente", "user", "usuario", "tpv", "salesforce",
+        "sf_", "contact", "mkt_cloud", "persona"]
+print("\n== fuentes que parecen de CLIENTE/usuario ==")
+for s in sorted(external_sources):
+    low = s.lower()
+    if any(k in low for k in KEYS):
+        print(f"  {s}")
+
+json.dump({
+    "intel_involved": sorted(intel_involved),
+    "external_sources": sorted(external_sources),
+}, open("scratchpad/intel_lineage.json", "w"), indent=2, ensure_ascii=False)
@@ -0,0 +1,158 @@
+"""Traza el linaje recursivo de las vistas de customer_marts hasta las tablas fuente.
+
+Para cada objeto: obtiene su tipo (VIEW/BASE TABLE/EXTERNAL/MATERIALIZED VIEW) y su DDL
+via INFORMATION_SCHEMA.TABLES, extrae las referencias a otras tablas del DDL y recurre
+sobre las que son vistas. Vuelca el grafo completo a un JSON en scratchpad.
+"""
+import json
+import re
+import sys
+import warnings
+
+warnings.filterwarnings("ignore")
+
+import google.auth
+from google.cloud import bigquery
+
+PROJECT = "autingo-159109"
+
+creds, _ = google.auth.default(scopes=["https://www.googleapis.com/auth/bigquery"])
+creds = creds.with_quota_project(None)
+client = bigquery.Client(project=PROJECT, credentials=creds)
+
+# Cache de metadata por dataset: {dataset: {table_name: {"type":..., "ddl":...}}}
+dataset_cache: dict[str, dict] = {}
+
+
+def load_dataset(dataset: str) -> dict:
+    """Carga todas las tablas/vistas de un dataset (una query por dataset)."""
+    if dataset in dataset_cache:
+        return dataset_cache[dataset]
+    result: dict[str, dict] = {}
+    try:
+        sql = f"""
+        SELECT table_name, table_type, ddl
+        FROM `{PROJECT}`.`{dataset}`.INFORMATION_SCHEMA.TABLES
+        """
+        for r in client.query(sql).result():
+            result[r.table_name] = {"type": r.table_type, "ddl": r.ddl or ""}
+    except Exception as e:  # noqa: BLE001
+        print(f"  [warn] no se pudo leer dataset {dataset}: {e}", file=sys.stderr)
+    dataset_cache[dataset] = result
+    return result
+
+
+# En el DDL que emite INFORMATION_SCHEMA, las referencias a otras tablas SIEMPRE van
+# entre backticks y totalmente cualificadas: `proyecto.dataset.tabla`. Los alias de
+# CTE/JOIN (dp, fcp, f...) nunca llevan backticks, asi que restringiendo a lo que hay
+# entre backticks eliminamos todo el ruido.
+BACKTICK_RE = re.compile(r"`([^`]+)`")
+# Variante con cada parte en su propio backtick: `proj`.`dataset`.`tabla`
+MULTIPART_RE = re.compile(
+    r"`([A-Za-z0-9_-]+)`\.`([A-Za-z0-9_-]+)`(?:\.`([A-Za-z0-9_-]+)`)?"
+)
+
+
+def _norm(proj: str, ds: str, tbl: str) -> tuple[str, str] | None:
+    if ds.upper() == "INFORMATION_SCHEMA" or tbl.upper() == "INFORMATION_SCHEMA":
+        return None
+    return (ds, tbl)
+
+
+def extract_refs(ddl: str) -> set[tuple[str, str]]:
+    """Devuelve el conjunto de (dataset, table) referenciados en el cuerpo del DDL.
+
+    Se queda con el SELECT (tras el primer 'AS') para no capturar el nombre del propio objeto.
+    """
+    body = ddl
+    m = re.search(r"\bAS\b", ddl, flags=re.IGNORECASE)
+    if m:
+        body = ddl[m.end():]
+
+    refs: set[tuple[str, str]] = set()
+
+    # Estilo `proyecto.dataset.tabla` (todo en un backtick).
+    for tok in BACKTICK_RE.findall(body):
+        parts = [p for p in tok.split(".") if p]
+        if len(parts) == 3:
+            r = _norm(parts[0], parts[1], parts[2])
+        elif len(parts) == 2:
+            r = _norm(PROJECT, parts[0], parts[1])
+        else:
+            r = None
+        if r:
+            refs.add(r)
+
+    # Estilo `proj`.`dataset`.`tabla` (parte por backtick, 3 partes cualificadas).
+    # OJO: `alias`.`columna` (2 partes con cada parte en su propio backtick) es una
+    # referencia a columna, NO a tabla — se descarta exigiendo las 3 partes.
+    for mt in MULTIPART_RE.finditer(body):
+        g1, g2, g3 = mt.group(1), mt.group(2), mt.group(3)
+        if g3:
+            r = _norm(g1, g2, g3)
+            if r:
+                refs.add(r)
+
+    return refs
+
+
+graph: dict[str, dict] = {}  # key "dataset.table" -> {type, ddl, refs:[...]}
+visited: set[str] = set()
+
+
+def visit(dataset: str, table: str, depth: int = 0):
+    key = f"{dataset}.{table}"
+    if key in visited:
+        return
+    visited.add(key)
+    meta = load_dataset(dataset).get(table)
+    if meta is None:
+        graph[key] = {"type": "UNKNOWN", "ddl": "", "refs": [], "depth": depth}
+        return
+    ddl = meta["ddl"]
+    ttype = meta["type"]
+    refs: list[str] = []
+    if ttype in ("VIEW", "MATERIALIZED VIEW"):
+        for ds, tbl in sorted(extract_refs(ddl)):
+            # Evitar auto-referencia
+            if ds == dataset and tbl == table:
+                continue
+            refs.append(f"{ds}.{tbl}")
+    graph[key] = {"type": ttype, "ddl": ddl, "refs": refs, "depth": depth}
+    for ref in refs:
+        rds, rtbl = ref.split(".", 1)
+        visit(rds, rtbl, depth + 1)
+
+
+# Semillas: las 14 vistas de customer_marts.
+SEEDS = [
+    "customer_brand_affinity", "customer_category_spend", "customer_channel",
+    "customer_contactability", "customer_monetary", "customer_payment_method",
+    "customer_predictive", "customer_product", "customer_profile",
+    "customer_promo_tolerance", "customer_promo_usage", "customer_store_spend",
+    "customer_temporal", "customer_vehicles",
+]
+for s in SEEDS:
+    visit("customer_marts", s, 0)
+
+out = {
+    "project": PROJECT,
+    "seeds": [f"customer_marts.{s}" for s in SEEDS],
+    "graph": graph,
+}
+with open("scratchpad/lineage_graph.json", "w") as f:
+    json.dump(out, f, indent=2, ensure_ascii=False)
+
+# Resumen
+n_view = sum(1 for v in graph.values() if v["type"] in ("VIEW", "MATERIALIZED VIEW"))
+n_base = sum(1 for v in graph.values() if v["type"] == "BASE TABLE")
+n_ext = sum(1 for v in graph.values() if v["type"] == "EXTERNAL")
+n_unk = sum(1 for v in graph.values() if v["type"] == "UNKNOWN")
+print(f"objetos totales: {len(graph)}  vistas: {n_view}  base: {n_base}  external: {n_ext}  desconocidos: {n_unk}")
+print("\n== objetos por dataset ==")
+by_ds: dict[str, int] = {}
+for k in graph:
+    ds = k.split(".", 1)[0]
+    by_ds[ds] = by_ds.get(ds, 0) + 1
+for ds, n in sorted(by_ds.items(), key=lambda x: -x[1]):
+    print(f"  {n:3d}  {ds}")
Author	SHA1	Message	Date
egutierrez	5a4f82cf76	chore: auto-commit (26 archivos) - python/functions/bigquery/bq_auth.md - python/functions/bigquery/bq_load_from_file.md - python/functions/bigquery/bq_load_from_gcs.md - python/functions/bigquery/client.py - python/functions/bigquery/queries.py - python/functions/datascience/__init__.py - python/functions/datascience/decode_qr_image.py - python/functions/datascience/load_bq_table_to_duckdb.md - python/functions/datascience/load_bq_table_to_duckdb.py - python/functions/pipelines/profile_bq_table.md - ... Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-07-02 19:00:13 +02:00
egutierrez	2ebc9efeb2	chore: auto-commit (8 archivos) - scratchpad/gen_docs.py - scratchpad/gen_intel.py - scratchpad/gen_verify.py - scratchpad/intel_build.json - scratchpad/intel_lineage.json - scratchpad/lineage_graph.json - scratchpad/trace_intel.py - scratchpad/trace_lineage.py Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-07-01 19:00:06 +02:00
egutierrez	fbdf80bd71	chore: auto-commit (10 archivos) - scratchpad/ap.parquet - scratchpad/bq.py - scratchpad/cards.json - scratchpad/citas_recon.csv - scratchpad/dash.txt - scratchpad/diego.parquet - scratchpad/diego_literals.sql - scratchpad/exf/ - scratchpad/va.parquet - scratchpad/vm.parquet Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-07-01 17:58:03 +02:00
egutierrez	8408863cfa	feat(eda): pipeline BQ-EDA sobre tablas BigQuery (grupo eda) Añade el conector y el pipeline para hacer EDA automático sobre tablas/vistas de BigQuery, reutilizando profile_table del grupo eda sin duplicar profiling: - load_bq_table_to_duckdb (datascience): trae una tabla BQ a DuckDB con seudonimización SHA-1 de columnas PII y normalización de dtypes. Por defecto carga el total de filas (sample_frac=None); el muestreo es opt-in explícito. - profile_bq_table (pipeline): orquesta load -> profile_table -> render report (JSON + Markdown + PDF/PPTX). Full por defecto. Ambas tageadas eda+bigquery, v1.1.0. El default full responde a la preferencia del operador: los EDA se corren sobre el total salvo indicación contraria. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-07-01 12:45:39 +02:00
egutierrez	7273823087	Merge remote-tracking branch 'origin/master' # Conflicts: # .claude/settings.local.json	2026-07-01 11:42:49 +02:00
egutierrez	76592e4dc0	chore: auto-commit (2 archivos) - .claude/settings.local.json - scratchpad/mbq.py Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-07-01 11:41:56 +02:00
egutierrez	26569c7015	chore: auto-commit (1 archivos) - logs/ardour_mcp_server.log Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-07-01 02:16:25 +02:00
egutierrez	44622339fa	merge(eda): cap4/cap5 distribuciones — parrafos al glosario, desc LLM+unidad por columna, donut->barras, PPT side_by_side	2026-07-01 02:11:53 +02:00
egutierrez	c0d44a6352	fix(eda): cat_distr — intro del cuerpo reducida a términos clicables mínimos Quita la frase descriptiva del cuerpo del capítulo ('Cada columna categórica ocupa su propia página — ...: cardinalidad, top de categorías y gráfico de barras. El dataset tiene N filas...'); ya vivía duplicada en la entrada de glosario 'pagina_categorica'. El intro deja solo los términos clicables mínimos ([[term:entropia]] · [[term:pagina_categorica]]) bajo el heading 'Entropía y cardinalidad'. El total de filas del dataset sigue disponible por columna en la tabla de cardinalidad ('Total filas (dataset)').	2026-07-01 02:10:39 +02:00
egutierrez	cab0fbf0a3	feat(eda): CAP4/CAP5 distribuciones — párrafos al glosario, desc LLM + unidad por columna, donut→barras, PPT figura a la derecha CAP4 num_distr: - Mueve el párrafo introductorio largo del histograma/boxplot al glosario (nuevo término clicable "histograma_boxplot"); el cuerpo del capítulo solo nombra el término con [[term:histograma_boxplot]] y la explicación completa (código de colores, 1,5·IQR, lectura de asimetría) vive en la entrada del glosario. La información se traslada, no se pierde. - Añade por columna numérica la descripción de negocio del LLM y la unidad, leídas de profile['llm']['dictionary'] (empareja por nombre de columna). Sin bloque LLM el bloque de descripción se omite limpiamente. CAP5 cat_distr: - Mueve el párrafo "Cada columna categórica ocupa su propia página..." al glosario (nuevo término clicable "pagina_categorica"); el intro solo nombra los términos entropía y pagina_categorica. - Añade descripción LLM + unidad por columna (misma fuente que CAP4). - Cambia el donut/pie por gráfico de barras horizontales (nueva función del registry categorical_top_bar_figure_py_datascience, contrato de entrada idéntico al donut para swap directo) más su fallback inline de barras. - Marca cada Group de columna con layout="side_by_side": en PPTX la tabla de cardinalidad queda a la izquierda y la barra a la derecha; en PDF se apila (A5 estrecho). No toca los renderers — el soporte de layout ya existía. Glosario: - Catálogo canónico _BASELINE_TERMS con las definiciones de los dos términos nuevos; build_glosario completa la definición de un término registrado sin ella desde el catálogo (los chapters solo registran clave+label). Tests actualizados (donut→barras, side_by_side, LLM desc/unidad, glosario) y nueva función con sus tests. Suite del subsistema + acceptance verde.	2026-07-01 02:01:07 +02:00
egutierrez	7f304adc9c	merge(eda): render quality global — DPI 220, tablas anchas como imagen, layout side_by_side, indice clicable	2026-07-01 01:36:10 +02:00
egutierrez	a74a5a047f	feat(eda): render quality global — DPI 220, tablas anchas como imagen, layout side_by_side, índice clicable Mejoras transversales del motor AutomaticEDA (PDF + PPTX) sobre el modelo de bloques: 1. DPI alto global: toda figura/imagen embebida se rasteriza a 220 dpi (antes 150, y en PDF la página se guardaba a ~100 dpi re-rasterizando los imshow). En PDF se aplica savefig.dpi=220 a la página; el texto sigue vectorial y seleccionable. Permite ampliar en el móvil sin pixelar. Imagen embebida medida: ~1081px (antes ~492px). 2. Tabla ancha → imagen de alta resolución: cuando un DataTable tiene demasiadas columnas para ser legible como texto (criterio _table_fits_as_text), se dibuja entera como una imagen nítida (nueva función render_table_as_figure_py_datascience: cabecera sombreada + zebra) escalada para caber completa, de modo que el lector hace zoom y la lee sin perder datos. Las tablas que sí caben siguen como texto seleccionable / tabla nativa. Aplica en PDF y PPTX. El df.head de 19 columnas del dataset sintético ya no se corta: sale como imagen. 3. Group.layout: nuevo hint retrocompatible (default "stack"). "side_by_side" coloca la tabla a la izquierda (~55%) y la figura a la derecha (~45%) en la misma slide PPTX (cae a apilado si no hay par tabla+figura o no caben); en PDF se trata como "stack" (el ancho A5 móvil no admite dos columnas). Pensado para que el capítulo cat_distr ponga el gráfico al lado de la tabla en PPT. 4. Portada con índice clicable: la lista de capítulos pasa de "Este informe incluye..." (markdown) a un Heading "Índice" + un TocEntry por capítulo. El renderer registra el inicio de cada capítulo y cablea cada entrada como salto real (PDF: link GOTO PyMuPDF; PPTX: salto a slide nativo), reutilizando el mecanismo del glosario clicable. Modelo: Group gana `layout`; nuevo bloque TocEntry; normalizers y __init__ actualizados. Contrato: documentado en docs/automatic_eda_contract.md §11.4 (incluye el contrato exacto del campo layout para el agente de cat_distr). Tests: nuevo render_quality_test.py (13 golden: DPI alto real, tabla ancha→imagen PDF/PPTX, narrow→texto, side_by_side PPTX dos columnas / PDF apilado, índice clicable PDF+PPTX, retrocompatibilidad layout por defecto). render_features_test actualizado al índice nuevo. Suite: 188 passed (módulo) + 38 passed/1 skipped (acceptance + pipeline). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-07-01 01:34:21 +02:00
egutierrez	44be1d6b58	merge(eda): cap2 overview enriquece diccionario y describe con descripcion+unidad del LLM	2026-07-01 01:14:37 +02:00
egutierrez	64306f3b1c	feat(eda): overview enriquece diccionario y describe con descripcion+unidad del LLM La tabla DICCIONARIO de columnas del capitulo overview gana columnas "Descripcion" y "Unidad", y la tabla DESCRIBE gana "Unidad", consumiendo profile['llm']['dictionary'] (entradas column/description/business_meaning/unit producidas por eda_llm_insights) emparejadas por nombre de columna. Lectura defensiva: sin bloque LLM (run_llm no corrio) las celdas degradan a "—" y las tablas siguen renderizando. No recalcula nada ni llama al LLM. CHAPTER_VERSION 1.1.0 -> 1.2.0. Tests: golden (descripcion+unidad pobladas para income), edge (sin LLM -> "—"), fallback ctx['llm'], y render PDF con las columnas nuevas visibles. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-07-01 01:13:02 +02:00
egutierrez	f2eb782a5f	merge(eda): portada v2 (sin Criterios, descripcion LLM, resumen a la derecha) + zebra global PDF + nombre PPTX grande/subrayado	2026-06-30 22:53:46 +02:00
egutierrez	80d10010f5	feat(eda): portada cap01 + zebra global y emphasis de render Itera el capítulo PORTADA del AutomaticEDA y dos mejoras globales de los renderers PDF/PPTX: 1. Zebra global (PDF): _place_kv_table ahora sombrea las filas pares igual que las DataTable, así toda tabla del documento queda rayada (no solo las DataTable). Mismo patrón coherente al partir/repetir cabecera. 2. Portada usa la descripción LLM rica (profile['llm']['summary']) cuando el perfil la tiene; se elimina del fallback derivado el texto ruido "active la interpretación LLM (run_llm)…". No fuerza llamadas LLM en el capítulo, solo consume profile['llm'] si está. 3. Se quita el bloque "Criterios de calidad" de la portada (PDF y PPTX); el score "Calidad" se mantiene. 4. "Resumen del análisis" (PDF): los valores se alinean al margen derecho via el nuevo KVTable.value_align="right". 5. Nombre del dataset en la portada PPTX más grande (44pt) y subrayado via los nuevos hints Heading.underline / Heading.size_pt (el PDF los ignora). Bump CHAPTER_VERSION de portada 1.2.0 -> 1.3.0. Verificado: suite 213 passed / 1 skipped (incl. aceptación de los 16 capítulos); golden zebra = 185 filas zebra en 13 capítulos del PDF completo; portada con run_llm sin "Criterios de calidad", con descripción LLM rica y valores a la derecha; PPTX con nombre 44pt subrayado; edge sin LLM cae al fallback derivado sin ruido; fn index sin error. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-30 22:44:33 +02:00
egutierrez	ecc22d6d57	merge(eda): suite de aceptacion de los 16 capitulos (29 passed, rescatado de ejecutor con auth caida)	2026-06-30 22:07:21 +02:00
agent	7bdb8bffb5	test(eda): suite de aceptacion de los 16 capitulos del AutomaticEDA Bateria que blinda el subsistema: cobertura de los 16 capitulos sobre el dataset sintetico Faker, contenido esencial por capitulo (needles parametrizados), capitulos sueltos con resolucion de dependencias (only_chapters=[outliers] puebla IsolationForest sin run_models; timeseries; correlacion), None cuando no aplica, folder multi-tabla con FK, completitud del MD (matriz de correlacion completa + skew/kurtosis), 3 salidas no vacias, determinismo. Test full+LLM skippeable. 29 passed, 1 skipped. Sin hallazgos: los 16 capitulos salen como deben.	2026-06-30 22:07:15 +02:00
egutierrez	4139394326	merge(eda): only_chapters con resolucion automatica de dependencias de computo por capitulo	2026-06-30 21:37:16 +02:00
egutierrez	4773781323	merge(eda): generadores sinteticos Faker (tabla todo-en-uno + carpeta multi-tabla) que activan todos los capitulos	2026-06-30 21:26:20 +02:00
egutierrez	ea6678ec23	feat(eda): generadores de datasets sintéticos Faker que ejercitan el AutomaticEDA Añade dos funciones impuras dict-no-throw, deterministas por seed, al dominio datascience (grupo eda): - generate_synthetic_eda_table: una tabla DuckDB de 19 columnas (numéricas correlacionadas + outliers, categóricas desbalanceadas, texto largo multi-idioma es/en/fr, fecha DATE, lat/lon válidas, PII email/iban/phone/uuid, nulos con patrón MCAR/MAR co-ocurrentes). Activa 14 capítulos del motor AutomaticEDA (num_distr, cat_distr, text_distr, calidad, missingness, correlacion, relaciones, modelos, timeseries, geospatial, agregacion, glosario + portada/overview). - generate_synthetic_eda_folder: 3 CSV relacionados (customers/orders/reviews) con FK customer detectable por containment, para el EDA de carpeta multi-tabla. Determinismo via Faker.seed_instance + numpy.default_rng. Tests: 16 passed (incluye determinismo por hash, rangos lat/lon, co-nulos income/spending, mediana palabras review >=20, phone formato internacional, FK containment). Añade faker (40.27.0) a python/pyproject.toml + uv.lock. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-30 21:25:31 +02:00
				`@@ -0,0 +1 @@`
				`{"total": 12367, "canal": 12368, "pago": 12369, "matriz": 12370, "evolutivo": 12371}`