feat(eda): nueva fórmula de calidad de datos (report 2046) + capítulo calidad

Implementa el modelo de calidad del report 2046 en el grupo eda. Score de columna: 0.6·completeness + 0.4·validity con renormalización por aplicabilidad (si la validez no es medible —texto libre o columna 100% nula— el score se basa solo en completeness). Validez = conformidad real al tipo: nativo numérico/fecha/bool = 1.0; texto promovido a número/fecha = parse rate (validity_rate); texto con semantic_type = match_rate; texto libre = no aplica. Outliers, columnas constantes e identificadores salen del score a un bloque de observaciones analíticas (no son defectos de calidad). Se elimina el doble conteo de la falta de datos (mostly_null ya no castiga validez) y el bug de escala de outliers (que además ya no entran en el score). Score de dataset: 100·(0.85·cell_quality + 0.15·row_uniqueness) en vez de la media simple. Se pobla duplicate_rows/duplicate_pct push-down en summarize_table_duckdb (COUNT sobre DISTINCT *, sin RAM) para habilitar la unicidad de registro; renormaliza a solo cell_quality si no se puede calcular. Capítulo calidad (v2.0.0): intro de dos dimensiones (60/40) que declara que los outliers no bajan el score; tabla de scores Columna|Calidad|Completitud|Validez (sin Consistencia, n/a cuando no aplica); DOS tablas separadas (Problemas de calidad vs Observaciones analíticas); resumen con Unicidad de registro; glosario clicable de completitud, validez, unicidad de registro y calidad de datos. Verificado: 123 tests verdes (automatic_eda + render_automatic_eda + column_quality_score + summarize_table_duckdb + profile_table). Golden EDA de titanic (run_models+run_llm) con score recomputado a mano, outliers separados en observaciones y glosario clicable (5 links GOTO en el PDF). column_quality_score v2.0.0, summarize_table_duckdb v1.1.0, profile_table v1.1.0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-30 18:10:23 +02:00
20 changed files with 779 additions and 1957 deletions
@@ -34,7 +34,6 @@ from .theils_u import theils_u
 from .correlation_ratio import correlation_ratio
 from .mutual_info_columns import mutual_info_columns
 from .infer_fk_containment_duckdb import infer_fk_containment_duckdb
-from .detect_declared_keys_duckdb import detect_declared_keys_duckdb
 from .build_join_graph import build_join_graph
 from .association_matrix import association_matrix
 from .correlation_matrix_duckdb import correlation_matrix_duckdb
@@ -70,10 +69,8 @@ from .build_eda_render_ctx import build_eda_render_ctx
 from .profile_datetime import profile_datetime
 from .resample_timeseries import resample_timeseries
 from .add_pdf_internal_links import add_pdf_internal_links
-from .suggest_intratable_fk_candidates import suggest_intratable_fk_candidates

 __all__ = [
-    "suggest_intratable_fk_candidates",
    "detect_time_column",
    "extract_timeseries_raw",
    "build_eda_render_ctx",
@@ -100,7 +97,6 @@ __all__ = [
    "correlation_ratio",
    "mutual_info_columns",
    "infer_fk_containment_duckdb",
-    "detect_declared_keys_duckdb",
    "build_join_graph",
    "association_matrix",
    "correlation_matrix_duckdb",
@@ -1,22 +1,26 @@
 """Data-quality chapter (CALIDAD) for AutomaticEDA.

 Builds the quality chapter from a ``TableProfile`` of the ``eda`` group. The
-chapter answers, in Spanish and as tables, the three things the user asked for:
+chapter implements the quality model of report 2046:

-1. **En qué se basa la calidad** — an intro paragraph explaining the criteria and
-   their weights (completeness, validity, consistency) before any number, plus a
-   table-level summary (global score and aggregates).
+1. **En qué se basa la calidad** — an intro paragraph explaining the two scored
+   dimensions and their weights (completitud 60%, validez 40%) plus the
+   table-level row uniqueness, BEFORE any number, and stating explicitly that
+   outliers are reported as observations and do **not** lower the score. The
+   criteria terms (calidad de datos, completitud, validez, unicidad de registro)
+   are hooked into the shared glossary as clickable jumps.
 2. **Scores por columna** — a table with, per column, the total quality score and
-   its breakdown into completeness / validity / consistency.
-3. **Problemas en español** — a second table listing, per column, the readable
-   issues in Spanish (kept separate from the type ``flags``).
+   its breakdown into completeness / validity (no consistency dimension).
+3. **Problemas de calidad** — a table listing ONLY real quality defects
+   (nulls, empty cells, values not conforming to their type/semantics).
+4. **Observaciones analíticas** — a SEPARATE table for outliers, constant
+   columns, high-cardinality ids and strong skew, with an explicit note that
+   these do not affect the score.

-The breakdown and the issues are NOT recomputed here: they come from the registry
-function ``column_quality_score`` (group ``eda``), which already derives
-``{score, completeness, validity, consistency, issues}`` from the ColumnProfile.
-This chapter is render-only — it consumes that function and lays the result out
-as model blocks; the renderers paginate tables (splitting by rows, repeating the
-header) and wrap long cells so nothing is ever cut.
+The breakdown, issues and observations are NOT recomputed here: they come from
+the registry function ``column_quality_score`` (group ``eda``), which derives
+``{score, completeness, validity, dimensions, applicable, issues,
+observations}`` from the ColumnProfile. This chapter is render-only.

 Contract: build_<id>(profile, ctx) -> Chapter | None ; CHAPTER_VERSION = "x.y.z".
 """
@@ -33,28 +37,47 @@ try:  # pragma: no cover - import wiring
 except Exception:  # noqa: BLE001 - never let an import error abort the document.
    _column_quality_score = None

-CHAPTER_VERSION = "1.0.0"
+CHAPTER_VERSION = "2.0.0"
 CHAPTER_ID = "calidad"
 CHAPTER_TITLE = "Calidad"

-# Weights mirror column_quality_score: completeness 0.5, validity 0.3,
-# consistency 0.2. Kept here only to render the human explanation; the actual
-# numbers always come from the function so the two never drift in computation.
-_CRITERIA_INTRO = (
-    "La calidad de cada columna es un score de 0 a 100 que combina tres "
-    "criterios, cada uno con un peso:\n\n"
-    "- **Completitud (peso 50%)**: proporción de valores presentes (sin nulos "
-    "ni vacíos). Una columna con muchos nulos baja de score.\n"
-    "- **Validez (peso 30%)**: los valores son coherentes con su tipo y rango "
-    "esperado (penaliza outliers y semánticas declaradas que no coinciden).\n"
-    "- **Consistencia (peso 20%)**: la columna aporta información útil (penaliza "
-    "columnas constantes o identificadores de cardinalidad muy alta).\n\n"
-    "Score = 100 × (0,5·completitud + 0,3·validez + 0,2·consistencia). "
-    "Los problemas detectados por columna se listan en español más abajo."
-)
+# Glossary terms this chapter explains (report 2046 §6). Registered in the shared
+# collector and marked clickable on their first appearance (contract §11.1).
+_TERMS = {
+    "calidad_datos": (
+        "Calidad de datos (score 0-100)",
+        "Mide hasta qué punto los datos están presentes y son utilizables tal "
+        "cual, no si son «buenos para el análisis». Se compone solo de "
+        "dimensiones medibles automáticamente desde el perfil de la tabla, sin "
+        "fuente externa de verdad: completitud (60%), validez (40%, cuando es "
+        "medible) y, a nivel de tabla, unicidad de registro. Los valores "
+        "atípicos NO bajan la calidad: se listan aparte como observaciones.",
+    ),
+    "completitud": (
+        "Completitud",
+        "Proporción de valores realmente presentes en una columna (1 − % de "
+        "nulos; en texto, las celdas vacías también cuentan como faltantes). Los "
+        "nulos y vacíos bajan el score porque falta información que debería "
+        "estar. Pesa el 60% del score de columna.",
+    ),
+    "validez": (
+        "Validez",
+        "Proporción de valores que encajan con su tipo o formato esperado: un "
+        "número que parsea, una fecha legible, un email con forma de email. Los "
+        "valores que no parsean a su tipo bajan el score. Si la columna es texto "
+        "libre sin formato esperado, la validez no se puede medir y el score se "
+        "basa solo en la completitud. Pesa el 40% del score cuando es medible.",
+    ),
+    "unicidad_registro": (
+        "Unicidad de registro",
+        "A nivel de tabla, las filas duplicadas restan calidad al conjunto "
+        "(1 − % de filas duplicadas). Es distinta de que una columna no-clave "
+        "repita valores, que no es un defecto de calidad.",
+    ),
+}

-# Cap for the joined issues cell so a single row never grows taller than a page;
-# the remainder is summarized as "(+N más)" instead of being silently dropped.
+# Cap for the joined cell so a single row never grows taller than a page; the
+# remainder is summarized as "(+N más)" instead of being silently dropped.
 _ISSUES_MAXLEN = 160


@@ -82,12 +105,19 @@ def _fmt_unit_pct(value) -> str:
        return str(value)


+def _fmt_validity(value) -> str:
+    """Validity is ``None`` when not applicable: show ``n/a`` not a fake 0%."""
+    if value is None:
+        return "n/a"
+    return _fmt_unit_pct(value)
+
+
 def _quality_of(col: dict) -> dict:
-    """Return ``{score, completeness, validity, consistency, issues}`` for a column.
+    """Return the quality dict for a column.

    Uses the registry ``column_quality_score`` when available; otherwise falls
    back to the per-column ``quality_score`` already in the profile (number only,
-    empty breakdown/issues). Never raises.
+    empty breakdown/issues/observations). Never raises.
    """
    if not isinstance(col, dict):
        col = {}
@@ -98,26 +128,25 @@ def _quality_of(col: dict) -> dict:
                return res
        except Exception:  # noqa: BLE001 - degrade instead of aborting.
            pass
-    # Fallback: only the final score is available pre-computed in the profile.
    return {
        "score": col.get("quality_score"),
        "completeness": None,
        "validity": None,
-        "consistency": None,
        "issues": [],
+        "observations": [],
    }


-def _join_issues(issues) -> str:
-    """Join Spanish issue strings into one cell, truncating overly long lists.
+def _join_cells(items) -> str:
+    """Join Spanish strings into one cell, truncating overly long lists.

-    The renderer wraps cell text, but a column with many long issues could make a
-    single row taller than a whole page; cap the length and append ``(+N más)``
-    so the count of hidden issues is honest rather than silently lost.
+    The renderer wraps cell text, but a column with many long entries could make
+    a single row taller than a whole page; cap the length and append ``(+N más)``
+    so the count of hidden entries is honest rather than silently lost.
    """
-    if not isinstance(issues, (list, tuple)) or not issues:
+    if not isinstance(items, (list, tuple)) or not items:
        return ""
-    parts = [model._safe_str(i).strip() for i in issues]
+    parts = [model._safe_str(i).strip() for i in items]
    parts = [p for p in parts if p]
    if not parts:
        return ""
@@ -142,6 +171,33 @@ def _columns_with_quality(profile: dict):
            yield c, _quality_of(c)


+def _fmt_unit_pct_or_pct(value) -> str:
+    """Format a value that may be a 0-1 fraction or an already-0-100 percentage."""
+    try:
+        num = float(value)
+    except (TypeError, ValueError):
+        return model._safe_str(value)
+    if num != num:  # NaN
+        return "—"
+    pct = num * 100 if num <= 1.0 else num
+    text = f"{pct:.1f}".rstrip("0").rstrip(".")
+    return f"{text}%"
+
+
+def _row_uniqueness(profile: dict):
+    """Return row uniqueness (1 - duplicate_pct) in [0,1], or None if unknown."""
+    dup = profile.get("duplicate_pct")
+    if dup is None:
+        return None
+    try:
+        d = float(dup)
+    except (TypeError, ValueError):
+        return None
+    if d > 1.0:  # tolerate a 0-100 scale
+        d = d / 100.0
+    return max(0.0, min(1.0, 1.0 - d))
+
+
 def _summary_block(profile: dict, evaluated: list):
    """Table-level KVTable: global score and quality aggregates."""
    rows = []
@@ -153,14 +209,15 @@ def _summary_block(profile: dict, evaluated: list):
             if isinstance(q.get("completeness"), (int, float))]
    vals = [q.get("validity") for _, q in evaluated
            if isinstance(q.get("validity"), (int, float))]
-    cons = [q.get("consistency") for _, q in evaluated
-            if isinstance(q.get("consistency"), (int, float))]
    if comps:
        rows.append(("Completitud media", _fmt_unit_pct(sum(comps) / len(comps))))
    if vals:
-        rows.append(("Validez media", _fmt_unit_pct(sum(vals) / len(vals))))
-    if cons:
-        rows.append(("Consistencia media", _fmt_unit_pct(sum(cons) / len(cons))))
+        rows.append(("Validez media (donde aplica)",
+                     _fmt_unit_pct(sum(vals) / len(vals))))
+
+    ru = _row_uniqueness(profile)
+    if ru is not None:
+        rows.append(("Unicidad de registro", _fmt_unit_pct(ru)))

    n_problem = sum(1 for _, q in evaluated if q.get("issues"))
    rows.append(("Columnas con problemas", str(n_problem)))
@@ -182,22 +239,9 @@ def _summary_block(profile: dict, evaluated: list):
    return model.KVTable(rows=rows, title="Resumen de calidad")


-def _fmt_unit_pct_or_pct(value) -> str:
-    """Format a value that may be a 0-1 fraction or an already-0-100 percentage."""
-    try:
-        num = float(value)
-    except (TypeError, ValueError):
-        return model._safe_str(value)
-    if num != num:  # NaN
-        return "—"
-    pct = num * 100 if num <= 1.0 else num
-    text = f"{pct:.1f}".rstrip("0").rstrip(".")
-    return f"{text}%"
-
-
 def _scores_block(evaluated: list):
-    """DataTable with per-column score and its three-criteria breakdown."""
-    header = ["Columna", "Calidad", "Completitud", "Validez", "Consistencia"]
+    """DataTable with per-column score and its completeness/validity breakdown."""
+    header = ["Columna", "Calidad", "Completitud", "Validez"]
    rows = []
    # Worst columns first so the reader sees the problems at the top.
    ordered = sorted(
@@ -210,22 +254,22 @@ def _scores_block(evaluated: list):
            col.get("name") or "(col)",
            _fmt_score(q.get("score")),
            _fmt_unit_pct(q.get("completeness")),
-            _fmt_unit_pct(q.get("validity")),
-            _fmt_unit_pct(q.get("consistency")),
+            _fmt_validity(q.get("validity")),
        ])
    if not rows:
        return None
    return model.DataTable(header=header, rows=rows,
                           title="Scores de calidad por columna",
-                           note="0 = peor, 100 = mejor; ordenado de peor a mejor")
+                           note="0 = peor, 100 = mejor; «n/a» = dimensión no "
+                                "medible; ordenado de peor a mejor")


 def _issues_block(evaluated: list):
-    """DataTable listing Spanish issues per column, or a Note when there are none."""
-    header = ["Columna", "Problemas detectados (español)"]
+    """DataTable listing ONLY real quality defects per column, or a Note."""
+    header = ["Columna", "Problemas de calidad (español)"]
    rows = []
    for col, q in evaluated:
-        joined = _join_issues(q.get("issues"))
+        joined = _join_cells(q.get("issues"))
        if joined:
            rows.append([col.get("name") or "(col)", joined])
    if not rows:
@@ -235,6 +279,63 @@ def _issues_block(evaluated: list):
                           title="Problemas de calidad por columna")


+def _observations_block(evaluated: list):
+    """DataTable listing analytical observations per column, or None.
+
+    Observations (outliers, constant columns, ids, strong skew) are NOT quality
+    defects: they do not affect the score. Returned as a separate table from the
+    issues so the report never presents a legitimate outlier as a problem.
+    """
+    header = ["Columna", "Observaciones analíticas"]
+    rows = []
+    for col, q in evaluated:
+        joined = _join_cells(q.get("observations"))
+        if joined:
+            rows.append([col.get("name") or "(col)", joined])
+    if not rows:
+        return None
+    return model.DataTable(
+        header=header, rows=rows,
+        title="Observaciones analíticas por columna",
+        note="No son defectos de calidad y NO afectan al score; orientan el "
+             "análisis (atípicos, columnas constantes, identificadores).")
+
+
+def _term(key: str, label: str, mark: bool) -> str:
+    """Render a term as a clickable glossary span when marking is enabled."""
+    if mark:
+        return f"[[term:{key}]]**{label}**[[/term]]"
+    return f"**{label}**"
+
+
+def _criteria_intro(mark: bool) -> str:
+    """Intro paragraph explaining the two scored dimensions and the principle."""
+    calidad = _term("calidad_datos", "calidad de datos", mark)
+    completitud = _term("completitud", "Completitud (peso 60%)", mark)
+    validez = _term("validez", "Validez (peso 40%, cuando es medible)", mark)
+    unicidad = _term("unicidad_registro", "unicidad de registro", mark)
+    return (
+        f"La {calidad} de cada columna es un score de 0 a 100 que combina solo "
+        "dimensiones medibles desde el perfil de la tabla, sin fuente externa "
+        "de verdad:\n\n"
+        f"- {completitud}: proporción de valores presentes (1 − % de nulos; en "
+        "texto, las celdas vacías cuentan como faltantes). Los nulos y vacíos "
+        "bajan el score.\n"
+        f"- {validez}: proporción de valores que encajan con su tipo o formato "
+        "(un número que parsea, una fecha legible, un email con forma de email). "
+        "Si una columna es texto libre sin formato esperado, la validez no se "
+        "mide y el score se basa solo en la completitud.\n\n"
+        f"Score de columna = 100 × (0,6·completitud + 0,4·validez), "
+        "renormalizado cuando la validez no aplica. A nivel de tabla se añade "
+        f"la {unicidad} (1 − % de filas duplicadas).\n\n"
+        "**Los valores atípicos (outliers) NO bajan la calidad.** Un valor "
+        "extremo puede ser real y correcto; detectar atípicos es parte del "
+        "análisis de la distribución, no un juicio de corrección. Por eso, junto "
+        "con las columnas constantes y los identificadores, se listan aparte "
+        "como **observaciones analíticas** que no afectan al score."
+    )
+
+
 def build_calidad(profile: dict, ctx: dict):
    """Build the data-quality Chapter, or None if the profile has no columns.

@@ -250,17 +351,35 @@ def build_calidad(profile: dict, ctx: dict):
    if not evaluated:
        return None  # no columns to score -> chapter does not apply.

+    # Register the criteria terms in the shared glossary (if present) and mark
+    # their first appearance clickable. Contract §11.1.
+    glossary = ctx.get("glossary")
+    mark = False
+    if isinstance(glossary, model.GlossaryCollector):
+        for key, (label, definition) in _TERMS.items():
+            glossary.add(key, label, definition)
+        mark = True
+
    blocks = [
        model.Heading(text="Cómo se calcula la calidad", level=2),
-        model.Markdown(text=_CRITERIA_INTRO),
+        model.Markdown(text=_criteria_intro(mark)),
        _summary_block(profile, evaluated),
        model.Heading(text="Scores por columna", level=2),
    ]
    scores = _scores_block(evaluated)
    if scores is not None:
        blocks.append(scores)
-    blocks.append(model.Heading(text="Problemas detectados", level=2))
+
+    blocks.append(model.Heading(text="Problemas de calidad", level=2))
    blocks.append(_issues_block(evaluated))

+    observations = _observations_block(evaluated)
+    if observations is not None:
+        blocks.append(model.Heading(text="Observaciones analíticas", level=2))
+        blocks.append(model.Note(
+            "Las observaciones siguientes NO son defectos de calidad y no "
+            "afectan al score: son señales para orientar el análisis."))
+        blocks.append(observations)
+
    return model.Chapter(id=CHAPTER_ID, title=CHAPTER_TITLE,
                         version=CHAPTER_VERSION, blocks=blocks)
@@ -1,11 +1,12 @@
-"""Tests for the CALIDAD chapter — DoD: golden + edges + anti-cut.
+"""Tests for the CALIDAD chapter — DoD: golden + edges + anti-cut + glossary.

 Self-contained: builds synthetic TableProfiles (no DuckDB) so the suite is fast
-and deterministic. Verifies that the chapter explains the quality criteria, shows
-per-column scores with the completeness/validity/consistency breakdown, lists the
-issues in Spanish (separate from the type flags), returns None when it does not
-apply, and that a wide profile with long names renders to PDF and PPTX without
-cutting any cell text (long content wraps, it is never truncated).
+and deterministic. Verifies the report-2046 quality model: the chapter explains
+the two scored dimensions (completitud 60% / validez 40%), shows per-column
+scores without a consistency column, keeps quality DEFECTS (issues) separate
+from analytical OBSERVATIONS (outliers, constant, ids), hooks the criteria terms
+into the glossary, returns None when it does not apply, and renders a wide
+profile to PDF and PPTX without cutting any cell text.
 """

 import os
@@ -20,28 +21,30 @@ from datascience.automatic_eda.chapters.calidad import (
    CHAPTER_VERSION,
 )
 from datascience.automatic_eda import build_document, render_pdf, render_pptx
+from datascience.automatic_eda import model


 def _profile() -> dict:
    """A small profile with one column per quality problem (nulls, outliers,
-    constant, high-cardinality id) plus one clean column."""
+    constant, high-cardinality id) plus one clean column. ``outlier_pct`` is in
+    the 0-100 scale that describe_numeric actually emits."""
    return {
        "table": "demo",
-        "quality_score": 72.5,
+        "quality_score": 82.0,
        "duplicate_pct": 0.04,
        "null_cell_pct": 0.11,
        "constant_cols": ["flag_const"],
        "all_null_cols": [],
        "columns": [
-            {"name": "edad", "inferred_type": "integer", "null_pct": 0.2,
-             "numeric": {"outlier_pct": 0.15, "min": 0, "max": 99},
-             "quality_score": 60},
+            {"name": "edad", "inferred_type": "numeric", "null_pct": 0.2,
+             "n_rows": 100, "unique_pct": 0.5,
+             "numeric": {"outlier_pct": 15.0, "min": 0, "max": 99}},
            {"name": "nombre", "inferred_type": "text", "null_pct": 0.0,
-             "unique_pct": 0.98, "quality_score": 80},
+             "unique_pct": 0.98, "flags": ["possible_id"]},
            {"name": "flag_const", "inferred_type": "text", "null_pct": 0.0,
-             "flags": ["constant"], "quality_score": 50},
-            {"name": "limpia", "inferred_type": "float", "null_pct": 0.0,
-             "numeric": {"outlier_pct": 0.0}, "quality_score": 100},
+             "unique_pct": 0.01, "flags": ["constant"]},
+            {"name": "limpia", "inferred_type": "numeric", "null_pct": 0.0,
+             "unique_pct": 0.5, "numeric": {"outlier_pct": 0.0}},
        ],
    }

@@ -50,16 +53,9 @@ def _tables(chapter):
    return [b for b in chapter.blocks if getattr(b, "kind", None) == "data_table"]


-def _scores_table(chapter):
+def _table_by_title(chapter, needle):
    for t in _tables(chapter):
-        if "Scores" in (t.title or ""):
-            return t
-    return None
-
-
-def _issues_table(chapter):
-    for t in _tables(chapter):
-        if "Problemas" in (t.title or ""):
+        if needle in (t.title or ""):
            return t
    return None

@@ -73,41 +69,84 @@ def test_golden_chapter_estructura_y_version():
    assert ch.id == "calidad"
    assert ch.version == CHAPTER_VERSION
    kinds = [b.kind for b in ch.blocks]
-    # intro heading + markdown criteria + summary kv + scores table + issues table
    assert "markdown" in kinds and "kv_table" in kinds and "data_table" in kinds


-def test_golden_intro_explica_criterios_y_pesos():
+def test_golden_intro_explica_dos_dimensiones_y_pesos():
    ch = build_calidad(_profile(), {})
    intro = [b for b in ch.blocks if b.kind == "markdown"][0].text
-    for needle in ("Completitud", "Validez", "Consistencia",
-                   "50%", "30%", "20%"):
+    for needle in ("Completitud", "Validez", "60%", "40%",
+                   "unicidad de registro"):
        assert needle in intro, f"falta {needle!r} en la intro de criterios"
+    # El principio: los outliers NO bajan la calidad.
+    assert "atípicos" in intro and "NO bajan" in intro
+    # Ya no se menciona la dimensión consistencia eliminada.
+    assert "20%" not in intro


-def test_golden_scores_incluyen_desglose_por_criterio():
+def test_golden_scores_sin_columna_consistencia():
    ch = build_calidad(_profile(), {})
-    scores = _scores_table(ch)
+    scores = _table_by_title(ch, "Scores")
    assert scores is not None
-    assert scores.header == ["Columna", "Calidad", "Completitud",
-                             "Validez", "Consistencia"]
-    # 4 columns scored, none dropped.
+    assert scores.header == ["Columna", "Calidad", "Completitud", "Validez"]
+    assert "Consistencia" not in scores.header
    assert len(scores.rows) == 4
    names = {r[0] for r in scores.rows}
    assert names == {"edad", "nombre", "flag_const", "limpia"}


-def test_golden_issues_en_espanol_separados_de_flags():
+def test_golden_outliers_en_observaciones_no_en_problemas():
    ch = build_calidad(_profile(), {})
-    issues = _issues_table(ch)
-    assert issues is not None
-    flat = " | ".join(" ".join(r) for r in issues.rows)
-    assert "nulos" in flat            # completeness issue (ES)
-    assert "outliers" in flat         # validity issue (ES)
-    assert "columna constante" in flat
-    assert "posible id de alta cardinalidad" in flat
-    # The raw type flag string must NOT leak as a "problem".
-    assert "constant" not in flat or "columna constante" in flat
+    problemas = _table_by_title(ch, "Problemas de calidad")
+    observaciones = _table_by_title(ch, "Observaciones")
+    assert problemas is not None
+    assert observaciones is not None
+
+    problemas_txt = " | ".join(" ".join(r) for r in problemas.rows)
+    observaciones_txt = " | ".join(" ".join(r) for r in observaciones.rows)
+
+    # Los nulos SÍ son problema de calidad.
+    assert "nulos" in problemas_txt
+    # Los outliers NO aparecen como problema...
+    assert "atípic" not in problemas_txt and "outlier" not in problemas_txt
+    # ...sino como observación analítica.
+    assert "atípic" in observaciones_txt
+    # Constante e id: observaciones, no problemas.
+    assert "constante" in observaciones_txt
+    assert "identificador" in observaciones_txt
+    assert "constante" not in problemas_txt
+
+
+def test_golden_score_columna_limpia_es_100():
+    """Columna sin nulos, numérica nativa: score 100 aunque tenga (o no) outliers."""
+    ch = build_calidad(_profile(), {})
+    scores = _table_by_title(ch, "Scores")
+    by_name = {r[0]: r for r in scores.rows}
+    assert by_name["limpia"][1] == "100 / 100"
+    # edad: 20% nulos -> 100*(0.6*0.8 + 0.4*1.0) = 88; los outliers no bajan nada.
+    assert by_name["edad"][1] == "88 / 100"
+
+
+# --------------------------------------------------------------------------- #
+# Glosario (contrato §11.1)
+# --------------------------------------------------------------------------- #
+def test_glosario_registra_los_cuatro_terminos_y_marca_clicable():
+    glossary = model.GlossaryCollector()
+    ch = build_calidad(_profile(), {"glossary": glossary})
+    for key in ("calidad_datos", "completitud", "validez", "unicidad_registro"):
+        assert glossary.has(key), f"término {key!r} no registrado en el glosario"
+    intro = [b for b in ch.blocks if b.kind == "markdown"][0].text
+    # Con colector presente, la primera aparición se marca clicable.
+    assert "[[term:completitud]]" in intro
+    assert "[[term:validez]]" in intro
+    assert "[[term:calidad_datos]]" in intro
+    assert "[[term:unicidad_registro]]" in intro
+
+
+def test_sin_glosario_no_marca_terminos():
+    ch = build_calidad(_profile(), {})  # ctx sin glossary
+    intro = [b for b in ch.blocks if b.kind == "markdown"][0].text
+    assert "[[term:" not in intro


 # --------------------------------------------------------------------------- #
@@ -124,17 +163,17 @@ def test_edge_perfil_limpio_sin_problemas_usa_nota():
    prof = {
        "quality_score": 100,
        "columns": [
-            {"name": "a", "inferred_type": "float", "null_pct": 0.0,
-             "numeric": {"outlier_pct": 0.0}},
-            {"name": "b", "inferred_type": "float", "null_pct": 0.0,
-             "numeric": {"outlier_pct": 0.0}},
+            {"name": "a", "inferred_type": "numeric", "null_pct": 0.0,
+             "unique_pct": 0.5, "numeric": {"outlier_pct": 0.0}},
+            {"name": "b", "inferred_type": "numeric", "null_pct": 0.0,
+             "unique_pct": 0.5, "numeric": {"outlier_pct": 0.0}},
        ],
    }
    ch = build_calidad(prof, {})
    assert ch is not None
-    assert _issues_table(ch) is None  # no issues table
+    assert _table_by_title(ch, "Problemas de calidad") is None  # no issues table
    notes = [b for b in ch.blocks if b.kind == "note"]
-    assert notes and "No se detectaron problemas" in notes[0].text
+    assert any("No se detectaron problemas" in n.text for n in notes)


 # --------------------------------------------------------------------------- #
@@ -143,44 +182,42 @@ def test_edge_perfil_limpio_sin_problemas_usa_nota():
 def _wide_profile(ncols: int = 22) -> dict:
    cols = [
        {"name": "identificador_unico_de_transaccion_con_nombre_muy_largo",
-         "inferred_type": "text", "null_pct": 0.0, "unique_pct": 0.99},
+         "inferred_type": "text", "null_pct": 0.0, "unique_pct": 0.99,
+         "flags": ["possible_id"]},
        {"name": "columna_constante_sin_ninguna_variacion_de_valor",
-         "inferred_type": "text", "null_pct": 0.0, "flags": ["constant"]},
+         "inferred_type": "text", "null_pct": 0.0, "unique_pct": 0.01,
+         "flags": ["constant"]},
    ]
    for k in range(ncols - 2):
        cols.append({
            "name": f"metrica_numerica_de_negocio_{k:02d}_con_nombre_largo",
-            "inferred_type": "float", "null_pct": 0.1 + (k % 3) * 0.05,
-            "numeric": {"outlier_pct": 0.08, "min": 0, "max": 1000},
+            "inferred_type": "numeric", "null_pct": 0.1 + (k % 3) * 0.05,
+            "unique_pct": 0.5,
+            "numeric": {"outlier_pct": 8.0, "min": 0, "max": 1000},
        })
-    return {"table": "ancha", "quality_score": 70.0, "columns": cols}
+    return {"table": "ancha", "quality_score": 70.0, "duplicate_pct": 0.0,
+            "columns": cols}


 def test_anticut_pdf_y_pptx_no_truncan_nombres_largos():
    prof = _wide_profile(22)
    full = build_document(prof, {"dataset_name": "ancha"})
    assert any(c.id == "calidad" for c in full)
-    # Render ONLY the calidad chapter so the anti-cut assertions are scoped to
-    # this chapter (other chapters, e.g. portada, legitimately contain '…').
    chapters = [c for c in full if c.id == "calidad"]
    long_name = "metrica_numerica_de_negocio_00_con_nombre_largo"
    with tempfile.TemporaryDirectory() as d:
        pdf = os.path.join(d, "q.pdf")
        pptx = os.path.join(d, "q.pptx")
        rp = render_pdf(chapters, pdf, {"title": "EDA"})
-        rx = render_pptx(chapters, pptx, {"title": "EDA"})
+        render_pptx(chapters, pptx, {"title": "EDA"})
        assert os.path.exists(pdf) and os.path.exists(pptx)
-        # The wide table forces pagination across several pages/slides.
        assert (rp or {}).get("n_pages", 0) >= 2

-        # PDF: the long name survives whole once wraps (spaces/newlines) removed,
-        # and there is no truncation marker.
        pdf_txt = "".join((pg.extract_text() or "") for pg in PdfReader(pdf).pages)
        assert "…" not in pdf_txt and "..." not in pdf_txt
        norm = re.sub(r"\s+", "", pdf_txt)
        assert long_name in norm, "el nombre largo se cortó en el PDF"

-        # PPTX: long name present in some cell, untruncated.
        allt = []
        for s in Presentation(pptx).slides:
            for sh in s.shapes:
@@ -1,500 +0,0 @@
-"""Key-relations chapter (RELACIONES) — the keys / join structure of the data.
-
-This chapter is the *relational* section of an AutomaticEDA report. It answers a
-single question for the table (or the whole DuckDB source it lives in): **how do
-the keys relate?** It composes, without reimplementing them, the registry's
-relation primitives and degrades honestly when a layer does not apply.
-
-It renders, in order, only the layers that have something to say:
-
-1. **Declared keys** (real schema constraints) — when the DuckDB source declares
-   PRIMARY KEY / FOREIGN KEY / UNIQUE constraints, they are read verbatim via
-   ``detect_declared_keys_duckdb`` and shown as ground truth: which column is the
-   PK, which columns are FKs and the table/column they point to.
-2. **Primary-key candidates** — the ``key_candidates`` the TableProfile already
-   carries (columns whose cardinality equals the row count, with no nulls). These
-   are *candidates*: a column that could serve as the row identifier.
-3. **Foreign-key candidates** when none are declared:
-   - **Inter-table** (the DuckDB source has several tables): real FK candidates by
-     name signal + value containment via ``infer_fk_containment_duckdb``, plus the
-     join graph (roles + a pasteable Mermaid diagram) via ``build_join_graph``.
-   - **Intra-table** (a single table): columns that *look* like a foreign key by a
-     name+cardinality heuristic (``suggest_intratable_fk_candidates``). This is a
-     **suggestion**, explicitly flagged as a heuristic, never an assertion.
-
-``build_relaciones(profile, ctx) -> Chapter | None``: returns ``None`` when there
-is nothing to say (no declared key, no key candidates, and no FK candidate —
-inter- or intra-table). Reads everything defensively (``.get``) and never raises:
-anything missing degrades to a note or is omitted; a failing registry call drops
-its layer instead of aborting the chapter.
-
-ctx keys this chapter consumes (all optional):
-    db_path, table : str — the DuckDB file and table being profiled (set by
-        ``build_eda_render_ctx``). ``db_path`` is needed to read declared
-        constraints, to list the sibling tables, and to run the containment-based
-        FK inference. Without it, only the profile-derived layers (PK candidates,
-        intra-table FK heuristic) are available.
-    glossary : model.GlossaryCollector — shared glossary; the chapter registers
-        the relational terms (PK, FK, containment, cardinality) and marks their
-        first appearance clickable.
-
-Contract: build_<id>(profile, ctx) -> Chapter | None ; CHAPTER_VERSION = "x.y.z".
-"""
-
-from __future__ import annotations
-
-from .. import model
-
-# Pure/impure registry functions (group ``eda``) this chapter composes. Imported
-# defensively (module-leaf imports, like the AGREGACION chapter) so the chapter
-# still builds — degrading the affected layer to nothing — if a function is
-# somehow unavailable / not indexed yet.
-try:
-    from datascience.detect_declared_keys_duckdb import detect_declared_keys_duckdb
-except Exception:  # noqa: BLE001 — keep the chapter importable no matter what.
-    detect_declared_keys_duckdb = None  # type: ignore[assignment]
-try:
-    from datascience.infer_fk_containment_duckdb import infer_fk_containment_duckdb
-except Exception:  # noqa: BLE001
-    infer_fk_containment_duckdb = None  # type: ignore[assignment]
-try:
-    from datascience.build_join_graph import build_join_graph
-except Exception:  # noqa: BLE001
-    build_join_graph = None  # type: ignore[assignment]
-try:
-    from datascience.suggest_intratable_fk_candidates import (
-        suggest_intratable_fk_candidates,
-    )
-except Exception:  # noqa: BLE001
-    suggest_intratable_fk_candidates = None  # type: ignore[assignment]
-try:
-    from infra import duckdb_list_tables
-except Exception:  # noqa: BLE001
-    duckdb_list_tables = None  # type: ignore[assignment]
-
-CHAPTER_VERSION = "1.0.0"
-CHAPTER_ID = "relaciones"
-CHAPTER_TITLE = "Relaciones de clave"
-
-# Cap the inter-table FK table so a wide schema does not blow up the page; the
-# rest is summarized in a closing note (no silent truncation).
-MAX_FK_ROWS = 40
-
-# --------------------------------------------------------------------------- #
-# Glossary terms this chapter explains. Registered in the shared collector and
-# marked clickable on their first appearance (contract §11.1).
-# --------------------------------------------------------------------------- #
-_TERMS = {
-    "pk": (
-        "Clave primaria (PK)",
-        "Columna (o conjunto de columnas) que identifica de forma única cada fila "
-        "de una tabla: sus valores no se repiten y no son nulos. Una tabla tiene "
-        "como mucho una clave primaria; es el ancla por la que otras tablas la "
-        "referencian.",
-    ),
-    "fk": (
-        "Clave foránea (FK)",
-        "Columna de una tabla cuyos valores apuntan a la clave primaria de otra "
-        "tabla (o de la misma), creando una relación entre ambas. Una FK suele ser "
-        "N:1: muchas filas de la tabla origen comparten el mismo valor de la tabla "
-        "destino.",
-    ),
-    "containment": (
-        "Containment / inclusión",
-        "Señal con la que se infiere una clave foránea sin que la base la declare: "
-        "la fracción de valores distintos de una columna A que también aparecen "
-        "como valores de otra columna B. Si casi todos los valores de A están "
-        "contenidos en B (inclusión ≈ 1) y B parece una clave, A → B es una FK "
-        "candidata.",
-    ),
-    "cardinalidad": (
-        "Cardinalidad",
-        "Número de valores distintos de una columna. Cardinalidad igual al número "
-        "de filas (y sin nulos) señala un identificador (candidato a clave "
-        "primaria); cardinalidad alta pero menor que el número de filas, con "
-        "valores repetidos, es típica de una clave foránea.",
-    ),
-}
-
-
-def _register_terms(ctx: dict) -> bool:
-    """Register the relational terms in the shared glossary. Returns whether the
-    in-text appearances should be marked clickable."""
-    glossary = ctx.get("glossary")
-    if not isinstance(glossary, model.GlossaryCollector):
-        return False
-    for key, (label, definition) in _TERMS.items():
-        glossary.add(key, label, definition)
-    return True
-
-
-# --------------------------------------------------------------------------- #
-# Formatting helpers (mirror the other chapters' defensive style).
-# --------------------------------------------------------------------------- #
-def _fmt_int(value) -> str:
-    if value is None:
-        return "—"
-    try:
-        return f"{int(value):,}".replace(",", ".")
-    except (TypeError, ValueError):
-        return model._safe_str(value)
-
-
-def _fmt_pct_fraction(value, decimals: int = 1) -> str:
-    """Format a 0–1 fraction as a percentage. None -> placeholder."""
-    if value is None:
-        return "—"
-    try:
-        v = float(value)
-    except (TypeError, ValueError):
-        return model._safe_str(value)
-    if v <= 1.0:
-        v *= 100.0
-    return f"{v:.{decimals}f}%"
-
-
-def _fmt_ratio(value, decimals: int = 3) -> str:
-    """Format an already-0–1 ratio (inclusion) as a plain number."""
-    if value is None:
-        return "—"
-    try:
-        return f"{float(value):.{decimals}f}".rstrip("0").rstrip(".")
-    except (TypeError, ValueError):
-        return model._safe_str(value)
-
-
-def _is_dict(v) -> bool:
-    return isinstance(v, dict)
-
-
-def _columns_by_name(profile: dict) -> dict:
-    """Index the profile columns by name for quick metric lookup."""
-    out = {}
-    for col in (profile.get("columns") or []):
-        if _is_dict(col) and col.get("name") is not None:
-            out[col.get("name")] = col
-    return out
-
-
-# --------------------------------------------------------------------------- #
-# Layer 1 — declared keys (real schema constraints).
-# --------------------------------------------------------------------------- #
-def _declared_keys(db_path: str, table: str):
-    """Read declared PK/FK/UNIQUE for the source, or None if unavailable."""
-    if not db_path or detect_declared_keys_duckdb is None:
-        return None
-    try:
-        out = detect_declared_keys_duckdb(db_path, table)
-    except Exception:  # noqa: BLE001 — dict-no-throw: treat as unavailable.
-        return None
-    if not _is_dict(out) or out.get("status") != "ok":
-        return None
-    return out
-
-
-def _declared_section(declared: dict) -> list:
-    """Blocks for the declared-keys layer, or [] if there is nothing declared."""
-    pks = [p for p in (declared.get("primary_keys") or []) if _is_dict(p)]
-    fks = [f for f in (declared.get("foreign_keys") or []) if _is_dict(f)]
-    uqs = [u for u in (declared.get("unique") or []) if _is_dict(u)]
-    if not (pks or fks or uqs):
-        return []
-
-    blocks = [
-        model.Heading(text="Claves declaradas en el esquema", level=2),
-        model.Markdown(text=(
-            "La base **declara** estas relaciones de clave como restricciones "
-            "reales del esquema (constraints). Son la verdad de referencia: no se "
-            "infieren, se leen tal cual de la definición de las tablas.")),
-    ]
-
-    if pks:
-        rows = [[model._safe_str(p.get("table")),
-                 ", ".join(model._safe_str(c) for c in (p.get("columns") or []))]
-                for p in pks]
-        blocks.append(model.DataTable(
-            header=["Tabla", "Columna(s) PK"], rows=rows,
-            title="Claves primarias declaradas",
-            note="Cada fila: la clave primaria declarada de una tabla."))
-
-    if fks:
-        rows = []
-        for f in fks:
-            src = ", ".join(model._safe_str(c) for c in (f.get("columns") or []))
-            dst = ", ".join(
-                model._safe_str(c) for c in (f.get("referenced_columns") or []))
-            rows.append([
-                model._safe_str(f.get("table")), src,
-                model._safe_str(f.get("referenced_table")), dst])
-        blocks.append(model.DataTable(
-            header=["Tabla origen", "Columna(s) FK", "→ Tabla destino",
-                    "Columna(s) destino"],
-            rows=rows, title="Claves foráneas declaradas",
-            note="Cada fila: una FK declarada — origen → destino."))
-
-    if uqs:
-        rows = [[model._safe_str(u.get("table")),
-                 ", ".join(model._safe_str(c) for c in (u.get("columns") or []))]
-                for u in uqs]
-        blocks.append(model.DataTable(
-            header=["Tabla", "Columna(s) UNIQUE"], rows=rows,
-            title="Restricciones UNIQUE declaradas"))
-
-    return blocks
-
-
-# --------------------------------------------------------------------------- #
-# Layer 2 — primary-key candidates (from the profile).
-# --------------------------------------------------------------------------- #
-def _pk_candidates_section(profile: dict, mark: bool) -> list:
-    """Blocks for the PK-candidates layer, or [] if there are none."""
-    keys = [k for k in (profile.get("key_candidates") or []) if k is not None]
-    if not keys:
-        return []
-    by_name = _columns_by_name(profile)
-
-    pk = ("[[term:pk]]**clave primaria**[[/term]]" if mark
-          else "**clave primaria**")
-    intro = (
-        f"Estas columnas son **candidatas a {pk}**: su "
-        "[[term:cardinalidad]]cardinalidad[[/term]] iguala al número de filas y no "
-        "tienen nulos, así que cada valor identifica una fila distinta. Son "
-        "candidatas, no una clave declarada: la base no las marca como tal."
-        if mark else
-        "Estas columnas son **candidatas a clave primaria**: su cardinalidad "
-        "iguala al número de filas y no tienen nulos, así que cada valor "
-        "identifica una fila distinta.")
-
-    rows = []
-    for name in keys:
-        col = by_name.get(name) or {}
-        rows.append([
-            model._safe_str(name),
-            _fmt_int(col.get("distinct_count")),
-            _fmt_pct_fraction(col.get("unique_pct")),
-            model._safe_str(col.get("inferred_type") or col.get("physical_type") or "—"),
-        ])
-    return [
-        model.Heading(text="Candidatos a clave primaria", level=2),
-        model.Markdown(text=intro),
-        model.DataTable(
-            header=["Columna", "Valores distintos", "% único", "Tipo"],
-            rows=rows, title="Candidatas a clave primaria",
-            note=f"{_fmt_int(profile.get('n_rows'))} filas en total como referencia."),
-    ]
-
-
-# --------------------------------------------------------------------------- #
-# Layer 3a — inter-table FK candidates (containment) + join graph.
-# --------------------------------------------------------------------------- #
-def _list_source_tables(db_path: str) -> list:
-    """List the tables in the DuckDB source, or [] if it can't be listed."""
-    if not db_path or duckdb_list_tables is None:
-        return []
-    try:
-        out = duckdb_list_tables(db_path)
-    except Exception:  # noqa: BLE001
-        return []
-    if not _is_dict(out) or out.get("status") != "ok":
-        return []
-    return [t for t in (out.get("tables") or []) if isinstance(t, str)]
-
-
-def _inter_table_section(db_path: str, tables: list, mark: bool) -> list:
-    """Blocks for the inter-table FK layer (containment + join graph), or []."""
-    if infer_fk_containment_duckdb is None or len(tables) < 2:
-        return []
-    try:
-        fk = infer_fk_containment_duckdb(db_path, tables=tables)
-    except Exception:  # noqa: BLE001
-        return []
-    if not _is_dict(fk) or fk.get("status") != "ok":
-        return []
-    candidates = [c for c in (fk.get("fk_candidates") or []) if _is_dict(c)]
-    if not candidates:
-        return []
-
-    containment = ("[[term:containment]]containment (inclusión de valores)[[/term]]"
-                   if mark else "containment (inclusión de valores)")
-    fk_term = "[[term:fk]]**claves foráneas**[[/term]]" if mark else "**claves foráneas**"
-    blocks = [
-        model.Heading(text="Claves foráneas candidatas (inter-tabla)", level=2),
-        model.Markdown(text=(
-            f"La fuente tiene varias tablas. Estas {fk_term} candidatas se infieren "
-            f"por señal de nombre y por {containment}: una columna de una tabla cuyos "
-            "valores están contenidos en la clave de otra. No están declaradas por "
-            "la base; son la relación más probable según los datos.")),
-    ]
-
-    shown = candidates[:MAX_FK_ROWS]
-    rows = []
-    for c in shown:
-        rows.append([
-            f"{model._safe_str(c.get('from_table'))}.{model._safe_str(c.get('from_col'))}",
-            f"{model._safe_str(c.get('to_table'))}.{model._safe_str(c.get('to_col'))}",
-            _fmt_ratio(c.get("inclusion")),
-            model._safe_str(c.get("cardinality") or "—"),
-            "sí" if c.get("name_match") else "no",
-        ])
-    note = "Ordenadas por señal de nombre e inclusión."
-    if len(candidates) > len(shown):
-        note += f" Se muestran {len(shown)} de {len(candidates)} candidatas."
-    blocks.append(model.DataTable(
-        header=["Origen", "→ Destino", "Inclusión", "Cardinalidad", "Coincide nombre"],
-        rows=rows, title="FK candidatas por containment", note=note))
-
-    # Join graph: node roles + a pasteable Mermaid diagram, kept together.
-    if build_join_graph is not None:
-        try:
-            graph = build_join_graph(candidates, tables=tables)
-        except Exception:  # noqa: BLE001
-            graph = None
-        if _is_dict(graph):
-            graph_blocks = [model.Heading(text="Grafo de relaciones", level=3)]
-            nodes = [n for n in (graph.get("nodes") or []) if _is_dict(n)]
-            if nodes:
-                node_rows = [[
-                    model._safe_str(n.get("table")),
-                    model._safe_str(n.get("role") or "—"),
-                    _fmt_int(n.get("out_degree")),
-                    _fmt_int(n.get("in_degree")),
-                ] for n in nodes]
-                graph_blocks.append(model.DataTable(
-                    header=["Tabla", "Rol", "FK salientes", "FK entrantes"],
-                    rows=node_rows, title="Tablas y su rol en el grafo",
-                    note="Rol: fact (apunta a otras), dimension (referenciada), "
-                         "bridge (ambas), standalone (aislada)."))
-            hubs = [h for h in (graph.get("hubs") or []) if h]
-            if hubs:
-                graph_blocks.append(model.Markdown(text=(
-                    "Tablas con más relaciones salientes (candidatas a tabla de "
-                    "hechos): " + ", ".join(model._safe_str(h) for h in hubs) + ".")))
-            mermaid = model._safe_str(graph.get("mermaid")).strip()
-            if mermaid:
-                graph_blocks.append(model.Markdown(text=(
-                    "Diagrama de las relaciones (pegable en un bloque Mermaid):")))
-                graph_blocks.append(model.Markdown(
-                    text="```mermaid\n" + mermaid + "\n```"))
-            if len(graph_blocks) > 1:
-                blocks.append(model.Group(blocks=graph_blocks,
-                                          title="Grafo de relaciones"))
-
-    skipped = [s for s in (fk.get("skipped") or []) if s]
-    if skipped:
-        blocks.append(model.Note(
-            "Algunos pares se omitieron por tamaño: "
-            + "; ".join(model._safe_str(s) for s in skipped) + "."))
-    return blocks
-
-
-# --------------------------------------------------------------------------- #
-# Layer 3b — intra-table FK candidates (name+cardinality heuristic).
-# --------------------------------------------------------------------------- #
-def _intra_table_section(profile: dict, mark: bool) -> list:
-    """Blocks for the intra-table FK heuristic layer, or [] if no candidates."""
-    if suggest_intratable_fk_candidates is None:
-        return []
-    try:
-        cands = suggest_intratable_fk_candidates(profile)
-    except Exception:  # noqa: BLE001
-        return []
-    cands = [c for c in (cands or []) if _is_dict(c)]
-    if not cands:
-        return []
-
-    fk_term = "[[term:fk]]**claves foráneas**[[/term]]" if mark else "**claves foráneas**"
-    blocks = [
-        model.Heading(text="Posibles claves foráneas (heurística de nombre)", level=2),
-        model.Markdown(text=(
-            f"No hay otras tablas que referenciar, pero algunas columnas **parecen** "
-            f"{fk_term} por su nombre (terminan en «id») y su cardinalidad (muchos "
-            "valores repetidos, N:1). Es una **sugerencia heurística**, no una "
-            "afirmación: el nombre de la tabla destino es una conjetura y no se "
-            "comprueba inclusión de valores contra ninguna tabla real.")),
-    ]
-    rows = []
-    for c in cands:
-        rows.append([
-            model._safe_str(c.get("column")),
-            model._safe_str(c.get("ref_table_guess") or "—"),
-            _fmt_int(c.get("distinct_count")),
-            _fmt_pct_fraction(c.get("unique_pct")),
-            model._safe_str(c.get("inferred_type") or c.get("physical_type") or "—"),
-            model._safe_str(c.get("reason") or ""),
-        ])
-    blocks.append(model.DataTable(
-        header=["Columna", "Posible tabla", "Valores distintos", "% único",
-                "Tipo", "Motivo"],
-        rows=rows, title="Posibles FK por nombre y cardinalidad",
-        note="Heurística: posibles falsos positivos/negativos. No confirma containment."))
-    blocks.append(model.Note(
-        "Estas sugerencias se basan solo en el nombre y la cardinalidad. Para "
-        "confirmarlas haría falta la tabla destino y comprobar la inclusión de "
-        "valores (containment)."))
-    return blocks
-
-
-# --------------------------------------------------------------------------- #
-# Entry point.
-# --------------------------------------------------------------------------- #
-def _intro_blocks(mark: bool) -> list:
-    pk = "[[term:pk]]clave primaria[[/term]]" if mark else "clave primaria"
-    fk = "[[term:fk]]clave foránea[[/term]]" if mark else "clave foránea"
-    text = (
-        f"Este capítulo analiza las **relaciones de clave** de la tabla: qué columna "
-        f"identifica cada fila (la {pk}) y qué columnas referencian a otra tabla (las "
-        f"{fk}). Cuando la base las **declara** como restricciones del esquema, se "
-        "muestran tal cual; cuando no, se proponen las más probables a partir de los "
-        "datos —por inclusión de valores entre tablas (containment) o, en una sola "
-        "tabla, por una heurística de nombre y cardinalidad— siempre marcadas como "
-        "candidatas, nunca como hechos.")
-    return [model.Heading(text=CHAPTER_TITLE, level=1), model.Markdown(text=text)]
-
-
-def build_relaciones(profile: dict, ctx: dict):
-    """Build the RELACIONES Chapter, or None if there is nothing to say.
-
-    Args:
-        profile: the ``eda`` group TableProfile dict (may be None/empty).
-        ctx: presentation context. Consumes ``db_path`` + ``table`` (to read
-            declared constraints, list sibling tables and run the containment FK
-            inference) and ``glossary`` (to register the relational terms).
-
-    Returns:
-        A ``model.Chapter`` with the applicable relation layers; or ``None`` when
-        the dataset has no declared key, no key candidates and no FK candidate
-        (neither inter- nor intra-table).
-    """
-    if not isinstance(profile, dict):
-        profile = {}
-    ctx = ctx if isinstance(ctx, dict) else {}
-    db_path = ctx.get("db_path")
-    table = ctx.get("table")
-
-    mark = _register_terms(ctx)
-
-    # Build each layer; the chapter is the concatenation of the non-empty ones.
-    declared = _declared_keys(db_path, table)
-    declared_blocks = _declared_section(declared) if declared else []
-    declared_has_fk = bool(declared and declared.get("foreign_keys"))
-
-    pk_blocks = _pk_candidates_section(profile, mark)
-
-    tables = _list_source_tables(db_path)
-    inter_blocks = _inter_table_section(db_path, tables, mark)
-
-    # The intra-table heuristic only makes sense when no real FK is available for
-    # this table — neither declared nor inferred inter-table. Otherwise the real
-    # relations already answer the question and the heuristic is just noise.
-    if declared_has_fk or inter_blocks:
-        intra_blocks = []
-    else:
-        intra_blocks = _intra_table_section(profile, mark)
-
-    body = declared_blocks + pk_blocks + inter_blocks + intra_blocks
-    if not body:
-        return None  # chapter does not apply: nothing to say about relations.
-
-    blocks = _intro_blocks(mark) + body
-    return model.Chapter(id=CHAPTER_ID, title=CHAPTER_TITLE,
-                         version=CHAPTER_VERSION, blocks=blocks)
@@ -1,273 +0,0 @@
-"""Tests for the RELACIONES chapter — DoD: golden(s) + edges + no-cut render.
-
-Two goldens covering the two real paths of the chapter:
-
- **Intra-table** (a single table, no db source for relations): the chapter shows
-  the primary-key candidates from the profile and the heuristic foreign-key
-  suggestions (name + cardinality), explicitly flagged as a heuristic. Renders to
-  PDF and PPTX with nothing cut.
- **Inter-table** (a real DuckDB file with two related tables, customers/orders,
-  with a declared FK): the chapter shows the declared keys, the containment-based
-  FK candidates and the join graph (roles + a pasteable Mermaid diagram).
-
-Edges: a profile with no key candidate and no FK-looking column returns None;
-``None`` / ``{}`` profiles do not raise. The chapter registers its glossary terms.
-
-Layers that depend on the sibling registry functions delegated alongside this
-chapter (``detect_declared_keys_duckdb``, ``suggest_intratable_fk_candidates``)
-are asserted **conditionally on the function being importable**, so the chapter's
-honest-degradation contract is what is tested, never a hard dependency on import
-timing.
-"""
-
-import os
-import tempfile
-
-import duckdb
-from pptx import Presentation
-from pypdf import PdfReader
-
-from datascience.automatic_eda.chapters.relaciones import build_relaciones
-from datascience.automatic_eda.model import Chapter, Group, GlossaryCollector
-from datascience.render_automatic_eda_pdf import render_automatic_eda_pdf
-from datascience.render_automatic_eda_pptx import render_automatic_eda_pptx
-
-# The optional sibling functions: their layers are asserted only when present.
-try:
-    from datascience.detect_declared_keys_duckdb import detect_declared_keys_duckdb
-except Exception:  # noqa: BLE001
-    detect_declared_keys_duckdb = None
-try:
-    from datascience.suggest_intratable_fk_candidates import (
-        suggest_intratable_fk_candidates,
-    )
-except Exception:  # noqa: BLE001
-    suggest_intratable_fk_candidates = None
-
-
-# --------------------------------------------------------------------------- #
-# Helpers.
-# --------------------------------------------------------------------------- #
-def _flatten(blocks) -> list:
-    """Flatten Group blocks so a test can inspect every leaf block."""
-    out = []
-    for b in blocks:
-        if isinstance(b, Group):
-            out.extend(_flatten(b.blocks))
-        else:
-            out.append(b)
-    return out
-
-
-def _text_of(chapter: Chapter) -> str:
-    """Collect all visible text of a chapter's blocks into one string."""
-    parts = []
-    for b in _flatten(chapter.blocks):
-        for attr in ("text", "title", "note"):
-            v = getattr(b, attr, None)
-            if isinstance(v, str):
-                parts.append(v)
-        header = getattr(b, "header", None)
-        if isinstance(header, list):
-            parts.extend(str(c) for c in header)
-        rows = getattr(b, "rows", None)
-        if isinstance(rows, list):
-            for r in rows:
-                if isinstance(r, (list, tuple)):
-                    parts.extend(str(c) for c in r)
-                else:
-                    parts.append(str(r))
-    return "\n".join(parts)
-
-
-def _render_both(chapter: Chapter, tag: str):
-    """Render the chapter to PDF and PPTX; return (pdf_text, n_slides)."""
-    tmp = tempfile.mkdtemp(prefix=f"relaciones_{tag}_")
-    pdf_path = os.path.join(tmp, "out.pdf")
-    pptx_path = os.path.join(tmp, "out.pptx")
-    meta = {"title": f"EDA — {tag}"}
-    render_automatic_eda_pdf([chapter], pdf_path, meta)
-    render_automatic_eda_pptx([chapter], pptx_path, meta)
-    assert os.path.exists(pdf_path) and os.path.getsize(pdf_path) > 0
-    assert os.path.exists(pptx_path) and os.path.getsize(pptx_path) > 0
-    text = "".join(p.extract_text() or "" for p in PdfReader(pdf_path).pages)
-    n_slides = len(Presentation(pptx_path).slides)
-    return text, n_slides
-
-
-# --------------------------------------------------------------------------- #
-# Fixtures.
-# --------------------------------------------------------------------------- #
-def _titanic_profile() -> dict:
-    """A single-table profile: a PK candidate + a column that looks like a FK."""
-    return {
-        "table": "titanic",
-        "source": "/data/titanic.csv",
-        "n_rows": 891,
-        "n_cols": 4,
-        "key_candidates": ["PassengerId"],
-        "columns": [
-            {"name": "PassengerId", "inferred_type": "numeric",
-             "physical_type": "BIGINT", "distinct_count": 891,
-             "unique_pct": 1.0, "flags": ["possible_id"]},
-            {"name": "ticket_id", "inferred_type": "numeric",
-             "physical_type": "BIGINT", "distinct_count": 681,
-             "unique_pct": 0.76, "flags": []},
-            {"name": "fare", "inferred_type": "numeric",
-             "physical_type": "DOUBLE", "distinct_count": 248,
-             "unique_pct": 0.28, "flags": []},
-            {"name": "sex", "inferred_type": "categorical",
-             "physical_type": "VARCHAR", "distinct_count": 2,
-             "unique_pct": 0.002, "flags": []},
-        ],
-    }
-
-
-def _make_relational_db(path: str) -> None:
-    """Create a small DuckDB with customers(id) <- orders(customer_id), real FK."""
-    con = duckdb.connect(path)
-    con.execute("CREATE TABLE customers(id INTEGER PRIMARY KEY, name TEXT)")
-    con.execute(
-        "CREATE TABLE orders(id INTEGER PRIMARY KEY, "
-        "customer_id INTEGER REFERENCES customers(id), amount DOUBLE)")
-    con.execute("INSERT INTO customers VALUES "
-                "(1,'a'),(2,'b'),(3,'c'),(4,'d'),(5,'e')")
-    con.execute("INSERT INTO orders VALUES "
-                "(1,1,10.0),(2,1,20.0),(3,2,30.0),(4,3,40.0),"
-                "(5,3,50.0),(6,4,60.0),(7,5,70.0),(8,2,80.0)")
-    con.close()
-
-
-def _orders_profile() -> dict:
-    """A profile for the `orders` table of the relational DB."""
-    return {
-        "table": "orders",
-        "source": "orders",
-        "n_rows": 8,
-        "n_cols": 3,
-        "key_candidates": ["id"],
-        "columns": [
-            {"name": "id", "inferred_type": "numeric", "physical_type": "INTEGER",
-             "distinct_count": 8, "unique_pct": 1.0, "flags": ["possible_id"]},
-            {"name": "customer_id", "inferred_type": "numeric",
-             "physical_type": "INTEGER", "distinct_count": 5, "unique_pct": 0.625,
-             "flags": []},
-            {"name": "amount", "inferred_type": "numeric", "physical_type": "DOUBLE",
-             "distinct_count": 8, "unique_pct": 1.0, "flags": []},
-        ],
-    }
-
-
-# --------------------------------------------------------------------------- #
-# Golden 1 — intra-table.
-# --------------------------------------------------------------------------- #
-def test_golden_intra_table_pk_and_fk_heuristic():
-    """Single table: PK candidate shown; FK heuristic shown (if fn available);
-    renders to PDF + PPTX with nothing cut."""
-    prof = _titanic_profile()
-    glossary = GlossaryCollector()
-    # No db_path: only the profile-derived layers apply (no declared, no inter).
-    chapter = build_relaciones(prof, {"glossary": glossary})
-
-    assert isinstance(chapter, Chapter)
-    assert chapter.id == "relaciones"
-    text = _text_of(chapter)
-
-    # PK candidate is always present (comes from the profile).
-    assert "Candidatos a clave primaria" in text
-    assert "PassengerId" in text
-
-    # Glossary terms got registered.
-    for key in ("pk", "fk", "cardinalidad"):
-        assert glossary.has(key)
-
-    # FK heuristic layer: present iff the delegated function is importable.
-    if suggest_intratable_fk_candidates is not None:
-        assert "Posibles claves foráneas" in text
-        assert "ticket_id" in text
-        # The float measure and the PK itself are NOT suggested as FKs.
-        assert "Posibles FK por nombre" in text
-
-    pdf_text, n_slides = _render_both(chapter, "intra")
-    assert "PassengerId" in pdf_text
-    assert n_slides >= 1
-
-
-# --------------------------------------------------------------------------- #
-# Golden 2 — inter-table (real DuckDB).
-# --------------------------------------------------------------------------- #
-def test_golden_inter_table_containment_and_join_graph():
-    """Two related tables: declared FK (if fn available) + containment FK
-    candidate + Mermaid join graph."""
-    tmp = tempfile.mkdtemp(prefix="relaciones_db_")
-    db_path = os.path.join(tmp, "shop.duckdb")
-    _make_relational_db(db_path)
-
-    prof = _orders_profile()
-    glossary = GlossaryCollector()
-    chapter = build_relaciones(
-        prof, {"db_path": db_path, "table": "orders", "glossary": glossary})
-
-    assert isinstance(chapter, Chapter)
-    text = _text_of(chapter)
-
-    # Inter-table containment FK candidate: customer_id -> customers.id. This path
-    # uses infer_fk_containment_duckdb + build_join_graph, both already in the
-    # registry, so it must be present.
-    assert "Claves foráneas candidatas (inter-tabla)" in text
-    assert "orders.customer_id" in text
-    assert "customers.id" in text
-    # Join graph with a pasteable Mermaid diagram.
-    assert "Grafo de relaciones" in text
-    assert "mermaid" in text
-    assert "graph LR" in text
-    assert "containment" in text.lower()
-
-    # Declared-keys layer: present iff the delegated function is importable.
-    if detect_declared_keys_duckdb is not None:
-        assert "Claves declaradas en el esquema" in text
-        assert "Claves foráneas declaradas" in text
-
-    pdf_text, n_slides = _render_both(chapter, "inter")
-    assert "customer_id" in pdf_text
-    assert n_slides >= 1
-
-
-# --------------------------------------------------------------------------- #
-# Edges.
-# --------------------------------------------------------------------------- #
-def test_none_when_no_relations():
-    """No key candidates, no FK-looking columns, no db source -> None."""
-    prof = {
-        "table": "flat", "n_rows": 100, "n_cols": 2, "key_candidates": [],
-        "columns": [
-            {"name": "value", "inferred_type": "numeric", "physical_type": "DOUBLE",
-             "distinct_count": 50, "unique_pct": 0.5, "flags": []},
-            {"name": "label", "inferred_type": "categorical",
-             "physical_type": "VARCHAR", "distinct_count": 3, "unique_pct": 0.03,
-             "flags": []},
-        ],
-    }
-    assert build_relaciones(prof, {}) is None
-
-
-def test_empty_and_none_profile_do_not_raise():
-    """None / {} profile and missing ctx degrade to None without raising."""
-    assert build_relaciones(None, None) is None
-    assert build_relaciones({}, {}) is None
-    assert build_relaciones({}, {"glossary": GlossaryCollector()}) is None
-
-
-def test_pk_candidate_only_builds_chapter():
-    """A profile with only a key candidate (no FK anything, no db) still builds:
-    the relations chapter applies because there is a PK candidate to report."""
-    prof = {
-        "table": "t", "n_rows": 10, "n_cols": 1, "key_candidates": ["row_id"],
-        "columns": [
-            {"name": "row_id", "inferred_type": "numeric", "physical_type": "BIGINT",
-             "distinct_count": 10, "unique_pct": 1.0, "flags": ["possible_id"]},
-        ],
-    }
-    chapter = build_relaciones(prof, {})
-    assert isinstance(chapter, Chapter)
-    assert "Candidatos a clave primaria" in _text_of(chapter)
@@ -33,7 +33,6 @@ CHAPTER_ORDER = [
    "cat_distr",     # categorical distributions
    "calidad",       # data quality
    "correlacion",   # correlations / associations
-    "relaciones",    # key relations: declared/candidate PK + FK (inter/intra-table)
    "modelos",       # cheap models (PCA/KMeans/outliers)
    "timeseries",    # time-series analysis
    "geospatial",    # geospatial
@@ -4,10 +4,10 @@ name: column_quality_score
 kind: function
 lang: py
 domain: datascience
-version: "1.0.0"
+version: "2.0.0"
 purity: pure
 signature: "def column_quality_score(col: dict) -> dict"
-description: "Calcula un score de calidad de datos 0-100 para un ColumnProfile del grupo eda, con desglose completeness/validity/consistency y lista de issues legibles. Funcion pura, no muta el input."
+description: "Calcula un score de calidad de datos 0-100 para un ColumnProfile del grupo eda. Combina completeness (0.6) y validity (0.4) con renormalizacion por aplicabilidad; los outliers, columnas constantes e ids NO bajan el score (van a observations). Devuelve desglose por dimension, issues (defectos) y observations (señales analiticas). Funcion pura, no muta el input."
 tags: [eda, data-quality, profiling, scoring, datascience]
 uses_functions: []
 uses_types: []
@@ -17,20 +17,26 @@ error_type: ""
 imports: []
 example: |
  from datascience import column_quality_score
-  col = {"name": "precio", "inferred_type": "float", "null_pct": 0.2,
-         "unique_pct": 0.4, "flags": [], "numeric": {"outlier_pct": 0.08}}
+  col = {"name": "precio", "inferred_type": "numeric", "null_pct": 0.2,
+         "unique_pct": 0.4, "flags": [], "numeric": {"outlier_pct": 8.0}}
  column_quality_score(col)
-  # {"score": 86.8, "completeness": 0.8, "validity": 0.92,
-  #  "consistency": 1.0, "issues": ["20% nulos", "8% outliers"]}
+  # {"score": 88.0, "completeness": 0.8, "validity": 1.0,
+  #  "applicable": ["completeness", "validity"], "issues": ["20% nulos"],
+  #  "observations": ["8% de valores atípicos (z-score>3): ..."]}
 tested: true
 tests:
  - "test_clean_column_high_score"
-  - "test_half_null_lowers_completeness_and_score"
-  - "test_constant_column_flags_issue"
+  - "test_weights_60_40_native_type"
+  - "test_outliers_do_not_penalize_score"
+  - "test_nulls_lower_score_more_than_outliers"
+  - "test_validity_from_parse_rate_lowers_score"
+  - "test_validity_from_match_rate"
+  - "test_free_text_renormalizes_to_completeness_only"
+  - "test_all_null_column_scores_zero"
+  - "test_constant_column_scores_full_and_is_observation"
+  - "test_high_cardinality_id_scores_full_and_is_observation"
+  - "test_mostly_null_no_double_counts_validity"
  - "test_empty_dict_does_not_crash"
-  - "test_outliers_penalize_validity"
-  - "test_mostly_null_flag_halves_validity"
-  - "test_high_cardinality_text_flagged_as_id"
  - "test_none_values_treated_defensively"
  - "test_does_not_mutate_input"
 test_file_path: "python/functions/datascience/column_quality_score_test.py"
@@ -38,16 +44,22 @@ file_path: "python/functions/datascience/column_quality_score.py"
 params:
  - name: col
    desc: >
-      ColumnProfile dict del grupo eda (p.ej. salida de summarize_table_duckdb).
-      Se leen sus claves de forma defensiva con .get(...) y se toleran valores
-      None. Claves usadas: null_pct (0-1), inferred_type, semantic_type,
-      unique_pct (0-1), flags (list[str], reconoce "constant"/"mostly_null"),
-      numeric ({outlier_pct: 0-1, ...}|None) y match_rate (opcional, 0-1).
+      ColumnProfile dict del grupo eda (p.ej. salida de summarize_table_duckdb /
+      profile_table). Se leen sus claves de forma defensiva con .get(...) y se
+      toleran valores None. Claves usadas: null_pct (0-1), n_rows, empty_count
+      (texto), inferred_type, semantic_type, validity_rate (0-1, lo expone
+      profile_table al promocionar texto a numero/fecha), match_rate (0-1),
+      unique_pct (0-1), flags (list[str], reconoce
+      "constant"/"possible_id"/"high_cardinality") y numeric ({outlier_pct: 0-100,
+      skew, ...}|None).
 output: >
-  dict con score (float 0-100, redondeado a 1 decimal), completeness (0-1),
-  validity (0-1), consistency (0-1) e issues (list[str] de descripciones
-  legibles de los problemas detectados). score = round(100 * (0.5*completeness
-  + 0.3*validity + 0.2*consistency), 1).
+  dict con score (float 0-100, 1 decimal), completeness (0-1), validity (0-1 o
+  None si no aplicable), dimensions ({completeness, validity}), applicable
+  (list[str] de dimensiones que entraron en el score), issues (list[str] SOLO de
+  defectos de calidad: nulos, vacios, valores no conformes) y observations
+  (list[str] de señales analiticas que NO bajan el score: outliers, columna
+  constante, posible id, asimetria). score = round(100 * (0.6*completeness +
+  0.4*validity) / pesos_aplicables, 1), renormalizado cuando validity no aplica.
 ---

 ## Ejemplo
@@ -59,51 +71,71 @@ from datascience import column_quality_score
 col = {
    "name": "precio",
    "physical_type": "DOUBLE",
-    "inferred_type": "float",
+    "inferred_type": "numeric",
    "semantic_type": "",
-    "count": 800,
    "n_rows": 1000,
    "null_count": 200,
    "null_pct": 0.20,
    "distinct_count": 400,
    "unique_pct": 0.40,
    "flags": [],
-    "numeric": {"outlier_pct": 0.08},
+    "numeric": {"outlier_pct": 8.0, "skew": 0.3},
    "categorical": None,
    "datetime": None,
 }

 column_quality_score(col)
 # {
-#   "score": 86.8,
-#   "completeness": 0.8,    # 1 - 0.20
-#   "validity": 0.92,       # 1 - min(0.08, 0.3)
-#   "consistency": 1.0,
-#   "issues": ["20% nulos", "8% outliers"],
+#   "score": 88.0,            # 100 * (0.6*0.8 + 0.4*1.0)
+#   "completeness": 0.8,      # 1 - 0.20
+#   "validity": 1.0,          # numerica nativa: el tipo es conforme
+#   "dimensions": {"completeness": 0.8, "validity": 1.0},
+#   "applicable": ["completeness", "validity"],
+#   "issues": ["20% nulos"],                       # SOLO defectos de calidad
+#   "observations": ["8% de valores atípicos (z-score>3): ..."],  # NO bajan score
 # }
 ```

 ## Cuando usarla

 Cuando hayas perfilado una tabla con el grupo `eda` (p.ej.
-`summarize_table_duckdb`) y necesites un numero 0-100 por columna para
-ordenar/priorizar limpieza de datos, pintar semaforos de calidad en un
-dashboard, o decidir que columnas descartar antes de modelar. Es la capa de
-scoring sobre el ColumnProfile crudo: lee el perfil, no toca los datos.
+`summarize_table_duckdb` / `profile_table`) y necesites un numero 0-100 por
+columna para ordenar/priorizar limpieza de datos, pintar semaforos de calidad,
+o decidir que columnas descartar antes de modelar. Separa los **defectos de
+calidad reales** (`issues`: nulos, vacios, valores que no parsean a su tipo) de
+las **observaciones analiticas** (`observations`: outliers, columnas constantes,
+ids), que se reportan pero no penalizan. Es la capa de scoring sobre el
+ColumnProfile crudo: lee el perfil, no toca los datos.

-## Notas
+## Gotchas

-Funcion pura, sin I/O ni dependencias externas, no muta `col`. Lee todas las
-claves con `.get(...)` y tolera que vengan en `None` (un ColumnProfile recien
-salido de `summarize_table_duckdb` trae muchas claves a `None`), por lo que
-nunca falla por claves ausentes — un `{}` produce un resultado bien definido.
+Funcion pura, sin I/O, no muta `col`. Aun asi conviene saber:

-Pesos del score: completeness 0.5, validity 0.3, consistency 0.2.
+- **Los outliers NO bajan el score.** Un valor extremo puede ser real y correcto
+  (un cliente que compra mucho); detectar atipicos es analisis de la
+  distribucion, no un juicio de correccion. Salen en `observations`, no en
+  `issues`. Mismo trato para columnas constantes e identificadores de alta
+  cardinalidad: son observaciones, no defectos.
+- **`validity` puede ser `None`** (no aplicable): texto libre sin `semantic_type`
+  ni `validity_rate`, o columna 100% nula. En ese caso el score se renormaliza a
+  solo `completeness` (la columna no se premia ni castiga por algo no medible).
+- **`outlier_pct` se interpreta en escala 0-100** (la que emite
+  `describe_numeric`, z-score>3). Pasar una fraccion 0-1 produce un texto de
+  observacion con el % equivocado, pero NUNCA afecta al score.
+- **`validity_rate` lo puebla `profile_table`** al promocionar una columna de
+  texto a numero/fecha (fraccion que parsea). Si no esta presente y el tipo es
+  nativo numerico/fecha/bool, `validity = 1.0`.
+- Sin doble conteo: la falta de datos cuenta solo en `completeness` (el antiguo
+  castigo de `mostly_null` sobre `validity` se elimino).

- **completeness** = `1 - null_pct` (None -> 0 nulls -> 1.0).
- **validity**: parte de 1.0 y penaliza `min(outlier_pct, 0.3)` en columnas
-  numericas, `0.5 * (1 - match_rate)` si hay `semantic_type` declarado con
-  `match_rate` bajo disponible, y multiplica por 0.5 si el flag `mostly_null`
-  esta presente.
- **consistency**: 1.0 salvo flag `constant` (-> 0.3, columna poco informativa)
-  o texto con `unique_pct > 0.9` (-> 0.6, posible id de alta cardinalidad).
+## Capability growth log
+
+- v2.0.0 (2026-06-30) — nueva formula de calidad (report 2046): pesos 60/40
+  (completeness/validity) con renormalizacion por aplicabilidad; se elimina la
+  dimension `consistency`-como-informatividad y el doble castigo de
+  `mostly_null`; los outliers/constantes/ids salen del score a `observations`;
+  validity mide conformidad real (parse rate / match rate / tipo nativo). Salida
+  ampliada con `dimensions`, `applicable` y `observations`.
+- v1.0.0 — version inicial: pesos 50/30/20 (completeness/validity/consistency),
+  los outliers penalizaban validity (con bug de escala) y consistency penalizaba
+  informatividad.
@@ -1,34 +1,78 @@
 """Score de calidad de datos (0-100) para un ColumnProfile del grupo eda.

 Funcion pura: dado el perfil de una columna producido por el grupo de
-capacidad `eda` (p.ej. summarize_table_duckdb), calcula un score agregado
-de calidad junto a su desglose en completeness / validity / consistency y
-una lista de issues legibles. No realiza I/O ni muta el input.
+capacidad `eda` (p.ej. summarize_table_duckdb / profile_table), calcula un
+score agregado de calidad junto a su desglose por dimension y dos listas
+legibles separadas: `issues` (defectos de calidad reales que SI bajan el
+score) y `observations` (señales analiticas que NO bajan el score). No
+realiza I/O ni muta el input.
+
+Modelo (DAMA-DMBOK / ISO 8000), ver report 2046:
+
+- Solo entran en el score las dimensiones medibles automaticamente desde el
+  perfil, sin fuente externa de verdad: completeness y validity por columna.
+- Renormalizacion por aplicabilidad: si una dimension no es medible en la
+  columna (texto libre sin semantica -> validity no aplica; columna 100% nula
+  -> validity no medible), se excluye y los pesos se renormalizan sobre las
+  aplicables. Una columna ni se premia ni se castiga por algo no medible.
+- Sin doble conteo: la falta de datos cuenta solo en completeness (se elimino
+  el antiguo castigo extra de `mostly_null` sobre validity).
+- Los OUTLIERS NO bajan la calidad. Un valor extremo puede ser real y
+  correcto; detectar atipicos es analisis de la distribucion, no un juicio de
+  coreccion. Outliers, columnas constantes e identificadores de alta
+  cardinalidad pasan a `observations`, nunca a `issues`.
 """


+# Pesos base de las dimensiones de columna (se renormalizan por aplicabilidad).
+_W_COMPLETENESS = 0.6
+_W_VALIDITY = 0.4
+
+# Tipos inferidos cuyo almacen garantiza la conformidad de tipo (validity=1.0)
+# cuando NO vienen de una promocion de texto (en cuyo caso manda validity_rate).
+_NATIVE_TYPED = ("numeric", "integer", "float", "datetime", "date", "boolean", "bool")
+
+
 def column_quality_score(col: dict) -> dict:
    """Calcula un score de calidad de datos 0-100 para un ColumnProfile.

-    El score pondera tres dimensiones:
-      - completeness (0.5): proporcion de valores no nulos.
-      - validity     (0.3): ausencia de outliers / heuristicas de validez.
-      - consistency  (0.2): la columna aporta informacion (no constante, no ruido).
+    El score combina solo dimensiones de calidad medibles desde el perfil, con
+    renormalizacion por aplicabilidad:
+
+      - completeness (peso base 0.6, siempre aplica): proporcion de valores
+        presentes = 1 - null_pct. En texto, las celdas vacias (`empty_count`)
+        tambien cuentan como faltantes.
+      - validity (peso base 0.4, cuando hay un criterio de validacion real):
+        fraccion de valores no nulos conformes a su tipo/semantica. Tipo nativo
+        numerico/fecha/bool = 1.0; texto promovido a numero/fecha = parse rate
+        (`validity_rate`); texto con `semantic_type` regexable = `match_rate`;
+        texto libre o columna 100% nula = NO aplicable (renormaliza a solo
+        completeness).
+
+    Los outliers, columnas constantes, identificadores y asimetria fuerte NO
+    bajan el score: se devuelven en `observations`.

    Args:
        col: ColumnProfile dict del grupo eda. Se leen las claves de forma
            defensiva con .get(...) y se tolera que muchas vengan en None.
-            Claves relevantes: null_pct, inferred_type, semantic_type,
-            unique_pct, flags (list[str]), numeric ({outlier_pct, ...}|None),
-            match_rate (opcional).
+            Claves relevantes: null_pct (0-1), n_rows, empty_count,
+            inferred_type, semantic_type, validity_rate (0-1, lo expone
+            profile_table al promocionar texto a numero/fecha), match_rate
+            (0-1), unique_pct (0-1), flags (list[str], reconoce
+            "constant"/"possible_id"/"high_cardinality"), numeric
+            ({outlier_pct: 0-100, skew, ...}|None).

    Returns:
        dict con:
-          score        (float, 0-100, redondeado a 1 decimal),
-          completeness (float, 0-1),
-          validity     (float, 0-1),
-          consistency  (float, 0-1),
-          issues       (list[str]) descripciones legibles de los problemas.
+          score        (float 0-100, redondeado a 1 decimal),
+          completeness (float 0-1),
+          validity     (float 0-1 | None si no aplicable),
+          dimensions   ({completeness, validity}),
+          applicable   (list[str] de dimensiones que entraron en el score),
+          issues       (list[str]) SOLO defectos de calidad (nulos, vacios,
+                       valores no conformes a su tipo/semantica),
+          observations (list[str]) señales analiticas que NO bajan el score
+                       (outliers, columna constante, posible id, asimetria).
    """
    if not isinstance(col, dict):
        col = {}
@@ -39,103 +83,153 @@ def column_quality_score(col: dict) -> dict:
    flags = set(flags)

    issues: list[str] = []
+    observations: list[str] = []
+
+    inferred_type = col.get("inferred_type") or ""
+    semantic_type = col.get("semantic_type") or ""

    # --- completeness -------------------------------------------------
-    null_pct = col.get("null_pct")
-    if null_pct is None:
-        null_pct = 0.0
-    try:
-        null_pct = float(null_pct)
-    except (TypeError, ValueError):
-        null_pct = 0.0
-    null_pct = _clamp(null_pct, 0.0, 1.0)
+    # Falta de datos = nulos + (en texto) celdas vacias. Es el unico sitio
+    # donde la falta de datos cuenta: nunca se duplica en validity.
+    null_pct = _clamp(_num(col.get("null_pct"), 0.0), 0.0, 1.0)
    completeness = 1.0 - null_pct
    if null_pct > 0:
-        issues.append(f"{round(null_pct * 100)}% nulos")
+        issues.append(f"{_pct(null_pct)} nulos")

-    # --- validity -----------------------------------------------------
-    validity = 1.0
-    inferred_type = col.get("inferred_type") or ""
+    empty_frac = 0.0
+    n_rows = col.get("n_rows")
+    empty_count = col.get("empty_count")
+    if (
+        isinstance(n_rows, (int, float)) and not isinstance(n_rows, bool) and n_rows > 0
+        and isinstance(empty_count, (int, float)) and not isinstance(empty_count, bool)
+        and empty_count > 0
+    ):
+        empty_frac = _clamp(float(empty_count) / float(n_rows), 0.0, 1.0)
+        completeness = _clamp(completeness - empty_frac, 0.0, 1.0)
+        issues.append(f"{_pct(empty_frac)} vacíos")

-    numeric = col.get("numeric")
-    is_numeric = inferred_type in ("integer", "float", "numeric") or isinstance(numeric, dict)
-    if isinstance(numeric, dict):
-        outlier_pct = numeric.get("outlier_pct")
-        if outlier_pct is not None:
-            try:
-                outlier_pct = float(outlier_pct)
-            except (TypeError, ValueError):
-                outlier_pct = 0.0
-            outlier_pct = _clamp(outlier_pct, 0.0, 1.0)
-            if outlier_pct > 0:
-                penalty = min(outlier_pct, 0.3)
-                validity -= penalty
-                issues.append(f"{round(outlier_pct * 100)}% outliers")
-
-    # semantic_type declarado pero con baja tasa de match (si la conocemos).
-    semantic_type = col.get("semantic_type") or ""
-    match_rate = col.get("match_rate")
-    if semantic_type and match_rate is not None:
-        try:
-            match_rate = float(match_rate)
-        except (TypeError, ValueError):
-            match_rate = None
-        if match_rate is not None:
-            match_rate = _clamp(match_rate, 0.0, 1.0)
-            if match_rate < 1.0:
-                shortfall = 1.0 - match_rate
-                validity -= 0.5 * shortfall
-                issues.append(
-                    f"semantic_type '{semantic_type}' con baja coincidencia "
-                    f"({round(match_rate * 100)}%)"
-                )
-
-    if "mostly_null" in flags:
-        validity *= 0.5
-        issues.append("mayoritariamente nula")
-
-    validity = _clamp(validity, 0.0, 1.0)
-
-    # --- consistency --------------------------------------------------
-    consistency = 1.0
-    if "constant" in flags:
-        consistency = 0.3
-        issues.append("columna constante")
+    # --- validity (con renormalizacion por aplicabilidad) -------------
+    # None = no medible -> se excluye del score (no penaliza ni premia).
+    validity = None
+    if completeness <= 0.0:
+        # Columna 100% faltante: no hay valores no nulos sobre los que medir
+        # conformidad. validity no aplica -> el score sale solo de completeness
+        # (= 0). Es el peor defecto de calidad posible.
+        validity = None
    else:
-        unique_pct = col.get("unique_pct")
-        if unique_pct is not None:
-            try:
-                unique_pct = float(unique_pct)
-            except (TypeError, ValueError):
-                unique_pct = None
-        if (
-            inferred_type == "text"
+        validity_rate = col.get("validity_rate")
+        match_rate = col.get("match_rate")
+        if validity_rate is not None:
+            # Texto promovido a numero/fecha: parse rate real de la muestra.
+            v = _num(validity_rate, None)
+            if v is not None:
+                validity = _clamp(v, 0.0, 1.0)
+                if validity < 1.0:
+                    kind = (
+                        "número" if inferred_type == "numeric"
+                        else "fecha" if inferred_type == "datetime"
+                        else inferred_type or "su tipo"
+                    )
+                    issues.append(
+                        f"{_pct(1.0 - validity)} no parsea al tipo {kind}"
+                    )
+        elif inferred_type in _NATIVE_TYPED:
+            # Tipo nativo garantizado por el almacen: no hay valores que no
+            # parseen. validity = 1.0 (no se confunde con tener outliers).
+            validity = 1.0
+        elif semantic_type and match_rate is not None:
+            v = _num(match_rate, None)
+            if v is not None:
+                validity = _clamp(v, 0.0, 1.0)
+                if validity < 1.0:
+                    issues.append(
+                        f"{_pct(1.0 - validity)} no casa con el "
+                        f"formato «{semantic_type}»"
+                    )
+        else:
+            # Texto libre / categorica sin semantica: no hay criterio honesto
+            # de validez. No aplica.
+            validity = None
+
+    # --- observations (NO bajan el score) -----------------------------
+    numeric = col.get("numeric")
+    if isinstance(numeric, dict):
+        # outlier_pct viene en escala 0-100 desde describe_numeric (z-score>3).
+        outlier_pct = _num(numeric.get("outlier_pct"), None)
+        if outlier_pct is not None and outlier_pct >= 0.05:
+            observations.append(
+                f"{_pct(outlier_pct / 100.0)} de valores atípicos (z-score>3): "
+                "revisar si son errores u observaciones legítimas"
+            )
+        skew = _num(numeric.get("skew"), None)
+        if skew is not None and abs(skew) >= 1.0:
+            observations.append(
+                f"asimetría fuerte (skew={round(skew, 2)}): considerar "
+                "re-expresión antes de modelar"
+            )
+
+    if "constant" in flags:
+        observations.append(
+            "columna constante: aporta poca información para el análisis"
+        )
+
+    unique_pct = _num(col.get("unique_pct"), None)
+    is_id = (
+        "possible_id" in flags
+        or "high_cardinality" in flags
+        or (
+            inferred_type in ("text", "categorical")
            and unique_pct is not None
            and _clamp(unique_pct, 0.0, 1.0) > 0.9
-        ):
-            consistency = 0.6
-            issues.append("posible id de alta cardinalidad")
-
-    consistency = _clamp(consistency, 0.0, 1.0)
-
-    # --- score agregado ----------------------------------------------
-    score = round(
-        100.0 * (0.5 * completeness + 0.3 * validity + 0.2 * consistency),
-        1,
+        )
    )
+    if is_id:
+        observations.append(
+            "valores casi únicos: posible identificador (no es un defecto de calidad)"
+        )

-    # Silencia warnings sobre la variable de tipo no usada.
-    _ = is_numeric
+    # --- score agregado con renormalizacion ---------------------------
+    applicable = ["completeness"]
+    num = _W_COMPLETENESS * completeness
+    den = _W_COMPLETENESS
+    if validity is not None:
+        applicable.append("validity")
+        num += _W_VALIDITY * validity
+        den += _W_VALIDITY
+    score = round(100.0 * num / den, 1) if den > 0 else 0.0

    return {
        "score": score,
        "completeness": completeness,
        "validity": validity,
-        "consistency": consistency,
+        "dimensions": {"completeness": completeness, "validity": validity},
+        "applicable": applicable,
        "issues": issues,
+        "observations": observations,
    }


+def _pct(frac: float) -> str:
+    """Formatea una fraccion 0-1 como porcentaje honesto: «N%» si >=1%, «0.N%»
+    por debajo (para no mostrar «0%» cuando hay un defecto real pequeño)."""
+    p = frac * 100.0
+    if p >= 1.0:
+        return f"{round(p)}%"
+    return f"{p:.1f}%"
+
+
+def _num(x, default):
+    """Convierte x a float; devuelve `default` si es None o no parseable."""
+    if x is None:
+        return default
+    if isinstance(x, bool):
+        return default
+    try:
+        return float(x)
+    except (TypeError, ValueError):
+        return default
+
+
 def _clamp(x: float, lo: float, hi: float) -> float:
    """Recorta x al rango [lo, hi]."""
    if x < lo:
@@ -1,4 +1,12 @@
-"""Tests para column_quality_score."""
+"""Tests para column_quality_score (nueva fórmula, report 2046).
+
+Verifica las invariantes de la fórmula de calidad:
+  - completeness (0.6) + validity (0.4) con renormalización por aplicabilidad.
+  - Los OUTLIERS no bajan el score (van a observations, no a issues).
+  - Columnas constantes e ids no bajan el score (observations).
+  - Sin doble conteo de la falta de datos.
+  - all-null -> score 0; función pura (no muta el input).
+"""

 import os
 import sys
@@ -9,11 +17,11 @@ from column_quality_score import column_quality_score


 def _clean_numeric_col() -> dict:
-    """ColumnProfile de una columna numerica sana, sin problemas."""
+    """ColumnProfile de una columna numérica nativa sana, sin problemas."""
    return {
        "name": "edad",
        "physical_type": "INTEGER",
-        "inferred_type": "integer",
+        "inferred_type": "numeric",
        "semantic_type": "",
        "count": 1000,
        "n_rows": 1000,
@@ -28,85 +36,163 @@ def _clean_numeric_col() -> dict:
    }


+# --------------------------------------------------------------------------- #
+# Golden
+# --------------------------------------------------------------------------- #
 def test_clean_column_high_score():
    out = column_quality_score(_clean_numeric_col())
-    assert out["score"] > 90
+    assert out["score"] == 100.0
    assert out["completeness"] == 1.0
    assert out["validity"] == 1.0
-    assert out["consistency"] == 1.0
+    assert out["applicable"] == ["completeness", "validity"]
    assert out["issues"] == []
+    assert out["observations"] == []


-def test_half_null_lowers_completeness_and_score():
+def test_weights_60_40_native_type():
+    """30% nulos en numérica nativa: score = 100*(0.6*0.7 + 0.4*1.0) = 82."""
    col = _clean_numeric_col()
-    col["null_count"] = 500
-    col["null_pct"] = 0.5
-    clean_score = column_quality_score(_clean_numeric_col())["score"]
+    col["null_pct"] = 0.30
+    col["null_count"] = 300
    out = column_quality_score(col)
-    assert out["completeness"] == 0.5
-    assert out["score"] < clean_score
-    assert any("nulos" in issue for issue in out["issues"])
+    assert out["completeness"] == 0.7
+    assert out["validity"] == 1.0
+    assert out["score"] == 82.0
+    assert any("nulos" in i for i in out["issues"])


-def test_constant_column_flags_issue():
+# --------------------------------------------------------------------------- #
+# Outliers FUERA del score
+# --------------------------------------------------------------------------- #
+def test_outliers_do_not_penalize_score():
+    """Columna con outliers pero sin nulos -> score máximo; outliers en observations."""
+    col = _clean_numeric_col()
+    col["numeric"] = {"outlier_pct": 18.0, "skew": 0.2}  # 18% atípicos (escala 0-100)
+    out = column_quality_score(col)
+    assert out["score"] == 100.0  # los outliers NO bajan la calidad
+    assert out["validity"] == 1.0
+    # No aparecen como problema de calidad...
+    assert not any("atípic" in i or "outlier" in i for i in out["issues"])
+    # ...sino como observación analítica.
+    assert any("atípic" in o for o in out["observations"])
+
+
+def test_nulls_lower_score_more_than_outliers():
+    """Vacíos sí penalizan; outliers no: comparar las dos columnas."""
+    con_nulos = _clean_numeric_col()
+    con_nulos["null_pct"] = 0.30
+    con_outliers = _clean_numeric_col()
+    con_outliers["numeric"] = {"outlier_pct": 30.0}
+    assert column_quality_score(con_nulos)["score"] < \
+        column_quality_score(con_outliers)["score"]
+
+
+# --------------------------------------------------------------------------- #
+# Validity: aplicabilidad y renormalización
+# --------------------------------------------------------------------------- #
+def test_validity_from_parse_rate_lowers_score():
+    """Numérica como texto con 20% basura: validity=0.8 -> score=92."""
+    col = {
+        "name": "precio_txt", "inferred_type": "numeric", "semantic_type": "decimal",
+        "null_pct": 0.0, "validity_rate": 0.80, "flags": [], "numeric": None,
+    }
+    out = column_quality_score(col)
+    assert out["validity"] == 0.8
+    assert out["score"] == 92.0  # 100*(0.6 + 0.4*0.8)
+    assert any("no parsea" in i for i in out["issues"])
+
+
+def test_validity_from_match_rate():
+    """Texto con semantic_type y 5% no conforme: validity=0.95."""
+    col = {
+        "name": "email", "inferred_type": "text", "semantic_type": "email",
+        "null_pct": 0.0, "match_rate": 0.95, "unique_pct": 0.5, "flags": [],
+    }
+    out = column_quality_score(col)
+    assert out["validity"] == 0.95
+    assert out["score"] == 98.0  # 100*(0.6 + 0.4*0.95)
+    assert any("no casa" in i for i in out["issues"])
+
+
+def test_free_text_renormalizes_to_completeness_only():
+    """Texto libre sin semántica: validity no aplica -> score = 100*completeness."""
+    col = {
+        "name": "comentario", "inferred_type": "text", "semantic_type": "",
+        "null_pct": 0.30, "unique_pct": 0.5, "flags": [], "numeric": None,
+    }
+    out = column_quality_score(col)
+    assert out["validity"] is None
+    assert out["applicable"] == ["completeness"]
+    assert out["completeness"] == 0.7
+    assert out["score"] == 70.0  # renormalizado a solo completeness
+
+
+# --------------------------------------------------------------------------- #
+# Casos límite (report §4.6)
+# --------------------------------------------------------------------------- #
+def test_all_null_column_scores_zero():
+    col = _clean_numeric_col()
+    col["null_pct"] = 1.0
+    col["null_count"] = 1000
+    out = column_quality_score(col)
+    assert out["completeness"] == 0.0
+    assert out["validity"] is None  # no medible sin valores no nulos
+    assert out["score"] == 0.0
+
+
+def test_constant_column_scores_full_and_is_observation():
+    """Columna constante: dato válido y completo -> score 100; baja info = observación."""
    col = _clean_numeric_col()
    col["flags"] = ["constant"]
    col["distinct_count"] = 1
    col["unique_pct"] = 0.001
    out = column_quality_score(col)
-    assert out["consistency"] == 0.3
-    assert any("constante" in issue for issue in out["issues"])
+    assert out["score"] == 100.0  # NO se castiga la baja informatividad
+    assert not any("constante" in i for i in out["issues"])
+    assert any("constante" in o for o in out["observations"])


+def test_high_cardinality_id_scores_full_and_is_observation():
+    """Id de alta cardinalidad: unicidad perfecta -> score 100; posible id = observación."""
+    col = {
+        "name": "uuid", "inferred_type": "text", "semantic_type": "",
+        "null_pct": 0.0, "unique_pct": 0.99, "flags": ["possible_id"],
+        "numeric": None,
+    }
+    out = column_quality_score(col)
+    assert out["score"] == 100.0
+    assert not any("identificador" in i for i in out["issues"])
+    assert any("identificador" in o for o in out["observations"])
+
+
+def test_mostly_null_no_double_counts_validity():
+    """85% nulos: solo completeness penaliza; validity nativa sigue 1.0 (sin doble castigo)."""
+    col = _clean_numeric_col()
+    col["null_pct"] = 0.85
+    col["flags"] = ["mostly_null"]
+    out = column_quality_score(col)
+    assert out["validity"] == 1.0  # ya no se multiplica por 0.5
+    # score = 100*(0.6*0.15 + 0.4*1.0) = 49
+    assert out["score"] == 49.0
+    assert not any("mayoritariamente" in o for o in out["observations"])
+
+
+# --------------------------------------------------------------------------- #
+# Robustez
+# --------------------------------------------------------------------------- #
 def test_empty_dict_does_not_crash():
    out = column_quality_score({})
    assert isinstance(out["score"], float)
    assert out["completeness"] == 1.0
    assert 0.0 <= out["score"] <= 100.0
    assert isinstance(out["issues"], list)
-
-
-def test_outliers_penalize_validity():
-    col = _clean_numeric_col()
-    col["numeric"] = {"outlier_pct": 0.2}
-    out = column_quality_score(col)
-    assert out["validity"] < 1.0
-    assert any("outliers" in issue for issue in out["issues"])
-
-
-def test_mostly_null_flag_halves_validity():
-    col = _clean_numeric_col()
-    col["null_pct"] = 0.85
-    col["flags"] = ["mostly_null"]
-    out = column_quality_score(col)
-    assert out["validity"] == 0.5
-    assert any("mayoritariamente nula" in issue for issue in out["issues"])
-
-
-def test_high_cardinality_text_flagged_as_id():
-    col = {
-        "name": "uuid",
-        "inferred_type": "text",
-        "semantic_type": "",
-        "null_pct": 0.0,
-        "unique_pct": 0.99,
-        "flags": [],
-        "numeric": None,
-    }
-    out = column_quality_score(col)
-    assert out["consistency"] < 1.0
-    assert any("alta cardinalidad" in issue for issue in out["issues"])
+    assert isinstance(out["observations"], list)


 def test_none_values_treated_defensively():
    col = {
-        "name": "x",
-        "inferred_type": None,
-        "semantic_type": None,
-        "null_pct": None,
-        "unique_pct": None,
-        "flags": None,
-        "numeric": None,
+        "name": "x", "inferred_type": None, "semantic_type": None,
+        "null_pct": None, "unique_pct": None, "flags": None, "numeric": None,
    }
    out = column_quality_score(col)
    assert out["completeness"] == 1.0
@@ -1,107 +0,0 @@
---
-name: detect_declared_keys_duckdb
-kind: function
-lang: py
-domain: datascience
-version: "1.0.0"
-purity: impure
-signature: "def detect_declared_keys_duckdb(db_path: str, table: str = None) -> dict"
-description: "Detecta las claves DECLARADAS (constraints reales) de un schema DuckDB leyendo la table function duckdb_constraints(): extrae PRIMARY KEY, FOREIGN KEY y UNIQUE (ignora NOT NULL y CHECK) y las devuelve normalizadas con sus columnas, y para las FK con su tabla y columnas referenciadas. Con table=None procesa todas las tablas; con table='X' filtra a PK/UNIQUE de X y a FK cuyo origen es X (case-sensitive). A diferencia de infer_fk_containment_duckdb (que INFIERE FKs candidatas por containment de valores cuando el schema no las declara), esta funcion devuelve las relaciones de clave REALES del schema. Estilo dict-no-throw: nunca lanza. Parte del grupo eda (relaciones de clave)."
-tags: [eda, duckdb, datascience, relations, primary-key, foreign-key, schema, exploratory-data-analysis]
-params:
-  - name: db_path
-    desc: "Ruta al archivo DuckDB. Debe existir (lectura read-only via duckdb_query_readonly; no se crea). Un path inexistente devuelve {status:'error', ...}."
-  - name: table
-    desc: "Si se pasa, filtra los resultados a esa tabla: incluye PRIMARY KEY y UNIQUE cuya tabla sea `table`, y FOREIGN KEY cuya tabla ORIGEN sea `table` (no la referenciada). None (default) devuelve los constraints de todas las tablas. La comparacion es case-sensitive (nombres tal cual los devuelve DuckDB)."
-output: "dict dict-no-throw. En exito {status:'ok', primary_keys:[{table:str, columns:[str,...]}, ...], foreign_keys:[{table:str, columns:[str,...], referenced_table:str, referenced_columns:[str,...]}, ...], unique:[{table:str, columns:[str,...]}, ...], tables:[str,...]} donde tables es la lista ordenada de tablas (origen) que poseen al menos un constraint PK/FK/UNIQUE emitido. Solo se emiten constraints de clave: NOT NULL y CHECK se ignoran. En error {status:'error', error:str}."
-uses_functions: [duckdb_query_readonly_py_infra]
-uses_types: []
-returns: []
-returns_optional: false
-error_type: "error_go_core"
-imports: []
-tested: true
-tests: ["test_golden_detecta_pks_y_fk", "test_golden_ignora_not_null_y_check", "test_edge_filtra_por_tabla_orders", "test_edge_filtra_por_tabla_customers", "test_edge_unique_declarado", "test_edge_sin_constraints_listas_vacias", "test_error_db_inexistente_no_lanza", "test_shape_resultado"]
-test_file_path: "python/functions/datascience/detect_declared_keys_duckdb_test.py"
-file_path: "python/functions/datascience/detect_declared_keys_duckdb.py"
---
-
-## Ejemplo
-
-```python
-import sys, os, duckdb
-sys.path.insert(0, os.path.join("python", "functions"))
-from datascience import detect_declared_keys_duckdb
-
-# Base de ejemplo en /tmp: orders.customer_id -> customers.id (FK declarada)
-path = "/tmp/declared_keys_demo.duckdb"
-if os.path.exists(path):
-    os.remove(path)
-con = duckdb.connect(path)
-con.execute("CREATE TABLE customers(id INTEGER PRIMARY KEY, name TEXT)")
-con.execute(
-    "CREATE TABLE orders("
-    "  id INTEGER PRIMARY KEY,"
-    "  customer_id INTEGER REFERENCES customers(id),"
-    "  amt DOUBLE)"
-)
-con.close()
-
-res = detect_declared_keys_duckdb(path)
-if res["status"] == "ok":
-    for pk in res["primary_keys"]:
-        print(f"PK  {pk['table']}({', '.join(pk['columns'])})")
-    for fk in res["foreign_keys"]:
-        print(f"FK  {fk['table']}({', '.join(fk['columns'])}) -> "
-              f"{fk['referenced_table']}({', '.join(fk['referenced_columns'])})")
-    # PK  customers(id)
-    # PK  orders(id)
-    # FK  orders(customer_id) -> customers(id)
-else:
-    print("error:", res["error"])
-
-# Filtrar a una tabla concreta (PK/UNIQUE de orders + FK con origen orders):
-solo_orders = detect_declared_keys_duckdb(path, table="orders")
-print(solo_orders["tables"])  # ['orders']
-```
-
-## Cuando usarla
-
- Cuando exploras un esquema DuckDB y quieres mostrar las relaciones de clave REALES (PK/FK/UNIQUE) que el schema ha declarado, sin inferir nada.
- Como paso del capitulo RELACIONES del grupo `eda`: primero mira las claves declaradas con esta funcion; si el schema no declara FKs, complementa con `infer_fk_containment_duckdb` (inferencia por containment).
- Antes de documentar o migrar un esquema, para listar el contrato de integridad referencial que el motor ya conoce.
- Para validar que las constraints que esperas (esa FK que creaste con `REFERENCES`) realmente estan declaradas en la base materializada.
-
-## Gotchas
-
- **Impura**: lee de disco via la primitiva read-only `duckdb_query_readonly` (no crea ni modifica la base). El `db_path` debe existir; un path inexistente devuelve `{status:'error'}` (read_only NO crea la base).
- **Requiere `duckdb_constraints()`**: usa la table function `duckdb_constraints()`, disponible en DuckDB modernos (verificado en 1.5.2). En versiones antiguas sin esa funcion, la query falla y se devuelve `{status:'error'}`.
- **Solo claves DECLARADAS**: devuelve lo que el schema declaro con `PRIMARY KEY` / `FOREIGN KEY (... REFERENCES ...)` / `UNIQUE`. Una tabla materializada con `CREATE TABLE AS SELECT` NO lleva constraints — para esos casos no habra claves que mostrar y hay que INFERIRLAS (`infer_fk_containment_duckdb`).
- **NOT NULL y CHECK se ignoran**: `duckdb_constraints()` tambien emite filas `NOT NULL` (DuckDB genera una por cada columna PK) y `CHECK`; esta funcion las descarta y solo conserva PK/FK/UNIQUE.
- **Nombres case-sensitive**: el filtro `table='Orders'` no casa con una tabla `orders`. Se comparan los nombres tal cual los devuelve DuckDB.
- **FK atribuida al origen**: una FOREIGN KEY se atribuye a su tabla ORIGEN (el `table` de la entrada), no a la referenciada. El filtro `table='X'` trae las FK cuyo origen es X, no las que apuntan a X.
- **`tables` = tablas dueñas de constraints emitidos**: la lista `tables` contiene solo las tablas que poseen al menos un PK/FK/UNIQUE en el resultado (su campo `table`), ordenadas. No incluye tablas referenciadas que no tengan constraint propio en la salida.
- **Columnas como listas**: `constraint_column_names` y `referenced_column_names` son columnas LIST de DuckDB; en 1.5.2 llegan como listas Python. La funcion las normaliza a listas de strings con una red de seguridad por si llegaran como string.
-
-## Notas
-
-`duckdb_constraints()` devuelve una fila por constraint con los campos
-`table_name`, `constraint_type`, `constraint_column_names`, `referenced_table`,
-`referenced_column_names`. Mapeo a la salida:
-
-```text
-PRIMARY KEY -> primary_keys[]: {table, columns}
-UNIQUE      -> unique[]:       {table, columns}
-FOREIGN KEY -> foreign_keys[]: {table, columns, referenced_table, referenced_columns}
-NOT NULL    -> ignorado
-CHECK       -> ignorado
-```
-
-Para una FK, `referenced_table` y `referenced_column_names` vienen poblados; para
-PK/UNIQUE, `referenced_table` es NULL y `referenced_column_names` una lista vacia.
-
-Complementa a `infer_fk_containment_duckdb`: esta funcion devuelve las relaciones
-de clave REALES del schema (declaradas); la otra INFIERE FKs candidatas por
-containment de valores cuando el schema no las declaro. En el capitulo RELACIONES
-de AutomaticEDA se usan en orden: primero las declaradas, luego la inferencia como
-respaldo.
@@ -1,127 +0,0 @@
-"""detect_declared_keys_duckdb — lee las claves DECLARADAS de un schema DuckDB.
-
-Funcion impura: lee de disco a traves de la primitiva read-only del grupo
-`duckdb` (duckdb_query_readonly). Pertenece al grupo de capacidad `eda`
-(relaciones de clave): a diferencia de infer_fk_containment_duckdb, que INFIERE
-FOREIGN KEYs candidatas por containment de valores, esta funcion devuelve las
-constraints REALES que el schema ha declarado (PRIMARY KEY / FOREIGN KEY /
-UNIQUE) leyendo la table function `duckdb_constraints()`.
-
-Es la pieza del capitulo RELACIONES de AutomaticEDA que muestra las relaciones de
-clave reales cuando existen — frente a la inferencia, que se usa cuando el schema
-no las declaro.
-
-Estilo dict-no-throw del grupo duckdb: nunca lanza; captura cualquier error y
-devuelve {status:'error', error:str}.
-"""
-
-from infra import duckdb_query_readonly
-
-
-def _as_list(value) -> list:
-    """Normaliza el valor de una columna LIST de DuckDB a una lista de strings.
-
-    En DuckDB 1.5.2, `constraint_column_names` y `referenced_column_names` llegan
-    ya como listas Python a traves de duckdb_query_readonly. Este helper es solo
-    una red de seguridad: si por cualquier motivo llegara como string (p.ej. la
-    representacion `[id, customer_id]`), la parsea de forma defensiva.
-    """
-    if value is None:
-        return []
-    if isinstance(value, (list, tuple)):
-        return [str(v) for v in value]
-    if isinstance(value, str):
-        s = value.strip()
-        if s.startswith("[") and s.endswith("]"):
-            s = s[1:-1]
-        if not s.strip():
-            return []
-        return [
-            part.strip().strip("'\"")
-            for part in s.split(",")
-            if part.strip().strip("'\"")
-        ]
-    return [str(value)]
-
-
-def detect_declared_keys_duckdb(db_path: str, table: str = None) -> dict:
-    """Detecta las claves PRIMARY KEY / FOREIGN KEY / UNIQUE declaradas en DuckDB.
-
-    Lee la table function `duckdb_constraints()` y extrae solo las constraints de
-    clave (PRIMARY KEY, FOREIGN KEY, UNIQUE), ignorando NOT NULL y CHECK.
-
-    Args:
-        db_path: ruta al archivo DuckDB. Debe existir (lectura read-only; no se
-            crea). Un path inexistente devuelve {status:'error', ...} sin lanzar.
-        table: si se pasa, filtra los resultados a esa tabla: incluye PRIMARY KEY
-            y UNIQUE cuya tabla sea `table`, y FOREIGN KEY cuya tabla ORIGEN sea
-            `table`. None (default) devuelve los constraints de todas las tablas.
-            La comparacion de nombres es case-sensitive (tal cual los devuelve
-            DuckDB).
-
-    Returns:
-        dict dict-no-throw. En exito:
-            {status:'ok',
-             primary_keys:[{table:str, columns:[str, ...]}, ...],
-             foreign_keys:[{table:str, columns:[str, ...],
-                            referenced_table:str,
-                            referenced_columns:[str, ...]}, ...],
-             unique:[{table:str, columns:[str, ...]}, ...],
-             tables:[str, ...]}   # tablas (origen) con algun PK/FK/UNIQUE emitido
-        En error (sin lanzar): {status:'error', error:str}.
-    """
-    try:
-        sql = (
-            "SELECT table_name, constraint_type, constraint_column_names, "
-            "referenced_table, referenced_column_names FROM duckdb_constraints()"
-        )
-        res = duckdb_query_readonly(db_path, sql)
-        if res["status"] != "ok":
-            return {"status": "error", "error": res["error"]}
-
-        primary_keys = []
-        foreign_keys = []
-        unique = []
-        tables = set()
-
-        for row in res["rows"]:
-            ctype = row["constraint_type"]
-            tname = row["table_name"]
-
-            # Filtro por tabla origen: para PK/FK/UNIQUE el dueño del constraint es
-            # `table_name`. Una FK se atribuye a su tabla origen (no a la
-            # referenciada), igual que el filtro pide.
-            if table is not None and tname != table:
-                continue
-
-            cols = _as_list(row["constraint_column_names"])
-
-            if ctype == "PRIMARY KEY":
-                primary_keys.append({"table": tname, "columns": cols})
-                tables.add(tname)
-            elif ctype == "UNIQUE":
-                unique.append({"table": tname, "columns": cols})
-                tables.add(tname)
-            elif ctype == "FOREIGN KEY":
-                foreign_keys.append(
-                    {
-                        "table": tname,
-                        "columns": cols,
-                        "referenced_table": row["referenced_table"],
-                        "referenced_columns": _as_list(
-                            row["referenced_column_names"]
-                        ),
-                    }
-                )
-                tables.add(tname)
-            # NOT NULL y CHECK se ignoran: no son relaciones de clave.
-
-        return {
-            "status": "ok",
-            "primary_keys": primary_keys,
-            "foreign_keys": foreign_keys,
-            "unique": unique,
-            "tables": sorted(tables),
-        }
-    except Exception as e:  # noqa: BLE001
-        return {"status": "error", "error": str(e)}
@@ -1,167 +0,0 @@
-"""Tests para detect_declared_keys_duckdb."""
-
-import duckdb
-import pytest
-
-from .detect_declared_keys_duckdb import detect_declared_keys_duckdb
-
-
-@pytest.fixture
-def db(tmp_path):
-    """DuckDB temporal con claves declaradas.
-
-    - customers(id PRIMARY KEY, name)
-    - orders(id PRIMARY KEY, customer_id REFERENCES customers(id), amt)
-
-    Esto declara dos PRIMARY KEY (customers.id, orders.id) y una FOREIGN KEY
-    (orders.customer_id -> customers.id). DuckDB ademas genera constraints
-    NOT NULL para las columnas PK, que la funcion debe ignorar.
-    """
-    path = str(tmp_path / "keys_test.duckdb")
-    con = duckdb.connect(path)
-    con.execute("CREATE TABLE customers(id INTEGER PRIMARY KEY, name TEXT)")
-    con.execute(
-        "CREATE TABLE orders("
-        "  id INTEGER PRIMARY KEY,"
-        "  customer_id INTEGER REFERENCES customers(id),"
-        "  amt DOUBLE"
-        ")"
-    )
-    con.close()
-    return path
-
-
-def _pk_for(res, table):
-    """Devuelve la entrada primary_keys cuya tabla es `table`, o None."""
-    for pk in res["primary_keys"]:
-        if pk["table"] == table:
-            return pk
-    return None
-
-
-def test_golden_detecta_pks_y_fk(db):
-    """Golden: detecta las dos PK y la FK declaradas, con valores concretos."""
-    res = detect_declared_keys_duckdb(db)
-    assert res["status"] == "ok"
-
-    # PRIMARY KEY de customers y de orders.
-    pk_customers = _pk_for(res, "customers")
-    pk_orders = _pk_for(res, "orders")
-    assert pk_customers is not None
-    assert pk_customers["columns"] == ["id"]
-    assert pk_orders is not None
-    assert pk_orders["columns"] == ["id"]
-
-    # FOREIGN KEY orders.customer_id -> customers.id.
-    assert len(res["foreign_keys"]) == 1
-    fk = res["foreign_keys"][0]
-    assert fk["table"] == "orders"
-    assert fk["columns"] == ["customer_id"]
-    assert fk["referenced_table"] == "customers"
-    assert fk["referenced_columns"] == ["id"]
-
-    # tables incluye ambas (origen de algun constraint).
-    assert res["tables"] == ["customers", "orders"]
-
-
-def test_golden_ignora_not_null_y_check(db):
-    """NOT NULL (auto-generado por las PK) no aparece como clave."""
-    res = detect_declared_keys_duckdb(db)
-    assert res["status"] == "ok"
-    # Solo 2 PK reales (no las NOT NULL que DuckDB genera por cada columna PK).
-    assert len(res["primary_keys"]) == 2
-    # No hay UNIQUE declarado en este schema.
-    assert res["unique"] == []
-
-
-def test_edge_filtra_por_tabla_orders(db):
-    """Edge table='orders': PK de orders + su FK; NO la PK de customers."""
-    res = detect_declared_keys_duckdb(db, table="orders")
-    assert res["status"] == "ok"
-
-    # Solo la PK de orders.
-    assert len(res["primary_keys"]) == 1
-    assert res["primary_keys"][0]["table"] == "orders"
-    assert res["primary_keys"][0]["columns"] == ["id"]
-    # La PK de customers NO esta.
-    assert _pk_for(res, "customers") is None
-
-    # La FK de orders si esta (origen = orders).
-    assert len(res["foreign_keys"]) == 1
-    assert res["foreign_keys"][0]["table"] == "orders"
-    assert res["foreign_keys"][0]["referenced_table"] == "customers"
-
-    # tables solo contiene orders (la dueña de los constraints emitidos).
-    assert res["tables"] == ["orders"]
-
-
-def test_edge_filtra_por_tabla_customers(db):
-    """Edge table='customers': solo su PK; ninguna FK (orders queda fuera)."""
-    res = detect_declared_keys_duckdb(db, table="customers")
-    assert res["status"] == "ok"
-    assert len(res["primary_keys"]) == 1
-    assert res["primary_keys"][0]["table"] == "customers"
-    assert res["foreign_keys"] == []
-    assert res["tables"] == ["customers"]
-
-
-def test_edge_unique_declarado(tmp_path):
-    """Edge: una constraint UNIQUE declarada aparece en `unique`."""
-    path = str(tmp_path / "unique_test.duckdb")
-    con = duckdb.connect(path)
-    con.execute("CREATE TABLE products(sku INTEGER UNIQUE, name TEXT)")
-    con.close()
-
-    res = detect_declared_keys_duckdb(path)
-    assert res["status"] == "ok"
-    assert len(res["unique"]) == 1
-    assert res["unique"][0]["table"] == "products"
-    assert res["unique"][0]["columns"] == ["sku"]
-    assert res["primary_keys"] == []
-    assert res["foreign_keys"] == []
-    assert res["tables"] == ["products"]
-
-
-def test_edge_sin_constraints_listas_vacias(tmp_path):
-    """Edge: tabla sin PK/FK/UNIQUE -> todas las listas vacias, status ok."""
-    path = str(tmp_path / "no_keys.duckdb")
-    con = duckdb.connect(path)
-    con.execute("CREATE TABLE log(a INTEGER, b INTEGER)")
-    con.close()
-
-    res = detect_declared_keys_duckdb(path)
-    assert res["status"] == "ok"
-    assert res["primary_keys"] == []
-    assert res["foreign_keys"] == []
-    assert res["unique"] == []
-    assert res["tables"] == []
-
-
-def test_error_db_inexistente_no_lanza(tmp_path):
-    """Error: db_path inexistente -> status error, sin lanzar excepcion."""
-    path = str(tmp_path / "does_not_exist.duckdb")
-    res = detect_declared_keys_duckdb(path)
-    assert res["status"] == "error"
-    assert isinstance(res["error"], str)
-    assert res["error"] != ""
-
-
-def test_shape_resultado(db):
-    """El retorno tiene exactamente las claves esperadas."""
-    res = detect_declared_keys_duckdb(db)
-    assert set(res.keys()) == {
-        "status",
-        "primary_keys",
-        "foreign_keys",
-        "unique",
-        "tables",
-    }
-    for pk in res["primary_keys"]:
-        assert set(pk.keys()) == {"table", "columns"}
-    for fk in res["foreign_keys"]:
-        assert set(fk.keys()) == {
-            "table",
-            "columns",
-            "referenced_table",
-            "referenced_columns",
-        }
@@ -1,91 +0,0 @@
---
-name: suggest_intratable_fk_candidates
-kind: function
-lang: py
-domain: datascience
-version: "1.0.0"
-purity: pure
-signature: "def suggest_intratable_fk_candidates(profile: dict, max_candidates: int = 20) -> list"
-description: "Sobre el TableProfile de UNA tabla (el dict de profile_table), sugiere por heuristica de nombre + cardinalidad que columnas PARECEN una clave foranea hacia otra tabla, cuando no hay relaciones inter-tabla que medir (una sola tabla). Es una SUGERENCIA, no una afirmacion: el ref_table_guess es el stem del nombre (customer_id -> customer) y NO confirma containment. Pura: solo lee el dict, sin I/O; nunca lanza (devuelve [])."
-tags: [eda, datascience, relationships, foreign-key, fk, heuristic, schema, python]
-uses_functions: []
-uses_types: []
-returns: []
-returns_optional: false
-error_type: ""
-imports: []
-params:
-  - name: profile
-    desc: "TableProfile (dict que produce profile_table / summarize_table_*). Se leen de forma defensiva `columns` (lista de ColumnProfile con name/inferred_type/physical_type/distinct_count/unique_pct/flags), `n_rows` (int) y `key_candidates` (lista de nombres de columna ya candidatos a PK, que se excluyen). Si no es dict o no trae columns -> []."
-  - name: max_candidates
-    desc: "Tope de sugerencias devueltas (default 20). Las columnas candidatas se ordenan por distinct_count descendente (mas informativas primero) antes de cortar a este maximo."
-output: "list (posiblemente vacia) de dicts, uno por columna sugerida, con claves: `column` (nombre), `ref_table_guess` (tabla conjeturada por el stem del nombre, p.ej. customer_id -> 'customer'), `reason` (frase humana que deja claro que es heuristica sin confirmar containment), `distinct_count` (int|None), `unique_pct` (float|None, fraccion 0-1 tal como viene del profile), `inferred_type` (str), `physical_type` (str). Nunca lanza."
-tested: true
-tests: ["test_golden_customer_id_detectado_otras_no", "test_camelcase_albumid_detectado", "test_constante_status_id_no_aparece", "test_profile_vacio_y_none_devuelven_lista_vacia", "test_category_id_casi_unico_parece_pk_no_aparece", "test_ref_table_guess_multitoken_y_orden_por_distinct", "test_max_candidates_corta_la_lista", "test_id_generico_solo_nunca_es_fk"]
-test_file_path: "python/functions/datascience/suggest_intratable_fk_candidates_test.py"
-file_path: "python/functions/datascience/suggest_intratable_fk_candidates.py"
---
-
-## Ejemplo
-
-```python
-from datascience import suggest_intratable_fk_candidates
-
-# TableProfile de UNA tabla (tipo titanic): customer_id es FK N:1; id es la PK;
-# amount es una medida float; name es categorica sin sufijo de id.
-profile = {
-    "n_rows": 891,
-    "key_candidates": ["id"],
-    "columns": [
-        {"name": "id", "inferred_type": "numeric", "physical_type": "BIGINT",
-         "distinct_count": 891, "unique_pct": 1.0, "flags": ["possible_id"]},
-        {"name": "customer_id", "inferred_type": "numeric", "physical_type": "BIGINT",
-         "distinct_count": 137, "unique_pct": 0.15, "flags": []},
-        {"name": "amount", "inferred_type": "numeric", "physical_type": "DOUBLE",
-         "distinct_count": 400, "unique_pct": 0.45, "flags": []},
-        {"name": "name", "inferred_type": "categorical", "physical_type": "VARCHAR",
-         "distinct_count": 700, "unique_pct": 0.78, "flags": []},
-    ],
-}
-
-out = suggest_intratable_fk_candidates(profile)
-[c["column"] for c in out]              # -> ["customer_id"]
-out[0]["ref_table_guess"]               # -> "customer"
-out[0]["reason"]
-# -> "el nombre termina en '_id' y es N:1 (137 valores distintos < 891 filas):
-#     parece (heuristica por nombre, sin confirmar containment) una referencia a
-#     una tabla «customer»"
-```
-
-## Cuando usarla
-
-Cuando el EDA tiene SOLO UNA tabla y, por tanto, no se puede inferir una FK
-inter-tabla por containment (no hay otra tabla cuyos valores contener). Es el plan B
-del capitulo RELACIONES de AutomaticEDA: en vez de medir solapamiento de valores
-entre tablas (lo correcto cuando hay varias, ver `infer_fk_containment_duckdb` /
-`build_join_graph`), conjetura por el NOMBRE de la columna (`<algo>_id`) y por su
-CARDINALIDAD N:1 que columnas parecen apuntar a una entidad externa. Usala para
-enriquecer el reporte con "estas columnas parecen referencias a otras tablas" sin
-prometer que esa tabla exista. NO la uses si tienes varias tablas: ahi mide
-containment de verdad.
-
-## Gotchas
-
- Es **heuristica**, no una verdad: produce **falsos positivos** (una columna
-  `period_id` que en realidad es un codigo libre, no una FK) y **falsos negativos**
-  (una FK que no se llama `*_id`, p.ej. `parent`, `owner`, `sku`). No la trates como
-  una afirmacion de esquema.
- `ref_table_guess` es una **conjetura por el nombre** (el stem sin el sufijo id):
-  `customer_id` -> `customer`, `AlbumId` -> `album`, `manager_staff_id` ->
-  `manager_staff`. Puede no coincidir con el nombre real de la tabla (plurales,
-  prefijos, alias). Es una pista, no un join garantizado.
- **NO confirma containment**: no comprueba que los valores de la columna existan en
-  ninguna otra tabla (no puede — solo recibe el perfil de una tabla). Para confirmar
-  una FK real con varias tablas usa `infer_fk_containment_duckdb`.
- Excluye deliberadamente: el `id`/`Id`/`ID` generico a secas (suele ser la PK
-  propia, no una referencia), las columnas constantes, las que parecen unicas
-  (`unique_pct >= 0.99`, mas PK que FK) y los tipos no-clave (float/decimal son
-  medidas; date/time/timestamp y boolean no son claves). En camelCase, `paid`,
-  `valid`, `grid` (con `id` en minuscula y sin separador) NO se confunden con FK.
- `unique_pct` se interpreta como **fraccion 0-1** (tal como la emite el profile), no
-  como porcentaje 0-100.
@@ -1,202 +0,0 @@
-"""suggest_intratable_fk_candidates — heuristica de FK intra-tabla del grupo `eda`.
-
-Sobre el TableProfile de UNA tabla (el dict que produce ``profile_table``), sugiere
-por heuristica de NOMBRE + CARDINALIDAD que columnas PARECEN una clave foranea hacia
-otra tabla, util cuando no hay relaciones inter-tabla disponibles (una sola tabla y,
-por tanto, sin containment cruzado que medir). Es una SUGERENCIA, no una afirmacion:
-no confirma que exista la tabla referida ni que los valores esten contenidos en ella.
-
-La consume el capitulo RELACIONES de AutomaticEDA cuando solo hay una tabla.
-
-Funcion PURA: solo lee el dict (lectura defensiva con ``.get``), no hace I/O y nunca
-lanza por inputs raros (devuelve ``[]``).
-"""
-
-# inferred_type que es compatible con una clave foranea (entero/categorico).
-_FK_INFERRED_OK = {"numeric", "categorical", "integer"}
-
-# Prefijos de physical_type que admiten ser clave foranea (enteros, texto, uuid).
-_FK_PHYSICAL_PREFIXES = (
-    "int", "bigint", "smallint", "tinyint", "hugeint", "uint",
-    "varchar", "text", "char", "bpchar", "string", "uuid",
-)
-
-# Prefijos de physical_type que EXCLUYEN ser clave foranea: medidas en coma flotante
-# (float/double/decimal/numeric/real), temporales (date/time/timestamp/interval) y
-# boolean. Se comprueban ANTES que las senales positivas (la exclusion gana: una
-# columna numeric con physical DOUBLE es una medida, no una FK).
-_FK_PHYSICAL_EXCLUDE = (
-    "float", "double", "decimal", "numeric", "real",
-    "date", "time", "timestamp", "interval",
-    "bool",
-)
-
-
-def _fk_name_signal(name):
-    """Detecta el sufijo de clave foranea en el nombre y devuelve ``(stem, sufijo)``.
-
-    Reconoce ``<algo>_id`` (snake), ``<Algo>Id`` y ``<algo>ID`` (camel). NO reconoce
-    el ``id``/``Id``/``ID`` generico a secas (suele ser la PK propia de la tabla, no
-    una referencia). En camelCase la ``I`` mayuscula marca el limite de palabra, asi
-    que ``paid``/``valid``/``grid`` (``id`` en minuscula y sin separador) NO matchean.
-
-    El ``stem`` se devuelve en minusculas y sirve de ``ref_table_guess`` (la tabla a
-    la que probablemente apunta): ``customer_id`` -> ``"customer"``, ``AlbumId`` ->
-    ``"album"``, ``manager_staff_id`` -> ``"manager_staff"``. Devuelve ``None`` si no
-    hay senal de nombre.
-    """
-    if not isinstance(name, str):
-        return None
-    raw = name.strip()
-    if not raw:
-        return None
-    # Snake: termina en "_id" (indiferente a mayusculas en la parte "id").
-    if raw.lower().endswith("_id"):
-        stem = raw[:-3].rstrip("_-. ")
-        if not stem:
-            return None
-        return (stem.lower(), "_id")
-    # Camel todo-mayuscula: "...ID" (p.ej. customerID).
-    if raw.endswith("ID"):
-        stem = raw[:-2].rstrip("_-. ")
-        if not stem:
-            return None
-        return (stem.lower(), "ID")
-    # Camel: "...Id" (p.ej. AlbumId).
-    if raw.endswith("Id"):
-        stem = raw[:-2].rstrip("_-. ")
-        if not stem:
-            return None
-        return (stem.lower(), "Id")
-    return None
-
-
-def _fk_type_compatible(col):
-    """True si el tipo de la columna admite ser clave foranea.
-
-    Compatible si el ``physical_type`` NO es una medida flotante, una temporal ni
-    boolean, Y ademas (``inferred_type`` en {numeric, categorical, integer} O el
-    ``physical_type`` empieza por entero/varchar/text/char/uuid). La comparacion es
-    indistinta a mayusculas/minusculas.
-    """
-    phys = (col.get("physical_type") or "").strip().lower()
-    inferred = (col.get("inferred_type") or "").strip().lower()
-    # Exclusion por tipo fisico (gana sobre cualquier senal positiva).
-    for bad in _FK_PHYSICAL_EXCLUDE:
-        if phys.startswith(bad):
-            return False
-    # Senal positiva por tipo inferido.
-    if inferred in _FK_INFERRED_OK:
-        return True
-    # Senal positiva por tipo fisico (entero/texto/uuid).
-    for good in _FK_PHYSICAL_PREFIXES:
-        if phys.startswith(good):
-            return True
-    return False
-
-
-def suggest_intratable_fk_candidates(profile: dict, max_candidates: int = 20) -> list:
-    """Sugiere columnas que parecen una FK intra-tabla por nombre + cardinalidad.
-
-    Heuristica (no afirma nada): una columna es candidata a clave foranea si su nombre
-    tiene sufijo de id con stem no vacio (``<algo>_id`` / ``<Algo>Id`` / ``<algo>ID``,
-    NUNCA el ``id`` generico), no es ya candidata a PK, no es constante, tiene
-    cardinalidad alta pero por debajo del numero de filas (N:1, no unica) y un tipo
-    compatible con clave (entero/categorico/texto/uuid; nunca float/fecha/boolean).
-
-    Args:
-        profile: TableProfile (dict de ``profile_table``). Se leen, de forma
-            defensiva, ``columns`` (lista de ColumnProfile), ``n_rows`` y
-            ``key_candidates`` (nombres de columna ya candidatos a PK).
-        max_candidates: tope de sugerencias devueltas (default 20). Las columnas se
-            ordenan por ``distinct_count`` descendente (mas informativas primero)
-            antes de cortar.
-
-    Returns:
-        list de dicts (posiblemente vacia), uno por columna sugerida, con claves:
-        ``column``, ``ref_table_guess`` (stem del nombre), ``reason`` (frase humana),
-        ``distinct_count``, ``unique_pct`` (fraccion 0-1 tal como viene del profile),
-        ``inferred_type``, ``physical_type``. Nunca lanza: si ``profile`` no es dict o
-        no hay columnas, devuelve ``[]``.
-    """
-    if not isinstance(profile, dict):
-        return []
-    columns = profile.get("columns")
-    if not isinstance(columns, list):
-        return []
-
-    n_rows = profile.get("n_rows")
-    has_n_rows = (
-        isinstance(n_rows, int) and not isinstance(n_rows, bool) and n_rows > 0
-    )
-
-    key_candidates = profile.get("key_candidates")
-    if not isinstance(key_candidates, (list, tuple, set)):
-        key_candidates = []
-    key_set = set(key_candidates)
-
-    out = []
-    for col in columns:
-        if not isinstance(col, dict):
-            continue
-        name = col.get("name")
-
-        # 1) Senal de nombre: sufijo de id con stem no vacio.
-        signal = _fk_name_signal(name)
-        if signal is None:
-            continue
-        ref_guess, suffix = signal
-
-        # 2) No es ya candidata a PK (clave primaria de la propia tabla).
-        if name in key_set:
-            continue
-
-        # 3) No constante y con >= 2 valores distintos.
-        flags = col.get("flags") or []
-        if "constant" in flags:
-            continue
-        dc = col.get("distinct_count")
-        if not (isinstance(dc, int) and not isinstance(dc, bool) and dc >= 2):
-            continue
-
-        # 4) Cardinalidad alta pero < n_rows (no es PK) y no parece unica.
-        if has_n_rows and dc >= n_rows:
-            continue
-        unique_pct = col.get("unique_pct")
-        has_unique = (
-            isinstance(unique_pct, (int, float)) and not isinstance(unique_pct, bool)
-        )
-        if has_unique and unique_pct >= 0.99:
-            continue
-
-        # 5) Tipo compatible con clave foranea (entero/categorico/texto; no medida).
-        if not _fk_type_compatible(col):
-            continue
-
-        out.append(
-            {
-                "column": name,
-                "ref_table_guess": ref_guess,
-                "reason": _build_reason(suffix, dc, n_rows if has_n_rows else None, ref_guess),
-                "distinct_count": dc,
-                "unique_pct": float(unique_pct) if has_unique else None,
-                "inferred_type": col.get("inferred_type") or "",
-                "physical_type": col.get("physical_type") or "",
-            }
-        )
-
-    # Mas informativas primero (mayor cardinalidad), luego corte.
-    out.sort(key=lambda d: d.get("distinct_count") or 0, reverse=True)
-    return out[: max(0, int(max_candidates))]
-
-
-def _build_reason(suffix, dc, n_rows, ref_guess):
-    """Frase humana que deja claro que la sugerencia es heuristica, no confirmada."""
-    if n_rows is not None:
-        card = f"es N:1 ({dc} valores distintos < {n_rows} filas)"
-    else:
-        card = f"tiene {dc} valores distintos que se repiten (cardinalidad N:1)"
-    return (
-        f"el nombre termina en '{suffix}' y {card}: parece (heuristica por nombre, "
-        f"sin confirmar containment) una referencia a una tabla «{ref_guess}»"
-    )
@@ -1,157 +0,0 @@
-"""Tests para suggest_intratable_fk_candidates (funcion pura, sin I/O)."""
-
-from suggest_intratable_fk_candidates import suggest_intratable_fk_candidates
-
-
-def _col(name, inferred_type="numeric", physical_type="BIGINT", distinct_count=10,
-         unique_pct=0.1, flags=None):
-    """Construye un ColumnProfile minimo a mano (el dict que emite profile_table)."""
-    return {
-        "name": name,
-        "inferred_type": inferred_type,
-        "physical_type": physical_type,
-        "semantic_type": "",
-        "distinct_count": distinct_count,
-        "unique_pct": unique_pct,
-        "null_count": 0,
-        "null_pct": 0.0,
-        "flags": list(flags) if flags else [],
-    }
-
-
-def test_golden_customer_id_detectado_otras_no():
-    # Tabla tipo titanic: customer_id es FK N:1; id es la PK; amount es medida;
-    # name es categorica sin sufijo de id. Solo customer_id debe aparecer.
-    profile = {
-        "n_rows": 891,
-        "key_candidates": ["id"],
-        "columns": [
-            _col("id", inferred_type="numeric", physical_type="BIGINT",
-                 distinct_count=891, unique_pct=1.0, flags=["possible_id"]),
-            _col("customer_id", inferred_type="numeric", physical_type="BIGINT",
-                 distinct_count=137, unique_pct=0.15, flags=[]),
-            _col("amount", inferred_type="numeric", physical_type="DOUBLE",
-                 distinct_count=400, unique_pct=0.45),
-            _col("name", inferred_type="categorical", physical_type="VARCHAR",
-                 distinct_count=700, unique_pct=0.78),
-        ],
-    }
-    out = suggest_intratable_fk_candidates(profile)
-    assert isinstance(out, list)
-    assert [c["column"] for c in out] == ["customer_id"]
-    cand = out[0]
-    assert cand["ref_table_guess"] == "customer"
-    assert cand["distinct_count"] == 137
-    assert cand["unique_pct"] == 0.15
-    assert cand["inferred_type"] == "numeric"
-    assert cand["physical_type"] == "BIGINT"
-    # La razon deja claro que es heuristica + cita el sufijo y la tabla.
-    assert "customer" in cand["reason"]
-    assert "_id" in cand["reason"]
-
-
-def test_camelcase_albumid_detectado():
-    # AlbumId (camelCase, VARCHAR) -> detectada, ref_table_guess "album".
-    profile = {
-        "n_rows": 3503,
-        "key_candidates": ["TrackId"],
-        "columns": [
-            _col("AlbumId", inferred_type="categorical", physical_type="VARCHAR",
-                 distinct_count=347, unique_pct=0.10),
-        ],
-    }
-    out = suggest_intratable_fk_candidates(profile)
-    # TrackId es PK candidata (en key_candidates), AlbumId no -> AlbumId aparece.
-    assert [c["column"] for c in out] == ["AlbumId"]
-    assert out[0]["ref_table_guess"] == "album"
-
-
-def test_constante_status_id_no_aparece():
-    # status_id constante (flag "constant", distinct_count 1) NO es FK util.
-    profile = {
-        "n_rows": 1000,
-        "key_candidates": [],
-        "columns": [
-            _col("status_id", inferred_type="numeric", physical_type="INTEGER",
-                 distinct_count=1, unique_pct=0.001, flags=["constant"]),
-        ],
-    }
-    out = suggest_intratable_fk_candidates(profile)
-    assert out == []
-
-
-def test_profile_vacio_y_none_devuelven_lista_vacia():
-    # Lectura defensiva: ni {} ni None lanzan; devuelven [].
-    assert suggest_intratable_fk_candidates({}) == []
-    assert suggest_intratable_fk_candidates(None) == []
-    # profile sin columns o con columns no-lista tampoco lanza.
-    assert suggest_intratable_fk_candidates({"n_rows": 10}) == []
-    assert suggest_intratable_fk_candidates({"columns": "no-soy-lista"}) == []
-
-
-def test_category_id_casi_unico_parece_pk_no_aparece():
-    # unique_pct 0.999 -> parece PK (no N:1) -> NO se sugiere como FK.
-    profile = {
-        "n_rows": 891,
-        "key_candidates": [],
-        "columns": [
-            _col("category_id", inferred_type="numeric", physical_type="BIGINT",
-                 distinct_count=890, unique_pct=0.999),
-        ],
-    }
-    out = suggest_intratable_fk_candidates(profile)
-    assert out == []
-
-
-def test_ref_table_guess_multitoken_y_orden_por_distinct():
-    # manager_staff_id conserva los underscores del stem -> "manager_staff".
-    # Ademas, con varias candidatas, se ordenan por distinct_count descendente.
-    profile = {
-        "n_rows": 10000,
-        "key_candidates": ["staff_id"],  # staff_id es PK aqui, no debe aparecer
-        "columns": [
-            _col("staff_id", inferred_type="numeric", physical_type="BIGINT",
-                 distinct_count=10000, unique_pct=1.0, flags=["possible_id"]),
-            _col("store_id", inferred_type="numeric", physical_type="INTEGER",
-                 distinct_count=2, unique_pct=0.0002),
-            _col("manager_staff_id", inferred_type="numeric", physical_type="INTEGER",
-                 distinct_count=40, unique_pct=0.004),
-        ],
-    }
-    out = suggest_intratable_fk_candidates(profile)
-    cols = [c["column"] for c in out]
-    # staff_id excluida (PK); las otras dos ordenadas por distinct desc.
-    assert cols == ["manager_staff_id", "store_id"]
-    refs = {c["column"]: c["ref_table_guess"] for c in out}
-    assert refs["manager_staff_id"] == "manager_staff"
-    assert refs["store_id"] == "store"
-
-
-def test_max_candidates_corta_la_lista():
-    # max_candidates limita el numero de sugerencias devueltas.
-    profile = {
-        "n_rows": 10000,
-        "key_candidates": [],
-        "columns": [
-            _col("a_id", distinct_count=300, unique_pct=0.03),
-            _col("b_id", distinct_count=200, unique_pct=0.02),
-            _col("c_id", distinct_count=100, unique_pct=0.01),
-        ],
-    }
-    out = suggest_intratable_fk_candidates(profile, max_candidates=2)
-    assert [c["column"] for c in out] == ["a_id", "b_id"]
-
-
-def test_id_generico_solo_nunca_es_fk():
-    # 'id'/'Id'/'ID' a secas (sin stem) jamas se sugieren como FK.
-    profile = {
-        "n_rows": 500,
-        "key_candidates": [],
-        "columns": [
-            _col("id", distinct_count=500, unique_pct=1.0),
-            _col("Id", distinct_count=120, unique_pct=0.24),
-            _col("ID", distinct_count=80, unique_pct=0.16),
-        ],
-    }
-    out = suggest_intratable_fk_candidates(profile)
-    assert out == []
@@ -3,7 +3,7 @@ name: summarize_table_duckdb
 kind: function
 lang: py
 domain: datascience
-version: "1.0.0"
+version: "1.1.0"
 purity: impure
 signature: "def summarize_table_duckdb(db_path: str, table: str, high_card_ratio: float = 0.9) -> dict"
 description: "Perfila una tabla DuckDB en una sola pasada SQL (SUMMARIZE, push-down sin traer filas a RAM) y devuelve el esqueleto de un TableProfile con el perfil base por columna. Corazon del grupo eda: base barata sobre la que otras funciones anaden lo estadistico fino (skew/kurtosis/histograma sobre muestra)."
@@ -64,6 +64,7 @@ else:
 - **`distinct_count` exacto para tablas <=200k filas, aproximado+capado por encima**: `SUMMARIZE` usa HyperLogLog (`approx_unique`), que SOBREESTIMA y en tablas pequenas puede reportar mas distintos que filas (inflando `unique_pct` por encima de 1.0 y disparando flags `possible_id` falsos). Por eso, para `n_rows <= 200000` la funcion calcula `COUNT(DISTINCT)` EXACTO en una sola query combinada (barata) y usa ese valor. Para tablas mas grandes mantiene `approx_unique` pero lo CAPA a `n_rows` (`distinct_count = min(approx_unique, n_rows)`). En ambos casos `unique_pct = min(distinct_count / n_rows, 1.0)`, asi que `distinct_count` nunca supera las filas ni `unique_pct` pasa de 1.0. Los flags `possible_id` / `high_cardinality` derivan de ese `distinct_count` ya corregido (exacto y fiable por debajo de 200k filas; aproximado y conservador por encima).
 - **`SUMMARIZE` NO da skew, kurtosis ni histograma**, ni percentiles finos (p1/p5/p95/p99), moda, outliers, correlaciones, key_candidates ni quality_score. Esas claves quedan en `None`/`[]` a proposito: las rellena otra funcion del grupo `eda` sobre una muestra. El sub-dict `numeric` solo trae min, max, mean, std, p25, p50, p75.
 - **`SUMMARIZE.count` es el total de filas, no el no-nulo**: la funcion deriva el `count` no-nulo del ColumnProfile como `n_rows - null_count` (con `null_count` redondeado de `null_percentage`).
+- **`duplicate_rows`/`duplicate_pct` se pueblan push-down** (desde v1.1.0) con `count(*)` sobre `SELECT DISTINCT *` (sin traer filas a RAM): `duplicate_rows = n_rows - filas_distintas`, `duplicate_pct` en fraccion 0-1. Habilitan la dimension de unicidad de registro del score de dataset (`profile_table` paso 6). Si la tabla tiene tipos no comparables con `DISTINCT` (BLOB/LIST/MAP) la query degrada y ambas vuelven a `None` (renormaliza el score a solo `cell_quality`).
 - **min/max/avg/std/q25/q50/q75 vienen como strings** desde DuckDB; se convierten a float (None si la columna no es numerica).
 - **Requiere DuckDB 1.5.2** (columnas de `SUMMARIZE` validadas con esa version: column_name, column_type, min, max, approx_unique, avg, std, q25, q50, q75, count, null_percentage).
 - **El identificador de tabla se interpola** (no parametrizable en `SUMMARIZE`): por eso se valida contra `^[A-Za-z_][A-Za-z0-9_]*$` antes de citarlo. Un nombre invalido (p.ej. con `;` o espacios) devuelve `{status:'error'}` sin tocar la base.
@@ -196,6 +196,21 @@ def summarize_table_duckdb(
            sum(c["null_pct"] for c in columns) / len(columns) if columns else 0.0
        )

+        # Unicidad de registro: filas duplicadas via COUNT de filas distintas
+        # push-down (DISTINCT *), sin traer filas a RAM. Habilita la dimension
+        # de uniqueness del score de dataset (1 - duplicate_pct). Degrada a None
+        # si la tabla tiene tipos no comparables con DISTINCT (BLOB/LIST/MAP).
+        duplicate_rows = None
+        duplicate_pct = None
+        if n_rows > 0:
+            dup_res = duckdb_query_readonly(
+                db_path, f"SELECT count(*) AS c FROM (SELECT DISTINCT * FROM {quoted})"
+            )
+            if dup_res["status"] == "ok" and dup_res["rows"]:
+                distinct_rows = int(dup_res["rows"][0]["c"])
+                duplicate_rows = max(0, n_rows - distinct_rows)
+                duplicate_pct = duplicate_rows / n_rows  # fraccion 0-1
+
        profile = {
            "table": table,
            "source": "duckdb",
@@ -203,8 +218,8 @@ def summarize_table_duckdb(
            "n_rows": n_rows,
            "n_cols": len(columns),
            "size_bytes": None,
-            "duplicate_rows": None,
-            "duplicate_pct": None,
+            "duplicate_rows": duplicate_rows,
+            "duplicate_pct": duplicate_pct,
            "constant_cols": constant_cols,
            "all_null_cols": all_null_cols,
            "null_cell_pct": null_cell_pct,
@@ -54,6 +54,30 @@ def test_shape_y_metadatos_tabla(db):
    assert profile["correlations"] is None


+def test_duplicate_pct_sin_duplicados(db):
+    """Tabla con todas las filas distintas: duplicate_pct = 0, no None."""
+    profile = summarize_table_duckdb(db, "ventas")["profile"]
+    assert profile["duplicate_rows"] == 0
+    assert profile["duplicate_pct"] == 0.0
+
+
+def test_duplicate_pct_con_duplicados(tmp_path):
+    """Filas repetidas: duplicate_rows/duplicate_pct se pueblan push-down."""
+    path = str(tmp_path / "dups.duckdb")
+    con = duckdb.connect(path)
+    con.execute("CREATE TABLE t (a INTEGER, b VARCHAR)")
+    # 5 filas, 2 de ellas idénticas a otras -> 2 duplicadas sobre 5 = 0.4.
+    con.execute(
+        "INSERT INTO t VALUES "
+        "(1,'x'), (2,'y'), (1,'x'), (3,'z'), (2,'y')"
+    )
+    con.close()
+    profile = summarize_table_duckdb(path, "t")["profile"]
+    assert profile["n_rows"] == 5
+    assert profile["duplicate_rows"] == 2
+    assert profile["duplicate_pct"] == 0.4
+
+
 def test_column_profile_shape(db):
    profile = summarize_table_duckdb(db, "ventas")["profile"]
    by_name = {c["name"]: c for c in profile["columns"]}
@@ -4,7 +4,7 @@ kind: pipeline
 lang: py
 domain: pipelines
 purity: impure
-version: "1.0.0"
+version: "1.1.0"
 signature: "def profile_table(db_path: str, table: str, backend: str = \"duckdb\", sample: int = 5000, run_models: bool = False, run_llm: bool = False, run_series: bool = False, emit_pdf: bool = False, emit_automatic: bool = False, report_dir: str = \"reports\", write_report: bool = True) -> dict"
 description: "Orquestador one-shot del grupo de capacidad eda: perfila UNA tabla (DuckDB o PostgreSQL) end-to-end componiendo las funciones del grupo (perfil base SQL + muestreo read-only + inferencia semantica + promocion de tipo + estadistica numerica/categorica + score de calidad + correlaciones con correccion FDR + re-expresion de Tukey + avisos exploratorios) y, opcional, modelos baratos (run_models), interpretacion LLM (run_llm) y analisis de serie temporal por columna (run_series: estacionariedad ADF+KPSS, ACF/PACF, STL, retornos). Emite el TableProfile completo mas (opcional) report markdown + JSON sidecar + PDF movil (emit_pdf). Es la composicion canonica para hazme un EDA de esta tabla."
 tags: [eda, duckdb, postgres, profiling, data-quality, pipeline, dataops, timeseries]
@@ -114,3 +114,12 @@ para auditar la calidad de una tabla ya productiva. Reemplaza orquestar a mano
  Formatos exoticos pueden descartarse silenciosamente del calculo numerico.
 - `db_path` debe existir: DuckDB read-only NO crea la base. El muestreo usa el
  sandbox por defecto de `duckdb_query_readonly` (sin acceso a FS/red).
+- **Score de calidad (report 2046, desde v1.1.0).** Paso 5: cada columna recibe
+  `quality_score` de `column_quality_score` con la formula 60/40
+  (completeness/validity); al promocionar texto a numero/fecha se expone
+  `col["validity_rate"]` (parse rate de la muestra) para alimentar la dimension
+  validity. Paso 6: el score de dataset NO es la media simple — es
+  `100 * (0.85*cell_quality + 0.15*row_uniqueness)`, donde
+  `cell_quality = media(score_col/100)` y `row_uniqueness = 1 - duplicate_pct`.
+  Si `duplicate_pct` es `None` (backend sin calcularlo) el score se renormaliza
+  a solo `cell_quality`. Los outliers NO bajan el score (van a `observations`).
@@ -477,9 +477,18 @@ def profile_table(
                    if vals and (len(ok) / len(vals)) >= _PROMOTE_MIN_PARSE:
                        col["inferred_type"] = "numeric"
                        inferred = "numeric"
+                        # Tasa de parseo real de la muestra: alimenta la
+                        # dimension validity de column_quality_score (fraccion
+                        # de valores conformes al tipo numerico promovido).
+                        col["validity_rate"] = len(ok) / len(vals)
                elif semantic in _DATETIME_SEMANTIC:
                    col["inferred_type"] = "datetime"
                    inferred = "datetime"
+                    # Tasa de parseo de la muestra a fecha (mismo papel que el
+                    # parse rate numerico) para la dimension validity.
+                    parsed_dt = [_to_ordinal_days(v) for v in vals]
+                    ok_dt = [d for d in parsed_dt if d is not None]
+                    col["validity_rate"] = (len(ok_dt) / len(vals)) if vals else None

            # 4) Enriquecer segun el inferred_type final.
            if inferred == "numeric":
@@ -506,11 +515,36 @@ def profile_table(
            # 5) Score de calidad por columna.
            col["quality_score"] = column_quality_score(col).get("score")

-        # 6) Score agregado de la tabla (media de columnas).
+        # 6) Score agregado de la tabla (report 2046): NO media simple.
+        #   cell_quality   = media de los scores de columna, en [0,1].
+        #   row_uniqueness = 1 - duplicate_pct (unicidad de registro).
+        #   score = 100 * (0.85*cell_quality + 0.15*row_uniqueness).
+        # Renormaliza a solo cell_quality si duplicate_pct no se pudo calcular.
        scores = [
            c["quality_score"] for c in cols if c.get("quality_score") is not None
        ]
-        prof["quality_score"] = round(sum(scores) / len(scores), 1) if scores else None
+        if scores:
+            cell_quality = (sum(scores) / len(scores)) / 100.0
+            dup_pct = prof.get("duplicate_pct")
+            if dup_pct is not None:
+                try:
+                    d = float(dup_pct)
+                except (TypeError, ValueError):
+                    d = None
+            else:
+                d = None
+            if d is not None:
+                # Tolerar escala 0-100 por si algun backend la entrega asi.
+                if d > 1.0:
+                    d = d / 100.0
+                row_uniqueness = max(0.0, min(1.0, 1.0 - d))
+                prof["quality_score"] = round(
+                    100.0 * (0.85 * cell_quality + 0.15 * row_uniqueness), 1
+                )
+            else:
+                prof["quality_score"] = round(100.0 * cell_quality, 1)
+        else:
+            prof["quality_score"] = None

        # 7) Candidatos a clave.
        key_candidates = []