Compare commits

..

1 Commits

Author SHA1 Message Date
egutierrez a2074a0167 feat(eda): nueva fórmula de calidad de datos (report 2046) + capítulo calidad
Implementa el modelo de calidad del report 2046 en el grupo eda.

Score de columna: 0.6·completeness + 0.4·validity con renormalización por
aplicabilidad (si la validez no es medible —texto libre o columna 100% nula— el
score se basa solo en completeness). Validez = conformidad real al tipo: nativo
numérico/fecha/bool = 1.0; texto promovido a número/fecha = parse rate
(validity_rate); texto con semantic_type = match_rate; texto libre = no aplica.

Outliers, columnas constantes e identificadores salen del score a un bloque de
observaciones analíticas (no son defectos de calidad). Se elimina el doble
conteo de la falta de datos (mostly_null ya no castiga validez) y el bug de
escala de outliers (que además ya no entran en el score).

Score de dataset: 100·(0.85·cell_quality + 0.15·row_uniqueness) en vez de la
media simple. Se pobla duplicate_rows/duplicate_pct push-down en
summarize_table_duckdb (COUNT sobre DISTINCT *, sin RAM) para habilitar la
unicidad de registro; renormaliza a solo cell_quality si no se puede calcular.

Capítulo calidad (v2.0.0): intro de dos dimensiones (60/40) que declara que los
outliers no bajan el score; tabla de scores Columna|Calidad|Completitud|Validez
(sin Consistencia, n/a cuando no aplica); DOS tablas separadas (Problemas de
calidad vs Observaciones analíticas); resumen con Unicidad de registro; glosario
clicable de completitud, validez, unicidad de registro y calidad de datos.

Verificado: 123 tests verdes (automatic_eda + render_automatic_eda +
column_quality_score + summarize_table_duckdb + profile_table). Golden EDA de
titanic (run_models+run_llm) con score recomputado a mano, outliers separados en
observaciones y glosario clicable (5 links GOTO en el PDF).

column_quality_score v2.0.0, summarize_table_duckdb v1.1.0, profile_table v1.1.0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-30 18:10:23 +02:00
20 changed files with 779 additions and 1957 deletions
-4
View File
@@ -34,7 +34,6 @@ from .theils_u import theils_u
from .correlation_ratio import correlation_ratio
from .mutual_info_columns import mutual_info_columns
from .infer_fk_containment_duckdb import infer_fk_containment_duckdb
from .detect_declared_keys_duckdb import detect_declared_keys_duckdb
from .build_join_graph import build_join_graph
from .association_matrix import association_matrix
from .correlation_matrix_duckdb import correlation_matrix_duckdb
@@ -70,10 +69,8 @@ from .build_eda_render_ctx import build_eda_render_ctx
from .profile_datetime import profile_datetime
from .resample_timeseries import resample_timeseries
from .add_pdf_internal_links import add_pdf_internal_links
from .suggest_intratable_fk_candidates import suggest_intratable_fk_candidates
__all__ = [
"suggest_intratable_fk_candidates",
"detect_time_column",
"extract_timeseries_raw",
"build_eda_render_ctx",
@@ -100,7 +97,6 @@ __all__ = [
"correlation_ratio",
"mutual_info_columns",
"infer_fk_containment_duckdb",
"detect_declared_keys_duckdb",
"build_join_graph",
"association_matrix",
"correlation_matrix_duckdb",
@@ -1,22 +1,26 @@
"""Data-quality chapter (CALIDAD) for AutomaticEDA.
Builds the quality chapter from a ``TableProfile`` of the ``eda`` group. The
chapter answers, in Spanish and as tables, the three things the user asked for:
chapter implements the quality model of report 2046:
1. **En qué se basa la calidad** — an intro paragraph explaining the criteria and
their weights (completeness, validity, consistency) before any number, plus a
table-level summary (global score and aggregates).
1. **En qué se basa la calidad** — an intro paragraph explaining the two scored
dimensions and their weights (completitud 60%, validez 40%) plus the
table-level row uniqueness, BEFORE any number, and stating explicitly that
outliers are reported as observations and do **not** lower the score. The
criteria terms (calidad de datos, completitud, validez, unicidad de registro)
are hooked into the shared glossary as clickable jumps.
2. **Scores por columna** — a table with, per column, the total quality score and
its breakdown into completeness / validity / consistency.
3. **Problemas en español** — a second table listing, per column, the readable
issues in Spanish (kept separate from the type ``flags``).
its breakdown into completeness / validity (no consistency dimension).
3. **Problemas de calidad** — a table listing ONLY real quality defects
(nulls, empty cells, values not conforming to their type/semantics).
4. **Observaciones analíticas** — a SEPARATE table for outliers, constant
columns, high-cardinality ids and strong skew, with an explicit note that
these do not affect the score.
The breakdown and the issues are NOT recomputed here: they come from the registry
function ``column_quality_score`` (group ``eda``), which already derives
``{score, completeness, validity, consistency, issues}`` from the ColumnProfile.
This chapter is render-only — it consumes that function and lays the result out
as model blocks; the renderers paginate tables (splitting by rows, repeating the
header) and wrap long cells so nothing is ever cut.
The breakdown, issues and observations are NOT recomputed here: they come from
the registry function ``column_quality_score`` (group ``eda``), which derives
``{score, completeness, validity, dimensions, applicable, issues,
observations}`` from the ColumnProfile. This chapter is render-only.
Contract: build_<id>(profile, ctx) -> Chapter | None ; CHAPTER_VERSION = "x.y.z".
"""
@@ -33,28 +37,47 @@ try: # pragma: no cover - import wiring
except Exception: # noqa: BLE001 - never let an import error abort the document.
_column_quality_score = None
CHAPTER_VERSION = "1.0.0"
CHAPTER_VERSION = "2.0.0"
CHAPTER_ID = "calidad"
CHAPTER_TITLE = "Calidad"
# Weights mirror column_quality_score: completeness 0.5, validity 0.3,
# consistency 0.2. Kept here only to render the human explanation; the actual
# numbers always come from the function so the two never drift in computation.
_CRITERIA_INTRO = (
"La calidad de cada columna es un score de 0 a 100 que combina tres "
"criterios, cada uno con un peso:\n\n"
"- **Completitud (peso 50%)**: proporción de valores presentes (sin nulos "
"ni vacíos). Una columna con muchos nulos baja de score.\n"
"- **Validez (peso 30%)**: los valores son coherentes con su tipo y rango "
"esperado (penaliza outliers y semánticas declaradas que no coinciden).\n"
"- **Consistencia (peso 20%)**: la columna aporta información útil (penaliza "
"columnas constantes o identificadores de cardinalidad muy alta).\n\n"
"Score = 100 × (0,5·completitud + 0,3·validez + 0,2·consistencia). "
"Los problemas detectados por columna se listan en español más abajo."
)
# Glossary terms this chapter explains (report 2046 §6). Registered in the shared
# collector and marked clickable on their first appearance (contract §11.1).
_TERMS = {
"calidad_datos": (
"Calidad de datos (score 0-100)",
"Mide hasta qué punto los datos están presentes y son utilizables tal "
"cual, no si son «buenos para el análisis». Se compone solo de "
"dimensiones medibles automáticamente desde el perfil de la tabla, sin "
"fuente externa de verdad: completitud (60%), validez (40%, cuando es "
"medible) y, a nivel de tabla, unicidad de registro. Los valores "
"atípicos NO bajan la calidad: se listan aparte como observaciones.",
),
"completitud": (
"Completitud",
"Proporción de valores realmente presentes en una columna (1 % de "
"nulos; en texto, las celdas vacías también cuentan como faltantes). Los "
"nulos y vacíos bajan el score porque falta información que debería "
"estar. Pesa el 60% del score de columna.",
),
"validez": (
"Validez",
"Proporción de valores que encajan con su tipo o formato esperado: un "
"número que parsea, una fecha legible, un email con forma de email. Los "
"valores que no parsean a su tipo bajan el score. Si la columna es texto "
"libre sin formato esperado, la validez no se puede medir y el score se "
"basa solo en la completitud. Pesa el 40% del score cuando es medible.",
),
"unicidad_registro": (
"Unicidad de registro",
"A nivel de tabla, las filas duplicadas restan calidad al conjunto "
"(1 % de filas duplicadas). Es distinta de que una columna no-clave "
"repita valores, que no es un defecto de calidad.",
),
}
# Cap for the joined issues cell so a single row never grows taller than a page;
# the remainder is summarized as "(+N más)" instead of being silently dropped.
# Cap for the joined cell so a single row never grows taller than a page; the
# remainder is summarized as "(+N más)" instead of being silently dropped.
_ISSUES_MAXLEN = 160
@@ -82,12 +105,19 @@ def _fmt_unit_pct(value) -> str:
return str(value)
def _fmt_validity(value) -> str:
"""Validity is ``None`` when not applicable: show ``n/a`` not a fake 0%."""
if value is None:
return "n/a"
return _fmt_unit_pct(value)
def _quality_of(col: dict) -> dict:
"""Return ``{score, completeness, validity, consistency, issues}`` for a column.
"""Return the quality dict for a column.
Uses the registry ``column_quality_score`` when available; otherwise falls
back to the per-column ``quality_score`` already in the profile (number only,
empty breakdown/issues). Never raises.
empty breakdown/issues/observations). Never raises.
"""
if not isinstance(col, dict):
col = {}
@@ -98,26 +128,25 @@ def _quality_of(col: dict) -> dict:
return res
except Exception: # noqa: BLE001 - degrade instead of aborting.
pass
# Fallback: only the final score is available pre-computed in the profile.
return {
"score": col.get("quality_score"),
"completeness": None,
"validity": None,
"consistency": None,
"issues": [],
"observations": [],
}
def _join_issues(issues) -> str:
"""Join Spanish issue strings into one cell, truncating overly long lists.
def _join_cells(items) -> str:
"""Join Spanish strings into one cell, truncating overly long lists.
The renderer wraps cell text, but a column with many long issues could make a
single row taller than a whole page; cap the length and append ``(+N más)``
so the count of hidden issues is honest rather than silently lost.
The renderer wraps cell text, but a column with many long entries could make
a single row taller than a whole page; cap the length and append ``(+N más)``
so the count of hidden entries is honest rather than silently lost.
"""
if not isinstance(issues, (list, tuple)) or not issues:
if not isinstance(items, (list, tuple)) or not items:
return ""
parts = [model._safe_str(i).strip() for i in issues]
parts = [model._safe_str(i).strip() for i in items]
parts = [p for p in parts if p]
if not parts:
return ""
@@ -142,6 +171,33 @@ def _columns_with_quality(profile: dict):
yield c, _quality_of(c)
def _fmt_unit_pct_or_pct(value) -> str:
"""Format a value that may be a 0-1 fraction or an already-0-100 percentage."""
try:
num = float(value)
except (TypeError, ValueError):
return model._safe_str(value)
if num != num: # NaN
return ""
pct = num * 100 if num <= 1.0 else num
text = f"{pct:.1f}".rstrip("0").rstrip(".")
return f"{text}%"
def _row_uniqueness(profile: dict):
"""Return row uniqueness (1 - duplicate_pct) in [0,1], or None if unknown."""
dup = profile.get("duplicate_pct")
if dup is None:
return None
try:
d = float(dup)
except (TypeError, ValueError):
return None
if d > 1.0: # tolerate a 0-100 scale
d = d / 100.0
return max(0.0, min(1.0, 1.0 - d))
def _summary_block(profile: dict, evaluated: list):
"""Table-level KVTable: global score and quality aggregates."""
rows = []
@@ -153,14 +209,15 @@ def _summary_block(profile: dict, evaluated: list):
if isinstance(q.get("completeness"), (int, float))]
vals = [q.get("validity") for _, q in evaluated
if isinstance(q.get("validity"), (int, float))]
cons = [q.get("consistency") for _, q in evaluated
if isinstance(q.get("consistency"), (int, float))]
if comps:
rows.append(("Completitud media", _fmt_unit_pct(sum(comps) / len(comps))))
if vals:
rows.append(("Validez media", _fmt_unit_pct(sum(vals) / len(vals))))
if cons:
rows.append(("Consistencia media", _fmt_unit_pct(sum(cons) / len(cons))))
rows.append(("Validez media (donde aplica)",
_fmt_unit_pct(sum(vals) / len(vals))))
ru = _row_uniqueness(profile)
if ru is not None:
rows.append(("Unicidad de registro", _fmt_unit_pct(ru)))
n_problem = sum(1 for _, q in evaluated if q.get("issues"))
rows.append(("Columnas con problemas", str(n_problem)))
@@ -182,22 +239,9 @@ def _summary_block(profile: dict, evaluated: list):
return model.KVTable(rows=rows, title="Resumen de calidad")
def _fmt_unit_pct_or_pct(value) -> str:
"""Format a value that may be a 0-1 fraction or an already-0-100 percentage."""
try:
num = float(value)
except (TypeError, ValueError):
return model._safe_str(value)
if num != num: # NaN
return ""
pct = num * 100 if num <= 1.0 else num
text = f"{pct:.1f}".rstrip("0").rstrip(".")
return f"{text}%"
def _scores_block(evaluated: list):
"""DataTable with per-column score and its three-criteria breakdown."""
header = ["Columna", "Calidad", "Completitud", "Validez", "Consistencia"]
"""DataTable with per-column score and its completeness/validity breakdown."""
header = ["Columna", "Calidad", "Completitud", "Validez"]
rows = []
# Worst columns first so the reader sees the problems at the top.
ordered = sorted(
@@ -210,22 +254,22 @@ def _scores_block(evaluated: list):
col.get("name") or "(col)",
_fmt_score(q.get("score")),
_fmt_unit_pct(q.get("completeness")),
_fmt_unit_pct(q.get("validity")),
_fmt_unit_pct(q.get("consistency")),
_fmt_validity(q.get("validity")),
])
if not rows:
return None
return model.DataTable(header=header, rows=rows,
title="Scores de calidad por columna",
note="0 = peor, 100 = mejor; ordenado de peor a mejor")
note="0 = peor, 100 = mejor; «n/a» = dimensión no "
"medible; ordenado de peor a mejor")
def _issues_block(evaluated: list):
"""DataTable listing Spanish issues per column, or a Note when there are none."""
header = ["Columna", "Problemas detectados (español)"]
"""DataTable listing ONLY real quality defects per column, or a Note."""
header = ["Columna", "Problemas de calidad (español)"]
rows = []
for col, q in evaluated:
joined = _join_issues(q.get("issues"))
joined = _join_cells(q.get("issues"))
if joined:
rows.append([col.get("name") or "(col)", joined])
if not rows:
@@ -235,6 +279,63 @@ def _issues_block(evaluated: list):
title="Problemas de calidad por columna")
def _observations_block(evaluated: list):
"""DataTable listing analytical observations per column, or None.
Observations (outliers, constant columns, ids, strong skew) are NOT quality
defects: they do not affect the score. Returned as a separate table from the
issues so the report never presents a legitimate outlier as a problem.
"""
header = ["Columna", "Observaciones analíticas"]
rows = []
for col, q in evaluated:
joined = _join_cells(q.get("observations"))
if joined:
rows.append([col.get("name") or "(col)", joined])
if not rows:
return None
return model.DataTable(
header=header, rows=rows,
title="Observaciones analíticas por columna",
note="No son defectos de calidad y NO afectan al score; orientan el "
"análisis (atípicos, columnas constantes, identificadores).")
def _term(key: str, label: str, mark: bool) -> str:
"""Render a term as a clickable glossary span when marking is enabled."""
if mark:
return f"[[term:{key}]]**{label}**[[/term]]"
return f"**{label}**"
def _criteria_intro(mark: bool) -> str:
"""Intro paragraph explaining the two scored dimensions and the principle."""
calidad = _term("calidad_datos", "calidad de datos", mark)
completitud = _term("completitud", "Completitud (peso 60%)", mark)
validez = _term("validez", "Validez (peso 40%, cuando es medible)", mark)
unicidad = _term("unicidad_registro", "unicidad de registro", mark)
return (
f"La {calidad} de cada columna es un score de 0 a 100 que combina solo "
"dimensiones medibles desde el perfil de la tabla, sin fuente externa "
"de verdad:\n\n"
f"- {completitud}: proporción de valores presentes (1 % de nulos; en "
"texto, las celdas vacías cuentan como faltantes). Los nulos y vacíos "
"bajan el score.\n"
f"- {validez}: proporción de valores que encajan con su tipo o formato "
"(un número que parsea, una fecha legible, un email con forma de email). "
"Si una columna es texto libre sin formato esperado, la validez no se "
"mide y el score se basa solo en la completitud.\n\n"
f"Score de columna = 100 × (0,6·completitud + 0,4·validez), "
"renormalizado cuando la validez no aplica. A nivel de tabla se añade "
f"la {unicidad} (1 % de filas duplicadas).\n\n"
"**Los valores atípicos (outliers) NO bajan la calidad.** Un valor "
"extremo puede ser real y correcto; detectar atípicos es parte del "
"análisis de la distribución, no un juicio de corrección. Por eso, junto "
"con las columnas constantes y los identificadores, se listan aparte "
"como **observaciones analíticas** que no afectan al score."
)
def build_calidad(profile: dict, ctx: dict):
"""Build the data-quality Chapter, or None if the profile has no columns.
@@ -250,17 +351,35 @@ def build_calidad(profile: dict, ctx: dict):
if not evaluated:
return None # no columns to score -> chapter does not apply.
# Register the criteria terms in the shared glossary (if present) and mark
# their first appearance clickable. Contract §11.1.
glossary = ctx.get("glossary")
mark = False
if isinstance(glossary, model.GlossaryCollector):
for key, (label, definition) in _TERMS.items():
glossary.add(key, label, definition)
mark = True
blocks = [
model.Heading(text="Cómo se calcula la calidad", level=2),
model.Markdown(text=_CRITERIA_INTRO),
model.Markdown(text=_criteria_intro(mark)),
_summary_block(profile, evaluated),
model.Heading(text="Scores por columna", level=2),
]
scores = _scores_block(evaluated)
if scores is not None:
blocks.append(scores)
blocks.append(model.Heading(text="Problemas detectados", level=2))
blocks.append(model.Heading(text="Problemas de calidad", level=2))
blocks.append(_issues_block(evaluated))
observations = _observations_block(evaluated)
if observations is not None:
blocks.append(model.Heading(text="Observaciones analíticas", level=2))
blocks.append(model.Note(
"Las observaciones siguientes NO son defectos de calidad y no "
"afectan al score: son señales para orientar el análisis."))
blocks.append(observations)
return model.Chapter(id=CHAPTER_ID, title=CHAPTER_TITLE,
version=CHAPTER_VERSION, blocks=blocks)
@@ -1,11 +1,12 @@
"""Tests for the CALIDAD chapter — DoD: golden + edges + anti-cut.
"""Tests for the CALIDAD chapter — DoD: golden + edges + anti-cut + glossary.
Self-contained: builds synthetic TableProfiles (no DuckDB) so the suite is fast
and deterministic. Verifies that the chapter explains the quality criteria, shows
per-column scores with the completeness/validity/consistency breakdown, lists the
issues in Spanish (separate from the type flags), returns None when it does not
apply, and that a wide profile with long names renders to PDF and PPTX without
cutting any cell text (long content wraps, it is never truncated).
and deterministic. Verifies the report-2046 quality model: the chapter explains
the two scored dimensions (completitud 60% / validez 40%), shows per-column
scores without a consistency column, keeps quality DEFECTS (issues) separate
from analytical OBSERVATIONS (outliers, constant, ids), hooks the criteria terms
into the glossary, returns None when it does not apply, and renders a wide
profile to PDF and PPTX without cutting any cell text.
"""
import os
@@ -20,28 +21,30 @@ from datascience.automatic_eda.chapters.calidad import (
CHAPTER_VERSION,
)
from datascience.automatic_eda import build_document, render_pdf, render_pptx
from datascience.automatic_eda import model
def _profile() -> dict:
"""A small profile with one column per quality problem (nulls, outliers,
constant, high-cardinality id) plus one clean column."""
constant, high-cardinality id) plus one clean column. ``outlier_pct`` is in
the 0-100 scale that describe_numeric actually emits."""
return {
"table": "demo",
"quality_score": 72.5,
"quality_score": 82.0,
"duplicate_pct": 0.04,
"null_cell_pct": 0.11,
"constant_cols": ["flag_const"],
"all_null_cols": [],
"columns": [
{"name": "edad", "inferred_type": "integer", "null_pct": 0.2,
"numeric": {"outlier_pct": 0.15, "min": 0, "max": 99},
"quality_score": 60},
{"name": "edad", "inferred_type": "numeric", "null_pct": 0.2,
"n_rows": 100, "unique_pct": 0.5,
"numeric": {"outlier_pct": 15.0, "min": 0, "max": 99}},
{"name": "nombre", "inferred_type": "text", "null_pct": 0.0,
"unique_pct": 0.98, "quality_score": 80},
"unique_pct": 0.98, "flags": ["possible_id"]},
{"name": "flag_const", "inferred_type": "text", "null_pct": 0.0,
"flags": ["constant"], "quality_score": 50},
{"name": "limpia", "inferred_type": "float", "null_pct": 0.0,
"numeric": {"outlier_pct": 0.0}, "quality_score": 100},
"unique_pct": 0.01, "flags": ["constant"]},
{"name": "limpia", "inferred_type": "numeric", "null_pct": 0.0,
"unique_pct": 0.5, "numeric": {"outlier_pct": 0.0}},
],
}
@@ -50,16 +53,9 @@ def _tables(chapter):
return [b for b in chapter.blocks if getattr(b, "kind", None) == "data_table"]
def _scores_table(chapter):
def _table_by_title(chapter, needle):
for t in _tables(chapter):
if "Scores" in (t.title or ""):
return t
return None
def _issues_table(chapter):
for t in _tables(chapter):
if "Problemas" in (t.title or ""):
if needle in (t.title or ""):
return t
return None
@@ -73,41 +69,84 @@ def test_golden_chapter_estructura_y_version():
assert ch.id == "calidad"
assert ch.version == CHAPTER_VERSION
kinds = [b.kind for b in ch.blocks]
# intro heading + markdown criteria + summary kv + scores table + issues table
assert "markdown" in kinds and "kv_table" in kinds and "data_table" in kinds
def test_golden_intro_explica_criterios_y_pesos():
def test_golden_intro_explica_dos_dimensiones_y_pesos():
ch = build_calidad(_profile(), {})
intro = [b for b in ch.blocks if b.kind == "markdown"][0].text
for needle in ("Completitud", "Validez", "Consistencia",
"50%", "30%", "20%"):
for needle in ("Completitud", "Validez", "60%", "40%",
"unicidad de registro"):
assert needle in intro, f"falta {needle!r} en la intro de criterios"
# El principio: los outliers NO bajan la calidad.
assert "atípicos" in intro and "NO bajan" in intro
# Ya no se menciona la dimensión consistencia eliminada.
assert "20%" not in intro
def test_golden_scores_incluyen_desglose_por_criterio():
def test_golden_scores_sin_columna_consistencia():
ch = build_calidad(_profile(), {})
scores = _scores_table(ch)
scores = _table_by_title(ch, "Scores")
assert scores is not None
assert scores.header == ["Columna", "Calidad", "Completitud",
"Validez", "Consistencia"]
# 4 columns scored, none dropped.
assert scores.header == ["Columna", "Calidad", "Completitud", "Validez"]
assert "Consistencia" not in scores.header
assert len(scores.rows) == 4
names = {r[0] for r in scores.rows}
assert names == {"edad", "nombre", "flag_const", "limpia"}
def test_golden_issues_en_espanol_separados_de_flags():
def test_golden_outliers_en_observaciones_no_en_problemas():
ch = build_calidad(_profile(), {})
issues = _issues_table(ch)
assert issues is not None
flat = " | ".join(" ".join(r) for r in issues.rows)
assert "nulos" in flat # completeness issue (ES)
assert "outliers" in flat # validity issue (ES)
assert "columna constante" in flat
assert "posible id de alta cardinalidad" in flat
# The raw type flag string must NOT leak as a "problem".
assert "constant" not in flat or "columna constante" in flat
problemas = _table_by_title(ch, "Problemas de calidad")
observaciones = _table_by_title(ch, "Observaciones")
assert problemas is not None
assert observaciones is not None
problemas_txt = " | ".join(" ".join(r) for r in problemas.rows)
observaciones_txt = " | ".join(" ".join(r) for r in observaciones.rows)
# Los nulos SÍ son problema de calidad.
assert "nulos" in problemas_txt
# Los outliers NO aparecen como problema...
assert "atípic" not in problemas_txt and "outlier" not in problemas_txt
# ...sino como observación analítica.
assert "atípic" in observaciones_txt
# Constante e id: observaciones, no problemas.
assert "constante" in observaciones_txt
assert "identificador" in observaciones_txt
assert "constante" not in problemas_txt
def test_golden_score_columna_limpia_es_100():
"""Columna sin nulos, numérica nativa: score 100 aunque tenga (o no) outliers."""
ch = build_calidad(_profile(), {})
scores = _table_by_title(ch, "Scores")
by_name = {r[0]: r for r in scores.rows}
assert by_name["limpia"][1] == "100 / 100"
# edad: 20% nulos -> 100*(0.6*0.8 + 0.4*1.0) = 88; los outliers no bajan nada.
assert by_name["edad"][1] == "88 / 100"
# --------------------------------------------------------------------------- #
# Glosario (contrato §11.1)
# --------------------------------------------------------------------------- #
def test_glosario_registra_los_cuatro_terminos_y_marca_clicable():
glossary = model.GlossaryCollector()
ch = build_calidad(_profile(), {"glossary": glossary})
for key in ("calidad_datos", "completitud", "validez", "unicidad_registro"):
assert glossary.has(key), f"término {key!r} no registrado en el glosario"
intro = [b for b in ch.blocks if b.kind == "markdown"][0].text
# Con colector presente, la primera aparición se marca clicable.
assert "[[term:completitud]]" in intro
assert "[[term:validez]]" in intro
assert "[[term:calidad_datos]]" in intro
assert "[[term:unicidad_registro]]" in intro
def test_sin_glosario_no_marca_terminos():
ch = build_calidad(_profile(), {}) # ctx sin glossary
intro = [b for b in ch.blocks if b.kind == "markdown"][0].text
assert "[[term:" not in intro
# --------------------------------------------------------------------------- #
@@ -124,17 +163,17 @@ def test_edge_perfil_limpio_sin_problemas_usa_nota():
prof = {
"quality_score": 100,
"columns": [
{"name": "a", "inferred_type": "float", "null_pct": 0.0,
"numeric": {"outlier_pct": 0.0}},
{"name": "b", "inferred_type": "float", "null_pct": 0.0,
"numeric": {"outlier_pct": 0.0}},
{"name": "a", "inferred_type": "numeric", "null_pct": 0.0,
"unique_pct": 0.5, "numeric": {"outlier_pct": 0.0}},
{"name": "b", "inferred_type": "numeric", "null_pct": 0.0,
"unique_pct": 0.5, "numeric": {"outlier_pct": 0.0}},
],
}
ch = build_calidad(prof, {})
assert ch is not None
assert _issues_table(ch) is None # no issues table
assert _table_by_title(ch, "Problemas de calidad") is None # no issues table
notes = [b for b in ch.blocks if b.kind == "note"]
assert notes and "No se detectaron problemas" in notes[0].text
assert any("No se detectaron problemas" in n.text for n in notes)
# --------------------------------------------------------------------------- #
@@ -143,44 +182,42 @@ def test_edge_perfil_limpio_sin_problemas_usa_nota():
def _wide_profile(ncols: int = 22) -> dict:
cols = [
{"name": "identificador_unico_de_transaccion_con_nombre_muy_largo",
"inferred_type": "text", "null_pct": 0.0, "unique_pct": 0.99},
"inferred_type": "text", "null_pct": 0.0, "unique_pct": 0.99,
"flags": ["possible_id"]},
{"name": "columna_constante_sin_ninguna_variacion_de_valor",
"inferred_type": "text", "null_pct": 0.0, "flags": ["constant"]},
"inferred_type": "text", "null_pct": 0.0, "unique_pct": 0.01,
"flags": ["constant"]},
]
for k in range(ncols - 2):
cols.append({
"name": f"metrica_numerica_de_negocio_{k:02d}_con_nombre_largo",
"inferred_type": "float", "null_pct": 0.1 + (k % 3) * 0.05,
"numeric": {"outlier_pct": 0.08, "min": 0, "max": 1000},
"inferred_type": "numeric", "null_pct": 0.1 + (k % 3) * 0.05,
"unique_pct": 0.5,
"numeric": {"outlier_pct": 8.0, "min": 0, "max": 1000},
})
return {"table": "ancha", "quality_score": 70.0, "columns": cols}
return {"table": "ancha", "quality_score": 70.0, "duplicate_pct": 0.0,
"columns": cols}
def test_anticut_pdf_y_pptx_no_truncan_nombres_largos():
prof = _wide_profile(22)
full = build_document(prof, {"dataset_name": "ancha"})
assert any(c.id == "calidad" for c in full)
# Render ONLY the calidad chapter so the anti-cut assertions are scoped to
# this chapter (other chapters, e.g. portada, legitimately contain '…').
chapters = [c for c in full if c.id == "calidad"]
long_name = "metrica_numerica_de_negocio_00_con_nombre_largo"
with tempfile.TemporaryDirectory() as d:
pdf = os.path.join(d, "q.pdf")
pptx = os.path.join(d, "q.pptx")
rp = render_pdf(chapters, pdf, {"title": "EDA"})
rx = render_pptx(chapters, pptx, {"title": "EDA"})
render_pptx(chapters, pptx, {"title": "EDA"})
assert os.path.exists(pdf) and os.path.exists(pptx)
# The wide table forces pagination across several pages/slides.
assert (rp or {}).get("n_pages", 0) >= 2
# PDF: the long name survives whole once wraps (spaces/newlines) removed,
# and there is no truncation marker.
pdf_txt = "".join((pg.extract_text() or "") for pg in PdfReader(pdf).pages)
assert "" not in pdf_txt and "..." not in pdf_txt
norm = re.sub(r"\s+", "", pdf_txt)
assert long_name in norm, "el nombre largo se cortó en el PDF"
# PPTX: long name present in some cell, untruncated.
allt = []
for s in Presentation(pptx).slides:
for sh in s.shapes:
@@ -1,500 +0,0 @@
"""Key-relations chapter (RELACIONES) — the keys / join structure of the data.
This chapter is the *relational* section of an AutomaticEDA report. It answers a
single question for the table (or the whole DuckDB source it lives in): **how do
the keys relate?** It composes, without reimplementing them, the registry's
relation primitives and degrades honestly when a layer does not apply.
It renders, in order, only the layers that have something to say:
1. **Declared keys** (real schema constraints) — when the DuckDB source declares
PRIMARY KEY / FOREIGN KEY / UNIQUE constraints, they are read verbatim via
``detect_declared_keys_duckdb`` and shown as ground truth: which column is the
PK, which columns are FKs and the table/column they point to.
2. **Primary-key candidates** — the ``key_candidates`` the TableProfile already
carries (columns whose cardinality equals the row count, with no nulls). These
are *candidates*: a column that could serve as the row identifier.
3. **Foreign-key candidates** when none are declared:
- **Inter-table** (the DuckDB source has several tables): real FK candidates by
name signal + value containment via ``infer_fk_containment_duckdb``, plus the
join graph (roles + a pasteable Mermaid diagram) via ``build_join_graph``.
- **Intra-table** (a single table): columns that *look* like a foreign key by a
name+cardinality heuristic (``suggest_intratable_fk_candidates``). This is a
**suggestion**, explicitly flagged as a heuristic, never an assertion.
``build_relaciones(profile, ctx) -> Chapter | None``: returns ``None`` when there
is nothing to say (no declared key, no key candidates, and no FK candidate —
inter- or intra-table). Reads everything defensively (``.get``) and never raises:
anything missing degrades to a note or is omitted; a failing registry call drops
its layer instead of aborting the chapter.
ctx keys this chapter consumes (all optional):
db_path, table : str — the DuckDB file and table being profiled (set by
``build_eda_render_ctx``). ``db_path`` is needed to read declared
constraints, to list the sibling tables, and to run the containment-based
FK inference. Without it, only the profile-derived layers (PK candidates,
intra-table FK heuristic) are available.
glossary : model.GlossaryCollector — shared glossary; the chapter registers
the relational terms (PK, FK, containment, cardinality) and marks their
first appearance clickable.
Contract: build_<id>(profile, ctx) -> Chapter | None ; CHAPTER_VERSION = "x.y.z".
"""
from __future__ import annotations
from .. import model
# Pure/impure registry functions (group ``eda``) this chapter composes. Imported
# defensively (module-leaf imports, like the AGREGACION chapter) so the chapter
# still builds — degrading the affected layer to nothing — if a function is
# somehow unavailable / not indexed yet.
try:
from datascience.detect_declared_keys_duckdb import detect_declared_keys_duckdb
except Exception: # noqa: BLE001 — keep the chapter importable no matter what.
detect_declared_keys_duckdb = None # type: ignore[assignment]
try:
from datascience.infer_fk_containment_duckdb import infer_fk_containment_duckdb
except Exception: # noqa: BLE001
infer_fk_containment_duckdb = None # type: ignore[assignment]
try:
from datascience.build_join_graph import build_join_graph
except Exception: # noqa: BLE001
build_join_graph = None # type: ignore[assignment]
try:
from datascience.suggest_intratable_fk_candidates import (
suggest_intratable_fk_candidates,
)
except Exception: # noqa: BLE001
suggest_intratable_fk_candidates = None # type: ignore[assignment]
try:
from infra import duckdb_list_tables
except Exception: # noqa: BLE001
duckdb_list_tables = None # type: ignore[assignment]
CHAPTER_VERSION = "1.0.0"
CHAPTER_ID = "relaciones"
CHAPTER_TITLE = "Relaciones de clave"
# Cap the inter-table FK table so a wide schema does not blow up the page; the
# rest is summarized in a closing note (no silent truncation).
MAX_FK_ROWS = 40
# --------------------------------------------------------------------------- #
# Glossary terms this chapter explains. Registered in the shared collector and
# marked clickable on their first appearance (contract §11.1).
# --------------------------------------------------------------------------- #
_TERMS = {
"pk": (
"Clave primaria (PK)",
"Columna (o conjunto de columnas) que identifica de forma única cada fila "
"de una tabla: sus valores no se repiten y no son nulos. Una tabla tiene "
"como mucho una clave primaria; es el ancla por la que otras tablas la "
"referencian.",
),
"fk": (
"Clave foránea (FK)",
"Columna de una tabla cuyos valores apuntan a la clave primaria de otra "
"tabla (o de la misma), creando una relación entre ambas. Una FK suele ser "
"N:1: muchas filas de la tabla origen comparten el mismo valor de la tabla "
"destino.",
),
"containment": (
"Containment / inclusión",
"Señal con la que se infiere una clave foránea sin que la base la declare: "
"la fracción de valores distintos de una columna A que también aparecen "
"como valores de otra columna B. Si casi todos los valores de A están "
"contenidos en B (inclusión ≈ 1) y B parece una clave, A → B es una FK "
"candidata.",
),
"cardinalidad": (
"Cardinalidad",
"Número de valores distintos de una columna. Cardinalidad igual al número "
"de filas (y sin nulos) señala un identificador (candidato a clave "
"primaria); cardinalidad alta pero menor que el número de filas, con "
"valores repetidos, es típica de una clave foránea.",
),
}
def _register_terms(ctx: dict) -> bool:
"""Register the relational terms in the shared glossary. Returns whether the
in-text appearances should be marked clickable."""
glossary = ctx.get("glossary")
if not isinstance(glossary, model.GlossaryCollector):
return False
for key, (label, definition) in _TERMS.items():
glossary.add(key, label, definition)
return True
# --------------------------------------------------------------------------- #
# Formatting helpers (mirror the other chapters' defensive style).
# --------------------------------------------------------------------------- #
def _fmt_int(value) -> str:
if value is None:
return ""
try:
return f"{int(value):,}".replace(",", ".")
except (TypeError, ValueError):
return model._safe_str(value)
def _fmt_pct_fraction(value, decimals: int = 1) -> str:
"""Format a 01 fraction as a percentage. None -> placeholder."""
if value is None:
return ""
try:
v = float(value)
except (TypeError, ValueError):
return model._safe_str(value)
if v <= 1.0:
v *= 100.0
return f"{v:.{decimals}f}%"
def _fmt_ratio(value, decimals: int = 3) -> str:
"""Format an already-01 ratio (inclusion) as a plain number."""
if value is None:
return ""
try:
return f"{float(value):.{decimals}f}".rstrip("0").rstrip(".")
except (TypeError, ValueError):
return model._safe_str(value)
def _is_dict(v) -> bool:
return isinstance(v, dict)
def _columns_by_name(profile: dict) -> dict:
"""Index the profile columns by name for quick metric lookup."""
out = {}
for col in (profile.get("columns") or []):
if _is_dict(col) and col.get("name") is not None:
out[col.get("name")] = col
return out
# --------------------------------------------------------------------------- #
# Layer 1 — declared keys (real schema constraints).
# --------------------------------------------------------------------------- #
def _declared_keys(db_path: str, table: str):
"""Read declared PK/FK/UNIQUE for the source, or None if unavailable."""
if not db_path or detect_declared_keys_duckdb is None:
return None
try:
out = detect_declared_keys_duckdb(db_path, table)
except Exception: # noqa: BLE001 — dict-no-throw: treat as unavailable.
return None
if not _is_dict(out) or out.get("status") != "ok":
return None
return out
def _declared_section(declared: dict) -> list:
"""Blocks for the declared-keys layer, or [] if there is nothing declared."""
pks = [p for p in (declared.get("primary_keys") or []) if _is_dict(p)]
fks = [f for f in (declared.get("foreign_keys") or []) if _is_dict(f)]
uqs = [u for u in (declared.get("unique") or []) if _is_dict(u)]
if not (pks or fks or uqs):
return []
blocks = [
model.Heading(text="Claves declaradas en el esquema", level=2),
model.Markdown(text=(
"La base **declara** estas relaciones de clave como restricciones "
"reales del esquema (constraints). Son la verdad de referencia: no se "
"infieren, se leen tal cual de la definición de las tablas.")),
]
if pks:
rows = [[model._safe_str(p.get("table")),
", ".join(model._safe_str(c) for c in (p.get("columns") or []))]
for p in pks]
blocks.append(model.DataTable(
header=["Tabla", "Columna(s) PK"], rows=rows,
title="Claves primarias declaradas",
note="Cada fila: la clave primaria declarada de una tabla."))
if fks:
rows = []
for f in fks:
src = ", ".join(model._safe_str(c) for c in (f.get("columns") or []))
dst = ", ".join(
model._safe_str(c) for c in (f.get("referenced_columns") or []))
rows.append([
model._safe_str(f.get("table")), src,
model._safe_str(f.get("referenced_table")), dst])
blocks.append(model.DataTable(
header=["Tabla origen", "Columna(s) FK", "→ Tabla destino",
"Columna(s) destino"],
rows=rows, title="Claves foráneas declaradas",
note="Cada fila: una FK declarada — origen → destino."))
if uqs:
rows = [[model._safe_str(u.get("table")),
", ".join(model._safe_str(c) for c in (u.get("columns") or []))]
for u in uqs]
blocks.append(model.DataTable(
header=["Tabla", "Columna(s) UNIQUE"], rows=rows,
title="Restricciones UNIQUE declaradas"))
return blocks
# --------------------------------------------------------------------------- #
# Layer 2 — primary-key candidates (from the profile).
# --------------------------------------------------------------------------- #
def _pk_candidates_section(profile: dict, mark: bool) -> list:
"""Blocks for the PK-candidates layer, or [] if there are none."""
keys = [k for k in (profile.get("key_candidates") or []) if k is not None]
if not keys:
return []
by_name = _columns_by_name(profile)
pk = ("[[term:pk]]**clave primaria**[[/term]]" if mark
else "**clave primaria**")
intro = (
f"Estas columnas son **candidatas a {pk}**: su "
"[[term:cardinalidad]]cardinalidad[[/term]] iguala al número de filas y no "
"tienen nulos, así que cada valor identifica una fila distinta. Son "
"candidatas, no una clave declarada: la base no las marca como tal."
if mark else
"Estas columnas son **candidatas a clave primaria**: su cardinalidad "
"iguala al número de filas y no tienen nulos, así que cada valor "
"identifica una fila distinta.")
rows = []
for name in keys:
col = by_name.get(name) or {}
rows.append([
model._safe_str(name),
_fmt_int(col.get("distinct_count")),
_fmt_pct_fraction(col.get("unique_pct")),
model._safe_str(col.get("inferred_type") or col.get("physical_type") or ""),
])
return [
model.Heading(text="Candidatos a clave primaria", level=2),
model.Markdown(text=intro),
model.DataTable(
header=["Columna", "Valores distintos", "% único", "Tipo"],
rows=rows, title="Candidatas a clave primaria",
note=f"{_fmt_int(profile.get('n_rows'))} filas en total como referencia."),
]
# --------------------------------------------------------------------------- #
# Layer 3a — inter-table FK candidates (containment) + join graph.
# --------------------------------------------------------------------------- #
def _list_source_tables(db_path: str) -> list:
"""List the tables in the DuckDB source, or [] if it can't be listed."""
if not db_path or duckdb_list_tables is None:
return []
try:
out = duckdb_list_tables(db_path)
except Exception: # noqa: BLE001
return []
if not _is_dict(out) or out.get("status") != "ok":
return []
return [t for t in (out.get("tables") or []) if isinstance(t, str)]
def _inter_table_section(db_path: str, tables: list, mark: bool) -> list:
"""Blocks for the inter-table FK layer (containment + join graph), or []."""
if infer_fk_containment_duckdb is None or len(tables) < 2:
return []
try:
fk = infer_fk_containment_duckdb(db_path, tables=tables)
except Exception: # noqa: BLE001
return []
if not _is_dict(fk) or fk.get("status") != "ok":
return []
candidates = [c for c in (fk.get("fk_candidates") or []) if _is_dict(c)]
if not candidates:
return []
containment = ("[[term:containment]]containment (inclusión de valores)[[/term]]"
if mark else "containment (inclusión de valores)")
fk_term = "[[term:fk]]**claves foráneas**[[/term]]" if mark else "**claves foráneas**"
blocks = [
model.Heading(text="Claves foráneas candidatas (inter-tabla)", level=2),
model.Markdown(text=(
f"La fuente tiene varias tablas. Estas {fk_term} candidatas se infieren "
f"por señal de nombre y por {containment}: una columna de una tabla cuyos "
"valores están contenidos en la clave de otra. No están declaradas por "
"la base; son la relación más probable según los datos.")),
]
shown = candidates[:MAX_FK_ROWS]
rows = []
for c in shown:
rows.append([
f"{model._safe_str(c.get('from_table'))}.{model._safe_str(c.get('from_col'))}",
f"{model._safe_str(c.get('to_table'))}.{model._safe_str(c.get('to_col'))}",
_fmt_ratio(c.get("inclusion")),
model._safe_str(c.get("cardinality") or ""),
"" if c.get("name_match") else "no",
])
note = "Ordenadas por señal de nombre e inclusión."
if len(candidates) > len(shown):
note += f" Se muestran {len(shown)} de {len(candidates)} candidatas."
blocks.append(model.DataTable(
header=["Origen", "→ Destino", "Inclusión", "Cardinalidad", "Coincide nombre"],
rows=rows, title="FK candidatas por containment", note=note))
# Join graph: node roles + a pasteable Mermaid diagram, kept together.
if build_join_graph is not None:
try:
graph = build_join_graph(candidates, tables=tables)
except Exception: # noqa: BLE001
graph = None
if _is_dict(graph):
graph_blocks = [model.Heading(text="Grafo de relaciones", level=3)]
nodes = [n for n in (graph.get("nodes") or []) if _is_dict(n)]
if nodes:
node_rows = [[
model._safe_str(n.get("table")),
model._safe_str(n.get("role") or ""),
_fmt_int(n.get("out_degree")),
_fmt_int(n.get("in_degree")),
] for n in nodes]
graph_blocks.append(model.DataTable(
header=["Tabla", "Rol", "FK salientes", "FK entrantes"],
rows=node_rows, title="Tablas y su rol en el grafo",
note="Rol: fact (apunta a otras), dimension (referenciada), "
"bridge (ambas), standalone (aislada)."))
hubs = [h for h in (graph.get("hubs") or []) if h]
if hubs:
graph_blocks.append(model.Markdown(text=(
"Tablas con más relaciones salientes (candidatas a tabla de "
"hechos): " + ", ".join(model._safe_str(h) for h in hubs) + ".")))
mermaid = model._safe_str(graph.get("mermaid")).strip()
if mermaid:
graph_blocks.append(model.Markdown(text=(
"Diagrama de las relaciones (pegable en un bloque Mermaid):")))
graph_blocks.append(model.Markdown(
text="```mermaid\n" + mermaid + "\n```"))
if len(graph_blocks) > 1:
blocks.append(model.Group(blocks=graph_blocks,
title="Grafo de relaciones"))
skipped = [s for s in (fk.get("skipped") or []) if s]
if skipped:
blocks.append(model.Note(
"Algunos pares se omitieron por tamaño: "
+ "; ".join(model._safe_str(s) for s in skipped) + "."))
return blocks
# --------------------------------------------------------------------------- #
# Layer 3b — intra-table FK candidates (name+cardinality heuristic).
# --------------------------------------------------------------------------- #
def _intra_table_section(profile: dict, mark: bool) -> list:
"""Blocks for the intra-table FK heuristic layer, or [] if no candidates."""
if suggest_intratable_fk_candidates is None:
return []
try:
cands = suggest_intratable_fk_candidates(profile)
except Exception: # noqa: BLE001
return []
cands = [c for c in (cands or []) if _is_dict(c)]
if not cands:
return []
fk_term = "[[term:fk]]**claves foráneas**[[/term]]" if mark else "**claves foráneas**"
blocks = [
model.Heading(text="Posibles claves foráneas (heurística de nombre)", level=2),
model.Markdown(text=(
f"No hay otras tablas que referenciar, pero algunas columnas **parecen** "
f"{fk_term} por su nombre (terminan en «id») y su cardinalidad (muchos "
"valores repetidos, N:1). Es una **sugerencia heurística**, no una "
"afirmación: el nombre de la tabla destino es una conjetura y no se "
"comprueba inclusión de valores contra ninguna tabla real.")),
]
rows = []
for c in cands:
rows.append([
model._safe_str(c.get("column")),
model._safe_str(c.get("ref_table_guess") or ""),
_fmt_int(c.get("distinct_count")),
_fmt_pct_fraction(c.get("unique_pct")),
model._safe_str(c.get("inferred_type") or c.get("physical_type") or ""),
model._safe_str(c.get("reason") or ""),
])
blocks.append(model.DataTable(
header=["Columna", "Posible tabla", "Valores distintos", "% único",
"Tipo", "Motivo"],
rows=rows, title="Posibles FK por nombre y cardinalidad",
note="Heurística: posibles falsos positivos/negativos. No confirma containment."))
blocks.append(model.Note(
"Estas sugerencias se basan solo en el nombre y la cardinalidad. Para "
"confirmarlas haría falta la tabla destino y comprobar la inclusión de "
"valores (containment)."))
return blocks
# --------------------------------------------------------------------------- #
# Entry point.
# --------------------------------------------------------------------------- #
def _intro_blocks(mark: bool) -> list:
pk = "[[term:pk]]clave primaria[[/term]]" if mark else "clave primaria"
fk = "[[term:fk]]clave foránea[[/term]]" if mark else "clave foránea"
text = (
f"Este capítulo analiza las **relaciones de clave** de la tabla: qué columna "
f"identifica cada fila (la {pk}) y qué columnas referencian a otra tabla (las "
f"{fk}). Cuando la base las **declara** como restricciones del esquema, se "
"muestran tal cual; cuando no, se proponen las más probables a partir de los "
"datos —por inclusión de valores entre tablas (containment) o, en una sola "
"tabla, por una heurística de nombre y cardinalidad— siempre marcadas como "
"candidatas, nunca como hechos.")
return [model.Heading(text=CHAPTER_TITLE, level=1), model.Markdown(text=text)]
def build_relaciones(profile: dict, ctx: dict):
"""Build the RELACIONES Chapter, or None if there is nothing to say.
Args:
profile: the ``eda`` group TableProfile dict (may be None/empty).
ctx: presentation context. Consumes ``db_path`` + ``table`` (to read
declared constraints, list sibling tables and run the containment FK
inference) and ``glossary`` (to register the relational terms).
Returns:
A ``model.Chapter`` with the applicable relation layers; or ``None`` when
the dataset has no declared key, no key candidates and no FK candidate
(neither inter- nor intra-table).
"""
if not isinstance(profile, dict):
profile = {}
ctx = ctx if isinstance(ctx, dict) else {}
db_path = ctx.get("db_path")
table = ctx.get("table")
mark = _register_terms(ctx)
# Build each layer; the chapter is the concatenation of the non-empty ones.
declared = _declared_keys(db_path, table)
declared_blocks = _declared_section(declared) if declared else []
declared_has_fk = bool(declared and declared.get("foreign_keys"))
pk_blocks = _pk_candidates_section(profile, mark)
tables = _list_source_tables(db_path)
inter_blocks = _inter_table_section(db_path, tables, mark)
# The intra-table heuristic only makes sense when no real FK is available for
# this table — neither declared nor inferred inter-table. Otherwise the real
# relations already answer the question and the heuristic is just noise.
if declared_has_fk or inter_blocks:
intra_blocks = []
else:
intra_blocks = _intra_table_section(profile, mark)
body = declared_blocks + pk_blocks + inter_blocks + intra_blocks
if not body:
return None # chapter does not apply: nothing to say about relations.
blocks = _intro_blocks(mark) + body
return model.Chapter(id=CHAPTER_ID, title=CHAPTER_TITLE,
version=CHAPTER_VERSION, blocks=blocks)
@@ -1,273 +0,0 @@
"""Tests for the RELACIONES chapter — DoD: golden(s) + edges + no-cut render.
Two goldens covering the two real paths of the chapter:
- **Intra-table** (a single table, no db source for relations): the chapter shows
the primary-key candidates from the profile and the heuristic foreign-key
suggestions (name + cardinality), explicitly flagged as a heuristic. Renders to
PDF and PPTX with nothing cut.
- **Inter-table** (a real DuckDB file with two related tables, customers/orders,
with a declared FK): the chapter shows the declared keys, the containment-based
FK candidates and the join graph (roles + a pasteable Mermaid diagram).
Edges: a profile with no key candidate and no FK-looking column returns None;
``None`` / ``{}`` profiles do not raise. The chapter registers its glossary terms.
Layers that depend on the sibling registry functions delegated alongside this
chapter (``detect_declared_keys_duckdb``, ``suggest_intratable_fk_candidates``)
are asserted **conditionally on the function being importable**, so the chapter's
honest-degradation contract is what is tested, never a hard dependency on import
timing.
"""
import os
import tempfile
import duckdb
from pptx import Presentation
from pypdf import PdfReader
from datascience.automatic_eda.chapters.relaciones import build_relaciones
from datascience.automatic_eda.model import Chapter, Group, GlossaryCollector
from datascience.render_automatic_eda_pdf import render_automatic_eda_pdf
from datascience.render_automatic_eda_pptx import render_automatic_eda_pptx
# The optional sibling functions: their layers are asserted only when present.
try:
from datascience.detect_declared_keys_duckdb import detect_declared_keys_duckdb
except Exception: # noqa: BLE001
detect_declared_keys_duckdb = None
try:
from datascience.suggest_intratable_fk_candidates import (
suggest_intratable_fk_candidates,
)
except Exception: # noqa: BLE001
suggest_intratable_fk_candidates = None
# --------------------------------------------------------------------------- #
# Helpers.
# --------------------------------------------------------------------------- #
def _flatten(blocks) -> list:
"""Flatten Group blocks so a test can inspect every leaf block."""
out = []
for b in blocks:
if isinstance(b, Group):
out.extend(_flatten(b.blocks))
else:
out.append(b)
return out
def _text_of(chapter: Chapter) -> str:
"""Collect all visible text of a chapter's blocks into one string."""
parts = []
for b in _flatten(chapter.blocks):
for attr in ("text", "title", "note"):
v = getattr(b, attr, None)
if isinstance(v, str):
parts.append(v)
header = getattr(b, "header", None)
if isinstance(header, list):
parts.extend(str(c) for c in header)
rows = getattr(b, "rows", None)
if isinstance(rows, list):
for r in rows:
if isinstance(r, (list, tuple)):
parts.extend(str(c) for c in r)
else:
parts.append(str(r))
return "\n".join(parts)
def _render_both(chapter: Chapter, tag: str):
"""Render the chapter to PDF and PPTX; return (pdf_text, n_slides)."""
tmp = tempfile.mkdtemp(prefix=f"relaciones_{tag}_")
pdf_path = os.path.join(tmp, "out.pdf")
pptx_path = os.path.join(tmp, "out.pptx")
meta = {"title": f"EDA — {tag}"}
render_automatic_eda_pdf([chapter], pdf_path, meta)
render_automatic_eda_pptx([chapter], pptx_path, meta)
assert os.path.exists(pdf_path) and os.path.getsize(pdf_path) > 0
assert os.path.exists(pptx_path) and os.path.getsize(pptx_path) > 0
text = "".join(p.extract_text() or "" for p in PdfReader(pdf_path).pages)
n_slides = len(Presentation(pptx_path).slides)
return text, n_slides
# --------------------------------------------------------------------------- #
# Fixtures.
# --------------------------------------------------------------------------- #
def _titanic_profile() -> dict:
"""A single-table profile: a PK candidate + a column that looks like a FK."""
return {
"table": "titanic",
"source": "/data/titanic.csv",
"n_rows": 891,
"n_cols": 4,
"key_candidates": ["PassengerId"],
"columns": [
{"name": "PassengerId", "inferred_type": "numeric",
"physical_type": "BIGINT", "distinct_count": 891,
"unique_pct": 1.0, "flags": ["possible_id"]},
{"name": "ticket_id", "inferred_type": "numeric",
"physical_type": "BIGINT", "distinct_count": 681,
"unique_pct": 0.76, "flags": []},
{"name": "fare", "inferred_type": "numeric",
"physical_type": "DOUBLE", "distinct_count": 248,
"unique_pct": 0.28, "flags": []},
{"name": "sex", "inferred_type": "categorical",
"physical_type": "VARCHAR", "distinct_count": 2,
"unique_pct": 0.002, "flags": []},
],
}
def _make_relational_db(path: str) -> None:
"""Create a small DuckDB with customers(id) <- orders(customer_id), real FK."""
con = duckdb.connect(path)
con.execute("CREATE TABLE customers(id INTEGER PRIMARY KEY, name TEXT)")
con.execute(
"CREATE TABLE orders(id INTEGER PRIMARY KEY, "
"customer_id INTEGER REFERENCES customers(id), amount DOUBLE)")
con.execute("INSERT INTO customers VALUES "
"(1,'a'),(2,'b'),(3,'c'),(4,'d'),(5,'e')")
con.execute("INSERT INTO orders VALUES "
"(1,1,10.0),(2,1,20.0),(3,2,30.0),(4,3,40.0),"
"(5,3,50.0),(6,4,60.0),(7,5,70.0),(8,2,80.0)")
con.close()
def _orders_profile() -> dict:
"""A profile for the `orders` table of the relational DB."""
return {
"table": "orders",
"source": "orders",
"n_rows": 8,
"n_cols": 3,
"key_candidates": ["id"],
"columns": [
{"name": "id", "inferred_type": "numeric", "physical_type": "INTEGER",
"distinct_count": 8, "unique_pct": 1.0, "flags": ["possible_id"]},
{"name": "customer_id", "inferred_type": "numeric",
"physical_type": "INTEGER", "distinct_count": 5, "unique_pct": 0.625,
"flags": []},
{"name": "amount", "inferred_type": "numeric", "physical_type": "DOUBLE",
"distinct_count": 8, "unique_pct": 1.0, "flags": []},
],
}
# --------------------------------------------------------------------------- #
# Golden 1 — intra-table.
# --------------------------------------------------------------------------- #
def test_golden_intra_table_pk_and_fk_heuristic():
"""Single table: PK candidate shown; FK heuristic shown (if fn available);
renders to PDF + PPTX with nothing cut."""
prof = _titanic_profile()
glossary = GlossaryCollector()
# No db_path: only the profile-derived layers apply (no declared, no inter).
chapter = build_relaciones(prof, {"glossary": glossary})
assert isinstance(chapter, Chapter)
assert chapter.id == "relaciones"
text = _text_of(chapter)
# PK candidate is always present (comes from the profile).
assert "Candidatos a clave primaria" in text
assert "PassengerId" in text
# Glossary terms got registered.
for key in ("pk", "fk", "cardinalidad"):
assert glossary.has(key)
# FK heuristic layer: present iff the delegated function is importable.
if suggest_intratable_fk_candidates is not None:
assert "Posibles claves foráneas" in text
assert "ticket_id" in text
# The float measure and the PK itself are NOT suggested as FKs.
assert "Posibles FK por nombre" in text
pdf_text, n_slides = _render_both(chapter, "intra")
assert "PassengerId" in pdf_text
assert n_slides >= 1
# --------------------------------------------------------------------------- #
# Golden 2 — inter-table (real DuckDB).
# --------------------------------------------------------------------------- #
def test_golden_inter_table_containment_and_join_graph():
"""Two related tables: declared FK (if fn available) + containment FK
candidate + Mermaid join graph."""
tmp = tempfile.mkdtemp(prefix="relaciones_db_")
db_path = os.path.join(tmp, "shop.duckdb")
_make_relational_db(db_path)
prof = _orders_profile()
glossary = GlossaryCollector()
chapter = build_relaciones(
prof, {"db_path": db_path, "table": "orders", "glossary": glossary})
assert isinstance(chapter, Chapter)
text = _text_of(chapter)
# Inter-table containment FK candidate: customer_id -> customers.id. This path
# uses infer_fk_containment_duckdb + build_join_graph, both already in the
# registry, so it must be present.
assert "Claves foráneas candidatas (inter-tabla)" in text
assert "orders.customer_id" in text
assert "customers.id" in text
# Join graph with a pasteable Mermaid diagram.
assert "Grafo de relaciones" in text
assert "mermaid" in text
assert "graph LR" in text
assert "containment" in text.lower()
# Declared-keys layer: present iff the delegated function is importable.
if detect_declared_keys_duckdb is not None:
assert "Claves declaradas en el esquema" in text
assert "Claves foráneas declaradas" in text
pdf_text, n_slides = _render_both(chapter, "inter")
assert "customer_id" in pdf_text
assert n_slides >= 1
# --------------------------------------------------------------------------- #
# Edges.
# --------------------------------------------------------------------------- #
def test_none_when_no_relations():
"""No key candidates, no FK-looking columns, no db source -> None."""
prof = {
"table": "flat", "n_rows": 100, "n_cols": 2, "key_candidates": [],
"columns": [
{"name": "value", "inferred_type": "numeric", "physical_type": "DOUBLE",
"distinct_count": 50, "unique_pct": 0.5, "flags": []},
{"name": "label", "inferred_type": "categorical",
"physical_type": "VARCHAR", "distinct_count": 3, "unique_pct": 0.03,
"flags": []},
],
}
assert build_relaciones(prof, {}) is None
def test_empty_and_none_profile_do_not_raise():
"""None / {} profile and missing ctx degrade to None without raising."""
assert build_relaciones(None, None) is None
assert build_relaciones({}, {}) is None
assert build_relaciones({}, {"glossary": GlossaryCollector()}) is None
def test_pk_candidate_only_builds_chapter():
"""A profile with only a key candidate (no FK anything, no db) still builds:
the relations chapter applies because there is a PK candidate to report."""
prof = {
"table": "t", "n_rows": 10, "n_cols": 1, "key_candidates": ["row_id"],
"columns": [
{"name": "row_id", "inferred_type": "numeric", "physical_type": "BIGINT",
"distinct_count": 10, "unique_pct": 1.0, "flags": ["possible_id"]},
],
}
chapter = build_relaciones(prof, {})
assert isinstance(chapter, Chapter)
assert "Candidatos a clave primaria" in _text_of(chapter)
@@ -33,7 +33,6 @@ CHAPTER_ORDER = [
"cat_distr", # categorical distributions
"calidad", # data quality
"correlacion", # correlations / associations
"relaciones", # key relations: declared/candidate PK + FK (inter/intra-table)
"modelos", # cheap models (PCA/KMeans/outliers)
"timeseries", # time-series analysis
"geospatial", # geospatial
@@ -4,10 +4,10 @@ name: column_quality_score
kind: function
lang: py
domain: datascience
version: "1.0.0"
version: "2.0.0"
purity: pure
signature: "def column_quality_score(col: dict) -> dict"
description: "Calcula un score de calidad de datos 0-100 para un ColumnProfile del grupo eda, con desglose completeness/validity/consistency y lista de issues legibles. Funcion pura, no muta el input."
description: "Calcula un score de calidad de datos 0-100 para un ColumnProfile del grupo eda. Combina completeness (0.6) y validity (0.4) con renormalizacion por aplicabilidad; los outliers, columnas constantes e ids NO bajan el score (van a observations). Devuelve desglose por dimension, issues (defectos) y observations (señales analiticas). Funcion pura, no muta el input."
tags: [eda, data-quality, profiling, scoring, datascience]
uses_functions: []
uses_types: []
@@ -17,20 +17,26 @@ error_type: ""
imports: []
example: |
from datascience import column_quality_score
col = {"name": "precio", "inferred_type": "float", "null_pct": 0.2,
"unique_pct": 0.4, "flags": [], "numeric": {"outlier_pct": 0.08}}
col = {"name": "precio", "inferred_type": "numeric", "null_pct": 0.2,
"unique_pct": 0.4, "flags": [], "numeric": {"outlier_pct": 8.0}}
column_quality_score(col)
# {"score": 86.8, "completeness": 0.8, "validity": 0.92,
# "consistency": 1.0, "issues": ["20% nulos", "8% outliers"]}
# {"score": 88.0, "completeness": 0.8, "validity": 1.0,
# "applicable": ["completeness", "validity"], "issues": ["20% nulos"],
# "observations": ["8% de valores atípicos (z-score>3): ..."]}
tested: true
tests:
- "test_clean_column_high_score"
- "test_half_null_lowers_completeness_and_score"
- "test_constant_column_flags_issue"
- "test_weights_60_40_native_type"
- "test_outliers_do_not_penalize_score"
- "test_nulls_lower_score_more_than_outliers"
- "test_validity_from_parse_rate_lowers_score"
- "test_validity_from_match_rate"
- "test_free_text_renormalizes_to_completeness_only"
- "test_all_null_column_scores_zero"
- "test_constant_column_scores_full_and_is_observation"
- "test_high_cardinality_id_scores_full_and_is_observation"
- "test_mostly_null_no_double_counts_validity"
- "test_empty_dict_does_not_crash"
- "test_outliers_penalize_validity"
- "test_mostly_null_flag_halves_validity"
- "test_high_cardinality_text_flagged_as_id"
- "test_none_values_treated_defensively"
- "test_does_not_mutate_input"
test_file_path: "python/functions/datascience/column_quality_score_test.py"
@@ -38,16 +44,22 @@ file_path: "python/functions/datascience/column_quality_score.py"
params:
- name: col
desc: >
ColumnProfile dict del grupo eda (p.ej. salida de summarize_table_duckdb).
Se leen sus claves de forma defensiva con .get(...) y se toleran valores
None. Claves usadas: null_pct (0-1), inferred_type, semantic_type,
unique_pct (0-1), flags (list[str], reconoce "constant"/"mostly_null"),
numeric ({outlier_pct: 0-1, ...}|None) y match_rate (opcional, 0-1).
ColumnProfile dict del grupo eda (p.ej. salida de summarize_table_duckdb /
profile_table). Se leen sus claves de forma defensiva con .get(...) y se
toleran valores None. Claves usadas: null_pct (0-1), n_rows, empty_count
(texto), inferred_type, semantic_type, validity_rate (0-1, lo expone
profile_table al promocionar texto a numero/fecha), match_rate (0-1),
unique_pct (0-1), flags (list[str], reconoce
"constant"/"possible_id"/"high_cardinality") y numeric ({outlier_pct: 0-100,
skew, ...}|None).
output: >
dict con score (float 0-100, redondeado a 1 decimal), completeness (0-1),
validity (0-1), consistency (0-1) e issues (list[str] de descripciones
legibles de los problemas detectados). score = round(100 * (0.5*completeness
+ 0.3*validity + 0.2*consistency), 1).
dict con score (float 0-100, 1 decimal), completeness (0-1), validity (0-1 o
None si no aplicable), dimensions ({completeness, validity}), applicable
(list[str] de dimensiones que entraron en el score), issues (list[str] SOLO de
defectos de calidad: nulos, vacios, valores no conformes) y observations
(list[str] de señales analiticas que NO bajan el score: outliers, columna
constante, posible id, asimetria). score = round(100 * (0.6*completeness +
0.4*validity) / pesos_aplicables, 1), renormalizado cuando validity no aplica.
---
## Ejemplo
@@ -59,51 +71,71 @@ from datascience import column_quality_score
col = {
"name": "precio",
"physical_type": "DOUBLE",
"inferred_type": "float",
"inferred_type": "numeric",
"semantic_type": "",
"count": 800,
"n_rows": 1000,
"null_count": 200,
"null_pct": 0.20,
"distinct_count": 400,
"unique_pct": 0.40,
"flags": [],
"numeric": {"outlier_pct": 0.08},
"numeric": {"outlier_pct": 8.0, "skew": 0.3},
"categorical": None,
"datetime": None,
}
column_quality_score(col)
# {
# "score": 86.8,
# "completeness": 0.8, # 1 - 0.20
# "validity": 0.92, # 1 - min(0.08, 0.3)
# "consistency": 1.0,
# "issues": ["20% nulos", "8% outliers"],
# "score": 88.0, # 100 * (0.6*0.8 + 0.4*1.0)
# "completeness": 0.8, # 1 - 0.20
# "validity": 1.0, # numerica nativa: el tipo es conforme
# "dimensions": {"completeness": 0.8, "validity": 1.0},
# "applicable": ["completeness", "validity"],
# "issues": ["20% nulos"], # SOLO defectos de calidad
# "observations": ["8% de valores atípicos (z-score>3): ..."], # NO bajan score
# }
```
## Cuando usarla
Cuando hayas perfilado una tabla con el grupo `eda` (p.ej.
`summarize_table_duckdb`) y necesites un numero 0-100 por columna para
ordenar/priorizar limpieza de datos, pintar semaforos de calidad en un
dashboard, o decidir que columnas descartar antes de modelar. Es la capa de
scoring sobre el ColumnProfile crudo: lee el perfil, no toca los datos.
`summarize_table_duckdb` / `profile_table`) y necesites un numero 0-100 por
columna para ordenar/priorizar limpieza de datos, pintar semaforos de calidad,
o decidir que columnas descartar antes de modelar. Separa los **defectos de
calidad reales** (`issues`: nulos, vacios, valores que no parsean a su tipo) de
las **observaciones analiticas** (`observations`: outliers, columnas constantes,
ids), que se reportan pero no penalizan. Es la capa de scoring sobre el
ColumnProfile crudo: lee el perfil, no toca los datos.
## Notas
## Gotchas
Funcion pura, sin I/O ni dependencias externas, no muta `col`. Lee todas las
claves con `.get(...)` y tolera que vengan en `None` (un ColumnProfile recien
salido de `summarize_table_duckdb` trae muchas claves a `None`), por lo que
nunca falla por claves ausentes — un `{}` produce un resultado bien definido.
Funcion pura, sin I/O, no muta `col`. Aun asi conviene saber:
Pesos del score: completeness 0.5, validity 0.3, consistency 0.2.
- **Los outliers NO bajan el score.** Un valor extremo puede ser real y correcto
(un cliente que compra mucho); detectar atipicos es analisis de la
distribucion, no un juicio de correccion. Salen en `observations`, no en
`issues`. Mismo trato para columnas constantes e identificadores de alta
cardinalidad: son observaciones, no defectos.
- **`validity` puede ser `None`** (no aplicable): texto libre sin `semantic_type`
ni `validity_rate`, o columna 100% nula. En ese caso el score se renormaliza a
solo `completeness` (la columna no se premia ni castiga por algo no medible).
- **`outlier_pct` se interpreta en escala 0-100** (la que emite
`describe_numeric`, z-score>3). Pasar una fraccion 0-1 produce un texto de
observacion con el % equivocado, pero NUNCA afecta al score.
- **`validity_rate` lo puebla `profile_table`** al promocionar una columna de
texto a numero/fecha (fraccion que parsea). Si no esta presente y el tipo es
nativo numerico/fecha/bool, `validity = 1.0`.
- Sin doble conteo: la falta de datos cuenta solo en `completeness` (el antiguo
castigo de `mostly_null` sobre `validity` se elimino).
- **completeness** = `1 - null_pct` (None -> 0 nulls -> 1.0).
- **validity**: parte de 1.0 y penaliza `min(outlier_pct, 0.3)` en columnas
numericas, `0.5 * (1 - match_rate)` si hay `semantic_type` declarado con
`match_rate` bajo disponible, y multiplica por 0.5 si el flag `mostly_null`
esta presente.
- **consistency**: 1.0 salvo flag `constant` (-> 0.3, columna poco informativa)
o texto con `unique_pct > 0.9` (-> 0.6, posible id de alta cardinalidad).
## Capability growth log
- v2.0.0 (2026-06-30) — nueva formula de calidad (report 2046): pesos 60/40
(completeness/validity) con renormalizacion por aplicabilidad; se elimina la
dimension `consistency`-como-informatividad y el doble castigo de
`mostly_null`; los outliers/constantes/ids salen del score a `observations`;
validity mide conformidad real (parse rate / match rate / tipo nativo). Salida
ampliada con `dimensions`, `applicable` y `observations`.
- v1.0.0 — version inicial: pesos 50/30/20 (completeness/validity/consistency),
los outliers penalizaban validity (con bug de escala) y consistency penalizaba
informatividad.
@@ -1,34 +1,78 @@
"""Score de calidad de datos (0-100) para un ColumnProfile del grupo eda.
Funcion pura: dado el perfil de una columna producido por el grupo de
capacidad `eda` (p.ej. summarize_table_duckdb), calcula un score agregado
de calidad junto a su desglose en completeness / validity / consistency y
una lista de issues legibles. No realiza I/O ni muta el input.
capacidad `eda` (p.ej. summarize_table_duckdb / profile_table), calcula un
score agregado de calidad junto a su desglose por dimension y dos listas
legibles separadas: `issues` (defectos de calidad reales que SI bajan el
score) y `observations` (señales analiticas que NO bajan el score). No
realiza I/O ni muta el input.
Modelo (DAMA-DMBOK / ISO 8000), ver report 2046:
- Solo entran en el score las dimensiones medibles automaticamente desde el
perfil, sin fuente externa de verdad: completeness y validity por columna.
- Renormalizacion por aplicabilidad: si una dimension no es medible en la
columna (texto libre sin semantica -> validity no aplica; columna 100% nula
-> validity no medible), se excluye y los pesos se renormalizan sobre las
aplicables. Una columna ni se premia ni se castiga por algo no medible.
- Sin doble conteo: la falta de datos cuenta solo en completeness (se elimino
el antiguo castigo extra de `mostly_null` sobre validity).
- Los OUTLIERS NO bajan la calidad. Un valor extremo puede ser real y
correcto; detectar atipicos es analisis de la distribucion, no un juicio de
coreccion. Outliers, columnas constantes e identificadores de alta
cardinalidad pasan a `observations`, nunca a `issues`.
"""
# Pesos base de las dimensiones de columna (se renormalizan por aplicabilidad).
_W_COMPLETENESS = 0.6
_W_VALIDITY = 0.4
# Tipos inferidos cuyo almacen garantiza la conformidad de tipo (validity=1.0)
# cuando NO vienen de una promocion de texto (en cuyo caso manda validity_rate).
_NATIVE_TYPED = ("numeric", "integer", "float", "datetime", "date", "boolean", "bool")
def column_quality_score(col: dict) -> dict:
"""Calcula un score de calidad de datos 0-100 para un ColumnProfile.
El score pondera tres dimensiones:
- completeness (0.5): proporcion de valores no nulos.
- validity (0.3): ausencia de outliers / heuristicas de validez.
- consistency (0.2): la columna aporta informacion (no constante, no ruido).
El score combina solo dimensiones de calidad medibles desde el perfil, con
renormalizacion por aplicabilidad:
- completeness (peso base 0.6, siempre aplica): proporcion de valores
presentes = 1 - null_pct. En texto, las celdas vacias (`empty_count`)
tambien cuentan como faltantes.
- validity (peso base 0.4, cuando hay un criterio de validacion real):
fraccion de valores no nulos conformes a su tipo/semantica. Tipo nativo
numerico/fecha/bool = 1.0; texto promovido a numero/fecha = parse rate
(`validity_rate`); texto con `semantic_type` regexable = `match_rate`;
texto libre o columna 100% nula = NO aplicable (renormaliza a solo
completeness).
Los outliers, columnas constantes, identificadores y asimetria fuerte NO
bajan el score: se devuelven en `observations`.
Args:
col: ColumnProfile dict del grupo eda. Se leen las claves de forma
defensiva con .get(...) y se tolera que muchas vengan en None.
Claves relevantes: null_pct, inferred_type, semantic_type,
unique_pct, flags (list[str]), numeric ({outlier_pct, ...}|None),
match_rate (opcional).
Claves relevantes: null_pct (0-1), n_rows, empty_count,
inferred_type, semantic_type, validity_rate (0-1, lo expone
profile_table al promocionar texto a numero/fecha), match_rate
(0-1), unique_pct (0-1), flags (list[str], reconoce
"constant"/"possible_id"/"high_cardinality"), numeric
({outlier_pct: 0-100, skew, ...}|None).
Returns:
dict con:
score (float, 0-100, redondeado a 1 decimal),
completeness (float, 0-1),
validity (float, 0-1),
consistency (float, 0-1),
issues (list[str]) descripciones legibles de los problemas.
score (float 0-100, redondeado a 1 decimal),
completeness (float 0-1),
validity (float 0-1 | None si no aplicable),
dimensions ({completeness, validity}),
applicable (list[str] de dimensiones que entraron en el score),
issues (list[str]) SOLO defectos de calidad (nulos, vacios,
valores no conformes a su tipo/semantica),
observations (list[str]) señales analiticas que NO bajan el score
(outliers, columna constante, posible id, asimetria).
"""
if not isinstance(col, dict):
col = {}
@@ -39,103 +83,153 @@ def column_quality_score(col: dict) -> dict:
flags = set(flags)
issues: list[str] = []
observations: list[str] = []
inferred_type = col.get("inferred_type") or ""
semantic_type = col.get("semantic_type") or ""
# --- completeness -------------------------------------------------
null_pct = col.get("null_pct")
if null_pct is None:
null_pct = 0.0
try:
null_pct = float(null_pct)
except (TypeError, ValueError):
null_pct = 0.0
null_pct = _clamp(null_pct, 0.0, 1.0)
# Falta de datos = nulos + (en texto) celdas vacias. Es el unico sitio
# donde la falta de datos cuenta: nunca se duplica en validity.
null_pct = _clamp(_num(col.get("null_pct"), 0.0), 0.0, 1.0)
completeness = 1.0 - null_pct
if null_pct > 0:
issues.append(f"{round(null_pct * 100)}% nulos")
issues.append(f"{_pct(null_pct)} nulos")
# --- validity -----------------------------------------------------
validity = 1.0
inferred_type = col.get("inferred_type") or ""
empty_frac = 0.0
n_rows = col.get("n_rows")
empty_count = col.get("empty_count")
if (
isinstance(n_rows, (int, float)) and not isinstance(n_rows, bool) and n_rows > 0
and isinstance(empty_count, (int, float)) and not isinstance(empty_count, bool)
and empty_count > 0
):
empty_frac = _clamp(float(empty_count) / float(n_rows), 0.0, 1.0)
completeness = _clamp(completeness - empty_frac, 0.0, 1.0)
issues.append(f"{_pct(empty_frac)} vacíos")
numeric = col.get("numeric")
is_numeric = inferred_type in ("integer", "float", "numeric") or isinstance(numeric, dict)
if isinstance(numeric, dict):
outlier_pct = numeric.get("outlier_pct")
if outlier_pct is not None:
try:
outlier_pct = float(outlier_pct)
except (TypeError, ValueError):
outlier_pct = 0.0
outlier_pct = _clamp(outlier_pct, 0.0, 1.0)
if outlier_pct > 0:
penalty = min(outlier_pct, 0.3)
validity -= penalty
issues.append(f"{round(outlier_pct * 100)}% outliers")
# semantic_type declarado pero con baja tasa de match (si la conocemos).
semantic_type = col.get("semantic_type") or ""
match_rate = col.get("match_rate")
if semantic_type and match_rate is not None:
try:
match_rate = float(match_rate)
except (TypeError, ValueError):
match_rate = None
if match_rate is not None:
match_rate = _clamp(match_rate, 0.0, 1.0)
if match_rate < 1.0:
shortfall = 1.0 - match_rate
validity -= 0.5 * shortfall
issues.append(
f"semantic_type '{semantic_type}' con baja coincidencia "
f"({round(match_rate * 100)}%)"
)
if "mostly_null" in flags:
validity *= 0.5
issues.append("mayoritariamente nula")
validity = _clamp(validity, 0.0, 1.0)
# --- consistency --------------------------------------------------
consistency = 1.0
if "constant" in flags:
consistency = 0.3
issues.append("columna constante")
# --- validity (con renormalizacion por aplicabilidad) -------------
# None = no medible -> se excluye del score (no penaliza ni premia).
validity = None
if completeness <= 0.0:
# Columna 100% faltante: no hay valores no nulos sobre los que medir
# conformidad. validity no aplica -> el score sale solo de completeness
# (= 0). Es el peor defecto de calidad posible.
validity = None
else:
unique_pct = col.get("unique_pct")
if unique_pct is not None:
try:
unique_pct = float(unique_pct)
except (TypeError, ValueError):
unique_pct = None
if (
inferred_type == "text"
validity_rate = col.get("validity_rate")
match_rate = col.get("match_rate")
if validity_rate is not None:
# Texto promovido a numero/fecha: parse rate real de la muestra.
v = _num(validity_rate, None)
if v is not None:
validity = _clamp(v, 0.0, 1.0)
if validity < 1.0:
kind = (
"número" if inferred_type == "numeric"
else "fecha" if inferred_type == "datetime"
else inferred_type or "su tipo"
)
issues.append(
f"{_pct(1.0 - validity)} no parsea al tipo {kind}"
)
elif inferred_type in _NATIVE_TYPED:
# Tipo nativo garantizado por el almacen: no hay valores que no
# parseen. validity = 1.0 (no se confunde con tener outliers).
validity = 1.0
elif semantic_type and match_rate is not None:
v = _num(match_rate, None)
if v is not None:
validity = _clamp(v, 0.0, 1.0)
if validity < 1.0:
issues.append(
f"{_pct(1.0 - validity)} no casa con el "
f"formato «{semantic_type}»"
)
else:
# Texto libre / categorica sin semantica: no hay criterio honesto
# de validez. No aplica.
validity = None
# --- observations (NO bajan el score) -----------------------------
numeric = col.get("numeric")
if isinstance(numeric, dict):
# outlier_pct viene en escala 0-100 desde describe_numeric (z-score>3).
outlier_pct = _num(numeric.get("outlier_pct"), None)
if outlier_pct is not None and outlier_pct >= 0.05:
observations.append(
f"{_pct(outlier_pct / 100.0)} de valores atípicos (z-score>3): "
"revisar si son errores u observaciones legítimas"
)
skew = _num(numeric.get("skew"), None)
if skew is not None and abs(skew) >= 1.0:
observations.append(
f"asimetría fuerte (skew={round(skew, 2)}): considerar "
"re-expresión antes de modelar"
)
if "constant" in flags:
observations.append(
"columna constante: aporta poca información para el análisis"
)
unique_pct = _num(col.get("unique_pct"), None)
is_id = (
"possible_id" in flags
or "high_cardinality" in flags
or (
inferred_type in ("text", "categorical")
and unique_pct is not None
and _clamp(unique_pct, 0.0, 1.0) > 0.9
):
consistency = 0.6
issues.append("posible id de alta cardinalidad")
consistency = _clamp(consistency, 0.0, 1.0)
# --- score agregado ----------------------------------------------
score = round(
100.0 * (0.5 * completeness + 0.3 * validity + 0.2 * consistency),
1,
)
)
if is_id:
observations.append(
"valores casi únicos: posible identificador (no es un defecto de calidad)"
)
# Silencia warnings sobre la variable de tipo no usada.
_ = is_numeric
# --- score agregado con renormalizacion ---------------------------
applicable = ["completeness"]
num = _W_COMPLETENESS * completeness
den = _W_COMPLETENESS
if validity is not None:
applicable.append("validity")
num += _W_VALIDITY * validity
den += _W_VALIDITY
score = round(100.0 * num / den, 1) if den > 0 else 0.0
return {
"score": score,
"completeness": completeness,
"validity": validity,
"consistency": consistency,
"dimensions": {"completeness": completeness, "validity": validity},
"applicable": applicable,
"issues": issues,
"observations": observations,
}
def _pct(frac: float) -> str:
"""Formatea una fraccion 0-1 como porcentaje honesto: «N%» si >=1%, «0.N%»
por debajo (para no mostrar «0%» cuando hay un defecto real pequeño)."""
p = frac * 100.0
if p >= 1.0:
return f"{round(p)}%"
return f"{p:.1f}%"
def _num(x, default):
"""Convierte x a float; devuelve `default` si es None o no parseable."""
if x is None:
return default
if isinstance(x, bool):
return default
try:
return float(x)
except (TypeError, ValueError):
return default
def _clamp(x: float, lo: float, hi: float) -> float:
"""Recorta x al rango [lo, hi]."""
if x < lo:
@@ -1,4 +1,12 @@
"""Tests para column_quality_score."""
"""Tests para column_quality_score (nueva fórmula, report 2046).
Verifica las invariantes de la fórmula de calidad:
- completeness (0.6) + validity (0.4) con renormalización por aplicabilidad.
- Los OUTLIERS no bajan el score (van a observations, no a issues).
- Columnas constantes e ids no bajan el score (observations).
- Sin doble conteo de la falta de datos.
- all-null -> score 0; función pura (no muta el input).
"""
import os
import sys
@@ -9,11 +17,11 @@ from column_quality_score import column_quality_score
def _clean_numeric_col() -> dict:
"""ColumnProfile de una columna numerica sana, sin problemas."""
"""ColumnProfile de una columna numérica nativa sana, sin problemas."""
return {
"name": "edad",
"physical_type": "INTEGER",
"inferred_type": "integer",
"inferred_type": "numeric",
"semantic_type": "",
"count": 1000,
"n_rows": 1000,
@@ -28,85 +36,163 @@ def _clean_numeric_col() -> dict:
}
# --------------------------------------------------------------------------- #
# Golden
# --------------------------------------------------------------------------- #
def test_clean_column_high_score():
out = column_quality_score(_clean_numeric_col())
assert out["score"] > 90
assert out["score"] == 100.0
assert out["completeness"] == 1.0
assert out["validity"] == 1.0
assert out["consistency"] == 1.0
assert out["applicable"] == ["completeness", "validity"]
assert out["issues"] == []
assert out["observations"] == []
def test_half_null_lowers_completeness_and_score():
def test_weights_60_40_native_type():
"""30% nulos en numérica nativa: score = 100*(0.6*0.7 + 0.4*1.0) = 82."""
col = _clean_numeric_col()
col["null_count"] = 500
col["null_pct"] = 0.5
clean_score = column_quality_score(_clean_numeric_col())["score"]
col["null_pct"] = 0.30
col["null_count"] = 300
out = column_quality_score(col)
assert out["completeness"] == 0.5
assert out["score"] < clean_score
assert any("nulos" in issue for issue in out["issues"])
assert out["completeness"] == 0.7
assert out["validity"] == 1.0
assert out["score"] == 82.0
assert any("nulos" in i for i in out["issues"])
def test_constant_column_flags_issue():
# --------------------------------------------------------------------------- #
# Outliers FUERA del score
# --------------------------------------------------------------------------- #
def test_outliers_do_not_penalize_score():
"""Columna con outliers pero sin nulos -> score máximo; outliers en observations."""
col = _clean_numeric_col()
col["numeric"] = {"outlier_pct": 18.0, "skew": 0.2} # 18% atípicos (escala 0-100)
out = column_quality_score(col)
assert out["score"] == 100.0 # los outliers NO bajan la calidad
assert out["validity"] == 1.0
# No aparecen como problema de calidad...
assert not any("atípic" in i or "outlier" in i for i in out["issues"])
# ...sino como observación analítica.
assert any("atípic" in o for o in out["observations"])
def test_nulls_lower_score_more_than_outliers():
"""Vacíos sí penalizan; outliers no: comparar las dos columnas."""
con_nulos = _clean_numeric_col()
con_nulos["null_pct"] = 0.30
con_outliers = _clean_numeric_col()
con_outliers["numeric"] = {"outlier_pct": 30.0}
assert column_quality_score(con_nulos)["score"] < \
column_quality_score(con_outliers)["score"]
# --------------------------------------------------------------------------- #
# Validity: aplicabilidad y renormalización
# --------------------------------------------------------------------------- #
def test_validity_from_parse_rate_lowers_score():
"""Numérica como texto con 20% basura: validity=0.8 -> score=92."""
col = {
"name": "precio_txt", "inferred_type": "numeric", "semantic_type": "decimal",
"null_pct": 0.0, "validity_rate": 0.80, "flags": [], "numeric": None,
}
out = column_quality_score(col)
assert out["validity"] == 0.8
assert out["score"] == 92.0 # 100*(0.6 + 0.4*0.8)
assert any("no parsea" in i for i in out["issues"])
def test_validity_from_match_rate():
"""Texto con semantic_type y 5% no conforme: validity=0.95."""
col = {
"name": "email", "inferred_type": "text", "semantic_type": "email",
"null_pct": 0.0, "match_rate": 0.95, "unique_pct": 0.5, "flags": [],
}
out = column_quality_score(col)
assert out["validity"] == 0.95
assert out["score"] == 98.0 # 100*(0.6 + 0.4*0.95)
assert any("no casa" in i for i in out["issues"])
def test_free_text_renormalizes_to_completeness_only():
"""Texto libre sin semántica: validity no aplica -> score = 100*completeness."""
col = {
"name": "comentario", "inferred_type": "text", "semantic_type": "",
"null_pct": 0.30, "unique_pct": 0.5, "flags": [], "numeric": None,
}
out = column_quality_score(col)
assert out["validity"] is None
assert out["applicable"] == ["completeness"]
assert out["completeness"] == 0.7
assert out["score"] == 70.0 # renormalizado a solo completeness
# --------------------------------------------------------------------------- #
# Casos límite (report §4.6)
# --------------------------------------------------------------------------- #
def test_all_null_column_scores_zero():
col = _clean_numeric_col()
col["null_pct"] = 1.0
col["null_count"] = 1000
out = column_quality_score(col)
assert out["completeness"] == 0.0
assert out["validity"] is None # no medible sin valores no nulos
assert out["score"] == 0.0
def test_constant_column_scores_full_and_is_observation():
"""Columna constante: dato válido y completo -> score 100; baja info = observación."""
col = _clean_numeric_col()
col["flags"] = ["constant"]
col["distinct_count"] = 1
col["unique_pct"] = 0.001
out = column_quality_score(col)
assert out["consistency"] == 0.3
assert any("constante" in issue for issue in out["issues"])
assert out["score"] == 100.0 # NO se castiga la baja informatividad
assert not any("constante" in i for i in out["issues"])
assert any("constante" in o for o in out["observations"])
def test_high_cardinality_id_scores_full_and_is_observation():
"""Id de alta cardinalidad: unicidad perfecta -> score 100; posible id = observación."""
col = {
"name": "uuid", "inferred_type": "text", "semantic_type": "",
"null_pct": 0.0, "unique_pct": 0.99, "flags": ["possible_id"],
"numeric": None,
}
out = column_quality_score(col)
assert out["score"] == 100.0
assert not any("identificador" in i for i in out["issues"])
assert any("identificador" in o for o in out["observations"])
def test_mostly_null_no_double_counts_validity():
"""85% nulos: solo completeness penaliza; validity nativa sigue 1.0 (sin doble castigo)."""
col = _clean_numeric_col()
col["null_pct"] = 0.85
col["flags"] = ["mostly_null"]
out = column_quality_score(col)
assert out["validity"] == 1.0 # ya no se multiplica por 0.5
# score = 100*(0.6*0.15 + 0.4*1.0) = 49
assert out["score"] == 49.0
assert not any("mayoritariamente" in o for o in out["observations"])
# --------------------------------------------------------------------------- #
# Robustez
# --------------------------------------------------------------------------- #
def test_empty_dict_does_not_crash():
out = column_quality_score({})
assert isinstance(out["score"], float)
assert out["completeness"] == 1.0
assert 0.0 <= out["score"] <= 100.0
assert isinstance(out["issues"], list)
def test_outliers_penalize_validity():
col = _clean_numeric_col()
col["numeric"] = {"outlier_pct": 0.2}
out = column_quality_score(col)
assert out["validity"] < 1.0
assert any("outliers" in issue for issue in out["issues"])
def test_mostly_null_flag_halves_validity():
col = _clean_numeric_col()
col["null_pct"] = 0.85
col["flags"] = ["mostly_null"]
out = column_quality_score(col)
assert out["validity"] == 0.5
assert any("mayoritariamente nula" in issue for issue in out["issues"])
def test_high_cardinality_text_flagged_as_id():
col = {
"name": "uuid",
"inferred_type": "text",
"semantic_type": "",
"null_pct": 0.0,
"unique_pct": 0.99,
"flags": [],
"numeric": None,
}
out = column_quality_score(col)
assert out["consistency"] < 1.0
assert any("alta cardinalidad" in issue for issue in out["issues"])
assert isinstance(out["observations"], list)
def test_none_values_treated_defensively():
col = {
"name": "x",
"inferred_type": None,
"semantic_type": None,
"null_pct": None,
"unique_pct": None,
"flags": None,
"numeric": None,
"name": "x", "inferred_type": None, "semantic_type": None,
"null_pct": None, "unique_pct": None, "flags": None, "numeric": None,
}
out = column_quality_score(col)
assert out["completeness"] == 1.0
@@ -1,107 +0,0 @@
---
name: detect_declared_keys_duckdb
kind: function
lang: py
domain: datascience
version: "1.0.0"
purity: impure
signature: "def detect_declared_keys_duckdb(db_path: str, table: str = None) -> dict"
description: "Detecta las claves DECLARADAS (constraints reales) de un schema DuckDB leyendo la table function duckdb_constraints(): extrae PRIMARY KEY, FOREIGN KEY y UNIQUE (ignora NOT NULL y CHECK) y las devuelve normalizadas con sus columnas, y para las FK con su tabla y columnas referenciadas. Con table=None procesa todas las tablas; con table='X' filtra a PK/UNIQUE de X y a FK cuyo origen es X (case-sensitive). A diferencia de infer_fk_containment_duckdb (que INFIERE FKs candidatas por containment de valores cuando el schema no las declara), esta funcion devuelve las relaciones de clave REALES del schema. Estilo dict-no-throw: nunca lanza. Parte del grupo eda (relaciones de clave)."
tags: [eda, duckdb, datascience, relations, primary-key, foreign-key, schema, exploratory-data-analysis]
params:
- name: db_path
desc: "Ruta al archivo DuckDB. Debe existir (lectura read-only via duckdb_query_readonly; no se crea). Un path inexistente devuelve {status:'error', ...}."
- name: table
desc: "Si se pasa, filtra los resultados a esa tabla: incluye PRIMARY KEY y UNIQUE cuya tabla sea `table`, y FOREIGN KEY cuya tabla ORIGEN sea `table` (no la referenciada). None (default) devuelve los constraints de todas las tablas. La comparacion es case-sensitive (nombres tal cual los devuelve DuckDB)."
output: "dict dict-no-throw. En exito {status:'ok', primary_keys:[{table:str, columns:[str,...]}, ...], foreign_keys:[{table:str, columns:[str,...], referenced_table:str, referenced_columns:[str,...]}, ...], unique:[{table:str, columns:[str,...]}, ...], tables:[str,...]} donde tables es la lista ordenada de tablas (origen) que poseen al menos un constraint PK/FK/UNIQUE emitido. Solo se emiten constraints de clave: NOT NULL y CHECK se ignoran. En error {status:'error', error:str}."
uses_functions: [duckdb_query_readonly_py_infra]
uses_types: []
returns: []
returns_optional: false
error_type: "error_go_core"
imports: []
tested: true
tests: ["test_golden_detecta_pks_y_fk", "test_golden_ignora_not_null_y_check", "test_edge_filtra_por_tabla_orders", "test_edge_filtra_por_tabla_customers", "test_edge_unique_declarado", "test_edge_sin_constraints_listas_vacias", "test_error_db_inexistente_no_lanza", "test_shape_resultado"]
test_file_path: "python/functions/datascience/detect_declared_keys_duckdb_test.py"
file_path: "python/functions/datascience/detect_declared_keys_duckdb.py"
---
## Ejemplo
```python
import sys, os, duckdb
sys.path.insert(0, os.path.join("python", "functions"))
from datascience import detect_declared_keys_duckdb
# Base de ejemplo en /tmp: orders.customer_id -> customers.id (FK declarada)
path = "/tmp/declared_keys_demo.duckdb"
if os.path.exists(path):
os.remove(path)
con = duckdb.connect(path)
con.execute("CREATE TABLE customers(id INTEGER PRIMARY KEY, name TEXT)")
con.execute(
"CREATE TABLE orders("
" id INTEGER PRIMARY KEY,"
" customer_id INTEGER REFERENCES customers(id),"
" amt DOUBLE)"
)
con.close()
res = detect_declared_keys_duckdb(path)
if res["status"] == "ok":
for pk in res["primary_keys"]:
print(f"PK {pk['table']}({', '.join(pk['columns'])})")
for fk in res["foreign_keys"]:
print(f"FK {fk['table']}({', '.join(fk['columns'])}) -> "
f"{fk['referenced_table']}({', '.join(fk['referenced_columns'])})")
# PK customers(id)
# PK orders(id)
# FK orders(customer_id) -> customers(id)
else:
print("error:", res["error"])
# Filtrar a una tabla concreta (PK/UNIQUE de orders + FK con origen orders):
solo_orders = detect_declared_keys_duckdb(path, table="orders")
print(solo_orders["tables"]) # ['orders']
```
## Cuando usarla
- Cuando exploras un esquema DuckDB y quieres mostrar las relaciones de clave REALES (PK/FK/UNIQUE) que el schema ha declarado, sin inferir nada.
- Como paso del capitulo RELACIONES del grupo `eda`: primero mira las claves declaradas con esta funcion; si el schema no declara FKs, complementa con `infer_fk_containment_duckdb` (inferencia por containment).
- Antes de documentar o migrar un esquema, para listar el contrato de integridad referencial que el motor ya conoce.
- Para validar que las constraints que esperas (esa FK que creaste con `REFERENCES`) realmente estan declaradas en la base materializada.
## Gotchas
- **Impura**: lee de disco via la primitiva read-only `duckdb_query_readonly` (no crea ni modifica la base). El `db_path` debe existir; un path inexistente devuelve `{status:'error'}` (read_only NO crea la base).
- **Requiere `duckdb_constraints()`**: usa la table function `duckdb_constraints()`, disponible en DuckDB modernos (verificado en 1.5.2). En versiones antiguas sin esa funcion, la query falla y se devuelve `{status:'error'}`.
- **Solo claves DECLARADAS**: devuelve lo que el schema declaro con `PRIMARY KEY` / `FOREIGN KEY (... REFERENCES ...)` / `UNIQUE`. Una tabla materializada con `CREATE TABLE AS SELECT` NO lleva constraints — para esos casos no habra claves que mostrar y hay que INFERIRLAS (`infer_fk_containment_duckdb`).
- **NOT NULL y CHECK se ignoran**: `duckdb_constraints()` tambien emite filas `NOT NULL` (DuckDB genera una por cada columna PK) y `CHECK`; esta funcion las descarta y solo conserva PK/FK/UNIQUE.
- **Nombres case-sensitive**: el filtro `table='Orders'` no casa con una tabla `orders`. Se comparan los nombres tal cual los devuelve DuckDB.
- **FK atribuida al origen**: una FOREIGN KEY se atribuye a su tabla ORIGEN (el `table` de la entrada), no a la referenciada. El filtro `table='X'` trae las FK cuyo origen es X, no las que apuntan a X.
- **`tables` = tablas dueñas de constraints emitidos**: la lista `tables` contiene solo las tablas que poseen al menos un PK/FK/UNIQUE en el resultado (su campo `table`), ordenadas. No incluye tablas referenciadas que no tengan constraint propio en la salida.
- **Columnas como listas**: `constraint_column_names` y `referenced_column_names` son columnas LIST de DuckDB; en 1.5.2 llegan como listas Python. La funcion las normaliza a listas de strings con una red de seguridad por si llegaran como string.
## Notas
`duckdb_constraints()` devuelve una fila por constraint con los campos
`table_name`, `constraint_type`, `constraint_column_names`, `referenced_table`,
`referenced_column_names`. Mapeo a la salida:
```text
PRIMARY KEY -> primary_keys[]: {table, columns}
UNIQUE -> unique[]: {table, columns}
FOREIGN KEY -> foreign_keys[]: {table, columns, referenced_table, referenced_columns}
NOT NULL -> ignorado
CHECK -> ignorado
```
Para una FK, `referenced_table` y `referenced_column_names` vienen poblados; para
PK/UNIQUE, `referenced_table` es NULL y `referenced_column_names` una lista vacia.
Complementa a `infer_fk_containment_duckdb`: esta funcion devuelve las relaciones
de clave REALES del schema (declaradas); la otra INFIERE FKs candidatas por
containment de valores cuando el schema no las declaro. En el capitulo RELACIONES
de AutomaticEDA se usan en orden: primero las declaradas, luego la inferencia como
respaldo.
@@ -1,127 +0,0 @@
"""detect_declared_keys_duckdb — lee las claves DECLARADAS de un schema DuckDB.
Funcion impura: lee de disco a traves de la primitiva read-only del grupo
`duckdb` (duckdb_query_readonly). Pertenece al grupo de capacidad `eda`
(relaciones de clave): a diferencia de infer_fk_containment_duckdb, que INFIERE
FOREIGN KEYs candidatas por containment de valores, esta funcion devuelve las
constraints REALES que el schema ha declarado (PRIMARY KEY / FOREIGN KEY /
UNIQUE) leyendo la table function `duckdb_constraints()`.
Es la pieza del capitulo RELACIONES de AutomaticEDA que muestra las relaciones de
clave reales cuando existen — frente a la inferencia, que se usa cuando el schema
no las declaro.
Estilo dict-no-throw del grupo duckdb: nunca lanza; captura cualquier error y
devuelve {status:'error', error:str}.
"""
from infra import duckdb_query_readonly
def _as_list(value) -> list:
"""Normaliza el valor de una columna LIST de DuckDB a una lista de strings.
En DuckDB 1.5.2, `constraint_column_names` y `referenced_column_names` llegan
ya como listas Python a traves de duckdb_query_readonly. Este helper es solo
una red de seguridad: si por cualquier motivo llegara como string (p.ej. la
representacion `[id, customer_id]`), la parsea de forma defensiva.
"""
if value is None:
return []
if isinstance(value, (list, tuple)):
return [str(v) for v in value]
if isinstance(value, str):
s = value.strip()
if s.startswith("[") and s.endswith("]"):
s = s[1:-1]
if not s.strip():
return []
return [
part.strip().strip("'\"")
for part in s.split(",")
if part.strip().strip("'\"")
]
return [str(value)]
def detect_declared_keys_duckdb(db_path: str, table: str = None) -> dict:
"""Detecta las claves PRIMARY KEY / FOREIGN KEY / UNIQUE declaradas en DuckDB.
Lee la table function `duckdb_constraints()` y extrae solo las constraints de
clave (PRIMARY KEY, FOREIGN KEY, UNIQUE), ignorando NOT NULL y CHECK.
Args:
db_path: ruta al archivo DuckDB. Debe existir (lectura read-only; no se
crea). Un path inexistente devuelve {status:'error', ...} sin lanzar.
table: si se pasa, filtra los resultados a esa tabla: incluye PRIMARY KEY
y UNIQUE cuya tabla sea `table`, y FOREIGN KEY cuya tabla ORIGEN sea
`table`. None (default) devuelve los constraints de todas las tablas.
La comparacion de nombres es case-sensitive (tal cual los devuelve
DuckDB).
Returns:
dict dict-no-throw. En exito:
{status:'ok',
primary_keys:[{table:str, columns:[str, ...]}, ...],
foreign_keys:[{table:str, columns:[str, ...],
referenced_table:str,
referenced_columns:[str, ...]}, ...],
unique:[{table:str, columns:[str, ...]}, ...],
tables:[str, ...]} # tablas (origen) con algun PK/FK/UNIQUE emitido
En error (sin lanzar): {status:'error', error:str}.
"""
try:
sql = (
"SELECT table_name, constraint_type, constraint_column_names, "
"referenced_table, referenced_column_names FROM duckdb_constraints()"
)
res = duckdb_query_readonly(db_path, sql)
if res["status"] != "ok":
return {"status": "error", "error": res["error"]}
primary_keys = []
foreign_keys = []
unique = []
tables = set()
for row in res["rows"]:
ctype = row["constraint_type"]
tname = row["table_name"]
# Filtro por tabla origen: para PK/FK/UNIQUE el dueño del constraint es
# `table_name`. Una FK se atribuye a su tabla origen (no a la
# referenciada), igual que el filtro pide.
if table is not None and tname != table:
continue
cols = _as_list(row["constraint_column_names"])
if ctype == "PRIMARY KEY":
primary_keys.append({"table": tname, "columns": cols})
tables.add(tname)
elif ctype == "UNIQUE":
unique.append({"table": tname, "columns": cols})
tables.add(tname)
elif ctype == "FOREIGN KEY":
foreign_keys.append(
{
"table": tname,
"columns": cols,
"referenced_table": row["referenced_table"],
"referenced_columns": _as_list(
row["referenced_column_names"]
),
}
)
tables.add(tname)
# NOT NULL y CHECK se ignoran: no son relaciones de clave.
return {
"status": "ok",
"primary_keys": primary_keys,
"foreign_keys": foreign_keys,
"unique": unique,
"tables": sorted(tables),
}
except Exception as e: # noqa: BLE001
return {"status": "error", "error": str(e)}
@@ -1,167 +0,0 @@
"""Tests para detect_declared_keys_duckdb."""
import duckdb
import pytest
from .detect_declared_keys_duckdb import detect_declared_keys_duckdb
@pytest.fixture
def db(tmp_path):
"""DuckDB temporal con claves declaradas.
- customers(id PRIMARY KEY, name)
- orders(id PRIMARY KEY, customer_id REFERENCES customers(id), amt)
Esto declara dos PRIMARY KEY (customers.id, orders.id) y una FOREIGN KEY
(orders.customer_id -> customers.id). DuckDB ademas genera constraints
NOT NULL para las columnas PK, que la funcion debe ignorar.
"""
path = str(tmp_path / "keys_test.duckdb")
con = duckdb.connect(path)
con.execute("CREATE TABLE customers(id INTEGER PRIMARY KEY, name TEXT)")
con.execute(
"CREATE TABLE orders("
" id INTEGER PRIMARY KEY,"
" customer_id INTEGER REFERENCES customers(id),"
" amt DOUBLE"
")"
)
con.close()
return path
def _pk_for(res, table):
"""Devuelve la entrada primary_keys cuya tabla es `table`, o None."""
for pk in res["primary_keys"]:
if pk["table"] == table:
return pk
return None
def test_golden_detecta_pks_y_fk(db):
"""Golden: detecta las dos PK y la FK declaradas, con valores concretos."""
res = detect_declared_keys_duckdb(db)
assert res["status"] == "ok"
# PRIMARY KEY de customers y de orders.
pk_customers = _pk_for(res, "customers")
pk_orders = _pk_for(res, "orders")
assert pk_customers is not None
assert pk_customers["columns"] == ["id"]
assert pk_orders is not None
assert pk_orders["columns"] == ["id"]
# FOREIGN KEY orders.customer_id -> customers.id.
assert len(res["foreign_keys"]) == 1
fk = res["foreign_keys"][0]
assert fk["table"] == "orders"
assert fk["columns"] == ["customer_id"]
assert fk["referenced_table"] == "customers"
assert fk["referenced_columns"] == ["id"]
# tables incluye ambas (origen de algun constraint).
assert res["tables"] == ["customers", "orders"]
def test_golden_ignora_not_null_y_check(db):
"""NOT NULL (auto-generado por las PK) no aparece como clave."""
res = detect_declared_keys_duckdb(db)
assert res["status"] == "ok"
# Solo 2 PK reales (no las NOT NULL que DuckDB genera por cada columna PK).
assert len(res["primary_keys"]) == 2
# No hay UNIQUE declarado en este schema.
assert res["unique"] == []
def test_edge_filtra_por_tabla_orders(db):
"""Edge table='orders': PK de orders + su FK; NO la PK de customers."""
res = detect_declared_keys_duckdb(db, table="orders")
assert res["status"] == "ok"
# Solo la PK de orders.
assert len(res["primary_keys"]) == 1
assert res["primary_keys"][0]["table"] == "orders"
assert res["primary_keys"][0]["columns"] == ["id"]
# La PK de customers NO esta.
assert _pk_for(res, "customers") is None
# La FK de orders si esta (origen = orders).
assert len(res["foreign_keys"]) == 1
assert res["foreign_keys"][0]["table"] == "orders"
assert res["foreign_keys"][0]["referenced_table"] == "customers"
# tables solo contiene orders (la dueña de los constraints emitidos).
assert res["tables"] == ["orders"]
def test_edge_filtra_por_tabla_customers(db):
"""Edge table='customers': solo su PK; ninguna FK (orders queda fuera)."""
res = detect_declared_keys_duckdb(db, table="customers")
assert res["status"] == "ok"
assert len(res["primary_keys"]) == 1
assert res["primary_keys"][0]["table"] == "customers"
assert res["foreign_keys"] == []
assert res["tables"] == ["customers"]
def test_edge_unique_declarado(tmp_path):
"""Edge: una constraint UNIQUE declarada aparece en `unique`."""
path = str(tmp_path / "unique_test.duckdb")
con = duckdb.connect(path)
con.execute("CREATE TABLE products(sku INTEGER UNIQUE, name TEXT)")
con.close()
res = detect_declared_keys_duckdb(path)
assert res["status"] == "ok"
assert len(res["unique"]) == 1
assert res["unique"][0]["table"] == "products"
assert res["unique"][0]["columns"] == ["sku"]
assert res["primary_keys"] == []
assert res["foreign_keys"] == []
assert res["tables"] == ["products"]
def test_edge_sin_constraints_listas_vacias(tmp_path):
"""Edge: tabla sin PK/FK/UNIQUE -> todas las listas vacias, status ok."""
path = str(tmp_path / "no_keys.duckdb")
con = duckdb.connect(path)
con.execute("CREATE TABLE log(a INTEGER, b INTEGER)")
con.close()
res = detect_declared_keys_duckdb(path)
assert res["status"] == "ok"
assert res["primary_keys"] == []
assert res["foreign_keys"] == []
assert res["unique"] == []
assert res["tables"] == []
def test_error_db_inexistente_no_lanza(tmp_path):
"""Error: db_path inexistente -> status error, sin lanzar excepcion."""
path = str(tmp_path / "does_not_exist.duckdb")
res = detect_declared_keys_duckdb(path)
assert res["status"] == "error"
assert isinstance(res["error"], str)
assert res["error"] != ""
def test_shape_resultado(db):
"""El retorno tiene exactamente las claves esperadas."""
res = detect_declared_keys_duckdb(db)
assert set(res.keys()) == {
"status",
"primary_keys",
"foreign_keys",
"unique",
"tables",
}
for pk in res["primary_keys"]:
assert set(pk.keys()) == {"table", "columns"}
for fk in res["foreign_keys"]:
assert set(fk.keys()) == {
"table",
"columns",
"referenced_table",
"referenced_columns",
}
@@ -1,91 +0,0 @@
---
name: suggest_intratable_fk_candidates
kind: function
lang: py
domain: datascience
version: "1.0.0"
purity: pure
signature: "def suggest_intratable_fk_candidates(profile: dict, max_candidates: int = 20) -> list"
description: "Sobre el TableProfile de UNA tabla (el dict de profile_table), sugiere por heuristica de nombre + cardinalidad que columnas PARECEN una clave foranea hacia otra tabla, cuando no hay relaciones inter-tabla que medir (una sola tabla). Es una SUGERENCIA, no una afirmacion: el ref_table_guess es el stem del nombre (customer_id -> customer) y NO confirma containment. Pura: solo lee el dict, sin I/O; nunca lanza (devuelve [])."
tags: [eda, datascience, relationships, foreign-key, fk, heuristic, schema, python]
uses_functions: []
uses_types: []
returns: []
returns_optional: false
error_type: ""
imports: []
params:
- name: profile
desc: "TableProfile (dict que produce profile_table / summarize_table_*). Se leen de forma defensiva `columns` (lista de ColumnProfile con name/inferred_type/physical_type/distinct_count/unique_pct/flags), `n_rows` (int) y `key_candidates` (lista de nombres de columna ya candidatos a PK, que se excluyen). Si no es dict o no trae columns -> []."
- name: max_candidates
desc: "Tope de sugerencias devueltas (default 20). Las columnas candidatas se ordenan por distinct_count descendente (mas informativas primero) antes de cortar a este maximo."
output: "list (posiblemente vacia) de dicts, uno por columna sugerida, con claves: `column` (nombre), `ref_table_guess` (tabla conjeturada por el stem del nombre, p.ej. customer_id -> 'customer'), `reason` (frase humana que deja claro que es heuristica sin confirmar containment), `distinct_count` (int|None), `unique_pct` (float|None, fraccion 0-1 tal como viene del profile), `inferred_type` (str), `physical_type` (str). Nunca lanza."
tested: true
tests: ["test_golden_customer_id_detectado_otras_no", "test_camelcase_albumid_detectado", "test_constante_status_id_no_aparece", "test_profile_vacio_y_none_devuelven_lista_vacia", "test_category_id_casi_unico_parece_pk_no_aparece", "test_ref_table_guess_multitoken_y_orden_por_distinct", "test_max_candidates_corta_la_lista", "test_id_generico_solo_nunca_es_fk"]
test_file_path: "python/functions/datascience/suggest_intratable_fk_candidates_test.py"
file_path: "python/functions/datascience/suggest_intratable_fk_candidates.py"
---
## Ejemplo
```python
from datascience import suggest_intratable_fk_candidates
# TableProfile de UNA tabla (tipo titanic): customer_id es FK N:1; id es la PK;
# amount es una medida float; name es categorica sin sufijo de id.
profile = {
"n_rows": 891,
"key_candidates": ["id"],
"columns": [
{"name": "id", "inferred_type": "numeric", "physical_type": "BIGINT",
"distinct_count": 891, "unique_pct": 1.0, "flags": ["possible_id"]},
{"name": "customer_id", "inferred_type": "numeric", "physical_type": "BIGINT",
"distinct_count": 137, "unique_pct": 0.15, "flags": []},
{"name": "amount", "inferred_type": "numeric", "physical_type": "DOUBLE",
"distinct_count": 400, "unique_pct": 0.45, "flags": []},
{"name": "name", "inferred_type": "categorical", "physical_type": "VARCHAR",
"distinct_count": 700, "unique_pct": 0.78, "flags": []},
],
}
out = suggest_intratable_fk_candidates(profile)
[c["column"] for c in out] # -> ["customer_id"]
out[0]["ref_table_guess"] # -> "customer"
out[0]["reason"]
# -> "el nombre termina en '_id' y es N:1 (137 valores distintos < 891 filas):
# parece (heuristica por nombre, sin confirmar containment) una referencia a
# una tabla «customer»"
```
## Cuando usarla
Cuando el EDA tiene SOLO UNA tabla y, por tanto, no se puede inferir una FK
inter-tabla por containment (no hay otra tabla cuyos valores contener). Es el plan B
del capitulo RELACIONES de AutomaticEDA: en vez de medir solapamiento de valores
entre tablas (lo correcto cuando hay varias, ver `infer_fk_containment_duckdb` /
`build_join_graph`), conjetura por el NOMBRE de la columna (`<algo>_id`) y por su
CARDINALIDAD N:1 que columnas parecen apuntar a una entidad externa. Usala para
enriquecer el reporte con "estas columnas parecen referencias a otras tablas" sin
prometer que esa tabla exista. NO la uses si tienes varias tablas: ahi mide
containment de verdad.
## Gotchas
- Es **heuristica**, no una verdad: produce **falsos positivos** (una columna
`period_id` que en realidad es un codigo libre, no una FK) y **falsos negativos**
(una FK que no se llama `*_id`, p.ej. `parent`, `owner`, `sku`). No la trates como
una afirmacion de esquema.
- `ref_table_guess` es una **conjetura por el nombre** (el stem sin el sufijo id):
`customer_id` -> `customer`, `AlbumId` -> `album`, `manager_staff_id` ->
`manager_staff`. Puede no coincidir con el nombre real de la tabla (plurales,
prefijos, alias). Es una pista, no un join garantizado.
- **NO confirma containment**: no comprueba que los valores de la columna existan en
ninguna otra tabla (no puede — solo recibe el perfil de una tabla). Para confirmar
una FK real con varias tablas usa `infer_fk_containment_duckdb`.
- Excluye deliberadamente: el `id`/`Id`/`ID` generico a secas (suele ser la PK
propia, no una referencia), las columnas constantes, las que parecen unicas
(`unique_pct >= 0.99`, mas PK que FK) y los tipos no-clave (float/decimal son
medidas; date/time/timestamp y boolean no son claves). En camelCase, `paid`,
`valid`, `grid` (con `id` en minuscula y sin separador) NO se confunden con FK.
- `unique_pct` se interpreta como **fraccion 0-1** (tal como la emite el profile), no
como porcentaje 0-100.
@@ -1,202 +0,0 @@
"""suggest_intratable_fk_candidates — heuristica de FK intra-tabla del grupo `eda`.
Sobre el TableProfile de UNA tabla (el dict que produce ``profile_table``), sugiere
por heuristica de NOMBRE + CARDINALIDAD que columnas PARECEN una clave foranea hacia
otra tabla, util cuando no hay relaciones inter-tabla disponibles (una sola tabla y,
por tanto, sin containment cruzado que medir). Es una SUGERENCIA, no una afirmacion:
no confirma que exista la tabla referida ni que los valores esten contenidos en ella.
La consume el capitulo RELACIONES de AutomaticEDA cuando solo hay una tabla.
Funcion PURA: solo lee el dict (lectura defensiva con ``.get``), no hace I/O y nunca
lanza por inputs raros (devuelve ``[]``).
"""
# inferred_type que es compatible con una clave foranea (entero/categorico).
_FK_INFERRED_OK = {"numeric", "categorical", "integer"}
# Prefijos de physical_type que admiten ser clave foranea (enteros, texto, uuid).
_FK_PHYSICAL_PREFIXES = (
"int", "bigint", "smallint", "tinyint", "hugeint", "uint",
"varchar", "text", "char", "bpchar", "string", "uuid",
)
# Prefijos de physical_type que EXCLUYEN ser clave foranea: medidas en coma flotante
# (float/double/decimal/numeric/real), temporales (date/time/timestamp/interval) y
# boolean. Se comprueban ANTES que las senales positivas (la exclusion gana: una
# columna numeric con physical DOUBLE es una medida, no una FK).
_FK_PHYSICAL_EXCLUDE = (
"float", "double", "decimal", "numeric", "real",
"date", "time", "timestamp", "interval",
"bool",
)
def _fk_name_signal(name):
"""Detecta el sufijo de clave foranea en el nombre y devuelve ``(stem, sufijo)``.
Reconoce ``<algo>_id`` (snake), ``<Algo>Id`` y ``<algo>ID`` (camel). NO reconoce
el ``id``/``Id``/``ID`` generico a secas (suele ser la PK propia de la tabla, no
una referencia). En camelCase la ``I`` mayuscula marca el limite de palabra, asi
que ``paid``/``valid``/``grid`` (``id`` en minuscula y sin separador) NO matchean.
El ``stem`` se devuelve en minusculas y sirve de ``ref_table_guess`` (la tabla a
la que probablemente apunta): ``customer_id`` -> ``"customer"``, ``AlbumId`` ->
``"album"``, ``manager_staff_id`` -> ``"manager_staff"``. Devuelve ``None`` si no
hay senal de nombre.
"""
if not isinstance(name, str):
return None
raw = name.strip()
if not raw:
return None
# Snake: termina en "_id" (indiferente a mayusculas en la parte "id").
if raw.lower().endswith("_id"):
stem = raw[:-3].rstrip("_-. ")
if not stem:
return None
return (stem.lower(), "_id")
# Camel todo-mayuscula: "...ID" (p.ej. customerID).
if raw.endswith("ID"):
stem = raw[:-2].rstrip("_-. ")
if not stem:
return None
return (stem.lower(), "ID")
# Camel: "...Id" (p.ej. AlbumId).
if raw.endswith("Id"):
stem = raw[:-2].rstrip("_-. ")
if not stem:
return None
return (stem.lower(), "Id")
return None
def _fk_type_compatible(col):
"""True si el tipo de la columna admite ser clave foranea.
Compatible si el ``physical_type`` NO es una medida flotante, una temporal ni
boolean, Y ademas (``inferred_type`` en {numeric, categorical, integer} O el
``physical_type`` empieza por entero/varchar/text/char/uuid). La comparacion es
indistinta a mayusculas/minusculas.
"""
phys = (col.get("physical_type") or "").strip().lower()
inferred = (col.get("inferred_type") or "").strip().lower()
# Exclusion por tipo fisico (gana sobre cualquier senal positiva).
for bad in _FK_PHYSICAL_EXCLUDE:
if phys.startswith(bad):
return False
# Senal positiva por tipo inferido.
if inferred in _FK_INFERRED_OK:
return True
# Senal positiva por tipo fisico (entero/texto/uuid).
for good in _FK_PHYSICAL_PREFIXES:
if phys.startswith(good):
return True
return False
def suggest_intratable_fk_candidates(profile: dict, max_candidates: int = 20) -> list:
"""Sugiere columnas que parecen una FK intra-tabla por nombre + cardinalidad.
Heuristica (no afirma nada): una columna es candidata a clave foranea si su nombre
tiene sufijo de id con stem no vacio (``<algo>_id`` / ``<Algo>Id`` / ``<algo>ID``,
NUNCA el ``id`` generico), no es ya candidata a PK, no es constante, tiene
cardinalidad alta pero por debajo del numero de filas (N:1, no unica) y un tipo
compatible con clave (entero/categorico/texto/uuid; nunca float/fecha/boolean).
Args:
profile: TableProfile (dict de ``profile_table``). Se leen, de forma
defensiva, ``columns`` (lista de ColumnProfile), ``n_rows`` y
``key_candidates`` (nombres de columna ya candidatos a PK).
max_candidates: tope de sugerencias devueltas (default 20). Las columnas se
ordenan por ``distinct_count`` descendente (mas informativas primero)
antes de cortar.
Returns:
list de dicts (posiblemente vacia), uno por columna sugerida, con claves:
``column``, ``ref_table_guess`` (stem del nombre), ``reason`` (frase humana),
``distinct_count``, ``unique_pct`` (fraccion 0-1 tal como viene del profile),
``inferred_type``, ``physical_type``. Nunca lanza: si ``profile`` no es dict o
no hay columnas, devuelve ``[]``.
"""
if not isinstance(profile, dict):
return []
columns = profile.get("columns")
if not isinstance(columns, list):
return []
n_rows = profile.get("n_rows")
has_n_rows = (
isinstance(n_rows, int) and not isinstance(n_rows, bool) and n_rows > 0
)
key_candidates = profile.get("key_candidates")
if not isinstance(key_candidates, (list, tuple, set)):
key_candidates = []
key_set = set(key_candidates)
out = []
for col in columns:
if not isinstance(col, dict):
continue
name = col.get("name")
# 1) Senal de nombre: sufijo de id con stem no vacio.
signal = _fk_name_signal(name)
if signal is None:
continue
ref_guess, suffix = signal
# 2) No es ya candidata a PK (clave primaria de la propia tabla).
if name in key_set:
continue
# 3) No constante y con >= 2 valores distintos.
flags = col.get("flags") or []
if "constant" in flags:
continue
dc = col.get("distinct_count")
if not (isinstance(dc, int) and not isinstance(dc, bool) and dc >= 2):
continue
# 4) Cardinalidad alta pero < n_rows (no es PK) y no parece unica.
if has_n_rows and dc >= n_rows:
continue
unique_pct = col.get("unique_pct")
has_unique = (
isinstance(unique_pct, (int, float)) and not isinstance(unique_pct, bool)
)
if has_unique and unique_pct >= 0.99:
continue
# 5) Tipo compatible con clave foranea (entero/categorico/texto; no medida).
if not _fk_type_compatible(col):
continue
out.append(
{
"column": name,
"ref_table_guess": ref_guess,
"reason": _build_reason(suffix, dc, n_rows if has_n_rows else None, ref_guess),
"distinct_count": dc,
"unique_pct": float(unique_pct) if has_unique else None,
"inferred_type": col.get("inferred_type") or "",
"physical_type": col.get("physical_type") or "",
}
)
# Mas informativas primero (mayor cardinalidad), luego corte.
out.sort(key=lambda d: d.get("distinct_count") or 0, reverse=True)
return out[: max(0, int(max_candidates))]
def _build_reason(suffix, dc, n_rows, ref_guess):
"""Frase humana que deja claro que la sugerencia es heuristica, no confirmada."""
if n_rows is not None:
card = f"es N:1 ({dc} valores distintos < {n_rows} filas)"
else:
card = f"tiene {dc} valores distintos que se repiten (cardinalidad N:1)"
return (
f"el nombre termina en '{suffix}' y {card}: parece (heuristica por nombre, "
f"sin confirmar containment) una referencia a una tabla «{ref_guess}»"
)
@@ -1,157 +0,0 @@
"""Tests para suggest_intratable_fk_candidates (funcion pura, sin I/O)."""
from suggest_intratable_fk_candidates import suggest_intratable_fk_candidates
def _col(name, inferred_type="numeric", physical_type="BIGINT", distinct_count=10,
unique_pct=0.1, flags=None):
"""Construye un ColumnProfile minimo a mano (el dict que emite profile_table)."""
return {
"name": name,
"inferred_type": inferred_type,
"physical_type": physical_type,
"semantic_type": "",
"distinct_count": distinct_count,
"unique_pct": unique_pct,
"null_count": 0,
"null_pct": 0.0,
"flags": list(flags) if flags else [],
}
def test_golden_customer_id_detectado_otras_no():
# Tabla tipo titanic: customer_id es FK N:1; id es la PK; amount es medida;
# name es categorica sin sufijo de id. Solo customer_id debe aparecer.
profile = {
"n_rows": 891,
"key_candidates": ["id"],
"columns": [
_col("id", inferred_type="numeric", physical_type="BIGINT",
distinct_count=891, unique_pct=1.0, flags=["possible_id"]),
_col("customer_id", inferred_type="numeric", physical_type="BIGINT",
distinct_count=137, unique_pct=0.15, flags=[]),
_col("amount", inferred_type="numeric", physical_type="DOUBLE",
distinct_count=400, unique_pct=0.45),
_col("name", inferred_type="categorical", physical_type="VARCHAR",
distinct_count=700, unique_pct=0.78),
],
}
out = suggest_intratable_fk_candidates(profile)
assert isinstance(out, list)
assert [c["column"] for c in out] == ["customer_id"]
cand = out[0]
assert cand["ref_table_guess"] == "customer"
assert cand["distinct_count"] == 137
assert cand["unique_pct"] == 0.15
assert cand["inferred_type"] == "numeric"
assert cand["physical_type"] == "BIGINT"
# La razon deja claro que es heuristica + cita el sufijo y la tabla.
assert "customer" in cand["reason"]
assert "_id" in cand["reason"]
def test_camelcase_albumid_detectado():
# AlbumId (camelCase, VARCHAR) -> detectada, ref_table_guess "album".
profile = {
"n_rows": 3503,
"key_candidates": ["TrackId"],
"columns": [
_col("AlbumId", inferred_type="categorical", physical_type="VARCHAR",
distinct_count=347, unique_pct=0.10),
],
}
out = suggest_intratable_fk_candidates(profile)
# TrackId es PK candidata (en key_candidates), AlbumId no -> AlbumId aparece.
assert [c["column"] for c in out] == ["AlbumId"]
assert out[0]["ref_table_guess"] == "album"
def test_constante_status_id_no_aparece():
# status_id constante (flag "constant", distinct_count 1) NO es FK util.
profile = {
"n_rows": 1000,
"key_candidates": [],
"columns": [
_col("status_id", inferred_type="numeric", physical_type="INTEGER",
distinct_count=1, unique_pct=0.001, flags=["constant"]),
],
}
out = suggest_intratable_fk_candidates(profile)
assert out == []
def test_profile_vacio_y_none_devuelven_lista_vacia():
# Lectura defensiva: ni {} ni None lanzan; devuelven [].
assert suggest_intratable_fk_candidates({}) == []
assert suggest_intratable_fk_candidates(None) == []
# profile sin columns o con columns no-lista tampoco lanza.
assert suggest_intratable_fk_candidates({"n_rows": 10}) == []
assert suggest_intratable_fk_candidates({"columns": "no-soy-lista"}) == []
def test_category_id_casi_unico_parece_pk_no_aparece():
# unique_pct 0.999 -> parece PK (no N:1) -> NO se sugiere como FK.
profile = {
"n_rows": 891,
"key_candidates": [],
"columns": [
_col("category_id", inferred_type="numeric", physical_type="BIGINT",
distinct_count=890, unique_pct=0.999),
],
}
out = suggest_intratable_fk_candidates(profile)
assert out == []
def test_ref_table_guess_multitoken_y_orden_por_distinct():
# manager_staff_id conserva los underscores del stem -> "manager_staff".
# Ademas, con varias candidatas, se ordenan por distinct_count descendente.
profile = {
"n_rows": 10000,
"key_candidates": ["staff_id"], # staff_id es PK aqui, no debe aparecer
"columns": [
_col("staff_id", inferred_type="numeric", physical_type="BIGINT",
distinct_count=10000, unique_pct=1.0, flags=["possible_id"]),
_col("store_id", inferred_type="numeric", physical_type="INTEGER",
distinct_count=2, unique_pct=0.0002),
_col("manager_staff_id", inferred_type="numeric", physical_type="INTEGER",
distinct_count=40, unique_pct=0.004),
],
}
out = suggest_intratable_fk_candidates(profile)
cols = [c["column"] for c in out]
# staff_id excluida (PK); las otras dos ordenadas por distinct desc.
assert cols == ["manager_staff_id", "store_id"]
refs = {c["column"]: c["ref_table_guess"] for c in out}
assert refs["manager_staff_id"] == "manager_staff"
assert refs["store_id"] == "store"
def test_max_candidates_corta_la_lista():
# max_candidates limita el numero de sugerencias devueltas.
profile = {
"n_rows": 10000,
"key_candidates": [],
"columns": [
_col("a_id", distinct_count=300, unique_pct=0.03),
_col("b_id", distinct_count=200, unique_pct=0.02),
_col("c_id", distinct_count=100, unique_pct=0.01),
],
}
out = suggest_intratable_fk_candidates(profile, max_candidates=2)
assert [c["column"] for c in out] == ["a_id", "b_id"]
def test_id_generico_solo_nunca_es_fk():
# 'id'/'Id'/'ID' a secas (sin stem) jamas se sugieren como FK.
profile = {
"n_rows": 500,
"key_candidates": [],
"columns": [
_col("id", distinct_count=500, unique_pct=1.0),
_col("Id", distinct_count=120, unique_pct=0.24),
_col("ID", distinct_count=80, unique_pct=0.16),
],
}
out = suggest_intratable_fk_candidates(profile)
assert out == []
@@ -3,7 +3,7 @@ name: summarize_table_duckdb
kind: function
lang: py
domain: datascience
version: "1.0.0"
version: "1.1.0"
purity: impure
signature: "def summarize_table_duckdb(db_path: str, table: str, high_card_ratio: float = 0.9) -> dict"
description: "Perfila una tabla DuckDB en una sola pasada SQL (SUMMARIZE, push-down sin traer filas a RAM) y devuelve el esqueleto de un TableProfile con el perfil base por columna. Corazon del grupo eda: base barata sobre la que otras funciones anaden lo estadistico fino (skew/kurtosis/histograma sobre muestra)."
@@ -64,6 +64,7 @@ else:
- **`distinct_count` exacto para tablas <=200k filas, aproximado+capado por encima**: `SUMMARIZE` usa HyperLogLog (`approx_unique`), que SOBREESTIMA y en tablas pequenas puede reportar mas distintos que filas (inflando `unique_pct` por encima de 1.0 y disparando flags `possible_id` falsos). Por eso, para `n_rows <= 200000` la funcion calcula `COUNT(DISTINCT)` EXACTO en una sola query combinada (barata) y usa ese valor. Para tablas mas grandes mantiene `approx_unique` pero lo CAPA a `n_rows` (`distinct_count = min(approx_unique, n_rows)`). En ambos casos `unique_pct = min(distinct_count / n_rows, 1.0)`, asi que `distinct_count` nunca supera las filas ni `unique_pct` pasa de 1.0. Los flags `possible_id` / `high_cardinality` derivan de ese `distinct_count` ya corregido (exacto y fiable por debajo de 200k filas; aproximado y conservador por encima).
- **`SUMMARIZE` NO da skew, kurtosis ni histograma**, ni percentiles finos (p1/p5/p95/p99), moda, outliers, correlaciones, key_candidates ni quality_score. Esas claves quedan en `None`/`[]` a proposito: las rellena otra funcion del grupo `eda` sobre una muestra. El sub-dict `numeric` solo trae min, max, mean, std, p25, p50, p75.
- **`SUMMARIZE.count` es el total de filas, no el no-nulo**: la funcion deriva el `count` no-nulo del ColumnProfile como `n_rows - null_count` (con `null_count` redondeado de `null_percentage`).
- **`duplicate_rows`/`duplicate_pct` se pueblan push-down** (desde v1.1.0) con `count(*)` sobre `SELECT DISTINCT *` (sin traer filas a RAM): `duplicate_rows = n_rows - filas_distintas`, `duplicate_pct` en fraccion 0-1. Habilitan la dimension de unicidad de registro del score de dataset (`profile_table` paso 6). Si la tabla tiene tipos no comparables con `DISTINCT` (BLOB/LIST/MAP) la query degrada y ambas vuelven a `None` (renormaliza el score a solo `cell_quality`).
- **min/max/avg/std/q25/q50/q75 vienen como strings** desde DuckDB; se convierten a float (None si la columna no es numerica).
- **Requiere DuckDB 1.5.2** (columnas de `SUMMARIZE` validadas con esa version: column_name, column_type, min, max, approx_unique, avg, std, q25, q50, q75, count, null_percentage).
- **El identificador de tabla se interpola** (no parametrizable en `SUMMARIZE`): por eso se valida contra `^[A-Za-z_][A-Za-z0-9_]*$` antes de citarlo. Un nombre invalido (p.ej. con `;` o espacios) devuelve `{status:'error'}` sin tocar la base.
@@ -196,6 +196,21 @@ def summarize_table_duckdb(
sum(c["null_pct"] for c in columns) / len(columns) if columns else 0.0
)
# Unicidad de registro: filas duplicadas via COUNT de filas distintas
# push-down (DISTINCT *), sin traer filas a RAM. Habilita la dimension
# de uniqueness del score de dataset (1 - duplicate_pct). Degrada a None
# si la tabla tiene tipos no comparables con DISTINCT (BLOB/LIST/MAP).
duplicate_rows = None
duplicate_pct = None
if n_rows > 0:
dup_res = duckdb_query_readonly(
db_path, f"SELECT count(*) AS c FROM (SELECT DISTINCT * FROM {quoted})"
)
if dup_res["status"] == "ok" and dup_res["rows"]:
distinct_rows = int(dup_res["rows"][0]["c"])
duplicate_rows = max(0, n_rows - distinct_rows)
duplicate_pct = duplicate_rows / n_rows # fraccion 0-1
profile = {
"table": table,
"source": "duckdb",
@@ -203,8 +218,8 @@ def summarize_table_duckdb(
"n_rows": n_rows,
"n_cols": len(columns),
"size_bytes": None,
"duplicate_rows": None,
"duplicate_pct": None,
"duplicate_rows": duplicate_rows,
"duplicate_pct": duplicate_pct,
"constant_cols": constant_cols,
"all_null_cols": all_null_cols,
"null_cell_pct": null_cell_pct,
@@ -54,6 +54,30 @@ def test_shape_y_metadatos_tabla(db):
assert profile["correlations"] is None
def test_duplicate_pct_sin_duplicados(db):
"""Tabla con todas las filas distintas: duplicate_pct = 0, no None."""
profile = summarize_table_duckdb(db, "ventas")["profile"]
assert profile["duplicate_rows"] == 0
assert profile["duplicate_pct"] == 0.0
def test_duplicate_pct_con_duplicados(tmp_path):
"""Filas repetidas: duplicate_rows/duplicate_pct se pueblan push-down."""
path = str(tmp_path / "dups.duckdb")
con = duckdb.connect(path)
con.execute("CREATE TABLE t (a INTEGER, b VARCHAR)")
# 5 filas, 2 de ellas idénticas a otras -> 2 duplicadas sobre 5 = 0.4.
con.execute(
"INSERT INTO t VALUES "
"(1,'x'), (2,'y'), (1,'x'), (3,'z'), (2,'y')"
)
con.close()
profile = summarize_table_duckdb(path, "t")["profile"]
assert profile["n_rows"] == 5
assert profile["duplicate_rows"] == 2
assert profile["duplicate_pct"] == 0.4
def test_column_profile_shape(db):
profile = summarize_table_duckdb(db, "ventas")["profile"]
by_name = {c["name"]: c for c in profile["columns"]}
+10 -1
View File
@@ -4,7 +4,7 @@ kind: pipeline
lang: py
domain: pipelines
purity: impure
version: "1.0.0"
version: "1.1.0"
signature: "def profile_table(db_path: str, table: str, backend: str = \"duckdb\", sample: int = 5000, run_models: bool = False, run_llm: bool = False, run_series: bool = False, emit_pdf: bool = False, emit_automatic: bool = False, report_dir: str = \"reports\", write_report: bool = True) -> dict"
description: "Orquestador one-shot del grupo de capacidad eda: perfila UNA tabla (DuckDB o PostgreSQL) end-to-end componiendo las funciones del grupo (perfil base SQL + muestreo read-only + inferencia semantica + promocion de tipo + estadistica numerica/categorica + score de calidad + correlaciones con correccion FDR + re-expresion de Tukey + avisos exploratorios) y, opcional, modelos baratos (run_models), interpretacion LLM (run_llm) y analisis de serie temporal por columna (run_series: estacionariedad ADF+KPSS, ACF/PACF, STL, retornos). Emite el TableProfile completo mas (opcional) report markdown + JSON sidecar + PDF movil (emit_pdf). Es la composicion canonica para hazme un EDA de esta tabla."
tags: [eda, duckdb, postgres, profiling, data-quality, pipeline, dataops, timeseries]
@@ -114,3 +114,12 @@ para auditar la calidad de una tabla ya productiva. Reemplaza orquestar a mano
Formatos exoticos pueden descartarse silenciosamente del calculo numerico.
- `db_path` debe existir: DuckDB read-only NO crea la base. El muestreo usa el
sandbox por defecto de `duckdb_query_readonly` (sin acceso a FS/red).
- **Score de calidad (report 2046, desde v1.1.0).** Paso 5: cada columna recibe
`quality_score` de `column_quality_score` con la formula 60/40
(completeness/validity); al promocionar texto a numero/fecha se expone
`col["validity_rate"]` (parse rate de la muestra) para alimentar la dimension
validity. Paso 6: el score de dataset NO es la media simple — es
`100 * (0.85*cell_quality + 0.15*row_uniqueness)`, donde
`cell_quality = media(score_col/100)` y `row_uniqueness = 1 - duplicate_pct`.
Si `duplicate_pct` es `None` (backend sin calcularlo) el score se renormaliza
a solo `cell_quality`. Los outliers NO bajan el score (van a `observations`).
+36 -2
View File
@@ -477,9 +477,18 @@ def profile_table(
if vals and (len(ok) / len(vals)) >= _PROMOTE_MIN_PARSE:
col["inferred_type"] = "numeric"
inferred = "numeric"
# Tasa de parseo real de la muestra: alimenta la
# dimension validity de column_quality_score (fraccion
# de valores conformes al tipo numerico promovido).
col["validity_rate"] = len(ok) / len(vals)
elif semantic in _DATETIME_SEMANTIC:
col["inferred_type"] = "datetime"
inferred = "datetime"
# Tasa de parseo de la muestra a fecha (mismo papel que el
# parse rate numerico) para la dimension validity.
parsed_dt = [_to_ordinal_days(v) for v in vals]
ok_dt = [d for d in parsed_dt if d is not None]
col["validity_rate"] = (len(ok_dt) / len(vals)) if vals else None
# 4) Enriquecer segun el inferred_type final.
if inferred == "numeric":
@@ -506,11 +515,36 @@ def profile_table(
# 5) Score de calidad por columna.
col["quality_score"] = column_quality_score(col).get("score")
# 6) Score agregado de la tabla (media de columnas).
# 6) Score agregado de la tabla (report 2046): NO media simple.
# cell_quality = media de los scores de columna, en [0,1].
# row_uniqueness = 1 - duplicate_pct (unicidad de registro).
# score = 100 * (0.85*cell_quality + 0.15*row_uniqueness).
# Renormaliza a solo cell_quality si duplicate_pct no se pudo calcular.
scores = [
c["quality_score"] for c in cols if c.get("quality_score") is not None
]
prof["quality_score"] = round(sum(scores) / len(scores), 1) if scores else None
if scores:
cell_quality = (sum(scores) / len(scores)) / 100.0
dup_pct = prof.get("duplicate_pct")
if dup_pct is not None:
try:
d = float(dup_pct)
except (TypeError, ValueError):
d = None
else:
d = None
if d is not None:
# Tolerar escala 0-100 por si algun backend la entrega asi.
if d > 1.0:
d = d / 100.0
row_uniqueness = max(0.0, min(1.0, 1.0 - d))
prof["quality_score"] = round(
100.0 * (0.85 * cell_quality + 0.15 * row_uniqueness), 1
)
else:
prof["quality_score"] = round(100.0 * cell_quality, 1)
else:
prof["quality_score"] = None
# 7) Candidatos a clave.
key_candidates = []