fn_registry/python/functions/datascience/eda_llm_insights.md at 8e16202935218dc68da159e2384bb4c02f50392e

Files

T

egutierrez 763e06c127 feat(browser): auto-commit con 178 cambios

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-06-20 18:22:23 +02:00

5.2 KiB

Raw Blame History

name, kind, lang, domain, version, purity, signature, description, tags, params, output, uses_functions, uses_types, returns, returns_optional, error_type, imports, tested, tests, test_file_path, file_path

name

kind

lang

domain

version

purity

signature

description

tags

params

output

uses_functions

uses_types

returns

returns_optional

error_type

imports

tested

tests

test_file_path

file_path

eda_llm_insights

function

datascience

1.0.0

impure

def eda_llm_insights(profile: dict, model: str = "claude-haiku-4-5-20251001") -> dict

Capa LLM interpretativa del grupo eda. Toma un TableProfile YA CALCULADO (el dict de profile_table) y, con UNA sola llamada al LLM, genera el bloque 'llm': resumen de la tabla, significado de una fila, diccionario de datos, deteccion de PII (RGPD), sugerencias de limpieza y analisis sugeridos. Clave de coste/privacidad: NO envia filas crudas al LLM, solo el perfil AGREGADO (nombres, tipos, % nulos, distinct, top valores agregados de categoricas, stats de numericas, pares de correlacion fuertes). Reusa ask_llm del grupo claude-direct (API directa con token OAuth de Claude). Impura, dict-no-throw.

eda

llm

claude-direct

datascience

profiling

pii

data-dictionary

name	desc
profile	TableProfile ya calculado (el dict que devuelve profile_table()['profile']). Se espera {table, n_rows, columns:[{name, inferred_type, semantic_type, null_pct, distinct_count, numeric:{min,max,mean,p50,...}, categorical:{top:[{value,count,pct}], mode,...}}], correlations:{strong:[{a,b,method,value}]} \| None}. Solo se le envia al LLM un resumen agregado; nunca filas crudas.

name	desc
model	id del modelo Anthropic a usar. Default 'claude-haiku-4-5-20251001' (haiku, coste bajo). Para mayor calidad interpretativa, pasar p.ej. 'claude-opus-4-8'.

dict dict-no-throw. En exito: {status:'ok', llm:{summary:str, row_meaning:str, dictionary:[{column,description,business_meaning,unit}], pii:[{column,kind,severity}], cleaning:[str], analyses:[str]}}. Las claves que el LLM omita se rellenan con defaults vacios. En error (sin lanzar): {status:'error', error:str}.

ask_llm_py_core

false

error_go_core

true

test_build_prompt_includes_table_and_columns

test_build_prompt_includes_numeric_stats_and_top_values

test_build_prompt_handles_empty_profile

test_parse_llm_json_plain

test_parse_llm_json_with_fences

test_parse_llm_json_with_surrounding_text

test_parse_llm_json_nested_braces_in_strings

test_parse_llm_json_raises_without_object

test_eda_llm_insights_ok_with_monkeypatched_llm

test_eda_llm_insights_fills_missing_keys

test_eda_llm_insights_error_on_empty_profile

test_eda_llm_insights_error_on_empty_llm_response

test_eda_llm_insights_error_on_unparseable_llm_response

python/functions/datascience/eda_llm_insights_test.py

python/functions/datascience/eda_llm_insights.py

Ejemplo

import sys, os
sys.path.insert(0, os.path.join("python", "functions"))

from pipelines.profile_table import profile_table
from datascience import eda_llm_insights

# 1) Perfila la tabla (calculo agregado, sin LLM).
r = profile_table("data/ventas.duckdb", "ventas", write_report=False)
profile = r["profile"]

# 2) Interpreta el perfil con UNA llamada al LLM (solo el perfil agregado viaja).
out = eda_llm_insights(profile)                       # haiku por defecto
# out = eda_llm_insights(profile, model="claude-opus-4-8")  # mas calidad

if out["status"] == "ok":
    llm = out["llm"]
    print(llm["summary"])         # que es la tabla, 2-3 frases
    print(llm["row_meaning"])     # que representa una fila
    for d in llm["dictionary"]:   # diccionario de datos por columna
        print(d["column"], "->", d["description"], f"({d['unit']})")
    for p in llm["pii"]:          # datos personales/sensibles RGPD
        print("PII:", p["column"], p["kind"], p["severity"])
    print(llm["cleaning"])        # sugerencias de limpieza
    print(llm["analyses"])        # analisis sugeridos + hipotesis
else:
    print("error:", out["error"])

Cuando usarla

Cuando necesites entender SEMANTICAMENTE una tabla ya perfilada: generar un diccionario de datos legible, detectar PII/datos sensibles RGPD, recibir sugerencias de limpieza y una lista de analisis/hipotesis a explorar. Es el paso interpretativo que sigue a profile_table: este calcula las metricas, y eda_llm_insights las traduce a lenguaje de negocio. El resultado encaja en la clave llm del TableProfile (la que render_eda_markdown renderiza en la seccion "Analisis LLM").

Gotchas

Impura: hace 1 llamada de red al LLM. No es determinista ni gratis.
Requiere token OAuth de Claude en ~/.claude/.credentials.json (via ask_llm / grupo claude-direct). Sin token, devuelve {status:'error'}.
NO envia filas crudas al LLM, solo el perfil AGREGADO (nombres, tipos, % nulos, distinct, top valores ya agregados, stats numericas, correlaciones fuertes). Privacidad y coste minimos por diseno — pero requiere que el profile venga ya calculado por profile_table.
Modelo haiku por defecto para coste bajo; sube a claude-opus-4-8 si necesitas interpretacion mas fina (mas caro y lento).
El LLM puede omitir claves: las que falten se rellenan con defaults vacios ("" o []), nunca lanza por shape incompleto.
El parseo tolera \``jsonfences y texto alrededor del objeto, pero si el modelo no devuelve ningun objeto JSON, retorna{status:'error'}`.

5.2 KiB Raw Blame History

Ejemplo

Cuando usarla

Gotchas

5.2 KiB

Raw Blame History