Files
gliner_glirel_tuning/notebooks/01_gliner_glirel_tuning.ipynb
T
2026-05-04 23:44:11 +02:00

866 lines
31 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
{
"cells": [
{
"cell_type": "markdown",
"id": "6a7ef2a5",
"metadata": {},
"source": [
"# GLiNER + GLiREL — calibracion empirica\n",
"\n",
"**Objetivo:** entender empiricamente como funcionan **GLiNER** (entidades) y **GLiREL** (relaciones) para fijar thresholds operativos en el pipeline `extract_graph_hybrid` (panel _Paste & Extract_ de `graph_explorer`).\n",
"\n",
"**Hallazgo previo (sesion del merge 0013):** un solo `confidence_threshold=0.6` filtra GLiNER (0.92-0.99 facil) Y GLiREL (max 0.21 en el test). Resultado: el panel jamas muestra relaciones aunque GLiREL si las detecte. Este notebook valida la separacion necesaria de thresholds y mide rangos sanos.\n",
"\n",
"**Plan:**\n",
"1. Cargar modelos\n",
"2. **GLiNER** — barrido threshold sobre corpus EN/ES + sensibilidad a label sets\n",
"3. **GLiREL** — distribucion de scores sin filtro + sensibilidad a label phrasing\n",
"4. Recomendaciones operativas\n",
"\n",
"**Stack:** gliner==0.2.26, glirel==1.2.1, transformers==5.1, huggingface_hub==1.13. Modelos `urchade/gliner_multi-v2.1` (~600 MB) y `jackboyla/glirel-large-v0` (~1.5 GB), ambos cacheados en `~/.cache/huggingface/`."
]
},
{
"cell_type": "markdown",
"id": "2423c283",
"metadata": {},
"source": [
"## 1. Setup\n",
"\n",
"El kernel autocarga `FN_REGISTRY_ROOT` y anade `python/functions/` al `sys.path` (ver `.ipython/profile_default/startup/00_fn_registry.py`)."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "67f48818",
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-04T12:58:37.640753Z",
"iopub.status.busy": "2026-05-04T12:58:37.640602Z",
"iopub.status.idle": "2026-05-04T12:58:37.853224Z",
"shell.execute_reply": "2026-05-04T12:58:37.852377Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"FN_REGISTRY_ROOT: /home/lucas/fn_registry\n",
"results.json keys: ['gliner_threshold_sweep', 'glirel_score_distribution', 'glirel_topk_sweep', 'corpus', 'entity_labels', 'relation_labels']\n"
]
}
],
"source": [
"import os, sys, json, time, warnings\n",
"warnings.filterwarnings('ignore')\n",
"os.environ.setdefault('HF_HUB_DISABLE_PROGRESS_BARS', '1')\n",
"from pathlib import Path\n",
"\n",
"# Limpiar sys.path: el startup del kernel anade cada subdir de\n",
"# python/functions/ al top-level, y bigquery/datasets.py sombrea\n",
"# al paquete `datasets` de HuggingFace que necesita transformers.\n",
"# Dejamos solo el directorio padre 'python/functions/' para imports\n",
"# 'from datascience.gliner_load_model import ...' del estilo paquete.\n",
"_pf = '/home/lucas/fn_registry/python/functions'\n",
"sys.path = [p for p in sys.path if not (p.startswith(_pf + '/'))]\n",
"if _pf not in sys.path:\n",
" sys.path.insert(0, _pf)\n",
"\n",
"import pandas as pd\n",
"from datascience.gliner_load_model import gliner_load_model\n",
"from datascience.glirel_load_model import glirel_load_model\n",
"\n",
"RESULTS = json.loads(Path('../results.json').read_text())\n",
"print('FN_REGISTRY_ROOT:', os.environ.get('FN_REGISTRY_ROOT'))\n",
"print('results.json keys:', list(RESULTS.keys()))"
]
},
{
"cell_type": "markdown",
"id": "6dc6a22b",
"metadata": {},
"source": [
"## 2. Corpus de prueba\n",
"\n",
"4 textos cortos cubriendo dominios diferentes (ES/EN, corporativo/OSINT/journalism). Sirven para detectar drift de calidad por idioma y por tipo de contenido."
]
},
{
"cell_type": "markdown",
"id": "0f208d97",
"metadata": {},
"source": [
"### `es_corporate`\n",
"```\n",
"Pablo Isla, expresidente de Inditex, ha sido nombrado consejero de Telefonica. La operacion fue anunciada por el presidente Jose Maria Alvarez-Pallete en Madrid el pasado lunes. Inditex factura mas de 30.000 millones anuales y tiene su sede en Arteixo, A Coruna.\n",
"```\n",
"\n",
"### `en_corporate`\n",
"```\n",
"Pablo Isla, the former chairman of Inditex, has been appointed as a director of Telefonica. The announcement was made by Jose Maria Alvarez-Pallete, the chairman of Telefonica, in Madrid last Monday. Inditex has its headquarters in Arteixo, A Coruna.\n",
"```\n",
"\n",
"### `en_osint`\n",
"```\n",
"On 2024-08-15, attacker IP 185.220.101.45 connected to victim host 10.0.5.22 over TLS. Reverse DNS pointed to tor-exit-relay-3.onionrouter.net. Operator handle @phantomzero claimed responsibility on a forum. The C2 panel was hosted on hxxps://malwareops[.]biz/control behind Cloudflare.\n",
"```\n",
"\n",
"### `es_journalism`\n",
"```\n",
"Iberdrola y Endesa firmaron un acuerdo de colaboracion en proyectos eolicos en Galicia. El presidente de Iberdrola, Ignacio Galan, se reunio con la CEO de Endesa, Marina Serrano, en Bilbao. El acuerdo movilizara 2.000 millones de euros en cinco anos.\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "8cbf0f22",
"metadata": {},
"source": [
"## 3. Carga de modelos\n",
"\n",
"Cold load: ~50s por modelo (descarga). Warm: ~8s. Cache global por (model_name, device)."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "cf04dfad",
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-04T12:58:37.855378Z",
"iopub.status.busy": "2026-05-04T12:58:37.855198Z",
"iopub.status.idle": "2026-05-04T12:58:52.254428Z",
"shell.execute_reply": "2026-05-04T12:58:52.253490Z"
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"\u001b[0;93m2026-05-04 14:58:38.910665577 [W:onnxruntime:Default, device_discovery.cc:283 GetGpuDevices] Failed to detect devices under \"/sys/class/drm/card0\": device_discovery.cc:93 ReadFileContents Failed to open file: \"/sys/class/drm/card0/device/vendor\"\u001b[m\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\u001b[1mDebertaV2Model LOAD REPORT\u001b[0m from: microsoft/deberta-v3-large\n",
"Key | Status | | \n",
"----------------------------------------+------------+--+-\n",
"mask_predictions.LayerNorm.bias | UNEXPECTED | | \n",
"lm_predictions.lm_head.bias | UNEXPECTED | | \n",
"lm_predictions.lm_head.LayerNorm.weight | UNEXPECTED | | \n",
"lm_predictions.lm_head.dense.weight | UNEXPECTED | | \n",
"lm_predictions.lm_head.dense.bias | UNEXPECTED | | \n",
"mask_predictions.classifier.bias | UNEXPECTED | | \n",
"mask_predictions.dense.weight | UNEXPECTED | | \n",
"mask_predictions.LayerNorm.weight | UNEXPECTED | | \n",
"mask_predictions.dense.bias | UNEXPECTED | | \n",
"mask_predictions.classifier.weight | UNEXPECTED | | \n",
"lm_predictions.lm_head.LayerNorm.bias | UNEXPECTED | | \n",
"\n",
"\u001b[3mNotes:\n",
"- UNEXPECTED\u001b[3m\t:can be ignored when loading from different task/architecture; not ok if you expect identical arch.\u001b[0m\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"GLiNER ready in 8.2s\n",
"GLiREL ready in 6.2s\n"
]
}
],
"source": [
"t0 = time.time(); gliner = gliner_load_model(); t_gliner = time.time()-t0\n",
"t0 = time.time(); glirel = glirel_load_model(); t_glirel = time.time()-t0\n",
"print(f'GLiNER ready in {t_gliner:.1f}s')\n",
"print(f'GLiREL ready in {t_glirel:.1f}s')"
]
},
{
"cell_type": "markdown",
"id": "08107c78",
"metadata": {},
"source": [
"## 4. GLiNER — barrido de threshold\n",
"\n",
"Para cada (corpus, label_set) corremos `predict_entities(threshold=0.0)` y filtramos a posteriori a {0.1, 0.3, 0.5, 0.7, 0.9}. Asi vemos la distribucion completa de scores sin recargar modelo."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "46598320",
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-04T12:58:52.257688Z",
"iopub.status.busy": "2026-05-04T12:58:52.257083Z",
"iopub.status.idle": "2026-05-04T12:58:52.284240Z",
"shell.execute_reply": "2026-05-04T12:58:52.283211Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>corpus</th>\n",
" <th>labels</th>\n",
" <th>t=.1</th>\n",
" <th>t=.3</th>\n",
" <th>t=.5</th>\n",
" <th>t=.7</th>\n",
" <th>t=.9</th>\n",
" <th>max_score</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>es_corporate</td>\n",
" <td>generic_en</td>\n",
" <td>8</td>\n",
" <td>8</td>\n",
" <td>8</td>\n",
" <td>8</td>\n",
" <td>8</td>\n",
" <td>0.994</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>es_corporate</td>\n",
" <td>generic_es</td>\n",
" <td>8</td>\n",
" <td>8</td>\n",
" <td>8</td>\n",
" <td>8</td>\n",
" <td>8</td>\n",
" <td>0.990</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>en_corporate</td>\n",
" <td>generic_en</td>\n",
" <td>9</td>\n",
" <td>9</td>\n",
" <td>9</td>\n",
" <td>9</td>\n",
" <td>9</td>\n",
" <td>0.995</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>en_corporate</td>\n",
" <td>specific_en</td>\n",
" <td>9</td>\n",
" <td>9</td>\n",
" <td>9</td>\n",
" <td>9</td>\n",
" <td>8</td>\n",
" <td>0.991</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>en_osint</td>\n",
" <td>generic_en</td>\n",
" <td>12</td>\n",
" <td>6</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0.604</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>en_osint</td>\n",
" <td>osint_en</td>\n",
" <td>13</td>\n",
" <td>8</td>\n",
" <td>6</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>0.953</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>es_journalism</td>\n",
" <td>generic_en</td>\n",
" <td>9</td>\n",
" <td>8</td>\n",
" <td>8</td>\n",
" <td>8</td>\n",
" <td>8</td>\n",
" <td>0.995</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>es_journalism</td>\n",
" <td>generic_es</td>\n",
" <td>9</td>\n",
" <td>8</td>\n",
" <td>8</td>\n",
" <td>8</td>\n",
" <td>7</td>\n",
" <td>0.992</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" corpus labels t=.1 t=.3 t=.5 t=.7 t=.9 max_score\n",
"0 es_corporate generic_en 8 8 8 8 8 0.994\n",
"1 es_corporate generic_es 8 8 8 8 8 0.990\n",
"2 en_corporate generic_en 9 9 9 9 9 0.995\n",
"3 en_corporate specific_en 9 9 9 9 8 0.991\n",
"4 en_osint generic_en 12 6 1 0 0 0.604\n",
"5 en_osint osint_en 13 8 6 2 2 0.953\n",
"6 es_journalism generic_en 9 8 8 8 8 0.995\n",
"7 es_journalism generic_es 9 8 8 8 7 0.992"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from datascience.gliner_load_model import gliner_load_model\n",
"thresholds = [0.1, 0.3, 0.5, 0.7, 0.9]\n",
"rows = []\n",
"for corpus_key, cdata in RESULTS['gliner_threshold_sweep'].items():\n",
" for ls_key, sdata in cdata.items():\n",
" scored = sdata['scored_at_t0']\n",
" max_s = max((s[2] for s in scored), default=0.0)\n",
" rows.append([corpus_key, ls_key, *[len(sdata[f't={t}']) for t in thresholds], round(max_s,3)])\n",
"df = pd.DataFrame(rows, columns=['corpus','labels','t=.1','t=.3','t=.5','t=.7','t=.9','max_score'])\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "eed12fb4",
"metadata": {},
"source": [
"**Lectura:**\n",
"\n",
"- En **narrativa estructurada** (corporate, journalism), GLiNER da 8-9 entidades estables con scores 0.92-0.99. **`threshold=0.5` o `0.7` son seguros**, casi no se mueve el conteo.\n",
"- En **OSINT** (IPs, dominios, URLs) con labels genericas (`person`, `organization`...): scores _se hunden_ a max 0.60. **Cae todo a t=0.5**.\n",
"- Mismo OSINT con labels especificas (`ip_address`, `domain`, `url`): max 0.95, threshold 0.5 retiene 6.\n",
"- ES vs EN: practicamente identicos. El `gliner_multi-v2.1` es genuinamente multilingue. **Las labels EN funcionan igual de bien sobre texto ES.**\n",
"\n",
"**Conclusion 1:** `entity_threshold = 0.5` es seguro como default. Pero el **label set debe encajar al dominio** — una mala eleccion mata mas que un threshold mal puesto."
]
},
{
"cell_type": "markdown",
"id": "fed8f100",
"metadata": {},
"source": [
"### 4.1 Entidades concretas (en_corporate, generic_en, t=0.5)\n",
"\n",
"Para verificar que no son ruido."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "5358e303",
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-04T12:58:52.286116Z",
"iopub.status.busy": "2026-05-04T12:58:52.285916Z",
"iopub.status.idle": "2026-05-04T12:58:52.300382Z",
"shell.execute_reply": "2026-05-04T12:58:52.299264Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>text</th>\n",
" <th>label</th>\n",
" <th>score</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Pablo Isla</td>\n",
" <td>person</td>\n",
" <td>0.989302</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Inditex</td>\n",
" <td>organization</td>\n",
" <td>0.992379</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Telefonica</td>\n",
" <td>organization</td>\n",
" <td>0.992698</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Jose Maria Alvarez-Pallete</td>\n",
" <td>person</td>\n",
" <td>0.975533</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Telefonica</td>\n",
" <td>organization</td>\n",
" <td>0.990853</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Madrid</td>\n",
" <td>location</td>\n",
" <td>0.966069</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>Inditex</td>\n",
" <td>organization</td>\n",
" <td>0.994649</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>Arteixo</td>\n",
" <td>location</td>\n",
" <td>0.968921</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>A Coruna</td>\n",
" <td>location</td>\n",
" <td>0.920429</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" text label score\n",
"0 Pablo Isla person 0.989302\n",
"1 Inditex organization 0.992379\n",
"2 Telefonica organization 0.992698\n",
"3 Jose Maria Alvarez-Pallete person 0.975533\n",
"4 Telefonica organization 0.990853\n",
"5 Madrid location 0.966069\n",
"6 Inditex organization 0.994649\n",
"7 Arteixo location 0.968921\n",
"8 A Coruna location 0.920429"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ents = RESULTS['gliner_threshold_sweep']['en_corporate']['generic_en']['t=0.5']\n",
"pd.DataFrame(ents, columns=['text','label','score','start','end'])[['text','label','score']]"
]
},
{
"cell_type": "markdown",
"id": "f4019283",
"metadata": {},
"source": [
"## 5. GLiREL — distribucion de scores\n",
"\n",
"Aqui esta el quid del bug: pasamos `threshold=0.0`, `top_k=5` y vemos los scores naturales que emite GLiREL. Comparamos dos estilos de label:\n",
"\n",
"- `snake_short`: `works_at`, `located_in`, `appointed_as`, ...\n",
"- `natural_long`: `person works at organization`, ...\n",
"\n",
"El folklore dice que el segundo deberia funcionar mejor (porque GLiREL es tipo zero-shot). Vamos a ver."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "b0516987",
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-04T12:58:52.302264Z",
"iopub.status.busy": "2026-05-04T12:58:52.302062Z",
"iopub.status.idle": "2026-05-04T12:58:52.313997Z",
"shell.execute_reply": "2026-05-04T12:58:52.312964Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>corpus</th>\n",
" <th>n_ents</th>\n",
" <th>label_style</th>\n",
" <th>n_rels</th>\n",
" <th>max_score</th>\n",
" <th>median_score</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>es_corporate</td>\n",
" <td>8</td>\n",
" <td>snake_short</td>\n",
" <td>280</td>\n",
" <td>0.169</td>\n",
" <td>0.017</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>es_corporate</td>\n",
" <td>8</td>\n",
" <td>natural_long</td>\n",
" <td>280</td>\n",
" <td>0.061</td>\n",
" <td>0.010</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>en_corporate</td>\n",
" <td>9</td>\n",
" <td>snake_short</td>\n",
" <td>360</td>\n",
" <td>0.233</td>\n",
" <td>0.016</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>en_corporate</td>\n",
" <td>9</td>\n",
" <td>natural_long</td>\n",
" <td>360</td>\n",
" <td>0.080</td>\n",
" <td>0.007</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>es_journalism</td>\n",
" <td>8</td>\n",
" <td>snake_short</td>\n",
" <td>280</td>\n",
" <td>0.195</td>\n",
" <td>0.011</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>es_journalism</td>\n",
" <td>8</td>\n",
" <td>natural_long</td>\n",
" <td>280</td>\n",
" <td>0.138</td>\n",
" <td>0.007</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" corpus n_ents label_style n_rels max_score median_score\n",
"0 es_corporate 8 snake_short 280 0.169 0.017\n",
"1 es_corporate 8 natural_long 280 0.061 0.010\n",
"2 en_corporate 9 snake_short 360 0.233 0.016\n",
"3 en_corporate 9 natural_long 360 0.080 0.007\n",
"4 es_journalism 8 snake_short 280 0.195 0.011\n",
"5 es_journalism 8 natural_long 280 0.138 0.007"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rows=[]\n",
"for corpus, cdata in RESULTS['glirel_score_distribution'].items():\n",
" n_ents = len(cdata.get('entities', []))\n",
" for style, rels in cdata.get('styles', {}).items():\n",
" if isinstance(rels, list) and rels:\n",
" scores = sorted([r['score'] for r in rels], reverse=True)\n",
" rows.append([corpus, n_ents, style, len(rels), round(scores[0],3), round(scores[len(scores)//2],3)])\n",
" else:\n",
" rows.append([corpus, n_ents, style, 0, 0.0, 0.0])\n",
"df = pd.DataFrame(rows, columns=['corpus','n_ents','label_style','n_rels','max_score','median_score'])\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "80cb8f95",
"metadata": {},
"source": [
"**Lectura — dos sorpresas:**\n",
"\n",
"1. **`snake_short` >> `natural_long`** por un factor 3-4×. Pasar `\"person works at organization\"` baja el score max de 0.23 a 0.08. **GLiREL fue entrenado con etiquetas estilo Wikipedia** (`P54`, `member_of_political_party`...), no con frases naturales. El prompt-engineering aqui es _menos_ es _mas_.\n",
"2. **EN > ES por ~25%**: `en_corporate` max 0.233 vs `es_corporate` max 0.169 con el mismo contenido factico. GLiREL tiene mejor cobertura del ingles.\n",
"3. **Texto OSINT** dio 0 entidades en GLiNER multi-v2.1 con labels genericas → no hay pares para GLiREL. (Para OSINT habria que cambiar GLiNER -> regex (que ya cubre IoCs) y dejar GLiREL para narrativa).\n",
"\n",
"**Conclusion 2:** **`relation_threshold` debe estar en 0.10-0.15**, NO en 0.6. El `confidence_threshold` global del pipeline debe partirse en dos."
]
},
{
"cell_type": "markdown",
"id": "e535e84b",
"metadata": {},
"source": [
"### 5.1 Efecto de `top_k`\n",
"\n",
"Subir `top_k` ¿descubre relaciones nuevas o solo añade ruido?"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "cc6855a0",
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-04T12:58:52.315945Z",
"iopub.status.busy": "2026-05-04T12:58:52.315750Z",
"iopub.status.idle": "2026-05-04T12:58:52.325915Z",
"shell.execute_reply": "2026-05-04T12:58:52.324821Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>top_k</th>\n",
" <th>n_total</th>\n",
" <th>max</th>\n",
" <th>median</th>\n",
" <th>min</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>top_k=1</td>\n",
" <td>72</td>\n",
" <td>0.233</td>\n",
" <td>0.129</td>\n",
" <td>0.036</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>top_k=3</td>\n",
" <td>216</td>\n",
" <td>0.233</td>\n",
" <td>0.045</td>\n",
" <td>0.003</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>top_k=5</td>\n",
" <td>360</td>\n",
" <td>0.233</td>\n",
" <td>0.016</td>\n",
" <td>0.000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>top_k=10</td>\n",
" <td>360</td>\n",
" <td>0.233</td>\n",
" <td>0.016</td>\n",
" <td>0.000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" top_k n_total max median min\n",
"0 top_k=1 72 0.233 0.129 0.036\n",
"1 top_k=3 216 0.233 0.045 0.003\n",
"2 top_k=5 360 0.233 0.016 0.000\n",
"3 top_k=10 360 0.233 0.016 0.000"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rows=[]\n",
"for tk, rels in RESULTS['glirel_topk_sweep']['by_topk'].items():\n",
" s = sorted([r['score'] for r in rels], reverse=True)\n",
" rows.append([tk, len(rels), round(s[0],3), round(s[len(s)//2],3), round(s[-1],3)])\n",
"df = pd.DataFrame(rows, columns=['top_k','n_total','max','median','min'])\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "52f63ef3",
"metadata": {},
"source": [
"**Lectura:** `max` no se mueve. Solo crece `n_total` con peor score. **`top_k=1` o `top_k=3` es suficiente** para la app — subirlo solo añade ruido por debajo del threshold.\n",
"\n",
"**Conclusion 3:** dejar `top_k=1` por defecto en el panel. Si el usuario quiere ver alternativas, abrir un control avanzado."
]
},
{
"cell_type": "markdown",
"id": "163a20d2",
"metadata": {},
"source": [
"## 6. Recomendaciones operativas\n",
"\n",
"### Para `extract_graph_hybrid` y `paste_extract`\n",
"\n",
"| Param | Valor recomendado | Razon |\n",
"|---|---|---|\n",
"| `entity_threshold` | **0.50** (general) / **0.70** (narrativa estructurada) | GLiNER da 0.92-0.99 en narrativa; 0.5 deja margen para casos limite |\n",
"| `relation_threshold` | **0.15** (EN) / **0.10** (ES) | GLiREL tiene scores naturalmente bajos; 0.6 es absurdo |\n",
"| `top_k` | **1** | Subirlo solo añade peor evidencia |\n",
"| `relation_labels` | **snake_case corto** (`works_at`) | Frases naturales empeoran scores 3-4× |\n",
"| `entity_labels` | **dominio-especificas si OSINT** | Labels genericas hunden recall en texto OSINT |\n",
"\n",
"### Cambios concretos en el codigo\n",
"\n",
"1. **Issue nuevo en `graph_explorer`** — `0041-split-confidence-thresholds.md`:\n",
" - En `python/functions/pipelines/extract_graph_hybrid.py`: separar `confidence_threshold` en `entity_threshold` y `relation_threshold`.\n",
" - En `enrichers/paste_extract/run.py`: aceptar ambos parametros desde el manifest/ctx.\n",
" - En el panel C++ (`extract_panel.cpp`): dos sliders en lugar de uno, defaults 0.50 y 0.15.\n",
"2. **Test pytest existente** (`tests/test_paste_extract.py`) ya monkeypatchea el pipeline; añadir un test del path real con threshold separado cuando los modelos esten disponibles (skip si no).\n",
"3. **Documentar en `app.md`** que el path hybrid descarga ~2 GB la primera vez y queda en `~/.cache/huggingface/`.\n",
"\n",
"### Decisiones que NO se confirman aqui\n",
"\n",
"- Que pasa con texto > 512 tokens (GLiNER tiene window). Ver `extract_graph_hybrid` que ya hace chunking.\n",
"- Calidad real con LLM fallback activo (no probado en este notebook).\n",
"- Comportamiento con corpus mucho mas grande (este analysis prueba 4 textos cortos)."
]
},
{
"cell_type": "markdown",
"id": "1546f0f8",
"metadata": {},
"source": [
"## 7. Apendice — script reproducible\n",
"\n",
"Los datos vienen de `../results.json`, generado por `../run_experiments.py`. Para regenerar (cambiar corpus, labels, etc.):\n",
"\n",
"```bash\n",
"cd analysis/gliner_glirel_tuning\n",
"./.venv/bin/python3 run_experiments.py # ~30s con modelos calientes\n",
"./.venv/bin/python3 build_notebook.py # rebuild .ipynb con outputs\n",
"```"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}