chore: initial sync — gliner+glirel benchmark notebooks

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 23:44:11 +02:00
commit b8c760d004
49 changed files with 47850 additions and 0 deletions
@@ -0,0 +1,865 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "6a7ef2a5",
+   "metadata": {},
+   "source": [
+    "# GLiNER + GLiREL — calibracion empirica\n",
+    "\n",
+    "**Objetivo:** entender empiricamente como funcionan **GLiNER** (entidades) y **GLiREL** (relaciones) para fijar thresholds operativos en el pipeline `extract_graph_hybrid` (panel _Paste & Extract_ de `graph_explorer`).\n",
+    "\n",
+    "**Hallazgo previo (sesion del merge 0013):** un solo `confidence_threshold=0.6` filtra GLiNER (0.92-0.99 facil) Y GLiREL (max 0.21 en el test). Resultado: el panel jamas muestra relaciones aunque GLiREL si las detecte. Este notebook valida la separacion necesaria de thresholds y mide rangos sanos.\n",
+    "\n",
+    "**Plan:**\n",
+    "1. Cargar modelos\n",
+    "2. **GLiNER** — barrido threshold sobre corpus EN/ES + sensibilidad a label sets\n",
+    "3. **GLiREL** — distribucion de scores sin filtro + sensibilidad a label phrasing\n",
+    "4. Recomendaciones operativas\n",
+    "\n",
+    "**Stack:** gliner==0.2.26, glirel==1.2.1, transformers==5.1, huggingface_hub==1.13. Modelos `urchade/gliner_multi-v2.1` (~600 MB) y `jackboyla/glirel-large-v0` (~1.5 GB), ambos cacheados en `~/.cache/huggingface/`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2423c283",
+   "metadata": {},
+   "source": [
+    "## 1. Setup\n",
+    "\n",
+    "El kernel autocarga `FN_REGISTRY_ROOT` y anade `python/functions/` al `sys.path` (ver `.ipython/profile_default/startup/00_fn_registry.py`)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "67f48818",
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2026-05-04T12:58:37.640753Z",
+     "iopub.status.busy": "2026-05-04T12:58:37.640602Z",
+     "iopub.status.idle": "2026-05-04T12:58:37.853224Z",
+     "shell.execute_reply": "2026-05-04T12:58:37.852377Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "FN_REGISTRY_ROOT: /home/lucas/fn_registry\n",
+      "results.json keys: ['gliner_threshold_sweep', 'glirel_score_distribution', 'glirel_topk_sweep', 'corpus', 'entity_labels', 'relation_labels']\n"
+     ]
+    }
+   ],
+   "source": [
+    "import os, sys, json, time, warnings\n",
+    "warnings.filterwarnings('ignore')\n",
+    "os.environ.setdefault('HF_HUB_DISABLE_PROGRESS_BARS', '1')\n",
+    "from pathlib import Path\n",
+    "\n",
+    "# Limpiar sys.path: el startup del kernel anade cada subdir de\n",
+    "# python/functions/ al top-level, y bigquery/datasets.py sombrea\n",
+    "# al paquete `datasets` de HuggingFace que necesita transformers.\n",
+    "# Dejamos solo el directorio padre 'python/functions/' para imports\n",
+    "# 'from datascience.gliner_load_model import ...' del estilo paquete.\n",
+    "_pf = '/home/lucas/fn_registry/python/functions'\n",
+    "sys.path = [p for p in sys.path if not (p.startswith(_pf + '/'))]\n",
+    "if _pf not in sys.path:\n",
+    "    sys.path.insert(0, _pf)\n",
+    "\n",
+    "import pandas as pd\n",
+    "from datascience.gliner_load_model import gliner_load_model\n",
+    "from datascience.glirel_load_model import glirel_load_model\n",
+    "\n",
+    "RESULTS = json.loads(Path('../results.json').read_text())\n",
+    "print('FN_REGISTRY_ROOT:', os.environ.get('FN_REGISTRY_ROOT'))\n",
+    "print('results.json keys:', list(RESULTS.keys()))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6dc6a22b",
+   "metadata": {},
+   "source": [
+    "## 2. Corpus de prueba\n",
+    "\n",
+    "4 textos cortos cubriendo dominios diferentes (ES/EN, corporativo/OSINT/journalism). Sirven para detectar drift de calidad por idioma y por tipo de contenido."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0f208d97",
+   "metadata": {},
+   "source": [
+    "### `es_corporate`\n",
+    "```\n",
+    "Pablo Isla, expresidente de Inditex, ha sido nombrado consejero de Telefonica. La operacion fue anunciada por el presidente Jose Maria Alvarez-Pallete en Madrid el pasado lunes. Inditex factura mas de 30.000 millones anuales y tiene su sede en Arteixo, A Coruna.\n",
+    "```\n",
+    "\n",
+    "### `en_corporate`\n",
+    "```\n",
+    "Pablo Isla, the former chairman of Inditex, has been appointed as a director of Telefonica. The announcement was made by Jose Maria Alvarez-Pallete, the chairman of Telefonica, in Madrid last Monday. Inditex has its headquarters in Arteixo, A Coruna.\n",
+    "```\n",
+    "\n",
+    "### `en_osint`\n",
+    "```\n",
+    "On 2024-08-15, attacker IP 185.220.101.45 connected to victim host 10.0.5.22 over TLS. Reverse DNS pointed to tor-exit-relay-3.onionrouter.net. Operator handle @phantomzero claimed responsibility on a forum. The C2 panel was hosted on hxxps://malwareops[.]biz/control behind Cloudflare.\n",
+    "```\n",
+    "\n",
+    "### `es_journalism`\n",
+    "```\n",
+    "Iberdrola y Endesa firmaron un acuerdo de colaboracion en proyectos eolicos en Galicia. El presidente de Iberdrola, Ignacio Galan, se reunio con la CEO de Endesa, Marina Serrano, en Bilbao. El acuerdo movilizara 2.000 millones de euros en cinco anos.\n",
+    "```\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8cbf0f22",
+   "metadata": {},
+   "source": [
+    "## 3. Carga de modelos\n",
+    "\n",
+    "Cold load: ~50s por modelo (descarga). Warm: ~8s. Cache global por (model_name, device)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "cf04dfad",
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2026-05-04T12:58:37.855378Z",
+     "iopub.status.busy": "2026-05-04T12:58:37.855198Z",
+     "iopub.status.idle": "2026-05-04T12:58:52.254428Z",
+     "shell.execute_reply": "2026-05-04T12:58:52.253490Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "\u001b[0;93m2026-05-04 14:58:38.910665577 [W:onnxruntime:Default, device_discovery.cc:283 GetGpuDevices] Failed to detect devices under \"/sys/class/drm/card0\": device_discovery.cc:93 ReadFileContents Failed to open file: \"/sys/class/drm/card0/device/vendor\"\u001b[m\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "\u001b[1mDebertaV2Model LOAD REPORT\u001b[0m from: microsoft/deberta-v3-large\n",
+      "Key                                     | Status     |  | \n",
+      "----------------------------------------+------------+--+-\n",
+      "mask_predictions.LayerNorm.bias         | UNEXPECTED |  | \n",
+      "lm_predictions.lm_head.bias             | UNEXPECTED |  | \n",
+      "lm_predictions.lm_head.LayerNorm.weight | UNEXPECTED |  | \n",
+      "lm_predictions.lm_head.dense.weight     | UNEXPECTED |  | \n",
+      "lm_predictions.lm_head.dense.bias       | UNEXPECTED |  | \n",
+      "mask_predictions.classifier.bias        | UNEXPECTED |  | \n",
+      "mask_predictions.dense.weight           | UNEXPECTED |  | \n",
+      "mask_predictions.LayerNorm.weight       | UNEXPECTED |  | \n",
+      "mask_predictions.dense.bias             | UNEXPECTED |  | \n",
+      "mask_predictions.classifier.weight      | UNEXPECTED |  | \n",
+      "lm_predictions.lm_head.LayerNorm.bias   | UNEXPECTED |  | \n",
+      "\n",
+      "\u001b[3mNotes:\n",
+      "- UNEXPECTED\u001b[3m\t:can be ignored when loading from different task/architecture; not ok if you expect identical arch.\u001b[0m\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "GLiNER ready in 8.2s\n",
+      "GLiREL ready in 6.2s\n"
+     ]
+    }
+   ],
+   "source": [
+    "t0 = time.time(); gliner = gliner_load_model(); t_gliner = time.time()-t0\n",
+    "t0 = time.time(); glirel = glirel_load_model(); t_glirel = time.time()-t0\n",
+    "print(f'GLiNER ready in {t_gliner:.1f}s')\n",
+    "print(f'GLiREL ready in {t_glirel:.1f}s')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "08107c78",
+   "metadata": {},
+   "source": [
+    "## 4. GLiNER — barrido de threshold\n",
+    "\n",
+    "Para cada (corpus, label_set) corremos `predict_entities(threshold=0.0)` y filtramos a posteriori a {0.1, 0.3, 0.5, 0.7, 0.9}. Asi vemos la distribucion completa de scores sin recargar modelo."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "46598320",
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2026-05-04T12:58:52.257688Z",
+     "iopub.status.busy": "2026-05-04T12:58:52.257083Z",
+     "iopub.status.idle": "2026-05-04T12:58:52.284240Z",
+     "shell.execute_reply": "2026-05-04T12:58:52.283211Z"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>corpus</th>\n",
+       "      <th>labels</th>\n",
+       "      <th>t=.1</th>\n",
+       "      <th>t=.3</th>\n",
+       "      <th>t=.5</th>\n",
+       "      <th>t=.7</th>\n",
+       "      <th>t=.9</th>\n",
+       "      <th>max_score</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>es_corporate</td>\n",
+       "      <td>generic_en</td>\n",
+       "      <td>8</td>\n",
+       "      <td>8</td>\n",
+       "      <td>8</td>\n",
+       "      <td>8</td>\n",
+       "      <td>8</td>\n",
+       "      <td>0.994</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>es_corporate</td>\n",
+       "      <td>generic_es</td>\n",
+       "      <td>8</td>\n",
+       "      <td>8</td>\n",
+       "      <td>8</td>\n",
+       "      <td>8</td>\n",
+       "      <td>8</td>\n",
+       "      <td>0.990</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>en_corporate</td>\n",
+       "      <td>generic_en</td>\n",
+       "      <td>9</td>\n",
+       "      <td>9</td>\n",
+       "      <td>9</td>\n",
+       "      <td>9</td>\n",
+       "      <td>9</td>\n",
+       "      <td>0.995</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>en_corporate</td>\n",
+       "      <td>specific_en</td>\n",
+       "      <td>9</td>\n",
+       "      <td>9</td>\n",
+       "      <td>9</td>\n",
+       "      <td>9</td>\n",
+       "      <td>8</td>\n",
+       "      <td>0.991</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>en_osint</td>\n",
+       "      <td>generic_en</td>\n",
+       "      <td>12</td>\n",
+       "      <td>6</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0.604</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>5</th>\n",
+       "      <td>en_osint</td>\n",
+       "      <td>osint_en</td>\n",
+       "      <td>13</td>\n",
+       "      <td>8</td>\n",
+       "      <td>6</td>\n",
+       "      <td>2</td>\n",
+       "      <td>2</td>\n",
+       "      <td>0.953</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>6</th>\n",
+       "      <td>es_journalism</td>\n",
+       "      <td>generic_en</td>\n",
+       "      <td>9</td>\n",
+       "      <td>8</td>\n",
+       "      <td>8</td>\n",
+       "      <td>8</td>\n",
+       "      <td>8</td>\n",
+       "      <td>0.995</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>7</th>\n",
+       "      <td>es_journalism</td>\n",
+       "      <td>generic_es</td>\n",
+       "      <td>9</td>\n",
+       "      <td>8</td>\n",
+       "      <td>8</td>\n",
+       "      <td>8</td>\n",
+       "      <td>7</td>\n",
+       "      <td>0.992</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "          corpus       labels  t=.1  t=.3  t=.5  t=.7  t=.9  max_score\n",
+       "0   es_corporate   generic_en     8     8     8     8     8      0.994\n",
+       "1   es_corporate   generic_es     8     8     8     8     8      0.990\n",
+       "2   en_corporate   generic_en     9     9     9     9     9      0.995\n",
+       "3   en_corporate  specific_en     9     9     9     9     8      0.991\n",
+       "4       en_osint   generic_en    12     6     1     0     0      0.604\n",
+       "5       en_osint     osint_en    13     8     6     2     2      0.953\n",
+       "6  es_journalism   generic_en     9     8     8     8     8      0.995\n",
+       "7  es_journalism   generic_es     9     8     8     8     7      0.992"
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from datascience.gliner_load_model import gliner_load_model\n",
+    "thresholds = [0.1, 0.3, 0.5, 0.7, 0.9]\n",
+    "rows = []\n",
+    "for corpus_key, cdata in RESULTS['gliner_threshold_sweep'].items():\n",
+    "    for ls_key, sdata in cdata.items():\n",
+    "        scored = sdata['scored_at_t0']\n",
+    "        max_s = max((s[2] for s in scored), default=0.0)\n",
+    "        rows.append([corpus_key, ls_key, *[len(sdata[f't={t}']) for t in thresholds], round(max_s,3)])\n",
+    "df = pd.DataFrame(rows, columns=['corpus','labels','t=.1','t=.3','t=.5','t=.7','t=.9','max_score'])\n",
+    "df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "eed12fb4",
+   "metadata": {},
+   "source": [
+    "**Lectura:**\n",
+    "\n",
+    "- En **narrativa estructurada** (corporate, journalism), GLiNER da 8-9 entidades estables con scores 0.92-0.99. **`threshold=0.5` o `0.7` son seguros**, casi no se mueve el conteo.\n",
+    "- En **OSINT** (IPs, dominios, URLs) con labels genericas (`person`, `organization`...): scores _se hunden_ a max 0.60. **Cae todo a t=0.5**.\n",
+    "- Mismo OSINT con labels especificas (`ip_address`, `domain`, `url`): max 0.95, threshold 0.5 retiene 6.\n",
+    "- ES vs EN: practicamente identicos. El `gliner_multi-v2.1` es genuinamente multilingue. **Las labels EN funcionan igual de bien sobre texto ES.**\n",
+    "\n",
+    "**Conclusion 1:** `entity_threshold = 0.5` es seguro como default. Pero el **label set debe encajar al dominio** — una mala eleccion mata mas que un threshold mal puesto."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fed8f100",
+   "metadata": {},
+   "source": [
+    "### 4.1 Entidades concretas (en_corporate, generic_en, t=0.5)\n",
+    "\n",
+    "Para verificar que no son ruido."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "5358e303",
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2026-05-04T12:58:52.286116Z",
+     "iopub.status.busy": "2026-05-04T12:58:52.285916Z",
+     "iopub.status.idle": "2026-05-04T12:58:52.300382Z",
+     "shell.execute_reply": "2026-05-04T12:58:52.299264Z"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>text</th>\n",
+       "      <th>label</th>\n",
+       "      <th>score</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>Pablo Isla</td>\n",
+       "      <td>person</td>\n",
+       "      <td>0.989302</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>Inditex</td>\n",
+       "      <td>organization</td>\n",
+       "      <td>0.992379</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>Telefonica</td>\n",
+       "      <td>organization</td>\n",
+       "      <td>0.992698</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>Jose Maria Alvarez-Pallete</td>\n",
+       "      <td>person</td>\n",
+       "      <td>0.975533</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>Telefonica</td>\n",
+       "      <td>organization</td>\n",
+       "      <td>0.990853</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>5</th>\n",
+       "      <td>Madrid</td>\n",
+       "      <td>location</td>\n",
+       "      <td>0.966069</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>6</th>\n",
+       "      <td>Inditex</td>\n",
+       "      <td>organization</td>\n",
+       "      <td>0.994649</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>7</th>\n",
+       "      <td>Arteixo</td>\n",
+       "      <td>location</td>\n",
+       "      <td>0.968921</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>8</th>\n",
+       "      <td>A Coruna</td>\n",
+       "      <td>location</td>\n",
+       "      <td>0.920429</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                         text         label     score\n",
+       "0                  Pablo Isla        person  0.989302\n",
+       "1                     Inditex  organization  0.992379\n",
+       "2                  Telefonica  organization  0.992698\n",
+       "3  Jose Maria Alvarez-Pallete        person  0.975533\n",
+       "4                  Telefonica  organization  0.990853\n",
+       "5                      Madrid      location  0.966069\n",
+       "6                     Inditex  organization  0.994649\n",
+       "7                     Arteixo      location  0.968921\n",
+       "8                    A Coruna      location  0.920429"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "ents = RESULTS['gliner_threshold_sweep']['en_corporate']['generic_en']['t=0.5']\n",
+    "pd.DataFrame(ents, columns=['text','label','score','start','end'])[['text','label','score']]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f4019283",
+   "metadata": {},
+   "source": [
+    "## 5. GLiREL — distribucion de scores\n",
+    "\n",
+    "Aqui esta el quid del bug: pasamos `threshold=0.0`, `top_k=5` y vemos los scores naturales que emite GLiREL. Comparamos dos estilos de label:\n",
+    "\n",
+    "- `snake_short`: `works_at`, `located_in`, `appointed_as`, ...\n",
+    "- `natural_long`: `person works at organization`, ...\n",
+    "\n",
+    "El folklore dice que el segundo deberia funcionar mejor (porque GLiREL es tipo zero-shot). Vamos a ver."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "b0516987",
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2026-05-04T12:58:52.302264Z",
+     "iopub.status.busy": "2026-05-04T12:58:52.302062Z",
+     "iopub.status.idle": "2026-05-04T12:58:52.313997Z",
+     "shell.execute_reply": "2026-05-04T12:58:52.312964Z"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>corpus</th>\n",
+       "      <th>n_ents</th>\n",
+       "      <th>label_style</th>\n",
+       "      <th>n_rels</th>\n",
+       "      <th>max_score</th>\n",
+       "      <th>median_score</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>es_corporate</td>\n",
+       "      <td>8</td>\n",
+       "      <td>snake_short</td>\n",
+       "      <td>280</td>\n",
+       "      <td>0.169</td>\n",
+       "      <td>0.017</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>es_corporate</td>\n",
+       "      <td>8</td>\n",
+       "      <td>natural_long</td>\n",
+       "      <td>280</td>\n",
+       "      <td>0.061</td>\n",
+       "      <td>0.010</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>en_corporate</td>\n",
+       "      <td>9</td>\n",
+       "      <td>snake_short</td>\n",
+       "      <td>360</td>\n",
+       "      <td>0.233</td>\n",
+       "      <td>0.016</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>en_corporate</td>\n",
+       "      <td>9</td>\n",
+       "      <td>natural_long</td>\n",
+       "      <td>360</td>\n",
+       "      <td>0.080</td>\n",
+       "      <td>0.007</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>es_journalism</td>\n",
+       "      <td>8</td>\n",
+       "      <td>snake_short</td>\n",
+       "      <td>280</td>\n",
+       "      <td>0.195</td>\n",
+       "      <td>0.011</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>5</th>\n",
+       "      <td>es_journalism</td>\n",
+       "      <td>8</td>\n",
+       "      <td>natural_long</td>\n",
+       "      <td>280</td>\n",
+       "      <td>0.138</td>\n",
+       "      <td>0.007</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "          corpus  n_ents   label_style  n_rels  max_score  median_score\n",
+       "0   es_corporate       8   snake_short     280      0.169         0.017\n",
+       "1   es_corporate       8  natural_long     280      0.061         0.010\n",
+       "2   en_corporate       9   snake_short     360      0.233         0.016\n",
+       "3   en_corporate       9  natural_long     360      0.080         0.007\n",
+       "4  es_journalism       8   snake_short     280      0.195         0.011\n",
+       "5  es_journalism       8  natural_long     280      0.138         0.007"
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "rows=[]\n",
+    "for corpus, cdata in RESULTS['glirel_score_distribution'].items():\n",
+    "    n_ents = len(cdata.get('entities', []))\n",
+    "    for style, rels in cdata.get('styles', {}).items():\n",
+    "        if isinstance(rels, list) and rels:\n",
+    "            scores = sorted([r['score'] for r in rels], reverse=True)\n",
+    "            rows.append([corpus, n_ents, style, len(rels), round(scores[0],3), round(scores[len(scores)//2],3)])\n",
+    "        else:\n",
+    "            rows.append([corpus, n_ents, style, 0, 0.0, 0.0])\n",
+    "df = pd.DataFrame(rows, columns=['corpus','n_ents','label_style','n_rels','max_score','median_score'])\n",
+    "df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "80cb8f95",
+   "metadata": {},
+   "source": [
+    "**Lectura — dos sorpresas:**\n",
+    "\n",
+    "1. **`snake_short` >> `natural_long`** por un factor 3-4×. Pasar `\"person works at organization\"` baja el score max de 0.23 a 0.08. **GLiREL fue entrenado con etiquetas estilo Wikipedia** (`P54`, `member_of_political_party`...), no con frases naturales. El prompt-engineering aqui es _menos_ es _mas_.\n",
+    "2. **EN > ES por ~25%**: `en_corporate` max 0.233 vs `es_corporate` max 0.169 con el mismo contenido factico. GLiREL tiene mejor cobertura del ingles.\n",
+    "3. **Texto OSINT** dio 0 entidades en GLiNER multi-v2.1 con labels genericas → no hay pares para GLiREL. (Para OSINT habria que cambiar GLiNER -> regex (que ya cubre IoCs) y dejar GLiREL para narrativa).\n",
+    "\n",
+    "**Conclusion 2:** **`relation_threshold` debe estar en 0.10-0.15**, NO en 0.6. El `confidence_threshold` global del pipeline debe partirse en dos."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e535e84b",
+   "metadata": {},
+   "source": [
+    "### 5.1 Efecto de `top_k`\n",
+    "\n",
+    "Subir `top_k` ¿descubre relaciones nuevas o solo añade ruido?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "cc6855a0",
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2026-05-04T12:58:52.315945Z",
+     "iopub.status.busy": "2026-05-04T12:58:52.315750Z",
+     "iopub.status.idle": "2026-05-04T12:58:52.325915Z",
+     "shell.execute_reply": "2026-05-04T12:58:52.324821Z"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>top_k</th>\n",
+       "      <th>n_total</th>\n",
+       "      <th>max</th>\n",
+       "      <th>median</th>\n",
+       "      <th>min</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>top_k=1</td>\n",
+       "      <td>72</td>\n",
+       "      <td>0.233</td>\n",
+       "      <td>0.129</td>\n",
+       "      <td>0.036</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>top_k=3</td>\n",
+       "      <td>216</td>\n",
+       "      <td>0.233</td>\n",
+       "      <td>0.045</td>\n",
+       "      <td>0.003</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>top_k=5</td>\n",
+       "      <td>360</td>\n",
+       "      <td>0.233</td>\n",
+       "      <td>0.016</td>\n",
+       "      <td>0.000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>top_k=10</td>\n",
+       "      <td>360</td>\n",
+       "      <td>0.233</td>\n",
+       "      <td>0.016</td>\n",
+       "      <td>0.000</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "      top_k  n_total    max  median    min\n",
+       "0   top_k=1       72  0.233   0.129  0.036\n",
+       "1   top_k=3      216  0.233   0.045  0.003\n",
+       "2   top_k=5      360  0.233   0.016  0.000\n",
+       "3  top_k=10      360  0.233   0.016  0.000"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "rows=[]\n",
+    "for tk, rels in RESULTS['glirel_topk_sweep']['by_topk'].items():\n",
+    "    s = sorted([r['score'] for r in rels], reverse=True)\n",
+    "    rows.append([tk, len(rels), round(s[0],3), round(s[len(s)//2],3), round(s[-1],3)])\n",
+    "df = pd.DataFrame(rows, columns=['top_k','n_total','max','median','min'])\n",
+    "df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "52f63ef3",
+   "metadata": {},
+   "source": [
+    "**Lectura:** `max` no se mueve. Solo crece `n_total` con peor score. **`top_k=1` o `top_k=3` es suficiente** para la app — subirlo solo añade ruido por debajo del threshold.\n",
+    "\n",
+    "**Conclusion 3:** dejar `top_k=1` por defecto en el panel. Si el usuario quiere ver alternativas, abrir un control avanzado."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "163a20d2",
+   "metadata": {},
+   "source": [
+    "## 6. Recomendaciones operativas\n",
+    "\n",
+    "### Para `extract_graph_hybrid` y `paste_extract`\n",
+    "\n",
+    "| Param | Valor recomendado | Razon |\n",
+    "|---|---|---|\n",
+    "| `entity_threshold` | **0.50** (general) / **0.70** (narrativa estructurada) | GLiNER da 0.92-0.99 en narrativa; 0.5 deja margen para casos limite |\n",
+    "| `relation_threshold` | **0.15** (EN) / **0.10** (ES) | GLiREL tiene scores naturalmente bajos; 0.6 es absurdo |\n",
+    "| `top_k` | **1** | Subirlo solo añade peor evidencia |\n",
+    "| `relation_labels` | **snake_case corto** (`works_at`) | Frases naturales empeoran scores 3-4× |\n",
+    "| `entity_labels` | **dominio-especificas si OSINT** | Labels genericas hunden recall en texto OSINT |\n",
+    "\n",
+    "### Cambios concretos en el codigo\n",
+    "\n",
+    "1. **Issue nuevo en `graph_explorer`** — `0041-split-confidence-thresholds.md`:\n",
+    "   - En `python/functions/pipelines/extract_graph_hybrid.py`: separar `confidence_threshold` en `entity_threshold` y `relation_threshold`.\n",
+    "   - En `enrichers/paste_extract/run.py`: aceptar ambos parametros desde el manifest/ctx.\n",
+    "   - En el panel C++ (`extract_panel.cpp`): dos sliders en lugar de uno, defaults 0.50 y 0.15.\n",
+    "2. **Test pytest existente** (`tests/test_paste_extract.py`) ya monkeypatchea el pipeline; añadir un test del path real con threshold separado cuando los modelos esten disponibles (skip si no).\n",
+    "3. **Documentar en `app.md`** que el path hybrid descarga ~2 GB la primera vez y queda en `~/.cache/huggingface/`.\n",
+    "\n",
+    "### Decisiones que NO se confirman aqui\n",
+    "\n",
+    "- Que pasa con texto > 512 tokens (GLiNER tiene window). Ver `extract_graph_hybrid` que ya hace chunking.\n",
+    "- Calidad real con LLM fallback activo (no probado en este notebook).\n",
+    "- Comportamiento con corpus mucho mas grande (este analysis prueba 4 textos cortos)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1546f0f8",
+   "metadata": {},
+   "source": [
+    "## 7. Apendice — script reproducible\n",
+    "\n",
+    "Los datos vienen de `../results.json`, generado por `../run_experiments.py`. Para regenerar (cambiar corpus, labels, etc.):\n",
+    "\n",
+    "```bash\n",
+    "cd analysis/gliner_glirel_tuning\n",
+    "./.venv/bin/python3 run_experiments.py    # ~30s con modelos calientes\n",
+    "./.venv/bin/python3 build_notebook.py     # rebuild .ipynb con outputs\n",
+    "```"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.13.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
@@ -0,0 +1,419 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "4a6738d5",
+   "metadata": {},
+   "source": [
+    "# Mejoras al pipeline GLiNER2 sobre PDF — resultados empiricos\n",
+    "\n",
+    "**Pregunta:** del notebook 05 nos quedamos con un grafo de PDF con 382 entidades pero solo 48 aristas y 324 nodos aislados. **¿Como subimos las relaciones correctas y reducimos aislados?**\n",
+    "\n",
+    "Tras leer la API real de GLiNER2 (no la del README), identifique 6 palancas:\n",
+    "\n",
+    "1. `threshold` (default 0.5) — bajar a 0.3 / 0.2\n",
+    "2. `relations({type: description})` — pasar dict con descripciones, no lista\n",
+    "3. `batch_extract` con `batch_size=8`\n",
+    "4. Coreference simple (normalizacion + substring) entre chunks\n",
+    "5. Sliding window de 2 frases entre chunks\n",
+    "6. Limpieza del PDF (page numbers, saltos espurios)\n",
+    "\n",
+    "Ejecutado el benchmark en `run_improvements.py` y guardado en `improvements.json`. Este notebook solo carga los datos y los presenta — sin recargar GLiNER2."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ebbdc3f9",
+   "metadata": {},
+   "source": [
+    "## 0. Setup"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c0adf6b4",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "keys: ['meta', 'configs', 'coref', 'top_entities_post_coref', 'top_relations_post_coref', 'ents_merged', 'rels_merged']\n"
+     ]
+    }
+   ],
+   "source": [
+    "import json\n",
+    "from pathlib import Path\n",
+    "import pandas as pd\n",
+    "DATA = json.loads(Path('../improvements.json').read_text())\n",
+    "print('keys:', list(DATA.keys()))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "59413647",
+   "metadata": {},
+   "source": [
+    "## 1. Pre-procesado del PDF (mejoras #5 y #6)\n",
+    "\n",
+    "Limpieza (`1/20` headers, saltos en medio de palabras, espacios duplicados) + chunking con sliding window de 2 frases."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "54e98462",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "raw chars:    89,882\n",
+      "clean chars:  88,714\n",
+      "chunks (overlap=2):  97\n",
+      "chunks (overlap=0):  66\n",
+      "\n",
+      "--- primeras 600 chars del clean ---\n",
+      "Banco Bilbao Vizcaya Argentaria, S.A., con domicilio en la Plaza San Nicolás, número 4, 48005 Bilbao,inscrito en el Registro Mercantil de Vizcaya, al tomo 2.083, Folio 1, Hoja BI-17-A, Inscripción 1ª con C.I.F. A-48265169POLÍTICA DE PROTECCIÓN DE DATOS PERSONALES 1. Política de Protección de Datos Personales T ómate tu tiempo y lee atentamente este documento. No dudes en pedirnos aclaraciones de lo que no entiendas.\n",
+      "En este apartado te explicamos para qué utilizará BBVA tus datos y, entre otros aspectos, qué derechos tienes relacionados con su uso.\n",
+      "INFORMACIÓN BÁSICA SOBRE PROTECCIÓN DE DATOS \n"
+     ]
+    }
+   ],
+   "source": [
+    "meta = DATA['meta']\n",
+    "print(f\"raw chars:    {meta['raw_chars']:,}\")\n",
+    "print(f\"clean chars:  {meta['clean_chars']:,}\")\n",
+    "print(f\"chunks (overlap=2):  {meta['n_chunks_overlap']}\")\n",
+    "print(f\"chunks (overlap=0):  {meta['n_chunks_no_overlap']}\")\n",
+    "print()\n",
+    "print('--- primeras 600 chars del clean ---')\n",
+    "print(meta['first_clean_600'])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cfd5a2bd",
+   "metadata": {},
+   "source": [
+    "## 2. Bateria comparativa — 5 configuraciones\n",
+    "\n",
+    "Sobre los mismos 97 chunks del PDF cleaned + sliding window:\n",
+    "\n",
+    "| Config | threshold | schema | metodo |\n",
+    "|---|---|---|---|\n",
+    "| **A** baseline | 0.5 (default) | flat list | extract loop |\n",
+    "| **B** lower threshold | 0.3 | flat list | extract loop |\n",
+    "| **C** very low threshold | 0.2 | flat list | extract loop |\n",
+    "| **D** + descriptions | 0.3 | dict con desc | extract loop |\n",
+    "| **E** + batch | 0.3 | dict con desc | batch_extract |\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4fecd7e7",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "config               time    ents  rels  edges  isolates  conn%\n",
+       "-------------------  ------  ----  ----  -----  --------  -----\n",
+       "A: t=0.5 flat loop   134.3s  397   71    71     329       17.8%\n",
+       "B: t=0.3 flat loop   139.0s  517   204   204    389       26.0%\n",
+       "C: t=0.2 flat loop   133.9s  632   362   362    397       34.9%\n",
+       "D: t=0.3 desc loop   132.4s  517   204   204    389       26.0%\n",
+       "E: t=0.3 desc batch  163.6s  517   204   204    389       26.0%"
+      ]
+     },
+     "execution_count": null,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "rows = []\n",
+    "for c in DATA['configs']:\n",
+    "    s = c['stats']\n",
+    "    rows.append({\n",
+    "        'config': c['name'], 'time_s': c['elapsed'],\n",
+    "        'ents': s['n_ents'], 'rels': s['n_rels'], 'edges': s['n_edges'],\n",
+    "        'isolates': s['n_isolates'], 'conn_pct': s['connect_pct'],\n",
+    "    })\n",
+    "df = pd.DataFrame(rows)\n",
+    "df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "757530b8",
+   "metadata": {},
+   "source": [
+    "**Lectura del benchmark:**\n",
+    "\n",
+    "- **Threshold es la palanca principal** y la unica que mueve la aguja:\n",
+    "  - `0.5 → 0.3` = **+187% relaciones** (71 → 204)\n",
+    "  - `0.3 → 0.2` = +78% mas (204 → 362), pero +22% entidades dudosas (517 → 632)\n",
+    "  - **Sweet spot: 0.3** — gran ganancia sin meter ruido excesivo.\n",
+    "\n",
+    "- **Descripciones por relacion NO mejoran** este corpus legal denso (B = D, identico). Probable explicacion: GLiNER2 ya entiende los nombres cortos como `governed_by`, `subject_to` directamente. Las descripciones podrian pesar mas en relaciones ambiguas (`acquired` vs `merged_with`).\n",
+    "\n",
+    "- **batch_extract NO da speedup en CPU** — fue **25% mas lento** que el loop (E=163s vs D=132s). Sospecha: el modelo es CPU-bound y el batching introduce overhead sin paralelismo real (1 modelo, no caben 8 forward pass simultaneos en un core). Solo vale la pena con GPU.\n",
+    "\n",
+    "- **Sliding window de 2 frases** ya esta aplicado en TODOS los configs (forma parte del chunking). Su efecto exacto vs no-overlap requeriria una sexta config aparte (no medido aqui)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "98c616a6",
+   "metadata": {},
+   "source": [
+    "## 3. Coreferencia sobre la mejor config (E)\n",
+    "\n",
+    "Aplicamos un mergeo simple por:\n",
+    "\n",
+    "1. Lowercase + trim de puntuacion → cluster por nombre normalizado.\n",
+    "2. Substring match: nombres cortos absorbidos por largos del mismo tipo (`BBVA` ⊂ `Banco Bilbao Vizcaya Argentaria, S.A.`).\n",
+    "3. Re-escritura de relaciones para usar nombres canonicos.\n",
+    "\n",
+    "Coste: 0.62s. Tras coref:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "def3dd7a",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "PRE-coref  {'n_ents': 517, 'n_rels': 204, 'n_nodes': 526, 'n_edges': 204, 'n_isolates': 389, 'connected': 137, 'connect_pct': 26.0}\n",
+      "POST-coref {'n_ents': 401, 'n_rels': 166, 'n_nodes': 440, 'n_edges': 166, 'n_isolates': 318, 'connected': 122, 'connect_pct': 27.7}\n",
+      "absorbed: 72 aliases en 0.62s\n",
+      "\n",
+      "Samples de aliases absorbidos:\n",
+      "  'productos y servicios'                                 → 'Información derivada de los productos y servicios contratados'\n",
+      "  'servicios contratados'                                 → 'Información derivada de los productos y servicios contratados'\n",
+      "  'información'                                           → 'Información derivada de los productos y servicios contratados'\n",
+      "  'productos'                                             → 'Información derivada de los productos y servicios contratados'\n",
+      "  'servicios'                                             → 'Información derivada de los productos y servicios contratados'\n",
+      "  'normativa'                                             → 'normativa interna sobre prevención de crimen financiero'\n",
+      "  'blanqueo de capitales'                                 → 'normativa de prevención del blanqueo de capitales'\n",
+      "  'interacción'                                           → 'datos derivados de la interacción con chatbots'"
+     ]
+    }
+   ],
+   "source": [
+    "pre = DATA['coref']['pre_stats']\n",
+    "post = DATA['coref']['post_stats']\n",
+    "print('PRE-coref ', pre)\n",
+    "print('POST-coref', post)\n",
+    "print(f\"absorbed: {DATA['coref']['n_absorbed']} aliases en {DATA['coref']['elapsed']}s\")\n",
+    "print()\n",
+    "print('Samples de aliases absorbidos:')\n",
+    "for old, new in DATA['coref']['absorbed_sample']:\n",
+    "    print(f'  {old!r:55s} → {new!r}')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5613c249",
+   "metadata": {},
+   "source": [
+    "**Lectura coref:**\n",
+    "\n",
+    "- **72 aliases absorbidos** en 0.62s — gratis para el usuario.\n",
+    "- Nodos: 526 → 440 (-86).\n",
+    "- Edges: 204 → 166 (-38) — _bajan porque las relaciones se mergean cuando ambos extremos colapsan al mismo canonico_.\n",
+    "- Aislados: 389 → 318 (-71, **-18%**).\n",
+    "- Conn%: 26.0% → 27.7% (mejora pequeña en porcentaje porque tambien se reducen los nodos totales).\n",
+    "\n",
+    "Lo que mas mejora la coreferencia es la **calidad del grafo**: en lugar de tener 5 nodos `productos`, `servicios`, `información`, etc. dispersos por el documento, los junta en una entidad canonica `Información derivada de los productos y servicios contratados`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5d9af970",
+   "metadata": {},
+   "source": [
+    "## 4. Top entidades post-coref"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fdb2f3c7",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "type           canonical                                                     mentions  n_aliases  aliases_sample                                                   \n",
+       "-------------  ------------------------------------------------------------  --------  ---------  -----------------------------------------------------------------\n",
+       "organization   BBVA Seguros                                                  81        1          ['BBVA']                                                         \n",
+       "data_category  Datos Personales                                              47        0          []                                                               \n",
+       "person         cliente particular                                            34        1          ['cliente']                                                      \n",
+       "organization   Banco de España (CIRBE)                                       28        3          ['Banco de España', 'Banco', 'CIRBE']                            \n",
+       "location       Plaza San Nicolás                                             27        0          []                                                               \n",
+       "location       Vizcaya                                                       22        0          []                                                               \n",
+       "data_category  datos derivados de la interacción con chatbots                19        3          ['interacción', 'chatbots', 'datos']                             \n",
+       "law            normativa interna sobre prevención de crimen financiero       19        1          ['normativa']                                                    \n",
+       "right          consentimiento                                                18        0          []                                                               \n",
+       "data_category  Datos transaccionales                                         18        1          ['transaccionales']                                              \n",
+       "data_category  Información derivada de los productos y servicios contratado  17        5          ['productos y servicios', 'servicios contratados', 'información']\n",
+       "person         clientes                                                      15        0          []                                                               \n",
+       "data_category  Datos identificativos                                         14        0          []                                                               \n",
+       "email          derechosprotecciondatos@bbva.com                              14        0          []                                                               \n",
+       "data_category  número de teléfono de contacto                                13        1          ['contacto']                                                     \n",
+       "person         representante                                                 12        0          []                                                               \n",
+       "organization   Agencia Española de Protección de Datos                       12        0          []                                                               \n",
+       "organization   sociedades participadas                                       11        2          ['participadas', 'sociedades']                                   \n",
+       "person         garante                                                       11        0          []                                                               \n",
+       "data_category  Datos económicos                                              11        1          ['económicos']                                                   "
+      ]
+     },
+     "execution_count": null,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "rows = DATA['top_entities_post_coref'][:20]\n",
+    "df = pd.DataFrame(rows)\n",
+    "df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "36710c94",
+   "metadata": {},
+   "source": [
+    "## 5. Top relaciones post-coref"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c5439813",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "from                                            kind            to                                                  count\n",
+       "----------------------------------------------  --------------  --------------------------------------------------  -----\n",
+       "BBVA Seguros                                    governed_by     Banco de España (CIRBE)                             4    \n",
+       "Datos Personales                                protected_by    Agencia Española de Protección de Datos             4    \n",
+       "Datos Personales                                protected_by    Política de Protección de Datos Personales          3    \n",
+       "BBVA Seguros                                    subject_to      obligaciones legales                                3    \n",
+       "derechos de acceso                              rights_against  datos derivados de la interacción con chatbots      3    \n",
+       "contratación                                    controlled_by   BBVA Seguros                                        3    \n",
+       "BBVA Seguros                                    subsidiary_of   Grupo BBVA                                          2    \n",
+       "Datos Personales                                protected_by    BBVA Seguros                                        2    \n",
+       "BBVA Seguros                                    contact_for     Información derivada de los productos y servicios   2    \n",
+       "Delegado de Protección de Datos                 contact_for     BBVA Seguros                                        2    \n",
+       "BBVA Seguros                                    controlled_by   Banco de España (CIRBE)                             2    \n",
+       "domicilio                                       located_in      Plaza San Nicolás                                   2    \n",
+       "datos de contacto                               contact_for     clientes                                            2    \n",
+       "BBVA Seguros                                    located_in      España                                              2    \n",
+       "contratos de crédito inmobiliario               governed_by     Ley 5/2019                                          2    \n",
+       "Avda. de la Industria                           located_in      MADRID                                              2    \n",
+       "bbva.es                                         located_in      MADRID                                              2    \n",
+       "datos derivados de la interacción con chatbots  subject_to      normativa interna sobre prevención de crimen finan  2    \n",
+       "Datos Personales                                subject_to      normativa interna sobre prevención de crimen finan  2    \n",
+       "Emailage Corporation                            located_in      Londres                                             2    "
+      ]
+     },
+     "execution_count": null,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "rows = DATA['top_relations_post_coref'][:20]\n",
+    "df = pd.DataFrame(rows)\n",
+    "df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3c830cb5",
+   "metadata": {},
+   "source": [
+    "## 6. Conclusion — recetario operativo\n",
+    "\n",
+    "**Para subir relaciones correctas y reducir aislados en GLiNER2 sobre PDF, en orden de impacto/coste:**\n",
+    "\n",
+    "| Mejora | Ganancia tipica | Coste de implementacion |\n",
+    "|---|---|---|\n",
+    "| ⭐ `threshold=0.3` (vs default 0.5) | **+187% relaciones** | 1 parametro |\n",
+    "| ⭐ Coreferencia simple (normalize + substring) | **-18% aislados** | ~30 lineas Python pure |\n",
+    "| Limpieza del PDF (`N/20`, saltos) | -1.3% chars de ruido + chunks mas estables | ~10 lineas regex |\n",
+    "| `threshold=0.2` (mas agresivo) | +78% relaciones extra, +22% ents dudosas | trade-off |\n",
+    "| ❌ Descripciones por relacion | Sin efecto en este corpus | dict en vez de list |\n",
+    "| ❌ batch_extract en CPU | 25% mas lento | API distinta |\n",
+    "| ❌ Sliding window con chunks de 1500 chars | Marginal | 5 lineas |\n",
+    "\n",
+    "**Stack final recomendado:**\n",
+    "\n",
+    "```python\n",
+    "# 1. Carga GLiNER2 (Apache 2.0)\n",
+    "model = GLiNER2.from_pretrained('fastino/gliner2-large-v1')\n",
+    "\n",
+    "# 2. Pre-procesa PDF\n",
+    "raw = extract_pdf_text(pdf_path)            # registry: extract_pdf_text_py_core\n",
+    "clean = clean_pdf_text(raw)                  # NUEVA funcion del registry\n",
+    "chunks = chunk_with_overlap(clean, max_chars=1500, overlap_sentences=2)  # NUEVA\n",
+    "\n",
+    "# 3. Schema + extract con threshold=0.3\n",
+    "schema = model.create_schema().entities([...]).relations([...])\n",
+    "results = [model.extract(c['text'], schema=schema, threshold=0.3) for c in chunks]\n",
+    "\n",
+    "# 4. Aggregate + coref\n",
+    "ents, rels = aggregate(results)              # NUEVA, pura\n",
+    "ents, rels, _ = merge_aliases(ents, rels)    # NUEVA, pura\n",
+    "```\n",
+    "\n",
+    "## Funciones a promover al registry (proximo fn-constructor)\n",
+    "\n",
+    "Aproximadamente **6 funciones nuevas**, casi todas puras:\n",
+    "\n",
+    "1. `gliner2_load_model_py_datascience` (impure) — Apache 2.0, NER+RE joint\n",
+    "2. `clean_pdf_text_py_core` (pure) — limpieza de artefactos PyPDF2\n",
+    "3. `chunk_with_overlap_py_core` (pure) — chunking con sliding window\n",
+    "4. `aggregate_extraction_results_py_core` (pure) — dedupe + counter\n",
+    "5. `merge_entity_aliases_py_core` (pure) — coref simple normalize + substring\n",
+    "6. `extract_graph_from_pdf_py_pipelines` (impure) — composicion completa\n",
+    "\n",
+    "Esto cierra el ciclo: el flujo del notebook se vuelve _una llamada del registry_ reusable cross-project."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.13.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}