{ "cells": [ { "cell_type": "markdown", "id": "64ca5797", "metadata": {}, "source": [ "# GLiNER2 β el modelo unico para `graph_explorer`\n", "\n", "Tras descartar GLiREL (notebook 02) y aceptar mREBEL con caveat de licencia (notebook 03), encontramos **`fastino/gliner2-large-v1`**: NER + RE en un solo modelo, **Apache 2.0**, soporta castellano nativo, **20-30Γ mas rapido** que mREBEL.\n", "\n", "| | GLiNER + GLiREL | GLiNER + mREBEL | **GLiNER2** |\n", "|---|---|---|---|\n", "| Modelos | 2 | 2 | **1** |\n", "| TamaΓ±o total | 2.1 GB | 3.0 GB | **0.7 GB** |\n", "| Latencia 8 frases ES | 1.0s | 25s | **1.2s** |\n", "| Latencia 30 frases ES | ~3s | ~90s | **4.2s** |\n", "| Calidad ES corporate | 1 falsa | 4/5 OK | **5-6/8 OK** |\n", "| Calidad ES OSINT | sin probar | sin probar | **funciona** |\n", "| Licencia | Apache 2.0 | CC BY-NC-SA 4.0 | **Apache 2.0** |\n", "| Idioma | EN-centric | 18 idiomas | EN/ES/FR |\n", "\n", "Este notebook empotra los datos del benchmark v2 (`benchmark_v2.json`) y construye el grafo final." ] }, { "cell_type": "markdown", "id": "1ec3450e", "metadata": {}, "source": [ "## 1. Setup" ] }, { "cell_type": "code", "execution_count": 1, "id": "ac1a949e", "metadata": { "execution": { "iopub.execute_input": "2026-05-04T13:42:13.793562Z", "iopub.status.busy": "2026-05-04T13:42:13.793426Z", "iopub.status.idle": "2026-05-04T13:42:18.411410Z", "shell.execute_reply": "2026-05-04T13:42:18.410428Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[0;93m2026-05-04 15:42:15.263782788 [W:onnxruntime:Default, device_discovery.cc:283 GetGpuDevices] Failed to detect devices under \"/sys/class/drm/card0\": device_discovery.cc:93 ReadFileContents Failed to open file: \"/sys/class/drm/card0/device/vendor\"\u001b[m\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "corpora benchmarked: ['es_corporate_short', 'es_corporate_long', 'es_osint', 'en_corporate_short']\n" ] } ], "source": [ "import os, sys, json, warnings, time\n", "warnings.filterwarnings('ignore')\n", "os.environ.setdefault('HF_HUB_DISABLE_PROGRESS_BARS', '1')\n", "from pathlib import Path\n", "\n", "_pf = '/home/lucas/fn_registry/python/functions'\n", "sys.path = [p for p in sys.path if not p.startswith(_pf + '/')]\n", "if _pf not in sys.path: sys.path.insert(0, _pf)\n", "\n", "import pandas as pd\n", "import networkx as nx\n", "import matplotlib.pyplot as plt\n", "from gliner2 import GLiNER2\n", "\n", "BENCH = json.loads(Path('../benchmark_v2.json').read_text())\n", "print('corpora benchmarked:', list(BENCH.keys()))" ] }, { "cell_type": "markdown", "id": "8ad929f7", "metadata": {}, "source": [ "## 2. Cargar GLiNER2 (warm β modelo cacheado)" ] }, { "cell_type": "code", "execution_count": 2, "id": "998dd198", "metadata": { "execution": { "iopub.execute_input": "2026-05-04T13:42:18.413727Z", "iopub.status.busy": "2026-05-04T13:42:18.413052Z", "iopub.status.idle": "2026-05-04T13:42:31.934909Z", "shell.execute_reply": "2026-05-04T13:42:31.934026Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "You are using a model of type extractor to instantiate a model of type . This is not supported for all configurations of models and can yield errors.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "============================================================\n", "π§ Model Configuration\n", "============================================================\n", "Encoder model : microsoft/deberta-v3-large\n", "Counting layer : count_lstm\n", "Token pooling : first\n", "============================================================\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "GLiNER2 ready in 13.5s\n" ] } ], "source": [ "t0 = time.time()\n", "model = GLiNER2.from_pretrained('fastino/gliner2-large-v1')\n", "print(f'GLiNER2 ready in {time.time()-t0:.1f}s')" ] }, { "cell_type": "markdown", "id": "101c5829", "metadata": {}, "source": [ "## 3. Resumen del benchmark sobre 4 corpora\n", "\n", "Datos de `run_benchmark_v2.py` corrido el 2026-05-04. Cada fila es una pasada GLiNER2 con su schema (entities + relations) sobre el corpus." ] }, { "cell_type": "code", "execution_count": 3, "id": "a3673e52", "metadata": { "execution": { "iopub.execute_input": "2026-05-04T13:42:31.936688Z", "iopub.status.busy": "2026-05-04T13:42:31.936494Z", "iopub.status.idle": "2026-05-04T13:42:31.947737Z", "shell.execute_reply": "2026-05-04T13:42:31.946991Z" } }, "outputs": [ { "data": { "text/html": [ "
| \n", " | corpus | \n", "chars | \n", "words | \n", "time_s | \n", "ents | \n", "rels | \n", "rels/word | \n", "
|---|---|---|---|---|---|---|---|
| 0 | \n", "es_corporate_short | \n", "658 | \n", "104 | \n", "1.185 | \n", "14 | \n", "8 | \n", "0.0769 | \n", "
| 1 | \n", "es_corporate_long | \n", "2582 | \n", "400 | \n", "4.212 | \n", "60 | \n", "6 | \n", "0.0150 | \n", "
| 2 | \n", "es_osint | \n", "724 | \n", "98 | \n", "1.071 | \n", "11 | \n", "5 | \n", "0.0510 | \n", "
| 3 | \n", "en_corporate_short | \n", "314 | \n", "49 | \n", "0.767 | \n", "9 | \n", "9 | \n", "0.1837 | \n", "