"""Construye notebooks/04_gliner2_winner.ipynb — la conclusion empirica. GLiNER2 (Apache 2.0, NER+RE joint, 340M, multilingue ES/EN/FR) gana frente a la stack actual GLiNER+GLiREL/mREBEL en velocidad, mantiene calidad similar/mejor, y SI funciona en OSINT castellano. Datos: benchmark_v2.json (run_benchmark_v2.py). """ from __future__ import annotations import json from pathlib import Path import nbformat as nbf HERE = Path(__file__).resolve().parent NB_PATH = HERE / "notebooks" / "04_gliner2_winner.ipynb" def _md(text: str): return nbf.v4.new_markdown_cell(text) def _code(src: str): cell = nbf.v4.new_code_cell(src) cell.outputs = [] cell.execution_count = None return cell def build(): cells = [] cells.append(_md( "# GLiNER2 — el modelo unico para `graph_explorer`\n\n" "Tras descartar GLiREL (notebook 02) y aceptar mREBEL con caveat de licencia (notebook 03), " "encontramos **`fastino/gliner2-large-v1`**: NER + RE en un solo modelo, **Apache 2.0**, " "soporta castellano nativo, **20-30× mas rapido** que mREBEL.\n\n" "| | GLiNER + GLiREL | GLiNER + mREBEL | **GLiNER2** |\n" "|---|---|---|---|\n" "| Modelos | 2 | 2 | **1** |\n" "| Tamaño total | 2.1 GB | 3.0 GB | **0.7 GB** |\n" "| Latencia 8 frases ES | 1.0s | 25s | **1.2s** |\n" "| Latencia 30 frases ES | ~3s | ~90s | **4.2s** |\n" "| Calidad ES corporate | 1 falsa | 4/5 OK | **5-6/8 OK** |\n" "| Calidad ES OSINT | sin probar | sin probar | **funciona** |\n" "| Licencia | Apache 2.0 | CC BY-NC-SA 4.0 | **Apache 2.0** |\n" "| Idioma | EN-centric | 18 idiomas | EN/ES/FR |\n\n" "Este notebook empotra los datos del benchmark v2 (`benchmark_v2.json`) y construye el grafo final." )) cells.append(_md("## 1. Setup")) cells.append(_code( "import os, sys, json, warnings, time\n" "warnings.filterwarnings('ignore')\n" "os.environ.setdefault('HF_HUB_DISABLE_PROGRESS_BARS', '1')\n" "from pathlib import Path\n" "\n" "_pf = '/home/lucas/fn_registry/python/functions'\n" "sys.path = [p for p in sys.path if not p.startswith(_pf + '/')]\n" "if _pf not in sys.path: sys.path.insert(0, _pf)\n" "\n" "import pandas as pd\n" "import networkx as nx\n" "import matplotlib.pyplot as plt\n" "from gliner2 import GLiNER2\n" "\n" "BENCH = json.loads(Path('../benchmark_v2.json').read_text())\n" "print('corpora benchmarked:', list(BENCH.keys()))" )) cells.append(_md("## 2. Cargar GLiNER2 (warm — modelo cacheado)")) cells.append(_code( "t0 = time.time()\n" "model = GLiNER2.from_pretrained('fastino/gliner2-large-v1')\n" "print(f'GLiNER2 ready in {time.time()-t0:.1f}s')" )) cells.append(_md( "## 3. Resumen del benchmark sobre 4 corpora\n\n" "Datos de `run_benchmark_v2.py` corrido el 2026-05-04. Cada fila es una pasada GLiNER2 con su schema (entities + relations) sobre el corpus." )) cells.append(_code( "rows = []\n" "for k, d in BENCH.items():\n" " rows.append({\n" " 'corpus': k, 'chars': d['n_chars'], 'words': d['n_words'],\n" " 'time_s': d['elapsed_s'], 'ents': d['n_entities'],\n" " 'rels': d['n_relations'], 'rels/word': round(d['n_relations']/d['n_words'], 4),\n" " })\n" "df = pd.DataFrame(rows)\n" "df" )) cells.append(_md( "**Lectura:**\n\n" "- `es_corporate_short` (8 frases, 104 words): 14 ents, 8 rels en 1.2s. **Comparable a mREBEL pero 20× mas rapido**.\n" "- `es_corporate_long` (30 frases, 400 words): 60 ents (excelente recall), 6 rels (recall bajo en relaciones — texto largo). Necesita chunking para mejorar.\n" "- `es_osint` (6 frases, 98 words): 11 ents incluyendo IPs, hashes, CVEs, dominios defanged + 5 relaciones tipadas — **funciona en ciberseguridad castellana**.\n" "- `en_corporate_short` (4 frases): 9 rels — mejor recall en EN que en ES." )) cells.append(_md("## 4. Caso 1 — es_corporate_short (8 frases)\n\nEl mismo corpus que notebook 02 y 03. Evaluacion manual de calidad.")) cells.append(_code( "data = BENCH['es_corporate_short']\n" "print('ENTITIES')\n" "for typ, names in data['entities'].items():\n" " print(f' {typ}: {names}')\n" "print('\\nRELATIONS')\n" "for rt, pairs in data['relations'].items():\n" " for h, t in pairs:\n" " print(f' {h:35s} --[{rt:20s}]--> {t}')" )) cells.append(_md( "**Verdict manual (8 relaciones):**\n\n" "| # | Relacion | Verdict |\n" "|---|---|---|\n" "| 1 | `Pablo Isla works_at Inditex` | ✅ correcto (era expresidente) |\n" "| 2 | `Pablo Isla appointed_as consejero de Telefonica` | ✅ correcto |\n" "| 3 | `Marina Serrano ceo_of Endesa` | ✅ correcto |\n" "| 4 | `Ignacio Galan president_of Iberdrola` | ✅ correcto |\n" "| 5 | `Ignacio Galan president_of Iberdrola` (DUP) | ⚠️ duplicado — dedupe pendiente |\n" "| 6 | `Inditex headquartered_in Arteixo, A Coruna` | ✅ correcto |\n" "| 7 | `Iberdrola agreement_with Endesa` | ✅ correcto |\n" "| 8 | `Inditex acquired Pablo Isla` | ❌ falso — ruido |\n\n" "**6/8 correctas, 1 duplicado, 1 falso.** Comparado con mREBEL (4/5 alineadas correctas) y GLiREL (~3/51), GLiNER2 esta a la altura y es 20× mas rapido." )) cells.append(_md("## 5. Visualizacion del grafo — es_corporate_short")) cells.append(_code( "TYPE_COLOR = {'person': '#5DA5DA', 'organization': '#F17CB0', 'location': '#60BD68'}\n" "TYPE_EN = {'persona': 'person', 'organizacion': 'organization', 'ubicacion': 'location'}\n" "\n" "def build_graph(data, type_color=TYPE_COLOR):\n" " G = nx.DiGraph()\n" " for typ, names in data['entities'].items():\n" " norm_typ = TYPE_EN.get(typ, typ)\n" " for n in names:\n" " G.add_node(n, type=norm_typ)\n" " seen = set()\n" " for rt, pairs in data['relations'].items():\n" " for h, t in pairs:\n" " key = (h, t, rt)\n" " if key in seen: continue\n" " seen.add(key)\n" " G.add_edge(h, t, kind=rt)\n" " return G\n" "\n" "def draw(ax, G, title):\n" " if G.number_of_nodes() == 0:\n" " ax.set_title(title + ' (empty)'); ax.axis('off'); return\n" " pos = nx.spring_layout(G, k=2.2, iterations=80, seed=42)\n" " cols = [TYPE_COLOR.get(G.nodes[n].get('type'), '#bbb') for n in G.nodes]\n" " nx.draw_networkx_nodes(G, pos, node_color=cols, node_size=1800, edgecolors='#333', linewidths=1.4, ax=ax)\n" " nx.draw_networkx_labels(G, pos, font_size=8, font_weight='bold', ax=ax)\n" " nx.draw_networkx_edges(G, pos, edge_color='#888', arrows=True, arrowsize=14, width=1.2, alpha=0.7, ax=ax, connectionstyle='arc3,rad=0.08')\n" " el = {(u,v): d['kind'] for u,v,d in G.edges(data=True)}\n" " nx.draw_networkx_edge_labels(G, pos, edge_labels=el, font_size=6.5, ax=ax,\n" " bbox=dict(boxstyle='round,pad=0.1', fc='white', ec='none', alpha=0.85))\n" " ax.set_title(f'{title}: {G.number_of_nodes()} ents, {G.number_of_edges()} rels', fontsize=11)\n" " ax.axis('off')\n" "\n" "G_short = build_graph(BENCH['es_corporate_short'])\n" "fig, ax = plt.subplots(figsize=(12, 8))\n" "draw(ax, G_short, 'es_corporate_short — GLiNER2')\n" "from matplotlib.patches import Patch\n" "legend = [Patch(facecolor=c, edgecolor='#333', label=t) for t, c in TYPE_COLOR.items()]\n" "ax.legend(handles=legend, loc='upper left', fontsize=10)\n" "plt.tight_layout(); plt.show()" )) cells.append(_md( "## 6. Caso 2 — es_osint (game-changer)\n\n" "Texto sobre ciberataque APT-29 con IoCs reales. Schema con labels especificas: `ip_address`, `dominio`, `vulnerabilidad`, `malware`, `hash`, `username`. **Hasta ahora ningun modelo del benchmark cubria OSINT en castellano.**" )) cells.append(_code( "data = BENCH['es_osint']\n" "print('ENTITIES')\n" "for typ, names in data['entities'].items():\n" " if names: print(f' {typ:18s}: {names}')\n" "print('\\nRELATIONS')\n" "for rt, pairs in data['relations'].items():\n" " for h, t in pairs:\n" " print(f' {h:38s} --[{rt:20s}]--> {t}')" )) cells.append(_md( "**OSINT en castellano funciona.** GLiNER2 detecta:\n" "- IP `185.220.101.45`\n" "- Dominio defanged `cloudfront-cdn[.]net` (¡reconoce la sintaxis OSINT!)\n" "- Username `@phantomzero`\n" "- CVE `CVE-2024-21412`\n" "- Malware `CozyBear`\n" "- Hash `a3f5e8c9b1d2e3f4a5b6c7d8e9f0a1b2`\n" "- Orgs `APT-29`, `CCN-CERT`, `Telefonica Tech`\n\n" "Relaciones:\n\n" "| # | Relacion | Verdict |\n" "|---|---|---|\n" "| 1 | `campana de phishing targets empresas energeticas espanolas` | ⚠️ span sucio pero correcto |\n" "| 2 | `CozyBear exploits CVE-2024-21412` | ✅ correcto |\n" "| 3 | `malware uses CozyBear` | ⚠️ direccion ambigua |\n" "| 4 | `grupo APT-29 attributed_to Rusia` | ✅ correcto |\n" "| 5 | `servidor de comando y control communicates_with sistemas internos de Iberdrola` | ⚠️ span sucio pero correcto |\n\n" "**3/5 inequivocamente correctas + 2 ambiguas.** Ningun falso positivo grave." )) cells.append(_code( "G_osint = build_graph(BENCH['es_osint'])\n" "# extender mapping a labels OSINT en castellano\n" "OSINT_COLOR = {'persona': '#5DA5DA', 'organizacion': '#F17CB0', 'ubicacion': '#60BD68',\n" " 'ip_address': '#FAA43A', 'dominio': '#F15854', 'username': '#B276B2',\n" " 'vulnerabilidad': '#DECF3F', 'malware': '#7C7C7C', 'hash': '#6C6C6C', 'url': '#FAA43A'}\n" "G_osint = nx.DiGraph()\n" "for typ, names in BENCH['es_osint']['entities'].items():\n" " for n in names: G_osint.add_node(n, type=typ)\n" "seen = set()\n" "for rt, pairs in BENCH['es_osint']['relations'].items():\n" " for h, t in pairs:\n" " if (h,t,rt) not in seen:\n" " seen.add((h,t,rt)); G_osint.add_edge(h, t, kind=rt)\n" "\n" "fig, ax = plt.subplots(figsize=(13, 9))\n" "if G_osint.number_of_nodes() > 0:\n" " pos = nx.spring_layout(G_osint, k=2.5, iterations=80, seed=42)\n" " cols = [OSINT_COLOR.get(G_osint.nodes[n].get('type'), '#bbb') for n in G_osint.nodes]\n" " nx.draw_networkx_nodes(G_osint, pos, node_color=cols, node_size=1800, edgecolors='#333', linewidths=1.4, ax=ax)\n" " nx.draw_networkx_labels(G_osint, pos, font_size=8, font_weight='bold', ax=ax)\n" " nx.draw_networkx_edges(G_osint, pos, edge_color='#888', arrows=True, arrowsize=14, width=1.2, alpha=0.7, ax=ax, connectionstyle='arc3,rad=0.1')\n" " el = {(u,v): d['kind'] for u,v,d in G_osint.edges(data=True)}\n" " nx.draw_networkx_edge_labels(G_osint, pos, edge_labels=el, font_size=6.5, ax=ax,\n" " bbox=dict(boxstyle='round,pad=0.1', fc='white', ec='none', alpha=0.85))\n" "ax.set_title(f'es_osint — GLiNER2: {G_osint.number_of_nodes()} ents, {G_osint.number_of_edges()} rels', fontsize=11)\n" "ax.axis('off')\n" "from matplotlib.patches import Patch\n" "legend = [Patch(facecolor=c, edgecolor='#333', label=t) for t, c in OSINT_COLOR.items() if t in {n[1].get('type') for n in G_osint.nodes(data=True)}]\n" "ax.legend(handles=legend, loc='upper left', fontsize=8)\n" "plt.tight_layout(); plt.show()" )) cells.append(_md( "## 7. Caso 3 — es_corporate_long (limitacion: recall bajo en relaciones)\n\n" "Texto extendido de 30 frases sobre el sector empresarial espanol. **60 entidades extraidas correctamente** pero solo **6 relaciones** — el modelo es muy selectivo cuando el contexto es denso." )) cells.append(_code( "data = BENCH['es_corporate_long']\n" "print(f'{data[\"n_entities\"]} entidades, {data[\"n_relations\"]} relaciones, {data[\"elapsed_s\"]}s')\n" "print('\\nMUESTRA de entidades (primeras 10 personas):', data['entities']['person'][:10])\n" "print('\\nRELATIONS (todas):')\n" "for rt, pairs in data['relations'].items():\n" " for h, t in pairs:\n" " print(f' {h:35s} --[{rt:20s}]--> {t}')" )) cells.append(_md( "**Lectura:** 60 entidades de 30 frases es buen recall — captura todo el cast (Pablo Isla, Amancio Ortega, Marta Ortega, Ana Botin, Ignacio Galan, Patrick Pouyanne, Andy Jassy, Mariano Rajoy...). Pero **solo 6 relaciones para tantos hechos** explicitos. Hipotesis:\n\n" "1. **Texto largo ahoga al modelo** — la atencion se diluye entre frases.\n" "2. **Solo emite alta confianza** — preferencia por precision sobre recall.\n" "3. **Procesar frase a frase mejoraria recall** — replicar la estrategia de mREBEL del notebook 03.\n\n" "**Plan:** issue 0042 debe contemplar ambos modos: `text_mode=joint` (rapido, recall bajo en texto largo) y `text_mode=sentences` (mas lento, recall mejor)." )) cells.append(_md( "## 8. Conclusion\n\n" "**GLiNER2 sustituye toda la stack actual (GLiNER + GLiREL/mREBEL) en `extract_graph_hybrid`.** Razones:\n\n" "1. **Apache 2.0** — sin restriccion comercial. Resuelve el caveat de mREBEL.\n" "2. **Un solo modelo** — 0.7 GB vs 2.1-3.0 GB de la stack actual.\n" "3. **20× mas rapido** que mREBEL en la misma calidad.\n" "4. **Funciona en OSINT castellano** — game-changer para el caso de uso real de `graph_explorer`.\n" "5. **Mismo paradigma de schema** — `entities([...]).relations([...])` es ergonomico.\n\n" "**Limitaciones aceptadas:**\n\n" "- Recall de relaciones cae en texto largo (>20 frases). Mitigar con chunking por frase.\n" "- Algunos errores semanticos puntuales (e.g. `Inditex acquired Pablo Isla`) — el dedupe + el filtro humano del panel `paste_extract` los cubren.\n" "- Solo soporta EN/ES/FR (vs mREBEL 18 idiomas) — irrelevante para nuestro caso de uso.\n\n" "## Plan de migracion\n\n" "1. **Reemplazar issue 0042** (mREBEL) por **issue 0042-revised**: GLiNER2 sustituye GLiREL en `extract_graph_hybrid`, con dos modos de ejecucion (joint / chunked-by-sentence). mREBEL queda como opcion en P3.\n" "2. **Funciones nuevas en el registry:**\n" " - `gliner2_load_model_py_datascience` — loader cacheado (Apache 2.0)\n" " - `extract_graph_gliner2_py_datascience` — schema construction + extract + normalizar a `EntityCandidate`/`RelationCandidate`\n" " - `extract_graph_gliner2_chunked_py_pipelines` — version frase-a-frase para texto largo\n" "3. **Actualizar el panel `extract_panel.cpp`**: combo de engines pasa a `[GLiNER2 (recomendado) | GLiNER+GLiREL (legacy) | GLiNER+mREBEL (no comercial)]`. Default GLiNER2.\n" "4. **Vault `osint_nlp_models`**: actualizar README + crear `models/gliner2.md` con estos hallazgos. Mover `mrebel.md` a estado 'fallback'.\n\n" "**Por probar a futuro (cola en `vaults/osint_nlp_models/models/candidates.md`):**\n" "- `fastino/gliner2-base-v1` (205M, mas pequeño aun) — confirmar que la calidad se mantiene.\n" "- GLiNER2 con threshold tuning (si la API lo expone).\n" "- GLiNER2 + chunking por frase para corpus largo (long_text experiment, pendiente)." )) nb = nbf.v4.new_notebook() nb.cells = cells nb.metadata = { "kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"name": "python"}, } NB_PATH.parent.mkdir(parents=True, exist_ok=True) nbf.write(nb, NB_PATH) print(f"[done] {NB_PATH} cells={len(cells)}") if __name__ == "__main__": build()