b8c760d004
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
309 lines
16 KiB
Python
309 lines
16 KiB
Python
"""Construye notebooks/04_gliner2_winner.ipynb — la conclusion empirica.
|
||
|
||
GLiNER2 (Apache 2.0, NER+RE joint, 340M, multilingue ES/EN/FR) gana frente
|
||
a la stack actual GLiNER+GLiREL/mREBEL en velocidad, mantiene calidad
|
||
similar/mejor, y SI funciona en OSINT castellano.
|
||
|
||
Datos: benchmark_v2.json (run_benchmark_v2.py).
|
||
"""
|
||
from __future__ import annotations
|
||
|
||
import json
|
||
from pathlib import Path
|
||
|
||
import nbformat as nbf
|
||
|
||
HERE = Path(__file__).resolve().parent
|
||
NB_PATH = HERE / "notebooks" / "04_gliner2_winner.ipynb"
|
||
|
||
|
||
def _md(text: str):
|
||
return nbf.v4.new_markdown_cell(text)
|
||
|
||
|
||
def _code(src: str):
|
||
cell = nbf.v4.new_code_cell(src)
|
||
cell.outputs = []
|
||
cell.execution_count = None
|
||
return cell
|
||
|
||
|
||
def build():
|
||
cells = []
|
||
|
||
cells.append(_md(
|
||
"# GLiNER2 — el modelo unico para `graph_explorer`\n\n"
|
||
"Tras descartar GLiREL (notebook 02) y aceptar mREBEL con caveat de licencia (notebook 03), "
|
||
"encontramos **`fastino/gliner2-large-v1`**: NER + RE en un solo modelo, **Apache 2.0**, "
|
||
"soporta castellano nativo, **20-30× mas rapido** que mREBEL.\n\n"
|
||
"| | GLiNER + GLiREL | GLiNER + mREBEL | **GLiNER2** |\n"
|
||
"|---|---|---|---|\n"
|
||
"| Modelos | 2 | 2 | **1** |\n"
|
||
"| Tamaño total | 2.1 GB | 3.0 GB | **0.7 GB** |\n"
|
||
"| Latencia 8 frases ES | 1.0s | 25s | **1.2s** |\n"
|
||
"| Latencia 30 frases ES | ~3s | ~90s | **4.2s** |\n"
|
||
"| Calidad ES corporate | 1 falsa | 4/5 OK | **5-6/8 OK** |\n"
|
||
"| Calidad ES OSINT | sin probar | sin probar | **funciona** |\n"
|
||
"| Licencia | Apache 2.0 | CC BY-NC-SA 4.0 | **Apache 2.0** |\n"
|
||
"| Idioma | EN-centric | 18 idiomas | EN/ES/FR |\n\n"
|
||
"Este notebook empotra los datos del benchmark v2 (`benchmark_v2.json`) y construye el grafo final."
|
||
))
|
||
|
||
cells.append(_md("## 1. Setup"))
|
||
|
||
cells.append(_code(
|
||
"import os, sys, json, warnings, time\n"
|
||
"warnings.filterwarnings('ignore')\n"
|
||
"os.environ.setdefault('HF_HUB_DISABLE_PROGRESS_BARS', '1')\n"
|
||
"from pathlib import Path\n"
|
||
"\n"
|
||
"_pf = '/home/lucas/fn_registry/python/functions'\n"
|
||
"sys.path = [p for p in sys.path if not p.startswith(_pf + '/')]\n"
|
||
"if _pf not in sys.path: sys.path.insert(0, _pf)\n"
|
||
"\n"
|
||
"import pandas as pd\n"
|
||
"import networkx as nx\n"
|
||
"import matplotlib.pyplot as plt\n"
|
||
"from gliner2 import GLiNER2\n"
|
||
"\n"
|
||
"BENCH = json.loads(Path('../benchmark_v2.json').read_text())\n"
|
||
"print('corpora benchmarked:', list(BENCH.keys()))"
|
||
))
|
||
|
||
cells.append(_md("## 2. Cargar GLiNER2 (warm — modelo cacheado)"))
|
||
|
||
cells.append(_code(
|
||
"t0 = time.time()\n"
|
||
"model = GLiNER2.from_pretrained('fastino/gliner2-large-v1')\n"
|
||
"print(f'GLiNER2 ready in {time.time()-t0:.1f}s')"
|
||
))
|
||
|
||
cells.append(_md(
|
||
"## 3. Resumen del benchmark sobre 4 corpora\n\n"
|
||
"Datos de `run_benchmark_v2.py` corrido el 2026-05-04. Cada fila es una pasada GLiNER2 con su schema (entities + relations) sobre el corpus."
|
||
))
|
||
|
||
cells.append(_code(
|
||
"rows = []\n"
|
||
"for k, d in BENCH.items():\n"
|
||
" rows.append({\n"
|
||
" 'corpus': k, 'chars': d['n_chars'], 'words': d['n_words'],\n"
|
||
" 'time_s': d['elapsed_s'], 'ents': d['n_entities'],\n"
|
||
" 'rels': d['n_relations'], 'rels/word': round(d['n_relations']/d['n_words'], 4),\n"
|
||
" })\n"
|
||
"df = pd.DataFrame(rows)\n"
|
||
"df"
|
||
))
|
||
|
||
cells.append(_md(
|
||
"**Lectura:**\n\n"
|
||
"- `es_corporate_short` (8 frases, 104 words): 14 ents, 8 rels en 1.2s. **Comparable a mREBEL pero 20× mas rapido**.\n"
|
||
"- `es_corporate_long` (30 frases, 400 words): 60 ents (excelente recall), 6 rels (recall bajo en relaciones — texto largo). Necesita chunking para mejorar.\n"
|
||
"- `es_osint` (6 frases, 98 words): 11 ents incluyendo IPs, hashes, CVEs, dominios defanged + 5 relaciones tipadas — **funciona en ciberseguridad castellana**.\n"
|
||
"- `en_corporate_short` (4 frases): 9 rels — mejor recall en EN que en ES."
|
||
))
|
||
|
||
cells.append(_md("## 4. Caso 1 — es_corporate_short (8 frases)\n\nEl mismo corpus que notebook 02 y 03. Evaluacion manual de calidad."))
|
||
|
||
cells.append(_code(
|
||
"data = BENCH['es_corporate_short']\n"
|
||
"print('ENTITIES')\n"
|
||
"for typ, names in data['entities'].items():\n"
|
||
" print(f' {typ}: {names}')\n"
|
||
"print('\\nRELATIONS')\n"
|
||
"for rt, pairs in data['relations'].items():\n"
|
||
" for h, t in pairs:\n"
|
||
" print(f' {h:35s} --[{rt:20s}]--> {t}')"
|
||
))
|
||
|
||
cells.append(_md(
|
||
"**Verdict manual (8 relaciones):**\n\n"
|
||
"| # | Relacion | Verdict |\n"
|
||
"|---|---|---|\n"
|
||
"| 1 | `Pablo Isla works_at Inditex` | ✅ correcto (era expresidente) |\n"
|
||
"| 2 | `Pablo Isla appointed_as consejero de Telefonica` | ✅ correcto |\n"
|
||
"| 3 | `Marina Serrano ceo_of Endesa` | ✅ correcto |\n"
|
||
"| 4 | `Ignacio Galan president_of Iberdrola` | ✅ correcto |\n"
|
||
"| 5 | `Ignacio Galan president_of Iberdrola` (DUP) | ⚠️ duplicado — dedupe pendiente |\n"
|
||
"| 6 | `Inditex headquartered_in Arteixo, A Coruna` | ✅ correcto |\n"
|
||
"| 7 | `Iberdrola agreement_with Endesa` | ✅ correcto |\n"
|
||
"| 8 | `Inditex acquired Pablo Isla` | ❌ falso — ruido |\n\n"
|
||
"**6/8 correctas, 1 duplicado, 1 falso.** Comparado con mREBEL (4/5 alineadas correctas) y GLiREL (~3/51), GLiNER2 esta a la altura y es 20× mas rapido."
|
||
))
|
||
|
||
cells.append(_md("## 5. Visualizacion del grafo — es_corporate_short"))
|
||
|
||
cells.append(_code(
|
||
"TYPE_COLOR = {'person': '#5DA5DA', 'organization': '#F17CB0', 'location': '#60BD68'}\n"
|
||
"TYPE_EN = {'persona': 'person', 'organizacion': 'organization', 'ubicacion': 'location'}\n"
|
||
"\n"
|
||
"def build_graph(data, type_color=TYPE_COLOR):\n"
|
||
" G = nx.DiGraph()\n"
|
||
" for typ, names in data['entities'].items():\n"
|
||
" norm_typ = TYPE_EN.get(typ, typ)\n"
|
||
" for n in names:\n"
|
||
" G.add_node(n, type=norm_typ)\n"
|
||
" seen = set()\n"
|
||
" for rt, pairs in data['relations'].items():\n"
|
||
" for h, t in pairs:\n"
|
||
" key = (h, t, rt)\n"
|
||
" if key in seen: continue\n"
|
||
" seen.add(key)\n"
|
||
" G.add_edge(h, t, kind=rt)\n"
|
||
" return G\n"
|
||
"\n"
|
||
"def draw(ax, G, title):\n"
|
||
" if G.number_of_nodes() == 0:\n"
|
||
" ax.set_title(title + ' (empty)'); ax.axis('off'); return\n"
|
||
" pos = nx.spring_layout(G, k=2.2, iterations=80, seed=42)\n"
|
||
" cols = [TYPE_COLOR.get(G.nodes[n].get('type'), '#bbb') for n in G.nodes]\n"
|
||
" nx.draw_networkx_nodes(G, pos, node_color=cols, node_size=1800, edgecolors='#333', linewidths=1.4, ax=ax)\n"
|
||
" nx.draw_networkx_labels(G, pos, font_size=8, font_weight='bold', ax=ax)\n"
|
||
" nx.draw_networkx_edges(G, pos, edge_color='#888', arrows=True, arrowsize=14, width=1.2, alpha=0.7, ax=ax, connectionstyle='arc3,rad=0.08')\n"
|
||
" el = {(u,v): d['kind'] for u,v,d in G.edges(data=True)}\n"
|
||
" nx.draw_networkx_edge_labels(G, pos, edge_labels=el, font_size=6.5, ax=ax,\n"
|
||
" bbox=dict(boxstyle='round,pad=0.1', fc='white', ec='none', alpha=0.85))\n"
|
||
" ax.set_title(f'{title}: {G.number_of_nodes()} ents, {G.number_of_edges()} rels', fontsize=11)\n"
|
||
" ax.axis('off')\n"
|
||
"\n"
|
||
"G_short = build_graph(BENCH['es_corporate_short'])\n"
|
||
"fig, ax = plt.subplots(figsize=(12, 8))\n"
|
||
"draw(ax, G_short, 'es_corporate_short — GLiNER2')\n"
|
||
"from matplotlib.patches import Patch\n"
|
||
"legend = [Patch(facecolor=c, edgecolor='#333', label=t) for t, c in TYPE_COLOR.items()]\n"
|
||
"ax.legend(handles=legend, loc='upper left', fontsize=10)\n"
|
||
"plt.tight_layout(); plt.show()"
|
||
))
|
||
|
||
cells.append(_md(
|
||
"## 6. Caso 2 — es_osint (game-changer)\n\n"
|
||
"Texto sobre ciberataque APT-29 con IoCs reales. Schema con labels especificas: `ip_address`, `dominio`, `vulnerabilidad`, `malware`, `hash`, `username`. **Hasta ahora ningun modelo del benchmark cubria OSINT en castellano.**"
|
||
))
|
||
|
||
cells.append(_code(
|
||
"data = BENCH['es_osint']\n"
|
||
"print('ENTITIES')\n"
|
||
"for typ, names in data['entities'].items():\n"
|
||
" if names: print(f' {typ:18s}: {names}')\n"
|
||
"print('\\nRELATIONS')\n"
|
||
"for rt, pairs in data['relations'].items():\n"
|
||
" for h, t in pairs:\n"
|
||
" print(f' {h:38s} --[{rt:20s}]--> {t}')"
|
||
))
|
||
|
||
cells.append(_md(
|
||
"**OSINT en castellano funciona.** GLiNER2 detecta:\n"
|
||
"- IP `185.220.101.45`\n"
|
||
"- Dominio defanged `cloudfront-cdn[.]net` (¡reconoce la sintaxis OSINT!)\n"
|
||
"- Username `@phantomzero`\n"
|
||
"- CVE `CVE-2024-21412`\n"
|
||
"- Malware `CozyBear`\n"
|
||
"- Hash `a3f5e8c9b1d2e3f4a5b6c7d8e9f0a1b2`\n"
|
||
"- Orgs `APT-29`, `CCN-CERT`, `Telefonica Tech`\n\n"
|
||
"Relaciones:\n\n"
|
||
"| # | Relacion | Verdict |\n"
|
||
"|---|---|---|\n"
|
||
"| 1 | `campana de phishing targets empresas energeticas espanolas` | ⚠️ span sucio pero correcto |\n"
|
||
"| 2 | `CozyBear exploits CVE-2024-21412` | ✅ correcto |\n"
|
||
"| 3 | `malware uses CozyBear` | ⚠️ direccion ambigua |\n"
|
||
"| 4 | `grupo APT-29 attributed_to Rusia` | ✅ correcto |\n"
|
||
"| 5 | `servidor de comando y control communicates_with sistemas internos de Iberdrola` | ⚠️ span sucio pero correcto |\n\n"
|
||
"**3/5 inequivocamente correctas + 2 ambiguas.** Ningun falso positivo grave."
|
||
))
|
||
|
||
cells.append(_code(
|
||
"G_osint = build_graph(BENCH['es_osint'])\n"
|
||
"# extender mapping a labels OSINT en castellano\n"
|
||
"OSINT_COLOR = {'persona': '#5DA5DA', 'organizacion': '#F17CB0', 'ubicacion': '#60BD68',\n"
|
||
" 'ip_address': '#FAA43A', 'dominio': '#F15854', 'username': '#B276B2',\n"
|
||
" 'vulnerabilidad': '#DECF3F', 'malware': '#7C7C7C', 'hash': '#6C6C6C', 'url': '#FAA43A'}\n"
|
||
"G_osint = nx.DiGraph()\n"
|
||
"for typ, names in BENCH['es_osint']['entities'].items():\n"
|
||
" for n in names: G_osint.add_node(n, type=typ)\n"
|
||
"seen = set()\n"
|
||
"for rt, pairs in BENCH['es_osint']['relations'].items():\n"
|
||
" for h, t in pairs:\n"
|
||
" if (h,t,rt) not in seen:\n"
|
||
" seen.add((h,t,rt)); G_osint.add_edge(h, t, kind=rt)\n"
|
||
"\n"
|
||
"fig, ax = plt.subplots(figsize=(13, 9))\n"
|
||
"if G_osint.number_of_nodes() > 0:\n"
|
||
" pos = nx.spring_layout(G_osint, k=2.5, iterations=80, seed=42)\n"
|
||
" cols = [OSINT_COLOR.get(G_osint.nodes[n].get('type'), '#bbb') for n in G_osint.nodes]\n"
|
||
" nx.draw_networkx_nodes(G_osint, pos, node_color=cols, node_size=1800, edgecolors='#333', linewidths=1.4, ax=ax)\n"
|
||
" nx.draw_networkx_labels(G_osint, pos, font_size=8, font_weight='bold', ax=ax)\n"
|
||
" nx.draw_networkx_edges(G_osint, pos, edge_color='#888', arrows=True, arrowsize=14, width=1.2, alpha=0.7, ax=ax, connectionstyle='arc3,rad=0.1')\n"
|
||
" el = {(u,v): d['kind'] for u,v,d in G_osint.edges(data=True)}\n"
|
||
" nx.draw_networkx_edge_labels(G_osint, pos, edge_labels=el, font_size=6.5, ax=ax,\n"
|
||
" bbox=dict(boxstyle='round,pad=0.1', fc='white', ec='none', alpha=0.85))\n"
|
||
"ax.set_title(f'es_osint — GLiNER2: {G_osint.number_of_nodes()} ents, {G_osint.number_of_edges()} rels', fontsize=11)\n"
|
||
"ax.axis('off')\n"
|
||
"from matplotlib.patches import Patch\n"
|
||
"legend = [Patch(facecolor=c, edgecolor='#333', label=t) for t, c in OSINT_COLOR.items() if t in {n[1].get('type') for n in G_osint.nodes(data=True)}]\n"
|
||
"ax.legend(handles=legend, loc='upper left', fontsize=8)\n"
|
||
"plt.tight_layout(); plt.show()"
|
||
))
|
||
|
||
cells.append(_md(
|
||
"## 7. Caso 3 — es_corporate_long (limitacion: recall bajo en relaciones)\n\n"
|
||
"Texto extendido de 30 frases sobre el sector empresarial espanol. **60 entidades extraidas correctamente** pero solo **6 relaciones** — el modelo es muy selectivo cuando el contexto es denso."
|
||
))
|
||
|
||
cells.append(_code(
|
||
"data = BENCH['es_corporate_long']\n"
|
||
"print(f'{data[\"n_entities\"]} entidades, {data[\"n_relations\"]} relaciones, {data[\"elapsed_s\"]}s')\n"
|
||
"print('\\nMUESTRA de entidades (primeras 10 personas):', data['entities']['person'][:10])\n"
|
||
"print('\\nRELATIONS (todas):')\n"
|
||
"for rt, pairs in data['relations'].items():\n"
|
||
" for h, t in pairs:\n"
|
||
" print(f' {h:35s} --[{rt:20s}]--> {t}')"
|
||
))
|
||
|
||
cells.append(_md(
|
||
"**Lectura:** 60 entidades de 30 frases es buen recall — captura todo el cast (Pablo Isla, Amancio Ortega, Marta Ortega, Ana Botin, Ignacio Galan, Patrick Pouyanne, Andy Jassy, Mariano Rajoy...). Pero **solo 6 relaciones para tantos hechos** explicitos. Hipotesis:\n\n"
|
||
"1. **Texto largo ahoga al modelo** — la atencion se diluye entre frases.\n"
|
||
"2. **Solo emite alta confianza** — preferencia por precision sobre recall.\n"
|
||
"3. **Procesar frase a frase mejoraria recall** — replicar la estrategia de mREBEL del notebook 03.\n\n"
|
||
"**Plan:** issue 0042 debe contemplar ambos modos: `text_mode=joint` (rapido, recall bajo en texto largo) y `text_mode=sentences` (mas lento, recall mejor)."
|
||
))
|
||
|
||
cells.append(_md(
|
||
"## 8. Conclusion\n\n"
|
||
"**GLiNER2 sustituye toda la stack actual (GLiNER + GLiREL/mREBEL) en `extract_graph_hybrid`.** Razones:\n\n"
|
||
"1. **Apache 2.0** — sin restriccion comercial. Resuelve el caveat de mREBEL.\n"
|
||
"2. **Un solo modelo** — 0.7 GB vs 2.1-3.0 GB de la stack actual.\n"
|
||
"3. **20× mas rapido** que mREBEL en la misma calidad.\n"
|
||
"4. **Funciona en OSINT castellano** — game-changer para el caso de uso real de `graph_explorer`.\n"
|
||
"5. **Mismo paradigma de schema** — `entities([...]).relations([...])` es ergonomico.\n\n"
|
||
"**Limitaciones aceptadas:**\n\n"
|
||
"- Recall de relaciones cae en texto largo (>20 frases). Mitigar con chunking por frase.\n"
|
||
"- Algunos errores semanticos puntuales (e.g. `Inditex acquired Pablo Isla`) — el dedupe + el filtro humano del panel `paste_extract` los cubren.\n"
|
||
"- Solo soporta EN/ES/FR (vs mREBEL 18 idiomas) — irrelevante para nuestro caso de uso.\n\n"
|
||
"## Plan de migracion\n\n"
|
||
"1. **Reemplazar issue 0042** (mREBEL) por **issue 0042-revised**: GLiNER2 sustituye GLiREL en `extract_graph_hybrid`, con dos modos de ejecucion (joint / chunked-by-sentence). mREBEL queda como opcion en P3.\n"
|
||
"2. **Funciones nuevas en el registry:**\n"
|
||
" - `gliner2_load_model_py_datascience` — loader cacheado (Apache 2.0)\n"
|
||
" - `extract_graph_gliner2_py_datascience` — schema construction + extract + normalizar a `EntityCandidate`/`RelationCandidate`\n"
|
||
" - `extract_graph_gliner2_chunked_py_pipelines` — version frase-a-frase para texto largo\n"
|
||
"3. **Actualizar el panel `extract_panel.cpp`**: combo de engines pasa a `[GLiNER2 (recomendado) | GLiNER+GLiREL (legacy) | GLiNER+mREBEL (no comercial)]`. Default GLiNER2.\n"
|
||
"4. **Vault `osint_nlp_models`**: actualizar README + crear `models/gliner2.md` con estos hallazgos. Mover `mrebel.md` a estado 'fallback'.\n\n"
|
||
"**Por probar a futuro (cola en `vaults/osint_nlp_models/models/candidates.md`):**\n"
|
||
"- `fastino/gliner2-base-v1` (205M, mas pequeño aun) — confirmar que la calidad se mantiene.\n"
|
||
"- GLiNER2 con threshold tuning (si la API lo expone).\n"
|
||
"- GLiNER2 + chunking por frase para corpus largo (long_text experiment, pendiente)."
|
||
))
|
||
|
||
nb = nbf.v4.new_notebook()
|
||
nb.cells = cells
|
||
nb.metadata = {
|
||
"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"},
|
||
"language_info": {"name": "python"},
|
||
}
|
||
NB_PATH.parent.mkdir(parents=True, exist_ok=True)
|
||
nbf.write(nb, NB_PATH)
|
||
print(f"[done] {NB_PATH} cells={len(cells)}")
|
||
|
||
|
||
if __name__ == "__main__":
|
||
build()
|