Files
gliner_glirel_tuning/build_notebook_gliner2.py
T
2026-05-04 23:44:11 +02:00

309 lines
16 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
"""Construye notebooks/04_gliner2_winner.ipynb — la conclusion empirica.
GLiNER2 (Apache 2.0, NER+RE joint, 340M, multilingue ES/EN/FR) gana frente
a la stack actual GLiNER+GLiREL/mREBEL en velocidad, mantiene calidad
similar/mejor, y SI funciona en OSINT castellano.
Datos: benchmark_v2.json (run_benchmark_v2.py).
"""
from __future__ import annotations
import json
from pathlib import Path
import nbformat as nbf
HERE = Path(__file__).resolve().parent
NB_PATH = HERE / "notebooks" / "04_gliner2_winner.ipynb"
def _md(text: str):
return nbf.v4.new_markdown_cell(text)
def _code(src: str):
cell = nbf.v4.new_code_cell(src)
cell.outputs = []
cell.execution_count = None
return cell
def build():
cells = []
cells.append(_md(
"# GLiNER2 — el modelo unico para `graph_explorer`\n\n"
"Tras descartar GLiREL (notebook 02) y aceptar mREBEL con caveat de licencia (notebook 03), "
"encontramos **`fastino/gliner2-large-v1`**: NER + RE en un solo modelo, **Apache 2.0**, "
"soporta castellano nativo, **20-30× mas rapido** que mREBEL.\n\n"
"| | GLiNER + GLiREL | GLiNER + mREBEL | **GLiNER2** |\n"
"|---|---|---|---|\n"
"| Modelos | 2 | 2 | **1** |\n"
"| Tamaño total | 2.1 GB | 3.0 GB | **0.7 GB** |\n"
"| Latencia 8 frases ES | 1.0s | 25s | **1.2s** |\n"
"| Latencia 30 frases ES | ~3s | ~90s | **4.2s** |\n"
"| Calidad ES corporate | 1 falsa | 4/5 OK | **5-6/8 OK** |\n"
"| Calidad ES OSINT | sin probar | sin probar | **funciona** |\n"
"| Licencia | Apache 2.0 | CC BY-NC-SA 4.0 | **Apache 2.0** |\n"
"| Idioma | EN-centric | 18 idiomas | EN/ES/FR |\n\n"
"Este notebook empotra los datos del benchmark v2 (`benchmark_v2.json`) y construye el grafo final."
))
cells.append(_md("## 1. Setup"))
cells.append(_code(
"import os, sys, json, warnings, time\n"
"warnings.filterwarnings('ignore')\n"
"os.environ.setdefault('HF_HUB_DISABLE_PROGRESS_BARS', '1')\n"
"from pathlib import Path\n"
"\n"
"_pf = '/home/lucas/fn_registry/python/functions'\n"
"sys.path = [p for p in sys.path if not p.startswith(_pf + '/')]\n"
"if _pf not in sys.path: sys.path.insert(0, _pf)\n"
"\n"
"import pandas as pd\n"
"import networkx as nx\n"
"import matplotlib.pyplot as plt\n"
"from gliner2 import GLiNER2\n"
"\n"
"BENCH = json.loads(Path('../benchmark_v2.json').read_text())\n"
"print('corpora benchmarked:', list(BENCH.keys()))"
))
cells.append(_md("## 2. Cargar GLiNER2 (warm — modelo cacheado)"))
cells.append(_code(
"t0 = time.time()\n"
"model = GLiNER2.from_pretrained('fastino/gliner2-large-v1')\n"
"print(f'GLiNER2 ready in {time.time()-t0:.1f}s')"
))
cells.append(_md(
"## 3. Resumen del benchmark sobre 4 corpora\n\n"
"Datos de `run_benchmark_v2.py` corrido el 2026-05-04. Cada fila es una pasada GLiNER2 con su schema (entities + relations) sobre el corpus."
))
cells.append(_code(
"rows = []\n"
"for k, d in BENCH.items():\n"
" rows.append({\n"
" 'corpus': k, 'chars': d['n_chars'], 'words': d['n_words'],\n"
" 'time_s': d['elapsed_s'], 'ents': d['n_entities'],\n"
" 'rels': d['n_relations'], 'rels/word': round(d['n_relations']/d['n_words'], 4),\n"
" })\n"
"df = pd.DataFrame(rows)\n"
"df"
))
cells.append(_md(
"**Lectura:**\n\n"
"- `es_corporate_short` (8 frases, 104 words): 14 ents, 8 rels en 1.2s. **Comparable a mREBEL pero 20× mas rapido**.\n"
"- `es_corporate_long` (30 frases, 400 words): 60 ents (excelente recall), 6 rels (recall bajo en relaciones — texto largo). Necesita chunking para mejorar.\n"
"- `es_osint` (6 frases, 98 words): 11 ents incluyendo IPs, hashes, CVEs, dominios defanged + 5 relaciones tipadas — **funciona en ciberseguridad castellana**.\n"
"- `en_corporate_short` (4 frases): 9 rels — mejor recall en EN que en ES."
))
cells.append(_md("## 4. Caso 1 — es_corporate_short (8 frases)\n\nEl mismo corpus que notebook 02 y 03. Evaluacion manual de calidad."))
cells.append(_code(
"data = BENCH['es_corporate_short']\n"
"print('ENTITIES')\n"
"for typ, names in data['entities'].items():\n"
" print(f' {typ}: {names}')\n"
"print('\\nRELATIONS')\n"
"for rt, pairs in data['relations'].items():\n"
" for h, t in pairs:\n"
" print(f' {h:35s} --[{rt:20s}]--> {t}')"
))
cells.append(_md(
"**Verdict manual (8 relaciones):**\n\n"
"| # | Relacion | Verdict |\n"
"|---|---|---|\n"
"| 1 | `Pablo Isla works_at Inditex` | ✅ correcto (era expresidente) |\n"
"| 2 | `Pablo Isla appointed_as consejero de Telefonica` | ✅ correcto |\n"
"| 3 | `Marina Serrano ceo_of Endesa` | ✅ correcto |\n"
"| 4 | `Ignacio Galan president_of Iberdrola` | ✅ correcto |\n"
"| 5 | `Ignacio Galan president_of Iberdrola` (DUP) | ⚠️ duplicado — dedupe pendiente |\n"
"| 6 | `Inditex headquartered_in Arteixo, A Coruna` | ✅ correcto |\n"
"| 7 | `Iberdrola agreement_with Endesa` | ✅ correcto |\n"
"| 8 | `Inditex acquired Pablo Isla` | ❌ falso — ruido |\n\n"
"**6/8 correctas, 1 duplicado, 1 falso.** Comparado con mREBEL (4/5 alineadas correctas) y GLiREL (~3/51), GLiNER2 esta a la altura y es 20× mas rapido."
))
cells.append(_md("## 5. Visualizacion del grafo — es_corporate_short"))
cells.append(_code(
"TYPE_COLOR = {'person': '#5DA5DA', 'organization': '#F17CB0', 'location': '#60BD68'}\n"
"TYPE_EN = {'persona': 'person', 'organizacion': 'organization', 'ubicacion': 'location'}\n"
"\n"
"def build_graph(data, type_color=TYPE_COLOR):\n"
" G = nx.DiGraph()\n"
" for typ, names in data['entities'].items():\n"
" norm_typ = TYPE_EN.get(typ, typ)\n"
" for n in names:\n"
" G.add_node(n, type=norm_typ)\n"
" seen = set()\n"
" for rt, pairs in data['relations'].items():\n"
" for h, t in pairs:\n"
" key = (h, t, rt)\n"
" if key in seen: continue\n"
" seen.add(key)\n"
" G.add_edge(h, t, kind=rt)\n"
" return G\n"
"\n"
"def draw(ax, G, title):\n"
" if G.number_of_nodes() == 0:\n"
" ax.set_title(title + ' (empty)'); ax.axis('off'); return\n"
" pos = nx.spring_layout(G, k=2.2, iterations=80, seed=42)\n"
" cols = [TYPE_COLOR.get(G.nodes[n].get('type'), '#bbb') for n in G.nodes]\n"
" nx.draw_networkx_nodes(G, pos, node_color=cols, node_size=1800, edgecolors='#333', linewidths=1.4, ax=ax)\n"
" nx.draw_networkx_labels(G, pos, font_size=8, font_weight='bold', ax=ax)\n"
" nx.draw_networkx_edges(G, pos, edge_color='#888', arrows=True, arrowsize=14, width=1.2, alpha=0.7, ax=ax, connectionstyle='arc3,rad=0.08')\n"
" el = {(u,v): d['kind'] for u,v,d in G.edges(data=True)}\n"
" nx.draw_networkx_edge_labels(G, pos, edge_labels=el, font_size=6.5, ax=ax,\n"
" bbox=dict(boxstyle='round,pad=0.1', fc='white', ec='none', alpha=0.85))\n"
" ax.set_title(f'{title}: {G.number_of_nodes()} ents, {G.number_of_edges()} rels', fontsize=11)\n"
" ax.axis('off')\n"
"\n"
"G_short = build_graph(BENCH['es_corporate_short'])\n"
"fig, ax = plt.subplots(figsize=(12, 8))\n"
"draw(ax, G_short, 'es_corporate_short — GLiNER2')\n"
"from matplotlib.patches import Patch\n"
"legend = [Patch(facecolor=c, edgecolor='#333', label=t) for t, c in TYPE_COLOR.items()]\n"
"ax.legend(handles=legend, loc='upper left', fontsize=10)\n"
"plt.tight_layout(); plt.show()"
))
cells.append(_md(
"## 6. Caso 2 — es_osint (game-changer)\n\n"
"Texto sobre ciberataque APT-29 con IoCs reales. Schema con labels especificas: `ip_address`, `dominio`, `vulnerabilidad`, `malware`, `hash`, `username`. **Hasta ahora ningun modelo del benchmark cubria OSINT en castellano.**"
))
cells.append(_code(
"data = BENCH['es_osint']\n"
"print('ENTITIES')\n"
"for typ, names in data['entities'].items():\n"
" if names: print(f' {typ:18s}: {names}')\n"
"print('\\nRELATIONS')\n"
"for rt, pairs in data['relations'].items():\n"
" for h, t in pairs:\n"
" print(f' {h:38s} --[{rt:20s}]--> {t}')"
))
cells.append(_md(
"**OSINT en castellano funciona.** GLiNER2 detecta:\n"
"- IP `185.220.101.45`\n"
"- Dominio defanged `cloudfront-cdn[.]net` (¡reconoce la sintaxis OSINT!)\n"
"- Username `@phantomzero`\n"
"- CVE `CVE-2024-21412`\n"
"- Malware `CozyBear`\n"
"- Hash `a3f5e8c9b1d2e3f4a5b6c7d8e9f0a1b2`\n"
"- Orgs `APT-29`, `CCN-CERT`, `Telefonica Tech`\n\n"
"Relaciones:\n\n"
"| # | Relacion | Verdict |\n"
"|---|---|---|\n"
"| 1 | `campana de phishing targets empresas energeticas espanolas` | ⚠️ span sucio pero correcto |\n"
"| 2 | `CozyBear exploits CVE-2024-21412` | ✅ correcto |\n"
"| 3 | `malware uses CozyBear` | ⚠️ direccion ambigua |\n"
"| 4 | `grupo APT-29 attributed_to Rusia` | ✅ correcto |\n"
"| 5 | `servidor de comando y control communicates_with sistemas internos de Iberdrola` | ⚠️ span sucio pero correcto |\n\n"
"**3/5 inequivocamente correctas + 2 ambiguas.** Ningun falso positivo grave."
))
cells.append(_code(
"G_osint = build_graph(BENCH['es_osint'])\n"
"# extender mapping a labels OSINT en castellano\n"
"OSINT_COLOR = {'persona': '#5DA5DA', 'organizacion': '#F17CB0', 'ubicacion': '#60BD68',\n"
" 'ip_address': '#FAA43A', 'dominio': '#F15854', 'username': '#B276B2',\n"
" 'vulnerabilidad': '#DECF3F', 'malware': '#7C7C7C', 'hash': '#6C6C6C', 'url': '#FAA43A'}\n"
"G_osint = nx.DiGraph()\n"
"for typ, names in BENCH['es_osint']['entities'].items():\n"
" for n in names: G_osint.add_node(n, type=typ)\n"
"seen = set()\n"
"for rt, pairs in BENCH['es_osint']['relations'].items():\n"
" for h, t in pairs:\n"
" if (h,t,rt) not in seen:\n"
" seen.add((h,t,rt)); G_osint.add_edge(h, t, kind=rt)\n"
"\n"
"fig, ax = plt.subplots(figsize=(13, 9))\n"
"if G_osint.number_of_nodes() > 0:\n"
" pos = nx.spring_layout(G_osint, k=2.5, iterations=80, seed=42)\n"
" cols = [OSINT_COLOR.get(G_osint.nodes[n].get('type'), '#bbb') for n in G_osint.nodes]\n"
" nx.draw_networkx_nodes(G_osint, pos, node_color=cols, node_size=1800, edgecolors='#333', linewidths=1.4, ax=ax)\n"
" nx.draw_networkx_labels(G_osint, pos, font_size=8, font_weight='bold', ax=ax)\n"
" nx.draw_networkx_edges(G_osint, pos, edge_color='#888', arrows=True, arrowsize=14, width=1.2, alpha=0.7, ax=ax, connectionstyle='arc3,rad=0.1')\n"
" el = {(u,v): d['kind'] for u,v,d in G_osint.edges(data=True)}\n"
" nx.draw_networkx_edge_labels(G_osint, pos, edge_labels=el, font_size=6.5, ax=ax,\n"
" bbox=dict(boxstyle='round,pad=0.1', fc='white', ec='none', alpha=0.85))\n"
"ax.set_title(f'es_osint — GLiNER2: {G_osint.number_of_nodes()} ents, {G_osint.number_of_edges()} rels', fontsize=11)\n"
"ax.axis('off')\n"
"from matplotlib.patches import Patch\n"
"legend = [Patch(facecolor=c, edgecolor='#333', label=t) for t, c in OSINT_COLOR.items() if t in {n[1].get('type') for n in G_osint.nodes(data=True)}]\n"
"ax.legend(handles=legend, loc='upper left', fontsize=8)\n"
"plt.tight_layout(); plt.show()"
))
cells.append(_md(
"## 7. Caso 3 — es_corporate_long (limitacion: recall bajo en relaciones)\n\n"
"Texto extendido de 30 frases sobre el sector empresarial espanol. **60 entidades extraidas correctamente** pero solo **6 relaciones** — el modelo es muy selectivo cuando el contexto es denso."
))
cells.append(_code(
"data = BENCH['es_corporate_long']\n"
"print(f'{data[\"n_entities\"]} entidades, {data[\"n_relations\"]} relaciones, {data[\"elapsed_s\"]}s')\n"
"print('\\nMUESTRA de entidades (primeras 10 personas):', data['entities']['person'][:10])\n"
"print('\\nRELATIONS (todas):')\n"
"for rt, pairs in data['relations'].items():\n"
" for h, t in pairs:\n"
" print(f' {h:35s} --[{rt:20s}]--> {t}')"
))
cells.append(_md(
"**Lectura:** 60 entidades de 30 frases es buen recall — captura todo el cast (Pablo Isla, Amancio Ortega, Marta Ortega, Ana Botin, Ignacio Galan, Patrick Pouyanne, Andy Jassy, Mariano Rajoy...). Pero **solo 6 relaciones para tantos hechos** explicitos. Hipotesis:\n\n"
"1. **Texto largo ahoga al modelo** — la atencion se diluye entre frases.\n"
"2. **Solo emite alta confianza** — preferencia por precision sobre recall.\n"
"3. **Procesar frase a frase mejoraria recall** — replicar la estrategia de mREBEL del notebook 03.\n\n"
"**Plan:** issue 0042 debe contemplar ambos modos: `text_mode=joint` (rapido, recall bajo en texto largo) y `text_mode=sentences` (mas lento, recall mejor)."
))
cells.append(_md(
"## 8. Conclusion\n\n"
"**GLiNER2 sustituye toda la stack actual (GLiNER + GLiREL/mREBEL) en `extract_graph_hybrid`.** Razones:\n\n"
"1. **Apache 2.0** — sin restriccion comercial. Resuelve el caveat de mREBEL.\n"
"2. **Un solo modelo** — 0.7 GB vs 2.1-3.0 GB de la stack actual.\n"
"3. **20× mas rapido** que mREBEL en la misma calidad.\n"
"4. **Funciona en OSINT castellano** — game-changer para el caso de uso real de `graph_explorer`.\n"
"5. **Mismo paradigma de schema** — `entities([...]).relations([...])` es ergonomico.\n\n"
"**Limitaciones aceptadas:**\n\n"
"- Recall de relaciones cae en texto largo (>20 frases). Mitigar con chunking por frase.\n"
"- Algunos errores semanticos puntuales (e.g. `Inditex acquired Pablo Isla`) — el dedupe + el filtro humano del panel `paste_extract` los cubren.\n"
"- Solo soporta EN/ES/FR (vs mREBEL 18 idiomas) — irrelevante para nuestro caso de uso.\n\n"
"## Plan de migracion\n\n"
"1. **Reemplazar issue 0042** (mREBEL) por **issue 0042-revised**: GLiNER2 sustituye GLiREL en `extract_graph_hybrid`, con dos modos de ejecucion (joint / chunked-by-sentence). mREBEL queda como opcion en P3.\n"
"2. **Funciones nuevas en el registry:**\n"
" - `gliner2_load_model_py_datascience` — loader cacheado (Apache 2.0)\n"
" - `extract_graph_gliner2_py_datascience` — schema construction + extract + normalizar a `EntityCandidate`/`RelationCandidate`\n"
" - `extract_graph_gliner2_chunked_py_pipelines` — version frase-a-frase para texto largo\n"
"3. **Actualizar el panel `extract_panel.cpp`**: combo de engines pasa a `[GLiNER2 (recomendado) | GLiNER+GLiREL (legacy) | GLiNER+mREBEL (no comercial)]`. Default GLiNER2.\n"
"4. **Vault `osint_nlp_models`**: actualizar README + crear `models/gliner2.md` con estos hallazgos. Mover `mrebel.md` a estado 'fallback'.\n\n"
"**Por probar a futuro (cola en `vaults/osint_nlp_models/models/candidates.md`):**\n"
"- `fastino/gliner2-base-v1` (205M, mas pequeño aun) — confirmar que la calidad se mantiene.\n"
"- GLiNER2 con threshold tuning (si la API lo expone).\n"
"- GLiNER2 + chunking por frase para corpus largo (long_text experiment, pendiente)."
))
nb = nbf.v4.new_notebook()
nb.cells = cells
nb.metadata = {
"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"},
"language_info": {"name": "python"},
}
NB_PATH.parent.mkdir(parents=True, exist_ok=True)
nbf.write(nb, NB_PATH)
print(f"[done] {NB_PATH} cells={len(cells)}")
if __name__ == "__main__":
build()