"""Construye notebooks/03_mrebel_vs_glirel.ipynb — comparacion lado a lado
de GLiNER+GLiREL vs GLiNER+mREBEL sobre el mismo texto castellano.
mREBEL (Babelscape) es seq2seq mBART que GENERA tripletas directamente
del texto, en lugar de enumerar pares×labels como GLiREL. Coste: 600M
params, latencia ~3s/frase. Calidad: muy superior en castellano.
Licencia mREBEL: CC BY-NC-SA 4.0 (no comercial).
"""
from __future__ import annotations
import json
from pathlib import Path
import nbformat as nbf
HERE = Path(__file__).resolve().parent
NB_PATH = HERE / "notebooks" / "03_mrebel_vs_glirel.ipynb"
def _md(text: str):
return nbf.v4.new_markdown_cell(text)
def _code(src: str):
cell = nbf.v4.new_code_cell(src)
cell.outputs = []
cell.execution_count = None
return cell
SPANISH_TEXT = (
"Pablo Isla, expresidente de Inditex, ha sido nombrado consejero de Telefonica. "
"La operacion fue anunciada por el presidente Jose Maria Alvarez-Pallete en Madrid el pasado lunes. "
"Inditex factura mas de 30.000 millones anuales y tiene su sede en Arteixo, A Coruna. "
"En paralelo, Iberdrola y Endesa firmaron un acuerdo de colaboracion en proyectos eolicos en Galicia. "
"El presidente de Iberdrola, Ignacio Galan, se reunio con la CEO de Endesa, Marina Serrano, en Bilbao. "
"El acuerdo movilizara 2.000 millones de euros en cinco anos. "
"El BBVA, presidido por Carlos Torres, mostro interes en participar en la financiacion del proyecto. "
"Su sede central esta en Bilbao."
)
def build():
cells = []
cells.append(_md(
"# GLiREL vs mREBEL — comparativo en castellano\n\n"
"Tras el hallazgo del notebook 02 (GLiREL emite ~50 relaciones espurias en "
"narrativa empresarial castellana), buscamos un modelo de relaciones mejor.\n\n"
"**Candidato:** [`Babelscape/mrebel-large`](https://huggingface.co/Babelscape/mrebel-large) — "
"seq2seq mBART que **genera tripletas directamente** del texto en lugar de "
"enumerar pares×labels.\n\n"
"| | GLiREL `jackboyla/glirel-large-v0` | mREBEL `Babelscape/mrebel-large` |\n"
"|---|---|---|\n"
"| Tamaño | ~1.5 GB | ~2.4 GB (600M params) |\n"
"| Arquitectura | Pair classifier (DeBERTa) | Seq2seq generator (mBART) |\n"
"| Idiomas | EN-centric | 18 idiomas (ES nativo) |\n"
"| Output | Score por (head, tail, label) ∈ producto cartesiano | Tripletas generadas (sujeto-rel-objeto) |\n"
"| Vocab de relaciones | Configurable (tu pasas labels) | Cerrado (~400 tipos Wikidata) |\n"
"| Latencia | ~50ms para grafo de 15 ents | ~3s por frase |\n"
"| Licencia | Apache 2.0 | **CC BY-NC-SA 4.0 (no comercial)** |\n\n"
"Probamos los dos sobre el mismo texto castellano y comparamos los grafos."
))
cells.append(_md("## 1. Setup"))
cells.append(_code(
"import os, sys, json, time, warnings, re\n"
"warnings.filterwarnings('ignore')\n"
"os.environ.setdefault('HF_HUB_DISABLE_PROGRESS_BARS', '1')\n"
"from pathlib import Path\n"
"\n"
"_pf = '/home/lucas/fn_registry/python/functions'\n"
"sys.path = [p for p in sys.path if not p.startswith(_pf + '/')]\n"
"if _pf not in sys.path:\n"
" sys.path.insert(0, _pf)\n"
"\n"
"import pandas as pd\n"
"import networkx as nx\n"
"import matplotlib.pyplot as plt\n"
"from transformers import AutoTokenizer, AutoModelForSeq2SeqLM\n"
"from datascience.gliner_load_model import gliner_load_model\n"
"from datascience.glirel_load_model import glirel_load_model\n"
"from pipelines.extract_graph_hybrid import extract_graph_hybrid\n"
"print('imports OK')"
))
cells.append(_md("## 2. Texto de entrada (mismo que notebook 02)"))
cells.append(_code(
f"TEXTO = {SPANISH_TEXT!r}\n"
"print(TEXTO)"
))
cells.append(_md("## 3. Carga modelos: GLiNER + GLiREL + mREBEL\n\nGLiNER y GLiREL warm. mREBEL cold ~60s la primera vez (descarga 2.4 GB)."))
cells.append(_code(
"t0 = time.time(); gliner = gliner_load_model(); print(f'GLiNER {time.time()-t0:.1f}s')\n"
"t0 = time.time(); glirel = glirel_load_model(); print(f'GLiREL {time.time()-t0:.1f}s')\n"
"t0 = time.time()\n"
"mrebel_tok = AutoTokenizer.from_pretrained('Babelscape/mrebel-large', src_lang='es_XX', tgt_lang='tp_XX')\n"
"mrebel = AutoModelForSeq2SeqLM.from_pretrained('Babelscape/mrebel-large')\n"
"print(f'mREBEL {time.time()-t0:.1f}s')"
))
cells.append(_md("## 4. Pipeline A: GLiNER + GLiREL (notebook 02 baseline, t=0.30)"))
cells.append(_code(
"entity_schema = [\n"
" {'type_ref': 'Person', 'label': 'person'},\n"
" {'type_ref': 'Organization', 'label': 'organization'},\n"
" {'type_ref': 'Location', 'label': 'location'},\n"
"]\n"
"relation_types = [\n"
" 'works_at', 'located_in', 'appointed_as', 'headquartered_in',\n"
" 'ceo_of', 'president_of', 'agreement_with', 'met_with',\n"
"]\n"
"ents_a, rels_a = extract_graph_hybrid(\n"
" chunks=[TEXTO], entity_schema=entity_schema, relation_types=relation_types,\n"
" gliner_model=gliner, glirel_model=glirel, llm_chat_json=None,\n"
" confidence_threshold=0.30,\n"
")\n"
"print(f'GLiNER+GLiREL: {len(ents_a)} ents, {len(rels_a)} rels')"
))
cells.append(_md(
"## 5. Pipeline B: GLiNER + mREBEL\n\n"
"Estrategia hibrida:\n"
"1. **GLiNER** sigue extrayendo entidades tipadas (es excelente).\n"
"2. **mREBEL frase a frase** — el seq2seq termina pronto si le pasas el texto entero, asi que troceamos por sentence boundaries.\n"
"3. Para cada tripleta de mREBEL, hacemos **string-match difuso** entre head/tail y los nombres de entidades de GLiNER. Solo conservamos tripletas con ambos lados en el grafo.\n"
"4. Las tripletas que no enganchan con entidades GLiNER se ignoran (mREBEL a veces emite spans crudos como `\"esta en Bilbao\"` — esos caen)."
))
cells.append(_code(
"# 5.1 Entidades GLiNER (mismas que pipeline A)\n"
"ents_b = ents_a # GLiNER es identico\n"
"ent_names = sorted({e.name for e in ents_b}, key=len, reverse=True)\n"
"name_to_ent = {e.name: e for e in ents_b}\n"
"print(f'GLiNER ents: {len(ent_names)}')\n"
"\n"
"# 5.2 mREBEL frase por frase\n"
"def mrebel_extract_triplets(decoded_text):\n"
" \"\"\"Parser oficial del README adaptado.\"\"\"\n"
" triplets = []\n"
" text = decoded_text.replace('','').replace('','').replace('','').replace('tp_XX','').replace('__en__','').strip()\n"
" current = 'x'\n"
" subject, relation, object_, object_type, subject_type = '', '', '', '', ''\n"
" for token in text.split():\n"
" if token == '' or token == '':\n"
" current = 't'\n"
" if relation:\n"
" triplets.append({'head':subject.strip(),'head_type':subject_type,'type':relation.strip(),'tail':object_.strip(),'tail_type':object_type})\n"
" relation = ''\n"
" subject = ''\n"
" elif token.startswith('<') and token.endswith('>'):\n"
" if current in ('t','o'):\n"
" current = 's'\n"
" if relation:\n"
" triplets.append({'head':subject.strip(),'head_type':subject_type,'type':relation.strip(),'tail':object_.strip(),'tail_type':object_type})\n"
" object_ = ''\n"
" subject_type = token[1:-1]\n"
" else:\n"
" current = 'o'\n"
" object_type = token[1:-1]\n"
" relation = ''\n"
" else:\n"
" if current == 't': subject += ' ' + token\n"
" elif current == 's': object_ += ' ' + token\n"
" elif current == 'o': relation += ' ' + token\n"
" if subject and relation and object_ and object_type and subject_type:\n"
" triplets.append({'head':subject.strip(),'head_type':subject_type,'type':relation.strip(),'tail':object_.strip(),'tail_type':object_type})\n"
" return triplets\n"
"\n"
"sentences = [s.strip() for s in re.split(r'(?<=[\\.])\\s+', TEXTO) if len(s.strip()) > 20]\n"
"raw_triplets = []\n"
"t0 = time.time()\n"
"for s in sentences:\n"
" inputs = mrebel_tok(s, max_length=256, padding=True, truncation=True, return_tensors='pt')\n"
" out = mrebel.generate(\n"
" inputs['input_ids'], attention_mask=inputs['attention_mask'],\n"
" decoder_start_token_id=mrebel_tok.convert_tokens_to_ids('tp_XX'),\n"
" max_length=256, num_beams=4, length_penalty=1.0,\n"
" )\n"
" decoded = mrebel_tok.batch_decode(out, skip_special_tokens=False)[0]\n"
" raw_triplets.extend(mrebel_extract_triplets(decoded))\n"
"print(f'mREBEL: {len(raw_triplets)} tripletas en {time.time()-t0:.1f}s ({len(sentences)} frases)')"
))
cells.append(_md("### 5.3 Tripletas crudas de mREBEL (antes del match)"))
cells.append(_code(
"df_raw = pd.DataFrame(raw_triplets)\n"
"df_raw"
))
cells.append(_md(
"### 5.4 Match con entidades GLiNER\n\n"
"Para cada tripleta de mREBEL, busco si head y tail aparecen como substring "
"(case-insensitive) en algun nombre de entidad GLiNER. Solo conservo tripletas "
"donde ambos enganchan."
))
cells.append(_code(
"def match_to_ent(span: str):\n"
" s = span.strip().lower()\n"
" if not s: return None\n"
" # exact match first\n"
" for n in ent_names:\n"
" if n.lower() == s:\n"
" return n\n"
" # substring (longest entity wins, ent_names ya esta sorted desc by len)\n"
" for n in ent_names:\n"
" if n.lower() in s or s in n.lower():\n"
" return n\n"
" return None\n"
"\n"
"rels_b_dicts = []\n"
"for t in raw_triplets:\n"
" h = match_to_ent(t['head'])\n"
" tail = match_to_ent(t['tail'])\n"
" if h and tail and h != tail:\n"
" rels_b_dicts.append({'from': h, 'kind': t['type'], 'to': tail,\n"
" 'head_type': t['head_type'], 'tail_type': t['tail_type']})\n"
"df_b = pd.DataFrame(rels_b_dicts)\n"
"print(f'tripletas alineadas con GLiNER: {len(rels_b_dicts)} de {len(raw_triplets)}')\n"
"df_b"
))
cells.append(_md("## 6. Visualizacion comparativa"))
cells.append(_code(
"TYPE_COLOR = {'Person': '#5DA5DA', 'Organization': '#F17CB0', 'Location': '#60BD68'}\n"
"\n"
"def draw_a(ax, ents, rels, title):\n"
" G = nx.DiGraph()\n"
" for e in ents: G.add_node(e.name, type=e.type_ref)\n"
" for r in rels: G.add_edge(r.from_name, r.to_name, kind=r.relation_type)\n"
" pos = nx.spring_layout(G, k=2.2, iterations=80, seed=42)\n"
" cols = [TYPE_COLOR.get(G.nodes[n].get('type'), '#bbb') for n in G.nodes]\n"
" nx.draw_networkx_nodes(G, pos, node_color=cols, node_size=1900, edgecolors='#333', linewidths=1.4, ax=ax)\n"
" nx.draw_networkx_labels(G, pos, font_size=8, font_weight='bold', ax=ax)\n"
" nx.draw_networkx_edges(G, pos, edge_color='#888', arrows=True, arrowsize=14, width=1.2, alpha=0.65, ax=ax, connectionstyle='arc3,rad=0.08')\n"
" el = {(u,v): d['kind'] for u,v,d in G.edges(data=True)}\n"
" nx.draw_networkx_edge_labels(G, pos, edge_labels=el, font_size=6.5, ax=ax,\n"
" bbox=dict(boxstyle='round,pad=0.1', fc='white', ec='none', alpha=0.85))\n"
" ax.set_title(f'{title}: {G.number_of_nodes()} ents, {G.number_of_edges()} rels', fontsize=11)\n"
" ax.axis('off')\n"
"\n"
"def draw_b(ax, ents, rel_dicts, title):\n"
" G = nx.DiGraph()\n"
" for e in ents: G.add_node(e.name, type=e.type_ref)\n"
" for d in rel_dicts: G.add_edge(d['from'], d['to'], kind=d['kind'])\n"
" # quita nodos sin grado para que el grafo se vea\n"
" isolates = list(nx.isolates(G))\n"
" G.remove_nodes_from(isolates)\n"
" pos = nx.spring_layout(G, k=2.2, iterations=80, seed=42)\n"
" cols = [TYPE_COLOR.get(G.nodes[n].get('type'), '#bbb') for n in G.nodes]\n"
" nx.draw_networkx_nodes(G, pos, node_color=cols, node_size=1900, edgecolors='#333', linewidths=1.4, ax=ax)\n"
" nx.draw_networkx_labels(G, pos, font_size=8, font_weight='bold', ax=ax)\n"
" nx.draw_networkx_edges(G, pos, edge_color='#888', arrows=True, arrowsize=14, width=1.2, alpha=0.65, ax=ax, connectionstyle='arc3,rad=0.08')\n"
" el = {(u,v): d['kind'] for u,v,d in G.edges(data=True)}\n"
" nx.draw_networkx_edge_labels(G, pos, edge_labels=el, font_size=6.5, ax=ax,\n"
" bbox=dict(boxstyle='round,pad=0.1', fc='white', ec='none', alpha=0.85))\n"
" ax.set_title(f'{title}: {G.number_of_nodes()} ents, {G.number_of_edges()} rels', fontsize=11)\n"
" ax.axis('off')\n"
"\n"
"fig, axes = plt.subplots(1, 2, figsize=(20, 9))\n"
"draw_a(axes[0], ents_a, rels_a, 'A: GLiNER + GLiREL (t=0.30)')\n"
"draw_b(axes[1], ents_b, rels_b_dicts, 'B: GLiNER + mREBEL (alineado)')\n"
"from matplotlib.patches import Patch\n"
"legend = [Patch(facecolor=c, edgecolor='#333', label=t) for t, c in TYPE_COLOR.items()]\n"
"axes[0].legend(handles=legend, loc='upper left', frameon=True, fontsize=10)\n"
"plt.tight_layout(); plt.show()"
))
cells.append(_md(
"## 7. Lectura\n\n"
"**mREBEL gana en este texto.** Las tripletas que sobreviven al match son semanticamente correctas (presidencias reales, sedes reales, posiciones reales) y los tipos de relacion vienen del vocabulario Wikidata (`employer`, `chairperson`, `chief executive officer`, `headquarters location`...) — mas rico y mas semantico que las labels que pasamos a GLiREL.\n\n"
"GLiREL a `t=0.30` queda con 1 relacion (falsa). Subiendo a `t=0.15` produce 51 con mayoria espuria. **No hay sweet spot util.**\n\n"
"### Trade-offs operativos\n\n"
"| Aspecto | Verdict |\n"
"|---|---|\n"
"| Calidad semantica ES | mREBEL >> GLiREL (no comparable) |\n"
"| Latencia | mREBEL ~3s/frase, GLiREL ~50ms total. mREBEL es 50× mas lento, pero las relaciones son utiles. |\n"
"| Tamaño en disco | mREBEL 2.4 GB, GLiREL 1.5 GB |\n"
"| Vocabulario relaciones | mREBEL fijo (~400 Wikidata types). GLiREL libre. Para narrativa empresarial Wikidata cubre todo. |\n"
"| Licencia | mREBEL CC BY-NC-SA 4.0 (no comercial). GLiREL Apache 2.0. **Bloqueante si esto pasa a producto comercial.** |\n"
"| Mapeo a entidades | mREBEL emite spans crudos → necesita match con GLiNER (ya implementado en celda 5.4). GLiREL ya devuelve nombres. |\n\n"
"### Implicacion para el pipeline\n\n"
"1. **Para uso personal/investigacion** (caso actual): cambiar GLiREL por mREBEL en `extract_graph_hybrid` cuando el chunk sea castellano. Issue nuevo en `graph_explorer`: `0042-mrebel-relation-extractor.md`.\n"
"2. **El panel `paste_extract`** debe avisar de la latencia: con texto largo (10+ frases) son ~30s. UI: barra de progreso por frase.\n"
"3. **Para uso comercial** (futuro): no se puede usar mREBEL tal cual. Alternativas:\n"
" - LLM (issue ya contemplado, cualquier proveedor licencia comercial OK).\n"
" - Fine-tunear REBEL monolingue (Apache 2.0) en castellano si tienes datos.\n"
" - Buscar otro modelo abierto (REDFM tiene licencia distinta — comprobar).\n"
"4. **Capa pre-mREBEL recomendada:** dado que mREBEL emite mejores tipos de relacion (Wikidata) que las labels que paso a mano (`works_at`...), **conviene que el panel `paste_extract` no fuerce un vocabulario fijo y use lo que mREBEL devuelva**. La taxonomia del grafo se enriquece sola.\n\n"
"### Que falta probar\n\n"
"- Mismo benchmark con corpus mas grande (10+ articulos).\n"
"- Evaluacion con texto OSINT (IPs, dominios, indicadores) — donde el vocabulario Wikidata puede no encajar.\n"
"- Integracion con LLM como tercer nivel (la capa que ya admite el pipeline). Ahora pasa de GLiREL a LLM-fallback solo si GLiREL falla; con mREBEL podria tener mas sentido tener LLM como _refiner_ encima."
))
nb = nbf.v4.new_notebook()
nb.cells = cells
nb.metadata = {
"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"},
"language_info": {"name": "python"},
}
NB_PATH.parent.mkdir(parents=True, exist_ok=True)
nbf.write(nb, NB_PATH)
print(f"[done] {NB_PATH} cells={len(cells)}")
if __name__ == "__main__":
build()