b8c760d004
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
420 lines
21 KiB
Plaintext
420 lines
21 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "4a6738d5",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Mejoras al pipeline GLiNER2 sobre PDF — resultados empiricos\n",
|
|
"\n",
|
|
"**Pregunta:** del notebook 05 nos quedamos con un grafo de PDF con 382 entidades pero solo 48 aristas y 324 nodos aislados. **¿Como subimos las relaciones correctas y reducimos aislados?**\n",
|
|
"\n",
|
|
"Tras leer la API real de GLiNER2 (no la del README), identifique 6 palancas:\n",
|
|
"\n",
|
|
"1. `threshold` (default 0.5) — bajar a 0.3 / 0.2\n",
|
|
"2. `relations({type: description})` — pasar dict con descripciones, no lista\n",
|
|
"3. `batch_extract` con `batch_size=8`\n",
|
|
"4. Coreference simple (normalizacion + substring) entre chunks\n",
|
|
"5. Sliding window de 2 frases entre chunks\n",
|
|
"6. Limpieza del PDF (page numbers, saltos espurios)\n",
|
|
"\n",
|
|
"Ejecutado el benchmark en `run_improvements.py` y guardado en `improvements.json`. Este notebook solo carga los datos y los presenta — sin recargar GLiNER2."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "ebbdc3f9",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 0. Setup"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "c0adf6b4",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"keys: ['meta', 'configs', 'coref', 'top_entities_post_coref', 'top_relations_post_coref', 'ents_merged', 'rels_merged']\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"import json\n",
|
|
"from pathlib import Path\n",
|
|
"import pandas as pd\n",
|
|
"DATA = json.loads(Path('../improvements.json').read_text())\n",
|
|
"print('keys:', list(DATA.keys()))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "59413647",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 1. Pre-procesado del PDF (mejoras #5 y #6)\n",
|
|
"\n",
|
|
"Limpieza (`1/20` headers, saltos en medio de palabras, espacios duplicados) + chunking con sliding window de 2 frases."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "54e98462",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"raw chars: 89,882\n",
|
|
"clean chars: 88,714\n",
|
|
"chunks (overlap=2): 97\n",
|
|
"chunks (overlap=0): 66\n",
|
|
"\n",
|
|
"--- primeras 600 chars del clean ---\n",
|
|
"Banco Bilbao Vizcaya Argentaria, S.A., con domicilio en la Plaza San Nicolás, número 4, 48005 Bilbao,inscrito en el Registro Mercantil de Vizcaya, al tomo 2.083, Folio 1, Hoja BI-17-A, Inscripción 1ª con C.I.F. A-48265169POLÍTICA DE PROTECCIÓN DE DATOS PERSONALES 1. Política de Protección de Datos Personales T ómate tu tiempo y lee atentamente este documento. No dudes en pedirnos aclaraciones de lo que no entiendas.\n",
|
|
"En este apartado te explicamos para qué utilizará BBVA tus datos y, entre otros aspectos, qué derechos tienes relacionados con su uso.\n",
|
|
"INFORMACIÓN BÁSICA SOBRE PROTECCIÓN DE DATOS \n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"meta = DATA['meta']\n",
|
|
"print(f\"raw chars: {meta['raw_chars']:,}\")\n",
|
|
"print(f\"clean chars: {meta['clean_chars']:,}\")\n",
|
|
"print(f\"chunks (overlap=2): {meta['n_chunks_overlap']}\")\n",
|
|
"print(f\"chunks (overlap=0): {meta['n_chunks_no_overlap']}\")\n",
|
|
"print()\n",
|
|
"print('--- primeras 600 chars del clean ---')\n",
|
|
"print(meta['first_clean_600'])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "cfd5a2bd",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 2. Bateria comparativa — 5 configuraciones\n",
|
|
"\n",
|
|
"Sobre los mismos 97 chunks del PDF cleaned + sliding window:\n",
|
|
"\n",
|
|
"| Config | threshold | schema | metodo |\n",
|
|
"|---|---|---|---|\n",
|
|
"| **A** baseline | 0.5 (default) | flat list | extract loop |\n",
|
|
"| **B** lower threshold | 0.3 | flat list | extract loop |\n",
|
|
"| **C** very low threshold | 0.2 | flat list | extract loop |\n",
|
|
"| **D** + descriptions | 0.3 | dict con desc | extract loop |\n",
|
|
"| **E** + batch | 0.3 | dict con desc | batch_extract |\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "4fecd7e7",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"config time ents rels edges isolates conn%\n",
|
|
"------------------- ------ ---- ---- ----- -------- -----\n",
|
|
"A: t=0.5 flat loop 134.3s 397 71 71 329 17.8%\n",
|
|
"B: t=0.3 flat loop 139.0s 517 204 204 389 26.0%\n",
|
|
"C: t=0.2 flat loop 133.9s 632 362 362 397 34.9%\n",
|
|
"D: t=0.3 desc loop 132.4s 517 204 204 389 26.0%\n",
|
|
"E: t=0.3 desc batch 163.6s 517 204 204 389 26.0%"
|
|
]
|
|
},
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"rows = []\n",
|
|
"for c in DATA['configs']:\n",
|
|
" s = c['stats']\n",
|
|
" rows.append({\n",
|
|
" 'config': c['name'], 'time_s': c['elapsed'],\n",
|
|
" 'ents': s['n_ents'], 'rels': s['n_rels'], 'edges': s['n_edges'],\n",
|
|
" 'isolates': s['n_isolates'], 'conn_pct': s['connect_pct'],\n",
|
|
" })\n",
|
|
"df = pd.DataFrame(rows)\n",
|
|
"df"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "757530b8",
|
|
"metadata": {},
|
|
"source": [
|
|
"**Lectura del benchmark:**\n",
|
|
"\n",
|
|
"- **Threshold es la palanca principal** y la unica que mueve la aguja:\n",
|
|
" - `0.5 → 0.3` = **+187% relaciones** (71 → 204)\n",
|
|
" - `0.3 → 0.2` = +78% mas (204 → 362), pero +22% entidades dudosas (517 → 632)\n",
|
|
" - **Sweet spot: 0.3** — gran ganancia sin meter ruido excesivo.\n",
|
|
"\n",
|
|
"- **Descripciones por relacion NO mejoran** este corpus legal denso (B = D, identico). Probable explicacion: GLiNER2 ya entiende los nombres cortos como `governed_by`, `subject_to` directamente. Las descripciones podrian pesar mas en relaciones ambiguas (`acquired` vs `merged_with`).\n",
|
|
"\n",
|
|
"- **batch_extract NO da speedup en CPU** — fue **25% mas lento** que el loop (E=163s vs D=132s). Sospecha: el modelo es CPU-bound y el batching introduce overhead sin paralelismo real (1 modelo, no caben 8 forward pass simultaneos en un core). Solo vale la pena con GPU.\n",
|
|
"\n",
|
|
"- **Sliding window de 2 frases** ya esta aplicado en TODOS los configs (forma parte del chunking). Su efecto exacto vs no-overlap requeriria una sexta config aparte (no medido aqui)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "98c616a6",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 3. Coreferencia sobre la mejor config (E)\n",
|
|
"\n",
|
|
"Aplicamos un mergeo simple por:\n",
|
|
"\n",
|
|
"1. Lowercase + trim de puntuacion → cluster por nombre normalizado.\n",
|
|
"2. Substring match: nombres cortos absorbidos por largos del mismo tipo (`BBVA` ⊂ `Banco Bilbao Vizcaya Argentaria, S.A.`).\n",
|
|
"3. Re-escritura de relaciones para usar nombres canonicos.\n",
|
|
"\n",
|
|
"Coste: 0.62s. Tras coref:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "def3dd7a",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"PRE-coref {'n_ents': 517, 'n_rels': 204, 'n_nodes': 526, 'n_edges': 204, 'n_isolates': 389, 'connected': 137, 'connect_pct': 26.0}\n",
|
|
"POST-coref {'n_ents': 401, 'n_rels': 166, 'n_nodes': 440, 'n_edges': 166, 'n_isolates': 318, 'connected': 122, 'connect_pct': 27.7}\n",
|
|
"absorbed: 72 aliases en 0.62s\n",
|
|
"\n",
|
|
"Samples de aliases absorbidos:\n",
|
|
" 'productos y servicios' → 'Información derivada de los productos y servicios contratados'\n",
|
|
" 'servicios contratados' → 'Información derivada de los productos y servicios contratados'\n",
|
|
" 'información' → 'Información derivada de los productos y servicios contratados'\n",
|
|
" 'productos' → 'Información derivada de los productos y servicios contratados'\n",
|
|
" 'servicios' → 'Información derivada de los productos y servicios contratados'\n",
|
|
" 'normativa' → 'normativa interna sobre prevención de crimen financiero'\n",
|
|
" 'blanqueo de capitales' → 'normativa de prevención del blanqueo de capitales'\n",
|
|
" 'interacción' → 'datos derivados de la interacción con chatbots'"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"pre = DATA['coref']['pre_stats']\n",
|
|
"post = DATA['coref']['post_stats']\n",
|
|
"print('PRE-coref ', pre)\n",
|
|
"print('POST-coref', post)\n",
|
|
"print(f\"absorbed: {DATA['coref']['n_absorbed']} aliases en {DATA['coref']['elapsed']}s\")\n",
|
|
"print()\n",
|
|
"print('Samples de aliases absorbidos:')\n",
|
|
"for old, new in DATA['coref']['absorbed_sample']:\n",
|
|
" print(f' {old!r:55s} → {new!r}')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "5613c249",
|
|
"metadata": {},
|
|
"source": [
|
|
"**Lectura coref:**\n",
|
|
"\n",
|
|
"- **72 aliases absorbidos** en 0.62s — gratis para el usuario.\n",
|
|
"- Nodos: 526 → 440 (-86).\n",
|
|
"- Edges: 204 → 166 (-38) — _bajan porque las relaciones se mergean cuando ambos extremos colapsan al mismo canonico_.\n",
|
|
"- Aislados: 389 → 318 (-71, **-18%**).\n",
|
|
"- Conn%: 26.0% → 27.7% (mejora pequeña en porcentaje porque tambien se reducen los nodos totales).\n",
|
|
"\n",
|
|
"Lo que mas mejora la coreferencia es la **calidad del grafo**: en lugar de tener 5 nodos `productos`, `servicios`, `información`, etc. dispersos por el documento, los junta en una entidad canonica `Información derivada de los productos y servicios contratados`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "5d9af970",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 4. Top entidades post-coref"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "fdb2f3c7",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"type canonical mentions n_aliases aliases_sample \n",
|
|
"------------- ------------------------------------------------------------ -------- --------- -----------------------------------------------------------------\n",
|
|
"organization BBVA Seguros 81 1 ['BBVA'] \n",
|
|
"data_category Datos Personales 47 0 [] \n",
|
|
"person cliente particular 34 1 ['cliente'] \n",
|
|
"organization Banco de España (CIRBE) 28 3 ['Banco de España', 'Banco', 'CIRBE'] \n",
|
|
"location Plaza San Nicolás 27 0 [] \n",
|
|
"location Vizcaya 22 0 [] \n",
|
|
"data_category datos derivados de la interacción con chatbots 19 3 ['interacción', 'chatbots', 'datos'] \n",
|
|
"law normativa interna sobre prevención de crimen financiero 19 1 ['normativa'] \n",
|
|
"right consentimiento 18 0 [] \n",
|
|
"data_category Datos transaccionales 18 1 ['transaccionales'] \n",
|
|
"data_category Información derivada de los productos y servicios contratado 17 5 ['productos y servicios', 'servicios contratados', 'información']\n",
|
|
"person clientes 15 0 [] \n",
|
|
"data_category Datos identificativos 14 0 [] \n",
|
|
"email derechosprotecciondatos@bbva.com 14 0 [] \n",
|
|
"data_category número de teléfono de contacto 13 1 ['contacto'] \n",
|
|
"person representante 12 0 [] \n",
|
|
"organization Agencia Española de Protección de Datos 12 0 [] \n",
|
|
"organization sociedades participadas 11 2 ['participadas', 'sociedades'] \n",
|
|
"person garante 11 0 [] \n",
|
|
"data_category Datos económicos 11 1 ['económicos'] "
|
|
]
|
|
},
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"rows = DATA['top_entities_post_coref'][:20]\n",
|
|
"df = pd.DataFrame(rows)\n",
|
|
"df"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "36710c94",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 5. Top relaciones post-coref"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "c5439813",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"from kind to count\n",
|
|
"---------------------------------------------- -------------- -------------------------------------------------- -----\n",
|
|
"BBVA Seguros governed_by Banco de España (CIRBE) 4 \n",
|
|
"Datos Personales protected_by Agencia Española de Protección de Datos 4 \n",
|
|
"Datos Personales protected_by Política de Protección de Datos Personales 3 \n",
|
|
"BBVA Seguros subject_to obligaciones legales 3 \n",
|
|
"derechos de acceso rights_against datos derivados de la interacción con chatbots 3 \n",
|
|
"contratación controlled_by BBVA Seguros 3 \n",
|
|
"BBVA Seguros subsidiary_of Grupo BBVA 2 \n",
|
|
"Datos Personales protected_by BBVA Seguros 2 \n",
|
|
"BBVA Seguros contact_for Información derivada de los productos y servicios 2 \n",
|
|
"Delegado de Protección de Datos contact_for BBVA Seguros 2 \n",
|
|
"BBVA Seguros controlled_by Banco de España (CIRBE) 2 \n",
|
|
"domicilio located_in Plaza San Nicolás 2 \n",
|
|
"datos de contacto contact_for clientes 2 \n",
|
|
"BBVA Seguros located_in España 2 \n",
|
|
"contratos de crédito inmobiliario governed_by Ley 5/2019 2 \n",
|
|
"Avda. de la Industria located_in MADRID 2 \n",
|
|
"bbva.es located_in MADRID 2 \n",
|
|
"datos derivados de la interacción con chatbots subject_to normativa interna sobre prevención de crimen finan 2 \n",
|
|
"Datos Personales subject_to normativa interna sobre prevención de crimen finan 2 \n",
|
|
"Emailage Corporation located_in Londres 2 "
|
|
]
|
|
},
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"rows = DATA['top_relations_post_coref'][:20]\n",
|
|
"df = pd.DataFrame(rows)\n",
|
|
"df"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "3c830cb5",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 6. Conclusion — recetario operativo\n",
|
|
"\n",
|
|
"**Para subir relaciones correctas y reducir aislados en GLiNER2 sobre PDF, en orden de impacto/coste:**\n",
|
|
"\n",
|
|
"| Mejora | Ganancia tipica | Coste de implementacion |\n",
|
|
"|---|---|---|\n",
|
|
"| ⭐ `threshold=0.3` (vs default 0.5) | **+187% relaciones** | 1 parametro |\n",
|
|
"| ⭐ Coreferencia simple (normalize + substring) | **-18% aislados** | ~30 lineas Python pure |\n",
|
|
"| Limpieza del PDF (`N/20`, saltos) | -1.3% chars de ruido + chunks mas estables | ~10 lineas regex |\n",
|
|
"| `threshold=0.2` (mas agresivo) | +78% relaciones extra, +22% ents dudosas | trade-off |\n",
|
|
"| ❌ Descripciones por relacion | Sin efecto en este corpus | dict en vez de list |\n",
|
|
"| ❌ batch_extract en CPU | 25% mas lento | API distinta |\n",
|
|
"| ❌ Sliding window con chunks de 1500 chars | Marginal | 5 lineas |\n",
|
|
"\n",
|
|
"**Stack final recomendado:**\n",
|
|
"\n",
|
|
"```python\n",
|
|
"# 1. Carga GLiNER2 (Apache 2.0)\n",
|
|
"model = GLiNER2.from_pretrained('fastino/gliner2-large-v1')\n",
|
|
"\n",
|
|
"# 2. Pre-procesa PDF\n",
|
|
"raw = extract_pdf_text(pdf_path) # registry: extract_pdf_text_py_core\n",
|
|
"clean = clean_pdf_text(raw) # NUEVA funcion del registry\n",
|
|
"chunks = chunk_with_overlap(clean, max_chars=1500, overlap_sentences=2) # NUEVA\n",
|
|
"\n",
|
|
"# 3. Schema + extract con threshold=0.3\n",
|
|
"schema = model.create_schema().entities([...]).relations([...])\n",
|
|
"results = [model.extract(c['text'], schema=schema, threshold=0.3) for c in chunks]\n",
|
|
"\n",
|
|
"# 4. Aggregate + coref\n",
|
|
"ents, rels = aggregate(results) # NUEVA, pura\n",
|
|
"ents, rels, _ = merge_aliases(ents, rels) # NUEVA, pura\n",
|
|
"```\n",
|
|
"\n",
|
|
"## Funciones a promover al registry (proximo fn-constructor)\n",
|
|
"\n",
|
|
"Aproximadamente **6 funciones nuevas**, casi todas puras:\n",
|
|
"\n",
|
|
"1. `gliner2_load_model_py_datascience` (impure) — Apache 2.0, NER+RE joint\n",
|
|
"2. `clean_pdf_text_py_core` (pure) — limpieza de artefactos PyPDF2\n",
|
|
"3. `chunk_with_overlap_py_core` (pure) — chunking con sliding window\n",
|
|
"4. `aggregate_extraction_results_py_core` (pure) — dedupe + counter\n",
|
|
"5. `merge_entity_aliases_py_core` (pure) — coref simple normalize + substring\n",
|
|
"6. `extract_graph_from_pdf_py_pipelines` (impure) — composicion completa\n",
|
|
"\n",
|
|
"Esto cierra el ciclo: el flujo del notebook se vuelve _una llamada del registry_ reusable cross-project."
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.13.7"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|