{ "cells": [ { "cell_type": "markdown", "id": "3687e8d1", "metadata": {}, "source": [ "# NuExtract 2.0-2B (GPU) vs GLiNER2 — comparativa con visualizacion\n", "\n", "**Pregunta:** ¿merece la pena un LLM con inferencia (NuExtract 2.0) en un proyecto donde antes elegimos GLiNER2 por velocidad?\n", "\n", "**Setup:**\n", "- NuExtract 2.0-2B (Qwen2-VL-2B base, **MIT license**, 2B params, GPU BF16 sobre RTX 3070).\n", "- GLiNER2-large-v1 (Apache 2.0, 340M params, CPU).\n", "- Mismos corpora: `es_corporate_short` (8 frases), `LONG_TEXT_ES` (25 frases), 5 chunks del PDF de BBVA.\n", "\n", "**Diferencia de paradigma:**\n", "- **GLiNER2** = clasificador. Output: listas planas `{entities: {tipo: [names]}, relations: {tipo: [(h, t)]}}`.\n", "- **NuExtract** = LLM generativo. Output: JSON arbitrario que tu defines en el `template`. Las relaciones se modelan como atributos de los objetos (`{org: {ceo: \"X\", headquartered_in: \"Y\"}}`).\n", "\n", "**Hipotesis:** NuExtract gana en _riqueza estructural_ (atributos por entidad de un solo paso) pero pierde en velocidad — incluso con GPU." ] }, { "cell_type": "markdown", "id": "5691cee5", "metadata": {}, "source": [ "## 1. Setup" ] }, { "cell_type": "code", "execution_count": 1, "id": "cd75a1d8", "metadata": { "execution": { "iopub.execute_input": "2026-05-04T19:36:55.012511Z", "iopub.status.busy": "2026-05-04T19:36:55.012317Z", "iopub.status.idle": "2026-05-04T19:36:55.652234Z", "shell.execute_reply": "2026-05-04T19:36:55.651410Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NuExtract keys: ['meta', 'cpu_baseline', 'T1_corp_short_flat', 'T2_corp_short_rich', 'T3_long_text_rich', 'pdf_meta', 'T4_pdf_chunks', 'full_pdf_extrapolation']\n", "GLiNER2 keys: ['meta', 'configs', 'coref', 'top_entities_post_coref', 'top_relations_post_coref', 'ents_merged', 'rels_merged']\n", "\n", "NuExtract device: cuda torch.bfloat16\n" ] } ], "source": [ "import os, sys, json, warnings\n", "warnings.filterwarnings('ignore')\n", "from pathlib import Path\n", "from collections import defaultdict\n", "\n", "_pf = '/home/lucas/fn_registry/python/functions'\n", "sys.path = [p for p in sys.path if not p.startswith(_pf + '/')]\n", "if _pf not in sys.path: sys.path.insert(0, _pf)\n", "\n", "import pandas as pd\n", "import networkx as nx\n", "import matplotlib.pyplot as plt\n", "from matplotlib.patches import Patch\n", "\n", "NUEX = json.loads(Path('../nuextract_results.json').read_text())\n", "\n", "# Re-parsear el raw_text de cada test con un parser corregido (el original\n", "# del script usaba rfind y solo capturaba el ultimo objeto pequeño).\n", "def reparse(text):\n", " if not text: return None\n", " s = text.find('{')\n", " if s < 0: return None\n", " for end in range(len(text), s, -1):\n", " try: return json.loads(text[s:end])\n", " except Exception: continue\n", " return None\n", "for key in ['T1_corp_short_flat', 'T2_corp_short_rich', 'T3_long_text_rich']:\n", " if key in NUEX:\n", " NUEX[key]['parsed'] = reparse(NUEX[key].get('raw_text', ''))\n", "for cr in NUEX.get('T4_pdf_chunks', []):\n", " cr['parsed'] = reparse(cr.get('raw_text', ''))\n", "GLNR_CORPUS = json.loads(Path('../benchmark_v2.json').read_text()) # GLiNER2 sobre 4 corpora\n", "GLNR = json.loads(Path('../improvements.json').read_text()) # GLiNER2 sobre PDF + improvements\n", "print('NuExtract keys:', list(NUEX.keys()))\n", "print('GLiNER2 keys: ', list(GLNR.keys()))\n", "print()\n", "print('NuExtract device:', NUEX['meta']['device'], NUEX['meta']['dtype'])" ] }, { "cell_type": "markdown", "id": "7c1d64c1", "metadata": {}, "source": [ "## 2. Tabla de tiempos — CPU vs GPU vs GLiNER2\n", "\n", "Comparamos las 4 pasadas (T1-T4) de NuExtract contra GLiNER2 sobre los mismos corpora." ] }, { "cell_type": "code", "execution_count": 2, "id": "9d4c55ad", "metadata": { "execution": { "iopub.execute_input": "2026-05-04T19:36:55.654408Z", "iopub.status.busy": "2026-05-04T19:36:55.654139Z", "iopub.status.idle": "2026-05-04T19:36:55.669174Z", "shell.execute_reply": "2026-05-04T19:36:55.668310Z" } }, "outputs": [ { "data": { "text/html": [ "
| \n", " | test | \n", "engine | \n", "time_s | \n", "in_tok | \n", "out_tok | \n", "
|---|---|---|---|---|---|
| 0 | \n", "T1 corp_short flat | \n", "NuExtract CPU | \n", "24.98 | \n", "245 | \n", "79 | \n", "
| 1 | \n", "T2 corp_short rich | \n", "NuExtract CPU | \n", "117.51 | \n", "351 | \n", "370 | \n", "
| 2 | \n", "T1 corp_short flat | \n", "NuExtract GPU | \n", "2.88 | \n", "245 | \n", "79 | \n", "
| 3 | \n", "T2 corp_short rich | \n", "NuExtract GPU | \n", "9.94 | \n", "351 | \n", "363 | \n", "
| 4 | \n", "T3 long_text rich | \n", "NuExtract GPU | \n", "53.56 | \n", "952 | \n", "2048 | \n", "
| 5 | \n", "PDF (97 chunks) | \n", "GLiNER2 CPU | \n", "134.30 | \n", "- | \n", "- | \n", "
| 6 | \n", "PDF (97 chunks) | \n", "GLiNER2 CPU t=0.3 | \n", "139.00 | \n", "- | \n", "- | \n", "