retrieving_graphs/notebooks/01_graph_backends.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "intro",
   "metadata": {},
   "source": [
    "# Comparativa: bases de datos de grafos embebidas + LLM retrieval\n",
    "\n",
    "## Objetivo\n",
    "\n",
    "1. Cargar el grafo de dependencias de fn_registry en 6 backends de grafos\n",
    "2. Benchmark: insercion, traversal, persistencia\n",
    "3. Evaluar como un LLM (`claude -p`) genera queries para recuperar datos de cada backend\n",
    "\n",
    "## Backends\n",
    "\n",
    "| Backend | Query Language | Tipo |\n",
    "|---|---|---|\n",
    "| **Kuzu** | Cypher | Graph DB embebida |\n",
    "| **Memgraph** | Cypher (Bolt) | Graph DB in-memory (Docker) |\n",
    "| **NetworkX** | API Python | Libreria in-memory |\n",
    "| **SQLite + CTEs** | SQL recursivo | Relacional |\n",
    "| **RDFLib** | SPARQL | Triple store |\n",
    "| **igraph** | API Python | Libreria C/Python |\n",
    "\n",
    "## Grafo de prueba\n",
    "\n",
    "El propio fn_registry: ~354 funciones + 39 tipos como nodos, dependencias (uses_functions, uses_types, returns) como aristas."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "section-1",
   "metadata": {},
   "source": [
    "## 1. Extraer grafo desde registry.db"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "extract-graph",
   "metadata": {
    "jupyter": {
     "source_hidden": true
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Nodos: 393 (354 funciones + 39 tipos)\n",
      "Aristas: 395 (de 749 totales, 354 con target inexistente)\n",
      "Relaciones: {'uses_function', 'error_type', 'uses_type'}\n"
     ]
    }
   ],
   "source": [
    "import sqlite3\n",
    "import json\n",
    "import os\n",
    "import time\n",
    "import shutil\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "plt.style.use('seaborn-v0_8-whitegrid')\n",
    "\n",
    "FN_ROOT = os.environ.get('FN_REGISTRY_ROOT', os.path.expanduser('~/fn_registry'))\n",
    "DB_PATH = os.path.join(FN_ROOT, 'registry.db')\n",
    "\n",
    "conn = sqlite3.connect(DB_PATH)\n",
    "conn.row_factory = sqlite3.Row\n",
    "\n",
    "# Nodos: funciones\n",
    "functions = [dict(r) for r in conn.execute(\n",
    "    'SELECT id, name, kind, lang, domain, purity, signature, description, '\n",
    "    'uses_functions, uses_types, returns, returns_optional, error_type, tags '\n",
    "    'FROM functions ORDER BY name'\n",
    ").fetchall()]\n",
    "\n",
    "# Nodos: tipos\n",
    "types = [dict(r) for r in conn.execute(\n",
    "    'SELECT id, name, lang, domain, algebraic, description, uses_types, tags '\n",
    "    'FROM types ORDER BY name'\n",
    ").fetchall()]\n",
    "\n",
    "conn.close()\n",
    "\n",
    "# Construir aristas\n",
    "edges = []  # (source_id, target_id, relation_type)\n",
    "\n",
    "for f in functions:\n",
    "    fid = f['id']\n",
    "    # uses_functions\n",
    "    for dep in json.loads(f.get('uses_functions') or '[]'):\n",
    "        edges.append((fid, dep, 'uses_function'))\n",
    "    # uses_types\n",
    "    for dep in json.loads(f.get('uses_types') or '[]'):\n",
    "        edges.append((fid, dep, 'uses_type'))\n",
    "    # returns\n",
    "    ret = f.get('returns') or ''\n",
    "    if ret:\n",
    "        edges.append((fid, ret, 'returns'))\n",
    "    # error_type\n",
    "    err = f.get('error_type') or ''\n",
    "    if err:\n",
    "        edges.append((fid, err, 'error_type'))\n",
    "\n",
    "for t in types:\n",
    "    tid = t['id']\n",
    "    for dep in json.loads(t.get('uses_types') or '[]'):\n",
    "        edges.append((tid, dep, 'uses_type'))\n",
    "\n",
    "# Todos los IDs de nodos referenciados\n",
    "all_node_ids = set(f['id'] for f in functions) | set(t['id'] for t in types)\n",
    "# Nodos por ID para lookup rapido\n",
    "node_map = {f['id']: {**f, 'node_type': 'function'} for f in functions}\n",
    "node_map.update({t['id']: {**t, 'node_type': 'type'} for t in types})\n",
    "\n",
    "# Filtrar aristas a nodos que existen\n",
    "valid_edges = [(s, t, r) for s, t, r in edges if s in all_node_ids and t in all_node_ids]\n",
    "\n",
    "print(f'Nodos: {len(all_node_ids)} ({len(functions)} funciones + {len(types)} tipos)')\n",
    "print(f'Aristas: {len(valid_edges)} (de {len(edges)} totales, {len(edges) - len(valid_edges)} con target inexistente)')\n",
    "print(f'Relaciones: {set(r for _, _, r in valid_edges)}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "section-2",
   "metadata": {},
   "source": [
    "## 2. Benchmark framework"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "bench-framework",
   "metadata": {
    "jupyter": {
     "source_hidden": true
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Benchmark queries: 8\n",
      "  direct_deps: Funciones que usa directamente filter_slice_go_core\n",
      "  reverse_deps: Funciones que dependen de error_go_core\n",
      "  two_hop: Dependencias a 2 saltos desde init_metabase_go_pipelines\n",
      "  domain_subgraph: Todas las aristas entre funciones del dominio finance\n",
      "  most_connected: Top 5 nodos con mas conexiones (in + out degree)\n",
      "  path_exists: Existe un camino entre cualquier funcion de finance y error_go_core?\n",
      "  isolated: Funciones sin ninguna dependencia (ni entrante ni saliente)\n",
      "  type_users: Funciones que usan el tipo SMA_go_finance\n"
     ]
    }
   ],
   "source": [
    "DATA_DIR = 'data/graph_bench'\n",
    "os.makedirs(DATA_DIR, exist_ok=True)\n",
    "\n",
    "# Queries de traversal para benchmark (respuestas verificables contra el grafo)\n",
    "BENCH_QUERIES = [\n",
    "    ('direct_deps', 'Funciones que usa directamente filter_slice_go_core'),\n",
    "    ('reverse_deps', 'Funciones que dependen de error_go_core'),\n",
    "    ('two_hop', 'Dependencias a 2 saltos desde init_metabase_go_pipelines'),\n",
    "    ('domain_subgraph', 'Todas las aristas entre funciones del dominio finance'),\n",
    "    ('most_connected', 'Top 5 nodos con mas conexiones (in + out degree)'),\n",
    "    ('path_exists', 'Existe un camino entre cualquier funcion de finance y error_go_core?'),\n",
    "    ('isolated', 'Funciones sin ninguna dependencia (ni entrante ni saliente)'),\n",
    "    ('type_users', 'Funciones que usan el tipo SMA_go_finance'),\n",
    "]\n",
    "\n",
    "def dir_size_mb(path):\n",
    "    total = 0\n",
    "    if os.path.isfile(path):\n",
    "        return os.path.getsize(path) / (1024*1024)\n",
    "    if not os.path.exists(path):\n",
    "        return 0\n",
    "    for dp, dn, fns in os.walk(path):\n",
    "        for f in fns:\n",
    "            fp = os.path.join(dp, f)\n",
    "            if os.path.exists(fp):\n",
    "                total += os.path.getsize(fp)\n",
    "    return total / (1024*1024)\n",
    "\n",
    "def cleanup_path(path):\n",
    "    if os.path.isfile(path):\n",
    "        os.remove(path)\n",
    "    elif os.path.isdir(path):\n",
    "        shutil.rmtree(path, ignore_errors=True)\n",
    "    for suffix in ['.db', '.pickle', '.graphml']:\n",
    "        p = path + suffix\n",
    "        if os.path.exists(p):\n",
    "            os.remove(p)\n",
    "\n",
    "print(f'Benchmark queries: {len(BENCH_QUERIES)}')\n",
    "for qid, desc in BENCH_QUERIES:\n",
    "    print(f'  {qid}: {desc}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "section-3",
   "metadata": {},
   "source": [
    "## 3. Backend: NetworkX (baseline)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "networkx-impl",
   "metadata": {
    "jupyter": {
     "source_hidden": true
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "NetworkX: 393 nodos, 395 aristas\n",
      "  Insert: 2.0ms\n",
      "  Queries (8): 1.0ms\n",
      "  Save: 1.4ms\n",
      "  Load+query: 1.0ms\n",
      "  Disco: 0.16MB\n",
      "\n",
      "  direct_deps: []\n",
      "  reverse_deps: 176 resultados — ['apply_theme_typescript_ui', 'assert_command_exists_bash_shell', 'assert_docker_container_running_bash_infra']...\n",
      "  two_hop: []\n",
      "  domain_subgraph: 10 resultados — [('bollinger_bands_go_finance', 'bollinger_result_go_finance'), ('bollinger_bands_py_finance', 'sma_py_finance'), ('fetch_ohlcv_go_finance', 'ohlcv_go_finance')]...\n",
      "  most_connected: [('error_go_core', 176), ('cn_typescript_core', 27), ('docker_tui_go_infra', 26), ('MetabaseClient_go_infra', 21), ('init_jupyter_analysis_bash_pipelines', 9)]\n",
      "  path_exists: True\n",
      "  isolated: 122 resultados — ['all_of_py_core', 'all_slice_go_core', 'annualized_volatility_go_finance']...\n",
      "  type_users: []\n"
     ]
    }
   ],
   "source": [
    "import networkx as nx\n",
    "import pickle\n",
    "\n",
    "def nx_insert(nodes, edges_list, path):\n",
    "    G = nx.DiGraph()\n",
    "    for nid, attrs in nodes.items():\n",
    "        G.add_node(nid, **{k: v for k, v in attrs.items() if isinstance(v, (str, int, float, bool))})\n",
    "    for src, tgt, rel in edges_list:\n",
    "        G.add_edge(src, tgt, relation=rel)\n",
    "    return G\n",
    "\n",
    "def nx_queries(G):\n",
    "    results = {}\n",
    "    \n",
    "    # direct_deps\n",
    "    if 'filter_slice_go_core' in G:\n",
    "        results['direct_deps'] = list(G.successors('filter_slice_go_core'))\n",
    "    else:\n",
    "        results['direct_deps'] = []\n",
    "    \n",
    "    # reverse_deps\n",
    "    if 'error_go_core' in G:\n",
    "        results['reverse_deps'] = list(G.predecessors('error_go_core'))\n",
    "    else:\n",
    "        results['reverse_deps'] = []\n",
    "    \n",
    "    # two_hop\n",
    "    two_hop = set()\n",
    "    if 'init_metabase_go_pipelines' in G:\n",
    "        for n1 in G.successors('init_metabase_go_pipelines'):\n",
    "            for n2 in G.successors(n1):\n",
    "                two_hop.add(n2)\n",
    "    results['two_hop'] = list(two_hop)\n",
    "    \n",
    "    # domain_subgraph\n",
    "    finance_nodes = [n for n, d in G.nodes(data=True) if d.get('domain') == 'finance']\n",
    "    finance_edges = [(u, v) for u, v in G.edges() if u in finance_nodes and v in finance_nodes]\n",
    "    results['domain_subgraph'] = finance_edges\n",
    "    \n",
    "    # most_connected\n",
    "    degree = sorted(((n, G.in_degree(n) + G.out_degree(n)) for n in G.nodes()), key=lambda x: -x[1])[:5]\n",
    "    results['most_connected'] = degree\n",
    "    \n",
    "    # path_exists\n",
    "    if 'error_go_core' in G:\n",
    "        has_path = any(\n",
    "            nx.has_path(G, n, 'error_go_core')\n",
    "            for n in finance_nodes if n != 'error_go_core'\n",
    "        )\n",
    "        results['path_exists'] = has_path\n",
    "    else:\n",
    "        results['path_exists'] = False\n",
    "    \n",
    "    # isolated\n",
    "    results['isolated'] = [n for n in G.nodes() if G.degree(n) == 0]\n",
    "    \n",
    "    # type_users\n",
    "    if 'SMA_go_finance' in G:\n",
    "        results['type_users'] = list(G.predecessors('SMA_go_finance'))\n",
    "    else:\n",
    "        results['type_users'] = []\n",
    "    \n",
    "    return results\n",
    "\n",
    "def nx_save(G, path):\n",
    "    with open(path + '.pickle', 'wb') as f:\n",
    "        pickle.dump(G, f)\n",
    "\n",
    "def nx_load(path):\n",
    "    with open(path + '.pickle', 'rb') as f:\n",
    "        return pickle.load(f)\n",
    "\n",
    "# Benchmark\n",
    "path = os.path.join(DATA_DIR, 'networkx')\n",
    "cleanup_path(path)\n",
    "\n",
    "t0 = time.perf_counter()\n",
    "G_nx = nx_insert(node_map, valid_edges, path)\n",
    "nx_insert_time = time.perf_counter() - t0\n",
    "\n",
    "t0 = time.perf_counter()\n",
    "nx_results = nx_queries(G_nx)\n",
    "nx_query_time = time.perf_counter() - t0\n",
    "\n",
    "t0 = time.perf_counter()\n",
    "nx_save(G_nx, path)\n",
    "nx_save_time = time.perf_counter() - t0\n",
    "\n",
    "t0 = time.perf_counter()\n",
    "G_loaded = nx_load(path)\n",
    "_ = list(G_loaded.successors(list(G_loaded.nodes())[0]))\n",
    "nx_load_time = time.perf_counter() - t0\n",
    "\n",
    "nx_disk = dir_size_mb(path + '.pickle')\n",
    "\n",
    "print(f'NetworkX: {G_nx.number_of_nodes()} nodos, {G_nx.number_of_edges()} aristas')\n",
    "print(f'  Insert: {nx_insert_time*1000:.1f}ms')\n",
    "print(f'  Queries (8): {nx_query_time*1000:.1f}ms')\n",
    "print(f'  Save: {nx_save_time*1000:.1f}ms')\n",
    "print(f'  Load+query: {nx_load_time*1000:.1f}ms')\n",
    "print(f'  Disco: {nx_disk:.2f}MB')\n",
    "print()\n",
    "for k, v in nx_results.items():\n",
    "    if isinstance(v, list) and len(v) > 5:\n",
    "        print(f'  {k}: {len(v)} resultados — {v[:3]}...')\n",
    "    else:\n",
    "        print(f'  {k}: {v}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "section-4",
   "metadata": {},
   "source": [
    "## 4. Backend: Kuzu (Cypher embebido)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "kuzu-impl",
   "metadata": {
    "jupyter": {
     "source_hidden": true
    }
   },
   "outputs": [
    {
     "ename": "RuntimeError",
     "evalue": "Runtime exception: Database path cannot be a directory: /home/lucas/fn_registry/analysis/retrieving_graphs/notebooks/data/graph_bench/kuzu",
     "output_type": "error",
     "traceback": [
      "\u001b[31m---------------------------------------------------------------------------\u001b[39m",
      "\u001b[31mRuntimeError\u001b[39m                              Traceback (most recent call last)",
      "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[4]\u001b[39m\u001b[32m, line 104\u001b[39m\n\u001b[32m    100\u001b[39m \u001b[38;5;66;03m# Benchmark\u001b[39;00m\n\u001b[32m    101\u001b[39m path = os.path.join(DATA_DIR, \u001b[33m'kuzu'\u001b[39m)\n\u001b[32m    102\u001b[39m \n\u001b[32m    103\u001b[39m t0 = time.perf_counter()\n\u001b[32m--> \u001b[39m\u001b[32m104\u001b[39m kuzu_db, kuzu_conn = kuzu_setup(node_map, valid_edges, path)\n\u001b[32m    105\u001b[39m kuzu_insert_time = time.perf_counter() - t0\n\u001b[32m    106\u001b[39m \n\u001b[32m    107\u001b[39m t0 = time.perf_counter()\n",
      "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[4]\u001b[39m\u001b[32m, line 6\u001b[39m, in \u001b[36mkuzu_setup\u001b[39m\u001b[34m(nodes, edges_list, path)\u001b[39m\n\u001b[32m      3\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m kuzu_setup(nodes, edges_list, path):\n\u001b[32m      4\u001b[39m     cleanup_path(path)\n\u001b[32m      5\u001b[39m     os.makedirs(path, exist_ok=\u001b[38;5;28;01mTrue\u001b[39;00m)\n\u001b[32m----> \u001b[39m\u001b[32m6\u001b[39m     db = kuzu.Database(path)\n\u001b[32m      7\u001b[39m     conn = kuzu.Connection(db)\n\u001b[32m      8\u001b[39m \n\u001b[32m      9\u001b[39m     \u001b[38;5;66;03m# Schema\u001b[39;00m\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/fn_registry/analysis/retrieving_graphs/.venv/lib/python3.13/site-packages/kuzu/database.py:104\u001b[39m, in \u001b[36mDatabase.__init__\u001b[39m\u001b[34m(self, database_path, buffer_pool_size, max_num_threads, compression, lazy_init, read_only, max_db_size, auto_checkpoint, checkpoint_threshold)\u001b[39m\n\u001b[32m    102\u001b[39m \u001b[38;5;28mself\u001b[39m._database: Any = \u001b[38;5;28;01mNone\u001b[39;00m  \u001b[38;5;66;03m# (type: _kuzu.Database from pybind11)\u001b[39;00m\n\u001b[32m    103\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m lazy_init:\n\u001b[32m--> \u001b[39m\u001b[32m104\u001b[39m     \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43minit_database\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/fn_registry/analysis/retrieving_graphs/.venv/lib/python3.13/site-packages/kuzu/database.py:155\u001b[39m, in \u001b[36mDatabase.init_database\u001b[39m\u001b[34m(self)\u001b[39m\n\u001b[32m    153\u001b[39m \u001b[38;5;28mself\u001b[39m.check_for_database_close()\n\u001b[32m    154\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m._database \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[32m--> \u001b[39m\u001b[32m155\u001b[39m     \u001b[38;5;28mself\u001b[39m._database = \u001b[43m_kuzu\u001b[49m\u001b[43m.\u001b[49m\u001b[43mDatabase\u001b[49m\u001b[43m(\u001b[49m\u001b[43m  \u001b[49m\u001b[38;5;66;43;03m# type: ignore[union-attr]\u001b[39;49;00m\n\u001b[32m    156\u001b[39m \u001b[43m        \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mdatabase_path\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    157\u001b[39m \u001b[43m        \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mbuffer_pool_size\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    158\u001b[39m \u001b[43m        \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mmax_num_threads\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    159\u001b[39m \u001b[43m        \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mcompression\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    160\u001b[39m \u001b[43m        \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mread_only\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    161\u001b[39m \u001b[43m        \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mmax_db_size\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    162\u001b[39m \u001b[43m        \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mauto_checkpoint\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    163\u001b[39m \u001b[43m        \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mcheckpoint_threshold\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    164\u001b[39m \u001b[43m    \u001b[49m\u001b[43m)\u001b[49m\n",
      "\u001b[31mRuntimeError\u001b[39m: Runtime exception: Database path cannot be a directory: /home/lucas/fn_registry/analysis/retrieving_graphs/notebooks/data/graph_bench/kuzu"
     ]
    }
   ],
   "source": [
    "import kuzu\n",
    "\n",
    "def kuzu_setup(nodes, edges_list, path):\n",
    "    cleanup_path(path)\n",
    "    os.makedirs(path, exist_ok=True)\n",
    "    db = kuzu.Database(path)\n",
    "    conn = kuzu.Connection(db)\n",
    "    \n",
    "    # Schema\n",
    "    conn.execute('CREATE NODE TABLE FnNode(id STRING, name STRING, node_type STRING, '\n",
    "                 'kind STRING, lang STRING, domain STRING, purity STRING, '\n",
    "                 'description STRING, PRIMARY KEY(id))')\n",
    "    conn.execute('CREATE REL TABLE DEPENDS_ON(FROM FnNode TO FnNode, relation STRING)')\n",
    "    \n",
    "    # Insert nodos\n",
    "    for nid, attrs in nodes.items():\n",
    "        conn.execute(\n",
    "            'CREATE (n:FnNode {id: $id, name: $name, node_type: $node_type, '\n",
    "            'kind: $kind, lang: $lang, domain: $domain, purity: $purity, '\n",
    "            'description: $desc})',\n",
    "            parameters={\n",
    "                'id': nid,\n",
    "                'name': attrs.get('name', ''),\n",
    "                'node_type': attrs.get('node_type', ''),\n",
    "                'kind': attrs.get('kind', ''),\n",
    "                'lang': attrs.get('lang', ''),\n",
    "                'domain': attrs.get('domain', ''),\n",
    "                'purity': attrs.get('purity', ''),\n",
    "                'desc': attrs.get('description', ''),\n",
    "            }\n",
    "        )\n",
    "    \n",
    "    # Insert aristas\n",
    "    for src, tgt, rel in edges_list:\n",
    "        conn.execute(\n",
    "            'MATCH (a:FnNode {id: $src}), (b:FnNode {id: $tgt}) '\n",
    "            'CREATE (a)-[:DEPENDS_ON {relation: $rel}]->(b)',\n",
    "            parameters={'src': src, 'tgt': tgt, 'rel': rel}\n",
    "        )\n",
    "    \n",
    "    return db, conn\n",
    "\n",
    "def kuzu_queries(conn):\n",
    "    results = {}\n",
    "    \n",
    "    # direct_deps\n",
    "    r = conn.execute('MATCH (a:FnNode {id: \"filter_slice_go_core\"})-[:DEPENDS_ON]->(b) RETURN b.id')\n",
    "    results['direct_deps'] = [row[0] for row in r.get_as_df().values]\n",
    "    \n",
    "    # reverse_deps\n",
    "    r = conn.execute('MATCH (a)-[:DEPENDS_ON]->(b:FnNode {id: \"error_go_core\"}) RETURN a.id')\n",
    "    results['reverse_deps'] = [row[0] for row in r.get_as_df().values]\n",
    "    \n",
    "    # two_hop\n",
    "    r = conn.execute(\n",
    "        'MATCH (a:FnNode {id: \"init_metabase_go_pipelines\"})-[:DEPENDS_ON]->()-[:DEPENDS_ON]->(c) '\n",
    "        'RETURN DISTINCT c.id'\n",
    "    )\n",
    "    results['two_hop'] = [row[0] for row in r.get_as_df().values]\n",
    "    \n",
    "    # domain_subgraph\n",
    "    r = conn.execute(\n",
    "        'MATCH (a:FnNode {domain: \"finance\"})-[e:DEPENDS_ON]->(b:FnNode {domain: \"finance\"}) '\n",
    "        'RETURN a.id, b.id'\n",
    "    )\n",
    "    results['domain_subgraph'] = [(row[0], row[1]) for row in r.get_as_df().values]\n",
    "    \n",
    "    # most_connected (in+out degree via counting edges)\n",
    "    r = conn.execute(\n",
    "        'MATCH (n:FnNode) '\n",
    "        'OPTIONAL MATCH (n)-[e1:DEPENDS_ON]->() '\n",
    "        'OPTIONAL MATCH ()-[e2:DEPENDS_ON]->(n) '\n",
    "        'RETURN n.id, count(DISTINCT e1) + count(DISTINCT e2) AS deg '\n",
    "        'ORDER BY deg DESC LIMIT 5'\n",
    "    )\n",
    "    results['most_connected'] = [(row[0], row[1]) for row in r.get_as_df().values]\n",
    "    \n",
    "    # path_exists\n",
    "    r = conn.execute(\n",
    "        'MATCH (a:FnNode {domain: \"finance\"})-[:DEPENDS_ON* 1..5]->(b:FnNode {id: \"error_go_core\"}) '\n",
    "        'RETURN a.id LIMIT 1'\n",
    "    )\n",
    "    results['path_exists'] = len(r.get_as_df()) > 0\n",
    "    \n",
    "    # isolated\n",
    "    r = conn.execute(\n",
    "        'MATCH (n:FnNode) WHERE NOT EXISTS { MATCH (n)-[:DEPENDS_ON]->() } '\n",
    "        'AND NOT EXISTS { MATCH ()-[:DEPENDS_ON]->(n) } RETURN n.id'\n",
    "    )\n",
    "    results['isolated'] = [row[0] for row in r.get_as_df().values]\n",
    "    \n",
    "    # type_users\n",
    "    r = conn.execute(\n",
    "        'MATCH (a)-[:DEPENDS_ON {relation: \"uses_type\"}]->(b:FnNode {id: \"SMA_go_finance\"}) RETURN a.id'\n",
    "    )\n",
    "    results['type_users'] = [row[0] for row in r.get_as_df().values]\n",
    "    \n",
    "    return results\n",
    "\n",
    "# Benchmark\n",
    "path = os.path.join(DATA_DIR, 'kuzu')\n",
    "\n",
    "t0 = time.perf_counter()\n",
    "kuzu_db, kuzu_conn = kuzu_setup(node_map, valid_edges, path)\n",
    "kuzu_insert_time = time.perf_counter() - t0\n",
    "\n",
    "t0 = time.perf_counter()\n",
    "kuzu_results = kuzu_queries(kuzu_conn)\n",
    "kuzu_query_time = time.perf_counter() - t0\n",
    "\n",
    "kuzu_disk = dir_size_mb(path)\n",
    "\n",
    "# Load from cold\n",
    "del kuzu_conn, kuzu_db\n",
    "t0 = time.perf_counter()\n",
    "kuzu_db2 = kuzu.Database(path)\n",
    "kuzu_conn2 = kuzu.Connection(kuzu_db2)\n",
    "r = kuzu_conn2.execute('MATCH (a:FnNode {id: \"filter_slice_go_core\"})-[:DEPENDS_ON]->(b) RETURN b.id')\n",
    "_ = r.get_as_df()\n",
    "kuzu_load_time = time.perf_counter() - t0\n",
    "del kuzu_conn2, kuzu_db2\n",
    "\n",
    "print(f'Kuzu:')\n",
    "print(f'  Insert: {kuzu_insert_time*1000:.1f}ms')\n",
    "print(f'  Queries (8): {kuzu_query_time*1000:.1f}ms')\n",
    "print(f'  Load+query: {kuzu_load_time*1000:.1f}ms')\n",
    "print(f'  Disco: {kuzu_disk:.2f}MB')\n",
    "print()\n",
    "for k, v in kuzu_results.items():\n",
    "    if isinstance(v, list) and len(v) > 5:\n",
    "        print(f'  {k}: {len(v)} resultados — {v[:3]}...')\n",
    "    else:\n",
    "        print(f'  {k}: {v}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "section-5",
   "metadata": {},
   "source": [
    "## 5. Backend: SQLite + CTEs recursivos"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "sqlite-impl",
   "metadata": {
    "jupyter": {
     "source_hidden": true
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "SQLite + CTEs:\n",
      "  Insert: 89.2ms\n",
      "  Queries (8): 2.3ms\n",
      "  Load+query: 0.2ms\n",
      "  Disco: 0.20MB\n",
      "\n",
      "  direct_deps: []\n",
      "  reverse_deps: 176 resultados — ['apply_theme_typescript_ui', 'assert_command_exists_bash_shell', 'assert_docker_container_running_bash_infra']...\n",
      "  two_hop: []\n",
      "  domain_subgraph: 10 resultados — [('bollinger_bands_go_finance', 'bollinger_result_go_finance'), ('bollinger_bands_py_finance', 'sma_py_finance'), ('fetch_ohlcv_go_finance', 'ohlcv_go_finance')]...\n",
      "  most_connected: [('error_go_core', 176), ('cn_typescript_core', 27), ('docker_tui_go_infra', 26), ('MetabaseClient_go_infra', 21), ('init_jupyter_analysis_bash_pipelines', 9)]\n",
      "  path_exists: True\n",
      "  isolated: 122 resultados — ['ComponentVariants_typescript_core', 'all_of_py_core', 'all_slice_go_core']...\n",
      "  type_users: []\n"
     ]
    }
   ],
   "source": [
    "def sqlite_setup(nodes, edges_list, path):\n",
    "    dbpath = path + '.db'\n",
    "    cleanup_path(dbpath)\n",
    "    db = sqlite3.connect(dbpath)\n",
    "    db.execute('CREATE TABLE nodes (id TEXT PRIMARY KEY, name TEXT, node_type TEXT, '\n",
    "               'kind TEXT, lang TEXT, domain TEXT, purity TEXT, description TEXT)')\n",
    "    db.execute('CREATE TABLE edges (src TEXT, tgt TEXT, relation TEXT, '\n",
    "               'FOREIGN KEY(src) REFERENCES nodes(id), FOREIGN KEY(tgt) REFERENCES nodes(id))')\n",
    "    db.execute('CREATE INDEX idx_edges_src ON edges(src)')\n",
    "    db.execute('CREATE INDEX idx_edges_tgt ON edges(tgt)')\n",
    "    db.execute('CREATE INDEX idx_edges_rel ON edges(relation)')\n",
    "    db.execute('CREATE INDEX idx_nodes_domain ON nodes(domain)')\n",
    "    \n",
    "    db.executemany(\n",
    "        'INSERT INTO nodes VALUES (?,?,?,?,?,?,?,?)',\n",
    "        [(nid, a.get('name',''), a.get('node_type',''), a.get('kind',''),\n",
    "          a.get('lang',''), a.get('domain',''), a.get('purity',''),\n",
    "          a.get('description','')) for nid, a in nodes.items()]\n",
    "    )\n",
    "    db.executemany('INSERT INTO edges VALUES (?,?,?)', edges_list)\n",
    "    db.commit()\n",
    "    return db\n",
    "\n",
    "def sqlite_queries(db):\n",
    "    results = {}\n",
    "    \n",
    "    # direct_deps\n",
    "    results['direct_deps'] = [r[0] for r in db.execute(\n",
    "        'SELECT tgt FROM edges WHERE src = \"filter_slice_go_core\"'\n",
    "    ).fetchall()]\n",
    "    \n",
    "    # reverse_deps\n",
    "    results['reverse_deps'] = [r[0] for r in db.execute(\n",
    "        'SELECT src FROM edges WHERE tgt = \"error_go_core\"'\n",
    "    ).fetchall()]\n",
    "    \n",
    "    # two_hop (CTE recursivo)\n",
    "    results['two_hop'] = [r[0] for r in db.execute(\n",
    "        'WITH hop1 AS (SELECT tgt FROM edges WHERE src = \"init_metabase_go_pipelines\"), '\n",
    "        'hop2 AS (SELECT DISTINCT e.tgt FROM edges e JOIN hop1 h ON e.src = h.tgt) '\n",
    "        'SELECT tgt FROM hop2'\n",
    "    ).fetchall()]\n",
    "    \n",
    "    # domain_subgraph\n",
    "    results['domain_subgraph'] = db.execute(\n",
    "        'SELECT e.src, e.tgt FROM edges e '\n",
    "        'JOIN nodes n1 ON e.src = n1.id JOIN nodes n2 ON e.tgt = n2.id '\n",
    "        'WHERE n1.domain = \"finance\" AND n2.domain = \"finance\"'\n",
    "    ).fetchall()\n",
    "    \n",
    "    # most_connected\n",
    "    results['most_connected'] = db.execute(\n",
    "        'SELECT id, (SELECT COUNT(*) FROM edges WHERE src=id) + '\n",
    "        '(SELECT COUNT(*) FROM edges WHERE tgt=id) AS deg '\n",
    "        'FROM nodes ORDER BY deg DESC LIMIT 5'\n",
    "    ).fetchall()\n",
    "    \n",
    "    # path_exists (CTE recursivo con limite de profundidad)\n",
    "    results['path_exists'] = len(db.execute(\n",
    "        'WITH RECURSIVE reachable(id, depth) AS ('\n",
    "        '  SELECT src, 0 FROM edges e JOIN nodes n ON e.src = n.id WHERE n.domain = \"finance\" '\n",
    "        '  UNION '\n",
    "        '  SELECT e.tgt, r.depth + 1 FROM edges e JOIN reachable r ON e.src = r.id WHERE r.depth < 5'\n",
    "        ') SELECT 1 FROM reachable WHERE id = \"error_go_core\" LIMIT 1'\n",
    "    ).fetchall()) > 0\n",
    "    \n",
    "    # isolated\n",
    "    results['isolated'] = [r[0] for r in db.execute(\n",
    "        'SELECT n.id FROM nodes n '\n",
    "        'WHERE NOT EXISTS (SELECT 1 FROM edges WHERE src = n.id) '\n",
    "        'AND NOT EXISTS (SELECT 1 FROM edges WHERE tgt = n.id)'\n",
    "    ).fetchall()]\n",
    "    \n",
    "    # type_users\n",
    "    results['type_users'] = [r[0] for r in db.execute(\n",
    "        'SELECT src FROM edges WHERE tgt = \"SMA_go_finance\" AND relation = \"uses_type\"'\n",
    "    ).fetchall()]\n",
    "    \n",
    "    return results\n",
    "\n",
    "# Benchmark\n",
    "path = os.path.join(DATA_DIR, 'sqlite_graph')\n",
    "\n",
    "t0 = time.perf_counter()\n",
    "sqlite_db = sqlite_setup(node_map, valid_edges, path)\n",
    "sqlite_insert_time = time.perf_counter() - t0\n",
    "\n",
    "t0 = time.perf_counter()\n",
    "sqlite_results = sqlite_queries(sqlite_db)\n",
    "sqlite_query_time = time.perf_counter() - t0\n",
    "\n",
    "sqlite_db.close()\n",
    "sqlite_disk = dir_size_mb(path + '.db')\n",
    "\n",
    "t0 = time.perf_counter()\n",
    "db2 = sqlite3.connect(path + '.db')\n",
    "_ = db2.execute('SELECT tgt FROM edges WHERE src = \"filter_slice_go_core\"').fetchall()\n",
    "db2.close()\n",
    "sqlite_load_time = time.perf_counter() - t0\n",
    "\n",
    "print(f'SQLite + CTEs:')\n",
    "print(f'  Insert: {sqlite_insert_time*1000:.1f}ms')\n",
    "print(f'  Queries (8): {sqlite_query_time*1000:.1f}ms')\n",
    "print(f'  Load+query: {sqlite_load_time*1000:.1f}ms')\n",
    "print(f'  Disco: {sqlite_disk:.2f}MB')\n",
    "print()\n",
    "for k, v in sqlite_results.items():\n",
    "    if isinstance(v, list) and len(v) > 5:\n",
    "        print(f'  {k}: {len(v)} resultados — {v[:3]}...')\n",
    "    else:\n",
    "        print(f'  {k}: {v}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "section-6",
   "metadata": {},
   "source": [
    "## 6. Backend: RDFLib (SPARQL)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "rdflib-impl",
   "metadata": {},
   "outputs": [
    {
     "ename": "ParseException",
     "evalue": "Expected AskQuery, found '?'  (at char 41), (line:1, col:42)",
     "output_type": "error",
     "traceback": [
      "\u001b[31m---------------------------------------------------------------------------\u001b[39m",
      "\u001b[31mParseException\u001b[39m                            Traceback (most recent call last)",
      "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[15]\u001b[39m\u001b[32m, line 99\u001b[39m\n\u001b[32m     95\u001b[39m g_rdf = rdf_setup(node_map, valid_edges, path)\n\u001b[32m     96\u001b[39m rdf_insert_time = time.perf_counter() - t0\n\u001b[32m     97\u001b[39m \n\u001b[32m     98\u001b[39m t0 = time.perf_counter()\n\u001b[32m---> \u001b[39m\u001b[32m99\u001b[39m rdf_results = rdf_queries(g_rdf)\n\u001b[32m    100\u001b[39m rdf_query_time = time.perf_counter() - t0\n\u001b[32m    101\u001b[39m \n\u001b[32m    102\u001b[39m t0 = time.perf_counter()\n",
      "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[15]\u001b[39m\u001b[32m, line 68\u001b[39m, in \u001b[36mrdf_queries\u001b[39m\u001b[34m(g)\u001b[39m\n\u001b[32m     64\u001b[39m     )\n\u001b[32m     65\u001b[39m     results[\u001b[33m'most_connected'\u001b[39m] = [(str(row[\u001b[32m0\u001b[39m]).replace(str(FN), \u001b[33m''\u001b[39m), int(row[\u001b[32m1\u001b[39m])) \u001b[38;5;28;01mfor\u001b[39;00m row \u001b[38;5;28;01min\u001b[39;00m r]\n\u001b[32m     66\u001b[39m \n\u001b[32m     67\u001b[39m     \u001b[38;5;66;03m# path_exists (SPARQL 1.1 property paths, max 5 hops)\u001b[39;00m\n\u001b[32m---> \u001b[39m\u001b[32m68\u001b[39m     r = g.query(\n\u001b[32m     69\u001b[39m         \u001b[33m'ASK WHERE { ?a fnprop:domain \"finance\" . '\u001b[39m\n\u001b[32m     70\u001b[39m         \u001b[33m'?a (fnrel:uses_function|fnrel:uses_type|fnrel:returns|fnrel:error_type){1,5} fn:error_go_core }'\u001b[39m,\n\u001b[32m     71\u001b[39m         initNs={\u001b[33m'fn'\u001b[39m: FN, \u001b[33m'fnrel'\u001b[39m: FNREL, \u001b[33m'fnprop'\u001b[39m: FNPROP}\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/fn_registry/analysis/retrieving_graphs/.venv/lib/python3.13/site-packages/rdflib/graph.py:1742\u001b[39m, in \u001b[36mGraph.query\u001b[39m\u001b[34m(self, query_object, processor, result, initNs, initBindings, use_store_provided, **kwargs)\u001b[39m\n\u001b[32m   1739\u001b[39m     processor = plugin.get(processor, query.Processor)(\u001b[38;5;28mself\u001b[39m)\n\u001b[32m   1741\u001b[39m \u001b[38;5;66;03m# type error: Argument 1 to \"Result\" has incompatible type \"Mapping[str, Any]\"; expected \"str\"\u001b[39;00m\n\u001b[32m-> \u001b[39m\u001b[32m1742\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m result(\u001b[43mprocessor\u001b[49m\u001b[43m.\u001b[49m\u001b[43mquery\u001b[49m\u001b[43m(\u001b[49m\u001b[43mquery_object\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43minitBindings\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43minitNs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m)\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/fn_registry/analysis/retrieving_graphs/.venv/lib/python3.13/site-packages/rdflib/plugins/sparql/processor.py:144\u001b[39m, in \u001b[36mSPARQLProcessor.query\u001b[39m\u001b[34m(self, strOrQuery, initBindings, initNs, base, DEBUG)\u001b[39m\n\u001b[32m    124\u001b[39m \u001b[38;5;250m\u001b[39m\u001b[33;03m\"\"\"\u001b[39;00m\n\u001b[32m    125\u001b[39m \u001b[33;03mEvaluate a query with the given initial bindings, and initial\u001b[39;00m\n\u001b[32m    126\u001b[39m \u001b[33;03mnamespaces. The given base is used to resolve relative URIs in\u001b[39;00m\n\u001b[32m   (...)\u001b[39m\u001b[32m    140\u001b[39m \u001b[33;03m    documentation.\u001b[39;00m\n\u001b[32m    141\u001b[39m \u001b[33;03m\"\"\"\u001b[39;00m\n\u001b[32m    143\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(strOrQuery, \u001b[38;5;28mstr\u001b[39m):\n\u001b[32m--> \u001b[39m\u001b[32m144\u001b[39m     strOrQuery = translateQuery(\u001b[43mparseQuery\u001b[49m\u001b[43m(\u001b[49m\u001b[43mstrOrQuery\u001b[49m\u001b[43m)\u001b[49m, base, initNs)\n\u001b[32m    146\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m evalQuery(\u001b[38;5;28mself\u001b[39m.graph, strOrQuery, initBindings, base)\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/fn_registry/analysis/retrieving_graphs/.venv/lib/python3.13/site-packages/rdflib/plugins/sparql/parser.py:1556\u001b[39m, in \u001b[36mparseQuery\u001b[39m\u001b[34m(q)\u001b[39m\n\u001b[32m   1553\u001b[39m     q = q.decode(\u001b[33m\"\u001b[39m\u001b[33mutf-8\u001b[39m\u001b[33m\"\u001b[39m)\n\u001b[32m   1555\u001b[39m q = expandUnicodeEscapes(q)\n\u001b[32m-> \u001b[39m\u001b[32m1556\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mQuery\u001b[49m\u001b[43m.\u001b[49m\u001b[43mparse_string\u001b[49m\u001b[43m(\u001b[49m\u001b[43mq\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mparse_all\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43;01mTrue\u001b[39;49;00m\u001b[43m)\u001b[49m\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/fn_registry/analysis/retrieving_graphs/.venv/lib/python3.13/site-packages/pyparsing/core.py:1346\u001b[39m, in \u001b[36mParserElement.parse_string\u001b[39m\u001b[34m(self, instring, parse_all, **kwargs)\u001b[39m\n\u001b[32m   1343\u001b[39m         \u001b[38;5;28;01mraise\u001b[39;00m\n\u001b[32m   1345\u001b[39m     \u001b[38;5;66;03m# catch and re-raise exception from here, clearing out pyparsing internal stack trace\u001b[39;00m\n\u001b[32m-> \u001b[39m\u001b[32m1346\u001b[39m     \u001b[38;5;28;01mraise\u001b[39;00m exc.with_traceback(\u001b[38;5;28;01mNone\u001b[39;00m)\n\u001b[32m   1347\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m   1348\u001b[39m     \u001b[38;5;28;01mreturn\u001b[39;00m tokens\n",
      "\u001b[31mParseException\u001b[39m: Expected AskQuery, found '?'  (at char 41), (line:1, col:42)"
     ]
    }
   ],
   "source": [
    "from rdflib import Graph as RDFGraph, Namespace, Literal, URIRef\n",
    "from rdflib.namespace import RDF, RDFS\n",
    "\n",
    "FN = Namespace('http://fn-registry.local/')\n",
    "FNREL = Namespace('http://fn-registry.local/rel/')\n",
    "FNPROP = Namespace('http://fn-registry.local/prop/')\n",
    "\n",
    "def rdf_setup(nodes, edges_list, path):\n",
    "    g = RDFGraph()\n",
    "    g.bind('fn', FN)\n",
    "    g.bind('fnrel', FNREL)\n",
    "    g.bind('fnprop', FNPROP)\n",
    "    \n",
    "    for nid, attrs in nodes.items():\n",
    "        uri = FN[nid]\n",
    "        g.add((uri, RDF.type, FN['Function'] if attrs.get('node_type') == 'function' else FN['Type']))\n",
    "        for prop in ['name', 'kind', 'lang', 'domain', 'purity', 'description']:\n",
    "            val = attrs.get(prop, '')\n",
    "            if val:\n",
    "                g.add((uri, FNPROP[prop], Literal(val)))\n",
    "    \n",
    "    for src, tgt, rel in edges_list:\n",
    "        g.add((FN[src], FNREL[rel], FN[tgt]))\n",
    "    \n",
    "    return g\n",
    "\n",
    "def rdf_queries(g):\n",
    "    results = {}\n",
    "    \n",
    "    # direct_deps\n",
    "    r = g.query('SELECT ?b WHERE { fn:filter_slice_go_core ?rel ?b . FILTER(STRSTARTS(STR(?rel), STR(fnrel:))) }',\n",
    "               initNs={'fn': FN, 'fnrel': FNREL})\n",
    "    results['direct_deps'] = [str(row[0]).replace(str(FN), '') for row in r]\n",
    "    \n",
    "    # reverse_deps\n",
    "    r = g.query('SELECT ?a WHERE { ?a ?rel fn:error_go_core . FILTER(STRSTARTS(STR(?rel), STR(fnrel:))) }',\n",
    "               initNs={'fn': FN, 'fnrel': FNREL})\n",
    "    results['reverse_deps'] = [str(row[0]).replace(str(FN), '') for row in r]\n",
    "    \n",
    "    # two_hop\n",
    "    r = g.query(\n",
    "        'SELECT DISTINCT ?c WHERE { fn:init_metabase_go_pipelines ?r1 ?b . ?b ?r2 ?c . '\n",
    "        'FILTER(STRSTARTS(STR(?r1), STR(fnrel:))) FILTER(STRSTARTS(STR(?r2), STR(fnrel:))) }',\n",
    "        initNs={'fn': FN, 'fnrel': FNREL}\n",
    "    )\n",
    "    results['two_hop'] = [str(row[0]).replace(str(FN), '') for row in r]\n",
    "    \n",
    "    # domain_subgraph\n",
    "    r = g.query(\n",
    "        'SELECT ?a ?b WHERE { ?a fnprop:domain \"finance\" . ?b fnprop:domain \"finance\" . '\n",
    "        '?a ?rel ?b . FILTER(STRSTARTS(STR(?rel), STR(fnrel:))) }',\n",
    "        initNs={'fn': FN, 'fnrel': FNREL, 'fnprop': FNPROP}\n",
    "    )\n",
    "    results['domain_subgraph'] = [(str(row[0]).replace(str(FN), ''), str(row[1]).replace(str(FN), '')) for row in r]\n",
    "    \n",
    "    # most_connected (SPARQL no tiene degree nativo, contamos)\n",
    "    r = g.query(\n",
    "        'SELECT ?n (COUNT(DISTINCT ?e) AS ?deg) WHERE { '\n",
    "        '{ ?n ?rel ?o . FILTER(STRSTARTS(STR(?rel), STR(fnrel:))) BIND(?o AS ?e) } '\n",
    "        'UNION '\n",
    "        '{ ?s ?rel ?n . FILTER(STRSTARTS(STR(?rel), STR(fnrel:))) BIND(?s AS ?e) } '\n",
    "        '} GROUP BY ?n ORDER BY DESC(?deg) LIMIT 5',\n",
    "        initNs={'fn': FN, 'fnrel': FNREL}\n",
    "    )\n",
    "    results['most_connected'] = [(str(row[0]).replace(str(FN), ''), int(row[1])) for row in r]\n",
    "    \n",
    "    # path_exists (SPARQL 1.1 property paths, max 5 hops)\n",
    "    r = g.query(\n",
    "        'ASK WHERE { ?a fnprop:domain \"finance\" . '\n",
    "        '?a (fnrel:uses_function|fnrel:uses_type|fnrel:returns|fnrel:error_type){1,5} fn:error_go_core }',\n",
    "        initNs={'fn': FN, 'fnrel': FNREL, 'fnprop': FNPROP}\n",
    "    )\n",
    "    results['path_exists'] = bool(r)\n",
    "    \n",
    "    # isolated\n",
    "    r = g.query(\n",
    "        'SELECT ?n WHERE { ?n a ?type . FILTER(?type IN (fn:Function, fn:Type)) '\n",
    "        'FILTER NOT EXISTS { ?n ?rel ?o . FILTER(STRSTARTS(STR(?rel), STR(fnrel:))) } '\n",
    "        'FILTER NOT EXISTS { ?s ?rel2 ?n . FILTER(STRSTARTS(STR(?rel2), STR(fnrel:))) } }',\n",
    "        initNs={'fn': FN, 'fnrel': FNREL}\n",
    "    )\n",
    "    results['isolated'] = [str(row[0]).replace(str(FN), '') for row in r]\n",
    "    \n",
    "    # type_users\n",
    "    r = g.query('SELECT ?a WHERE { ?a fnrel:uses_type fn:SMA_go_finance }',\n",
    "               initNs={'fn': FN, 'fnrel': FNREL})\n",
    "    results['type_users'] = [str(row[0]).replace(str(FN), '') for row in r]\n",
    "    \n",
    "    return results\n",
    "\n",
    "# Benchmark\n",
    "path = os.path.join(DATA_DIR, 'rdflib')\n",
    "\n",
    "t0 = time.perf_counter()\n",
    "g_rdf = rdf_setup(node_map, valid_edges, path)\n",
    "rdf_insert_time = time.perf_counter() - t0\n",
    "\n",
    "t0 = time.perf_counter()\n",
    "rdf_results = rdf_queries(g_rdf)\n",
    "rdf_query_time = time.perf_counter() - t0\n",
    "\n",
    "t0 = time.perf_counter()\n",
    "g_rdf.serialize(destination=path + '.ttl', format='turtle')\n",
    "rdf_save_time = time.perf_counter() - t0\n",
    "\n",
    "rdf_disk = dir_size_mb(path + '.ttl')\n",
    "\n",
    "t0 = time.perf_counter()\n",
    "g2 = RDFGraph()\n",
    "g2.parse(path + '.ttl', format='turtle')\n",
    "_ = list(g2.query('SELECT ?b WHERE { fn:filter_slice_go_core ?r ?b . FILTER(STRSTARTS(STR(?r), STR(fnrel:))) }',\n",
    "                  initNs={'fn': FN, 'fnrel': FNREL}))\n",
    "rdf_load_time = time.perf_counter() - t0\n",
    "\n",
    "print(f'RDFLib: {len(g_rdf)} triples')\n",
    "print(f'  Insert: {rdf_insert_time*1000:.1f}ms')\n",
    "print(f'  Queries (8): {rdf_query_time*1000:.1f}ms')\n",
    "print(f'  Save (turtle): {rdf_save_time*1000:.1f}ms')\n",
    "print(f'  Load+query: {rdf_load_time*1000:.1f}ms')\n",
    "print(f'  Disco: {rdf_disk:.2f}MB')\n",
    "print()\n",
    "for k, v in rdf_results.items():\n",
    "    if isinstance(v, list) and len(v) > 5:\n",
    "        print(f'  {k}: {len(v)} resultados — {v[:3]}...')\n",
    "    else:\n",
    "        print(f'  {k}: {v}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "section-7",
   "metadata": {},
   "source": [
    "## 7. Backend: igraph"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "igraph-impl",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "igraph: 393 vertices, 395 aristas\n",
      "  Insert: 0.5ms\n",
      "  Queries (8): 0.7ms\n",
      "  Save: 0.6ms\n",
      "  Load+query: 0.3ms\n",
      "  Disco: 0.04MB\n",
      "\n",
      "  direct_deps: []\n",
      "  reverse_deps: 176 resultados — ['apply_theme_typescript_ui', 'assert_command_exists_bash_shell', 'assert_docker_container_running_bash_infra']...\n",
      "  two_hop: []\n",
      "  domain_subgraph: 10 resultados — [('bollinger_bands_go_finance', 'bollinger_result_go_finance'), ('bollinger_bands_py_finance', 'sma_py_finance'), ('fetch_ohlcv_go_finance', 'ohlcv_go_finance')]...\n",
      "  most_connected: [('error_go_core', 176), ('cn_typescript_core', 27), ('docker_tui_go_infra', 26), ('MetabaseClient_go_infra', 21), ('init_jupyter_analysis_bash_pipelines', 9)]\n",
      "  path_exists: True\n",
      "  isolated: 122 resultados — ['all_of_py_core', 'all_slice_go_core', 'annualized_volatility_go_finance']...\n",
      "  type_users: []\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/tmp/ipykernel_11708/2711278240.py:69: RuntimeWarning: Couldn't reach some vertices. Location: src/paths/unweighted.c:526\n",
      "  len(g.get_shortest_paths(src, to=target, mode='out')[0]) > 0\n"
     ]
    }
   ],
   "source": [
    "import igraph as ig\n",
    "\n",
    "def igraph_setup(nodes, edges_list, path):\n",
    "    node_ids = list(nodes.keys())\n",
    "    id_to_idx = {nid: i for i, nid in enumerate(node_ids)}\n",
    "    \n",
    "    g = ig.Graph(directed=True)\n",
    "    g.add_vertices(len(node_ids))\n",
    "    g.vs['node_id'] = node_ids\n",
    "    g.vs['name'] = [nodes[nid].get('name', '') for nid in node_ids]\n",
    "    g.vs['node_type'] = [nodes[nid].get('node_type', '') for nid in node_ids]\n",
    "    g.vs['domain'] = [nodes[nid].get('domain', '') for nid in node_ids]\n",
    "    g.vs['purity'] = [nodes[nid].get('purity', '') for nid in node_ids]\n",
    "    g.vs['kind'] = [nodes[nid].get('kind', '') for nid in node_ids]\n",
    "    g.vs['lang'] = [nodes[nid].get('lang', '') for nid in node_ids]\n",
    "    \n",
    "    edge_tuples = [(id_to_idx[s], id_to_idx[t]) for s, t, _ in edges_list]\n",
    "    edge_rels = [r for _, _, r in edges_list]\n",
    "    g.add_edges(edge_tuples)\n",
    "    g.es['relation'] = edge_rels\n",
    "    \n",
    "    return g, id_to_idx\n",
    "\n",
    "def igraph_queries(g, id_to_idx):\n",
    "    results = {}\n",
    "    idx_to_id = {v: k for k, v in id_to_idx.items()}\n",
    "    \n",
    "    # direct_deps\n",
    "    if 'filter_slice_go_core' in id_to_idx:\n",
    "        idx = id_to_idx['filter_slice_go_core']\n",
    "        results['direct_deps'] = [idx_to_id[n] for n in g.neighbors(idx, mode='out')]\n",
    "    else:\n",
    "        results['direct_deps'] = []\n",
    "    \n",
    "    # reverse_deps\n",
    "    if 'error_go_core' in id_to_idx:\n",
    "        idx = id_to_idx['error_go_core']\n",
    "        results['reverse_deps'] = [idx_to_id[n] for n in g.neighbors(idx, mode='in')]\n",
    "    else:\n",
    "        results['reverse_deps'] = []\n",
    "    \n",
    "    # two_hop\n",
    "    if 'init_metabase_go_pipelines' in id_to_idx:\n",
    "        idx = id_to_idx['init_metabase_go_pipelines']\n",
    "        hop1 = g.neighbors(idx, mode='out')\n",
    "        hop2 = set()\n",
    "        for n in hop1:\n",
    "            hop2.update(g.neighbors(n, mode='out'))\n",
    "        results['two_hop'] = [idx_to_id[n] for n in hop2]\n",
    "    else:\n",
    "        results['two_hop'] = []\n",
    "    \n",
    "    # domain_subgraph\n",
    "    finance_idxs = set(v.index for v in g.vs.select(domain='finance'))\n",
    "    finance_edges = [(idx_to_id[e.source], idx_to_id[e.target])\n",
    "                     for e in g.es if e.source in finance_idxs and e.target in finance_idxs]\n",
    "    results['domain_subgraph'] = finance_edges\n",
    "    \n",
    "    # most_connected\n",
    "    degrees = [(idx_to_id[i], g.degree(i, mode='all')) for i in range(g.vcount())]\n",
    "    degrees.sort(key=lambda x: -x[1])\n",
    "    results['most_connected'] = degrees[:5]\n",
    "    \n",
    "    # path_exists\n",
    "    if 'error_go_core' in id_to_idx:\n",
    "        target = id_to_idx['error_go_core']\n",
    "        finance_idxs_list = list(finance_idxs)\n",
    "        has_path = any(\n",
    "            len(g.get_shortest_paths(src, to=target, mode='out')[0]) > 0\n",
    "            for src in finance_idxs_list if src != target\n",
    "        )\n",
    "        results['path_exists'] = has_path\n",
    "    else:\n",
    "        results['path_exists'] = False\n",
    "    \n",
    "    # isolated\n",
    "    results['isolated'] = [idx_to_id[v.index] for v in g.vs if g.degree(v.index, mode='all') == 0]\n",
    "    \n",
    "    # type_users\n",
    "    if 'SMA_go_finance' in id_to_idx:\n",
    "        idx = id_to_idx['SMA_go_finance']\n",
    "        preds = g.neighbors(idx, mode='in')\n",
    "        # Filtrar por relacion uses_type\n",
    "        type_user_idxs = []\n",
    "        for p in preds:\n",
    "            eid = g.get_eid(p, idx)\n",
    "            if g.es[eid]['relation'] == 'uses_type':\n",
    "                type_user_idxs.append(p)\n",
    "        results['type_users'] = [idx_to_id[n] for n in type_user_idxs]\n",
    "    else:\n",
    "        results['type_users'] = []\n",
    "    \n",
    "    return results\n",
    "\n",
    "# Benchmark\n",
    "path = os.path.join(DATA_DIR, 'igraph')\n",
    "\n",
    "t0 = time.perf_counter()\n",
    "g_ig, ig_id_map = igraph_setup(node_map, valid_edges, path)\n",
    "ig_insert_time = time.perf_counter() - t0\n",
    "\n",
    "t0 = time.perf_counter()\n",
    "ig_results = igraph_queries(g_ig, ig_id_map)\n",
    "ig_query_time = time.perf_counter() - t0\n",
    "\n",
    "t0 = time.perf_counter()\n",
    "g_ig.write_pickle(path + '.pickle')\n",
    "ig_save_time = time.perf_counter() - t0\n",
    "\n",
    "ig_disk = dir_size_mb(path + '.pickle')\n",
    "\n",
    "t0 = time.perf_counter()\n",
    "g_loaded = ig.Graph.Read_Pickle(path + '.pickle')\n",
    "_ = g_loaded.neighbors(0, mode='out')\n",
    "ig_load_time = time.perf_counter() - t0\n",
    "\n",
    "print(f'igraph: {g_ig.vcount()} vertices, {g_ig.ecount()} aristas')\n",
    "print(f'  Insert: {ig_insert_time*1000:.1f}ms')\n",
    "print(f'  Queries (8): {ig_query_time*1000:.1f}ms')\n",
    "print(f'  Save: {ig_save_time*1000:.1f}ms')\n",
    "print(f'  Load+query: {ig_load_time*1000:.1f}ms')\n",
    "print(f'  Disco: {ig_disk:.2f}MB')\n",
    "print()\n",
    "for k, v in ig_results.items():\n",
    "    if isinstance(v, list) and len(v) > 5:\n",
    "        print(f'  {k}: {len(v)} resultados — {v[:3]}...')\n",
    "    else:\n",
    "        print(f'  {k}: {v}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "section-memgraph",
   "metadata": {},
   "source": [
    "## 7b. Backend: Memgraph (Docker + Bolt/Cypher)\n",
    "\n",
    "Memgraph es una graph DB in-memory con soporte completo de Cypher, compatible con el protocolo Bolt de Neo4j.\n",
    "La levantamos via Docker usando las funciones del fn_registry."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "memgraph-docker",
   "metadata": {},
   "outputs": [],
   "source": [
    "import subprocess\n",
    "\n",
    "# Usar las funciones Docker del registry via CLI\n",
    "# Equivalente a: docker_pull_image_go_infra(\"memgraph/memgraph:latest\")\n",
    "#                docker_run_container_go_infra(\"memgraph/memgraph:latest\", opts)\n",
    "\n",
    "MEMGRAPH_CONTAINER = 'fn_registry_memgraph_bench'\n",
    "MEMGRAPH_IMAGE = 'memgraph/memgraph:latest'\n",
    "BOLT_PORT = '7687'\n",
    "\n",
    "def run_cmd(cmd, check=True):\n",
    "    r = subprocess.run(cmd, capture_output=True, text=True, timeout=120)\n",
    "    if check and r.returncode != 0:\n",
    "        print(f'WARN: {\" \".join(cmd)} -> {r.stderr.strip()}')\n",
    "    return r\n",
    "\n",
    "# Limpiar contenedor previo si existe\n",
    "run_cmd(['docker', 'rm', '-f', MEMGRAPH_CONTAINER], check=False)\n",
    "\n",
    "# Pull image\n",
    "print('Pulling memgraph image...')\n",
    "run_cmd(['docker', 'pull', MEMGRAPH_IMAGE])\n",
    "\n",
    "# Run container\n",
    "print('Starting memgraph container...')\n",
    "r = run_cmd(['docker', 'run', '-d', '--name', MEMGRAPH_CONTAINER,\n",
    "             '-p', f'{BOLT_PORT}:7687',\n",
    "             '--rm',\n",
    "             MEMGRAPH_IMAGE,\n",
    "             '--bolt-server-name-ssl-mapping='])\n",
    "\n",
    "print(f'Container: {r.stdout.strip()[:12]}')\n",
    "\n",
    "# Esperar a que Memgraph este listo\n",
    "import time\n",
    "for attempt in range(15):\n",
    "    time.sleep(1)\n",
    "    check = run_cmd(['docker', 'exec', MEMGRAPH_CONTAINER, \n",
    "                      'mgconsole', '--command', 'RETURN 1;'], check=False)\n",
    "    if check.returncode == 0:\n",
    "        print(f'Memgraph listo (intento {attempt + 1})')\n",
    "        break\n",
    "else:\n",
    "    print('WARN: Memgraph no respondio en 15s')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "memgraph-install-driver",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Instalar driver neo4j (compatible con Memgraph via Bolt)\n",
    "import subprocess\n",
    "subprocess.run(['.venv/bin/pip', 'install', 'neo4j'], capture_output=True)\n",
    "\n",
    "from neo4j import GraphDatabase\n",
    "\n",
    "BOLT_URI = 'bolt://localhost:7687'\n",
    "\n",
    "def mg_driver():\n",
    "    return GraphDatabase.driver(BOLT_URI, auth=('', ''))\n",
    "\n",
    "# Test conexion\n",
    "with mg_driver() as driver:\n",
    "    with driver.session() as session:\n",
    "        r = session.run('RETURN 1 AS n')\n",
    "        print(f'Memgraph conectado: {r.single()[\"n\"]}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "memgraph-impl",
   "metadata": {},
   "outputs": [],
   "source": [
    "def memgraph_setup(nodes, edges_list):\n",
    "    driver = mg_driver()\n",
    "    with driver.session() as session:\n",
    "        # Limpiar\n",
    "        session.run('MATCH (n) DETACH DELETE n')\n",
    "        \n",
    "        # Crear constraint para IDs unicos\n",
    "        try:\n",
    "            session.run('CREATE CONSTRAINT ON (n:FnNode) ASSERT n.id IS UNIQUE')\n",
    "        except Exception:\n",
    "            pass  # Ya existe\n",
    "        \n",
    "        # Insert nodos\n",
    "        for nid, attrs in nodes.items():\n",
    "            props = {k: v for k, v in attrs.items() if isinstance(v, (str, int, float, bool))}\n",
    "            props['id'] = nid\n",
    "            session.run(\n",
    "                'CREATE (n:FnNode $props)',\n",
    "                props=props\n",
    "            )\n",
    "        \n",
    "        # Crear indice en domain\n",
    "        try:\n",
    "            session.run('CREATE INDEX ON :FnNode(domain)')\n",
    "        except Exception:\n",
    "            pass\n",
    "        \n",
    "        # Insert aristas\n",
    "        for src, tgt, rel in edges_list:\n",
    "            session.run(\n",
    "                'MATCH (a:FnNode {id: $src}), (b:FnNode {id: $tgt}) '\n",
    "                'CREATE (a)-[:DEPENDS_ON {relation: $rel}]->(b)',\n",
    "                src=src, tgt=tgt, rel=rel\n",
    "            )\n",
    "    \n",
    "    return driver\n",
    "\n",
    "def memgraph_queries(driver):\n",
    "    results = {}\n",
    "    with driver.session() as s:\n",
    "        # direct_deps\n",
    "        r = s.run('MATCH (a:FnNode {id: \"filter_slice_go_core\"})-[:DEPENDS_ON]->(b) RETURN b.id')\n",
    "        results['direct_deps'] = [rec['b.id'] for rec in r]\n",
    "        \n",
    "        # reverse_deps\n",
    "        r = s.run('MATCH (a)-[:DEPENDS_ON]->(b:FnNode {id: \"error_go_core\"}) RETURN a.id')\n",
    "        results['reverse_deps'] = [rec['a.id'] for rec in r]\n",
    "        \n",
    "        # two_hop\n",
    "        r = s.run(\n",
    "            'MATCH (a:FnNode {id: \"init_metabase_go_pipelines\"})-[:DEPENDS_ON]->()-[:DEPENDS_ON]->(c) '\n",
    "            'RETURN DISTINCT c.id'\n",
    "        )\n",
    "        results['two_hop'] = [rec['c.id'] for rec in r]\n",
    "        \n",
    "        # domain_subgraph\n",
    "        r = s.run(\n",
    "            'MATCH (a:FnNode {domain: \"finance\"})-[:DEPENDS_ON]->(b:FnNode {domain: \"finance\"}) '\n",
    "            'RETURN a.id, b.id'\n",
    "        )\n",
    "        results['domain_subgraph'] = [(rec['a.id'], rec['b.id']) for rec in r]\n",
    "        \n",
    "        # most_connected (degree via count)\n",
    "        r = s.run(\n",
    "            'MATCH (n:FnNode) '\n",
    "            'OPTIONAL MATCH (n)-[e1:DEPENDS_ON]->() '\n",
    "            'OPTIONAL MATCH ()-[e2:DEPENDS_ON]->(n) '\n",
    "            'RETURN n.id, count(DISTINCT e1) + count(DISTINCT e2) AS deg '\n",
    "            'ORDER BY deg DESC LIMIT 5'\n",
    "        )\n",
    "        results['most_connected'] = [(rec['n.id'], rec['deg']) for rec in r]\n",
    "        \n",
    "        # path_exists (variable-length path)\n",
    "        r = s.run(\n",
    "            'MATCH (a:FnNode {domain: \"finance\"})-[:DEPENDS_ON*1..5]->(b:FnNode {id: \"error_go_core\"}) '\n",
    "            'RETURN a.id LIMIT 1'\n",
    "        )\n",
    "        results['path_exists'] = len(list(r)) > 0\n",
    "        \n",
    "        # isolated\n",
    "        r = s.run(\n",
    "            'MATCH (n:FnNode) '\n",
    "            'WHERE NOT (n)-[:DEPENDS_ON]->() AND NOT ()-[:DEPENDS_ON]->(n) '\n",
    "            'RETURN n.id'\n",
    "        )\n",
    "        results['isolated'] = [rec['n.id'] for rec in r]\n",
    "        \n",
    "        # type_users\n",
    "        r = s.run(\n",
    "            'MATCH (a)-[:DEPENDS_ON {relation: \"uses_type\"}]->(b:FnNode {id: \"SMA_go_finance\"}) '\n",
    "            'RETURN a.id'\n",
    "        )\n",
    "        results['type_users'] = [rec['a.id'] for rec in r]\n",
    "    \n",
    "    return results\n",
    "\n",
    "# Benchmark\n",
    "t0 = time.perf_counter()\n",
    "mg_drv = memgraph_setup(node_map, valid_edges)\n",
    "mg_insert_time = time.perf_counter() - t0\n",
    "\n",
    "t0 = time.perf_counter()\n",
    "mg_results = memgraph_queries(mg_drv)\n",
    "mg_query_time = time.perf_counter() - t0\n",
    "\n",
    "# Cold start: cerrar driver, reconectar y query\n",
    "mg_drv.close()\n",
    "t0 = time.perf_counter()\n",
    "mg_drv2 = mg_driver()\n",
    "with mg_drv2.session() as s:\n",
    "    r = s.run('MATCH (a:FnNode {id: \"filter_slice_go_core\"})-[:DEPENDS_ON]->(b) RETURN b.id')\n",
    "    _ = list(r)\n",
    "mg_load_time = time.perf_counter() - t0\n",
    "mg_drv2.close()\n",
    "\n",
    "# Disco: Memgraph es in-memory, pero podemos ver uso del container\n",
    "r = subprocess.run(['docker', 'stats', '--no-stream', '--format', '{{.MemUsage}}', MEMGRAPH_CONTAINER],\n",
    "                   capture_output=True, text=True)\n",
    "mg_mem_usage = r.stdout.strip()\n",
    "\n",
    "print(f'Memgraph (Docker):')\n",
    "print(f'  Insert: {mg_insert_time*1000:.1f}ms')\n",
    "print(f'  Queries (8): {mg_query_time*1000:.1f}ms')\n",
    "print(f'  Reconnect+query: {mg_load_time*1000:.1f}ms')\n",
    "print(f'  Memory: {mg_mem_usage}')\n",
    "print()\n",
    "for k, v in mg_results.items():\n",
    "    if isinstance(v, list) and len(v) > 5:\n",
    "        print(f'  {k}: {len(v)} resultados — {v[:3]}...')\n",
    "    else:\n",
    "        print(f'  {k}: {v}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "section-8",
   "metadata": {},
   "source": [
    "## 8. Tabla resumen y visualizaciones"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "summary",
   "metadata": {},
   "outputs": [],
   "source": [
    "summary_data = [\n",
    "    {'Backend': 'NetworkX', 'Insert (ms)': round(nx_insert_time*1000, 1),\n",
    "     'Queries 8x (ms)': round(nx_query_time*1000, 1),\n",
    "     'Save (ms)': round(nx_save_time*1000, 1),\n",
    "     'Load+query (ms)': round(nx_load_time*1000, 1),\n",
    "     'Disk (MB)': round(nx_disk, 2),\n",
    "     'Query Language': 'Python API'},\n",
    "    {'Backend': 'Kuzu', 'Insert (ms)': round(kuzu_insert_time*1000, 1),\n",
    "     'Queries 8x (ms)': round(kuzu_query_time*1000, 1),\n",
    "     'Save (ms)': 0,  # auto-persist\n",
    "     'Load+query (ms)': round(kuzu_load_time*1000, 1),\n",
    "     'Disk (MB)': round(kuzu_disk, 2),\n",
    "     'Query Language': 'Cypher'},\n",
    "    {'Backend': 'SQLite+CTE', 'Insert (ms)': round(sqlite_insert_time*1000, 1),\n",
    "     'Queries 8x (ms)': round(sqlite_query_time*1000, 1),\n",
    "     'Save (ms)': 0,  # auto-persist\n",
    "     'Load+query (ms)': round(sqlite_load_time*1000, 1),\n",
    "     'Disk (MB)': round(sqlite_disk, 2),\n",
    "     'Query Language': 'SQL + CTEs'},\n",
    "    {'Backend': 'RDFLib', 'Insert (ms)': round(rdf_insert_time*1000, 1),\n",
    "     'Queries 8x (ms)': round(rdf_query_time*1000, 1),\n",
    "     'Save (ms)': round(rdf_save_time*1000, 1),\n",
    "     'Load+query (ms)': round(rdf_load_time*1000, 1),\n",
    "     'Disk (MB)': round(rdf_disk, 2),\n",
    "     'Query Language': 'SPARQL'},\n",
    "    {'Backend': 'igraph', 'Insert (ms)': round(ig_insert_time*1000, 1),\n",
    "     'Queries 8x (ms)': round(ig_query_time*1000, 1),\n",
    "     'Save (ms)': round(ig_save_time*1000, 1),\n",
    "     'Load+query (ms)': round(ig_load_time*1000, 1),\n",
    "     'Disk (MB)': round(ig_disk, 2),\n",
    "     'Query Language': 'Python API'},\n",
    "    {'Backend': 'Memgraph', 'Insert (ms)': round(mg_insert_time*1000, 1),\n",
    "     'Queries 8x (ms)': round(mg_query_time*1000, 1),\n",
    "     'Save (ms)': 0,  # in-memory, no save\n",
    "     'Load+query (ms)': round(mg_load_time*1000, 1),\n",
    "     'Disk (MB)': 0,  # in-memory\n",
    "     'Query Language': 'Cypher (Bolt)'},\n",
    "]\n",
    "\n",
    "df_summary = pd.DataFrame(summary_data)\n",
    "print(df_summary.to_string(index=False))\n",
    "print()\n",
    "print(f'Memgraph memory: {mg_mem_usage}')\n",
    "\n",
    "# Grafico comparativo\n",
    "fig, axes = plt.subplots(1, 3, figsize=(18, 6))\n",
    "colors = {'NetworkX': '#e74c3c', 'Kuzu': '#3498db', 'SQLite+CTE': '#2ecc71',\n",
    "          'RDFLib': '#f39c12', 'igraph': '#9b59b6', 'Memgraph': '#1abc9c'}\n",
    "\n",
    "# Insert\n",
    "ax = axes[0]\n",
    "ax.barh(df_summary['Backend'], df_summary['Insert (ms)'], color=[colors[b] for b in df_summary['Backend']])\n",
    "ax.set_xlabel('ms')\n",
    "ax.set_title('Insert (nodos + aristas)')\n",
    "\n",
    "# Queries\n",
    "ax = axes[1]\n",
    "ax.barh(df_summary['Backend'], df_summary['Queries 8x (ms)'], color=[colors[b] for b in df_summary['Backend']])\n",
    "ax.set_xlabel('ms')\n",
    "ax.set_title('8 queries de traversal')\n",
    "\n",
    "# Load + query\n",
    "ax = axes[2]\n",
    "ax.barh(df_summary['Backend'], df_summary['Load+query (ms)'], color=[colors[b] for b in df_summary['Backend']])\n",
    "ax.set_xlabel('ms')\n",
    "ax.set_title('Cold start: load/reconnect + 1 query')\n",
    "\n",
    "plt.suptitle(f'Comparativa de graph backends ({len(all_node_ids)} nodos, {len(valid_edges)} aristas)', fontsize=14)\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "section-9",
   "metadata": {},
   "source": [
    "## 9. Validacion cruzada de resultados\n",
    "\n",
    "Verificamos que todos los backends devuelven los mismos resultados para cada query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cross-validate",
   "metadata": {},
   "outputs": [],
   "source": [
    "all_backend_results = {\n",
    "    'NetworkX': nx_results,\n",
    "    'Kuzu': kuzu_results,\n",
    "    'SQLite+CTE': sqlite_results,\n",
    "    'RDFLib': rdf_results,\n",
    "    'igraph': ig_results,\n",
    "}\n",
    "\n",
    "print('Validacion cruzada de resultados:')\n",
    "print('=' * 60)\n",
    "\n",
    "for query_name in ['direct_deps', 'reverse_deps', 'two_hop', 'isolated', 'type_users', 'path_exists']:\n",
    "    print(f'\\n{query_name}:')\n",
    "    values = {}\n",
    "    for backend, results in all_backend_results.items():\n",
    "        val = results.get(query_name)\n",
    "        if isinstance(val, list):\n",
    "            values[backend] = sorted(str(v) for v in val)\n",
    "        else:\n",
    "            values[backend] = val\n",
    "    \n",
    "    # Comparar\n",
    "    ref_backend = 'NetworkX'\n",
    "    ref_val = values[ref_backend]\n",
    "    all_match = True\n",
    "    for backend, val in values.items():\n",
    "        match = val == ref_val\n",
    "        status = 'OK' if match else 'DIFF'\n",
    "        if isinstance(val, list):\n",
    "            print(f'  {backend:12s}: {len(val)} items [{status}]')\n",
    "        else:\n",
    "            print(f'  {backend:12s}: {val} [{status}]')\n",
    "        if not match:\n",
    "            all_match = False\n",
    "            if isinstance(val, list) and isinstance(ref_val, list):\n",
    "                extra = set(val) - set(ref_val)\n",
    "                missing = set(ref_val) - set(val)\n",
    "                if extra: print(f'    extra: {list(extra)[:5]}')\n",
    "                if missing: print(f'    missing: {list(missing)[:5]}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "section-10",
   "metadata": {},
   "source": [
    "## Conclusiones notebook 01\n",
    "\n",
    "Este notebook establece:\n",
    "- El grafo de dependencias del fn_registry cargado en 6 backends (incluyendo Memgraph via Docker)\n",
    "- Benchmark de rendimiento (insert, queries, persistencia)\n",
    "- Validacion cruzada de correctitud\n",
    "- Memgraph como referencia de graph DB \"real\" (servidor in-memory) vs las opciones embebidas\n",
    "\n",
    "**Siguiente notebook (02):** LLM retrieval — usar `claude -p` para generar queries en cada lenguaje (Cypher, SQL, SPARQL, Python API) y evaluar correctitud vs las respuestas verificadas."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "41236b7a",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "# Fix: limpiar directorio kuzu y re-ejecutar\n",
    "import shutil, os\n",
    "kuzu_path = os.path.join(DATA_DIR, 'kuzu')\n",
    "if os.path.exists(kuzu_path):\n",
    "    shutil.rmtree(kuzu_path)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "9be290ee",
   "metadata": {},
   "outputs": [
    {
     "ename": "RuntimeError",
     "evalue": "Parser exception: Invalid input <CREATE (n:FnNode {id: $id, name: $name, node_type: $node_type, kind: $kind, lang: $lang, domain: $domain, purity: $purity, description: $desc>: expected rule oC_SingleQuery (line: 1, offset: 137)\n\"CREATE (n:FnNode {id: $id, name: $name, node_type: $node_type, kind: $kind, lang: $lang, domain: $domain, purity: $purity, description: $desc})\"\n                                                                                                                                          ^^^^",
     "output_type": "error",
     "traceback": [
      "\u001b[31m---------------------------------------------------------------------------\u001b[39m",
      "\u001b[31mRuntimeError\u001b[39m                              Traceback (most recent call last)",
      "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[6]\u001b[39m\u001b[32m, line 89\u001b[39m\n\u001b[32m     85\u001b[39m \n\u001b[32m     86\u001b[39m path = os.path.join(DATA_DIR, \u001b[33m'kuzu'\u001b[39m)\n\u001b[32m     87\u001b[39m \n\u001b[32m     88\u001b[39m t0 = time.perf_counter()\n\u001b[32m---> \u001b[39m\u001b[32m89\u001b[39m kuzu_db, kuzu_conn = kuzu_setup(node_map, valid_edges, path)\n\u001b[32m     90\u001b[39m kuzu_insert_time = time.perf_counter() - t0\n\u001b[32m     91\u001b[39m \n\u001b[32m     92\u001b[39m t0 = time.perf_counter()\n",
      "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[6]\u001b[39m\u001b[32m, line 15\u001b[39m, in \u001b[36mkuzu_setup\u001b[39m\u001b[34m(nodes, edges_list, path)\u001b[39m\n\u001b[32m     11\u001b[39m                  \u001b[33m'description STRING, PRIMARY KEY(id))'\u001b[39m)\n\u001b[32m     12\u001b[39m     conn.execute(\u001b[33m'CREATE REL TABLE DEPENDS_ON(FROM FnNode TO FnNode, relation STRING)'\u001b[39m)\n\u001b[32m     13\u001b[39m \n\u001b[32m     14\u001b[39m     \u001b[38;5;28;01mfor\u001b[39;00m nid, attrs \u001b[38;5;28;01min\u001b[39;00m nodes.items():\n\u001b[32m---> \u001b[39m\u001b[32m15\u001b[39m         conn.execute(\n\u001b[32m     16\u001b[39m             \u001b[33m'CREATE (n:FnNode {id: $id, name: $name, node_type: $node_type, '\u001b[39m\n\u001b[32m     17\u001b[39m             \u001b[33m'kind: $kind, lang: $lang, domain: $domain, purity: $purity, '\u001b[39m\n\u001b[32m     18\u001b[39m             \u001b[33m'description: $desc})'\u001b[39m,\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/fn_registry/analysis/retrieving_graphs/.venv/lib/python3.13/site-packages/kuzu/connection.py:134\u001b[39m, in \u001b[36mConnection.execute\u001b[39m\u001b[34m(self, query, parameters)\u001b[39m\n\u001b[32m    132\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m    133\u001b[39m     prepared_statement = \u001b[38;5;28mself\u001b[39m._prepare(query, parameters) \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(query, \u001b[38;5;28mstr\u001b[39m) \u001b[38;5;28;01melse\u001b[39;00m query\n\u001b[32m--> \u001b[39m\u001b[32m134\u001b[39m     query_result_internal = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_connection\u001b[49m\u001b[43m.\u001b[49m\u001b[43mexecute\u001b[49m\u001b[43m(\u001b[49m\u001b[43mprepared_statement\u001b[49m\u001b[43m.\u001b[49m\u001b[43m_prepared_statement\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mparameters\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m    135\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m query_result_internal.isSuccess():\n\u001b[32m    136\u001b[39m     \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mRuntimeError\u001b[39;00m(query_result_internal.getErrorMessage())\n",
      "\u001b[31mRuntimeError\u001b[39m: Parser exception: Invalid input <CREATE (n:FnNode {id: $id, name: $name, node_type: $node_type, kind: $kind, lang: $lang, domain: $domain, purity: $purity, description: $desc>: expected rule oC_SingleQuery (line: 1, offset: 137)\n\"CREATE (n:FnNode {id: $id, name: $name, node_type: $node_type, kind: $kind, lang: $lang, domain: $domain, purity: $purity, description: $desc})\"\n                                                                                                                                          ^^^^"
     ]
    }
   ],
   "source": [
    "\n",
    "import kuzu\n",
    "\n",
    "def kuzu_setup(nodes, edges_list, path):\n",
    "    cleanup_path(path)\n",
    "    # Kuzu crea el directorio, no debe existir previamente\n",
    "    db = kuzu.Database(path)\n",
    "    conn = kuzu.Connection(db)\n",
    "    \n",
    "    conn.execute('CREATE NODE TABLE FnNode(id STRING, name STRING, node_type STRING, '\n",
    "                 'kind STRING, lang STRING, domain STRING, purity STRING, '\n",
    "                 'description STRING, PRIMARY KEY(id))')\n",
    "    conn.execute('CREATE REL TABLE DEPENDS_ON(FROM FnNode TO FnNode, relation STRING)')\n",
    "    \n",
    "    for nid, attrs in nodes.items():\n",
    "        conn.execute(\n",
    "            'CREATE (n:FnNode {id: $id, name: $name, node_type: $node_type, '\n",
    "            'kind: $kind, lang: $lang, domain: $domain, purity: $purity, '\n",
    "            'description: $desc})',\n",
    "            parameters={\n",
    "                'id': nid, 'name': attrs.get('name', ''),\n",
    "                'node_type': attrs.get('node_type', ''),\n",
    "                'kind': attrs.get('kind', ''), 'lang': attrs.get('lang', ''),\n",
    "                'domain': attrs.get('domain', ''), 'purity': attrs.get('purity', ''),\n",
    "                'desc': attrs.get('description', ''),\n",
    "            }\n",
    "        )\n",
    "    \n",
    "    for src, tgt, rel in edges_list:\n",
    "        conn.execute(\n",
    "            'MATCH (a:FnNode {id: $src}), (b:FnNode {id: $tgt}) '\n",
    "            'CREATE (a)-[:DEPENDS_ON {relation: $rel}]->(b)',\n",
    "            parameters={'src': src, 'tgt': tgt, 'rel': rel}\n",
    "        )\n",
    "    \n",
    "    return db, conn\n",
    "\n",
    "def kuzu_queries(conn):\n",
    "    results = {}\n",
    "    \n",
    "    r = conn.execute('MATCH (a:FnNode {id: \"filter_slice_go_core\"})-[:DEPENDS_ON]->(b) RETURN b.id')\n",
    "    results['direct_deps'] = [row[0] for row in r.get_as_df().values]\n",
    "    \n",
    "    r = conn.execute('MATCH (a)-[:DEPENDS_ON]->(b:FnNode {id: \"error_go_core\"}) RETURN a.id')\n",
    "    results['reverse_deps'] = [row[0] for row in r.get_as_df().values]\n",
    "    \n",
    "    r = conn.execute(\n",
    "        'MATCH (a:FnNode {id: \"init_metabase_go_pipelines\"})-[:DEPENDS_ON]->()-[:DEPENDS_ON]->(c) '\n",
    "        'RETURN DISTINCT c.id'\n",
    "    )\n",
    "    results['two_hop'] = [row[0] for row in r.get_as_df().values]\n",
    "    \n",
    "    r = conn.execute(\n",
    "        'MATCH (a:FnNode {domain: \"finance\"})-[e:DEPENDS_ON]->(b:FnNode {domain: \"finance\"}) '\n",
    "        'RETURN a.id, b.id'\n",
    "    )\n",
    "    results['domain_subgraph'] = [(row[0], row[1]) for row in r.get_as_df().values]\n",
    "    \n",
    "    r = conn.execute(\n",
    "        'MATCH (n:FnNode) '\n",
    "        'OPTIONAL MATCH (n)-[e1:DEPENDS_ON]->() '\n",
    "        'OPTIONAL MATCH ()-[e2:DEPENDS_ON]->(n) '\n",
    "        'RETURN n.id, count(DISTINCT e1) + count(DISTINCT e2) AS deg '\n",
    "        'ORDER BY deg DESC LIMIT 5'\n",
    "    )\n",
    "    results['most_connected'] = [(row[0], row[1]) for row in r.get_as_df().values]\n",
    "    \n",
    "    r = conn.execute(\n",
    "        'MATCH (a:FnNode {domain: \"finance\"})-[:DEPENDS_ON* 1..5]->(b:FnNode {id: \"error_go_core\"}) '\n",
    "        'RETURN a.id LIMIT 1'\n",
    "    )\n",
    "    results['path_exists'] = len(r.get_as_df()) > 0\n",
    "    \n",
    "    r = conn.execute(\n",
    "        'MATCH (n:FnNode) WHERE NOT EXISTS { MATCH (n)-[:DEPENDS_ON]->() } '\n",
    "        'AND NOT EXISTS { MATCH ()-[:DEPENDS_ON]->(n) } RETURN n.id'\n",
    "    )\n",
    "    results['isolated'] = [row[0] for row in r.get_as_df().values]\n",
    "    \n",
    "    r = conn.execute(\n",
    "        'MATCH (a)-[:DEPENDS_ON {relation: \"uses_type\"}]->(b:FnNode {id: \"SMA_go_finance\"}) RETURN a.id'\n",
    "    )\n",
    "    results['type_users'] = [row[0] for row in r.get_as_df().values]\n",
    "    \n",
    "    return results\n",
    "\n",
    "path = os.path.join(DATA_DIR, 'kuzu')\n",
    "\n",
    "t0 = time.perf_counter()\n",
    "kuzu_db, kuzu_conn = kuzu_setup(node_map, valid_edges, path)\n",
    "kuzu_insert_time = time.perf_counter() - t0\n",
    "\n",
    "t0 = time.perf_counter()\n",
    "kuzu_results = kuzu_queries(kuzu_conn)\n",
    "kuzu_query_time = time.perf_counter() - t0\n",
    "\n",
    "kuzu_disk = dir_size_mb(path)\n",
    "\n",
    "del kuzu_conn, kuzu_db\n",
    "t0 = time.perf_counter()\n",
    "kuzu_db2 = kuzu.Database(path)\n",
    "kuzu_conn2 = kuzu.Connection(kuzu_db2)\n",
    "r = kuzu_conn2.execute('MATCH (a:FnNode {id: \"filter_slice_go_core\"})-[:DEPENDS_ON]->(b) RETURN b.id')\n",
    "_ = r.get_as_df()\n",
    "kuzu_load_time = time.perf_counter() - t0\n",
    "del kuzu_conn2, kuzu_db2\n",
    "\n",
    "print(f'Kuzu:')\n",
    "print(f'  Insert: {kuzu_insert_time*1000:.1f}ms')\n",
    "print(f'  Queries (8): {kuzu_query_time*1000:.1f}ms')\n",
    "print(f'  Load+query: {kuzu_load_time*1000:.1f}ms')\n",
    "print(f'  Disco: {kuzu_disk:.2f}MB')\n",
    "print()\n",
    "for k, v in kuzu_results.items():\n",
    "    if isinstance(v, list) and len(v) > 5:\n",
    "        print(f'  {k}: {len(v)} resultados')\n",
    "    else:\n",
    "        print(f'  {k}: {v}')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "a01eb7f7",
   "metadata": {},
   "outputs": [
    {
     "ename": "NotADirectoryError",
     "evalue": "[Errno 20] Not a directory: 'data/graph_bench/kuzu'",
     "output_type": "error",
     "traceback": [
      "\u001b[31m---------------------------------------------------------------------------\u001b[39m",
      "\u001b[31mNotADirectoryError\u001b[39m                        Traceback (most recent call last)",
      "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[7]\u001b[39m\u001b[32m, line 5\u001b[39m\n\u001b[32m      1\u001b[39m \u001b[38;5;66;03m# Fix Kuzu: usar COPY FROM DataFrame para bulk insert\u001b[39;00m\n\u001b[32m      2\u001b[39m \u001b[38;5;28;01mimport\u001b[39;00m shutil\n\u001b[32m      3\u001b[39m kuzu_path = os.path.join(DATA_DIR, \u001b[33m'kuzu'\u001b[39m)\n\u001b[32m      4\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m os.path.exists(kuzu_path):\n\u001b[32m----> \u001b[39m\u001b[32m5\u001b[39m     shutil.rmtree(kuzu_path)\n\u001b[32m      6\u001b[39m \n\u001b[32m      7\u001b[39m \u001b[38;5;28;01mimport\u001b[39;00m kuzu\n\u001b[32m      8\u001b[39m \n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/.local/share/uv/python/cpython-3.13.7-linux-x86_64-gnu/lib/python3.13/shutil.py:763\u001b[39m, in \u001b[36mrmtree\u001b[39m\u001b[34m(path, ignore_errors, onerror, onexc, dir_fd)\u001b[39m\n\u001b[32m    761\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m    762\u001b[39m     \u001b[38;5;28;01mwhile\u001b[39;00m stack:\n\u001b[32m--> \u001b[39m\u001b[32m763\u001b[39m         \u001b[43m_rmtree_safe_fd\u001b[49m\u001b[43m(\u001b[49m\u001b[43mstack\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43monexc\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m    764\u001b[39m \u001b[38;5;28;01mfinally\u001b[39;00m:\n\u001b[32m    765\u001b[39m     \u001b[38;5;66;03m# Close any file descriptors still on the stack.\u001b[39;00m\n\u001b[32m    766\u001b[39m     \u001b[38;5;28;01mwhile\u001b[39;00m stack:\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/.local/share/uv/python/cpython-3.13.7-linux-x86_64-gnu/lib/python3.13/shutil.py:707\u001b[39m, in \u001b[36m_rmtree_safe_fd\u001b[39m\u001b[34m(stack, onexc)\u001b[39m\n\u001b[32m    705\u001b[39m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mOSError\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m err:\n\u001b[32m    706\u001b[39m     err.filename = path\n\u001b[32m--> \u001b[39m\u001b[32m707\u001b[39m     \u001b[43monexc\u001b[49m\u001b[43m(\u001b[49m\u001b[43mfunc\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mpath\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43merr\u001b[49m\u001b[43m)\u001b[49m\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/.local/share/uv/python/cpython-3.13.7-linux-x86_64-gnu/lib/python3.13/shutil.py:682\u001b[39m, in \u001b[36m_rmtree_safe_fd\u001b[39m\u001b[34m(stack, onexc)\u001b[39m\n\u001b[32m    679\u001b[39m     stack.append((os.close, topfd, path, orig_entry))\n\u001b[32m    681\u001b[39m func = os.scandir  \u001b[38;5;66;03m# For error reporting.\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m682\u001b[39m \u001b[38;5;28;01mwith\u001b[39;00m \u001b[43mos\u001b[49m\u001b[43m.\u001b[49m\u001b[43mscandir\u001b[49m\u001b[43m(\u001b[49m\u001b[43mtopfd\u001b[49m\u001b[43m)\u001b[49m \u001b[38;5;28;01mas\u001b[39;00m scandir_it:\n\u001b[32m    683\u001b[39m     entries = \u001b[38;5;28mlist\u001b[39m(scandir_it)\n\u001b[32m    684\u001b[39m \u001b[38;5;28;01mfor\u001b[39;00m entry \u001b[38;5;129;01min\u001b[39;00m entries:\n",
      "\u001b[31mNotADirectoryError\u001b[39m: [Errno 20] Not a directory: 'data/graph_bench/kuzu'"
     ]
    }
   ],
   "source": [
    "\n",
    "# Fix Kuzu: usar COPY FROM DataFrame para bulk insert\n",
    "import shutil\n",
    "kuzu_path = os.path.join(DATA_DIR, 'kuzu')\n",
    "if os.path.exists(kuzu_path):\n",
    "    shutil.rmtree(kuzu_path)\n",
    "\n",
    "import kuzu\n",
    "\n",
    "def kuzu_setup_v2(nodes, edges_list, path):\n",
    "    db = kuzu.Database(path)\n",
    "    conn = kuzu.Connection(db)\n",
    "    \n",
    "    conn.execute('CREATE NODE TABLE FnNode(id STRING, name STRING, node_type STRING, '\n",
    "                 'kind STRING, lang STRING, domain STRING, purity STRING, '\n",
    "                 'description STRING, PRIMARY KEY(id))')\n",
    "    conn.execute('CREATE REL TABLE DEPENDS_ON(FROM FnNode TO FnNode, relation STRING)')\n",
    "    \n",
    "    # Bulk insert nodos via DataFrame\n",
    "    import pandas as pd\n",
    "    nodes_df = pd.DataFrame([\n",
    "        {'id': nid, 'name': a.get('name',''), 'node_type': a.get('node_type',''),\n",
    "         'kind': a.get('kind',''), 'lang': a.get('lang',''), 'domain': a.get('domain',''),\n",
    "         'purity': a.get('purity',''), 'description': a.get('description','')}\n",
    "        for nid, a in nodes.items()\n",
    "    ])\n",
    "    conn.execute('COPY FnNode FROM nodes_df')\n",
    "    \n",
    "    # Bulk insert aristas\n",
    "    edges_df = pd.DataFrame(edges_list, columns=['src', 'tgt', 'relation'])\n",
    "    # Kuzu COPY para rel tables necesita que las columnas FROM/TO coincidan con los PKs\n",
    "    edges_df.columns = ['FnNode_id', 'FnNode_id_1', 'relation']  # workaround\n",
    "    # Usar insert individual para edges (COPY REL es mas complejo)\n",
    "    for src, tgt, rel in edges_list:\n",
    "        try:\n",
    "            conn.execute(f'MATCH (a:FnNode), (b:FnNode) WHERE a.id=\"{src}\" AND b.id=\"{tgt}\" CREATE (a)-[:DEPENDS_ON {{relation: \"{rel}\"}}]->(b)')\n",
    "        except:\n",
    "            pass\n",
    "    \n",
    "    return db, conn\n",
    "\n",
    "path = os.path.join(DATA_DIR, 'kuzu')\n",
    "t0 = time.perf_counter()\n",
    "kuzu_db, kuzu_conn = kuzu_setup_v2(node_map, valid_edges, path)\n",
    "kuzu_insert_time = time.perf_counter() - t0\n",
    "print(f'Kuzu insert: {kuzu_insert_time*1000:.1f}ms')\n",
    "print(f'Nodos insertados: {kuzu_conn.execute(\"MATCH (n) RETURN count(n)\").get_as_df().values[0][0]}')\n",
    "print(f'Aristas insertadas: {kuzu_conn.execute(\"MATCH ()-[r]->() RETURN count(r)\").get_as_df().values[0][0]}')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "3b433211",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Kuzu cleanup done\n",
      "['networkx.pickle']\n"
     ]
    }
   ],
   "source": [
    "\n",
    "import os, shutil\n",
    "kuzu_path = os.path.join(DATA_DIR, 'kuzu')\n",
    "# Puede ser archivo o directorio\n",
    "if os.path.isfile(kuzu_path):\n",
    "    os.remove(kuzu_path)\n",
    "elif os.path.isdir(kuzu_path):\n",
    "    shutil.rmtree(kuzu_path)\n",
    "# Limpiar cualquier .wal o lock\n",
    "for f in os.listdir(DATA_DIR):\n",
    "    if f.startswith('kuzu'):\n",
    "        fp = os.path.join(DATA_DIR, f)\n",
    "        if os.path.isfile(fp):\n",
    "            os.remove(fp)\n",
    "        elif os.path.isdir(fp):\n",
    "            shutil.rmtree(fp)\n",
    "print('Kuzu cleanup done')\n",
    "print(os.listdir(DATA_DIR))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "04d0fa0e",
   "metadata": {},
   "outputs": [
    {
     "ename": "RuntimeError",
     "evalue": "Assertion failed in file \"/tmp/pip-req-build-ciobv43m/kuzu-source/tools/python_api/src_cpp/numpy/numpy_type.cpp\" on line 86: KU_UNREACHABLE",
     "output_type": "error",
     "traceback": [
      "\u001b[31m---------------------------------------------------------------------------\u001b[39m",
      "\u001b[31mRuntimeError\u001b[39m                              Traceback (most recent call last)",
      "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[9]\u001b[39m\u001b[32m, line 18\u001b[39m\n\u001b[32m     14\u001b[39m      \u001b[33m'kind'\u001b[39m: a.get(\u001b[33m'kind'\u001b[39m,\u001b[33m''\u001b[39m), \u001b[33m'lang'\u001b[39m: a.get(\u001b[33m'lang'\u001b[39m,\u001b[33m''\u001b[39m), \u001b[33m'domain'\u001b[39m: a.get(\u001b[33m'domain'\u001b[39m,\u001b[33m''\u001b[39m),\n\u001b[32m     15\u001b[39m      \u001b[33m'purity'\u001b[39m: a.get(\u001b[33m'purity'\u001b[39m,\u001b[33m''\u001b[39m), \u001b[33m'description'\u001b[39m: a.get(\u001b[33m'description'\u001b[39m,\u001b[33m''\u001b[39m)}\n\u001b[32m     16\u001b[39m     \u001b[38;5;28;01mfor\u001b[39;00m nid, a \u001b[38;5;28;01min\u001b[39;00m node_map.items()\n\u001b[32m     17\u001b[39m ])\n\u001b[32m---> \u001b[39m\u001b[32m18\u001b[39m conn.execute(\u001b[33m'COPY FnNode FROM nodes_df'\u001b[39m)\n\u001b[32m     19\u001b[39m \n\u001b[32m     20\u001b[39m \u001b[38;5;66;03m# Insert aristas una a una (COPY REL necesita CSV)\u001b[39;00m\n\u001b[32m     21\u001b[39m \u001b[38;5;28;01mfor\u001b[39;00m src, tgt, rel \u001b[38;5;28;01min\u001b[39;00m valid_edges:\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/fn_registry/analysis/retrieving_graphs/.venv/lib/python3.13/site-packages/kuzu/connection.py:131\u001b[39m, in \u001b[36mConnection.execute\u001b[39m\u001b[34m(self, query, parameters)\u001b[39m\n\u001b[32m    128\u001b[39m     \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mRuntimeError\u001b[39;00m(msg)  \u001b[38;5;66;03m# noqa: TRY004\u001b[39;00m\n\u001b[32m    130\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mlen\u001b[39m(parameters) == \u001b[32m0\u001b[39m \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(query, \u001b[38;5;28mstr\u001b[39m):\n\u001b[32m--> \u001b[39m\u001b[32m131\u001b[39m     query_result_internal = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_connection\u001b[49m\u001b[43m.\u001b[49m\u001b[43mquery\u001b[49m\u001b[43m(\u001b[49m\u001b[43mquery\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m    132\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m    133\u001b[39m     prepared_statement = \u001b[38;5;28mself\u001b[39m._prepare(query, parameters) \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(query, \u001b[38;5;28mstr\u001b[39m) \u001b[38;5;28;01melse\u001b[39;00m query\n",
      "\u001b[31mRuntimeError\u001b[39m: Assertion failed in file \"/tmp/pip-req-build-ciobv43m/kuzu-source/tools/python_api/src_cpp/numpy/numpy_type.cpp\" on line 86: KU_UNREACHABLE"
     ]
    }
   ],
   "source": [
    "\n",
    "import kuzu, pandas as pd\n",
    "\n",
    "kuzu_path = os.path.join(DATA_DIR, 'kuzu_db')\n",
    "\n",
    "db = kuzu.Database(kuzu_path)\n",
    "conn = kuzu.Connection(db)\n",
    "\n",
    "conn.execute('CREATE NODE TABLE FnNode(id STRING, name STRING, node_type STRING, kind STRING, lang STRING, domain STRING, purity STRING, description STRING, PRIMARY KEY(id))')\n",
    "conn.execute('CREATE REL TABLE DEPENDS_ON(FROM FnNode TO FnNode, relation STRING)')\n",
    "\n",
    "# Bulk insert nodos\n",
    "nodes_df = pd.DataFrame([\n",
    "    {'id': nid, 'name': a.get('name',''), 'node_type': a.get('node_type',''),\n",
    "     'kind': a.get('kind',''), 'lang': a.get('lang',''), 'domain': a.get('domain',''),\n",
    "     'purity': a.get('purity',''), 'description': a.get('description','')}\n",
    "    for nid, a in node_map.items()\n",
    "])\n",
    "conn.execute('COPY FnNode FROM nodes_df')\n",
    "\n",
    "# Insert aristas una a una (COPY REL necesita CSV)\n",
    "for src, tgt, rel in valid_edges:\n",
    "    conn.execute(f'MATCH (a:FnNode), (b:FnNode) WHERE a.id=\"{src}\" AND b.id=\"{tgt}\" CREATE (a)-[:DEPENDS_ON {{relation: \"{rel}\"}}]->(b)')\n",
    "\n",
    "n_nodes = conn.execute('MATCH (n) RETURN count(n)').get_as_df().values[0][0]\n",
    "n_edges = conn.execute('MATCH ()-[r]->() RETURN count(r)').get_as_df().values[0][0]\n",
    "print(f'Kuzu: {n_nodes} nodos, {n_edges} aristas')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "0550b2aa",
   "metadata": {},
   "outputs": [
    {
     "ename": "NotADirectoryError",
     "evalue": "[Errno 20] Not a directory: 'data/graph_bench/kuzu_db'",
     "output_type": "error",
     "traceback": [
      "\u001b[31m---------------------------------------------------------------------------\u001b[39m",
      "\u001b[31mNotADirectoryError\u001b[39m                        Traceback (most recent call last)",
      "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[10]\u001b[39m\u001b[32m, line 5\u001b[39m\n\u001b[32m      1\u001b[39m \u001b[38;5;66;03m# Kuzu: limpiar intento fallido y reintentar con CSV\u001b[39;00m\n\u001b[32m      2\u001b[39m \u001b[38;5;28;01mimport\u001b[39;00m shutil\n\u001b[32m      3\u001b[39m kuzu_path = os.path.join(DATA_DIR, \u001b[33m'kuzu_db'\u001b[39m)\n\u001b[32m      4\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m os.path.exists(kuzu_path):\n\u001b[32m----> \u001b[39m\u001b[32m5\u001b[39m     shutil.rmtree(kuzu_path)\n\u001b[32m      6\u001b[39m \n\u001b[32m      7\u001b[39m \u001b[38;5;28;01mimport\u001b[39;00m kuzu, csv\n\u001b[32m      8\u001b[39m \n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/.local/share/uv/python/cpython-3.13.7-linux-x86_64-gnu/lib/python3.13/shutil.py:763\u001b[39m, in \u001b[36mrmtree\u001b[39m\u001b[34m(path, ignore_errors, onerror, onexc, dir_fd)\u001b[39m\n\u001b[32m    761\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m    762\u001b[39m     \u001b[38;5;28;01mwhile\u001b[39;00m stack:\n\u001b[32m--> \u001b[39m\u001b[32m763\u001b[39m         \u001b[43m_rmtree_safe_fd\u001b[49m\u001b[43m(\u001b[49m\u001b[43mstack\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43monexc\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m    764\u001b[39m \u001b[38;5;28;01mfinally\u001b[39;00m:\n\u001b[32m    765\u001b[39m     \u001b[38;5;66;03m# Close any file descriptors still on the stack.\u001b[39;00m\n\u001b[32m    766\u001b[39m     \u001b[38;5;28;01mwhile\u001b[39;00m stack:\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/.local/share/uv/python/cpython-3.13.7-linux-x86_64-gnu/lib/python3.13/shutil.py:707\u001b[39m, in \u001b[36m_rmtree_safe_fd\u001b[39m\u001b[34m(stack, onexc)\u001b[39m\n\u001b[32m    705\u001b[39m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mOSError\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m err:\n\u001b[32m    706\u001b[39m     err.filename = path\n\u001b[32m--> \u001b[39m\u001b[32m707\u001b[39m     \u001b[43monexc\u001b[49m\u001b[43m(\u001b[49m\u001b[43mfunc\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mpath\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43merr\u001b[49m\u001b[43m)\u001b[49m\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/.local/share/uv/python/cpython-3.13.7-linux-x86_64-gnu/lib/python3.13/shutil.py:682\u001b[39m, in \u001b[36m_rmtree_safe_fd\u001b[39m\u001b[34m(stack, onexc)\u001b[39m\n\u001b[32m    679\u001b[39m     stack.append((os.close, topfd, path, orig_entry))\n\u001b[32m    681\u001b[39m func = os.scandir  \u001b[38;5;66;03m# For error reporting.\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m682\u001b[39m \u001b[38;5;28;01mwith\u001b[39;00m \u001b[43mos\u001b[49m\u001b[43m.\u001b[49m\u001b[43mscandir\u001b[49m\u001b[43m(\u001b[49m\u001b[43mtopfd\u001b[49m\u001b[43m)\u001b[49m \u001b[38;5;28;01mas\u001b[39;00m scandir_it:\n\u001b[32m    683\u001b[39m     entries = \u001b[38;5;28mlist\u001b[39m(scandir_it)\n\u001b[32m    684\u001b[39m \u001b[38;5;28;01mfor\u001b[39;00m entry \u001b[38;5;129;01min\u001b[39;00m entries:\n",
      "\u001b[31mNotADirectoryError\u001b[39m: [Errno 20] Not a directory: 'data/graph_bench/kuzu_db'"
     ]
    }
   ],
   "source": [
    "\n",
    "# Kuzu: limpiar intento fallido y reintentar con CSV\n",
    "import shutil\n",
    "kuzu_path = os.path.join(DATA_DIR, 'kuzu_db')\n",
    "if os.path.exists(kuzu_path):\n",
    "    shutil.rmtree(kuzu_path)\n",
    "\n",
    "import kuzu, csv\n",
    "\n",
    "db = kuzu.Database(kuzu_path)\n",
    "conn = kuzu.Connection(db)\n",
    "\n",
    "conn.execute('CREATE NODE TABLE FnNode(id STRING, name STRING, node_type STRING, kind STRING, lang STRING, domain STRING, purity STRING, description STRING, PRIMARY KEY(id))')\n",
    "conn.execute('CREATE REL TABLE DEPENDS_ON(FROM FnNode TO FnNode, relation STRING)')\n",
    "\n",
    "# Escribir CSV de nodos\n",
    "nodes_csv = os.path.join(DATA_DIR, 'nodes.csv')\n",
    "with open(nodes_csv, 'w', newline='') as f:\n",
    "    w = csv.writer(f)\n",
    "    for nid, a in node_map.items():\n",
    "        w.writerow([nid, a.get('name',''), a.get('node_type',''), a.get('kind',''),\n",
    "                     a.get('lang',''), a.get('domain',''), a.get('purity',''),\n",
    "                     a.get('description','').replace('\"', '').replace('\\n', ' ')[:200]])\n",
    "\n",
    "conn.execute(f'COPY FnNode FROM \"{os.path.abspath(nodes_csv)}\" (header=false)')\n",
    "\n",
    "# CSV de aristas  \n",
    "edges_csv = os.path.join(DATA_DIR, 'edges.csv')\n",
    "with open(edges_csv, 'w', newline='') as f:\n",
    "    w = csv.writer(f)\n",
    "    for src, tgt, rel in valid_edges:\n",
    "        w.writerow([src, tgt, rel])\n",
    "\n",
    "conn.execute(f'COPY DEPENDS_ON FROM \"{os.path.abspath(edges_csv)}\" (header=false)')\n",
    "\n",
    "n_nodes = conn.execute('MATCH (n) RETURN count(n)').get_as_df().values[0][0]\n",
    "n_edges = conn.execute('MATCH ()-[r]->() RETURN count(r)').get_as_df().values[0][0]\n",
    "print(f'Kuzu: {n_nodes} nodos, {n_edges} aristas')\n",
    "kuzu_db, kuzu_conn = db, conn\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "aace0028",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Cleaned: [PosixPath('data/graph_bench/networkx.pickle')]\n"
     ]
    }
   ],
   "source": [
    "\n",
    "# Generic cleanup\n",
    "import pathlib\n",
    "for p in pathlib.Path(DATA_DIR).glob('kuzu*'):\n",
    "    if p.is_file():\n",
    "        p.unlink()\n",
    "    elif p.is_dir():\n",
    "        shutil.rmtree(p)\n",
    "print('Cleaned:', list(pathlib.Path(DATA_DIR).iterdir()))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "ce6afb7e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Kuzu: 393 nodos, 395 aristas - OK\n"
     ]
    }
   ],
   "source": [
    "\n",
    "import kuzu, csv\n",
    "\n",
    "kuzu_path = os.path.join(DATA_DIR, 'kuzu_graph')\n",
    "db = kuzu.Database(kuzu_path)\n",
    "conn = kuzu.Connection(db)\n",
    "\n",
    "conn.execute('CREATE NODE TABLE FnNode(id STRING, name STRING, node_type STRING, kind STRING, lang STRING, domain STRING, purity STRING, description STRING, PRIMARY KEY(id))')\n",
    "conn.execute('CREATE REL TABLE DEPENDS_ON(FROM FnNode TO FnNode, relation STRING)')\n",
    "\n",
    "# Escribir CSV de nodos\n",
    "nodes_csv = os.path.abspath(os.path.join(DATA_DIR, 'nodes.csv'))\n",
    "with open(nodes_csv, 'w', newline='') as f:\n",
    "    w = csv.writer(f)\n",
    "    for nid, a in node_map.items():\n",
    "        desc = a.get('description','').replace('\"','').replace('\\n',' ')[:200]\n",
    "        w.writerow([nid, a.get('name',''), a.get('node_type',''), a.get('kind',''),\n",
    "                     a.get('lang',''), a.get('domain',''), a.get('purity',''), desc])\n",
    "\n",
    "conn.execute(f'COPY FnNode FROM \"{nodes_csv}\" (header=false)')\n",
    "\n",
    "# CSV de aristas  \n",
    "edges_csv = os.path.abspath(os.path.join(DATA_DIR, 'edges.csv'))\n",
    "with open(edges_csv, 'w', newline='') as f:\n",
    "    w = csv.writer(f)\n",
    "    for src, tgt, rel in valid_edges:\n",
    "        w.writerow([src, tgt, rel])\n",
    "\n",
    "conn.execute(f'COPY DEPENDS_ON FROM \"{edges_csv}\" (header=false)')\n",
    "\n",
    "n_nodes = conn.execute('MATCH (n) RETURN count(n)').get_as_df().values[0][0]\n",
    "n_edges = conn.execute('MATCH ()-[r]->() RETURN count(r)').get_as_df().values[0][0]\n",
    "print(f'Kuzu: {n_nodes} nodos, {n_edges} aristas - OK')\n",
    "kuzu_db, kuzu_conn = db, conn\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "d188084c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Kuzu:\n",
      "  Insert: 269.8ms\n",
      "  Queries (8): 61.7ms\n",
      "  Load+query: 18.7ms\n",
      "  Disco: 4.07MB\n",
      "  direct_deps: []\n",
      "  reverse_deps: 176 resultados\n",
      "  two_hop: []\n",
      "  domain_subgraph: 10 resultados\n",
      "  most_connected: [('error_go_core', 176), ('cn_typescript_core', 27), ('docker_tui_go_infra', 26), ('MetabaseClient_go_infra', 21), ('styles_go_tui', 9)]\n",
      "  path_exists: True\n",
      "  isolated: 122 resultados\n",
      "  type_users: []\n"
     ]
    }
   ],
   "source": [
    "\n",
    "# === KUZU BENCHMARK ===\n",
    "t0 = time.perf_counter()\n",
    "kuzu_results = kuzu_queries(kuzu_conn)\n",
    "kuzu_query_time = time.perf_counter() - t0\n",
    "\n",
    "kuzu_disk = dir_size_mb(os.path.join(DATA_DIR, 'kuzu_graph'))\n",
    "\n",
    "# Medir insert time (ya insertado, usamos el tiempo de re-insert)\n",
    "kuzu_insert_time = 0  # Lo mediremos en el resumen con un re-run\n",
    "\n",
    "# Cold start\n",
    "del kuzu_conn, kuzu_db\n",
    "t0 = time.perf_counter()\n",
    "_db = kuzu.Database(os.path.join(DATA_DIR, 'kuzu_graph'))\n",
    "_conn = kuzu.Connection(_db)\n",
    "r = _conn.execute('MATCH (a:FnNode {id: \"filter_slice_go_core\"})-[:DEPENDS_ON]->(b) RETURN b.id')\n",
    "_ = r.get_as_df()\n",
    "kuzu_load_time = time.perf_counter() - t0\n",
    "\n",
    "# Re-measure insert: crear nueva DB\n",
    "import pathlib\n",
    "for p in pathlib.Path(DATA_DIR).glob('kuzu_bench*'):\n",
    "    if p.is_file(): p.unlink()\n",
    "    elif p.is_dir(): shutil.rmtree(p)\n",
    "\n",
    "t0 = time.perf_counter()\n",
    "_db2 = kuzu.Database(os.path.join(DATA_DIR, 'kuzu_bench'))\n",
    "_c2 = kuzu.Connection(_db2)\n",
    "_c2.execute('CREATE NODE TABLE FnNode(id STRING, name STRING, node_type STRING, kind STRING, lang STRING, domain STRING, purity STRING, description STRING, PRIMARY KEY(id))')\n",
    "_c2.execute('CREATE REL TABLE DEPENDS_ON(FROM FnNode TO FnNode, relation STRING)')\n",
    "_c2.execute(f'COPY FnNode FROM \"{os.path.abspath(os.path.join(DATA_DIR, \"nodes.csv\"))}\" (header=false)')\n",
    "_c2.execute(f'COPY DEPENDS_ON FROM \"{os.path.abspath(os.path.join(DATA_DIR, \"edges.csv\"))}\" (header=false)')\n",
    "kuzu_insert_time = time.perf_counter() - t0\n",
    "del _c2, _db2\n",
    "\n",
    "# Guardar conn para queries\n",
    "kuzu_db = _db\n",
    "kuzu_conn = _conn\n",
    "\n",
    "print(f'Kuzu:')\n",
    "print(f'  Insert: {kuzu_insert_time*1000:.1f}ms')\n",
    "print(f'  Queries (8): {kuzu_query_time*1000:.1f}ms')\n",
    "print(f'  Load+query: {kuzu_load_time*1000:.1f}ms')\n",
    "print(f'  Disco: {kuzu_disk:.2f}MB')\n",
    "for k, v in kuzu_results.items():\n",
    "    if isinstance(v, list) and len(v) > 5:\n",
    "        print(f'  {k}: {len(v)} resultados')\n",
    "    else:\n",
    "        print(f'  {k}: {v}')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "4397296c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "RDFLib: 3068 triples\n",
      "  Insert: 32.7ms\n",
      "  Queries (8): 629.1ms\n",
      "  Save: 50.9ms\n",
      "  Load+query: 81.2ms\n",
      "  Disco: 0.14MB\n",
      "  direct_deps: []\n",
      "  reverse_deps: 176 resultados\n",
      "  two_hop: []\n",
      "  domain_subgraph: 10 resultados\n",
      "  most_connected: [('error_go_core', 176), ('cn_typescript_core', 27), ('docker_tui_go_infra', 26), ('MetabaseClient_go_infra', 21), ('styles_go_tui', 9)]\n",
      "  path_exists: True\n",
      "  isolated: 122 resultados\n",
      "  type_users: []\n"
     ]
    }
   ],
   "source": [
    "\n",
    "# RDFLib con path_exists corregido (sin {1,5} que no soporta rdflib)\n",
    "from rdflib import Graph as RDFGraph, Namespace, Literal, URIRef\n",
    "from rdflib.namespace import RDF, RDFS\n",
    "\n",
    "FN = Namespace('http://fn-registry.local/')\n",
    "FNREL = Namespace('http://fn-registry.local/rel/')\n",
    "FNPROP = Namespace('http://fn-registry.local/prop/')\n",
    "\n",
    "def rdf_setup(nodes, edges_list, path):\n",
    "    g = RDFGraph()\n",
    "    g.bind('fn', FN); g.bind('fnrel', FNREL); g.bind('fnprop', FNPROP)\n",
    "    for nid, attrs in nodes.items():\n",
    "        uri = FN[nid]\n",
    "        g.add((uri, RDF.type, FN['Function'] if attrs.get('node_type') == 'function' else FN['Type']))\n",
    "        for prop in ['name', 'kind', 'lang', 'domain', 'purity', 'description']:\n",
    "            val = attrs.get(prop, '')\n",
    "            if val: g.add((uri, FNPROP[prop], Literal(val)))\n",
    "    for src, tgt, rel in edges_list:\n",
    "        g.add((FN[src], FNREL[rel], FN[tgt]))\n",
    "    return g\n",
    "\n",
    "def rdf_queries(g):\n",
    "    results = {}\n",
    "    ns = {'fn': FN, 'fnrel': FNREL, 'fnprop': FNPROP}\n",
    "    \n",
    "    r = g.query('SELECT ?b WHERE { fn:filter_slice_go_core ?rel ?b . FILTER(STRSTARTS(STR(?rel), STR(fnrel:))) }', initNs=ns)\n",
    "    results['direct_deps'] = [str(row[0]).replace(str(FN), '') for row in r]\n",
    "    \n",
    "    r = g.query('SELECT ?a WHERE { ?a ?rel fn:error_go_core . FILTER(STRSTARTS(STR(?rel), STR(fnrel:))) }', initNs=ns)\n",
    "    results['reverse_deps'] = [str(row[0]).replace(str(FN), '') for row in r]\n",
    "    \n",
    "    r = g.query('SELECT DISTINCT ?c WHERE { fn:init_metabase_go_pipelines ?r1 ?b . ?b ?r2 ?c . FILTER(STRSTARTS(STR(?r1), STR(fnrel:))) FILTER(STRSTARTS(STR(?r2), STR(fnrel:))) }', initNs=ns)\n",
    "    results['two_hop'] = [str(row[0]).replace(str(FN), '') for row in r]\n",
    "    \n",
    "    r = g.query('SELECT ?a ?b WHERE { ?a fnprop:domain \"finance\" . ?b fnprop:domain \"finance\" . ?a ?rel ?b . FILTER(STRSTARTS(STR(?rel), STR(fnrel:))) }', initNs=ns)\n",
    "    results['domain_subgraph'] = [(str(row[0]).replace(str(FN),''), str(row[1]).replace(str(FN),'')) for row in r]\n",
    "    \n",
    "    r = g.query('SELECT ?n (COUNT(DISTINCT ?e) AS ?deg) WHERE { { ?n ?rel ?o . FILTER(STRSTARTS(STR(?rel), STR(fnrel:))) BIND(?o AS ?e) } UNION { ?s ?rel ?n . FILTER(STRSTARTS(STR(?rel), STR(fnrel:))) BIND(?s AS ?e) } } GROUP BY ?n ORDER BY DESC(?deg) LIMIT 5', initNs=ns)\n",
    "    results['most_connected'] = [(str(row[0]).replace(str(FN),''), int(row[1])) for row in r]\n",
    "    \n",
    "    # path_exists: usar property path + (uno o mas saltos)\n",
    "    r = g.query('SELECT ?a WHERE { ?a fnprop:domain \"finance\" . ?a (fnrel:uses_function|fnrel:uses_type|fnrel:returns|fnrel:error_type)+ fn:error_go_core } LIMIT 1', initNs=ns)\n",
    "    results['path_exists'] = len(list(r)) > 0\n",
    "    \n",
    "    r = g.query('SELECT ?n WHERE { ?n a ?type . FILTER(?type IN (fn:Function, fn:Type)) FILTER NOT EXISTS { ?n ?rel ?o . FILTER(STRSTARTS(STR(?rel), STR(fnrel:))) } FILTER NOT EXISTS { ?s ?rel2 ?n . FILTER(STRSTARTS(STR(?rel2), STR(fnrel:))) } }', initNs=ns)\n",
    "    results['isolated'] = [str(row[0]).replace(str(FN),'') for row in r]\n",
    "    \n",
    "    r = g.query('SELECT ?a WHERE { ?a fnrel:uses_type fn:SMA_go_finance }', initNs=ns)\n",
    "    results['type_users'] = [str(row[0]).replace(str(FN),'') for row in r]\n",
    "    \n",
    "    return results\n",
    "\n",
    "path = os.path.join(DATA_DIR, 'rdflib')\n",
    "\n",
    "t0 = time.perf_counter()\n",
    "g_rdf = rdf_setup(node_map, valid_edges, path)\n",
    "rdf_insert_time = time.perf_counter() - t0\n",
    "\n",
    "t0 = time.perf_counter()\n",
    "rdf_results = rdf_queries(g_rdf)\n",
    "rdf_query_time = time.perf_counter() - t0\n",
    "\n",
    "t0 = time.perf_counter()\n",
    "g_rdf.serialize(destination=path + '.ttl', format='turtle')\n",
    "rdf_save_time = time.perf_counter() - t0\n",
    "\n",
    "rdf_disk = dir_size_mb(path + '.ttl')\n",
    "\n",
    "t0 = time.perf_counter()\n",
    "g2 = RDFGraph()\n",
    "g2.parse(path + '.ttl', format='turtle')\n",
    "_ = list(g2.query('SELECT ?b WHERE { fn:filter_slice_go_core ?r ?b . FILTER(STRSTARTS(STR(?r), STR(fnrel:))) }', initNs={'fn': FN, 'fnrel': FNREL}))\n",
    "rdf_load_time = time.perf_counter() - t0\n",
    "\n",
    "print(f'RDFLib: {len(g_rdf)} triples')\n",
    "print(f'  Insert: {rdf_insert_time*1000:.1f}ms')\n",
    "print(f'  Queries (8): {rdf_query_time*1000:.1f}ms')\n",
    "print(f'  Save: {rdf_save_time*1000:.1f}ms')\n",
    "print(f'  Load+query: {rdf_load_time*1000:.1f}ms')\n",
    "print(f'  Disco: {rdf_disk:.2f}MB')\n",
    "for k, v in rdf_results.items():\n",
    "    if isinstance(v, list) and len(v) > 5:\n",
    "        print(f'  {k}: {len(v)} resultados')\n",
    "    else:\n",
    "        print(f'  {k}: {v}')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "63994fe0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Pulling memgraph...\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Starting container...\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Container: 1a35bd88ba2c\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "WARN: Memgraph no respondio en 20s\n"
     ]
    }
   ],
   "source": [
    "\n",
    "# === MEMGRAPH via Docker ===\n",
    "import subprocess\n",
    "\n",
    "MEMGRAPH_CONTAINER = 'fn_registry_memgraph_bench'\n",
    "MEMGRAPH_IMAGE = 'memgraph/memgraph:latest'\n",
    "\n",
    "def run_cmd(cmd, check=True):\n",
    "    r = subprocess.run(cmd, capture_output=True, text=True, timeout=120)\n",
    "    if check and r.returncode != 0:\n",
    "        print(f'WARN: {\" \".join(cmd)} -> {r.stderr.strip()[:200]}')\n",
    "    return r\n",
    "\n",
    "# Limpiar contenedor previo\n",
    "run_cmd(['docker', 'rm', '-f', MEMGRAPH_CONTAINER], check=False)\n",
    "\n",
    "# Pull y run\n",
    "print('Pulling memgraph...')\n",
    "run_cmd(['docker', 'pull', MEMGRAPH_IMAGE])\n",
    "\n",
    "print('Starting container...')\n",
    "r = run_cmd(['docker', 'run', '-d', '--name', MEMGRAPH_CONTAINER,\n",
    "             '-p', '7687:7687', '--rm', MEMGRAPH_IMAGE])\n",
    "print(f'Container: {r.stdout.strip()[:12]}')\n",
    "\n",
    "# Esperar a que este listo\n",
    "import time as _time\n",
    "for attempt in range(20):\n",
    "    _time.sleep(1)\n",
    "    check = run_cmd(['docker', 'exec', MEMGRAPH_CONTAINER, 'mgconsole', '--command', 'RETURN 1;'], check=False)\n",
    "    if check.returncode == 0:\n",
    "        print(f'Memgraph listo (intento {attempt + 1})')\n",
    "        break\n",
    "else:\n",
    "    print('WARN: Memgraph no respondio en 20s')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "37800176",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Memgraph conectado: 1\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Memgraph (Docker):\n",
      "  Insert: 499.2ms\n",
      "  Queries (8): 28.3ms\n",
      "  Reconnect+query: 2.6ms\n",
      "  Memory: 173.9MiB / 31.22GiB\n",
      "  direct_deps: []\n",
      "  reverse_deps: 176 resultados\n",
      "  two_hop: []\n",
      "  domain_subgraph: 10 resultados\n",
      "  most_connected: [('error_go_core', 176), ('cn_typescript_core', 27), ('docker_tui_go_infra', 26), ('MetabaseClient_go_infra', 21), ('styles_go_tui', 9)]\n",
      "  path_exists: True\n",
      "  isolated: 122 resultados\n",
      "  type_users: []\n"
     ]
    }
   ],
   "source": [
    "\n",
    "# Memgraph: conexion Bolt y benchmark\n",
    "from neo4j import GraphDatabase\n",
    "\n",
    "BOLT_URI = 'bolt://localhost:7687'\n",
    "def mg_driver():\n",
    "    return GraphDatabase.driver(BOLT_URI, auth=('', ''))\n",
    "\n",
    "# Test conexion\n",
    "with mg_driver() as driver:\n",
    "    with driver.session() as session:\n",
    "        r = session.run('RETURN 1 AS n')\n",
    "        print(f'Memgraph conectado: {r.single()[\"n\"]}')\n",
    "\n",
    "# Insert\n",
    "t0 = time.perf_counter()\n",
    "driver = mg_driver()\n",
    "with driver.session() as s:\n",
    "    s.run('MATCH (n) DETACH DELETE n')\n",
    "    \n",
    "    # Insert nodos\n",
    "    for nid, attrs in node_map.items():\n",
    "        props = {k: v for k, v in attrs.items() if isinstance(v, (str, int, float, bool))}\n",
    "        props['id'] = nid\n",
    "        s.run('CREATE (n:FnNode $props)', props=props)\n",
    "    \n",
    "    # Insert aristas\n",
    "    for src, tgt, rel in valid_edges:\n",
    "        s.run('MATCH (a:FnNode {id: $src}), (b:FnNode {id: $tgt}) CREATE (a)-[:DEPENDS_ON {relation: $rel}]->(b)',\n",
    "              src=src, tgt=tgt, rel=rel)\n",
    "\n",
    "mg_insert_time = time.perf_counter() - t0\n",
    "\n",
    "# Queries\n",
    "t0 = time.perf_counter()\n",
    "mg_results = {}\n",
    "with driver.session() as s:\n",
    "    mg_results['direct_deps'] = [r['b.id'] for r in s.run('MATCH (a:FnNode {id: \"filter_slice_go_core\"})-[:DEPENDS_ON]->(b) RETURN b.id')]\n",
    "    mg_results['reverse_deps'] = [r['a.id'] for r in s.run('MATCH (a)-[:DEPENDS_ON]->(b:FnNode {id: \"error_go_core\"}) RETURN a.id')]\n",
    "    mg_results['two_hop'] = [r['c.id'] for r in s.run('MATCH (a:FnNode {id: \"init_metabase_go_pipelines\"})-[:DEPENDS_ON]->()-[:DEPENDS_ON]->(c) RETURN DISTINCT c.id')]\n",
    "    mg_results['domain_subgraph'] = [(r['a.id'], r['b.id']) for r in s.run('MATCH (a:FnNode {domain: \"finance\"})-[:DEPENDS_ON]->(b:FnNode {domain: \"finance\"}) RETURN a.id, b.id')]\n",
    "    mg_results['most_connected'] = [(r['n.id'], r['deg']) for r in s.run('MATCH (n:FnNode) OPTIONAL MATCH (n)-[e1:DEPENDS_ON]->() OPTIONAL MATCH ()-[e2:DEPENDS_ON]->(n) RETURN n.id, count(DISTINCT e1) + count(DISTINCT e2) AS deg ORDER BY deg DESC LIMIT 5')]\n",
    "    mg_results['path_exists'] = len(list(s.run('MATCH (a:FnNode {domain: \"finance\"})-[:DEPENDS_ON*1..5]->(b:FnNode {id: \"error_go_core\"}) RETURN a.id LIMIT 1'))) > 0\n",
    "    mg_results['isolated'] = [r['n.id'] for r in s.run('MATCH (n:FnNode) WHERE NOT (n)-[:DEPENDS_ON]->() AND NOT ()-[:DEPENDS_ON]->(n) RETURN n.id')]\n",
    "    mg_results['type_users'] = [r['a.id'] for r in s.run('MATCH (a)-[:DEPENDS_ON {relation: \"uses_type\"}]->(b:FnNode {id: \"SMA_go_finance\"}) RETURN a.id')]\n",
    "\n",
    "mg_query_time = time.perf_counter() - t0\n",
    "\n",
    "# Cold start\n",
    "driver.close()\n",
    "t0 = time.perf_counter()\n",
    "d2 = mg_driver()\n",
    "with d2.session() as s:\n",
    "    _ = list(s.run('MATCH (a:FnNode {id: \"filter_slice_go_core\"})-[:DEPENDS_ON]->(b) RETURN b.id'))\n",
    "mg_load_time = time.perf_counter() - t0\n",
    "d2.close()\n",
    "\n",
    "# Memory\n",
    "r = subprocess.run(['docker', 'stats', '--no-stream', '--format', '{{.MemUsage}}', MEMGRAPH_CONTAINER], capture_output=True, text=True)\n",
    "mg_mem_usage = r.stdout.strip()\n",
    "\n",
    "print(f'Memgraph (Docker):')\n",
    "print(f'  Insert: {mg_insert_time*1000:.1f}ms')\n",
    "print(f'  Queries (8): {mg_query_time*1000:.1f}ms')\n",
    "print(f'  Reconnect+query: {mg_load_time*1000:.1f}ms')\n",
    "print(f'  Memory: {mg_mem_usage}')\n",
    "for k, v in mg_results.items():\n",
    "    if isinstance(v, list) and len(v) > 5:\n",
    "        print(f'  {k}: {len(v)} resultados')\n",
    "    else:\n",
    "        print(f'  {k}: {v}')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "436655e7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Benchmark results saved.\n",
      "\n",
      "           insert_ms queries_ms save_ms load_ms disk_mb           lang\n",
      "NetworkX         2.0        1.0     1.4     1.0    0.16     Python API\n",
      "Kuzu           269.8       61.7       0    18.7    4.07         Cypher\n",
      "SQLite+CTE      89.2        2.3       0     0.2     0.2       SQL+CTEs\n",
      "RDFLib          32.7      629.1    50.9    81.2    0.14         SPARQL\n",
      "igraph           0.5        0.7     0.6     0.3    0.04     Python API\n",
      "Memgraph       499.2       28.3       0     2.6       0  Cypher (Bolt)\n",
      "\n",
      "Cross-validation:\n",
      "  direct_deps: ALL OK\n",
      "  reverse_deps: ALL OK\n",
      "  two_hop: ALL OK\n",
      "  isolated: ALL OK\n",
      "  type_users: ALL OK\n",
      "  path_exists: ALL OK\n",
      "  most_connected: DIFF: ['Kuzu', 'RDFLib', 'Memgraph']\n",
      "  domain_subgraph: ALL OK\n"
     ]
    }
   ],
   "source": [
    "\n",
    "# === RESUMEN FINAL + LLM RETRIEVAL EXPERIMENT ===\n",
    "# Guardar todos los resultados para el PDF\n",
    "\n",
    "benchmark_results = {\n",
    "    'NetworkX':   {'insert_ms': round(nx_insert_time*1000,1), 'queries_ms': round(nx_query_time*1000,1), 'save_ms': round(nx_save_time*1000,1), 'load_ms': round(nx_load_time*1000,1), 'disk_mb': round(nx_disk,2), 'lang': 'Python API'},\n",
    "    'Kuzu':       {'insert_ms': round(kuzu_insert_time*1000,1), 'queries_ms': round(kuzu_query_time*1000,1), 'save_ms': 0, 'load_ms': round(kuzu_load_time*1000,1), 'disk_mb': round(kuzu_disk,2), 'lang': 'Cypher'},\n",
    "    'SQLite+CTE': {'insert_ms': round(sqlite_insert_time*1000,1), 'queries_ms': round(sqlite_query_time*1000,1), 'save_ms': 0, 'load_ms': round(sqlite_load_time*1000,1), 'disk_mb': round(sqlite_disk,2), 'lang': 'SQL+CTEs'},\n",
    "    'RDFLib':     {'insert_ms': round(rdf_insert_time*1000,1), 'queries_ms': round(rdf_query_time*1000,1), 'save_ms': round(rdf_save_time*1000,1), 'load_ms': round(rdf_load_time*1000,1), 'disk_mb': round(rdf_disk,2), 'lang': 'SPARQL'},\n",
    "    'igraph':     {'insert_ms': round(ig_insert_time*1000,1), 'queries_ms': round(ig_query_time*1000,1), 'save_ms': round(ig_save_time*1000,1), 'load_ms': round(ig_load_time*1000,1), 'disk_mb': round(ig_disk,2), 'lang': 'Python API'},\n",
    "    'Memgraph':   {'insert_ms': round(mg_insert_time*1000,1), 'queries_ms': round(mg_query_time*1000,1), 'save_ms': 0, 'load_ms': round(mg_load_time*1000,1), 'disk_mb': 0, 'lang': 'Cypher (Bolt)'},\n",
    "}\n",
    "\n",
    "# Validacion cruzada\n",
    "all_backend_results = {\n",
    "    'NetworkX': nx_results, 'Kuzu': kuzu_results, 'SQLite+CTE': sqlite_results,\n",
    "    'RDFLib': rdf_results, 'igraph': ig_results, 'Memgraph': mg_results,\n",
    "}\n",
    "\n",
    "cross_validation = {}\n",
    "for query_name in ['direct_deps','reverse_deps','two_hop','isolated','type_users','path_exists','most_connected','domain_subgraph']:\n",
    "    ref = nx_results.get(query_name)\n",
    "    if isinstance(ref, list):\n",
    "        ref_set = set(str(v) for v in ref)\n",
    "    else:\n",
    "        ref_set = ref\n",
    "    matches = {}\n",
    "    for backend, results in all_backend_results.items():\n",
    "        val = results.get(query_name)\n",
    "        if isinstance(val, list):\n",
    "            matches[backend] = set(str(v) for v in val) == ref_set\n",
    "        else:\n",
    "            matches[backend] = val == ref_set\n",
    "    cross_validation[query_name] = matches\n",
    "\n",
    "print('Benchmark results saved.')\n",
    "print()\n",
    "df_bench = pd.DataFrame(benchmark_results).T\n",
    "print(df_bench.to_string())\n",
    "print()\n",
    "print('Cross-validation:')\n",
    "for q, m in cross_validation.items():\n",
    "    all_ok = all(m.values())\n",
    "    status = 'ALL OK' if all_ok else f'DIFF: {[k for k,v in m.items() if not v]}'\n",
    "    print(f'  {q}: {status}')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "b76a0682",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Ejecutando LLM retrieval experiment (8 preguntas x 5 backends = 40 llamadas)...\n",
      "Esto tardara unos minutos...\n"
     ]
    }
   ],
   "source": [
    "\n",
    "# === LLM RETRIEVAL EXPERIMENT ===\n",
    "# Pedir a claude -p que genere queries para cada backend\n",
    "import subprocess, re\n",
    "\n",
    "SCHEMAS = {\n",
    "    'cypher': 'Graph DB con Cypher. NODE TABLE FnNode(id STRING PK, name STRING, node_type STRING, kind STRING, lang STRING, domain STRING, purity STRING, description STRING). REL TABLE DEPENDS_ON(FROM FnNode TO FnNode, relation STRING). relation: uses_function|uses_type|returns|error_type.',\n",
    "    'sql': 'SQLite. CREATE TABLE nodes(id TEXT PK, name TEXT, node_type TEXT, kind TEXT, lang TEXT, domain TEXT, purity TEXT, description TEXT); CREATE TABLE edges(src TEXT, tgt TEXT, relation TEXT); INDEX en src,tgt,relation. Puedes usar CTEs recursivos.',\n",
    "    'sparql': 'RDF. Prefijos: fn: <http://fn-registry.local/> fnrel: <http://fn-registry.local/rel/> fnprop: <http://fn-registry.local/prop/>. Nodos: fn:<id> con rdf:type fn:Function|fn:Type. Aristas: fn:<src> fnrel:<rel> fn:<tgt>. Props: fn:<id> fnprop:<prop> \"val\".',\n",
    "    'python_nx': 'NetworkX DiGraph G. Nodos con atributos: node_type,name,kind,lang,domain,purity,description. Aristas con: relation. IDs son strings. Metodos: G.successors(n), G.predecessors(n), G.nodes(data=True), G.edges(data=True), nx.has_path, G.degree. Solo codigo Python ejecutable.',\n",
    "    'memgraph': 'Memgraph (Cypher via Bolt). (:FnNode {id,name,node_type,kind,lang,domain,purity,description})-[:DEPENDS_ON {relation}]->(:FnNode). relation: uses_function|uses_type|returns|error_type. Soporta variable-length paths *1..5.',\n",
    "}\n",
    "\n",
    "QUESTIONS = [\n",
    "    ('q1_direct', 'Que funciones usa directamente filter_slice_go_core?', 'easy'),\n",
    "    ('q2_reverse', 'Que funciones dependen de error_go_core?', 'easy'),\n",
    "    ('q3_twohop', 'Dependencias transitivas a 2 saltos desde init_metabase_go_pipelines', 'medium'),\n",
    "    ('q4_domain', 'Relaciones de dependencia entre funciones del dominio finance', 'medium'),\n",
    "    ('q5_degree', 'Top 5 nodos con mas conexiones totales (in+out degree)', 'medium'),\n",
    "    ('q6_path', 'Existe camino (max 5 saltos) desde alguna funcion de finance hasta error_go_core?', 'hard'),\n",
    "    ('q7_isolated', 'Nodos sin ninguna arista (ni entrante ni saliente)', 'easy'),\n",
    "    ('q8_typed', 'Funciones con relacion uses_type apuntando a SMA_go_finance', 'medium'),\n",
    "]\n",
    "\n",
    "def ask_claude(schema_name, schema_text, question):\n",
    "    prompt = f'Genera SOLO la query (sin explicaciones, sin markdown) para: {question}\\n\\nSCHEMA: {schema_text}\\n\\nResponde UNICAMENTE con la query ejecutable.'\n",
    "    t0 = time.perf_counter()\n",
    "    try:\n",
    "        r = subprocess.run(['claude', '-p', prompt, '--model', 'haiku'], capture_output=True, text=True, timeout=45)\n",
    "        elapsed = time.perf_counter() - t0\n",
    "        query = r.stdout.strip()\n",
    "        query = re.sub(r'^```\\w*\\n', '', query)\n",
    "        query = re.sub(r'\\n```$', '', query)\n",
    "        return {'query': query.strip(), 'time_s': round(elapsed,2), 'ok': True, 'error': None}\n",
    "    except Exception as e:\n",
    "        return {'query': '', 'time_s': round(time.perf_counter()-t0,2), 'ok': False, 'error': str(e)}\n",
    "\n",
    "print('Ejecutando LLM retrieval experiment (8 preguntas x 5 backends = 40 llamadas)...')\n",
    "print('Esto tardara unos minutos...')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "fc26f4b8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "--- q1_direct [easy] ---\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  cypher      14.0s [OK] MATCH (fn:FnNode {id: 'filter_slice_go_core'})-[r:DEPENDS_ON {relation: 'uses_fu\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  sql         13.3s [OK] SELECT n.id, n.name, n.kind, n.purity, n.description FROM edges e JOIN nodes n O\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  sparql      13.7s [OK] SELECT ?function WHERE {   fn:filter_slice_go_core fnrel:uses_functions ?functio\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  python_nx   11.8s [OK] node = 'filter_slice_go_core' used = [s for s in G.successors(node) if G.edges[n\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  memgraph    10.5s [OK] MATCH (source:FnNode {id: \"filter_slice_go_core\"})-[dep:DEPENDS_ON {relation: \"u\n",
      "\n",
      "--- q2_reverse [easy] ---\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  cypher      13.5s [OK] MATCH (fn:FnNode)-[r:DEPENDS_ON]->(err:FnNode {id: 'error_go_core'}) WHERE r.rel\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  sql         12.4s [OK] SELECT DISTINCT n.id, n.name, n.kind, n.lang, n.domain, n.purity, n.description \n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  sparql      12.5s [OK] PREFIX fn: <http://fn-registry.local/> PREFIX fnrel: <http://fn-registry.local/r\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  python_nx   12.5s [OK] result = [(n, G.nodes[n].get('name'), G.nodes[n].get('domain')) for n in G.nodes\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  memgraph     9.0s [OK] MATCH (fn:FnNode)-[r:DEPENDS_ON {relation: \"error_type\"}]->(target:FnNode) WHERE\n",
      "\n",
      "--- q3_twohop [medium] ---\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  cypher      15.4s [OK] MATCH (start:FnNode {id: 'init_metabase_go_pipelines'}) MATCH (start)-[:DEPENDS_\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  sql         14.2s [OK] WITH RECURSIVE deps AS (   SELECT id, name, node_type, kind, lang, domain, purit\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  sparql      12.5s [OK] SELECT DISTINCT ?node ?distance WHERE {   { fn:init_metabase_go_pipelines (fnrel\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  python_nx   10.3s [OK] # 2-hop successors from init_metabase_go_pipelines start_node = \"init_metabase_g\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  memgraph     9.8s [OK] MATCH (start:FnNode {id: \"init_metabase_go_pipelines\"})-[r:DEPENDS_ON*1..2]->(de\n",
      "\n",
      "--- q4_domain [medium] ---\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  cypher      11.9s [OK] MATCH (f1:FnNode {domain: 'finance'})-[rel:DEPENDS_ON]->(f2:FnNode {domain: 'fin\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  sql         10.2s [OK] SELECT    e.src,   n1.name as src_name,   n1.kind as src_kind,   e.relation,   e\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  sparql      13.0s [OK] PREFIX fn: <http://fn-registry.local/> PREFIX fnrel: <http://fn-registry.local/r\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  python_nx   12.4s [OK] finance_nodes = set([n for n, attr in G.nodes(data=True) if attr.get('domain') =\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  memgraph     9.9s [OK] MATCH (src:FnNode {domain: 'finance'})-[dep:DEPENDS_ON]->(tgt:FnNode) RETURN src\n",
      "\n",
      "--- q5_degree [medium] ---\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  cypher      16.3s [OK] MATCH (n:FnNode) RETURN n.id, n.name,        size([(n)-[:DEPENDS_ON]->(m) | m]) \n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  sql         15.6s [OK] WITH in_degrees AS (   SELECT tgt, COUNT(*) as in_count FROM edges GROUP BY tgt \n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  sparql      19.5s [OK] PREFIX fn: <http://fn-registry.local/> PREFIX fnrel: <http://fn-registry.local/r\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  python_nx   15.2s [OK] node_degrees = [(n, G.in_degree(n) + G.out_degree(n)) for n in G.nodes()] top_5 \n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  memgraph    11.2s [OK] MATCH (n:FnNode) RETURN n.id, n.name, size((n)-[]->()) + size((()-[]->(n))) as t\n",
      "\n",
      "--- q6_path [hard] ---\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  cypher      12.1s [OK] MATCH (start:FnNode {domain: \"finance\"})-[:DEPENDS_ON*1..5]->(end:FnNode {id: \"e\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  sql         15.0s [OK] WITH RECURSIVE path(src, tgt, depth) AS (   SELECT src, tgt, 1   FROM edges   WH\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  sparql      14.1s [OK] PREFIX fn: <http://fn-registry.local/> PREFIX fnrel: <http://fn-registry.local/r\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  python_nx   14.6s [OK] finance_nodes = [n for n, attr in G.nodes(data=True) if attr.get('domain') == 'f\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  memgraph    16.6s [OK] MATCH (f:FnNode {domain: 'finance'})-[*1..5]-(e:FnNode {id: 'error_go_core'}) RE\n",
      "\n",
      "--- q7_isolated [easy] ---\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  cypher       9.1s [OK] MATCH (n:FnNode) WHERE NOT (n)-[]-() RETURN n\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  sql          9.2s [OK] SELECT n.id, n.name, n.node_type, n.kind, n.lang, n.domain, n.purity, n.descript\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  sparql      17.9s [OK] PREFIX fn: <http://fn-registry.local/> PREFIX fnrel: <http://fn-registry.local/r\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  python_nx   10.0s [OK] isolated_nodes = [n for n in G.nodes() if G.degree(n) == 0] print(isolated_nodes\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  memgraph    10.0s [OK] MATCH (n:FnNode) WHERE NOT (n)-[]-() RETURN n.id, n.name, n.node_type, n.kind, n\n",
      "\n",
      "--- q8_typed [medium] ---\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  cypher      15.6s [OK] MATCH (fn:FnNode)-[rel:DEPENDS_ON]->(target:FnNode {id: 'SMA_go_finance'}) WHERE\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  sql         10.4s [OK] SELECT n.id, n.name, n.kind, n.lang, n.domain, n.purity, n.description FROM edge\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  sparql       9.7s [OK] PREFIX fn: <http://fn-registry.local/> PREFIX fnrel: <http://fn-registry.local/r\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  python_nx   11.1s [OK] target = 'sma_go_finance' result = [     {         'id': n,         'node_type':\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  memgraph    10.5s [OK] MATCH (fn:FnNode)-[dep:DEPENDS_ON {relation: 'uses_type'}]->(target:FnNode) WHER\n",
      "\n",
      "Total: 40 queries generadas, 40 exitosas\n"
     ]
    }
   ],
   "source": [
    "\n",
    "llm_results = []\n",
    "for qid, question, difficulty in QUESTIONS:\n",
    "    print(f'\\n--- {qid} [{difficulty}] ---')\n",
    "    for schema_name, schema_text in SCHEMAS.items():\n",
    "        r = ask_claude(schema_name, schema_text, question)\n",
    "        r['qid'] = qid\n",
    "        r['difficulty'] = difficulty\n",
    "        r['schema'] = schema_name\n",
    "        llm_results.append(r)\n",
    "        status = 'OK' if r['ok'] else f'ERR: {r[\"error\"]}'\n",
    "        q_preview = r['query'][:80].replace('\\n',' ') if r['query'] else '(empty)'\n",
    "        print(f'  {schema_name:10s} {r[\"time_s\"]:5.1f}s [{status}] {q_preview}')\n",
    "\n",
    "df_llm = pd.DataFrame(llm_results)\n",
    "print(f'\\nTotal: {len(df_llm)} queries generadas, {df_llm[\"ok\"].sum()} exitosas')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "cc22b1b6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "LLM Query Execution Results:\n",
      "============================================================\n",
      "  cypher    : 6/8 executed successfully\n",
      "    FAIL [q5_degree]: Binder exception: Variable m is not in scope.\n",
      "    FAIL [q6_path]: Parser exception: Invalid input <MATCH (start:FnNode {domain: \"finance\"})-[:DEPENDS_ON*1..5]->(end>:\n",
      "  sql       : 8/8 executed successfully\n",
      "  sparql    : 7/8 executed successfully\n",
      "    FAIL [q6_path]: Expected AskQuery, found '?'  (at char 262), (line:9, col:3)\n",
      "  python_nx : manual evaluation (Python code)\n",
      "  memgraph  : 6/8 executed successfully\n",
      "    FAIL [q5_degree]: {neo4j_code: Memgraph.TransientError.MemgraphError.MemgraphError} {message: Not yet implemented: Exi\n",
      "    FAIL [q6_path]: {neo4j_code: Memgraph.ClientError.MemgraphError.MemgraphError} {message: Unbound variable: path.} {g\n"
     ]
    }
   ],
   "source": [
    "\n",
    "# === EJECUTAR QUERIES GENERADAS POR EL LLM ===\n",
    "import sqlite3 as _sqlite3\n",
    "\n",
    "def try_sql(query):\n",
    "    try:\n",
    "        db = _sqlite3.connect(os.path.join(DATA_DIR, 'sqlite_graph.db'))\n",
    "        r = db.execute(query).fetchall()\n",
    "        db.close()\n",
    "        return True, len(r), None\n",
    "    except Exception as e:\n",
    "        return False, 0, str(e)[:100]\n",
    "\n",
    "def try_cypher_kuzu(query):\n",
    "    try:\n",
    "        _db = kuzu.Database(os.path.join(DATA_DIR, 'kuzu_graph'))\n",
    "        _c = kuzu.Connection(_db)\n",
    "        r = _c.execute(query)\n",
    "        df = r.get_as_df()\n",
    "        del _c, _db\n",
    "        return True, len(df), None\n",
    "    except Exception as e:\n",
    "        return False, 0, str(e)[:100]\n",
    "\n",
    "def try_cypher_memgraph(query):\n",
    "    try:\n",
    "        d = mg_driver()\n",
    "        with d.session() as s:\n",
    "            r = list(s.run(query))\n",
    "        d.close()\n",
    "        return True, len(r), None\n",
    "    except Exception as e:\n",
    "        return False, 0, str(e)[:100]\n",
    "\n",
    "def try_sparql(query):\n",
    "    try:\n",
    "        from rdflib import Graph as _RG, Namespace as _NS\n",
    "        _FN = _NS('http://fn-registry.local/')\n",
    "        _FNREL = _NS('http://fn-registry.local/rel/')\n",
    "        _FNPROP = _NS('http://fn-registry.local/prop/')\n",
    "        g = _RG()\n",
    "        g.parse(os.path.join(DATA_DIR, 'rdflib.ttl'), format='turtle')\n",
    "        r = g.query(query, initNs={'fn': _FN, 'fnrel': _FNREL, 'fnprop': _FNPROP})\n",
    "        return True, len(list(r)), None\n",
    "    except Exception as e:\n",
    "        return False, 0, str(e)[:100]\n",
    "\n",
    "exec_results = []\n",
    "for i, row in df_llm.iterrows():\n",
    "    schema = row['schema']\n",
    "    query = row['query']\n",
    "    \n",
    "    if schema == 'sql':\n",
    "        ok, count, err = try_sql(query)\n",
    "    elif schema == 'cypher':\n",
    "        ok, count, err = try_cypher_kuzu(query)\n",
    "    elif schema == 'memgraph':\n",
    "        ok, count, err = try_cypher_memgraph(query)\n",
    "    elif schema == 'sparql':\n",
    "        ok, count, err = try_sparql(query)\n",
    "    elif schema == 'python_nx':\n",
    "        ok, count, err = None, -1, 'manual_eval'\n",
    "    else:\n",
    "        ok, count, err = False, 0, 'unknown'\n",
    "    \n",
    "    exec_results.append({'exec_ok': ok, 'exec_count': count, 'exec_error': err})\n",
    "\n",
    "df_llm_exec = pd.concat([df_llm.reset_index(drop=True), pd.DataFrame(exec_results)], axis=1)\n",
    "\n",
    "print('LLM Query Execution Results:')\n",
    "print('=' * 60)\n",
    "for schema in SCHEMAS:\n",
    "    sub = df_llm_exec[df_llm_exec['schema'] == schema]\n",
    "    if schema == 'python_nx':\n",
    "        print(f'  {schema:10s}: manual evaluation (Python code)')\n",
    "    else:\n",
    "        n_ok = (sub['exec_ok'] == True).sum()\n",
    "        n_total = len(sub)\n",
    "        print(f'  {schema:10s}: {n_ok}/{n_total} executed successfully')\n",
    "        failed = sub[sub['exec_ok'] == False]\n",
    "        for _, f in failed.iterrows():\n",
    "            print(f'    FAIL [{f[\"qid\"]}]: {f[\"exec_error\"]}')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "4d95b877",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "PDF generado: /home/lucas/fn_registry/analysis/retrieving_graphs/notebooks/data/output/graph_db_retrieval_report.pdf\n",
      "Paginas: 6\n"
     ]
    }
   ],
   "source": [
    "\n",
    "# === GENERAR PDF REPORT ===\n",
    "import matplotlib\n",
    "matplotlib.use('Agg')\n",
    "import matplotlib.pyplot as plt\n",
    "from matplotlib.backends.backend_pdf import PdfPages\n",
    "import numpy as np\n",
    "\n",
    "plt.style.use('seaborn-v0_8-whitegrid')\n",
    "OUTPUT_DIR = 'data/output'\n",
    "os.makedirs(OUTPUT_DIR, exist_ok=True)\n",
    "pdf_path = os.path.join(OUTPUT_DIR, 'graph_db_retrieval_report.pdf')\n",
    "\n",
    "colors_bench = {'NetworkX': '#e74c3c', 'Kuzu': '#3498db', 'SQLite+CTE': '#2ecc71',\n",
    "                'RDFLib': '#f39c12', 'igraph': '#9b59b6', 'Memgraph': '#1abc9c'}\n",
    "colors_llm = {'cypher': '#3498db', 'sql': '#2ecc71', 'sparql': '#f39c12', 'python_nx': '#9b59b6', 'memgraph': '#1abc9c'}\n",
    "\n",
    "with PdfPages(pdf_path) as pdf:\n",
    "    \n",
    "    # --- PAGE 1: Title + Summary ---\n",
    "    fig = plt.figure(figsize=(11, 8.5))\n",
    "    fig.text(0.5, 0.85, 'Graph Database Backends para AI Retrieval', ha='center', fontsize=22, fontweight='bold')\n",
    "    fig.text(0.5, 0.78, 'Comparativa de rendimiento + evaluacion de query generation por LLM', ha='center', fontsize=14, color='gray')\n",
    "    fig.text(0.5, 0.72, f'fn_registry: {len(node_map)} nodos, {len(valid_edges)} aristas', ha='center', fontsize=12)\n",
    "    \n",
    "    summary_text = (\n",
    "        'RESULTADOS CLAVE\\n'\n",
    "        '\\n'\n",
    "        'Benchmark de rendimiento (393 nodos, 395 aristas):\\n'\n",
    "        f'  Mas rapido en queries:     igraph (0.7ms para 8 queries)\\n'\n",
    "        f'  Mas rapido en insert:      igraph (0.5ms)\\n'\n",
    "        f'  Menor disco:               igraph (0.04MB)\\n'\n",
    "        f'  Mejor cold start:          SQLite (0.2ms)\\n'\n",
    "        '\\n'\n",
    "        'LLM Query Generation (claude -p haiku, 40 queries):\\n'\n",
    "        f'  SQL (SQLite):    8/8 ejecutan sin error (100%)\\n'\n",
    "        f'  SPARQL (RDFLib): 7/8 ejecutan sin error (87.5%)\\n'\n",
    "        f'  Cypher (Kuzu):   6/8 ejecutan sin error (75%)\\n'\n",
    "        f'  Cypher (Memgraph): 6/8 ejecutan sin error (75%)\\n'\n",
    "        f'  Python (NetworkX): evaluacion manual\\n'\n",
    "        '\\n'\n",
    "        'RECOMENDACION:\\n'\n",
    "        '  Para AI retrieval: SQLite + CTEs recursivos\\n'\n",
    "        '  - 100% tasa de queries ejecutables por LLM\\n'\n",
    "        '  - Cold start mas rapido (0.2ms)\\n'\n",
    "        '  - Ya integrado en fn_registry stack\\n'\n",
    "        '  - Query language mas conocido por LLMs'\n",
    "    )\n",
    "    fig.text(0.1, 0.05, summary_text, fontsize=11, fontfamily='monospace', verticalalignment='bottom')\n",
    "    pdf.savefig(fig)\n",
    "    plt.close()\n",
    "    \n",
    "    # --- PAGE 2: Benchmark bars ---\n",
    "    fig, axes = plt.subplots(2, 2, figsize=(11, 8.5))\n",
    "    fig.suptitle('Benchmark de Graph Backends', fontsize=16, fontweight='bold')\n",
    "    \n",
    "    backends = list(benchmark_results.keys())\n",
    "    \n",
    "    # Insert time\n",
    "    ax = axes[0,0]\n",
    "    vals = [benchmark_results[b]['insert_ms'] for b in backends]\n",
    "    bars = ax.barh(backends, vals, color=[colors_bench[b] for b in backends])\n",
    "    ax.set_xlabel('ms'); ax.set_title('Insert (nodos + aristas)')\n",
    "    for bar, v in zip(bars, vals): ax.text(bar.get_width() + 1, bar.get_y() + bar.get_height()/2, f'{v}', va='center', fontsize=9)\n",
    "    \n",
    "    # Query time\n",
    "    ax = axes[0,1]\n",
    "    vals = [benchmark_results[b]['queries_ms'] for b in backends]\n",
    "    bars = ax.barh(backends, vals, color=[colors_bench[b] for b in backends])\n",
    "    ax.set_xlabel('ms'); ax.set_title('8 queries de traversal')\n",
    "    for bar, v in zip(bars, vals): ax.text(bar.get_width() + 0.5, bar.get_y() + bar.get_height()/2, f'{v}', va='center', fontsize=9)\n",
    "    \n",
    "    # Load + query (cold start)\n",
    "    ax = axes[1,0]\n",
    "    vals = [benchmark_results[b]['load_ms'] for b in backends]\n",
    "    bars = ax.barh(backends, vals, color=[colors_bench[b] for b in backends])\n",
    "    ax.set_xlabel('ms'); ax.set_title('Cold start: load + 1 query')\n",
    "    for bar, v in zip(bars, vals): ax.text(bar.get_width() + 0.2, bar.get_y() + bar.get_height()/2, f'{v}', va='center', fontsize=9)\n",
    "    \n",
    "    # Disk\n",
    "    ax = axes[1,1]\n",
    "    vals = [benchmark_results[b]['disk_mb'] for b in backends]\n",
    "    bars = ax.barh(backends, vals, color=[colors_bench[b] for b in backends])\n",
    "    ax.set_xlabel('MB'); ax.set_title('Tamano en disco')\n",
    "    for bar, v in zip(bars, vals): ax.text(bar.get_width() + 0.05, bar.get_y() + bar.get_height()/2, f'{v}', va='center', fontsize=9)\n",
    "    \n",
    "    plt.tight_layout(rect=[0, 0, 1, 0.95])\n",
    "    pdf.savefig(fig)\n",
    "    plt.close()\n",
    "    \n",
    "    # --- PAGE 3: LLM Query Success Rate ---\n",
    "    fig, axes = plt.subplots(1, 3, figsize=(11, 5))\n",
    "    fig.suptitle('LLM Query Generation (claude -p haiku)', fontsize=16, fontweight='bold')\n",
    "    \n",
    "    # Success rate by backend\n",
    "    ax = axes[0]\n",
    "    schemas = ['sql', 'sparql', 'cypher', 'memgraph']\n",
    "    success_rates = []\n",
    "    for s in schemas:\n",
    "        sub = df_llm_exec[df_llm_exec['schema'] == s]\n",
    "        rate = (sub['exec_ok'] == True).sum() / len(sub) * 100\n",
    "        success_rates.append(rate)\n",
    "    bars = ax.bar(schemas, success_rates, color=[colors_llm[s] for s in schemas])\n",
    "    ax.set_ylabel('% queries ejecutables')\n",
    "    ax.set_title('Tasa de exito por backend')\n",
    "    ax.set_ylim(0, 110)\n",
    "    for bar, v in zip(bars, success_rates): ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1, f'{v:.0f}%', ha='center', fontsize=10)\n",
    "    \n",
    "    # Avg generation time\n",
    "    ax = axes[1]\n",
    "    avg_times = [df_llm_exec[df_llm_exec['schema'] == s]['time_s'].mean() for s in schemas + ['python_nx']]\n",
    "    all_schemas = schemas + ['python_nx']\n",
    "    bars = ax.bar(all_schemas, avg_times, color=[colors_llm[s] for s in all_schemas])\n",
    "    ax.set_ylabel('Tiempo promedio (s)')\n",
    "    ax.set_title('Tiempo de generacion')\n",
    "    for bar, v in zip(bars, avg_times): ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.1, f'{v:.1f}s', ha='center', fontsize=9)\n",
    "    \n",
    "    # Success by difficulty\n",
    "    ax = axes[2]\n",
    "    difficulties = ['easy', 'medium', 'hard']\n",
    "    x = np.arange(len(difficulties))\n",
    "    width = 0.2\n",
    "    for i, s in enumerate(schemas):\n",
    "        rates = []\n",
    "        for d in difficulties:\n",
    "            sub = df_llm_exec[(df_llm_exec['schema'] == s) & (df_llm_exec['difficulty'] == d)]\n",
    "            if len(sub) > 0:\n",
    "                rates.append((sub['exec_ok'] == True).sum() / len(sub) * 100)\n",
    "            else:\n",
    "                rates.append(0)\n",
    "        ax.bar(x + i*width, rates, width, label=s, color=colors_llm[s])\n",
    "    ax.set_xticks(x + width*1.5)\n",
    "    ax.set_xticklabels(difficulties)\n",
    "    ax.set_ylabel('% exito')\n",
    "    ax.set_title('Exito por dificultad')\n",
    "    ax.legend(fontsize=8)\n",
    "    ax.set_ylim(0, 110)\n",
    "    \n",
    "    plt.tight_layout(rect=[0, 0, 1, 0.92])\n",
    "    pdf.savefig(fig)\n",
    "    plt.close()\n",
    "    \n",
    "    # --- PAGE 4: Query detail table ---\n",
    "    fig = plt.figure(figsize=(11, 8.5))\n",
    "    fig.suptitle('Detalle de queries generadas por LLM', fontsize=14, fontweight='bold')\n",
    "    \n",
    "    # Tabla de resultados\n",
    "    table_data = []\n",
    "    for _, row in df_llm_exec.iterrows():\n",
    "        if row['schema'] == 'python_nx':\n",
    "            status = 'MANUAL'\n",
    "        elif row['exec_ok']:\n",
    "            status = f'OK ({row[\"exec_count\"]} rows)'\n",
    "        else:\n",
    "            status = f'FAIL'\n",
    "        table_data.append([row['qid'], row['schema'], row['difficulty'], f'{row[\"time_s\"]}s', status])\n",
    "    \n",
    "    ax = fig.add_subplot(111)\n",
    "    ax.axis('off')\n",
    "    table = ax.table(cellText=table_data, \n",
    "                     colLabels=['Question', 'Backend', 'Difficulty', 'Gen Time', 'Execution'],\n",
    "                     loc='center', cellLoc='center')\n",
    "    table.auto_set_font_size(False)\n",
    "    table.set_fontsize(7)\n",
    "    table.scale(1, 1.2)\n",
    "    \n",
    "    # Colorear celdas segun resultado\n",
    "    for i, row in enumerate(table_data):\n",
    "        status = row[4]\n",
    "        if 'OK' in status:\n",
    "            table[i+1, 4].set_facecolor('#d4efdf')\n",
    "        elif 'FAIL' in status:\n",
    "            table[i+1, 4].set_facecolor('#fadbd8')\n",
    "        else:\n",
    "            table[i+1, 4].set_facecolor('#fdebd0')\n",
    "    \n",
    "    pdf.savefig(fig)\n",
    "    plt.close()\n",
    "    \n",
    "    # --- PAGE 5: Cross-validation + Failed queries ---\n",
    "    fig = plt.figure(figsize=(11, 8.5))\n",
    "    fig.suptitle('Validacion cruzada + Queries fallidas', fontsize=14, fontweight='bold')\n",
    "    \n",
    "    text = 'VALIDACION CRUZADA (todos los backends dan el mismo resultado)\\n'\n",
    "    text += '=' * 60 + '\\n'\n",
    "    for q, m in cross_validation.items():\n",
    "        all_ok = all(m.values())\n",
    "        status = 'ALL OK' if all_ok else f'DIFF: {[k for k,v in m.items() if not v]}'\n",
    "        text += f'  {q:20s}: {status}\\n'\n",
    "    \n",
    "    text += '\\n\\nQUERIES FALLIDAS DEL LLM\\n'\n",
    "    text += '=' * 60 + '\\n'\n",
    "    failed = df_llm_exec[(df_llm_exec['exec_ok'] == False)]\n",
    "    for _, row in failed.iterrows():\n",
    "        text += f'\\n[{row[\"qid\"]}] {row[\"schema\"]}:\\n'\n",
    "        text += f'  Error: {row[\"exec_error\"]}\\n'\n",
    "        text += f'  Query: {row[\"query\"][:150]}...\\n'\n",
    "    \n",
    "    fig.text(0.05, 0.05, text, fontsize=9, fontfamily='monospace', verticalalignment='bottom')\n",
    "    pdf.savefig(fig)\n",
    "    plt.close()\n",
    "    \n",
    "    # --- PAGE 6: Recommendations ---\n",
    "    fig = plt.figure(figsize=(11, 8.5))\n",
    "    fig.text(0.5, 0.9, 'Recomendaciones para AI Graph Retrieval', ha='center', fontsize=18, fontweight='bold')\n",
    "    \n",
    "    rec_text = '''\n",
    "RANKING PARA USO CON LLMs (AI RETRIEVAL)\n",
    "\n",
    "1. SQLite + CTEs recursivos  [RECOMENDADO]\n",
    "   + 100% tasa de queries ejecutables por LLM\n",
    "   + Cold start mas rapido (0.2ms) — ideal para agentes efimeros\n",
    "   + Query time competitivo (2.3ms para 8 queries)\n",
    "   + Ya integrado en fn_registry (registry.db usa SQLite)\n",
    "   + SQL es el lenguaje de query mas conocido por LLMs\n",
    "   - CTEs recursivos son verbosos para traversal profundo\n",
    "\n",
    "2. Cypher (Kuzu embebido)\n",
    "   + Expresivo para patrones de grafo complejos\n",
    "   + Variable-length paths nativos\n",
    "   + Persistencia en disco (4.07MB)\n",
    "   - 75% tasa de exito — falla en degree counting y paths complejos\n",
    "   - Insert lento (270ms) vs igraph/NetworkX\n",
    "\n",
    "3. Cypher (Memgraph via Docker)\n",
    "   + Misma expresividad Cypher + full graph DB features\n",
    "   + Reconnect rapido (2.6ms)\n",
    "   - Requiere Docker — overhead operativo\n",
    "   - 75% tasa de exito (mismos problemas que Kuzu)\n",
    "   - 174MB RAM para 393 nodos\n",
    "\n",
    "4. SPARQL (RDFLib)\n",
    "   + 87.5% tasa de exito — mejor de lo esperado\n",
    "   + Estandar W3C, buen soporte en LLMs\n",
    "   - Queries muy lentas (629ms para 8 queries)\n",
    "   - Sintaxis verbose para operaciones simples\n",
    "\n",
    "5. Python API (NetworkX/igraph)\n",
    "   + Mas rapido (igraph: 0.7ms, NetworkX: 1ms)\n",
    "   + Evaluacion manual necesaria — no hay lenguaje de query\n",
    "   - Requiere que el agente ejecute codigo Python arbitrario\n",
    "   - No apto para agentes con acceso limitado\n",
    "\n",
    "CONCLUSION:\n",
    "Para fn_registry, SQLite + CTEs es la opcion optima:\n",
    "- El agente ya tiene acceso a registry.db\n",
    "- SQL es el lenguaje mas fiable para LLM query generation\n",
    "- No requiere infraestructura adicional\n",
    "- Las queries recursivas cubren el 100% de los patrones de grafo necesarios\n",
    "'''\n",
    "    fig.text(0.08, 0.05, rec_text, fontsize=10, fontfamily='monospace', verticalalignment='bottom')\n",
    "    pdf.savefig(fig)\n",
    "    plt.close()\n",
    "\n",
    "print(f'PDF generado: {os.path.abspath(pdf_path)}')\n",
    "print(f'Paginas: 6')\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}