domain_coverage_gaps/notebooks/02_topic_gap_analysis.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "95651c14",
   "metadata": {},
   "source": [
    "# 02 — Gap analysis: 8 temas\n",
    "\n",
    "Para cada tema: **(A) lo que YA tenemos**, **(B) lo que falta**, **(C) primer paso** (funciones concretas a delegar a `fn-constructor`).\n",
    "\n",
    "Temas: trading · scraping_web · analisis_quantitativo · monitorizacion_realtime · generacion_imagenes_ia · generacion_texto_ia · generacion_audio · audio_realtime_voiceconversion."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8f0bbd20",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os, sqlite3, pandas as pd\n",
    "ROOT = os.environ['FN_REGISTRY_ROOT']\n",
    "conn = sqlite3.connect(f'file:{ROOT}/registry.db?mode=ro', uri=True)\n",
    "pd.set_option('display.max_colwidth', 120)\n",
    "\n",
    "def show(ids, title=''):\n",
    "    if not ids: print(f'{title}: (vacio)'); return None\n",
    "    qm = ','.join('?'*len(ids))\n",
    "    df = pd.read_sql_query(\n",
    "        f\"SELECT id, lang, purity, description FROM functions WHERE id IN ({qm})\",\n",
    "        conn, params=ids)\n",
    "    if title: print(f'=== {title} ({len(df)}/{len(ids)}) ===')\n",
    "    return df\n",
    "\n",
    "def fts(q, limit=15):\n",
    "    return pd.read_sql_query(\n",
    "        '''SELECT f.id, f.lang, f.purity, f.description\n",
    "             FROM functions_fts JOIN functions f ON f.id = functions_fts.id\n",
    "            WHERE functions_fts MATCH ? ORDER BY rank LIMIT ?''',\n",
    "        conn, params=[q, limit])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e367b10d",
   "metadata": {},
   "source": [
    "---\n",
    "## 1) trading\n",
    "\n",
    "**Lo que tenemos** — `finance` ya cubre indicators + OHLCV + persistencia y un simulador de mercado.\n",
    "\n",
    "**Falta** para un stack de trading real: conectores exchange (REST + WS) por venue concreto, libro de ordenes, ejecucion paper/real, gestion de riesgo, backtester vectorizado, sizing/portfolio."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "640b287d",
   "metadata": {},
   "outputs": [],
   "source": [
    "show([\n",
    "  'fetch_ohlcv_go_finance','tick_to_ohlcv_go_finance','stream_ticks_go_finance',\n",
    "  'sma_go_finance','ema_go_finance','rsi_go_finance','vwap_go_finance',\n",
    "  'bollinger_bands_go_finance','sharpe_ratio_go_finance','max_drawdown_go_finance',\n",
    "  'log_return_go_finance','annualized_volatility_go_finance','normalize_ohlcv_go_finance',\n",
    "  'write_ohlcv_to_parquet_go_finance','load_ohlcv_from_duckdb_go_finance',\n",
    "  'avellaneda_stoikov_quotes_py_finance','generate_taker_order_py_finance',\n",
    "  'hawkes_intensity_py_finance','generate_gbm_prices_py_finance',\n",
    "  'run_market_sim_py_pipelines','monte_carlo_market_py_pipelines'\n",
    "], 'trading — YA')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dacc42b2",
   "metadata": {},
   "source": [
    "**Gap & primer batch (delegar a fn-constructor, tag `trading`):**\n",
    "\n",
    "| # | id propuesto | proposito |\n",
    "|---|---|---|\n",
    "| 1 | `binance_rest_client_py_finance` | client REST autenticado (klines, balance, order) |\n",
    "| 2 | `binance_ws_stream_py_finance`   | WS streams trade/depth/kline reconectable |\n",
    "| 3 | `orderbook_l2_py_finance`        | book L2 con snapshot+delta, BBO, walk-the-book |\n",
    "| 4 | `paper_broker_py_finance`        | simulador FIFO con slippage configurable |\n",
    "| 5 | `position_sizer_py_finance`      | Kelly fraccional + cap por riesgo |\n",
    "| 6 | `backtest_vectorized_py_finance` | apply de signal sobre OHLCV → equity curve |\n",
    "| 7 | `risk_metrics_py_finance`        | VaR/ES/Calmar (los 3 que faltan respecto a sharpe/drawdown) |\n",
    "| 8 | `signal_crossover_go_finance`    | golden/death cross + zscore mean reversion (puras) |"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b3b3735e",
   "metadata": {},
   "source": [
    "---\n",
    "## 2) scraping_web\n",
    "\n",
    "**Lo que tenemos** — domain `browser` con CDP completo en Go puro + `http_*` en infra. Excelente base.\n",
    "\n",
    "**Falta** — parsing HTML/CSS-select sin browser, robots/sitemap, deduplicacion, rate-limit por host, persistencia incremental, captchas. Y un tag `scraping` que agrupe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "90b5b349",
   "metadata": {},
   "outputs": [],
   "source": [
    "show([\n",
    "  'chrome_launch_go_browser','cdp_connect_go_browser','cdp_navigate_go_browser',\n",
    "  'cdp_evaluate_go_browser','cdp_get_html_go_browser','cdp_screenshot_go_browser',\n",
    "  'cdp_click_go_browser','cdp_click_text_go_browser','cdp_find_by_text_go_browser',\n",
    "  'cdp_type_text_go_browser','cdp_wait_element_go_browser','cdp_wait_load_go_browser',\n",
    "  'cdp_har_record_go_browser','cdp_set_cookie_go_browser','cdp_new_tab_go_browser',\n",
    "  'http_get_json_go_infra','http_download_file_go_infra','extract_urls_go_cybersecurity'\n",
    "], 'scraping_web — YA')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "081dc473",
   "metadata": {},
   "source": [
    "**Primer batch (tag `scraping`):**\n",
    "\n",
    "| # | id propuesto | proposito |\n",
    "|---|---|---|\n",
    "| 1 | `html_css_select_go_browser`     | goquery-like, devuelve nodos por selector CSS |\n",
    "| 2 | `html_to_text_go_browser`        | strip tags conservando estructura semantica |\n",
    "| 3 | `robots_txt_check_go_browser`    | parse + match user-agent/path antes de fetch |\n",
    "| 4 | `sitemap_iter_go_browser`        | descubre URLs desde sitemap.xml (+ index) |\n",
    "| 5 | `host_rate_limiter_go_infra`     | token-bucket por hostname con backoff 429 |\n",
    "| 6 | `crawl_frontier_go_browser`      | cola con dedupe + politeness por dominio |\n",
    "| 7 | `cdp_intercept_request_go_browser` | bloquear assets (img/font) para acelerar |\n",
    "| 8 | `scrape_pagination_py_browser`   | helper next-page con xpath/css o cursor JSON |\n",
    "\n",
    "Promover `apps/scraper_*` apps despues."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22b7a80c",
   "metadata": {},
   "source": [
    "---\n",
    "## 3) analisis_quantitativo\n",
    "\n",
    "**Lo que tenemos** — Monte Carlo de mercado, Hawkes, GBM, Avellaneda-Stoikov, sharpe/drawdown. Suficiente para microestructura.\n",
    "\n",
    "**Falta** — todo lo que NO es microestructura: regresion, cointegration, PCA, portfolio optimization, GARCH, risk parity, distribuciones (kurtosis/skew)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "884a7570",
   "metadata": {},
   "outputs": [],
   "source": [
    "show([\n",
    "  'run_market_sim_py_pipelines','monte_carlo_market_py_pipelines',\n",
    "  'hawkes_intensity_py_finance','generate_gbm_prices_py_finance',\n",
    "  'avellaneda_stoikov_quotes_py_finance','generate_taker_order_py_finance',\n",
    "  'sharpe_ratio_py_finance','max_drawdown_py_finance',\n",
    "  'annualized_volatility_py_finance','log_return_py_finance'\n",
    "], 'quant — YA')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "75317f4c",
   "metadata": {},
   "outputs": [],
   "source": [
    "fts('regression OR cointegration OR portfolio OR garch OR pca')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8e045748",
   "metadata": {},
   "source": [
    "**Primer batch (tag `quant`):**\n",
    "\n",
    "| # | id | proposito |\n",
    "|---|---|---|\n",
    "| 1 | `linear_regression_py_datascience` | OLS con stats (R2, t, p) |\n",
    "| 2 | `engle_granger_test_py_finance`    | cointegracion 2 series |\n",
    "| 3 | `johansen_test_py_finance`         | cointegracion n series |\n",
    "| 4 | `garch_fit_py_finance`             | GARCH(1,1) volatilidad condicional |\n",
    "| 5 | `markowitz_optim_py_finance`       | min-variance / max-sharpe |\n",
    "| 6 | `risk_parity_py_finance`           | pesos por contribucion de riesgo |\n",
    "| 7 | `pca_explained_var_py_datascience` | PCA sobre returns + varianza explicada |\n",
    "| 8 | `var_es_historical_py_finance`     | VaR/Expected Shortfall historicos |\n",
    "| 9 | `pairs_zscore_py_finance`          | spread y zscore para pairs trading |"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "93fce6ed",
   "metadata": {},
   "source": [
    "---\n",
    "## 4) monitorizacion_realtime\n",
    "\n",
    "**Lo que tenemos** — SSE handlers, WS hub, rate limit, logger middleware, health check. Plomeria casi completa.\n",
    "\n",
    "**Falta** — la capa de **semantica**: metricas (counter/gauge/histogram), alerting, anomaly detection online, ring-buffers de series, exporter Prometheus, panel de tail de logs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8a0e7780",
   "metadata": {},
   "outputs": [],
   "source": [
    "show([\n",
    "  'sse_handler_go_infra','sse_send_go_infra','sse_keepalive_go_infra',\n",
    "  'ws_handler_go_infra','ws_upgrader_go_infra',\n",
    "  'http_logger_middleware_go_infra','logger_middleware_go_infra',\n",
    "  'rate_limit_middleware_go_infra','rate_limiter_by_key_go_infra',\n",
    "  'health_check_http_go_infra'\n",
    "], 'realtime — YA (transporte)')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "018870d6",
   "metadata": {},
   "outputs": [],
   "source": [
    "fts('metric OR prometheus OR alert OR anomaly')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7d5eef7f",
   "metadata": {},
   "source": [
    "**Primer batch (tag `realtime` / `metrics`):**\n",
    "\n",
    "| # | id | proposito |\n",
    "|---|---|---|\n",
    "| 1 | `metric_counter_go_infra`          | atomic counter thread-safe |\n",
    "| 2 | `metric_gauge_go_infra`            | gauge con set/inc/dec |\n",
    "| 3 | `metric_histogram_go_infra`        | buckets configurables, sum/count |\n",
    "| 4 | `prometheus_exporter_go_infra`     | handler /metrics text format |\n",
    "| 5 | `ringbuffer_series_go_core`        | buffer circular para timeseries (pure) |\n",
    "| 6 | `ewma_anomaly_go_datascience`      | EWMA + 3-sigma deteccion outliers |\n",
    "| 7 | `alert_rule_evaluator_go_infra`    | expresion threshold → notif (compose con `slack_send`/email) |\n",
    "| 8 | `log_tail_sse_go_infra`            | broadcaster de log lines via SSE |"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4c1fe07a",
   "metadata": {},
   "source": [
    "---\n",
    "## 5) generacion_imagenes_ia\n",
    "\n",
    "**Lo que tenemos** — solo **tipos** (`image_generator`, `model_ref`, `lora_ref`, `generation_config`, `image_gen_result` × Go+Py). El contrato esta listo, **las implementaciones no existen**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ec66c272",
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.read_sql_query(\n",
    "  \"SELECT id, lang, algebraic, description FROM types WHERE domain='ml' AND \"\n",
    "  \"(id LIKE '%image%' OR id LIKE '%lora%' OR id LIKE '%model_ref%' OR id LIKE '%generation%')\",\n",
    "  conn)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "61dd77c0",
   "metadata": {},
   "outputs": [],
   "source": [
    "fts('diffusion OR stable OR sdxl OR comfy OR flux', 20)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d46ec553",
   "metadata": {},
   "source": [
    "**Primer batch (tag `image-gen`):**\n",
    "\n",
    "| # | id | proposito |\n",
    "|---|---|---|\n",
    "| 1 | `diffusers_generate_py_ml`        | impl local con `diffusers` cumpliendo `image_generator_py_ml` |\n",
    "| 2 | `comfyui_generate_py_ml`          | impl HTTP contra ComfyUI server local |\n",
    "| 3 | `openai_image_generate_py_ml`     | DALL-E / gpt-image-1 client |\n",
    "| 4 | `replicate_image_generate_py_ml`  | API generica replicate.com |\n",
    "| 5 | `image_to_image_py_ml`            | init image + strength sobre stack actual |\n",
    "| 6 | `controlnet_generate_py_ml`       | preprocessor + condicionamiento |\n",
    "| 7 | `image_grid_py_ml`                | helper PIL: grid NxM con seeds |\n",
    "| 8 | `prompt_template_render_py_core`  | Jinja-like prompt + LoRA tags + weights |\n",
    "\n",
    "Pipeline `image_gen_batch_py_pipelines` componiendo prompt → generator → save+meta."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0f77fa34",
   "metadata": {},
   "source": [
    "---\n",
    "## 6) generacion_texto_ia\n",
    "\n",
    "**Lo que tenemos** — solo **tipos** en `core`: `message`, `part`, `tool_part`, `text_part`, `context_part`, `query_plan`. No hay cliente."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a1692778",
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.read_sql_query(\n",
    "  \"SELECT id, lang, algebraic, description FROM types \"\n",
    "  \"WHERE id IN ('message_py_core','part_py_core','text_part_py_core','tool_part_py_core',\"\n",
    "  \"'context_part_py_core','query_plan_py_core','matched_context_py_core')\",\n",
    "  conn)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "687d64ba",
   "metadata": {},
   "source": [
    "**Primer batch (tag `llm`):**\n",
    "\n",
    "| # | id | proposito |\n",
    "|---|---|---|\n",
    "| 1 | `anthropic_client_py_ml`          | client Claude (messages API + streaming SSE) |\n",
    "| 2 | `openai_client_py_ml`             | client GPT (chat completions + responses) |\n",
    "| 3 | `ollama_client_py_ml`             | local LLM via Ollama HTTP |\n",
    "| 4 | `llm_stream_to_sse_py_infra`      | bridge stream LLM → SSE para UI |\n",
    "| 5 | `tool_use_dispatcher_py_core`     | ejecuta tool_part contra registry de funciones |\n",
    "| 6 | `embedding_openai_py_ml`          | embeddings + cosine search |\n",
    "| 7 | `prompt_cache_anthropic_py_ml`    | ephemeral cache_control breakpoint |\n",
    "| 8 | `token_count_py_core`             | tiktoken / claude tokenizer |\n",
    "| 9 | `chat_session_jsonl_py_core`      | persistir/cargar `message[]` JSONL |"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ef57c750",
   "metadata": {},
   "source": [
    "---\n",
    "## 7) generacion_audio\n",
    "\n",
    "**Lo que tenemos** — solo **playback** en gamedev (`audio_engine_cpp_gamedev`, `audio_play_cpp_gamedev`, miniaudio). **0 generacion**, **0 STT/TTS**, sin dominio `audio`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "339b8fb6",
   "metadata": {},
   "outputs": [],
   "source": [
    "show(['audio_engine_cpp_gamedev','audio_play_cpp_gamedev'], 'audio — YA (solo playback)')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4bc34071",
   "metadata": {},
   "source": [
    "**Primer batch (nuevo dominio `audio`, tag `audio-gen`):**\n",
    "\n",
    "| # | id | proposito |\n",
    "|---|---|---|\n",
    "| 1 | `wav_read_py_audio`               | sf.read → np.ndarray + sample_rate |\n",
    "| 2 | `wav_write_py_audio`              | np.ndarray → wav PCM16 |\n",
    "| 3 | `resample_audio_py_audio`         | librosa/scipy resample (pure salvo IO) |\n",
    "| 4 | `tts_piper_py_audio`              | TTS offline Piper, multi-voz |\n",
    "| 5 | `tts_elevenlabs_py_audio`         | client API ElevenLabs |\n",
    "| 6 | `tts_openai_py_audio`             | client API OpenAI tts-1 |\n",
    "| 7 | `stt_whisper_local_py_audio`      | faster-whisper local |\n",
    "| 8 | `stt_whisper_api_py_audio`        | OpenAI whisper API |\n",
    "| 9 | `musicgen_generate_py_audio`      | facebook/musicgen via transformers |\n",
    "| 10| `audio_concat_py_audio`           | concatenar wavs con crossfade ms |"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1a991579",
   "metadata": {},
   "source": [
    "---\n",
    "## 8) audio_realtime_voiceconversion\n",
    "\n",
    "**Lo que tenemos** — **nada**. Sin captura, sin streaming, sin VC.\n",
    "\n",
    "Es el tema con mayor coste de entrada: requiere binario nativo (cmake/CUDA), latencia <100ms, ring-buffers PortAudio/miniaudio en input."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3cebc01a",
   "metadata": {},
   "source": [
    "**Primer batch (tag `audio-rt`, dominio `audio`):**\n",
    "\n",
    "| # | id | proposito |\n",
    "|---|---|---|\n",
    "| 1 | `audio_input_cpp_audio`           | captura miniaudio device → ring buffer (mirror de `audio_engine`) |\n",
    "| 2 | `audio_ring_buffer_cpp_core`      | spsc lock-free para samples float32 |\n",
    "| 3 | `vad_silero_py_audio`             | Voice Activity Detection on chunks 30ms |\n",
    "| 4 | `rvc_infer_py_audio`              | Retrieval-based Voice Conversion local (torch) |\n",
    "| 5 | `seed_vc_infer_py_audio`          | Seed-VC zero-shot baseline |\n",
    "| 6 | `audio_ws_stream_go_infra`        | WS server que recibe PCM y devuelve PCM convertido |\n",
    "| 7 | `audio_chunker_py_audio`          | dividir stream en chunks 320 samples para inferencia |\n",
    "| 8 | `pitch_shift_psola_py_audio`      | pitch shift sin neural, fallback rapido |\n",
    "\n",
    "Mas un app `apps/voice_changer/` (C++ ImGui + Go service) que componga el pipeline."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bf80cad5",
   "metadata": {},
   "source": [
    "---\n",
    "## Resumen\n",
    "\n",
    "| Tema | Cobertura actual | Esfuerzo proximo |\n",
    "|------|------------------|------------------|\n",
    "| trading | media-alta | conectores exchange + paper broker (8 fn) |\n",
    "| scraping_web | alta (CDP completo) | parser HTML + politeness + frontier (8 fn) |\n",
    "| quant | baja-media | regresion/coint/portfolio/risk (9 fn) |\n",
    "| realtime | alta (transporte) | metrics + alerting (8 fn) |\n",
    "| image_gen | cero (solo tipos) | implementaciones diffusers/comfy/openai (8 fn) |\n",
    "| text_gen | cero (solo tipos) | clientes LLM + streaming (9 fn) |\n",
    "| audio_gen | cero (solo playback) | dominio nuevo `audio`, TTS/STT/music (10 fn) |\n",
    "| audio_rt_vc | cero | el mas costoso, requiere C++ (8 fn + app) |\n",
    "\n",
    "**Total**: ~70 funciones nuevas para cubrir los 8 temas con un primer baseline funcional.\n",
    "\n",
    "**Prioridad sugerida** (por ratio valor / coste):\n",
    "1. text_gen (clientes LLM ya bloquean muchas otras apps).\n",
    "2. realtime metrics + alerting (acelera el propio fn_monitoring).\n",
    "3. trading conectores + paper broker (cierra el stack que ya esta a medias).\n",
    "4. scraping HTML parser + politeness (multiplicador para osint_graph y data ingest).\n",
    "5. image_gen (alto valor demo, dependencias pesadas).\n",
    "6. quant (puede vivir como funciones puras Py sin infra).\n",
    "7. audio_gen.\n",
    "8. audio_rt_vc (ultimo: nuevo dominio C++ + dep nativa)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e7a722e1",
   "metadata": {},
   "source": [
    "---\n",
    "## Apendice — workaround FTS5\n",
    "\n",
    "`functions_fts` esta desfasada del contenido (`fts5: missing row N from content table 'main'.'functions'`).\n",
    "Las celdas `fts(...)` de arriba pueden petar. Solucion: regenerar el indice con `cd $FN_REGISTRY_ROOT && ./fn index`.\n",
    "\n",
    "Mientras, override con LIKE para que las busquedas funcionen sin FTS:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "658421c3",
   "metadata": {},
   "outputs": [],
   "source": [
    "def fts(q, limit=15):\n",
    "    \"\"\"Override seguro: busca term1|term2|... en name+description+tags via LIKE.\"\"\"\n",
    "    terms = [t.strip().lower() for t in q.replace(' OR ', '|').split('|') if t.strip()]\n",
    "    if not terms: return pd.DataFrame()\n",
    "    where = ' OR '.join([\"lower(name||' '||description||' '||tags) LIKE ?\"] * len(terms))\n",
    "    params = [f'%{t}%' for t in terms] + [limit]\n",
    "    return pd.read_sql_query(\n",
    "        f\"SELECT id, lang, purity, description FROM functions WHERE {where} LIMIT ?\",\n",
    "        conn, params=params)\n",
    "\n",
    "# Verifica que ahora encuentra funciones para los 3 gaps:\n",
    "for q in ['regression OR cointegration OR portfolio OR garch OR pca',\n",
    "          'metric OR prometheus OR alert OR anomaly',\n",
    "          'diffusion OR stable OR sdxl OR comfy OR flux']:\n",
    "    df = fts(q, limit=20)\n",
    "    print(f'--- {q} -> {len(df)} hits ---')\n",
    "    print(df.to_string(index=False) if len(df) else '(ninguno)')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}