Files
domain_coverage_gaps/notebooks/02_topic_gap_analysis.ipynb
2026-05-14 02:06:42 +02:00

534 lines
21 KiB
Plaintext
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
{
"cells": [
{
"cell_type": "markdown",
"id": "95651c14",
"metadata": {},
"source": [
"# 02 — Gap analysis: 8 temas\n",
"\n",
"Para cada tema: **(A) lo que YA tenemos**, **(B) lo que falta**, **(C) primer paso** (funciones concretas a delegar a `fn-constructor`).\n",
"\n",
"Temas: trading · scraping_web · analisis_quantitativo · monitorizacion_realtime · generacion_imagenes_ia · generacion_texto_ia · generacion_audio · audio_realtime_voiceconversion."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8f0bbd20",
"metadata": {},
"outputs": [],
"source": [
"import os, sqlite3, pandas as pd\n",
"ROOT = os.environ['FN_REGISTRY_ROOT']\n",
"conn = sqlite3.connect(f'file:{ROOT}/registry.db?mode=ro', uri=True)\n",
"pd.set_option('display.max_colwidth', 120)\n",
"\n",
"def show(ids, title=''):\n",
" if not ids: print(f'{title}: (vacio)'); return None\n",
" qm = ','.join('?'*len(ids))\n",
" df = pd.read_sql_query(\n",
" f\"SELECT id, lang, purity, description FROM functions WHERE id IN ({qm})\",\n",
" conn, params=ids)\n",
" if title: print(f'=== {title} ({len(df)}/{len(ids)}) ===')\n",
" return df\n",
"\n",
"def fts(q, limit=15):\n",
" return pd.read_sql_query(\n",
" '''SELECT f.id, f.lang, f.purity, f.description\n",
" FROM functions_fts JOIN functions f ON f.id = functions_fts.id\n",
" WHERE functions_fts MATCH ? ORDER BY rank LIMIT ?''',\n",
" conn, params=[q, limit])"
]
},
{
"cell_type": "markdown",
"id": "e367b10d",
"metadata": {},
"source": [
"---\n",
"## 1) trading\n",
"\n",
"**Lo que tenemos** — `finance` ya cubre indicators + OHLCV + persistencia y un simulador de mercado.\n",
"\n",
"**Falta** para un stack de trading real: conectores exchange (REST + WS) por venue concreto, libro de ordenes, ejecucion paper/real, gestion de riesgo, backtester vectorizado, sizing/portfolio."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "640b287d",
"metadata": {},
"outputs": [],
"source": [
"show([\n",
" 'fetch_ohlcv_go_finance','tick_to_ohlcv_go_finance','stream_ticks_go_finance',\n",
" 'sma_go_finance','ema_go_finance','rsi_go_finance','vwap_go_finance',\n",
" 'bollinger_bands_go_finance','sharpe_ratio_go_finance','max_drawdown_go_finance',\n",
" 'log_return_go_finance','annualized_volatility_go_finance','normalize_ohlcv_go_finance',\n",
" 'write_ohlcv_to_parquet_go_finance','load_ohlcv_from_duckdb_go_finance',\n",
" 'avellaneda_stoikov_quotes_py_finance','generate_taker_order_py_finance',\n",
" 'hawkes_intensity_py_finance','generate_gbm_prices_py_finance',\n",
" 'run_market_sim_py_pipelines','monte_carlo_market_py_pipelines'\n",
"], 'trading — YA')"
]
},
{
"cell_type": "markdown",
"id": "dacc42b2",
"metadata": {},
"source": [
"**Gap & primer batch (delegar a fn-constructor, tag `trading`):**\n",
"\n",
"| # | id propuesto | proposito |\n",
"|---|---|---|\n",
"| 1 | `binance_rest_client_py_finance` | client REST autenticado (klines, balance, order) |\n",
"| 2 | `binance_ws_stream_py_finance` | WS streams trade/depth/kline reconectable |\n",
"| 3 | `orderbook_l2_py_finance` | book L2 con snapshot+delta, BBO, walk-the-book |\n",
"| 4 | `paper_broker_py_finance` | simulador FIFO con slippage configurable |\n",
"| 5 | `position_sizer_py_finance` | Kelly fraccional + cap por riesgo |\n",
"| 6 | `backtest_vectorized_py_finance` | apply de signal sobre OHLCV → equity curve |\n",
"| 7 | `risk_metrics_py_finance` | VaR/ES/Calmar (los 3 que faltan respecto a sharpe/drawdown) |\n",
"| 8 | `signal_crossover_go_finance` | golden/death cross + zscore mean reversion (puras) |"
]
},
{
"cell_type": "markdown",
"id": "b3b3735e",
"metadata": {},
"source": [
"---\n",
"## 2) scraping_web\n",
"\n",
"**Lo que tenemos** — domain `browser` con CDP completo en Go puro + `http_*` en infra. Excelente base.\n",
"\n",
"**Falta** — parsing HTML/CSS-select sin browser, robots/sitemap, deduplicacion, rate-limit por host, persistencia incremental, captchas. Y un tag `scraping` que agrupe."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "90b5b349",
"metadata": {},
"outputs": [],
"source": [
"show([\n",
" 'chrome_launch_go_browser','cdp_connect_go_browser','cdp_navigate_go_browser',\n",
" 'cdp_evaluate_go_browser','cdp_get_html_go_browser','cdp_screenshot_go_browser',\n",
" 'cdp_click_go_browser','cdp_click_text_go_browser','cdp_find_by_text_go_browser',\n",
" 'cdp_type_text_go_browser','cdp_wait_element_go_browser','cdp_wait_load_go_browser',\n",
" 'cdp_har_record_go_browser','cdp_set_cookie_go_browser','cdp_new_tab_go_browser',\n",
" 'http_get_json_go_infra','http_download_file_go_infra','extract_urls_go_cybersecurity'\n",
"], 'scraping_web — YA')"
]
},
{
"cell_type": "markdown",
"id": "081dc473",
"metadata": {},
"source": [
"**Primer batch (tag `scraping`):**\n",
"\n",
"| # | id propuesto | proposito |\n",
"|---|---|---|\n",
"| 1 | `html_css_select_go_browser` | goquery-like, devuelve nodos por selector CSS |\n",
"| 2 | `html_to_text_go_browser` | strip tags conservando estructura semantica |\n",
"| 3 | `robots_txt_check_go_browser` | parse + match user-agent/path antes de fetch |\n",
"| 4 | `sitemap_iter_go_browser` | descubre URLs desde sitemap.xml (+ index) |\n",
"| 5 | `host_rate_limiter_go_infra` | token-bucket por hostname con backoff 429 |\n",
"| 6 | `crawl_frontier_go_browser` | cola con dedupe + politeness por dominio |\n",
"| 7 | `cdp_intercept_request_go_browser` | bloquear assets (img/font) para acelerar |\n",
"| 8 | `scrape_pagination_py_browser` | helper next-page con xpath/css o cursor JSON |\n",
"\n",
"Promover `apps/scraper_*` apps despues."
]
},
{
"cell_type": "markdown",
"id": "22b7a80c",
"metadata": {},
"source": [
"---\n",
"## 3) analisis_quantitativo\n",
"\n",
"**Lo que tenemos** — Monte Carlo de mercado, Hawkes, GBM, Avellaneda-Stoikov, sharpe/drawdown. Suficiente para microestructura.\n",
"\n",
"**Falta** — todo lo que NO es microestructura: regresion, cointegration, PCA, portfolio optimization, GARCH, risk parity, distribuciones (kurtosis/skew)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "884a7570",
"metadata": {},
"outputs": [],
"source": [
"show([\n",
" 'run_market_sim_py_pipelines','monte_carlo_market_py_pipelines',\n",
" 'hawkes_intensity_py_finance','generate_gbm_prices_py_finance',\n",
" 'avellaneda_stoikov_quotes_py_finance','generate_taker_order_py_finance',\n",
" 'sharpe_ratio_py_finance','max_drawdown_py_finance',\n",
" 'annualized_volatility_py_finance','log_return_py_finance'\n",
"], 'quant — YA')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "75317f4c",
"metadata": {},
"outputs": [],
"source": [
"fts('regression OR cointegration OR portfolio OR garch OR pca')"
]
},
{
"cell_type": "markdown",
"id": "8e045748",
"metadata": {},
"source": [
"**Primer batch (tag `quant`):**\n",
"\n",
"| # | id | proposito |\n",
"|---|---|---|\n",
"| 1 | `linear_regression_py_datascience` | OLS con stats (R2, t, p) |\n",
"| 2 | `engle_granger_test_py_finance` | cointegracion 2 series |\n",
"| 3 | `johansen_test_py_finance` | cointegracion n series |\n",
"| 4 | `garch_fit_py_finance` | GARCH(1,1) volatilidad condicional |\n",
"| 5 | `markowitz_optim_py_finance` | min-variance / max-sharpe |\n",
"| 6 | `risk_parity_py_finance` | pesos por contribucion de riesgo |\n",
"| 7 | `pca_explained_var_py_datascience` | PCA sobre returns + varianza explicada |\n",
"| 8 | `var_es_historical_py_finance` | VaR/Expected Shortfall historicos |\n",
"| 9 | `pairs_zscore_py_finance` | spread y zscore para pairs trading |"
]
},
{
"cell_type": "markdown",
"id": "93fce6ed",
"metadata": {},
"source": [
"---\n",
"## 4) monitorizacion_realtime\n",
"\n",
"**Lo que tenemos** — SSE handlers, WS hub, rate limit, logger middleware, health check. Plomeria casi completa.\n",
"\n",
"**Falta** — la capa de **semantica**: metricas (counter/gauge/histogram), alerting, anomaly detection online, ring-buffers de series, exporter Prometheus, panel de tail de logs."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8a0e7780",
"metadata": {},
"outputs": [],
"source": [
"show([\n",
" 'sse_handler_go_infra','sse_send_go_infra','sse_keepalive_go_infra',\n",
" 'ws_handler_go_infra','ws_upgrader_go_infra',\n",
" 'http_logger_middleware_go_infra','logger_middleware_go_infra',\n",
" 'rate_limit_middleware_go_infra','rate_limiter_by_key_go_infra',\n",
" 'health_check_http_go_infra'\n",
"], 'realtime — YA (transporte)')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "018870d6",
"metadata": {},
"outputs": [],
"source": [
"fts('metric OR prometheus OR alert OR anomaly')"
]
},
{
"cell_type": "markdown",
"id": "7d5eef7f",
"metadata": {},
"source": [
"**Primer batch (tag `realtime` / `metrics`):**\n",
"\n",
"| # | id | proposito |\n",
"|---|---|---|\n",
"| 1 | `metric_counter_go_infra` | atomic counter thread-safe |\n",
"| 2 | `metric_gauge_go_infra` | gauge con set/inc/dec |\n",
"| 3 | `metric_histogram_go_infra` | buckets configurables, sum/count |\n",
"| 4 | `prometheus_exporter_go_infra` | handler /metrics text format |\n",
"| 5 | `ringbuffer_series_go_core` | buffer circular para timeseries (pure) |\n",
"| 6 | `ewma_anomaly_go_datascience` | EWMA + 3-sigma deteccion outliers |\n",
"| 7 | `alert_rule_evaluator_go_infra` | expresion threshold → notif (compose con `slack_send`/email) |\n",
"| 8 | `log_tail_sse_go_infra` | broadcaster de log lines via SSE |"
]
},
{
"cell_type": "markdown",
"id": "4c1fe07a",
"metadata": {},
"source": [
"---\n",
"## 5) generacion_imagenes_ia\n",
"\n",
"**Lo que tenemos** — solo **tipos** (`image_generator`, `model_ref`, `lora_ref`, `generation_config`, `image_gen_result` × Go+Py). El contrato esta listo, **las implementaciones no existen**."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ec66c272",
"metadata": {},
"outputs": [],
"source": [
"pd.read_sql_query(\n",
" \"SELECT id, lang, algebraic, description FROM types WHERE domain='ml' AND \"\n",
" \"(id LIKE '%image%' OR id LIKE '%lora%' OR id LIKE '%model_ref%' OR id LIKE '%generation%')\",\n",
" conn)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "61dd77c0",
"metadata": {},
"outputs": [],
"source": [
"fts('diffusion OR stable OR sdxl OR comfy OR flux', 20)"
]
},
{
"cell_type": "markdown",
"id": "d46ec553",
"metadata": {},
"source": [
"**Primer batch (tag `image-gen`):**\n",
"\n",
"| # | id | proposito |\n",
"|---|---|---|\n",
"| 1 | `diffusers_generate_py_ml` | impl local con `diffusers` cumpliendo `image_generator_py_ml` |\n",
"| 2 | `comfyui_generate_py_ml` | impl HTTP contra ComfyUI server local |\n",
"| 3 | `openai_image_generate_py_ml` | DALL-E / gpt-image-1 client |\n",
"| 4 | `replicate_image_generate_py_ml` | API generica replicate.com |\n",
"| 5 | `image_to_image_py_ml` | init image + strength sobre stack actual |\n",
"| 6 | `controlnet_generate_py_ml` | preprocessor + condicionamiento |\n",
"| 7 | `image_grid_py_ml` | helper PIL: grid NxM con seeds |\n",
"| 8 | `prompt_template_render_py_core` | Jinja-like prompt + LoRA tags + weights |\n",
"\n",
"Pipeline `image_gen_batch_py_pipelines` componiendo prompt → generator → save+meta."
]
},
{
"cell_type": "markdown",
"id": "0f77fa34",
"metadata": {},
"source": [
"---\n",
"## 6) generacion_texto_ia\n",
"\n",
"**Lo que tenemos** — solo **tipos** en `core`: `message`, `part`, `tool_part`, `text_part`, `context_part`, `query_plan`. No hay cliente."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a1692778",
"metadata": {},
"outputs": [],
"source": [
"pd.read_sql_query(\n",
" \"SELECT id, lang, algebraic, description FROM types \"\n",
" \"WHERE id IN ('message_py_core','part_py_core','text_part_py_core','tool_part_py_core',\"\n",
" \"'context_part_py_core','query_plan_py_core','matched_context_py_core')\",\n",
" conn)"
]
},
{
"cell_type": "markdown",
"id": "687d64ba",
"metadata": {},
"source": [
"**Primer batch (tag `llm`):**\n",
"\n",
"| # | id | proposito |\n",
"|---|---|---|\n",
"| 1 | `anthropic_client_py_ml` | client Claude (messages API + streaming SSE) |\n",
"| 2 | `openai_client_py_ml` | client GPT (chat completions + responses) |\n",
"| 3 | `ollama_client_py_ml` | local LLM via Ollama HTTP |\n",
"| 4 | `llm_stream_to_sse_py_infra` | bridge stream LLM → SSE para UI |\n",
"| 5 | `tool_use_dispatcher_py_core` | ejecuta tool_part contra registry de funciones |\n",
"| 6 | `embedding_openai_py_ml` | embeddings + cosine search |\n",
"| 7 | `prompt_cache_anthropic_py_ml` | ephemeral cache_control breakpoint |\n",
"| 8 | `token_count_py_core` | tiktoken / claude tokenizer |\n",
"| 9 | `chat_session_jsonl_py_core` | persistir/cargar `message[]` JSONL |"
]
},
{
"cell_type": "markdown",
"id": "ef57c750",
"metadata": {},
"source": [
"---\n",
"## 7) generacion_audio\n",
"\n",
"**Lo que tenemos** — solo **playback** en gamedev (`audio_engine_cpp_gamedev`, `audio_play_cpp_gamedev`, miniaudio). **0 generacion**, **0 STT/TTS**, sin dominio `audio`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "339b8fb6",
"metadata": {},
"outputs": [],
"source": [
"show(['audio_engine_cpp_gamedev','audio_play_cpp_gamedev'], 'audio — YA (solo playback)')"
]
},
{
"cell_type": "markdown",
"id": "4bc34071",
"metadata": {},
"source": [
"**Primer batch (nuevo dominio `audio`, tag `audio-gen`):**\n",
"\n",
"| # | id | proposito |\n",
"|---|---|---|\n",
"| 1 | `wav_read_py_audio` | sf.read → np.ndarray + sample_rate |\n",
"| 2 | `wav_write_py_audio` | np.ndarray → wav PCM16 |\n",
"| 3 | `resample_audio_py_audio` | librosa/scipy resample (pure salvo IO) |\n",
"| 4 | `tts_piper_py_audio` | TTS offline Piper, multi-voz |\n",
"| 5 | `tts_elevenlabs_py_audio` | client API ElevenLabs |\n",
"| 6 | `tts_openai_py_audio` | client API OpenAI tts-1 |\n",
"| 7 | `stt_whisper_local_py_audio` | faster-whisper local |\n",
"| 8 | `stt_whisper_api_py_audio` | OpenAI whisper API |\n",
"| 9 | `musicgen_generate_py_audio` | facebook/musicgen via transformers |\n",
"| 10| `audio_concat_py_audio` | concatenar wavs con crossfade ms |"
]
},
{
"cell_type": "markdown",
"id": "1a991579",
"metadata": {},
"source": [
"---\n",
"## 8) audio_realtime_voiceconversion\n",
"\n",
"**Lo que tenemos** — **nada**. Sin captura, sin streaming, sin VC.\n",
"\n",
"Es el tema con mayor coste de entrada: requiere binario nativo (cmake/CUDA), latencia <100ms, ring-buffers PortAudio/miniaudio en input."
]
},
{
"cell_type": "markdown",
"id": "3cebc01a",
"metadata": {},
"source": [
"**Primer batch (tag `audio-rt`, dominio `audio`):**\n",
"\n",
"| # | id | proposito |\n",
"|---|---|---|\n",
"| 1 | `audio_input_cpp_audio` | captura miniaudio device → ring buffer (mirror de `audio_engine`) |\n",
"| 2 | `audio_ring_buffer_cpp_core` | spsc lock-free para samples float32 |\n",
"| 3 | `vad_silero_py_audio` | Voice Activity Detection on chunks 30ms |\n",
"| 4 | `rvc_infer_py_audio` | Retrieval-based Voice Conversion local (torch) |\n",
"| 5 | `seed_vc_infer_py_audio` | Seed-VC zero-shot baseline |\n",
"| 6 | `audio_ws_stream_go_infra` | WS server que recibe PCM y devuelve PCM convertido |\n",
"| 7 | `audio_chunker_py_audio` | dividir stream en chunks 320 samples para inferencia |\n",
"| 8 | `pitch_shift_psola_py_audio` | pitch shift sin neural, fallback rapido |\n",
"\n",
"Mas un app `apps/voice_changer/` (C++ ImGui + Go service) que componga el pipeline."
]
},
{
"cell_type": "markdown",
"id": "bf80cad5",
"metadata": {},
"source": [
"---\n",
"## Resumen\n",
"\n",
"| Tema | Cobertura actual | Esfuerzo proximo |\n",
"|------|------------------|------------------|\n",
"| trading | media-alta | conectores exchange + paper broker (8 fn) |\n",
"| scraping_web | alta (CDP completo) | parser HTML + politeness + frontier (8 fn) |\n",
"| quant | baja-media | regresion/coint/portfolio/risk (9 fn) |\n",
"| realtime | alta (transporte) | metrics + alerting (8 fn) |\n",
"| image_gen | cero (solo tipos) | implementaciones diffusers/comfy/openai (8 fn) |\n",
"| text_gen | cero (solo tipos) | clientes LLM + streaming (9 fn) |\n",
"| audio_gen | cero (solo playback) | dominio nuevo `audio`, TTS/STT/music (10 fn) |\n",
"| audio_rt_vc | cero | el mas costoso, requiere C++ (8 fn + app) |\n",
"\n",
"**Total**: ~70 funciones nuevas para cubrir los 8 temas con un primer baseline funcional.\n",
"\n",
"**Prioridad sugerida** (por ratio valor / coste):\n",
"1. text_gen (clientes LLM ya bloquean muchas otras apps).\n",
"2. realtime metrics + alerting (acelera el propio fn_monitoring).\n",
"3. trading conectores + paper broker (cierra el stack que ya esta a medias).\n",
"4. scraping HTML parser + politeness (multiplicador para osint_graph y data ingest).\n",
"5. image_gen (alto valor demo, dependencias pesadas).\n",
"6. quant (puede vivir como funciones puras Py sin infra).\n",
"7. audio_gen.\n",
"8. audio_rt_vc (ultimo: nuevo dominio C++ + dep nativa)."
]
},
{
"cell_type": "markdown",
"id": "e7a722e1",
"metadata": {},
"source": [
"---\n",
"## Apendice — workaround FTS5\n",
"\n",
"`functions_fts` esta desfasada del contenido (`fts5: missing row N from content table 'main'.'functions'`).\n",
"Las celdas `fts(...)` de arriba pueden petar. Solucion: regenerar el indice con `cd $FN_REGISTRY_ROOT && ./fn index`.\n",
"\n",
"Mientras, override con LIKE para que las busquedas funcionen sin FTS:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "658421c3",
"metadata": {},
"outputs": [],
"source": [
"def fts(q, limit=15):\n",
" \"\"\"Override seguro: busca term1|term2|... en name+description+tags via LIKE.\"\"\"\n",
" terms = [t.strip().lower() for t in q.replace(' OR ', '|').split('|') if t.strip()]\n",
" if not terms: return pd.DataFrame()\n",
" where = ' OR '.join([\"lower(name||' '||description||' '||tags) LIKE ?\"] * len(terms))\n",
" params = [f'%{t}%' for t in terms] + [limit]\n",
" return pd.read_sql_query(\n",
" f\"SELECT id, lang, purity, description FROM functions WHERE {where} LIMIT ?\",\n",
" conn, params=params)\n",
"\n",
"# Verifica que ahora encuentra funciones para los 3 gaps:\n",
"for q in ['regression OR cointegration OR portfolio OR garch OR pca',\n",
" 'metric OR prometheus OR alert OR anomaly',\n",
" 'diffusion OR stable OR sdxl OR comfy OR flux']:\n",
" df = fts(q, limit=20)\n",
" print(f'--- {q} -> {len(df)} hits ---')\n",
" print(df.to_string(index=False) if len(df) else '(ninguno)')"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}