{ "cells": [ { "cell_type": "markdown", "id": "95651c14", "metadata": {}, "source": [ "# 02 — Gap analysis: 8 temas\n", "\n", "Para cada tema: **(A) lo que YA tenemos**, **(B) lo que falta**, **(C) primer paso** (funciones concretas a delegar a `fn-constructor`).\n", "\n", "Temas: trading · scraping_web · analisis_quantitativo · monitorizacion_realtime · generacion_imagenes_ia · generacion_texto_ia · generacion_audio · audio_realtime_voiceconversion." ] }, { "cell_type": "code", "execution_count": null, "id": "8f0bbd20", "metadata": {}, "outputs": [], "source": [ "import os, sqlite3, pandas as pd\n", "ROOT = os.environ['FN_REGISTRY_ROOT']\n", "conn = sqlite3.connect(f'file:{ROOT}/registry.db?mode=ro', uri=True)\n", "pd.set_option('display.max_colwidth', 120)\n", "\n", "def show(ids, title=''):\n", " if not ids: print(f'{title}: (vacio)'); return None\n", " qm = ','.join('?'*len(ids))\n", " df = pd.read_sql_query(\n", " f\"SELECT id, lang, purity, description FROM functions WHERE id IN ({qm})\",\n", " conn, params=ids)\n", " if title: print(f'=== {title} ({len(df)}/{len(ids)}) ===')\n", " return df\n", "\n", "def fts(q, limit=15):\n", " return pd.read_sql_query(\n", " '''SELECT f.id, f.lang, f.purity, f.description\n", " FROM functions_fts JOIN functions f ON f.id = functions_fts.id\n", " WHERE functions_fts MATCH ? ORDER BY rank LIMIT ?''',\n", " conn, params=[q, limit])" ] }, { "cell_type": "markdown", "id": "e367b10d", "metadata": {}, "source": [ "---\n", "## 1) trading\n", "\n", "**Lo que tenemos** — `finance` ya cubre indicators + OHLCV + persistencia y un simulador de mercado.\n", "\n", "**Falta** para un stack de trading real: conectores exchange (REST + WS) por venue concreto, libro de ordenes, ejecucion paper/real, gestion de riesgo, backtester vectorizado, sizing/portfolio." ] }, { "cell_type": "code", "execution_count": null, "id": "640b287d", "metadata": {}, "outputs": [], "source": [ "show([\n", " 'fetch_ohlcv_go_finance','tick_to_ohlcv_go_finance','stream_ticks_go_finance',\n", " 'sma_go_finance','ema_go_finance','rsi_go_finance','vwap_go_finance',\n", " 'bollinger_bands_go_finance','sharpe_ratio_go_finance','max_drawdown_go_finance',\n", " 'log_return_go_finance','annualized_volatility_go_finance','normalize_ohlcv_go_finance',\n", " 'write_ohlcv_to_parquet_go_finance','load_ohlcv_from_duckdb_go_finance',\n", " 'avellaneda_stoikov_quotes_py_finance','generate_taker_order_py_finance',\n", " 'hawkes_intensity_py_finance','generate_gbm_prices_py_finance',\n", " 'run_market_sim_py_pipelines','monte_carlo_market_py_pipelines'\n", "], 'trading — YA')" ] }, { "cell_type": "markdown", "id": "dacc42b2", "metadata": {}, "source": [ "**Gap & primer batch (delegar a fn-constructor, tag `trading`):**\n", "\n", "| # | id propuesto | proposito |\n", "|---|---|---|\n", "| 1 | `binance_rest_client_py_finance` | client REST autenticado (klines, balance, order) |\n", "| 2 | `binance_ws_stream_py_finance` | WS streams trade/depth/kline reconectable |\n", "| 3 | `orderbook_l2_py_finance` | book L2 con snapshot+delta, BBO, walk-the-book |\n", "| 4 | `paper_broker_py_finance` | simulador FIFO con slippage configurable |\n", "| 5 | `position_sizer_py_finance` | Kelly fraccional + cap por riesgo |\n", "| 6 | `backtest_vectorized_py_finance` | apply de signal sobre OHLCV → equity curve |\n", "| 7 | `risk_metrics_py_finance` | VaR/ES/Calmar (los 3 que faltan respecto a sharpe/drawdown) |\n", "| 8 | `signal_crossover_go_finance` | golden/death cross + zscore mean reversion (puras) |" ] }, { "cell_type": "markdown", "id": "b3b3735e", "metadata": {}, "source": [ "---\n", "## 2) scraping_web\n", "\n", "**Lo que tenemos** — domain `browser` con CDP completo en Go puro + `http_*` en infra. Excelente base.\n", "\n", "**Falta** — parsing HTML/CSS-select sin browser, robots/sitemap, deduplicacion, rate-limit por host, persistencia incremental, captchas. Y un tag `scraping` que agrupe." ] }, { "cell_type": "code", "execution_count": null, "id": "90b5b349", "metadata": {}, "outputs": [], "source": [ "show([\n", " 'chrome_launch_go_browser','cdp_connect_go_browser','cdp_navigate_go_browser',\n", " 'cdp_evaluate_go_browser','cdp_get_html_go_browser','cdp_screenshot_go_browser',\n", " 'cdp_click_go_browser','cdp_click_text_go_browser','cdp_find_by_text_go_browser',\n", " 'cdp_type_text_go_browser','cdp_wait_element_go_browser','cdp_wait_load_go_browser',\n", " 'cdp_har_record_go_browser','cdp_set_cookie_go_browser','cdp_new_tab_go_browser',\n", " 'http_get_json_go_infra','http_download_file_go_infra','extract_urls_go_cybersecurity'\n", "], 'scraping_web — YA')" ] }, { "cell_type": "markdown", "id": "081dc473", "metadata": {}, "source": [ "**Primer batch (tag `scraping`):**\n", "\n", "| # | id propuesto | proposito |\n", "|---|---|---|\n", "| 1 | `html_css_select_go_browser` | goquery-like, devuelve nodos por selector CSS |\n", "| 2 | `html_to_text_go_browser` | strip tags conservando estructura semantica |\n", "| 3 | `robots_txt_check_go_browser` | parse + match user-agent/path antes de fetch |\n", "| 4 | `sitemap_iter_go_browser` | descubre URLs desde sitemap.xml (+ index) |\n", "| 5 | `host_rate_limiter_go_infra` | token-bucket por hostname con backoff 429 |\n", "| 6 | `crawl_frontier_go_browser` | cola con dedupe + politeness por dominio |\n", "| 7 | `cdp_intercept_request_go_browser` | bloquear assets (img/font) para acelerar |\n", "| 8 | `scrape_pagination_py_browser` | helper next-page con xpath/css o cursor JSON |\n", "\n", "Promover `apps/scraper_*` apps despues." ] }, { "cell_type": "markdown", "id": "22b7a80c", "metadata": {}, "source": [ "---\n", "## 3) analisis_quantitativo\n", "\n", "**Lo que tenemos** — Monte Carlo de mercado, Hawkes, GBM, Avellaneda-Stoikov, sharpe/drawdown. Suficiente para microestructura.\n", "\n", "**Falta** — todo lo que NO es microestructura: regresion, cointegration, PCA, portfolio optimization, GARCH, risk parity, distribuciones (kurtosis/skew)." ] }, { "cell_type": "code", "execution_count": null, "id": "884a7570", "metadata": {}, "outputs": [], "source": [ "show([\n", " 'run_market_sim_py_pipelines','monte_carlo_market_py_pipelines',\n", " 'hawkes_intensity_py_finance','generate_gbm_prices_py_finance',\n", " 'avellaneda_stoikov_quotes_py_finance','generate_taker_order_py_finance',\n", " 'sharpe_ratio_py_finance','max_drawdown_py_finance',\n", " 'annualized_volatility_py_finance','log_return_py_finance'\n", "], 'quant — YA')" ] }, { "cell_type": "code", "execution_count": null, "id": "75317f4c", "metadata": {}, "outputs": [], "source": [ "fts('regression OR cointegration OR portfolio OR garch OR pca')" ] }, { "cell_type": "markdown", "id": "8e045748", "metadata": {}, "source": [ "**Primer batch (tag `quant`):**\n", "\n", "| # | id | proposito |\n", "|---|---|---|\n", "| 1 | `linear_regression_py_datascience` | OLS con stats (R2, t, p) |\n", "| 2 | `engle_granger_test_py_finance` | cointegracion 2 series |\n", "| 3 | `johansen_test_py_finance` | cointegracion n series |\n", "| 4 | `garch_fit_py_finance` | GARCH(1,1) volatilidad condicional |\n", "| 5 | `markowitz_optim_py_finance` | min-variance / max-sharpe |\n", "| 6 | `risk_parity_py_finance` | pesos por contribucion de riesgo |\n", "| 7 | `pca_explained_var_py_datascience` | PCA sobre returns + varianza explicada |\n", "| 8 | `var_es_historical_py_finance` | VaR/Expected Shortfall historicos |\n", "| 9 | `pairs_zscore_py_finance` | spread y zscore para pairs trading |" ] }, { "cell_type": "markdown", "id": "93fce6ed", "metadata": {}, "source": [ "---\n", "## 4) monitorizacion_realtime\n", "\n", "**Lo que tenemos** — SSE handlers, WS hub, rate limit, logger middleware, health check. Plomeria casi completa.\n", "\n", "**Falta** — la capa de **semantica**: metricas (counter/gauge/histogram), alerting, anomaly detection online, ring-buffers de series, exporter Prometheus, panel de tail de logs." ] }, { "cell_type": "code", "execution_count": null, "id": "8a0e7780", "metadata": {}, "outputs": [], "source": [ "show([\n", " 'sse_handler_go_infra','sse_send_go_infra','sse_keepalive_go_infra',\n", " 'ws_handler_go_infra','ws_upgrader_go_infra',\n", " 'http_logger_middleware_go_infra','logger_middleware_go_infra',\n", " 'rate_limit_middleware_go_infra','rate_limiter_by_key_go_infra',\n", " 'health_check_http_go_infra'\n", "], 'realtime — YA (transporte)')" ] }, { "cell_type": "code", "execution_count": null, "id": "018870d6", "metadata": {}, "outputs": [], "source": [ "fts('metric OR prometheus OR alert OR anomaly')" ] }, { "cell_type": "markdown", "id": "7d5eef7f", "metadata": {}, "source": [ "**Primer batch (tag `realtime` / `metrics`):**\n", "\n", "| # | id | proposito |\n", "|---|---|---|\n", "| 1 | `metric_counter_go_infra` | atomic counter thread-safe |\n", "| 2 | `metric_gauge_go_infra` | gauge con set/inc/dec |\n", "| 3 | `metric_histogram_go_infra` | buckets configurables, sum/count |\n", "| 4 | `prometheus_exporter_go_infra` | handler /metrics text format |\n", "| 5 | `ringbuffer_series_go_core` | buffer circular para timeseries (pure) |\n", "| 6 | `ewma_anomaly_go_datascience` | EWMA + 3-sigma deteccion outliers |\n", "| 7 | `alert_rule_evaluator_go_infra` | expresion threshold → notif (compose con `slack_send`/email) |\n", "| 8 | `log_tail_sse_go_infra` | broadcaster de log lines via SSE |" ] }, { "cell_type": "markdown", "id": "4c1fe07a", "metadata": {}, "source": [ "---\n", "## 5) generacion_imagenes_ia\n", "\n", "**Lo que tenemos** — solo **tipos** (`image_generator`, `model_ref`, `lora_ref`, `generation_config`, `image_gen_result` × Go+Py). El contrato esta listo, **las implementaciones no existen**." ] }, { "cell_type": "code", "execution_count": null, "id": "ec66c272", "metadata": {}, "outputs": [], "source": [ "pd.read_sql_query(\n", " \"SELECT id, lang, algebraic, description FROM types WHERE domain='ml' AND \"\n", " \"(id LIKE '%image%' OR id LIKE '%lora%' OR id LIKE '%model_ref%' OR id LIKE '%generation%')\",\n", " conn)" ] }, { "cell_type": "code", "execution_count": null, "id": "61dd77c0", "metadata": {}, "outputs": [], "source": [ "fts('diffusion OR stable OR sdxl OR comfy OR flux', 20)" ] }, { "cell_type": "markdown", "id": "d46ec553", "metadata": {}, "source": [ "**Primer batch (tag `image-gen`):**\n", "\n", "| # | id | proposito |\n", "|---|---|---|\n", "| 1 | `diffusers_generate_py_ml` | impl local con `diffusers` cumpliendo `image_generator_py_ml` |\n", "| 2 | `comfyui_generate_py_ml` | impl HTTP contra ComfyUI server local |\n", "| 3 | `openai_image_generate_py_ml` | DALL-E / gpt-image-1 client |\n", "| 4 | `replicate_image_generate_py_ml` | API generica replicate.com |\n", "| 5 | `image_to_image_py_ml` | init image + strength sobre stack actual |\n", "| 6 | `controlnet_generate_py_ml` | preprocessor + condicionamiento |\n", "| 7 | `image_grid_py_ml` | helper PIL: grid NxM con seeds |\n", "| 8 | `prompt_template_render_py_core` | Jinja-like prompt + LoRA tags + weights |\n", "\n", "Pipeline `image_gen_batch_py_pipelines` componiendo prompt → generator → save+meta." ] }, { "cell_type": "markdown", "id": "0f77fa34", "metadata": {}, "source": [ "---\n", "## 6) generacion_texto_ia\n", "\n", "**Lo que tenemos** — solo **tipos** en `core`: `message`, `part`, `tool_part`, `text_part`, `context_part`, `query_plan`. No hay cliente." ] }, { "cell_type": "code", "execution_count": null, "id": "a1692778", "metadata": {}, "outputs": [], "source": [ "pd.read_sql_query(\n", " \"SELECT id, lang, algebraic, description FROM types \"\n", " \"WHERE id IN ('message_py_core','part_py_core','text_part_py_core','tool_part_py_core',\"\n", " \"'context_part_py_core','query_plan_py_core','matched_context_py_core')\",\n", " conn)" ] }, { "cell_type": "markdown", "id": "687d64ba", "metadata": {}, "source": [ "**Primer batch (tag `llm`):**\n", "\n", "| # | id | proposito |\n", "|---|---|---|\n", "| 1 | `anthropic_client_py_ml` | client Claude (messages API + streaming SSE) |\n", "| 2 | `openai_client_py_ml` | client GPT (chat completions + responses) |\n", "| 3 | `ollama_client_py_ml` | local LLM via Ollama HTTP |\n", "| 4 | `llm_stream_to_sse_py_infra` | bridge stream LLM → SSE para UI |\n", "| 5 | `tool_use_dispatcher_py_core` | ejecuta tool_part contra registry de funciones |\n", "| 6 | `embedding_openai_py_ml` | embeddings + cosine search |\n", "| 7 | `prompt_cache_anthropic_py_ml` | ephemeral cache_control breakpoint |\n", "| 8 | `token_count_py_core` | tiktoken / claude tokenizer |\n", "| 9 | `chat_session_jsonl_py_core` | persistir/cargar `message[]` JSONL |" ] }, { "cell_type": "markdown", "id": "ef57c750", "metadata": {}, "source": [ "---\n", "## 7) generacion_audio\n", "\n", "**Lo que tenemos** — solo **playback** en gamedev (`audio_engine_cpp_gamedev`, `audio_play_cpp_gamedev`, miniaudio). **0 generacion**, **0 STT/TTS**, sin dominio `audio`." ] }, { "cell_type": "code", "execution_count": null, "id": "339b8fb6", "metadata": {}, "outputs": [], "source": [ "show(['audio_engine_cpp_gamedev','audio_play_cpp_gamedev'], 'audio — YA (solo playback)')" ] }, { "cell_type": "markdown", "id": "4bc34071", "metadata": {}, "source": [ "**Primer batch (nuevo dominio `audio`, tag `audio-gen`):**\n", "\n", "| # | id | proposito |\n", "|---|---|---|\n", "| 1 | `wav_read_py_audio` | sf.read → np.ndarray + sample_rate |\n", "| 2 | `wav_write_py_audio` | np.ndarray → wav PCM16 |\n", "| 3 | `resample_audio_py_audio` | librosa/scipy resample (pure salvo IO) |\n", "| 4 | `tts_piper_py_audio` | TTS offline Piper, multi-voz |\n", "| 5 | `tts_elevenlabs_py_audio` | client API ElevenLabs |\n", "| 6 | `tts_openai_py_audio` | client API OpenAI tts-1 |\n", "| 7 | `stt_whisper_local_py_audio` | faster-whisper local |\n", "| 8 | `stt_whisper_api_py_audio` | OpenAI whisper API |\n", "| 9 | `musicgen_generate_py_audio` | facebook/musicgen via transformers |\n", "| 10| `audio_concat_py_audio` | concatenar wavs con crossfade ms |" ] }, { "cell_type": "markdown", "id": "1a991579", "metadata": {}, "source": [ "---\n", "## 8) audio_realtime_voiceconversion\n", "\n", "**Lo que tenemos** — **nada**. Sin captura, sin streaming, sin VC.\n", "\n", "Es el tema con mayor coste de entrada: requiere binario nativo (cmake/CUDA), latencia <100ms, ring-buffers PortAudio/miniaudio en input." ] }, { "cell_type": "markdown", "id": "3cebc01a", "metadata": {}, "source": [ "**Primer batch (tag `audio-rt`, dominio `audio`):**\n", "\n", "| # | id | proposito |\n", "|---|---|---|\n", "| 1 | `audio_input_cpp_audio` | captura miniaudio device → ring buffer (mirror de `audio_engine`) |\n", "| 2 | `audio_ring_buffer_cpp_core` | spsc lock-free para samples float32 |\n", "| 3 | `vad_silero_py_audio` | Voice Activity Detection on chunks 30ms |\n", "| 4 | `rvc_infer_py_audio` | Retrieval-based Voice Conversion local (torch) |\n", "| 5 | `seed_vc_infer_py_audio` | Seed-VC zero-shot baseline |\n", "| 6 | `audio_ws_stream_go_infra` | WS server que recibe PCM y devuelve PCM convertido |\n", "| 7 | `audio_chunker_py_audio` | dividir stream en chunks 320 samples para inferencia |\n", "| 8 | `pitch_shift_psola_py_audio` | pitch shift sin neural, fallback rapido |\n", "\n", "Mas un app `apps/voice_changer/` (C++ ImGui + Go service) que componga el pipeline." ] }, { "cell_type": "markdown", "id": "bf80cad5", "metadata": {}, "source": [ "---\n", "## Resumen\n", "\n", "| Tema | Cobertura actual | Esfuerzo proximo |\n", "|------|------------------|------------------|\n", "| trading | media-alta | conectores exchange + paper broker (8 fn) |\n", "| scraping_web | alta (CDP completo) | parser HTML + politeness + frontier (8 fn) |\n", "| quant | baja-media | regresion/coint/portfolio/risk (9 fn) |\n", "| realtime | alta (transporte) | metrics + alerting (8 fn) |\n", "| image_gen | cero (solo tipos) | implementaciones diffusers/comfy/openai (8 fn) |\n", "| text_gen | cero (solo tipos) | clientes LLM + streaming (9 fn) |\n", "| audio_gen | cero (solo playback) | dominio nuevo `audio`, TTS/STT/music (10 fn) |\n", "| audio_rt_vc | cero | el mas costoso, requiere C++ (8 fn + app) |\n", "\n", "**Total**: ~70 funciones nuevas para cubrir los 8 temas con un primer baseline funcional.\n", "\n", "**Prioridad sugerida** (por ratio valor / coste):\n", "1. text_gen (clientes LLM ya bloquean muchas otras apps).\n", "2. realtime metrics + alerting (acelera el propio fn_monitoring).\n", "3. trading conectores + paper broker (cierra el stack que ya esta a medias).\n", "4. scraping HTML parser + politeness (multiplicador para osint_graph y data ingest).\n", "5. image_gen (alto valor demo, dependencias pesadas).\n", "6. quant (puede vivir como funciones puras Py sin infra).\n", "7. audio_gen.\n", "8. audio_rt_vc (ultimo: nuevo dominio C++ + dep nativa)." ] }, { "cell_type": "markdown", "id": "e7a722e1", "metadata": {}, "source": [ "---\n", "## Apendice — workaround FTS5\n", "\n", "`functions_fts` esta desfasada del contenido (`fts5: missing row N from content table 'main'.'functions'`).\n", "Las celdas `fts(...)` de arriba pueden petar. Solucion: regenerar el indice con `cd $FN_REGISTRY_ROOT && ./fn index`.\n", "\n", "Mientras, override con LIKE para que las busquedas funcionen sin FTS:" ] }, { "cell_type": "code", "execution_count": null, "id": "658421c3", "metadata": {}, "outputs": [], "source": [ "def fts(q, limit=15):\n", " \"\"\"Override seguro: busca term1|term2|... en name+description+tags via LIKE.\"\"\"\n", " terms = [t.strip().lower() for t in q.replace(' OR ', '|').split('|') if t.strip()]\n", " if not terms: return pd.DataFrame()\n", " where = ' OR '.join([\"lower(name||' '||description||' '||tags) LIKE ?\"] * len(terms))\n", " params = [f'%{t}%' for t in terms] + [limit]\n", " return pd.read_sql_query(\n", " f\"SELECT id, lang, purity, description FROM functions WHERE {where} LIMIT ?\",\n", " conn, params=params)\n", "\n", "# Verifica que ahora encuentra funciones para los 3 gaps:\n", "for q in ['regression OR cointegration OR portfolio OR garch OR pca',\n", " 'metric OR prometheus OR alert OR anomaly',\n", " 'diffusion OR stable OR sdxl OR comfy OR flux']:\n", " df = fts(q, limit=20)\n", " print(f'--- {q} -> {len(df)} hits ---')\n", " print(df.to_string(index=False) if len(df) else '(ninguno)')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.7" } }, "nbformat": 4, "nbformat_minor": 5 }