feat: extraccion masiva footprint_aurgi (41 funcs + 4 types + stack Docker geo)

Extrae al registry funciones del proyecto interno footprint_aurgi:
- core (6): slugify_ascii, normalize_for_join, cp_provincia_es, infer_provincia_from_cp, safe_read_csv_fallback, csv_to_parquet_duckdb
- geo puras (7): haversine_km, point_in_ring, point_in_polygon, point_in_polygons_bbox, polygon_bbox, extent_with_padding, distance_bucket
- geo I/O (4): load_geojson_polygons, load_boundary_gdf, add_basemap_osm, add_basemap_with_timeout
- valhalla client (4): valhalla_route, valhalla_isochrone, valhalla_isochrones_async, valhalla_matrix_1_to_n
- datascience stats (7): trimmed_mean, geometric_mean, detect_distribution_type, best_central_tendency, summary_stats, kde_density_levels, alpha_shape_concave_hull
- datascience fuzzy (3): fuzzy_merge_adaptive (rapidfuzz), words_to_dataset, remove_words_from_column
- datascience viz (2): plot_kde_2d, plot_heatmap_log
- infra (4): compress_pdf_ghostscript, render_table_page_pdfpages, add_header_logo, osm2pgsql_ingest
- pipelines (4): setup_geo_stack_docker, compute_centers_reachability, generate_isochrones_by_zone, count_points_per_zone
- types geo (4): LonLat, BBox, IsochroneRequest, Centro

Incluye:
- apps/footprint_geo_stack/ (PostGIS + Martin + Valhalla via docker-compose)
- 131/132 tests pasan (1 skip esperado: osm2pgsql en PATH)
- Issue tracker dev/issues/0052-footprint-aurgi-extraction.md
- Atribucion uniforme: source_repo internal:footprint_aurgi, source_license internal-aurgi
- Build con 9 agentes en paralelo (8 wave 1 + 1 wave 2 pipelines)

Tambien commitea trabajo previo no commiteado: aggregate_extraction_results, chunk_with_overlap, clean_pdf_text, merge_entity_aliases, extract_graph_gliner2, extract_relations_mrebel, extract_triples_spacy_es, gliner2/mrebel/marianmt/rebel/spacy_es load_model, parse_rebel_output, translate_es_to_en, issue 0050/0051.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-04 23:35:22 +02:00
parent f73ea072bd
commit faac610745
193 changed files with 13146 additions and 3 deletions
+103
View File
@@ -30,6 +30,109 @@ Para contexto detallado del trabajo diario ver `docs/diary/`. Para decisiones ar
- ADR `0003-orphan-tu-as-separate-function-entry.md` — cuando una funcion del registry necesita partir su `.cpp` en varios TUs por testabilidad o separacion ImGui-vs-puro, cada TU adicional se registra como entrada propia con su `.md` en lugar de extender `file_path` para listar varios archivos. El parent declara la nueva entrada en `uses_functions`. Razon: el indexer asume `1 .cpp = 1 .md`; un `file_path` multi-archivo rompe la convencion y deja apps nuevas sin saber que TUs enlazar.
### Added — sesion NER+RE para graph_explorer (tarde, 980 → 990 funciones)
**18 funciones nuevas** sobre el ecosistema NER+RE, en dos rondas de `fn-constructor`:
Ronda 1 — extraccion de relaciones (mREBEL/REBEL/MarianMT):
- `python/functions/datascience/parse_rebel_output.py` (pure) — parser wire `<triplet>` REBEL/mREBEL.
- `python/functions/datascience/align_relations_to_entities.py` (pure) — string-match aligner.
- `python/functions/datascience/mrebel_load_model.py` (impure, **CC BY-NC-SA 4.0 — NO comercial**).
- `python/functions/datascience/mrebel_base_load_model.py` (impure, misma licencia).
- `python/functions/datascience/rebel_load_model.py` (impure, **Apache 2.0**, EN-only).
- `python/functions/datascience/marianmt_es_en_load_model.py` (impure) — Helsinki-NLP/opus-mt-es-en.
- `python/functions/datascience/translate_es_to_en.py` (impure) — wrapper traduccion frase a frase.
- `python/functions/datascience/extract_relations_mrebel.py` (impure) — pipeline mREBEL frase-a-frase + alineamiento.
- 21 tests pytest verdes.
Ronda 2 — pipeline GLiNER2 + OpenIE schema-less + composicion (tarde):
- `python/functions/core/clean_pdf_text.py` (pure) — limpia artefactos PyPDF2.
- `python/functions/core/chunk_with_overlap.py` (pure) — sliding window con avance forzado.
- `python/functions/core/merge_entity_aliases.py` (pure) — coreferencia normalize+substring.
- `python/functions/core/filter_relations_by_entity_types.py` (pure) — post-filter typed.
- `python/functions/core/aggregate_extraction_results.py` (pure) — dedupe + Counter sobre N chunks.
- `python/functions/datascience/gliner2_load_model.py` (impure, **Apache 2.0**) — `fastino/gliner2-large-v1`.
- `python/functions/datascience/extract_graph_gliner2.py` (impure) — wrapper schema + threshold + include_confidence.
- `python/functions/datascience/spacy_es_load_model.py` (impure) — `es_core_news_md` cacheado.
- `python/functions/datascience/extract_triples_spacy_es.py` (impure) — OpenIE schema-less ES por reglas de dependencia (verbo del texto = predicado).
- `python/functions/pipelines/extract_graph_from_text.py` (impure pipeline) — composicion E2E: chunk → extract_graph_gliner2 (×N) → aggregate → filter typed → merge aliases → grafo final.
- 39 tests pytest verdes.
### Added — analysis `gliner_glirel_tuning`
`projects/osint_graph/analysis/gliner_glirel_tuning/` — investigacion empirica de modelos NER/RE. **9 notebooks** ejecutados:
| # | Notebook | Hallazgo clave |
|---|---|---|
| 01 | `01_gliner_glirel_tuning.ipynb` | Calibracion de thresholds GLiNER+GLiREL |
| 02 | `02_e2e_spanish_graph.ipynb` | E2E texto ES — descubrimiento del fail de GLiREL en castellano |
| 03 | `03_mrebel_vs_glirel.ipynb` | mREBEL gana a GLiREL pero CC BY-NC-SA |
| 04 | `04_gliner2_winner.ipynb` ⭐ | **GLiNER2 (Apache 2.0, NER+RE joint, 340M)** elegido como motor principal |
| 05 | `05_long_text_and_pdf.ipynb` | Pipeline PDF E2E sobre `politica_proteccion_datos.pdf` (BBVA, 89.882 chars) |
| 06 | `06_improvements.ipynb` | Threshold 0.3 (vs default 0.5) → +187% relaciones; coref reduce 18% aislados |
| 07 | `07_nuextract_vs_gliner2.ipynb` | NuExtract GPU 2.6× mas lento, calidad similar — descartado por defecto |
| 08 | `08_improving_gliner2.ipynb` | snake_case verbal labels + post-filter typed = mejor combo |
| 09 | `09_spacy_es_openie.ipynb` | spaCy ES dep-rules: schema-less, predicado = verbo del texto |
### Added — vault `osint_nlp_models`
`projects/osint_graph/vaults/osint_nlp_models` (symlink a `~/vaults/osint_nlp_models/`):
- `models/` — fichas de gliner, glirel, mrebel, gliner2, candidates a probar.
- `decisions/` — 3 ADRs cortos del 2026-05-04 (mrebel-over-glirel mañana, gliner2-over-mrebel tarde, license-constraint).
- `benchmarks/corpus_v1.md` + `results_log.csv` (15 filas de experimentos).
- `test_documents/politica_proteccion_datos.pdf` (PDF de BBVA copiado para reproducibilidad).
### Added — playground HTML
`projects/osint_graph/analysis/gliner_glirel_tuning/playground/`:
- `server.py` — FastAPI con GLiNER2 cacheado, endpoints `GET /` (HTML) y `POST /extract` (texto → grafo).
- `index.html` — UI: textarea, KPIs (nodos/aristas/tiempo), grafo Sigma.js, JSON exportable.
- `static/sigma.min.js` + `graphology.umd.min.js` (servidos localmente para evitar bloqueo CDN por extensiones tipo MetaMask/SES).
Stack aplicado por el server:
1. snake_case verbal labels (`works_at`, `ceo_of`, `headquartered_in`, `agreement_with`...)
2. threshold 0.3 (configurable)
3. chunking automatico > 1500 chars
4. post-filter typed (`(person, organization)` validos por relacion)
5. coreferencia normalize+substring
6. layout server-side via `networkx.spring_layout`
7. render Sigma.js (sin fisica → sin loops de ResizeObserver)
### Added — issues
- `dev/issues/0050-jupyter-exec-collab-client-failure.md` — bug `jupyter_exec` con cliente colaborativo + workaround documentado.
- `projects/osint_graph/apps/graph_explorer/issues/0041-split-confidence-thresholds.md` — split `confidence_threshold` en `entity_threshold` + `relation_threshold`.
- `projects/osint_graph/apps/graph_explorer/issues/0042-gliner2-unified-extractor.md` ⭐ — sustituir GLiREL por GLiNER2 en `extract_graph_hybrid`. Reemplaza 0042-mrebel.
- `projects/osint_graph/apps/graph_explorer/issues/0042-mrebel-relation-extractor.md.superseded` — version mREBEL del 0042 archivada al ganar GLiNER2.
### Changed
- `cpp/CMakeLists.txt``_GE_DIR` y `_DASH_DIR` sobreescribibles via `-D<...>=<path>` para builds en worktrees (commit `e72d6364`). Habilita `parallel-fix-issues` sobre apps C++.
- `python/functions/datascience/glirel_load_model.py` — workaround compat `huggingface_hub` 1.x: classmethod monkey-patch idempotente para inyectar `proxies`/`resume_download` que el HF nuevo dejo de pasar (commit `3b3378cf`).
- Sub-repo `dataforge/graph_explorer` master local: merges `--no-ff` de `issue/0035e-polish-and-tests` (commit `f614a51`) + `issue/0013-paste-extract-panel` (commit `2a49c2b`). 125/125 tests pytest verdes. **Sin push aun** — pendiente confirmacion + validacion Windows.
### Fixed (bugs encontrados + raiz + fix)
| Bug | Raiz | Fix |
|---|---|---|
| `chunk_with_overlap` bucle infinito | Frase mas larga que `max_chars`, no avanzaba `i`, OOM-killed por overlap acumulado | Avance forzado: meter al menos UNA frase aunque exceda `max_chars` |
| NuExtract degenera en texto largo | Sin `repetition_penalty`, decoder entra en bucle de tokens repetidos hasta agotar 2048 max_new_tokens | `repetition_penalty=1.15` + chunking obligatorio (179/179 chunks parsed OK tras fix) |
| NuExtract `AutoProcessor.from_pretrained` rota en transformers 5.x | Sub-processor de video tira `TypeError: argument of type 'NoneType' is not iterable` (Qwen2-VL) | Bypass: `AutoTokenizer` + `AutoModelForImageTextToText` directamente |
| Vis-network ResizeObserver loop spam (en SES/MetaMask) | Vis-network usa physics simulation → ResizeObserver dispara warnings amplificados por SES | Migrar a Sigma.js + layout server-side via `networkx.spring_layout` (sin fisica frontend) |
| `jupyter_exec append` HTTP 405 | `jupyter_nbmodel_client` espera collab WebSocket Y.js, no soportado al 100% por jupyter-collaboration nuevo | Documentado en issue 0050; workaround actual: build_notebook scripts con `nbformat` + `nbconvert --execute` |
| Kernel startup shadows pip packages | `00_fn_registry.py` añade cada subdir de `python/functions/` a sys.path top-level → `bigquery/datasets.py` shadows HF `datasets` package needed by transformers | Workaround per-notebook: `sys.path = [p for p in sys.path if not p.startswith(_pf+'/')]` + añadir solo el padre. Issue futuro pendiente. |
### Decisions — vault ADRs
| Decision | Razon |
|---|---|
| **GLiNER2 (Apache 2.0)** sustituye a GLiREL en `extract_graph_hybrid` | 6/8 relaciones correctas vs 0/1 de GLiREL en es_corporate_short, 1.18s vs 22s de mREBEL, NER+RE en una pasada |
| mREBEL queda como fallback (no comercial) | 4/5 correctas pero CC BY-NC-SA 4.0 + 25× mas lento |
| spaCy ES dep-rules para OpenIE schema-less | Predicado = verbo del texto (`querer`, `abrazar`), 5ms/frase, sin alucinaciones |
| Threshold `0.3` (vs default `0.5`) sweet spot | +187% relaciones manteniendo precision; 0.2 mete +22% entidades dudosas |
| Coreferencia normalize+substring + post-filter typed = **gratis y decisivos** | Coref 18% aislados; post-filter elimina `Madrid president_of Persona` |
| Translate ES→EN + triplet-extract EN **NO** vale la pena | Pierdes verbos del texto (`querer``loves`), +500ms-1s, +300MB MarianMT, riesgo nombres propios |
## 2026-04-28
### Added