feat: extraccion masiva footprint_aurgi (41 funcs + 4 types + stack Docker geo)
Extrae al registry funciones del proyecto interno footprint_aurgi: - core (6): slugify_ascii, normalize_for_join, cp_provincia_es, infer_provincia_from_cp, safe_read_csv_fallback, csv_to_parquet_duckdb - geo puras (7): haversine_km, point_in_ring, point_in_polygon, point_in_polygons_bbox, polygon_bbox, extent_with_padding, distance_bucket - geo I/O (4): load_geojson_polygons, load_boundary_gdf, add_basemap_osm, add_basemap_with_timeout - valhalla client (4): valhalla_route, valhalla_isochrone, valhalla_isochrones_async, valhalla_matrix_1_to_n - datascience stats (7): trimmed_mean, geometric_mean, detect_distribution_type, best_central_tendency, summary_stats, kde_density_levels, alpha_shape_concave_hull - datascience fuzzy (3): fuzzy_merge_adaptive (rapidfuzz), words_to_dataset, remove_words_from_column - datascience viz (2): plot_kde_2d, plot_heatmap_log - infra (4): compress_pdf_ghostscript, render_table_page_pdfpages, add_header_logo, osm2pgsql_ingest - pipelines (4): setup_geo_stack_docker, compute_centers_reachability, generate_isochrones_by_zone, count_points_per_zone - types geo (4): LonLat, BBox, IsochroneRequest, Centro Incluye: - apps/footprint_geo_stack/ (PostGIS + Martin + Valhalla via docker-compose) - 131/132 tests pasan (1 skip esperado: osm2pgsql en PATH) - Issue tracker dev/issues/0052-footprint-aurgi-extraction.md - Atribucion uniforme: source_repo internal:footprint_aurgi, source_license internal-aurgi - Build con 9 agentes en paralelo (8 wave 1 + 1 wave 2 pipelines) Tambien commitea trabajo previo no commiteado: aggregate_extraction_results, chunk_with_overlap, clean_pdf_text, merge_entity_aliases, extract_graph_gliner2, extract_relations_mrebel, extract_triples_spacy_es, gliner2/mrebel/marianmt/rebel/spacy_es load_model, parse_rebel_output, translate_es_to_en, issue 0050/0051. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
+103
@@ -30,6 +30,109 @@ Para contexto detallado del trabajo diario ver `docs/diary/`. Para decisiones ar
|
||||
|
||||
- ADR `0003-orphan-tu-as-separate-function-entry.md` — cuando una funcion del registry necesita partir su `.cpp` en varios TUs por testabilidad o separacion ImGui-vs-puro, cada TU adicional se registra como entrada propia con su `.md` en lugar de extender `file_path` para listar varios archivos. El parent declara la nueva entrada en `uses_functions`. Razon: el indexer asume `1 .cpp = 1 .md`; un `file_path` multi-archivo rompe la convencion y deja apps nuevas sin saber que TUs enlazar.
|
||||
|
||||
### Added — sesion NER+RE para graph_explorer (tarde, 980 → 990 funciones)
|
||||
|
||||
**18 funciones nuevas** sobre el ecosistema NER+RE, en dos rondas de `fn-constructor`:
|
||||
|
||||
Ronda 1 — extraccion de relaciones (mREBEL/REBEL/MarianMT):
|
||||
- `python/functions/datascience/parse_rebel_output.py` (pure) — parser wire `<triplet>` REBEL/mREBEL.
|
||||
- `python/functions/datascience/align_relations_to_entities.py` (pure) — string-match aligner.
|
||||
- `python/functions/datascience/mrebel_load_model.py` (impure, **CC BY-NC-SA 4.0 — NO comercial**).
|
||||
- `python/functions/datascience/mrebel_base_load_model.py` (impure, misma licencia).
|
||||
- `python/functions/datascience/rebel_load_model.py` (impure, **Apache 2.0**, EN-only).
|
||||
- `python/functions/datascience/marianmt_es_en_load_model.py` (impure) — Helsinki-NLP/opus-mt-es-en.
|
||||
- `python/functions/datascience/translate_es_to_en.py` (impure) — wrapper traduccion frase a frase.
|
||||
- `python/functions/datascience/extract_relations_mrebel.py` (impure) — pipeline mREBEL frase-a-frase + alineamiento.
|
||||
- 21 tests pytest verdes.
|
||||
|
||||
Ronda 2 — pipeline GLiNER2 + OpenIE schema-less + composicion (tarde):
|
||||
- `python/functions/core/clean_pdf_text.py` (pure) — limpia artefactos PyPDF2.
|
||||
- `python/functions/core/chunk_with_overlap.py` (pure) — sliding window con avance forzado.
|
||||
- `python/functions/core/merge_entity_aliases.py` (pure) — coreferencia normalize+substring.
|
||||
- `python/functions/core/filter_relations_by_entity_types.py` (pure) — post-filter typed.
|
||||
- `python/functions/core/aggregate_extraction_results.py` (pure) — dedupe + Counter sobre N chunks.
|
||||
- `python/functions/datascience/gliner2_load_model.py` (impure, **Apache 2.0**) — `fastino/gliner2-large-v1`.
|
||||
- `python/functions/datascience/extract_graph_gliner2.py` (impure) — wrapper schema + threshold + include_confidence.
|
||||
- `python/functions/datascience/spacy_es_load_model.py` (impure) — `es_core_news_md` cacheado.
|
||||
- `python/functions/datascience/extract_triples_spacy_es.py` (impure) — OpenIE schema-less ES por reglas de dependencia (verbo del texto = predicado).
|
||||
- `python/functions/pipelines/extract_graph_from_text.py` (impure pipeline) — composicion E2E: chunk → extract_graph_gliner2 (×N) → aggregate → filter typed → merge aliases → grafo final.
|
||||
- 39 tests pytest verdes.
|
||||
|
||||
### Added — analysis `gliner_glirel_tuning`
|
||||
|
||||
`projects/osint_graph/analysis/gliner_glirel_tuning/` — investigacion empirica de modelos NER/RE. **9 notebooks** ejecutados:
|
||||
|
||||
| # | Notebook | Hallazgo clave |
|
||||
|---|---|---|
|
||||
| 01 | `01_gliner_glirel_tuning.ipynb` | Calibracion de thresholds GLiNER+GLiREL |
|
||||
| 02 | `02_e2e_spanish_graph.ipynb` | E2E texto ES — descubrimiento del fail de GLiREL en castellano |
|
||||
| 03 | `03_mrebel_vs_glirel.ipynb` | mREBEL gana a GLiREL pero CC BY-NC-SA |
|
||||
| 04 | `04_gliner2_winner.ipynb` ⭐ | **GLiNER2 (Apache 2.0, NER+RE joint, 340M)** elegido como motor principal |
|
||||
| 05 | `05_long_text_and_pdf.ipynb` | Pipeline PDF E2E sobre `politica_proteccion_datos.pdf` (BBVA, 89.882 chars) |
|
||||
| 06 | `06_improvements.ipynb` | Threshold 0.3 (vs default 0.5) → +187% relaciones; coref reduce 18% aislados |
|
||||
| 07 | `07_nuextract_vs_gliner2.ipynb` | NuExtract GPU 2.6× mas lento, calidad similar — descartado por defecto |
|
||||
| 08 | `08_improving_gliner2.ipynb` | snake_case verbal labels + post-filter typed = mejor combo |
|
||||
| 09 | `09_spacy_es_openie.ipynb` | spaCy ES dep-rules: schema-less, predicado = verbo del texto |
|
||||
|
||||
### Added — vault `osint_nlp_models`
|
||||
|
||||
`projects/osint_graph/vaults/osint_nlp_models` (symlink a `~/vaults/osint_nlp_models/`):
|
||||
- `models/` — fichas de gliner, glirel, mrebel, gliner2, candidates a probar.
|
||||
- `decisions/` — 3 ADRs cortos del 2026-05-04 (mrebel-over-glirel mañana, gliner2-over-mrebel tarde, license-constraint).
|
||||
- `benchmarks/corpus_v1.md` + `results_log.csv` (15 filas de experimentos).
|
||||
- `test_documents/politica_proteccion_datos.pdf` (PDF de BBVA copiado para reproducibilidad).
|
||||
|
||||
### Added — playground HTML
|
||||
|
||||
`projects/osint_graph/analysis/gliner_glirel_tuning/playground/`:
|
||||
- `server.py` — FastAPI con GLiNER2 cacheado, endpoints `GET /` (HTML) y `POST /extract` (texto → grafo).
|
||||
- `index.html` — UI: textarea, KPIs (nodos/aristas/tiempo), grafo Sigma.js, JSON exportable.
|
||||
- `static/sigma.min.js` + `graphology.umd.min.js` (servidos localmente para evitar bloqueo CDN por extensiones tipo MetaMask/SES).
|
||||
|
||||
Stack aplicado por el server:
|
||||
1. snake_case verbal labels (`works_at`, `ceo_of`, `headquartered_in`, `agreement_with`...)
|
||||
2. threshold 0.3 (configurable)
|
||||
3. chunking automatico > 1500 chars
|
||||
4. post-filter typed (`(person, organization)` validos por relacion)
|
||||
5. coreferencia normalize+substring
|
||||
6. layout server-side via `networkx.spring_layout`
|
||||
7. render Sigma.js (sin fisica → sin loops de ResizeObserver)
|
||||
|
||||
### Added — issues
|
||||
|
||||
- `dev/issues/0050-jupyter-exec-collab-client-failure.md` — bug `jupyter_exec` con cliente colaborativo + workaround documentado.
|
||||
- `projects/osint_graph/apps/graph_explorer/issues/0041-split-confidence-thresholds.md` — split `confidence_threshold` en `entity_threshold` + `relation_threshold`.
|
||||
- `projects/osint_graph/apps/graph_explorer/issues/0042-gliner2-unified-extractor.md` ⭐ — sustituir GLiREL por GLiNER2 en `extract_graph_hybrid`. Reemplaza 0042-mrebel.
|
||||
- `projects/osint_graph/apps/graph_explorer/issues/0042-mrebel-relation-extractor.md.superseded` — version mREBEL del 0042 archivada al ganar GLiNER2.
|
||||
|
||||
### Changed
|
||||
|
||||
- `cpp/CMakeLists.txt` — `_GE_DIR` y `_DASH_DIR` sobreescribibles via `-D<...>=<path>` para builds en worktrees (commit `e72d6364`). Habilita `parallel-fix-issues` sobre apps C++.
|
||||
- `python/functions/datascience/glirel_load_model.py` — workaround compat `huggingface_hub` 1.x: classmethod monkey-patch idempotente para inyectar `proxies`/`resume_download` que el HF nuevo dejo de pasar (commit `3b3378cf`).
|
||||
- Sub-repo `dataforge/graph_explorer` master local: merges `--no-ff` de `issue/0035e-polish-and-tests` (commit `f614a51`) + `issue/0013-paste-extract-panel` (commit `2a49c2b`). 125/125 tests pytest verdes. **Sin push aun** — pendiente confirmacion + validacion Windows.
|
||||
|
||||
### Fixed (bugs encontrados + raiz + fix)
|
||||
|
||||
| Bug | Raiz | Fix |
|
||||
|---|---|---|
|
||||
| `chunk_with_overlap` bucle infinito | Frase mas larga que `max_chars`, no avanzaba `i`, OOM-killed por overlap acumulado | Avance forzado: meter al menos UNA frase aunque exceda `max_chars` |
|
||||
| NuExtract degenera en texto largo | Sin `repetition_penalty`, decoder entra en bucle de tokens repetidos hasta agotar 2048 max_new_tokens | `repetition_penalty=1.15` + chunking obligatorio (179/179 chunks parsed OK tras fix) |
|
||||
| NuExtract `AutoProcessor.from_pretrained` rota en transformers 5.x | Sub-processor de video tira `TypeError: argument of type 'NoneType' is not iterable` (Qwen2-VL) | Bypass: `AutoTokenizer` + `AutoModelForImageTextToText` directamente |
|
||||
| Vis-network ResizeObserver loop spam (en SES/MetaMask) | Vis-network usa physics simulation → ResizeObserver dispara warnings amplificados por SES | Migrar a Sigma.js + layout server-side via `networkx.spring_layout` (sin fisica frontend) |
|
||||
| `jupyter_exec append` HTTP 405 | `jupyter_nbmodel_client` espera collab WebSocket Y.js, no soportado al 100% por jupyter-collaboration nuevo | Documentado en issue 0050; workaround actual: build_notebook scripts con `nbformat` + `nbconvert --execute` |
|
||||
| Kernel startup shadows pip packages | `00_fn_registry.py` añade cada subdir de `python/functions/` a sys.path top-level → `bigquery/datasets.py` shadows HF `datasets` package needed by transformers | Workaround per-notebook: `sys.path = [p for p in sys.path if not p.startswith(_pf+'/')]` + añadir solo el padre. Issue futuro pendiente. |
|
||||
|
||||
### Decisions — vault ADRs
|
||||
|
||||
| Decision | Razon |
|
||||
|---|---|
|
||||
| **GLiNER2 (Apache 2.0)** sustituye a GLiREL en `extract_graph_hybrid` | 6/8 relaciones correctas vs 0/1 de GLiREL en es_corporate_short, 1.18s vs 22s de mREBEL, NER+RE en una pasada |
|
||||
| mREBEL queda como fallback (no comercial) | 4/5 correctas pero CC BY-NC-SA 4.0 + 25× mas lento |
|
||||
| spaCy ES dep-rules para OpenIE schema-less | Predicado = verbo del texto (`querer`, `abrazar`), 5ms/frase, sin alucinaciones |
|
||||
| Threshold `0.3` (vs default `0.5`) sweet spot | +187% relaciones manteniendo precision; 0.2 mete +22% entidades dudosas |
|
||||
| Coreferencia normalize+substring + post-filter typed = **gratis y decisivos** | Coref −18% aislados; post-filter elimina `Madrid president_of Persona` |
|
||||
| Translate ES→EN + triplet-extract EN **NO** vale la pena | Pierdes verbos del texto (`querer` → `loves`), +500ms-1s, +300MB MarianMT, riesgo nombres propios |
|
||||
|
||||
## 2026-04-28
|
||||
|
||||
### Added
|
||||
|
||||
Reference in New Issue
Block a user