feat: extraccion masiva footprint_aurgi (41 funcs + 4 types + stack Docker geo)

Extrae al registry funciones del proyecto interno footprint_aurgi: - core (6): slugify_ascii, normalize_for_join, cp_provincia_es, infer_provincia_from_cp, safe_read_csv_fallback, csv_to_parquet_duckdb - geo puras (7): haversine_km, point_in_ring, point_in_polygon, point_in_polygons_bbox, polygon_bbox, extent_with_padding, distance_bucket - geo I/O (4): load_geojson_polygons, load_boundary_gdf, add_basemap_osm, add_basemap_with_timeout - valhalla client (4): valhalla_route, valhalla_isochrone, valhalla_isochrones_async, valhalla_matrix_1_to_n - datascience stats (7): trimmed_mean, geometric_mean, detect_distribution_type, best_central_tendency, summary_stats, kde_density_levels, alpha_shape_concave_hull - datascience fuzzy (3): fuzzy_merge_adaptive (rapidfuzz), words_to_dataset, remove_words_from_column - datascience viz (2): plot_kde_2d, plot_heatmap_log - infra (4): compress_pdf_ghostscript, render_table_page_pdfpages, add_header_logo, osm2pgsql_ingest - pipelines (4): setup_geo_stack_docker, compute_centers_reachability, generate_isochrones_by_zone, count_points_per_zone - types geo (4): LonLat, BBox, IsochroneRequest, Centro Incluye: - apps/footprint_geo_stack/ (PostGIS + Martin + Valhalla via docker-compose) - 131/132 tests pasan (1 skip esperado: osm2pgsql en PATH) - Issue tracker dev/issues/0052-footprint-aurgi-extraction.md - Atribucion uniforme: source_repo internal:footprint_aurgi, source_license internal-aurgi - Build con 9 agentes en paralelo (8 wave 1 + 1 wave 2 pipelines) Tambien commitea trabajo previo no commiteado: aggregate_extraction_results, chunk_with_overlap, clean_pdf_text, merge_entity_aliases, extract_graph_gliner2, extract_relations_mrebel, extract_triples_spacy_es, gliner2/mrebel/marianmt/rebel/spacy_es load_model, parse_rebel_output, translate_es_to_en, issue 0050/0051. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 23:35:22 +02:00
parent f73ea072bd
commit faac610745
193 changed files with 13146 additions and 3 deletions
@@ -0,0 +1,166 @@
+# 0050 — `jupyter_exec` falla por cliente colaborativo (workaround documentado)
+
+## APP Metadata
+
+| Campo | Valor |
+|-------|-------|
+| **ID** | 0050 |
+| **Estado** | pendiente |
+| **Prioridad** | media |
+| **Tipo** | bug — `python/functions/notebook/jupyter_exec.py` |
+
+## Dependencias
+
+Ninguna. Independiente del resto.
+
+---
+
+## Sintoma
+
+Al ejecutar `jupyter_exec.py append <notebook> <code>` contra un Jupyter Lab
+arrancado con el launcher estandar de los analyses (`run-jupyter-lab.sh`,
+flag `--collaborative`), la operacion falla con:
+
+```
+{"error": "HTTP Error 405: Method Not Allowed"}
+```
+
+`jupyter_write.py append-code` y `append-markdown` SI funcionan (no usan el
+canal colaborativo). El bug solo afecta a `jupyter_exec`, que necesita
+ejecutar la celda en el kernel y para eso usa `jupyter_nbmodel_client`
+con websocket Y.js.
+
+Reproducido en `2026-05-04` durante la construccion del analysis
+`projects/osint_graph/analysis/gliner_glirel_tuning/`. El resto de
+funciones del modulo `notebook/` quedan intactas:
+
+```bash
+$JX append <nb> <code>            # ❌ HTTP 405
+$JW append-code <nb> <code>       # ✅ OK (sin ejecucion)
+$JW append-markdown <nb> <md>     # ✅ OK
+$JX cell <nb> <idx>               # 🔁 No probado, pero usa el mismo cliente
+$JX kernel <code>                 # 🔁 No probado
+```
+
+---
+
+## Diagnostico (parcial)
+
+`jupyter_nbmodel_client` espera que el server tenga la extension
+`jupyter_collaboration` activa y montada en `/api/collaboration/...`. El
+launcher arranca jupyter con el flag CLI `--collaborative`, que en
+versiones recientes (`jupyter_server >= 2.x`, `jupyter-collaboration >= 4.x`)
+**ya no es suficiente** — la extension se carga via entry-point y se
+controla con flags distintos (`--YDocExtension.disable_rtc` o equivalente),
+o requiere un fichero de config explicito.
+
+Salida de `jupyter_discover.py` confirma el sintoma indirectamente:
+
+```json
+{ "url": "http://localhost:8888", "collaborative": false, ... }
+```
+
+aunque `--collaborative` esta en el launch command. Es decir: el server
+arranca, expone la API REST, pero la capa colaborativa NO esta activa.
+
+---
+
+## Workaround usado en `gliner_glirel_tuning`
+
+Cambio de tactica: en lugar de construir el notebook con `jupyter_exec
+append` celda a celda, **se ejecutan los experimentos en un script
+externo** y se empotran las celdas (codigo + outputs ya generados) con
+`nbformat` directo a fichero. El notebook resultante es persistente y
+no necesita el canal colaborativo.
+
+```python
+# build_notebook.py
+import nbformat as nbf
+nb = nbf.v4.new_notebook()
+for src, stdout in cells:
+    cell = nbf.v4.new_code_cell(src)
+    cell.outputs = [nbf.v4.new_output("stream", name="stdout", text=stdout)]
+    nb.cells.append(cell)
+nbf.write(nb, "notebooks/01_foo.ipynb")
+```
+
+Si se quieren outputs reales (DataFrames como HTML, figuras matplotlib),
+ejecutar despues con `nbconvert`:
+
+```bash
+IPYTHONDIR=$(pwd)/.ipython ./.venv/bin/jupyter nbconvert \
+  --to notebook --execute notebooks/01_foo.ipynb \
+  --output 01_foo.ipynb --ExecutePreprocessor.timeout=300
+```
+
+Esto bypassa completamente el canal colaborativo y produce un `.ipynb`
+funcional, abrible en Jupyter Lab para ver / iterar / re-ejecutar.
+
+Ver `projects/osint_graph/analysis/gliner_glirel_tuning/build_notebook.py`
+y `build_notebook_e2e.py` para ejemplos vivos.
+
+---
+
+## Causas raiz a investigar
+
+1. **Verificar la version de `jupyter-collaboration`** en el venv del
+   analysis. Si es >=4.x, el flag `--collaborative` ya no aplica y el
+   launcher (`write_jupyter_launcher_bash_io`) tiene que actualizarse.
+2. **El cliente** `jupyter_nbmodel_client` puede tener su propia
+   ventana de versiones soportadas — comprobar pinning en
+   `python/.venv` y en los venvs de analyses.
+3. **El endpoint** `/api/collaboration/document` debe responder a un
+   `GET` con HTTP 200 cuando la extension esta activa. Si responde
+   `405`, el cliente intenta una operacion (POST/PUT) sobre un endpoint
+   que solo acepta GET, sintoma de mismatch.
+
+---
+
+## Tareas
+
+1. Reproducir el `HTTP 405` con un notebook nuevo y un kernel nuevo
+   en un analysis recien creado.
+2. Capturar la URL exacta y el metodo HTTP que dispara el 405
+   (anadir logging a `jupyter_exec.py` linea ~192/229 donde llama a
+   `get_jupyter_notebook_websocket_url`).
+3. Verificar version de `jupyter-collaboration` en el venv y comparar
+   con la matriz de compatibilidad de `jupyter_nbmodel_client`.
+4. Una de dos:
+   - **(a)** Corregir el flag/config en `write_jupyter_launcher_bash_io`
+     para activar correctamente la colaboracion en versiones nuevas.
+   - **(b)** Si la API colaborativa cambio mucho, **migrar
+     `jupyter_exec.py` a usar el `JupyterClient` clasico** (REST + WebSocket
+     directo al kernel sin Y.js) que es estable a traves de versiones.
+     `jupyter_kernel.py` ya hace algo asi y funciona.
+5. Anadir un test e2e basico en `tests/` que arranca jupyter, lanza
+   `jupyter_exec append`, verifica que la celda se ejecuto y captura
+   stdout. Sin esto el bug puede regresar.
+
+---
+
+## Out of scope
+
+- Reescribir el sistema completo de notebook collaboration.
+- Migrar a un MCP. La regla `notebook_collaboration.md` es explicita:
+  estas funciones reemplazan al MCP jupyter.
+
+---
+
+## Riesgos
+
+- Si la causa es la matriz de versiones, la opcion (a) puede generar
+  fricion futura cada vez que `jupyter-collaboration` haga un breaking
+  change. La opcion (b) es mas robusta a largo plazo aunque pierde la
+  capacidad de ver cambios en tiempo real desde el navegador.
+
+## Notas operativas
+
+Mientras este bug exista, el patron recomendado para construir notebooks
+desde un agente Claude en un analysis es:
+
+1. `build_notebook.py` con `nbformat` para estructura + outputs estaticos.
+2. `nbconvert --execute` para outputs reales (HTML, plots).
+3. Si necesitas tiempo real con el browser, abre el notebook ya generado
+   en Jupyter Lab y reejecuta a mano.
+
+El propio analysis `gliner_glirel_tuning` es referencia.
@@ -0,0 +1,313 @@
+# 0051 — Funciones pendientes del pipeline de extraccion (NER+RE+OpenIE)
+
+## APP Metadata
+
+| Campo | Valor |
+|-------|-------|
+| **ID** | 0051 |
+| **Estado** | pendiente |
+| **Prioridad** | media |
+| **Tipo** | feature — `python/functions/{datascience,pipelines,core}/` |
+
+## Dependencias
+
+- Las 18 funciones NER/RE creadas el 2026-05-04 (gliner2_load_model, extract_graph_gliner2, extract_triples_spacy_es, etc.) — base ya construida.
+- `extract_pdf_text_py_core` — ya existente, se reusa.
+
+**Desbloquea:** integracion completa del pipeline GLiNER2 + spaCy ES + chunking + coref + post-filter en `graph_explorer` panel `paste_extract` (issues 0041 y 0042 del sub-repo).
+
+---
+
+## Contexto
+
+En la sesion del 2026-05-04 se construyeron 18 funciones NER+RE (ver CHANGELOG.md y `vaults/osint_nlp_models/`). Quedan **5 huecos** que no se construyeron en esa ronda y que deberian existir para cerrar el ciclo:
+
+1. NuExtract loader + extractor — descartado por velocidad pero util como engine "Rich extraction" opcional cuando hay GPU.
+2. `extract_graph_from_pdf` pipeline — composicion `extract_pdf_text + clean_pdf_text + chunk_with_overlap + extract_graph_gliner2 + ...`.
+3. spaCy ES V2 reglas — soportar pasiva refleja, copulares, coref simple de pronombres.
+4. Fix del kernel startup que sombrea paquetes pip (`bigquery/datasets.py` rompe `import datasets` de HF).
+5. `extract_relations_rebel` (paralela a `extract_relations_mrebel`) para texto en ingles con licencia Apache.
+
+Cada hueco se desglosa abajo con plantilla suficiente para que un proximo `fn-constructor` lo pueda construir sin abrir la conversacion original.
+
+---
+
+## A. NuExtract 2.0 loader + extractor
+
+### Justificacion
+
+NuExtract 2.0-2B (`numind/NuExtract-2.0-2B`, **MIT license**) emite **JSON estructurado** rellenando un template. Util cuando el usuario quiere ficha rica por entidad (ej. para cada empresa: `{name, ceo, headquarters, subsidiaries, founded_in}`). Mas lento que GLiNER2 (310s vs 139s sobre el PDF de BBVA) pero mejor recall de atributos por entidad.
+
+Ver `notebooks/07_nuextract_vs_gliner2.ipynb` y `vaults/osint_nlp_models/models/` (no hay md de nuextract todavia, anadirlo).
+
+### Funciones a crear
+
+**A1. `nuextract_load_model_py_datascience` (impure)**
+
+```python
+"""LICENSE: MIT (NuExtract-2.0-2B). Comercial OK.
+
+Version 4B es CC BY-NC-Qwen-Research (no comercial). 
+Version 8B es MIT.
+"""
+from typing import Any
+_MODEL_CACHE: dict = {}
+
+def nuextract_load_model(
+    model_name: str = "numind/NuExtract-2.0-2B",
+    device: str = "auto",
+) -> tuple[Any, Any]:
+    """Loads (and caches) NuExtract tokenizer + model.
+    
+    Returns (tokenizer, model).
+    Note: AutoProcessor is broken in transformers 5.x for Qwen2-VL — use AutoTokenizer + AutoModelForImageTextToText directly (no AutoProcessor).
+    
+    For GPU: bfloat16, attn_implementation='sdpa'.
+    For CPU: float32, attn_implementation='eager' (much slower, 10-30s/extraction).
+    """
+    import torch
+    from transformers import AutoTokenizer, AutoModelForImageTextToText
+    use_gpu = device == "cuda" or (device == "auto" and torch.cuda.is_available())
+    resolved = "cuda" if use_gpu else "cpu"
+    dtype = torch.bfloat16 if use_gpu else torch.float32
+    attn = "sdpa" if use_gpu else "eager"
+    key = (model_name, resolved)
+    if key in _MODEL_CACHE: return _MODEL_CACHE[key]
+    tok = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, padding_side="left")
+    mdl = AutoModelForImageTextToText.from_pretrained(
+        model_name, trust_remote_code=True,
+        torch_dtype=dtype, attn_implementation=attn,
+    )
+    if use_gpu: mdl = mdl.to(resolved)
+    mdl.eval()
+    _MODEL_CACHE[key] = (tok, mdl)
+    return tok, mdl
+```
+
+**A2. `extract_structured_nuextract_py_datascience` (impure)**
+
+```python
+def extract_structured_nuextract(
+    text: str,
+    template: str,                 # JSON schema as string
+    tokenizer,
+    model,
+    max_new_tokens: int = 1024,
+    repetition_penalty: float = 1.15,   # CRITICO — sin esto degenera en bucles
+    num_beams: int = 1,
+) -> dict:
+    """Extract structured info from text using NuExtract 2.0 with a JSON template.
+    
+    Returns:
+        {"raw_text": str (el JSON crudo del modelo),
+         "parsed": dict | None (parseado con find('{') + truncate progresivo),
+         "elapsed_s": float,
+         "n_input_tokens": int,
+         "n_output_tokens": int}
+    
+    IMPORTANTE: NuExtract degenera en texto largo si repetition_penalty < ~1.1.
+    Usar repetition_penalty=1.15 (default) y trocear texto largo con chunk_with_overlap.
+    """
+```
+
+Parser de output esta en `run_nuextract_full.py` linea ~120 (find('{') + truncate progresivo).
+
+### Tests
+
+- A1: cache hit/miss.
+- A2: con stub de modelo, validar parser de JSON. Con corpus real solo si GPU disponible (skip otherwise).
+
+---
+
+## B. `extract_graph_from_pdf_py_pipelines`
+
+### Justificacion
+
+Composicion natural: `extract_pdf_text` ya existe en `python/functions/core/`. Combinarlo con todo lo nuevo:
+
+```
+extract_pdf_text          (existente)
+   → clean_pdf_text       (NUEVO 2026-05-04)
+   → chunk_with_overlap   (NUEVO 2026-05-04)
+   → extract_graph_gliner2 (×N, NUEVO 2026-05-04)
+   → aggregate_extraction_results
+   → filter_relations_by_entity_types
+   → merge_entity_aliases
+   → grafo final
+```
+
+### Firma
+
+```python
+def extract_graph_from_pdf(
+    pdf_path: str,
+    entity_labels: list[str],
+    relation_labels,
+    allowed: dict,
+    model,                              # GLiNER2 model
+    threshold: float = 0.3,
+    max_chars_per_chunk: int = 1500,
+    overlap_sentences: int = 2,
+) -> dict:
+    """End-to-end pipeline: PDF -> graph.
+    
+    Internally:
+      1. extract_pdf_text (existing)
+      2. clean_pdf_text
+      3. chunk_with_overlap if len(text) > max_chars_per_chunk
+      4. extract_graph_gliner2 per chunk
+      5. aggregate_extraction_results
+      6. filter_relations_by_entity_types
+      7. merge_entity_aliases
+    
+    Returns: same shape as extract_graph_from_text.
+    """
+```
+
+Esto es essencialmente `extract_graph_from_text(extract_pdf_text(path), ...)` con la limpieza intermedia.
+
+### Tests
+
+- Test smoke con PDF de fixture pequeño (1-2 paginas).
+- Test que fallback a chunking solo dispara cuando `len(text) > max_chars`.
+
+### Donde poner el PDF de fixture
+
+`python/functions/pipelines/tests/fixtures/sample.pdf` — un PDF corto de uso libre. O reusar `vaults/osint_nlp_models/test_documents/politica_proteccion_datos.pdf` con un path absoluto en el test (skip si no existe).
+
+---
+
+## C. spaCy ES V2 — reglas mejoradas
+
+### Justificacion
+
+Notebook 09 mostro que las reglas V1 (`extract_triples_spacy_es_py_datascience`) fallan en:
+
+1. **Pasiva refleja**: `Se firmaron acuerdos entre Iberdrola y Endesa.` → vacio. Debe emitir `(Iberdrola, firmar[pass], Endesa)` o similar.
+2. **Copulares**: `Pablo Isla es expresidente de Inditex.` → vacio. Debe emitir `(Pablo Isla, ser, expresidente de Inditex)`.
+3. **Coreferencia pronombres**: `Sara llamo a su madre Lucia.` → tripleta con span `'su madre Lucia'`. Debe resolver `su` al sujeto previo (Sara).
+4. **Lematizacion**: `movilizara` → `movilizarar` (lemma incorrecta del modelo `es_core_news_md`). Considerar `es_core_news_lg` o post-process.
+
+### Funciones a crear
+
+**C1. `extract_triples_spacy_es_v2_py_datascience` (impure)**
+
+Mismo patron que V1 pero con reglas adicionales:
+
+```python
+def extract_triples_spacy_es_v2(text: str, nlp: Any, resolve_pronouns: bool = True) -> dict:
+    """Improved Spanish OpenIE via spaCy dependency parsing.
+    
+    V2 changes vs V1:
+    - Pasiva refleja: detect 'se' + verb conjugated -> treat agent as subject if available
+    - Copulares: 'X es Y', 'X esta Y' -> emit (X, ser/estar, Y)
+    - Coref simple: track previous subject, resolve 'su X' to that subject
+    - Lemma override: hardcoded fixes for common errors (movilizarar -> movilizar)
+    
+    Returns: same shape as extract_triples_spacy_es V1.
+    """
+```
+
+### Tests
+
+- Test pasiva refleja: `'Se firmaron acuerdos entre Iberdrola y Endesa'` -> tripleta con `firmar[pass]`.
+- Test copular: `'Pablo es presidente'` -> `(Pablo, ser, presidente)`.
+- Test coref: `'Sara llamo a su madre Lucia'` -> sujeto canonico Sara (no `'su madre Lucia'`).
+- Test lemma override: `movilizara` -> lemma `movilizar`.
+
+---
+
+## D. Fix kernel startup shadow de paquetes pip
+
+### Sintoma
+
+`.ipython/profile_default/startup/00_fn_registry.py` añade cada subdir de `python/functions/` al sys.path top-level. Como hay un `bigquery/datasets.py` en el registry, **shadows** el paquete `datasets` de HuggingFace que `transformers` necesita. Resultado: en cada notebook hay que aplicar un workaround:
+
+```python
+_pf = '/home/lucas/fn_registry/python/functions'
+sys.path = [p for p in sys.path if not p.startswith(_pf + '/')]
+if _pf not in sys.path: sys.path.insert(0, _pf)
+```
+
+### Fix propuesto
+
+Modificar el template `write_jupyter_registry_kernel` (la funcion del registry que genera ese startup file en cada analysis nuevo) para:
+
+```python
+# Solo el directorio padre 'python/functions/' (no los subdirs)
+sys.path.insert(0, str(_python_functions))
+
+# El usuario importa con paquete:
+#   from datascience.gliner_load_model import gliner_load_model
+#   from core.extract_pdf_text import extract_pdf_text
+# (no `from gliner_load_model import ...` directo)
+```
+
+Esto requiere actualizar:
+1. La funcion del registry que genera el startup file.
+2. Re-generar el startup file en analyses existentes (script de migracion).
+3. Documentar en `.claude/CLAUDE.md` que los imports en notebooks de analysis siguen el patron `from <domain> import <function_name>`.
+
+### Tests
+
+- Test que el startup nuevo permite `import datasets` (huggingface) sin shadow.
+- Test que sigue funcionando `from datascience.gliner_load_model import gliner_load_model`.
+
+---
+
+## E. `extract_relations_rebel_py_datascience`
+
+### Justificacion
+
+`extract_relations_mrebel` ya existe (creado en ronda 1 del 2026-05-04). Para texto en **ingles** y casos donde se necesita licencia comercial sin caveat, REBEL (`Babelscape/rebel-large`, **Apache 2.0**) es la alternativa.
+
+### Firma
+
+```python
+def extract_relations_rebel(
+    text: str,
+    entities: list,                    # list[EntityCandidate]
+    tokenizer,
+    model,
+    sentence_split_re: str = r"(?<=[\.])\s+",
+    min_sentence_chars: int = 20,
+    num_beams: int = 4,
+    max_length: int = 256,
+) -> list:
+    """Extract relations from English text using REBEL, sentence by sentence.
+    
+    Same wire format as mREBEL — reuses `parse_rebel_output` and
+    `align_relations_to_entities` from the registry.
+    
+    LICENSE: Apache 2.0 (commercial OK).
+    """
+```
+
+Practicamente identica a `extract_relations_mrebel` pero sin el `tgt_lang='tp_XX'` (REBEL es monolingue).
+
+---
+
+## Priorizacion sugerida
+
+| # | Item | Impacto | Coste | Cuando |
+|---|---|---|---|---|
+| B | `extract_graph_from_pdf` pipeline | ⭐⭐⭐ — la composicion mas usada | Bajo (compone existentes) | Inmediato |
+| C | spaCy ES V2 reglas | ⭐⭐ — desbloquea mas casos ES | Medio (reglas + tests) | Cuando V1 limita |
+| D | Fix kernel startup | ⭐⭐ — limpia el flow notebooks | Medio (refactor + migracion) | Cuando se cree un analysis nuevo |
+| A | NuExtract loader/extractor | ⭐ — engine alternativo opcional | Medio (GPU testing) | Cuando se quiera "Rich mode" |
+| E | REBEL EN extractor | ⭐ — solo si llega caso EN comercial | Bajo (copy de mREBEL) | Cuando aparezca el caso |
+
+---
+
+## Definicion de hecho (todos los items)
+
+- Funciones implementadas + frontmatter + tests pytest verdes.
+- `./fn index` suma exactamente las funciones declaradas.
+- `./fn check params` no marca ninguna nueva sin params_schema.
+- Documentadas en `vaults/osint_nlp_models/models/` o seccion correspondiente del vault.
+- Notas operativas en `app.md` del consumidor (graph_explorer) si toca uses_functions.
+
+## Out of scope explicito
+
+- LLM-as-validator para mejorar relaciones (Claude Haiku post-NuExtract). El usuario indico explicitamente que no quiere LLMs pesados en el flow.
+- GLiDRE / ReLiK / AlignRE — solo si surge necesidad concreta. Listados en `vaults/osint_nlp_models/models/candidates.md`.
@@ -0,0 +1,46 @@
+---
+title: "Extracción masiva de footprint_aurgi → registry"
+status: in_progress
+created: 2026-05-04
+---
+
+# 0052 — Extracción de funciones de `sources/footprint_aurgi/`
+
+Extracción de 45 funciones + 4 tipos del proyecto interno `footprint_aurgi` (código propio Aurgi, sin LICENSE — `source_license: internal-aurgi`).
+
+## Capacidades cubiertas
+
+1. Geocodificación y routing (Valhalla)
+2. Generación de isócronas (sync + async batch)
+3. Stack Docker geo (PostGIS + Martin + Valhalla)
+4. Spatial primitivas (haversine, point-in-polygon, bbox, sindex)
+5. Visualización en mapa (basemap OSM, KDE, alpha-shape hulls)
+6. PDFs reporting (compresión ghostscript, table pages)
+7. Estadística para distribuciones reales (skew/kurt, trimmed/geo means)
+8. Fuzzy joining adaptativo
+9. Normalización España (CP→provincia)
+10. Data prep (CSV→Parquet via duckdb)
+
+## Batches
+
+| # | Dominio | Funciones | Owner |
+|---|---|---|---|
+| 1 | geo (puras) + tipos | haversine, point_in_polygon, bbox, extent, distance_bucket + LonLat, BBox, IsochroneRequest, Centro | agent-A |
+| 2 | core (string ES) | slugify_ascii, normalize_for_join, cp_provincia_es, infer_provincia_from_cp | agent-B |
+| 3 | datascience (stats) | trimmed_mean, geometric_mean, detect_distribution_type, best_central_tendency, summary_stats, kde_density_levels, alpha_shape_concave_hull | agent-C |
+| 4 | datascience (fuzzy) | fuzzy_merge_adaptive, words_to_dataset, remove_words_from_column | agent-D |
+| 5 | geo (Valhalla client) | valhalla_route, valhalla_matrix_1_to_n, valhalla_isochrone, valhalla_isochrones_async | agent-E |
+| 6 | geo (I/O + viz) | load_geojson_polygons, load_boundary_gdf, add_basemap_osm, add_basemap_with_timeout, plot_kde_2d, plot_heatmap_log | agent-F |
+| 7 | infra (PDF + data) | compress_pdf_ghostscript, render_table_page_pdfpages, add_header_logo, safe_read_csv_fallback, csv_to_parquet_duckdb, osm2pgsql_ingest | agent-G |
+| 8 | infra (docker stack) | docker-compose footprint geo (PostGIS + Martin + Valhalla) — levantar y verificar | agent-H |
+| 9 | pipelines | setup_geo_stack_docker, compute_centers_reachability, generate_isochrones_by_zone, count_points_per_zone | agent-I (wave 2) |
+
+## Fuente
+
+- Path: `sources/footprint_aurgi/`
+- Sub-proyectos: aurgi_mapas, better_maps, frontend_mapas, fuzzy_joins, ponderacion_isochronas, zonas_mapas_aurgi
+- Atribución uniforme: `source_repo: "internal:footprint_aurgi"`, `source_license: "internal-aurgi"`
+
+## Resultado esperado
+
+Reporte final por función: ✅ tests pasan / ❌ tests fallan / ⚠️ stub (requiere infra externa).
@@ -66,3 +66,5 @@
 | [0049i](completed/0049i-graph-layouts-static.md) | graph_layouts (radial/hierarchical/fixed) + viewport multi-select | completado | media | feature | parte de 0049 |
 | [0049j](completed/0049j-graph-labels.md) | graph_labels: render etiquetas con LabelPolicy | completado | media | feature | parte de 0049 |
 | [0049k](completed/0049k-graph-explorer-app.md) | App graph_explorer (proyecto osint_graph) — integracion final | completado | alta | feature | parte de 0049 |
+| [0050](0050-jupyter-exec-collab-client-failure.md) | `jupyter_exec` falla por cliente colaborativo (workaround documentado) | pendiente | media | bug | — |
+| [0051](0051-extraction-pipeline-followups.md) | Funciones pendientes del pipeline NER+RE (NuExtract, extract_graph_from_pdf, spaCy ES V2, kernel startup fix, REBEL EN) | pendiente | media | feature | — |