Files
fn_registry/python/functions/datascience/parse_rebel_output.md
T
egutierrez faac610745 feat: extraccion masiva footprint_aurgi (41 funcs + 4 types + stack Docker geo)
Extrae al registry funciones del proyecto interno footprint_aurgi:
- core (6): slugify_ascii, normalize_for_join, cp_provincia_es, infer_provincia_from_cp, safe_read_csv_fallback, csv_to_parquet_duckdb
- geo puras (7): haversine_km, point_in_ring, point_in_polygon, point_in_polygons_bbox, polygon_bbox, extent_with_padding, distance_bucket
- geo I/O (4): load_geojson_polygons, load_boundary_gdf, add_basemap_osm, add_basemap_with_timeout
- valhalla client (4): valhalla_route, valhalla_isochrone, valhalla_isochrones_async, valhalla_matrix_1_to_n
- datascience stats (7): trimmed_mean, geometric_mean, detect_distribution_type, best_central_tendency, summary_stats, kde_density_levels, alpha_shape_concave_hull
- datascience fuzzy (3): fuzzy_merge_adaptive (rapidfuzz), words_to_dataset, remove_words_from_column
- datascience viz (2): plot_kde_2d, plot_heatmap_log
- infra (4): compress_pdf_ghostscript, render_table_page_pdfpages, add_header_logo, osm2pgsql_ingest
- pipelines (4): setup_geo_stack_docker, compute_centers_reachability, generate_isochrones_by_zone, count_points_per_zone
- types geo (4): LonLat, BBox, IsochroneRequest, Centro

Incluye:
- apps/footprint_geo_stack/ (PostGIS + Martin + Valhalla via docker-compose)
- 131/132 tests pasan (1 skip esperado: osm2pgsql en PATH)
- Issue tracker dev/issues/0052-footprint-aurgi-extraction.md
- Atribucion uniforme: source_repo internal:footprint_aurgi, source_license internal-aurgi
- Build con 9 agentes en paralelo (8 wave 1 + 1 wave 2 pipelines)

Tambien commitea trabajo previo no commiteado: aggregate_extraction_results, chunk_with_overlap, clean_pdf_text, merge_entity_aliases, extract_graph_gliner2, extract_relations_mrebel, extract_triples_spacy_es, gliner2/mrebel/marianmt/rebel/spacy_es load_model, parse_rebel_output, translate_es_to_en, issue 0050/0051.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 23:35:22 +02:00

2.9 KiB

name, kind, lang, domain, version, purity, signature, description, tags, uses_functions, uses_types, returns, returns_optional, error_type, imports, params, output, tested, tests, test_file_path, file_path, notes
name kind lang domain version purity signature description tags uses_functions uses_types returns returns_optional error_type imports params output tested tests test_file_path file_path notes
parse_rebel_output function py datascience 1.0.0 pure def parse_rebel_output(decoded_text: str) -> list[dict] Parser puro del wire format de REBEL / mREBEL. Convierte la cadena decoded por el tokenizer (con skip_special_tokens=False) a una lista de triplets tipados {head, head_type, type, tail, tail_type}. Nunca lanza excepcion.
rebel
mrebel
relation-extraction
nlp
parser
knowledge-graph
datascience
python
false
name desc
decoded_text cadena raw producida por tokenizer.decode(..., skip_special_tokens=False) — incluye tokens especiales como <triplet>, <per>, <org>, <loc>, tp_XX, etc.
lista de dicts con claves head (str), head_type (str), type (str), tail (str), tail_type (str). Lista vacia si no hay triplets completos o el input es vacio. true
string vacio retorna lista vacia
un triplet completo retorna un dict con campos correctos
dos triplets retorna dos dicts
triplet incompleto sin cierre no rompe
tokens angulares desconocidos no lanzan excepcion
python/functions/datascience/tests/test_parse_rebel_output.py python/functions/datascience/parse_rebel_output.py Funcion pura. Adapta el parser oficial del README de Babelscape/rebel al estilo del registry. Compatible con mREBEL (prefijo tp_XX, lang token __es__, __en__) y REBEL (sin prefijo de idioma). El formato wire incluye <triplet> para separar triplets y tokens <type> para cerrar spans de head/tail. El estado de la maquina es: t=leyendo head, s=leyendo tail, o=leyendo relacion.

Ejemplo

from python.functions.datascience.parse_rebel_output import parse_rebel_output

decoded = "tp_XX<triplet> Pablo Isla <per> Inditex <org> employer"
triplets = parse_rebel_output(decoded)
# [{'head': 'Pablo Isla', 'head_type': 'per', 'type': 'employer',
#   'tail': 'Inditex', 'tail_type': 'org'}]

Formato wire REBEL / mREBEL

tp_XX<triplet> HEAD_TOKENS <HEAD_TYPE> TAIL_TOKENS <TAIL_TYPE> RELATION_TOKENS<triplet> ...
  • <triplet> — marca el inicio de un nuevo triplet (y cierra el anterior).
  • <HEAD_TYPE> — cierra el span del head y abre el span del tail.
  • <TAIL_TYPE> — cierra el span del tail y abre el span de la relacion.
  • El ultimo triplet se cierra con </s> (ya eliminado antes del split).

Notas

  • No valida ni filtra los head_type/tail_type — los devuelve tal cual emite el modelo.
  • Compatible con cualquier variante seq2seq que use el mismo wire format (Babelscape/rebel, Babelscape/mrebel-large, Babelscape/mrebel-base).
  • Para usar el output en el grafo, pasar por align_relations_to_entities que resuelve head/tail a nombres canonicos del conjunto de entidades conocido.