chore: añade directorio dev/ con issues y funciones implementadas
Tracking de issues completados (jupyter tools) y funciones implementadas (specs de diseño ya resueltas).
This commit is contained in:
@@ -0,0 +1,49 @@
|
||||
# Funciones para FuzzyGraph: Extraccion automatica de entidades y relaciones
|
||||
|
||||
Diseño original para `apps/fuzzygraph`. Pipeline completa:
|
||||
|
||||
```
|
||||
documento → extract_text → preprocess → split_chunks
|
||||
→ extract_entities_llm (por chunk) → deduplicate_entities
|
||||
→ extract_relations_llm (por chunk + entities) → deduplicate_relations
|
||||
→ insert operations.db
|
||||
```
|
||||
|
||||
## Dependencias del registry existentes
|
||||
|
||||
| Funcion existente | ID | Para que |
|
||||
|---|---|---|
|
||||
| levenshtein_distance | `levenshtein_distance_py_cybersecurity` | Fuzzy matching de nombres |
|
||||
| jaccard_similarity | `jaccard_similarity_py_cybersecurity` | Fuzzy matching de tokens |
|
||||
| extract_urls | `extract_urls_py_cybersecurity` | Pre-extraer URLs como entidades Domain |
|
||||
| normalize_url | `normalize_url_py_cybersecurity` | Normalizar URLs antes de dedup |
|
||||
|
||||
## Dependencias de specs pendientes (OpenViking/MiroFish)
|
||||
|
||||
| Spec | Para que |
|
||||
|---|---|
|
||||
| mf_09 extract_text_from_file | Sacar texto de PDF/MD/TXT |
|
||||
| mf_01 split_text_into_chunks | Chunks con overlap |
|
||||
| mf_02 preprocess_text | Normalizar whitespace |
|
||||
| mf_04 parse_llm_json | Limpiar JSON del LLM |
|
||||
| mf_06 retry_with_backoff | Reintentar llamadas LLM |
|
||||
| mf_11 call_batch_with_retry | Procesar chunks en batch |
|
||||
|
||||
## Funciones nuevas (este directorio)
|
||||
|
||||
| # | Archivo | Dominio | Funcion |
|
||||
|---|---------|---------|---------|
|
||||
| 01 | fg_extract_entities_llm.md | datascience | extract_entities_llm |
|
||||
| 02 | fg_extract_relations_llm.md | datascience | extract_relations_llm |
|
||||
| 03 | fg_deduplicate_entities.md | datascience | deduplicate_entities |
|
||||
| 04 | fg_deduplicate_relations.md | datascience | deduplicate_relations |
|
||||
| 05 | fg_build_entity_schema_prompt.md | datascience | build_entity_schema_prompt, build_relation_schema_prompt |
|
||||
| 06 | fg_normalize_entity_name.md | core | normalize_entity_name |
|
||||
| 07 | fg_merge_entity_attributes.md | core | merge_entity_attributes |
|
||||
| 08 | fg_extraction_pipeline.md | pipelines | extraction_pipeline (orquestador completo) |
|
||||
|
||||
## Tipos nuevos
|
||||
|
||||
| # | Archivo | Dominio | Tipos |
|
||||
|---|---------|---------|-------|
|
||||
| 09 | fg_type_extraction.md | datascience | EntityCandidate, RelationCandidate, ExtractionResult, DeduplicationResult |
|
||||
Reference in New Issue
Block a user