chore: auto-commit (286 archivos)
- .claude/agents/fn-orquestador/SKILL.md - .claude/commands/fn_claude.md - .claude/rules/INDEX.md - .claude/rules/cpp_apps.md - .claude/rules/ids_naming.md - CHANGELOG.md - apps/dag_engine/README.md - apps/dag_engine/api.go - apps/dag_engine/dags_migrated/example.yaml - apps/dag_engine/dags_migrated/example_lineage_tracking.yaml - ... Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1,6 +1,6 @@
|
||||
# Capability: nlp
|
||||
|
||||
_(Descripcion del grupo — editar a mano)_
|
||||
Pipeline NLP para extraccion de entities/relations sobre documentos en castellano (proyecto OSINT principalmente). Cubre: lectura de PDFs (`extract_pdf_text`, `clean_pdf_text`), OCR fallback, chunking (`chunk_with_overlap`), inferencia GLiNER (NER) + GLiREL (relation extraction), dedup (`dedup_entities`, `dedup_relations`), agregacion (`aggregate_extraction_results`), extraccion de elementos especificos (URLs, crypto wallets, IPs, dominios).
|
||||
|
||||
## Funciones
|
||||
|
||||
@@ -43,8 +43,37 @@ _(Descripcion del grupo — editar a mano)_
|
||||
|
||||
## Ejemplo canonico
|
||||
|
||||
_(Anadir 1-2 bloques de codigo end-to-end)_
|
||||
### Pipeline completo PDF -> entities + relations
|
||||
|
||||
```python
|
||||
import os, sys
|
||||
sys.path.insert(0, os.path.join(os.environ["FN_REGISTRY_ROOT"], "python", "functions"))
|
||||
|
||||
from core import extract_pdf_text, clean_pdf_text, chunk_with_overlap
|
||||
from datascience import (
|
||||
gliner_extract_entities, glirel_extract_relations,
|
||||
dedup_entities, dedup_relations, aggregate_extraction_results,
|
||||
)
|
||||
|
||||
raw = extract_pdf_text("/path/to/doc.pdf")
|
||||
text = clean_pdf_text(raw)
|
||||
chunks = chunk_with_overlap(text, size=512, overlap=64)
|
||||
|
||||
entities = []
|
||||
relations = []
|
||||
for ch in chunks:
|
||||
entities.extend(gliner_extract_entities(ch, labels=["PERSON","ORG","ACCOUNT"]))
|
||||
relations.extend(glirel_extract_relations(ch, entity_pairs=entities))
|
||||
|
||||
result = aggregate_extraction_results(
|
||||
entities=dedup_entities(entities),
|
||||
relations=dedup_relations(relations),
|
||||
)
|
||||
```
|
||||
|
||||
## Fronteras
|
||||
|
||||
_(Que NO cubre este grupo)_
|
||||
- **NO entrena modelos**. Solo inferencia con GLiNER/GLiREL pre-entrenados (HuggingFace).
|
||||
- **NO maneja embeddings densos** (sentence-transformers / e5). Para vectores, usa funciones del grupo `ml`.
|
||||
- **NO hace traduccion ni summarization LLM**. Solo NER + RE. Para LLM, ver tag `llm`.
|
||||
- **NO escribe a BD** automaticamente. La persistencia (vault, sqlite, parquet) la maneja el caller via funciones de `infra`/`datascience`.
|
||||
|
||||
Reference in New Issue
Block a user