e3c8979e8d
- cmd/fn/doctor.go - cmd/fn/main.go - cpp/apps/primitives_gallery/playground/tables/CMakeLists.txt - cpp/apps/primitives_gallery/playground/tables/data_table.cpp - cpp/apps/primitives_gallery/playground/tables/data_table_logic.cpp - cpp/apps/primitives_gallery/playground/tables/data_table_logic.h - cpp/apps/primitives_gallery/playground/tables/self_test.cpp - cpp/apps/primitives_gallery/playground/tables/tql.cpp - cpp/apps/primitives_gallery/playground/tables/viz.cpp - cpp/apps/primitives_gallery/playground/tables/viz.h - ... Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2.4 KiB
2.4 KiB
name, kind, lang, domain, version, purity, signature, description, tags, uses_functions, uses_types, returns, returns_optional, error_type, imports, params, output, tested, tests, test_file_path, file_path
| name | kind | lang | domain | version | purity | signature | description | tags | uses_functions | uses_types | returns | returns_optional | error_type | imports | params | output | tested | tests | test_file_path | file_path | |||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| vault_pdf_extract | function | py | datascience | 1.0.0 | impure | def vault_pdf_extract(vault_path: str, rel_path: str, db_path: str | None = None, dump_text: bool = True) -> dict | Extrae texto de un PDF del vault con PyMuPDF; persiste page_count y text_len en pdf_extracts; vuelca texto a .txt en data/processed/ o .vault_extracts/; actualiza files_fts para búsqueda por contenido. |
|
false | error_go_core |
|
|
Dict con: rel_path (str), page_count (int), text_len (int), extracted_to (ruta relativa al .txt o None), persisted (bool). | true |
|
python/functions/datascience/tests/test_vault_pdf_extract.py | python/functions/datascience/vault_pdf_extract.py |
Ejemplo
from vault_pdf_extract import vault_pdf_extract
result = vault_pdf_extract("/vaults/mi_vault", "docs/informe_anual.pdf")
# {
# "rel_path": "docs/informe_anual.pdf",
# "page_count": 24,
# "text_len": 45210,
# "extracted_to": "data/processed/informe_anual.txt",
# "persisted": True
# }
Notas
- Requiere PyMuPDF (paquete
pymupdf, importado comofitz). Ya instalado en python/.venv. - El texto se trunca a 10 MB antes de insertarlo en files_fts para evitar tablas FTS5 masivas.
- Layout de volcado: si
<vault_path>/data/processed/existe, se usa; si no, se crea<vault_path>/.vault_extracts/. - PDFs corruptos levantan RuntimeError con mensaje descriptivo.
- El rowid de files_fts se ancla al rowid de la tabla files (subquery) para que vault_search funcione correctamente.
- Si vault_index.db no existe, retorna el dict sin intentar persistir (persisted=False).