chore: dependencias Python, sources manifest, reglas de extracción y comando extract-source

Actualiza pyproject.toml con nuevas dependencias (pdfplumber, python-docx, ebooklib, openpyxl, etc.).
Actualiza sources.yaml con funciones extraídas de repos externos.
Mejora reglas de extracción en sources.md.
Añade comando Claude extract-source para workflow de extracción.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-05 17:12:05 +02:00
parent ac05aa489c
commit 462c1b7a66
5 changed files with 623 additions and 2 deletions
+65
View File
@@ -18,6 +18,71 @@
# 5. Actualizar este manifest con las funciones extraidas
repos:
- repo: https://github.com/VectifyAI/PageIndex
license: MIT
cloned_dir: PageIndex
extracted:
# Pure — tree manipulation (8)
- id: flatten_tree_py_core
source_file: pageindex/utils.py
date: 2026-04-05
- id: tree_to_flat_list_py_core
source_file: pageindex/utils.py
date: 2026-04-05
- id: get_leaf_nodes_py_core
source_file: pageindex/utils.py
date: 2026-04-05
- id: write_node_ids_py_core
source_file: pageindex/utils.py
date: 2026-04-05
- id: list_to_tree_py_core
source_file: pageindex/utils.py
date: 2026-04-05
- id: remove_tree_fields_py_core
source_file: pageindex/utils.py
date: 2026-04-05
- id: format_tree_structure_py_core
source_file: pageindex/utils.py
date: 2026-04-05
- id: create_node_mapping_py_core
source_file: pageindex/utils.py
date: 2026-04-05
# Pure — text/JSON extraction (2)
- id: extract_json_from_llm_py_core
source_file: pageindex/utils.py
date: 2026-04-05
- id: parse_page_range_py_core
source_file: pageindex/retrieve.py
date: 2026-04-05
# Pure — markdown parsing (2)
- id: extract_markdown_headers_py_core
source_file: pageindex/page_index_md.py
date: 2026-04-05
- id: build_tree_from_headers_py_core
source_file: pageindex/page_index_md.py
date: 2026-04-05
# Pure — pagination/chunking (2)
- id: page_list_to_groups_py_core
source_file: pageindex/page_index.py
date: 2026-04-05
- id: calculate_page_offset_py_core
source_file: pageindex/page_index.py
date: 2026-04-05
# Impure — LLM wrappers (2)
- id: llm_completion_retry_py_core
source_file: pageindex/utils.py
date: 2026-04-05
- id: llm_acompletion_retry_py_core
source_file: pageindex/utils.py
date: 2026-04-05
# Impure — PDF extraction (2)
- id: extract_pdf_text_py_core
source_file: pageindex/utils.py
date: 2026-04-05
- id: get_pdf_page_tokens_py_core
source_file: pageindex/utils.py
date: 2026-04-05
- repo: https://gitea-dgg044oo04woo4ggcsws4gk0.organic-machine.com/Bl4cksmith/Frontend_Library
license: MIT
cloned_dir: frontend_library