feat: funciones Python datascience, finance, cybersecurity y pipelines

Datascience: aggregate_by_group, deduplicate_entities/relations, detect_drift, diff_entities/relations, extract_entities/relations_llm, hotness_score, melt, merge_graphs, pivot, build_entity/relation_schema_prompt. Finance: avellaneda_stoikov_quotes, generate_gbm_prices, generate_taker_order, hawkes_intensity + módulo finance.py. Cybersecurity: envelope_encrypt/decrypt + módulo cybersecurity.py. Pipelines: extraction_pipeline, monte_carlo_market, run_market_sim. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 17:11:32 +02:00
parent 25a392df48
commit 63a9cb5273
62 changed files with 5376 additions and 0 deletions
@@ -0,0 +1,87 @@
+---
+name: extract_entities_llm
+kind: function
+lang: py
+domain: datascience
+version: "1.0.0"
+purity: impure
+signature: "def extract_entities_llm(text: str, entity_schema: list[dict], llm_chat_json: Callable[[list[dict]], dict], language_instruction: str = 'Respond in English.') -> list[EntityCandidate]"
+description: "Extrae entidades de un chunk de texto usando un LLM inyectado. Construye el system prompt con el schema, llama al LLM y valida la respuesta retornando EntityCandidate. JSON invalido o type_ref fuera del schema se descartan con warning."
+tags: [llm, extraction, entity, nlp, osint, graph, fuzzygraph, datascience, prompt]
+uses_functions: []
+uses_types: [entity_candidate_py_datascience]
+returns: []
+returns_optional: false
+error_type: "error_go_core"
+imports: [warnings, typing.Callable]
+tested: true
+tests:
+  - "texto con entidades claras retorna EntityCandidate"
+  - "texto sin entidades retorna lista vacia"
+  - "llm retorna json mal formado retorna lista vacia con warning"
+  - "type_ref invalido en respuesta se descarta con warning"
+  - "confidence se propaga correctamente"
+  - "schema vacio lanza ValueError"
+test_file_path: "python/functions/datascience/extract_entities_llm_test.py"
+file_path: "python/functions/datascience/extract_entities_llm.py"
+---
+
+## Ejemplo
+
+```python
+import json
+from extract_entities_llm import extract_entities_llm
+
+# LLM stub para tests — en produccion usar litellm o similar
+def mock_llm(messages: list[dict]) -> dict:
+    return {
+        "entities": [
+            {
+                "name": "John Smith",
+                "type_ref": "osint_person_go_cybersecurity",
+                "attributes": {"full_name": "John Smith", "nationality": "US"},
+                "confidence": 0.95,
+            },
+            {
+                "name": "evil-corp.com",
+                "type_ref": "osint_domain_go_cybersecurity",
+                "attributes": {"fqdn": "evil-corp.com"},
+                "confidence": 0.88,
+            },
+        ]
+    }
+
+schema = [
+    {
+        "type_ref": "osint_person_go_cybersecurity",
+        "label": "Person",
+        "metadata_fields": ["full_name", "alias", "nationality", "dob", "risk_score"],
+    },
+    {
+        "type_ref": "osint_domain_go_cybersecurity",
+        "label": "Domain",
+        "metadata_fields": ["fqdn", "registrar", "created_date"],
+    },
+]
+
+text = "John Smith, a US citizen, was linked to the domain evil-corp.com."
+candidates = extract_entities_llm(text, schema, mock_llm)
+# [EntityCandidate(name='John Smith', type_ref='osint_person_go_cybersecurity', confidence=0.95),
+#  EntityCandidate(name='evil-corp.com', type_ref='osint_domain_go_cybersecurity', confidence=0.88)]
+```
+
+## Notas
+
+**Inyeccion de dependencia del LLM:** `llm_chat_json` recibe mensajes en formato OpenAI (`[{"role": "system", "content": "..."}, ...]`) y retorna un `dict` con la respuesta ya parseada como JSON. Esto desacopla la funcion de cualquier cliente especifico — puede usarse con OpenAI, Anthropic via litellm, o cualquier mock.
+
+**Validacion de type_ref:** Solo se aceptan entidades cuyo `type_ref` aparece en el `entity_schema`. Entidades con type_ref desconocido se descartan con `warnings.warn` (no lanzan excepcion) para ser resiliente ante alucinaciones del LLM.
+
+**Manejo de JSON invalido:** Si `llm_chat_json` lanza una excepcion o retorna un dict sin la clave `entities`, se retorna lista vacia y se emite un warning. El llamador puede decidir si reintentar.
+
+**Confidence clamping:** El valor de confidence se clampea al rango [0.0, 1.0] automaticamente.
+
+**Atributos null:** Los atributos con valor `None` se filtran del dict de atributos para mantener el output limpio.
+
+**source_chunk_indices:** Esta funcion no setea `source_chunk_indices` — ese campo lo llena el pipeline exterior que conoce el indice del chunk actual.
+
+Esta funcion es el bloque atomico de extraccion. El pipeline completo de grafos la llama por cada chunk del documento y luego deduplica los candidatos resultantes.