47fac22230
- .claude/CLAUDE.md - .claude/commands/subagentes.md - .claude/rules/INDEX.md - .mcp.json - bash/functions/cybersecurity/analyze_dns.md - bash/functions/cybersecurity/audit_http_headers.md - bash/functions/cybersecurity/audit_ssh_config.md - bash/functions/cybersecurity/check_firewall.md - bash/functions/cybersecurity/detect_suspicious_users.md - bash/functions/cybersecurity/encrypt_file.md - ... Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2.8 KiB
2.8 KiB
name, kind, lang, domain, version, purity, signature, description, tags, uses_functions, uses_types, returns, returns_optional, error_type, imports, tested, tests, test_file_path, file_path, framework, params, output
| name | kind | lang | domain | version | purity | signature | description | tags | uses_functions | uses_types | returns | returns_optional | error_type | imports | tested | tests | test_file_path | file_path | framework | params | output | ||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| gpu_reduce | function | cpp | gfx | 1.0.0 | impure | GpuReduce gpu_reduce_create(int max_n_samples); float gpu_reduce_run(GpuReduce&, ReduceOp op, const Ssbo& samples, int count); float gpu_reduce_mean(GpuReduce&, const Ssbo& samples, int count); void gpu_reduce_destroy(GpuReduce&) | Reduccion paralela sobre SSBO float[]: sum, min, max, mean. Workgroup-shared tree reduction (local 256). Cada workgroup escribe un partial; reduccion final CPU-side sobre N/256 partials. |
|
|
false | error_go_core |
|
false | cpp/functions/gfx/gpu_reduce.cpp | opengl |
|
Escalar reducido. Bloquea (incluye readback de los ~N/256 partials a CPU). Para N=10^6, partials = 4096 floats = 16 KB readback (microscopico). |
gpu_reduce
Reduccion paralela GPU + finalizacion CPU. Util para metrics resumen sobre un SSBO de samples sin tener que leer todo el buffer a CPU.
Patron
auto r = fn::gfx::gpu_reduce_create(/*max_n=*/10'000'000);
// Tras un dispatch que llena samples_ssbo:
float total = fn::gfx::gpu_reduce_run(r, fn::gfx::ReduceOp::Sum, samples, N);
float lo = fn::gfx::gpu_reduce_run(r, fn::gfx::ReduceOp::Min, samples, N);
float hi = fn::gfx::gpu_reduce_run(r, fn::gfx::ReduceOp::Max, samples, N);
float mean = fn::gfx::gpu_reduce_mean(r, samples, N);
fn::gfx::gpu_reduce_destroy(r);
Performance
Workgroup-shared tree reduction: cada workgroup procesa 256 elementos en log2(256) = 8 pasos sobre shared memory (sin atomics). Para N = 10^7 son 39062 workgroups y readback de 39062 floats (152 KB) — total ~2 ms en RTX 3070.
Notas
- El readback es sincrono. Si llamas multiples reduce sobre el mismo SSBO en sucesion (sum, min, max), cada uno cuesta el round-trip. Para metrics multiple-output considerar un kernel custom que las calcule en una sola pasada.
- No incluye variance / std — depende de mean, asi que requiere dos passes. Implementarlo como funcion custom encima de este reduce.
count <= 0o partials vacios devuelven identidad (Sum=0, Min=+inf, Max=-inf).- Para reducciones de uint (counts de histograma) este modulo no aplica — usar gpu_histogram_1d/2d que ya emiten counts directamente.