fn_registry/dev/issues/completed/0049h-graph-force-layout-gpu.md

---
id: "0049h"
title: "`graph_force_layout_gpu`: compute shader + spatial hash"
status: completado
type: feature
domain: []
scope: multi-app
priority: media
depends: []
blocks: []
related: []
created: 2026-05-17
updated: 2026-05-17
tags: []
---
# 0049h — `graph_force_layout_gpu`: compute shader + spatial hash

## Metadata

| Campo | Valor |
|-------|-------|
| **ID** | 0049h |
| **Estado** | pendiente |
| **Prioridad** | media-alta |
| **Tipo** | feature performance — parte de [#0049](0049-osint-graph-viewer.md) |

## Dependencias

**Bloqueada por:** [0049b](0049b-cpp-bump-gl-43.md) (compute shaders), [0049e](0049e-graph-types-extended.md) (`flags.pinned`).

---

## Objetivo

Implementar el force-directed layout en GPU usando compute shaders y SSBOs. Repulsion via spatial hash grid (no Barnes-Hut), atraccion por aristas, integracion con clamp y respeto de `NF_PINNED`. API consistente con la version CPU para que el consumer pueda swappear.

## Contexto

A 50k+ nodos el CPU layout (Barnes-Hut, ~5-10 ms/frame con 20k nodos) impide 60fps con margen. GPU compute con grid hash:
- Cada celda contiene punteros a sus nodos (atomic insert).
- Repulsion: cada nodo lee solo su celda + 8 vecinas (3×3) → O(N · density) en vez de O(N log N).
- Sin estructuras dinamicas → todo arrays planos en SSBO.

## Arquitectura

```
cpp/functions/viz/
├── graph_force_layout_gpu.h           # NEW
├── graph_force_layout_gpu.cpp         # NEW
├── graph_force_layout_gpu.md          # NEW

cpp/functions/viz/shaders/              # NEW dir (o inline strings)
├── force_clear_grid.comp               # NEW: zero counters
├── force_build_grid.comp               # NEW: insert nodes en celdas (atomic counters)
├── force_repulsion.comp                # NEW: leer celdas vecinas, acumular fuerza
├── force_attraction.comp               # NEW: por arista, acumular spring force
└── force_integrate.comp                # NEW: v += f, clamp, x += v, respeta pinned

cpp/tests/
└── test_graph_force_layout_gpu.cpp     # NEW
```

### API

```cpp
struct ForceLayoutGPU;

ForceLayoutGPU* graph_force_layout_gpu_create(int max_nodes, int max_edges,
                                              int grid_cells_per_side = 64);
void            graph_force_layout_gpu_upload(ForceLayoutGPU*, const GraphData&);   // copy positions/edges to GPU
float           graph_force_layout_gpu_step  (ForceLayoutGPU*, GraphData&,
                                              const ForceLayoutConfig&);            // returns total energy
void            graph_force_layout_gpu_readback(ForceLayoutGPU*, GraphData&);       // sync GPU positions back to CPU mirror
void            graph_force_layout_gpu_destroy(ForceLayoutGPU*);
```

### Buffers GPU

| SSBO | Layout | Tamaño |
|---|---|---|
| `positions`  | `vec2[N]`  | 8 × N |
| `velocities` | `vec2[N]`  | 8 × N |
| `forces`     | `vec2[N]`  | 8 × N (working) |
| `flags`      | `uint[N]`  | 4 × N |
| `edges`      | `uvec2[E]` | 8 × E |
| `weights`    | `float[E]` | 4 × E |
| `grid_counts`| `uint[G²]` | 4 × 64² = 16 KB |
| `grid_cells` | `uint[G²][K]` | 4 × G² × max_nodes_per_cell (32) |

Para 100k nodos: ~3 MB en SSBOs — trivial.

## Tareas

### Fase 1 — Esqueleto + buffer alloc

- [ ] **1.1** Crear `ForceLayoutGPU` con `max_nodes`/`max_edges`. Crear todos los SSBOs.
- [ ] **1.2** `_upload`: empaqueta posiciones/aristas/flags y llama `glBufferSubData`.
- [ ] **1.3** Compilar los 5 compute shaders al crear (helper `compile_compute(src)`).

### Fase 2 — Compute shaders

- [ ] **2.1** `force_clear_grid.comp`: 1 thread por celda, zero counter.
- [ ] **2.2** `force_build_grid.comp`: 1 thread por nodo, calcula celda, `atomicAdd(grid_counts[ci], 1)`, escribe en `grid_cells[ci][slot]` si `slot < K`.
- [ ] **2.3** `force_repulsion.comp`: 1 thread por nodo. Recorre las 9 celdas vecinas, para cada otro nodo calcula `F = repulsion / dist²`. Acumula en `forces[i]`.
- [ ] **2.4** `force_attraction.comp`: 1 thread por arista, atomic add a `forces[source]` y `forces[target]`.
- [ ] **2.5** `force_integrate.comp`: 1 thread por nodo. Si `flags & NF_PINNED`, skip. `v = damping*v + F`, clamp a `max_velocity`, `x += v`. Atomic add a `total_energy_ssbo`.

### Fase 3 — Step pipeline

- [ ] **3.1** `_step` despacha los 5 compute shaders en orden con `glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT)` entre cada uno.
- [ ] **3.2** Ultimo barrier: `GL_BUFFER_UPDATE_BARRIER_BIT` para que `_readback` vea valores frescos.
- [ ] **3.3** Bounds calculados en CPU tras readback (cheap).

### Fase 4 — Readback eficiente

- [ ] **4.1** `_readback` usa `glGetBufferSubData` solo de `positions` (8 × N bytes). Para 50k nodos = 400 KB; CPU mirror se actualiza cada frame.
- [ ] **4.2** Documentar tradeoff: si el consumer no necesita mirror cada frame (p.ej. solo la app que dibuja con la GPU), `_readback` es opcional.

### Fase 5 — Tests

- [ ] **5.1** Test smoke: 100 nodos, 200 aristas; correr 100 steps; verificar que la energia decrece.
- [ ] **5.2** Test pinned: pinear 1 nodo en (0,0); tras 100 steps debe seguir en (0,0).
- [ ] **5.3** Test consistency CPU-vs-GPU: para un grafo pequeño (50 nodos), la energia tras N steps debe diferir < 5% entre la version CPU y la GPU (no exacto por floating-point + grid approximation, pero comparable).
- [ ] **5.4** Test SKIP en WSL si no hay context con compute (el harness ya skipea si no hay GL context).

### Fase 6 — Demo

- [ ] **6.1** Anadir toggle "GPU layout" en `demos_graph` que swappea CPU ↔ GPU. Mostrar fps y energia en tiempo real.
- [ ] **6.2** Probar con 50k nodos.

### Fase 7 — Cleanup

- [ ] Frontmatter `.md` con `purity: impure`, `uses_types: [graph_data_cpp_viz, force_layout_config_cpp_viz]`.
- [ ] `fn index`.
- [ ] Commit `feat(viz): graph_force_layout_gpu compute + spatial hash`.

## Criterio de done

- [ ] 50k nodos + 200k aristas a 60fps con layout iterativo corriendo.
- [ ] Tests verdes (smoke + pinned + consistency).
- [ ] CPU usage < 10% durante el layout (medible con Tracy).
- [ ] Toggle CPU/GPU en `demos_graph` operativo.

## Riesgos

| Riesgo | Mitigacion |
|---|---|
| `max_nodes_per_cell` overflow | Si una celda satura, los excedentes se ignoran (con warning); ajustar `grid_cells_per_side` mas grande o aumentar `K` |
| Atomic contention en repulsion (escrituras a `forces[i]`) | Cada nodo es escrito por un solo thread (su propio); lecturas si concurrentes pero readonly de `positions` |
| Atomic add sobre `forces` en attraction (multiple aristas tocan el mismo nodo) | Usar `atomicAdd` real sobre uvec2 packed, o serializar con un buffer temp y sumar despues. Alternativa: un compute pass por dimension |
| Diferencias driver entre vendors (Mesa, NV, AMD) | Usar `#version 430 core`, evitar features vendor-specific. Test en al menos un GPU NVIDIA y uno AMD/Mesa |