--- id: "0106" title: "App services_monitor: dashboard cross-PC de services activos" status: pendiente type: app domain: - apps-infra - cpp-stack - telemetry - deploy scope: multi-app priority: alta depends: - "0105" blocks: [] related: - "0085" created: 2026-05-17 updated: 2026-05-17 tags: [services, monitoring, cross-pc, ssh, systemd, healthcheck, dashboard] --- # 0106 — App `services_monitor` ## Problema `fn doctor services` da snapshot puntual del PC local. Falta vista en vivo cross-PC: - ¿Cuales de mis 10 services estan vivos en aurgi-pc? - ¿Cuales en organic-machine.com? - ¿Cuales murieron sin que me entere (caso sqlite_api 2026-05-17)? ## Decision App ImGui `services_monitor` consumiendo backend Go `services_api` (port 8485). Reconcilia esperado (`service_targets` + `apps.*` del registry) vs real (systemd state + port listening + HTTP health) en cada PC target. Persistencia historica = transiciones + agregado horario. ## Componentes ### Backend `apps/services_api/` (Go, tag: service, port 8485) Endpoints: - `GET /api/services` lista plana `(app_id, pc_id, expected, actual, port, last_check_ts, last_healthy_ts, transitions_24h)` - `GET /api/services/:app/:pc` detalle + ultimas N transiciones + journalctl tail - `POST /api/services/:app/:pc/check` fuerza check inmediato - `POST /api/services/:app/:pc/action` (action=start|stop|restart) feature-flag OFF en v1 - `GET /api/pcs` estado por PC (reachable, lag_ms, version_uname) - `GET /api/ws/services` WS push de delta cada check Worker pool: ciclo 10s por PC, paralelo. Checker local (is_local_only=true o PC = self): exec `systemctl --user is-active ` + `ss -tln | grep :` + `curl -m `. Checker remoto: `ssh_exec_go_infra` con los mismos comandos + parseo de output. BD: `services_api.db`: - `service_check` append-only (ts, app_id, pc_id, systemd_state, port_listening, http_status, latency_ms) - `service_transition` (ts, app_id, pc_id, from, to) - `service_state_hourly` (hour_bucket, app_id, pc_id, healthy_ratio, transitions) ### Frontend `apps/services_monitor/` (C++ ImGui) Patron `data_factory`. Paneles: 1. **Overview** Grid `pcs x apps`. Celda = semaforo. Click => Detail. 2. **PC Detail** apps esperadas en el PC, drift expected vs actual, accion restart (disabled v1). 3. **App Detail** por app: estado en cada PC, transitions ultimas 7d, mini chart healthy_ratio horario. 4. **Live (WS)** stream transitions. 5. **Alerts** apps expected=running AND actual=inactive > 5min. (v1 solo lista; notifs separadas). UI: `data_table_cpp_viz`, `badge_cpp_core`, `empty_state_cpp_core`. ## Decisiones cerradas (2026-05-17) 1. **Local especial**: PC local NO se chequea via SSH. Flag `pc_is_self` por PC. Checker selecciona path: local exec vs ssh exec. 2. **Persistencia**: transitions + hourly aggregate. Append-only `service_check` con TTL 7d (vacuum job nocturno). 3. **Auto-start**: NO en v1. Solo alerta. Feature flag `services_monitor.auto_fix` OFF. ## Tareas (orden) - [ ] Migration `services_api.db`: tabla `service_check`, `service_transition`, `service_state_hourly` - [ ] Funciones registry: `port_listening_check_go_infra`, `http_health_probe_go_infra` (si no existen) via fn-constructor paralelo - [ ] `services_api` MVP: worker loop + `/api/services` + WS - [ ] systemd unit + Restart=always + actualizar issue 0105 con 11mo service - [ ] App C++ `services_monitor` scaffold via `fn run init_cpp_app services_monitor` - [ ] Panel Overview + WS client - [ ] PC Detail + App Detail - [ ] Alerts panel ## DoD - 10 services visibles en Overview con semaforo correcto contra ground truth. - Caida simulada (kill -9 sqlite_api) detectada en <15s. - Recovery (auto-restart via Restart=always) detectada y reflejada en transitions. - App lanzable en aurgi-pc + home-wsl (sin SSH a self). - Backend `services_api` corriendo como `tag: service` (dogfooding completo).