b9716a7cd6
Snapshot de WIP acumulado de sesiones previas antes de merge wave 1 del flow 0008 (kanban_cpp + agent_runner_api + DoD schema). Incluye: - dev/flows/0008-kanban-cpp-and-agent-workflows.md - dev/issues/0112-0119*.md (7 sub-issues) - WIP previo en cmd/fn/doctor.go, registry/*, modules/, cpp/, etc. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
93 lines
3.8 KiB
Markdown
93 lines
3.8 KiB
Markdown
---
|
|
id: "0106"
|
|
title: "App services_monitor: dashboard cross-PC de services activos"
|
|
status: pendiente
|
|
type: app
|
|
domain:
|
|
- apps-infra
|
|
- cpp-stack
|
|
- telemetry
|
|
- deploy
|
|
scope: multi-app
|
|
priority: alta
|
|
depends:
|
|
- "0105"
|
|
blocks: []
|
|
related:
|
|
- "0085"
|
|
created: 2026-05-17
|
|
updated: 2026-05-17
|
|
tags: [services, monitoring, cross-pc, ssh, systemd, healthcheck, dashboard]
|
|
---
|
|
|
|
# 0106 — App `services_monitor`
|
|
|
|
## Problema
|
|
|
|
`fn doctor services` da snapshot puntual del PC local. Falta vista en vivo cross-PC:
|
|
- ¿Cuales de mis 10 services estan vivos en aurgi-pc?
|
|
- ¿Cuales en organic-machine.com?
|
|
- ¿Cuales murieron sin que me entere (caso sqlite_api 2026-05-17)?
|
|
|
|
## Decision
|
|
|
|
App ImGui `services_monitor` consumiendo backend Go `services_api` (port 8485). Reconcilia esperado (`service_targets` + `apps.*` del registry) vs real (systemd state + port listening + HTTP health) en cada PC target. Persistencia historica = transiciones + agregado horario.
|
|
|
|
## Componentes
|
|
|
|
### Backend `apps/services_api/` (Go, tag: service, port 8485)
|
|
|
|
Endpoints:
|
|
- `GET /api/services` lista plana `(app_id, pc_id, expected, actual, port, last_check_ts, last_healthy_ts, transitions_24h)`
|
|
- `GET /api/services/:app/:pc` detalle + ultimas N transiciones + journalctl tail
|
|
- `POST /api/services/:app/:pc/check` fuerza check inmediato
|
|
- `POST /api/services/:app/:pc/action` (action=start|stop|restart) feature-flag OFF en v1
|
|
- `GET /api/pcs` estado por PC (reachable, lag_ms, version_uname)
|
|
- `GET /api/ws/services` WS push de delta cada check
|
|
|
|
Worker pool: ciclo 10s por PC, paralelo.
|
|
Checker local (is_local_only=true o PC = self): exec `systemctl --user is-active <unit>` + `ss -tln | grep :<port>` + `curl -m <timeout> <health_endpoint>`.
|
|
Checker remoto: `ssh_exec_go_infra` con los mismos comandos + parseo de output.
|
|
|
|
BD: `services_api.db`:
|
|
- `service_check` append-only (ts, app_id, pc_id, systemd_state, port_listening, http_status, latency_ms)
|
|
- `service_transition` (ts, app_id, pc_id, from, to)
|
|
- `service_state_hourly` (hour_bucket, app_id, pc_id, healthy_ratio, transitions)
|
|
|
|
### Frontend `apps/services_monitor/` (C++ ImGui)
|
|
|
|
Patron `data_factory`. Paneles:
|
|
|
|
1. **Overview** Grid `pcs x apps`. Celda = semaforo. Click => Detail.
|
|
2. **PC Detail** apps esperadas en el PC, drift expected vs actual, accion restart (disabled v1).
|
|
3. **App Detail** por app: estado en cada PC, transitions ultimas 7d, mini chart healthy_ratio horario.
|
|
4. **Live (WS)** stream transitions.
|
|
5. **Alerts** apps expected=running AND actual=inactive > 5min. (v1 solo lista; notifs separadas).
|
|
|
|
UI: `data_table_cpp_viz`, `badge_cpp_core`, `empty_state_cpp_core`.
|
|
|
|
## Decisiones cerradas (2026-05-17)
|
|
|
|
1. **Local especial**: PC local NO se chequea via SSH. Flag `pc_is_self` por PC. Checker selecciona path: local exec vs ssh exec.
|
|
2. **Persistencia**: transitions + hourly aggregate. Append-only `service_check` con TTL 7d (vacuum job nocturno).
|
|
3. **Auto-start**: NO en v1. Solo alerta. Feature flag `services_monitor.auto_fix` OFF.
|
|
|
|
## Tareas (orden)
|
|
|
|
- [ ] Migration `services_api.db`: tabla `service_check`, `service_transition`, `service_state_hourly`
|
|
- [ ] Funciones registry: `port_listening_check_go_infra`, `http_health_probe_go_infra` (si no existen) via fn-constructor paralelo
|
|
- [ ] `services_api` MVP: worker loop + `/api/services` + WS
|
|
- [ ] systemd unit + Restart=always + actualizar issue 0105 con 11mo service
|
|
- [ ] App C++ `services_monitor` scaffold via `fn run init_cpp_app services_monitor`
|
|
- [ ] Panel Overview + WS client
|
|
- [ ] PC Detail + App Detail
|
|
- [ ] Alerts panel
|
|
|
|
## DoD
|
|
|
|
- 10 services visibles en Overview con semaforo correcto contra ground truth.
|
|
- Caida simulada (kill -9 sqlite_api) detectada en <15s.
|
|
- Recovery (auto-restart via Restart=always) detectada y reflejada en transitions.
|
|
- App lanzable en aurgi-pc + home-wsl (sin SSH a self).
|
|
- Backend `services_api` corriendo como `tag: service` (dogfooding completo).
|