Files
fn_registry/dev/issues/0106-services-monitor-app.md
T
egutierrez b9716a7cd6 chore: snapshot WIP previo + flow 0008 + 7 sub-issues (0112-0119)
Snapshot de WIP acumulado de sesiones previas antes de merge wave 1
del flow 0008 (kanban_cpp + agent_runner_api + DoD schema).

Incluye:
- dev/flows/0008-kanban-cpp-and-agent-workflows.md
- dev/issues/0112-0119*.md (7 sub-issues)
- WIP previo en cmd/fn/doctor.go, registry/*, modules/, cpp/, etc.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 18:17:08 +02:00

93 lines
3.8 KiB
Markdown

---
id: "0106"
title: "App services_monitor: dashboard cross-PC de services activos"
status: pendiente
type: app
domain:
- apps-infra
- cpp-stack
- telemetry
- deploy
scope: multi-app
priority: alta
depends:
- "0105"
blocks: []
related:
- "0085"
created: 2026-05-17
updated: 2026-05-17
tags: [services, monitoring, cross-pc, ssh, systemd, healthcheck, dashboard]
---
# 0106 — App `services_monitor`
## Problema
`fn doctor services` da snapshot puntual del PC local. Falta vista en vivo cross-PC:
- ¿Cuales de mis 10 services estan vivos en aurgi-pc?
- ¿Cuales en organic-machine.com?
- ¿Cuales murieron sin que me entere (caso sqlite_api 2026-05-17)?
## Decision
App ImGui `services_monitor` consumiendo backend Go `services_api` (port 8485). Reconcilia esperado (`service_targets` + `apps.*` del registry) vs real (systemd state + port listening + HTTP health) en cada PC target. Persistencia historica = transiciones + agregado horario.
## Componentes
### Backend `apps/services_api/` (Go, tag: service, port 8485)
Endpoints:
- `GET /api/services` lista plana `(app_id, pc_id, expected, actual, port, last_check_ts, last_healthy_ts, transitions_24h)`
- `GET /api/services/:app/:pc` detalle + ultimas N transiciones + journalctl tail
- `POST /api/services/:app/:pc/check` fuerza check inmediato
- `POST /api/services/:app/:pc/action` (action=start|stop|restart) feature-flag OFF en v1
- `GET /api/pcs` estado por PC (reachable, lag_ms, version_uname)
- `GET /api/ws/services` WS push de delta cada check
Worker pool: ciclo 10s por PC, paralelo.
Checker local (is_local_only=true o PC = self): exec `systemctl --user is-active <unit>` + `ss -tln | grep :<port>` + `curl -m <timeout> <health_endpoint>`.
Checker remoto: `ssh_exec_go_infra` con los mismos comandos + parseo de output.
BD: `services_api.db`:
- `service_check` append-only (ts, app_id, pc_id, systemd_state, port_listening, http_status, latency_ms)
- `service_transition` (ts, app_id, pc_id, from, to)
- `service_state_hourly` (hour_bucket, app_id, pc_id, healthy_ratio, transitions)
### Frontend `apps/services_monitor/` (C++ ImGui)
Patron `data_factory`. Paneles:
1. **Overview** Grid `pcs x apps`. Celda = semaforo. Click => Detail.
2. **PC Detail** apps esperadas en el PC, drift expected vs actual, accion restart (disabled v1).
3. **App Detail** por app: estado en cada PC, transitions ultimas 7d, mini chart healthy_ratio horario.
4. **Live (WS)** stream transitions.
5. **Alerts** apps expected=running AND actual=inactive > 5min. (v1 solo lista; notifs separadas).
UI: `data_table_cpp_viz`, `badge_cpp_core`, `empty_state_cpp_core`.
## Decisiones cerradas (2026-05-17)
1. **Local especial**: PC local NO se chequea via SSH. Flag `pc_is_self` por PC. Checker selecciona path: local exec vs ssh exec.
2. **Persistencia**: transitions + hourly aggregate. Append-only `service_check` con TTL 7d (vacuum job nocturno).
3. **Auto-start**: NO en v1. Solo alerta. Feature flag `services_monitor.auto_fix` OFF.
## Tareas (orden)
- [ ] Migration `services_api.db`: tabla `service_check`, `service_transition`, `service_state_hourly`
- [ ] Funciones registry: `port_listening_check_go_infra`, `http_health_probe_go_infra` (si no existen) via fn-constructor paralelo
- [ ] `services_api` MVP: worker loop + `/api/services` + WS
- [ ] systemd unit + Restart=always + actualizar issue 0105 con 11mo service
- [ ] App C++ `services_monitor` scaffold via `fn run init_cpp_app services_monitor`
- [ ] Panel Overview + WS client
- [ ] PC Detail + App Detail
- [ ] Alerts panel
## DoD
- 10 services visibles en Overview con semaforo correcto contra ground truth.
- Caida simulada (kill -9 sqlite_api) detectada en <15s.
- Recovery (auto-restart via Restart=always) detectada y reflejada en transitions.
- App lanzable en aurgi-pc + home-wsl (sin SSH a self).
- Backend `services_api` corriendo como `tag: service` (dogfooding completo).