Files
fn_registry/dev/issues/0106-services-monitor-app.md
T
egutierrez b9716a7cd6 chore: snapshot WIP previo + flow 0008 + 7 sub-issues (0112-0119)
Snapshot de WIP acumulado de sesiones previas antes de merge wave 1
del flow 0008 (kanban_cpp + agent_runner_api + DoD schema).

Incluye:
- dev/flows/0008-kanban-cpp-and-agent-workflows.md
- dev/issues/0112-0119*.md (7 sub-issues)
- WIP previo en cmd/fn/doctor.go, registry/*, modules/, cpp/, etc.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 18:17:08 +02:00

3.8 KiB

id, title, status, type, domain, scope, priority, depends, blocks, related, created, updated, tags
id title status type domain scope priority depends blocks related created updated tags
0106 App services_monitor: dashboard cross-PC de services activos pendiente app
apps-infra
cpp-stack
telemetry
deploy
multi-app alta
0105
0085
2026-05-17 2026-05-17
services
monitoring
cross-pc
ssh
systemd
healthcheck
dashboard

0106 — App services_monitor

Problema

fn doctor services da snapshot puntual del PC local. Falta vista en vivo cross-PC:

  • ¿Cuales de mis 10 services estan vivos en aurgi-pc?
  • ¿Cuales en organic-machine.com?
  • ¿Cuales murieron sin que me entere (caso sqlite_api 2026-05-17)?

Decision

App ImGui services_monitor consumiendo backend Go services_api (port 8485). Reconcilia esperado (service_targets + apps.* del registry) vs real (systemd state + port listening + HTTP health) en cada PC target. Persistencia historica = transiciones + agregado horario.

Componentes

Backend apps/services_api/ (Go, tag: service, port 8485)

Endpoints:

  • GET /api/services lista plana (app_id, pc_id, expected, actual, port, last_check_ts, last_healthy_ts, transitions_24h)
  • GET /api/services/:app/:pc detalle + ultimas N transiciones + journalctl tail
  • POST /api/services/:app/:pc/check fuerza check inmediato
  • POST /api/services/:app/:pc/action (action=start|stop|restart) feature-flag OFF en v1
  • GET /api/pcs estado por PC (reachable, lag_ms, version_uname)
  • GET /api/ws/services WS push de delta cada check

Worker pool: ciclo 10s por PC, paralelo. Checker local (is_local_only=true o PC = self): exec systemctl --user is-active <unit> + ss -tln | grep :<port> + curl -m <timeout> <health_endpoint>. Checker remoto: ssh_exec_go_infra con los mismos comandos + parseo de output.

BD: services_api.db:

  • service_check append-only (ts, app_id, pc_id, systemd_state, port_listening, http_status, latency_ms)
  • service_transition (ts, app_id, pc_id, from, to)
  • service_state_hourly (hour_bucket, app_id, pc_id, healthy_ratio, transitions)

Frontend apps/services_monitor/ (C++ ImGui)

Patron data_factory. Paneles:

  1. Overview Grid pcs x apps. Celda = semaforo. Click => Detail.
  2. PC Detail apps esperadas en el PC, drift expected vs actual, accion restart (disabled v1).
  3. App Detail por app: estado en cada PC, transitions ultimas 7d, mini chart healthy_ratio horario.
  4. Live (WS) stream transitions.
  5. Alerts apps expected=running AND actual=inactive > 5min. (v1 solo lista; notifs separadas).

UI: data_table_cpp_viz, badge_cpp_core, empty_state_cpp_core.

Decisiones cerradas (2026-05-17)

  1. Local especial: PC local NO se chequea via SSH. Flag pc_is_self por PC. Checker selecciona path: local exec vs ssh exec.
  2. Persistencia: transitions + hourly aggregate. Append-only service_check con TTL 7d (vacuum job nocturno).
  3. Auto-start: NO en v1. Solo alerta. Feature flag services_monitor.auto_fix OFF.

Tareas (orden)

  • Migration services_api.db: tabla service_check, service_transition, service_state_hourly
  • Funciones registry: port_listening_check_go_infra, http_health_probe_go_infra (si no existen) via fn-constructor paralelo
  • services_api MVP: worker loop + /api/services + WS
  • systemd unit + Restart=always + actualizar issue 0105 con 11mo service
  • App C++ services_monitor scaffold via fn run init_cpp_app services_monitor
  • Panel Overview + WS client
  • PC Detail + App Detail
  • Alerts panel

DoD

  • 10 services visibles en Overview con semaforo correcto contra ground truth.
  • Caida simulada (kill -9 sqlite_api) detectada en <15s.
  • Recovery (auto-restart via Restart=always) detectada y reflejada en transitions.
  • App lanzable en aurgi-pc + home-wsl (sin SSH a self).
  • Backend services_api corriendo como tag: service (dogfooding completo).