feat(embeddednats): UNIBUS_NATS_MONITOR flag decoupled from debug log

Add a dedicated UNIBUS_NATS_MONITOR=1 toggle that opens the embedded nats-server monitoring HTTP endpoint (127.0.0.1:8222, loopback only) so a local metrics scraper can read /varz, /connz and /jsz for server-level metrics (msgs/s, connections, KV bucket msgs, RAFT leader per stream, restarts). Previously the monitoring endpoint was only reachable via UNIBUS_NATS_DEBUG=1, which is coupled to the verbose nats-server debug log: enabling the endpoint also wrote routes/RAFT/room subjects to journald in clear, which regresses the hardened posture (issue 0007). The two concerns are now decoupled. The toggle computation is extracted to a pure function natsLogOpts(debugEnv, monitorEnv) (noLog, debug, trace, monitor): MONITOR=1 opens the endpoint while keeping the log quiet (NoLog true / Debug false). The inverse coupling is preserved for backward compatibility (DEBUG still implies MONITOR). The 127.0.0.1 bind stays hardcoded — the monitoring endpoint has no auth and must never be reachable from the network. Deploy wiring versioned: additive systemd drop-in membershipd-cluster.service.d/nats-monitor.conf (Environment=UNIBUS_NATS_MONITOR=1) plus a "NATS server metrics" section in the cluster README with the rolling activation runbook (magnus -> homer -> datardos) gated on R3 reconvergence (followers 2/2) between nodes. Tests: pure decoupling table (monitor on => log NOT debug; debug => monitor; default closed) + a real embedded server with MONITOR=1 asserting /varz answers 200 on loopback:8222, and a server without the flag with the endpoint closed. 100% additive: behavior is identical without the flag. Bump app.md 0.10.0 -> 0.11.0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 20:57:46 +02:00
parent b4f3118e85
commit 1c9325104c
5 changed files with 274 additions and 12 deletions
@@ -283,3 +283,61 @@ ssh dd 'sudo systemctl start membershipd-cluster'   # rejoins, catches up
 the unit and start it without `--store kv`/`--cluster-name`; the KV buckets remain
 for a later retry. To rotate the cluster CA, re-run `generate-cluster-certs.sh
 --force` and re-stage (every node must get the new `cluster-ca.crt` together).
+
+## NATS server metrics (loopback monitoring — optional)
+
+The embedded NATS server can expose its own monitoring HTTP endpoint so a local
+scraper reads server-level metrics that `/healthz` does not surface: msgs/s,
+connections, slow consumers, memory, KV bucket message counts, the RAFT leader per
+stream and per-stream restarts. This feeds the `unibus-nats` dashboard in
+`fleet_monitoring` (the scraper hits `127.0.0.1:8222/varz|/connz|/jsz` over
+loopback and pushes to VictoriaMetrics).
+
+The endpoint is opened by the **dedicated** environment toggle `UNIBUS_NATS_MONITOR=1`
+(0.11.0+ binary). It is **decoupled** from `UNIBUS_NATS_DEBUG`: it opens the
+monitoring endpoint WITHOUT enabling the verbose nats-server debug log, so no room
+subjects or routing metadata leak to journald (keeps the hardened posture, issue
+0007). The endpoint binds `127.0.0.1:8222` **only** — the binary hardcodes the
+loopback bind, so it is never reachable from the network and needs no auth. Never
+use `UNIBUS_NATS_DEBUG` in production just to get the endpoint.
+
+### Enable it (HUMAN — requires the 0.11.0+ binary on the node)
+
+The clean way is the additive systemd drop-in in this directory:
+
+```bash
+# On each node, AFTER the 0.11.0+ binary is in /opt/unibus/membershipd:
+ssh <node> 'sudo mkdir -p /etc/systemd/system/membershipd-cluster.service.d'
+scp membershipd-cluster.service.d/nats-monitor.conf <node>:/tmp/nats-monitor.conf
+ssh <node> 'sudo cp /tmp/nats-monitor.conf /etc/systemd/system/membershipd-cluster.service.d/ \
+  && sudo systemctl daemon-reload && sudo systemctl restart membershipd-cluster'
+```
+
+(Equivalently, add `UNIBUS_NATS_MONITOR=1` to `/opt/unibus/cluster.env`, which the
+unit already sources via `EnvironmentFile`; the drop-in is preferred because it is
+self-documenting and does not edit the generated env file.)
+
+### Rolling restart with the R3 reconvergence gate (CRITICAL)
+
+`systemctl restart membershipd-cluster` restarts that node's JetStream RAFT member.
+**Never restart two nodes at once** — that would drop the cluster below quorum
+(2/3) and fail the control plane closed. Roll **one node at a time**, in the order
+`magnus → homer → datardos`, and between each node wait until the cluster has
+reconverged to R3 (every control-plane bucket back to `followers_current=2/2`):
+
+```bash
+# After restarting ONE node, gate on R3 reconvergence before touching the next:
+ssh root@magnus 'for s in KV_UNIBUS_users KV_UNIBUS_rooms KV_UNIBUS_members \
+  KV_UNIBUS_room_keys KV_UNIBUS_rooms_by_member KV_UNIBUS_nonces; do
+    nats --server nats://127.0.0.1:4250 stream info "$s" -j \
+      | jq -r --arg s "$s" \"\\($s): replicas=\\(.cluster.replicas|length) leader=\\(.cluster.leader)\"
+  done'
+# Proceed to the next node ONLY when all six show 3 replicas with a leader
+# (i.e. 2/2 followers current). Also confirm healthz is green on the just-restarted
+# node first:
+ssh <node> 'curl -fsS https://127.0.0.1:8470/healthz --cacert /opt/unibus/tls/ca.crt'
+```
+
+This restart is normally **not** done as a standalone step: the 0.11.0 binary that
+carries the flag is rolled to the three nodes in the consolidated rollout, and the
+drop-in is installed during that same rolling restart.
@@ -0,0 +1,27 @@
+# Drop-in: enable the embedded NATS server monitoring HTTP endpoint so a local
+# metrics scraper can read /varz, /connz and /jsz for server-level metrics
+# (msgs/s, connections, KV bucket msgs, RAFT leader per stream, restarts).
+#
+# ADDITIVE and minimal: it only sets one environment variable; the base unit
+# (membershipd-cluster.service) is otherwise unchanged.
+#
+# UNIBUS_NATS_MONITOR is DECOUPLED from UNIBUS_NATS_DEBUG: it opens the monitoring
+# endpoint WITHOUT enabling the verbose nats-server debug log, so no room subjects
+# or routing metadata are written to journald (keeps the hardened posture, issue
+# 0007). Do NOT use UNIBUS_NATS_DEBUG in production just to get the endpoint.
+#
+# The endpoint binds 127.0.0.1:8222 ONLY — the binary hardcodes the loopback bind,
+# so it is never reachable from the network and needs no auth. The scraper runs on
+# the same host and reads it over loopback.
+#
+# Requires the 0.11.0+ membershipd binary (the one that honors UNIBUS_NATS_MONITOR).
+# Install on a node:
+#   sudo mkdir -p /etc/systemd/system/membershipd-cluster.service.d
+#   sudo cp nats-monitor.conf /etc/systemd/system/membershipd-cluster.service.d/
+#   sudo systemctl daemon-reload && sudo systemctl restart membershipd-cluster
+#
+# Restarting a node restarts its JetStream RAFT member, so roll ONE node at a time
+# and wait for R3 reconvergence (followers 2/2) before touching the next. See the
+# "NATS server metrics" section of this directory's README for the full runbook.
+[Service]
+Environment=UNIBUS_NATS_MONITOR=1