feat(embeddednats): UNIBUS_NATS_MONITOR flag decoupled from debug log

Add a dedicated UNIBUS_NATS_MONITOR=1 toggle that opens the embedded
nats-server monitoring HTTP endpoint (127.0.0.1:8222, loopback only) so a
local metrics scraper can read /varz, /connz and /jsz for server-level
metrics (msgs/s, connections, KV bucket msgs, RAFT leader per stream,
restarts).

Previously the monitoring endpoint was only reachable via UNIBUS_NATS_DEBUG=1,
which is coupled to the verbose nats-server debug log: enabling the endpoint
also wrote routes/RAFT/room subjects to journald in clear, which regresses the
hardened posture (issue 0007). The two concerns are now decoupled.

The toggle computation is extracted to a pure function
natsLogOpts(debugEnv, monitorEnv) (noLog, debug, trace, monitor): MONITOR=1
opens the endpoint while keeping the log quiet (NoLog true / Debug false). The
inverse coupling is preserved for backward compatibility (DEBUG still implies
MONITOR). The 127.0.0.1 bind stays hardcoded — the monitoring endpoint has no
auth and must never be reachable from the network.

Deploy wiring versioned: additive systemd drop-in
membershipd-cluster.service.d/nats-monitor.conf (Environment=UNIBUS_NATS_MONITOR=1)
plus a "NATS server metrics" section in the cluster README with the rolling
activation runbook (magnus -> homer -> datardos) gated on R3 reconvergence
(followers 2/2) between nodes.

Tests: pure decoupling table (monitor on => log NOT debug; debug => monitor;
default closed) + a real embedded server with MONITOR=1 asserting /varz answers
200 on loopback:8222, and a server without the flag with the endpoint closed.
100% additive: behavior is identical without the flag. Bump app.md 0.10.0 ->
0.11.0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
Egutierrez
2026-06-07 20:57:46 +02:00
parent b4f3118e85
commit 1c9325104c
5 changed files with 274 additions and 12 deletions
+58
View File
@@ -283,3 +283,61 @@ ssh dd 'sudo systemctl start membershipd-cluster' # rejoins, catches up
the unit and start it without `--store kv`/`--cluster-name`; the KV buckets remain
for a later retry. To rotate the cluster CA, re-run `generate-cluster-certs.sh
--force` and re-stage (every node must get the new `cluster-ca.crt` together).
## NATS server metrics (loopback monitoring — optional)
The embedded NATS server can expose its own monitoring HTTP endpoint so a local
scraper reads server-level metrics that `/healthz` does not surface: msgs/s,
connections, slow consumers, memory, KV bucket message counts, the RAFT leader per
stream and per-stream restarts. This feeds the `unibus-nats` dashboard in
`fleet_monitoring` (the scraper hits `127.0.0.1:8222/varz|/connz|/jsz` over
loopback and pushes to VictoriaMetrics).
The endpoint is opened by the **dedicated** environment toggle `UNIBUS_NATS_MONITOR=1`
(0.11.0+ binary). It is **decoupled** from `UNIBUS_NATS_DEBUG`: it opens the
monitoring endpoint WITHOUT enabling the verbose nats-server debug log, so no room
subjects or routing metadata leak to journald (keeps the hardened posture, issue
0007). The endpoint binds `127.0.0.1:8222` **only** — the binary hardcodes the
loopback bind, so it is never reachable from the network and needs no auth. Never
use `UNIBUS_NATS_DEBUG` in production just to get the endpoint.
### Enable it (HUMAN — requires the 0.11.0+ binary on the node)
The clean way is the additive systemd drop-in in this directory:
```bash
# On each node, AFTER the 0.11.0+ binary is in /opt/unibus/membershipd:
ssh <node> 'sudo mkdir -p /etc/systemd/system/membershipd-cluster.service.d'
scp membershipd-cluster.service.d/nats-monitor.conf <node>:/tmp/nats-monitor.conf
ssh <node> 'sudo cp /tmp/nats-monitor.conf /etc/systemd/system/membershipd-cluster.service.d/ \
&& sudo systemctl daemon-reload && sudo systemctl restart membershipd-cluster'
```
(Equivalently, add `UNIBUS_NATS_MONITOR=1` to `/opt/unibus/cluster.env`, which the
unit already sources via `EnvironmentFile`; the drop-in is preferred because it is
self-documenting and does not edit the generated env file.)
### Rolling restart with the R3 reconvergence gate (CRITICAL)
`systemctl restart membershipd-cluster` restarts that node's JetStream RAFT member.
**Never restart two nodes at once** — that would drop the cluster below quorum
(2/3) and fail the control plane closed. Roll **one node at a time**, in the order
`magnus → homer → datardos`, and between each node wait until the cluster has
reconverged to R3 (every control-plane bucket back to `followers_current=2/2`):
```bash
# After restarting ONE node, gate on R3 reconvergence before touching the next:
ssh root@magnus 'for s in KV_UNIBUS_users KV_UNIBUS_rooms KV_UNIBUS_members \
KV_UNIBUS_room_keys KV_UNIBUS_rooms_by_member KV_UNIBUS_nonces; do
nats --server nats://127.0.0.1:4250 stream info "$s" -j \
| jq -r --arg s "$s" \"\\($s): replicas=\\(.cluster.replicas|length) leader=\\(.cluster.leader)\"
done'
# Proceed to the next node ONLY when all six show 3 replicas with a leader
# (i.e. 2/2 followers current). Also confirm healthz is green on the just-restarted
# node first:
ssh <node> 'curl -fsS https://127.0.0.1:8470/healthz --cacert /opt/unibus/tls/ca.crt'
```
This restart is normally **not** done as a standalone step: the 0.11.0 binary that
carries the flag is rolled to the three nodes in the consolidated rollout, and the
drop-in is installed during that same rolling restart.
@@ -0,0 +1,27 @@
# Drop-in: enable the embedded NATS server monitoring HTTP endpoint so a local
# metrics scraper can read /varz, /connz and /jsz for server-level metrics
# (msgs/s, connections, KV bucket msgs, RAFT leader per stream, restarts).
#
# ADDITIVE and minimal: it only sets one environment variable; the base unit
# (membershipd-cluster.service) is otherwise unchanged.
#
# UNIBUS_NATS_MONITOR is DECOUPLED from UNIBUS_NATS_DEBUG: it opens the monitoring
# endpoint WITHOUT enabling the verbose nats-server debug log, so no room subjects
# or routing metadata are written to journald (keeps the hardened posture, issue
# 0007). Do NOT use UNIBUS_NATS_DEBUG in production just to get the endpoint.
#
# The endpoint binds 127.0.0.1:8222 ONLY — the binary hardcodes the loopback bind,
# so it is never reachable from the network and needs no auth. The scraper runs on
# the same host and reads it over loopback.
#
# Requires the 0.11.0+ membershipd binary (the one that honors UNIBUS_NATS_MONITOR).
# Install on a node:
# sudo mkdir -p /etc/systemd/system/membershipd-cluster.service.d
# sudo cp nats-monitor.conf /etc/systemd/system/membershipd-cluster.service.d/
# sudo systemctl daemon-reload && sudo systemctl restart membershipd-cluster
#
# Restarting a node restarts its JetStream RAFT member, so roll ONE node at a time
# and wait for R3 reconvergence (followers 2/2) before touching the next. See the
# "NATS server metrics" section of this directory's README for the full runbook.
[Service]
Environment=UNIBUS_NATS_MONITOR=1