feat(embeddednats): UNIBUS_NATS_MONITOR flag decoupled from debug log
Add a dedicated UNIBUS_NATS_MONITOR=1 toggle that opens the embedded nats-server monitoring HTTP endpoint (127.0.0.1:8222, loopback only) so a local metrics scraper can read /varz, /connz and /jsz for server-level metrics (msgs/s, connections, KV bucket msgs, RAFT leader per stream, restarts). Previously the monitoring endpoint was only reachable via UNIBUS_NATS_DEBUG=1, which is coupled to the verbose nats-server debug log: enabling the endpoint also wrote routes/RAFT/room subjects to journald in clear, which regresses the hardened posture (issue 0007). The two concerns are now decoupled. The toggle computation is extracted to a pure function natsLogOpts(debugEnv, monitorEnv) (noLog, debug, trace, monitor): MONITOR=1 opens the endpoint while keeping the log quiet (NoLog true / Debug false). The inverse coupling is preserved for backward compatibility (DEBUG still implies MONITOR). The 127.0.0.1 bind stays hardcoded — the monitoring endpoint has no auth and must never be reachable from the network. Deploy wiring versioned: additive systemd drop-in membershipd-cluster.service.d/nats-monitor.conf (Environment=UNIBUS_NATS_MONITOR=1) plus a "NATS server metrics" section in the cluster README with the rolling activation runbook (magnus -> homer -> datardos) gated on R3 reconvergence (followers 2/2) between nodes. Tests: pure decoupling table (monitor on => log NOT debug; debug => monitor; default closed) + a real embedded server with MONITOR=1 asserting /varz answers 200 on loopback:8222, and a server without the flag with the endpoint closed. 100% additive: behavior is identical without the flag. Bump app.md 0.10.0 -> 0.11.0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -283,3 +283,61 @@ ssh dd 'sudo systemctl start membershipd-cluster' # rejoins, catches up
|
||||
the unit and start it without `--store kv`/`--cluster-name`; the KV buckets remain
|
||||
for a later retry. To rotate the cluster CA, re-run `generate-cluster-certs.sh
|
||||
--force` and re-stage (every node must get the new `cluster-ca.crt` together).
|
||||
|
||||
## NATS server metrics (loopback monitoring — optional)
|
||||
|
||||
The embedded NATS server can expose its own monitoring HTTP endpoint so a local
|
||||
scraper reads server-level metrics that `/healthz` does not surface: msgs/s,
|
||||
connections, slow consumers, memory, KV bucket message counts, the RAFT leader per
|
||||
stream and per-stream restarts. This feeds the `unibus-nats` dashboard in
|
||||
`fleet_monitoring` (the scraper hits `127.0.0.1:8222/varz|/connz|/jsz` over
|
||||
loopback and pushes to VictoriaMetrics).
|
||||
|
||||
The endpoint is opened by the **dedicated** environment toggle `UNIBUS_NATS_MONITOR=1`
|
||||
(0.11.0+ binary). It is **decoupled** from `UNIBUS_NATS_DEBUG`: it opens the
|
||||
monitoring endpoint WITHOUT enabling the verbose nats-server debug log, so no room
|
||||
subjects or routing metadata leak to journald (keeps the hardened posture, issue
|
||||
0007). The endpoint binds `127.0.0.1:8222` **only** — the binary hardcodes the
|
||||
loopback bind, so it is never reachable from the network and needs no auth. Never
|
||||
use `UNIBUS_NATS_DEBUG` in production just to get the endpoint.
|
||||
|
||||
### Enable it (HUMAN — requires the 0.11.0+ binary on the node)
|
||||
|
||||
The clean way is the additive systemd drop-in in this directory:
|
||||
|
||||
```bash
|
||||
# On each node, AFTER the 0.11.0+ binary is in /opt/unibus/membershipd:
|
||||
ssh <node> 'sudo mkdir -p /etc/systemd/system/membershipd-cluster.service.d'
|
||||
scp membershipd-cluster.service.d/nats-monitor.conf <node>:/tmp/nats-monitor.conf
|
||||
ssh <node> 'sudo cp /tmp/nats-monitor.conf /etc/systemd/system/membershipd-cluster.service.d/ \
|
||||
&& sudo systemctl daemon-reload && sudo systemctl restart membershipd-cluster'
|
||||
```
|
||||
|
||||
(Equivalently, add `UNIBUS_NATS_MONITOR=1` to `/opt/unibus/cluster.env`, which the
|
||||
unit already sources via `EnvironmentFile`; the drop-in is preferred because it is
|
||||
self-documenting and does not edit the generated env file.)
|
||||
|
||||
### Rolling restart with the R3 reconvergence gate (CRITICAL)
|
||||
|
||||
`systemctl restart membershipd-cluster` restarts that node's JetStream RAFT member.
|
||||
**Never restart two nodes at once** — that would drop the cluster below quorum
|
||||
(2/3) and fail the control plane closed. Roll **one node at a time**, in the order
|
||||
`magnus → homer → datardos`, and between each node wait until the cluster has
|
||||
reconverged to R3 (every control-plane bucket back to `followers_current=2/2`):
|
||||
|
||||
```bash
|
||||
# After restarting ONE node, gate on R3 reconvergence before touching the next:
|
||||
ssh root@magnus 'for s in KV_UNIBUS_users KV_UNIBUS_rooms KV_UNIBUS_members \
|
||||
KV_UNIBUS_room_keys KV_UNIBUS_rooms_by_member KV_UNIBUS_nonces; do
|
||||
nats --server nats://127.0.0.1:4250 stream info "$s" -j \
|
||||
| jq -r --arg s "$s" \"\\($s): replicas=\\(.cluster.replicas|length) leader=\\(.cluster.leader)\"
|
||||
done'
|
||||
# Proceed to the next node ONLY when all six show 3 replicas with a leader
|
||||
# (i.e. 2/2 followers current). Also confirm healthz is green on the just-restarted
|
||||
# node first:
|
||||
ssh <node> 'curl -fsS https://127.0.0.1:8470/healthz --cacert /opt/unibus/tls/ca.crt'
|
||||
```
|
||||
|
||||
This restart is normally **not** done as a standalone step: the 0.11.0 binary that
|
||||
carries the flag is rolled to the three nodes in the consolidated rollout, and the
|
||||
drop-in is installed during that same rolling restart.
|
||||
|
||||
@@ -0,0 +1,27 @@
|
||||
# Drop-in: enable the embedded NATS server monitoring HTTP endpoint so a local
|
||||
# metrics scraper can read /varz, /connz and /jsz for server-level metrics
|
||||
# (msgs/s, connections, KV bucket msgs, RAFT leader per stream, restarts).
|
||||
#
|
||||
# ADDITIVE and minimal: it only sets one environment variable; the base unit
|
||||
# (membershipd-cluster.service) is otherwise unchanged.
|
||||
#
|
||||
# UNIBUS_NATS_MONITOR is DECOUPLED from UNIBUS_NATS_DEBUG: it opens the monitoring
|
||||
# endpoint WITHOUT enabling the verbose nats-server debug log, so no room subjects
|
||||
# or routing metadata are written to journald (keeps the hardened posture, issue
|
||||
# 0007). Do NOT use UNIBUS_NATS_DEBUG in production just to get the endpoint.
|
||||
#
|
||||
# The endpoint binds 127.0.0.1:8222 ONLY — the binary hardcodes the loopback bind,
|
||||
# so it is never reachable from the network and needs no auth. The scraper runs on
|
||||
# the same host and reads it over loopback.
|
||||
#
|
||||
# Requires the 0.11.0+ membershipd binary (the one that honors UNIBUS_NATS_MONITOR).
|
||||
# Install on a node:
|
||||
# sudo mkdir -p /etc/systemd/system/membershipd-cluster.service.d
|
||||
# sudo cp nats-monitor.conf /etc/systemd/system/membershipd-cluster.service.d/
|
||||
# sudo systemctl daemon-reload && sudo systemctl restart membershipd-cluster
|
||||
#
|
||||
# Restarting a node restarts its JetStream RAFT member, so roll ONE node at a time
|
||||
# and wait for R3 reconvergence (followers 2/2) before touching the next. See the
|
||||
# "NATS server metrics" section of this directory's README for the full runbook.
|
||||
[Service]
|
||||
Environment=UNIBUS_NATS_MONITOR=1
|
||||
Reference in New Issue
Block a user