Files
unibus/deploy/cluster/README.md
T
Egutierrez 1c9325104c feat(embeddednats): UNIBUS_NATS_MONITOR flag decoupled from debug log
Add a dedicated UNIBUS_NATS_MONITOR=1 toggle that opens the embedded
nats-server monitoring HTTP endpoint (127.0.0.1:8222, loopback only) so a
local metrics scraper can read /varz, /connz and /jsz for server-level
metrics (msgs/s, connections, KV bucket msgs, RAFT leader per stream,
restarts).

Previously the monitoring endpoint was only reachable via UNIBUS_NATS_DEBUG=1,
which is coupled to the verbose nats-server debug log: enabling the endpoint
also wrote routes/RAFT/room subjects to journald in clear, which regresses the
hardened posture (issue 0007). The two concerns are now decoupled.

The toggle computation is extracted to a pure function
natsLogOpts(debugEnv, monitorEnv) (noLog, debug, trace, monitor): MONITOR=1
opens the endpoint while keeping the log quiet (NoLog true / Debug false). The
inverse coupling is preserved for backward compatibility (DEBUG still implies
MONITOR). The 127.0.0.1 bind stays hardcoded — the monitoring endpoint has no
auth and must never be reachable from the network.

Deploy wiring versioned: additive systemd drop-in
membershipd-cluster.service.d/nats-monitor.conf (Environment=UNIBUS_NATS_MONITOR=1)
plus a "NATS server metrics" section in the cluster README with the rolling
activation runbook (magnus -> homer -> datardos) gated on R3 reconvergence
(followers 2/2) between nodes.

Tests: pure decoupling table (monitor on => log NOT debug; debug => monitor;
default closed) + a real embedded server with MONITOR=1 asserting /varz answers
200 on loopback:8222, and a server without the flag with the endpoint closed.
100% additive: behavior is identical without the flag. Bump app.md 0.10.0 ->
0.11.0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 20:57:46 +02:00

18 KiB

unibus cluster — 3-node deploy runbook (issue 0006g)

This directory holds the material to bring up unibus as a 3-node cluster (magnus + homer + datardos) for real HA: with R3 replication the control plane (rooms/members/keys/users on JetStream KV + the anti-replay nonce bucket) survives the loss of any one node (quorum 2/3).

Status: this cluster is DEPLOYED in production (magnus + homer + datardos, R3, enforce+ACL+TLS) — see report 0011. The runbook below was authored before any VPS existed and has since been corrected against the real deploy (report 0012): the start ordering, the R1→R3 reality, and the live user-add path were all wrong or missing. Steps that change a remote host are marked HUMAN; deploy-cluster.sh still defaults to a dry run.

Files

File What it is
nodes.env Topology: cluster name, ports, and the per-node rows (name, ssh host, public IP, WG IP). HUMAN fills the placeholders.
generate-cluster-certs.sh Mints a separate cluster route CA + a route cert per node, and a data-plane server cert per node signed by the client CA (../tls/ca.*).
membershipd-cluster.service One systemd unit, parameterized per node by /opt/unibus/cluster.env. enforce + per-subject ACL + TLS + --store kv, Restart=always.
deploy-cluster.sh Cross-builds the linux binary, generates each node's cluster.env, and (with --yes) rsyncs everything + installs the unit. Staggered start is manual.

Generated keys/secrets (out/, build/, secrets/) are gitignored — they are secret and never leave the operator's trusted machine except over the secure rsync channel.

Topology (as deployed, report 0011)

Node SSH Public IP Role
magnus magnus (root) 135.125.201.30 node — = organic-machine.com = om, the critical host (caddy + gitea + registry-api + monitoring); the bus runs alongside, untouched
homer homer (ubuntu+sudo) 141.94.69.66 node
datardos dd (ubuntu+sudo) 51.91.100.142 node

ROUTE_NETWORK=public, not wg: there is no WireGuard mesh between the three nodes (homer and datardos do not even have the wg binary; om's only WG peers are the operator's PCs). The server-to-server routes therefore travel over the public IPs, protected by the separate cluster route CA (mutual route TLS) — a client data-plane cert can never be presented to the route port. The client data plane and the HTTP control plane are also reached over the public IPs. There is no fixed "seed" node: with R3 the three are peers (see "Bring up" for why a lone node cannot self-serve).

Prerequisites (HUMAN, once)

  1. Fill nodes.env — replace every <PLACEHOLDER> (magnus public IP, all WG IPs). The scripts refuse to run while any remain.
  2. Client CA exists../tls/ca.crt + ../tls/ca.key. If not, run ../tls/generate-certs.sh on the CA host (om) first. The cluster reuses this CA for the data plane so existing clients keep trusting the bus.
  3. Mint cluster TLS:
    ./generate-cluster-certs.sh        # writes out/<name>/ ; --force to rotate the cluster CA
    
  4. Create the route secret (out of argv, shared by all nodes):
    mkdir -p secrets && openssl rand -hex 32 > secrets/cluster.pass
    
  5. SSH to each node's SSH host as root works (ssh magnus true, ssh dd true, ...).

Stage the nodes

./deploy-cluster.sh            # DRY RUN — prints the full plan, touches nothing
./deploy-cluster.sh --yes      # HUMAN: actually rsync + install the unit on all 3 nodes

This cross-builds membershipd (linux/amd64, CGO_ENABLED=0), writes each node's cluster.env (its NODE_NAME and the --routes to the OTHER two nodes), and ships the binary, the node's TLS material, the secret, the env file and the unit. It does not start anything.

Seed the first admin into the KV (HUMAN — loopback bootstrap)

The empty KV control plane has no users, and under enforce no external tool can write the FIRST admin over NATS (it would need to be an admin already — a chicken-and-egg). The user CLI also writes only to a local SQLite file, not the KV. So the first admin is seeded on the seed node through a loopback, no-auth bootstrap that populates the same JetStream store the cluster unit then reuses:

ssh root@magnus 'bash -s' <<'SEED'
set -euo pipefail
cd /opt/unibus
# a) Put the first admin into a local SQLite seed file.
./membershipd user add --db ./seed.db --handle root --sign-pub <ADMIN_SIGN_PUB_HEX> --role admin
# b) Bring up a TEMPORARY loopback, no-auth, single-node KV server on the cluster's
#    own JetStream store dir (not exposed; bus-auth off is allowed on 127.0.0.1).
./membershipd --store kv --bus-auth off --bind 127.0.0.1 \
  --nats-store ./local_files/jetstream --db ./seed.db >/tmp/seed-boot.log 2>&1 &
BOOT=$!; sleep 2
# c) Migrate the admin from SQLite into the replicated KV (loopback — no --ca needed).
./membershipd migrate-to-kv --db ./seed.db --nats-url nats://127.0.0.1:4250 --replicas 1
# d) Stop the bootstrap server. The KV buckets persist in ./local_files/jetstream.
kill "$BOOT"; wait "$BOOT" 2>/dev/null || true
rm -f ./seed.db
SEED

The KV written here lives in ./local_files/jetstream, which the cluster unit reuses (--nats-store default), so the admin is present when the enforce cluster starts. This loopback bootstrap is needed ONLY for the very first admin (the chicken-and-egg). Every user after that is added with the cluster live — no stop-seed-restart — via user add --store kv (see "Add users to the live cluster" below, report 0012).

Bring up (HUMAN)

CORRECTION (report 0012). The original instruction — "start magnus alone and verify healthz, then add the others" — is WRONG and will look like a hung deploy. A 3-node JetStream cluster forms a RAFT meta-group that needs a quorum (2 of 3) to elect a leader. A single started node has no quorum, so its JetStream meta never becomes current: --store kv blocks creating the KV buckets and /healthz never returns ok until a second node joins. Waiting for magnus to "go green" before starting the others therefore deadlocks the rollout.

Start the nodes so a quorum forms. On a clean cluster the simplest correct procedure is to start all three close together and let the meta-group converge:

# Start all three (order does not matter); each blocks on the others until a
# 2/3 quorum elects a JetStream meta leader, then the KV buckets are created.
for h in magnus homer datardos; do ssh "$h" 'sudo systemctl enable --now membershipd-cluster'; done

# Only NOW does healthz return ok — once the meta-group has a leader (give it
# ~10-30s on a cold start). Poll, do not assume the first node is broken.
for h in magnus homer datardos; do
  echo "== $h =="; ssh "$h" 'curl -fsS https://127.0.0.1:8470/healthz --cacert /opt/unibus/tls/ca.crt || echo "(not ready yet — needs quorum)"'
done

A staggered start also works, but only because membershipd's KV open RETRIES the bucket creation for a 120s bootstrap budget (issue 0006g, fix #3): the first node sits in that retry loop — NOT serving healthz — until the second node makes a quorum, then both converge and the third catches up. Either way, a lone node never self-serves; do not gate the next node's start on the previous one's healthz.

A cold multi-node start only converges because of three cold-start fixes (report 0011): route pooling off (PoolSize=-1), NoAdvertise=true (Docker bridge IPs not gossiped), and the KV-open retry loop above. Without them the meta-group re-elects leaders forever and bucket creation hangs. If a fresh cluster will not form, confirm the running binary contains these fixes before touching config.

Promote an existing single-node (SQLite) deployment (HUMAN, optional)

Instead of seeding fresh, you can migrate an existing single-node unibus.db into the KV — loopback only (the allowlist would otherwise travel cleartext; the command refuses a remote target without --ca). Use the same loopback-bootstrap shape as the seed step (temporary --bus-auth off server on 127.0.0.1, then migrate-to-kv --db /opt/unibus/local_files/unibus.db).

Verify

# Posture on every node — all must be enforce+acl+tls+cluster, store=kv.
for h in magnus homer datardos; do
  echo "== $h =="
  ssh root@$h 'curl -fsS https://127.0.0.1:8470/healthz --cacert /opt/unibus/tls/ca.crt'
done

# Cluster + JetStream meta-group health (needs the `nats` CLI on a node):
ssh root@magnus 'nats --server nats://127.0.0.1:4250 server report jetstream'
ssh root@magnus 'nats --server nats://127.0.0.1:4250 server list'   # 3 servers, routes up

A healthy cluster shows 3 routed servers and a JetStream meta-group with a leader.

Add users to the live cluster (HUMAN — user add --store kv)

With the cluster up, add (and revoke) bus users without stopping anything, directly against the replicated KV allowlist. This replaces the stop-seed-restart procedure the original runbook implied for every user beyond the first admin.

The mechanism is the cluster's own privileged internal connection: under enforce every bus user is confined by the per-subject ACL to its own rooms, so no ordinary identity may write the control-plane buckets. The only identity the authenticator grants full JetStream permissions is membershipd's internal service identity. The unit persists that identity to ${INTERNAL_ID_FILE} (/opt/unibus/secrets/internal.id, 0600) via --internal-id-file, so the same key is available to the CLI. Run the CLI on a node, over loopback (the data-plane TLS cert SAN covers 127.0.0.1); reading the identity file requires root on that node, which already implies full control of it, so this adds no practical exposure.

# Add a member to the live cluster's replicated allowlist (run on any node).
ssh root@magnus 'sudo /opt/unibus/membershipd user add --store kv \
  --handle alice --role member --sign-pub <64-hex-ed25519-pub>'
#   -> added user "alice" (...) role=member
#   -> KV_UNIBUS_users: leader=<node> followers_current=2/2 msgs=N   (replicated, HA)

# List / revoke against the same live KV:
ssh root@magnus 'sudo /opt/unibus/membershipd user list   --store kv'
ssh root@magnus 'sudo /opt/unibus/membershipd user revoke --store kv <64-hex-ed25519-pub>'

Defaults assume an on-node invocation (--nats-url nats://127.0.0.1:4250, --internal-id-file /opt/unibus/secrets/internal.id, --ca /opt/unibus/tls/ca.crt, --kv-replicas 3). Semantics:

  • Idempotent / non-destructive: re-adding the same key is an explicit already registered error (exit 1), never a silent overwrite — a re-add cannot flip a member to admin. To replace a user, revoke then add.
  • HA: the write commits through the JetStream quorum, so it succeeds even with one node down (2/3); the printed followers_current shows replication.
  • No hard delete: revoke flips status to revoked (denied on both planes, auditable); the KV has no row deletion, matching the SQLite store.

Rollout note (report 0012): the live verification deployed this binary + --internal-id-file to datardos only (the non-critical node). magnus and homer still run the 0011 binary. To make the capability available (and the unit) on all three — recommended, the posture is identical so there is no urgency — roll the new binary with backups, one node at a time, verifying healthz between each:

for h in homer magnus; do
  ssh "$h" 'sudo cp -a /opt/unibus/membershipd /opt/unibus/membershipd.bak'   # backup
  scp build/membershipd "$h:/tmp/m" && ssh "$h" 'sudo install -o ubuntu -g ubuntu -m0775 /tmp/m /opt/unibus/membershipd'
  # add INTERNAL_ID_FILE=/opt/unibus/secrets/internal.id to /opt/unibus/cluster.env
  # add `--internal-id-file ${INTERNAL_ID_FILE} \` to the unit before `--store kv`
  ssh "$h" 'sudo systemctl daemon-reload && sudo systemctl restart membershipd-cluster'
  ssh "$h" 'curl -fsS https://127.0.0.1:8470/healthz --cacert /opt/unibus/tls/ca.crt'  # green before next
done

(deploy-cluster.sh + the unit template already emit INTERNAL_ID_FILE and the flag, so a fresh ./deploy-cluster.sh --yes is correct for all three.)

Replication: go straight to R3 (HUMAN — real HA)

CORRECTION (report 0012). The original "start at R1, then scale to R3" plan assumed R1 is a usable interim state. It is not, in this cluster. At R1 all six control-plane buckets (KV_UNIBUS_users/rooms/members/room_keys/rooms_by_member

  • KV_UNIBUS_nonces) live on a SINGLE node — a hard SPOF for authentication: if that node dies, the nonce/KV control plane is unreachable and EVERY authenticated request fails closed (auth DoS). Worse, the cold multi-node start only converges at all because of the three cold-start fixes (see "Bring up"); the real deploy never ran a healthy R1 and jumped straight to R3 once the cluster formed. Treat R1 as a transient artifact of bucket creation, not a milestone.

The deployed config already sets KV_REPLICAS=3 in nodes.env. If buckets were created at R1 (e.g. only one node was up when --store kv first opened them), raise every control-plane stream to R3 IN PLACE (no data loss) once all three nodes are routed:

for s in KV_UNIBUS_users KV_UNIBUS_rooms KV_UNIBUS_members KV_UNIBUS_room_keys \
         KV_UNIBUS_rooms_by_member KV_UNIBUS_nonces; do
  ssh root@magnus "nats --server nats://127.0.0.1:4250 stream update $s --replicas 3 -f"
done
# (also OBJ_UNIBUS_blobs if the object store is in use)

After this each bucket shows followers_current=2/2 (quorum 2/3). The user add --store kv command prints that figure for KV_UNIBUS_users on every add, which is a cheap live HA check.

Chaos test (HUMAN — requires the 3 live VPS)

Validate quorum tolerance after R3:

# Kill one node; the cluster keeps serving (quorum 2/3). On ubuntu nodes use sudo.
ssh dd 'sudo systemctl stop membershipd-cluster'
#   -> clients fail over (multiple seed URLs); reads/writes still succeed.
ssh dd 'sudo systemctl start membershipd-cluster'   # rejoins, catches up

# Kill two nodes; quorum is LOST — the control plane should fail CLOSED (deny),
# never fail open. Verify a request is rejected, not silently served.

Validated (report 0012). The 0011 chaos run checked only the control plane (healthz + meta/stream-leader failover + KV readable with 2/3). Report 0012 added the missing data-plane proofs against the live cluster: a real authenticated client (cmd/clientcheck, operator identity, nkey+TLS) creating an E2E room and publishing/subscribing — including a node stopped mid-stream, where the client failed over to a survivor and kept receiving with zero loss (quorum 2/3) — and user add --store kv committing with one node (the KV leader) down. The kill-2/3 fail-closed case remains a documented manual step.

Rollback

membershipd does not delete data. To revert a node to standalone SQLite, stop the unit and start it without --store kv/--cluster-name; the KV buckets remain for a later retry. To rotate the cluster CA, re-run generate-cluster-certs.sh --force and re-stage (every node must get the new cluster-ca.crt together).

NATS server metrics (loopback monitoring — optional)

The embedded NATS server can expose its own monitoring HTTP endpoint so a local scraper reads server-level metrics that /healthz does not surface: msgs/s, connections, slow consumers, memory, KV bucket message counts, the RAFT leader per stream and per-stream restarts. This feeds the unibus-nats dashboard in fleet_monitoring (the scraper hits 127.0.0.1:8222/varz|/connz|/jsz over loopback and pushes to VictoriaMetrics).

The endpoint is opened by the dedicated environment toggle UNIBUS_NATS_MONITOR=1 (0.11.0+ binary). It is decoupled from UNIBUS_NATS_DEBUG: it opens the monitoring endpoint WITHOUT enabling the verbose nats-server debug log, so no room subjects or routing metadata leak to journald (keeps the hardened posture, issue 0007). The endpoint binds 127.0.0.1:8222 only — the binary hardcodes the loopback bind, so it is never reachable from the network and needs no auth. Never use UNIBUS_NATS_DEBUG in production just to get the endpoint.

Enable it (HUMAN — requires the 0.11.0+ binary on the node)

The clean way is the additive systemd drop-in in this directory:

# On each node, AFTER the 0.11.0+ binary is in /opt/unibus/membershipd:
ssh <node> 'sudo mkdir -p /etc/systemd/system/membershipd-cluster.service.d'
scp membershipd-cluster.service.d/nats-monitor.conf <node>:/tmp/nats-monitor.conf
ssh <node> 'sudo cp /tmp/nats-monitor.conf /etc/systemd/system/membershipd-cluster.service.d/ \
  && sudo systemctl daemon-reload && sudo systemctl restart membershipd-cluster'

(Equivalently, add UNIBUS_NATS_MONITOR=1 to /opt/unibus/cluster.env, which the unit already sources via EnvironmentFile; the drop-in is preferred because it is self-documenting and does not edit the generated env file.)

Rolling restart with the R3 reconvergence gate (CRITICAL)

systemctl restart membershipd-cluster restarts that node's JetStream RAFT member. Never restart two nodes at once — that would drop the cluster below quorum (2/3) and fail the control plane closed. Roll one node at a time, in the order magnus → homer → datardos, and between each node wait until the cluster has reconverged to R3 (every control-plane bucket back to followers_current=2/2):

# After restarting ONE node, gate on R3 reconvergence before touching the next:
ssh root@magnus 'for s in KV_UNIBUS_users KV_UNIBUS_rooms KV_UNIBUS_members \
  KV_UNIBUS_room_keys KV_UNIBUS_rooms_by_member KV_UNIBUS_nonces; do
    nats --server nats://127.0.0.1:4250 stream info "$s" -j \
      | jq -r --arg s "$s" \"\\($s): replicas=\\(.cluster.replicas|length) leader=\\(.cluster.leader)\"
  done'
# Proceed to the next node ONLY when all six show 3 replicas with a leader
# (i.e. 2/2 followers current). Also confirm healthz is green on the just-restarted
# node first:
ssh <node> 'curl -fsS https://127.0.0.1:8470/healthz --cacert /opt/unibus/tls/ca.crt'

This restart is normally not done as a standalone step: the 0.11.0 binary that carries the flag is rolled to the three nodes in the consolidated rollout, and the drop-in is installed during that same rolling restart.