1c9325104c
Add a dedicated UNIBUS_NATS_MONITOR=1 toggle that opens the embedded nats-server monitoring HTTP endpoint (127.0.0.1:8222, loopback only) so a local metrics scraper can read /varz, /connz and /jsz for server-level metrics (msgs/s, connections, KV bucket msgs, RAFT leader per stream, restarts). Previously the monitoring endpoint was only reachable via UNIBUS_NATS_DEBUG=1, which is coupled to the verbose nats-server debug log: enabling the endpoint also wrote routes/RAFT/room subjects to journald in clear, which regresses the hardened posture (issue 0007). The two concerns are now decoupled. The toggle computation is extracted to a pure function natsLogOpts(debugEnv, monitorEnv) (noLog, debug, trace, monitor): MONITOR=1 opens the endpoint while keeping the log quiet (NoLog true / Debug false). The inverse coupling is preserved for backward compatibility (DEBUG still implies MONITOR). The 127.0.0.1 bind stays hardcoded — the monitoring endpoint has no auth and must never be reachable from the network. Deploy wiring versioned: additive systemd drop-in membershipd-cluster.service.d/nats-monitor.conf (Environment=UNIBUS_NATS_MONITOR=1) plus a "NATS server metrics" section in the cluster README with the rolling activation runbook (magnus -> homer -> datardos) gated on R3 reconvergence (followers 2/2) between nodes. Tests: pure decoupling table (monitor on => log NOT debug; debug => monitor; default closed) + a real embedded server with MONITOR=1 asserting /varz answers 200 on loopback:8222, and a server without the flag with the endpoint closed. 100% additive: behavior is identical without the flag. Bump app.md 0.10.0 -> 0.11.0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
344 lines
18 KiB
Markdown
344 lines
18 KiB
Markdown
# unibus cluster — 3-node deploy runbook (issue 0006g)
|
|
|
|
This directory holds the material to bring up unibus as a **3-node cluster**
|
|
(`magnus` + `homer` + `datardos`) for real HA: with **R3** replication the control
|
|
plane (rooms/members/keys/users on JetStream KV + the anti-replay nonce bucket)
|
|
survives the loss of any one node (quorum 2/3).
|
|
|
|
> **Status: this cluster is DEPLOYED in production** (magnus + homer + datardos,
|
|
> R3, enforce+ACL+TLS) — see report 0011. The runbook below was authored before any
|
|
> VPS existed and has since been **corrected against the real deploy** (report 0012):
|
|
> the start ordering, the R1→R3 reality, and the live user-add path were all wrong
|
|
> or missing. Steps that change a remote host are marked **HUMAN**; `deploy-cluster.sh`
|
|
> still defaults to a dry run.
|
|
|
|
## Files
|
|
|
|
| File | What it is |
|
|
|---|---|
|
|
| `nodes.env` | Topology: cluster name, ports, and the per-node rows (name, ssh host, public IP, WG IP). **HUMAN fills the placeholders.** |
|
|
| `generate-cluster-certs.sh` | Mints a **separate cluster route CA** + a route cert per node, and a data-plane server cert per node signed by the **client CA** (`../tls/ca.*`). |
|
|
| `membershipd-cluster.service` | One systemd unit, parameterized per node by `/opt/unibus/cluster.env`. enforce + per-subject ACL + TLS + `--store kv`, `Restart=always`. |
|
|
| `deploy-cluster.sh` | Cross-builds the linux binary, generates each node's `cluster.env`, and (with `--yes`) rsyncs everything + installs the unit. Staggered start is manual. |
|
|
|
|
Generated keys/secrets (`out/`, `build/`, `secrets/`) are **gitignored** — they are
|
|
secret and never leave the operator's trusted machine except over the secure
|
|
rsync channel.
|
|
|
|
## Topology (as deployed, report 0011)
|
|
|
|
| Node | SSH | Public IP | Role |
|
|
|---|---|---|---|
|
|
| magnus | `magnus` (root) | `135.125.201.30` | node — **= organic-machine.com = `om`**, the critical host (caddy + gitea + registry-api + monitoring); the bus runs alongside, untouched |
|
|
| homer | `homer` (ubuntu+sudo) | `141.94.69.66` | node |
|
|
| datardos | `dd` (ubuntu+sudo) | `51.91.100.142` | node |
|
|
|
|
`ROUTE_NETWORK=public`, **not `wg`**: there is no WireGuard mesh between the three
|
|
nodes (homer and datardos do not even have the `wg` binary; om's only WG peers are
|
|
the operator's PCs). The server-to-server routes therefore travel over the public
|
|
IPs, protected by the **separate cluster route CA** (mutual route TLS) — a client
|
|
data-plane cert can never be presented to the route port. The client data plane and
|
|
the HTTP control plane are also reached over the public IPs. There is no fixed
|
|
"seed" node: with R3 the three are peers (see "Bring up" for why a lone node cannot
|
|
self-serve).
|
|
|
|
## Prerequisites (HUMAN, once)
|
|
|
|
1. **Fill `nodes.env`** — replace every `<PLACEHOLDER>` (magnus public IP, all WG
|
|
IPs). The scripts refuse to run while any remain.
|
|
2. **Client CA exists** — `../tls/ca.crt` + `../tls/ca.key`. If not, run
|
|
`../tls/generate-certs.sh` on the CA host (om) first. The cluster reuses this CA
|
|
for the data plane so existing clients keep trusting the bus.
|
|
3. **Mint cluster TLS**:
|
|
```bash
|
|
./generate-cluster-certs.sh # writes out/<name>/ ; --force to rotate the cluster CA
|
|
```
|
|
4. **Create the route secret** (out of argv, shared by all nodes):
|
|
```bash
|
|
mkdir -p secrets && openssl rand -hex 32 > secrets/cluster.pass
|
|
```
|
|
5. **SSH** to each node's SSH host as `root` works (`ssh magnus true`, `ssh dd true`, ...).
|
|
|
|
## Stage the nodes
|
|
|
|
```bash
|
|
./deploy-cluster.sh # DRY RUN — prints the full plan, touches nothing
|
|
./deploy-cluster.sh --yes # HUMAN: actually rsync + install the unit on all 3 nodes
|
|
```
|
|
|
|
This cross-builds `membershipd` (linux/amd64, `CGO_ENABLED=0`), writes each node's
|
|
`cluster.env` (its `NODE_NAME` and the `--routes` to the OTHER two nodes), and
|
|
ships the binary, the node's TLS material, the secret, the env file and the unit.
|
|
It does **not** start anything.
|
|
|
|
## Seed the first admin into the KV (HUMAN — loopback bootstrap)
|
|
|
|
The empty KV control plane has no users, and under `enforce` no external tool can
|
|
write the FIRST admin over NATS (it would need to be an admin already — a
|
|
chicken-and-egg). The `user` CLI also writes only to a local SQLite file, not the
|
|
KV. So the first admin is seeded on the seed node through a **loopback, no-auth
|
|
bootstrap** that populates the same JetStream store the cluster unit then reuses:
|
|
|
|
```bash
|
|
ssh root@magnus 'bash -s' <<'SEED'
|
|
set -euo pipefail
|
|
cd /opt/unibus
|
|
# a) Put the first admin into a local SQLite seed file.
|
|
./membershipd user add --db ./seed.db --handle root --sign-pub <ADMIN_SIGN_PUB_HEX> --role admin
|
|
# b) Bring up a TEMPORARY loopback, no-auth, single-node KV server on the cluster's
|
|
# own JetStream store dir (not exposed; bus-auth off is allowed on 127.0.0.1).
|
|
./membershipd --store kv --bus-auth off --bind 127.0.0.1 \
|
|
--nats-store ./local_files/jetstream --db ./seed.db >/tmp/seed-boot.log 2>&1 &
|
|
BOOT=$!; sleep 2
|
|
# c) Migrate the admin from SQLite into the replicated KV (loopback — no --ca needed).
|
|
./membershipd migrate-to-kv --db ./seed.db --nats-url nats://127.0.0.1:4250 --replicas 1
|
|
# d) Stop the bootstrap server. The KV buckets persist in ./local_files/jetstream.
|
|
kill "$BOOT"; wait "$BOOT" 2>/dev/null || true
|
|
rm -f ./seed.db
|
|
SEED
|
|
```
|
|
|
|
> The KV written here lives in `./local_files/jetstream`, which the cluster unit
|
|
> reuses (`--nats-store` default), so the admin is present when the enforce cluster
|
|
> starts. This loopback bootstrap is needed ONLY for the very first admin (the
|
|
> chicken-and-egg). **Every user after that is added with the cluster live** — no
|
|
> stop-seed-restart — via `user add --store kv` (see "Add users to the live
|
|
> cluster" below, report 0012).
|
|
|
|
## Bring up (HUMAN)
|
|
|
|
> **CORRECTION (report 0012).** The original instruction — "start magnus alone and
|
|
> verify healthz, then add the others" — is **WRONG and will look like a hung
|
|
> deploy.** A 3-node JetStream cluster forms a RAFT meta-group that needs a quorum
|
|
> (2 of 3) to elect a leader. A single started node has no quorum, so its JetStream
|
|
> meta never becomes current: `--store kv` blocks creating the KV buckets and
|
|
> **`/healthz` never returns ok** until a second node joins. Waiting for magnus to
|
|
> "go green" before starting the others therefore deadlocks the rollout.
|
|
|
|
Start the nodes so a quorum forms. On a **clean cluster** the simplest correct
|
|
procedure is to start all three close together and let the meta-group converge:
|
|
|
|
```bash
|
|
# Start all three (order does not matter); each blocks on the others until a
|
|
# 2/3 quorum elects a JetStream meta leader, then the KV buckets are created.
|
|
for h in magnus homer datardos; do ssh "$h" 'sudo systemctl enable --now membershipd-cluster'; done
|
|
|
|
# Only NOW does healthz return ok — once the meta-group has a leader (give it
|
|
# ~10-30s on a cold start). Poll, do not assume the first node is broken.
|
|
for h in magnus homer datardos; do
|
|
echo "== $h =="; ssh "$h" 'curl -fsS https://127.0.0.1:8470/healthz --cacert /opt/unibus/tls/ca.crt || echo "(not ready yet — needs quorum)"'
|
|
done
|
|
```
|
|
|
|
A **staggered** start also works, but only because `membershipd`'s KV open RETRIES
|
|
the bucket creation for a 120s bootstrap budget (issue 0006g, fix #3): the first
|
|
node sits in that retry loop — NOT serving healthz — until the second node makes a
|
|
quorum, then both converge and the third catches up. Either way, a lone node never
|
|
self-serves; do not gate the next node's start on the previous one's healthz.
|
|
|
|
> A cold multi-node start only converges because of **three cold-start fixes**
|
|
> (report 0011): route pooling off (`PoolSize=-1`), `NoAdvertise=true` (Docker
|
|
> bridge IPs not gossiped), and the KV-open retry loop above. Without them the
|
|
> meta-group re-elects leaders forever and bucket creation hangs. If a fresh
|
|
> cluster will not form, confirm the running binary contains these fixes before
|
|
> touching config.
|
|
|
|
## Promote an existing single-node (SQLite) deployment (HUMAN, optional)
|
|
|
|
Instead of seeding fresh, you can migrate an existing single-node `unibus.db` into
|
|
the KV — **loopback only** (the allowlist would otherwise travel cleartext; the
|
|
command refuses a remote target without `--ca`). Use the same loopback-bootstrap
|
|
shape as the seed step (temporary `--bus-auth off` server on 127.0.0.1, then
|
|
`migrate-to-kv --db /opt/unibus/local_files/unibus.db`).
|
|
|
|
## Verify
|
|
|
|
```bash
|
|
# Posture on every node — all must be enforce+acl+tls+cluster, store=kv.
|
|
for h in magnus homer datardos; do
|
|
echo "== $h =="
|
|
ssh root@$h 'curl -fsS https://127.0.0.1:8470/healthz --cacert /opt/unibus/tls/ca.crt'
|
|
done
|
|
|
|
# Cluster + JetStream meta-group health (needs the `nats` CLI on a node):
|
|
ssh root@magnus 'nats --server nats://127.0.0.1:4250 server report jetstream'
|
|
ssh root@magnus 'nats --server nats://127.0.0.1:4250 server list' # 3 servers, routes up
|
|
```
|
|
|
|
A healthy cluster shows 3 routed servers and a JetStream meta-group with a leader.
|
|
|
|
## Add users to the live cluster (HUMAN — `user add --store kv`)
|
|
|
|
With the cluster up, add (and revoke) bus users **without stopping anything**,
|
|
directly against the replicated KV allowlist. This replaces the stop-seed-restart
|
|
procedure the original runbook implied for every user beyond the first admin.
|
|
|
|
The mechanism is the cluster's own **privileged internal connection**: under
|
|
`enforce` every bus user is confined by the per-subject ACL to its own rooms, so no
|
|
ordinary identity may write the control-plane buckets. The only identity the
|
|
authenticator grants full JetStream permissions is `membershipd`'s internal service
|
|
identity. The unit persists that identity to `${INTERNAL_ID_FILE}`
|
|
(`/opt/unibus/secrets/internal.id`, 0600) via `--internal-id-file`, so the same key
|
|
is available to the CLI. Run the CLI **on a node, over loopback** (the data-plane
|
|
TLS cert SAN covers `127.0.0.1`); reading the identity file requires root on that
|
|
node, which already implies full control of it, so this adds no practical exposure.
|
|
|
|
```bash
|
|
# Add a member to the live cluster's replicated allowlist (run on any node).
|
|
ssh root@magnus 'sudo /opt/unibus/membershipd user add --store kv \
|
|
--handle alice --role member --sign-pub <64-hex-ed25519-pub>'
|
|
# -> added user "alice" (...) role=member
|
|
# -> KV_UNIBUS_users: leader=<node> followers_current=2/2 msgs=N (replicated, HA)
|
|
|
|
# List / revoke against the same live KV:
|
|
ssh root@magnus 'sudo /opt/unibus/membershipd user list --store kv'
|
|
ssh root@magnus 'sudo /opt/unibus/membershipd user revoke --store kv <64-hex-ed25519-pub>'
|
|
```
|
|
|
|
Defaults assume an on-node invocation (`--nats-url nats://127.0.0.1:4250`,
|
|
`--internal-id-file /opt/unibus/secrets/internal.id`, `--ca /opt/unibus/tls/ca.crt`,
|
|
`--kv-replicas 3`). Semantics:
|
|
|
|
- **Idempotent / non-destructive**: re-adding the same key is an explicit
|
|
`already registered` error (exit 1), never a silent overwrite — a re-add cannot
|
|
flip a member to admin. To replace a user, `revoke` then add.
|
|
- **HA**: the write commits through the JetStream quorum, so it succeeds even with
|
|
one node down (2/3); the printed `followers_current` shows replication.
|
|
- **No hard delete**: `revoke` flips status to `revoked` (denied on both planes,
|
|
auditable); the KV has no row deletion, matching the SQLite store.
|
|
|
|
> **Rollout note (report 0012):** the live verification deployed this binary +
|
|
> `--internal-id-file` to **datardos only** (the non-critical node). magnus and
|
|
> homer still run the 0011 binary. To make the capability available (and the unit)
|
|
> on all three — recommended, the posture is identical so there is no urgency — roll
|
|
> the new binary with backups, one node at a time, verifying healthz between each:
|
|
> ```bash
|
|
> for h in homer magnus; do
|
|
> ssh "$h" 'sudo cp -a /opt/unibus/membershipd /opt/unibus/membershipd.bak' # backup
|
|
> scp build/membershipd "$h:/tmp/m" && ssh "$h" 'sudo install -o ubuntu -g ubuntu -m0775 /tmp/m /opt/unibus/membershipd'
|
|
> # add INTERNAL_ID_FILE=/opt/unibus/secrets/internal.id to /opt/unibus/cluster.env
|
|
> # add `--internal-id-file ${INTERNAL_ID_FILE} \` to the unit before `--store kv`
|
|
> ssh "$h" 'sudo systemctl daemon-reload && sudo systemctl restart membershipd-cluster'
|
|
> ssh "$h" 'curl -fsS https://127.0.0.1:8470/healthz --cacert /opt/unibus/tls/ca.crt' # green before next
|
|
> done
|
|
> ```
|
|
> (`deploy-cluster.sh` + the unit template already emit `INTERNAL_ID_FILE` and the
|
|
> flag, so a fresh `./deploy-cluster.sh --yes` is correct for all three.)
|
|
|
|
## Replication: go straight to R3 (HUMAN — real HA)
|
|
|
|
> **CORRECTION (report 0012).** The original "start at R1, then scale to R3" plan
|
|
> assumed R1 is a usable interim state. **It is not, in this cluster.** At R1 all six
|
|
> control-plane buckets (`KV_UNIBUS_users/rooms/members/room_keys/rooms_by_member`
|
|
> + `KV_UNIBUS_nonces`) live on a SINGLE node — a hard **SPOF for authentication**:
|
|
> if that node dies, the nonce/KV control plane is unreachable and EVERY
|
|
> authenticated request fails closed (auth DoS). Worse, the cold multi-node start
|
|
> only converges at all because of the three cold-start fixes (see "Bring up"); the
|
|
> real deploy never ran a healthy R1 and **jumped straight to R3 once the cluster
|
|
> formed.** Treat R1 as a transient artifact of bucket creation, not a milestone.
|
|
|
|
The deployed config already sets `KV_REPLICAS=3` in `nodes.env`. If buckets were
|
|
created at R1 (e.g. only one node was up when `--store kv` first opened them), raise
|
|
every control-plane stream to R3 IN PLACE (no data loss) once all three nodes are
|
|
routed:
|
|
|
|
```bash
|
|
for s in KV_UNIBUS_users KV_UNIBUS_rooms KV_UNIBUS_members KV_UNIBUS_room_keys \
|
|
KV_UNIBUS_rooms_by_member KV_UNIBUS_nonces; do
|
|
ssh root@magnus "nats --server nats://127.0.0.1:4250 stream update $s --replicas 3 -f"
|
|
done
|
|
# (also OBJ_UNIBUS_blobs if the object store is in use)
|
|
```
|
|
|
|
After this each bucket shows `followers_current=2/2` (quorum 2/3). The
|
|
`user add --store kv` command prints that figure for `KV_UNIBUS_users` on every add,
|
|
which is a cheap live HA check.
|
|
|
|
## Chaos test (HUMAN — requires the 3 live VPS)
|
|
|
|
Validate quorum tolerance after R3:
|
|
|
|
```bash
|
|
# Kill one node; the cluster keeps serving (quorum 2/3). On ubuntu nodes use sudo.
|
|
ssh dd 'sudo systemctl stop membershipd-cluster'
|
|
# -> clients fail over (multiple seed URLs); reads/writes still succeed.
|
|
ssh dd 'sudo systemctl start membershipd-cluster' # rejoins, catches up
|
|
|
|
# Kill two nodes; quorum is LOST — the control plane should fail CLOSED (deny),
|
|
# never fail open. Verify a request is rejected, not silently served.
|
|
```
|
|
|
|
> **Validated (report 0012).** The 0011 chaos run checked only the control plane
|
|
> (healthz + meta/stream-leader failover + KV readable with 2/3). Report 0012 added
|
|
> the missing data-plane proofs against the live cluster: a real authenticated
|
|
> client (`cmd/clientcheck`, operator identity, nkey+TLS) creating an E2E room and
|
|
> publishing/subscribing — including a node stopped mid-stream, where the client
|
|
> failed over to a survivor and kept receiving with zero loss (quorum 2/3) — and
|
|
> `user add --store kv` committing with one node (the KV leader) down. The kill-2/3
|
|
> fail-closed case remains a documented manual step.
|
|
|
|
## Rollback
|
|
|
|
`membershipd` does not delete data. To revert a node to standalone SQLite, stop
|
|
the unit and start it without `--store kv`/`--cluster-name`; the KV buckets remain
|
|
for a later retry. To rotate the cluster CA, re-run `generate-cluster-certs.sh
|
|
--force` and re-stage (every node must get the new `cluster-ca.crt` together).
|
|
|
|
## NATS server metrics (loopback monitoring — optional)
|
|
|
|
The embedded NATS server can expose its own monitoring HTTP endpoint so a local
|
|
scraper reads server-level metrics that `/healthz` does not surface: msgs/s,
|
|
connections, slow consumers, memory, KV bucket message counts, the RAFT leader per
|
|
stream and per-stream restarts. This feeds the `unibus-nats` dashboard in
|
|
`fleet_monitoring` (the scraper hits `127.0.0.1:8222/varz|/connz|/jsz` over
|
|
loopback and pushes to VictoriaMetrics).
|
|
|
|
The endpoint is opened by the **dedicated** environment toggle `UNIBUS_NATS_MONITOR=1`
|
|
(0.11.0+ binary). It is **decoupled** from `UNIBUS_NATS_DEBUG`: it opens the
|
|
monitoring endpoint WITHOUT enabling the verbose nats-server debug log, so no room
|
|
subjects or routing metadata leak to journald (keeps the hardened posture, issue
|
|
0007). The endpoint binds `127.0.0.1:8222` **only** — the binary hardcodes the
|
|
loopback bind, so it is never reachable from the network and needs no auth. Never
|
|
use `UNIBUS_NATS_DEBUG` in production just to get the endpoint.
|
|
|
|
### Enable it (HUMAN — requires the 0.11.0+ binary on the node)
|
|
|
|
The clean way is the additive systemd drop-in in this directory:
|
|
|
|
```bash
|
|
# On each node, AFTER the 0.11.0+ binary is in /opt/unibus/membershipd:
|
|
ssh <node> 'sudo mkdir -p /etc/systemd/system/membershipd-cluster.service.d'
|
|
scp membershipd-cluster.service.d/nats-monitor.conf <node>:/tmp/nats-monitor.conf
|
|
ssh <node> 'sudo cp /tmp/nats-monitor.conf /etc/systemd/system/membershipd-cluster.service.d/ \
|
|
&& sudo systemctl daemon-reload && sudo systemctl restart membershipd-cluster'
|
|
```
|
|
|
|
(Equivalently, add `UNIBUS_NATS_MONITOR=1` to `/opt/unibus/cluster.env`, which the
|
|
unit already sources via `EnvironmentFile`; the drop-in is preferred because it is
|
|
self-documenting and does not edit the generated env file.)
|
|
|
|
### Rolling restart with the R3 reconvergence gate (CRITICAL)
|
|
|
|
`systemctl restart membershipd-cluster` restarts that node's JetStream RAFT member.
|
|
**Never restart two nodes at once** — that would drop the cluster below quorum
|
|
(2/3) and fail the control plane closed. Roll **one node at a time**, in the order
|
|
`magnus → homer → datardos`, and between each node wait until the cluster has
|
|
reconverged to R3 (every control-plane bucket back to `followers_current=2/2`):
|
|
|
|
```bash
|
|
# After restarting ONE node, gate on R3 reconvergence before touching the next:
|
|
ssh root@magnus 'for s in KV_UNIBUS_users KV_UNIBUS_rooms KV_UNIBUS_members \
|
|
KV_UNIBUS_room_keys KV_UNIBUS_rooms_by_member KV_UNIBUS_nonces; do
|
|
nats --server nats://127.0.0.1:4250 stream info "$s" -j \
|
|
| jq -r --arg s "$s" \"\\($s): replicas=\\(.cluster.replicas|length) leader=\\(.cluster.leader)\"
|
|
done'
|
|
# Proceed to the next node ONLY when all six show 3 replicas with a leader
|
|
# (i.e. 2/2 followers current). Also confirm healthz is green on the just-restarted
|
|
# node first:
|
|
ssh <node> 'curl -fsS https://127.0.0.1:8470/healthz --cacert /opt/unibus/tls/ca.crt'
|
|
```
|
|
|
|
This restart is normally **not** done as a standalone step: the 0.11.0 binary that
|
|
carries the flag is rolled to the three nodes in the consolidated rollout, and the
|
|
drop-in is installed during that same rolling restart.
|