docs(cluster): correct runbook + wire --internal-id-file into deploy

Corrections learned from the real 0011 deploy:
- Bring up: the "start magnus alone and verify healthz" order deadlocks — a
  lone node of a 3-node cluster has no meta-group quorum and never serves
  healthz until a second node joins. Document a quorum-forming start and that
  a node never self-serves.
- Replication: R1 is an unusable SPOF (all six control-plane buckets on one
  node) and the cold start only converges with the three cold-start fixes;
  go straight to R3 once the cluster forms.
- Add a "user add --store kv" section: the live user-add path that replaces
  stop-seed-restart, with its security model and idempotency/HA/no-delete
  semantics.
- Topology: real IPs, ROUTE_NETWORK=public (no WireGuard mesh exists).
- Chaos test: mark the data-plane client + failover proofs as validated (0012).

Deploy machinery now emits the persisted internal identity: the unit gains
--internal-id-file ${INTERNAL_ID_FILE} and deploy-cluster.sh writes
INTERNAL_ID_FILE into each node's cluster.env, so a fresh deploy enables the
live user-add path on every node.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-07 19:41:56 +02:00
parent 3aa5a2c9a9
commit ce72131ddf
3 changed files with 156 additions and 47 deletions
+143 -39
View File
@@ -5,9 +5,12 @@ This directory holds the material to bring up unibus as a **3-node cluster**
plane (rooms/members/keys/users on JetStream KV + the anti-replay nonce bucket)
survives the loss of any one node (quorum 2/3).
> **The agent that authored this never touched a VPS.** Every step that changes a
> remote host is marked **HUMAN** and is executed by the operator. `deploy-cluster.sh`
> defaults to a dry run.
> **Status: this cluster is DEPLOYED in production** (magnus + homer + datardos,
> R3, enforce+ACL+TLS) — see report 0011. The runbook below was authored before any
> VPS existed and has since been **corrected against the real deploy** (report 0012):
> the start ordering, the R1→R3 reality, and the live user-add path were all wrong
> or missing. Steps that change a remote host are marked **HUMAN**; `deploy-cluster.sh`
> still defaults to a dry run.
## Files
@@ -22,18 +25,22 @@ Generated keys/secrets (`out/`, `build/`, `secrets/`) are **gitignored** — the
secret and never leave the operator's trusted machine except over the secure
rsync channel.
## Topology
## Topology (as deployed, report 0011)
| Node | SSH | Public IP | WireGuard IP | Role |
|---|---|---|---|---|
| magnus | `magnus` | `<MAGNUS_PUBLIC_IP>` | `<MAGNUS_WG_IP>` | seed (first up) |
| homer | `homer` | `141.94.69.66` | `<HOMER_WG_IP>` | replica |
| datardos | `dd` | `51.91.100.142` | `<DATARDOS_WG_IP>` (10.21.0.x) | replica |
| Node | SSH | Public IP | Role |
|---|---|---|---|
| magnus | `magnus` (root) | `135.125.201.30` | node — **= organic-machine.com = `om`**, the critical host (caddy + gitea + registry-api + monitoring); the bus runs alongside, untouched |
| homer | `homer` (ubuntu+sudo) | `141.94.69.66` | node |
| datardos | `dd` (ubuntu+sudo) | `51.91.100.142` | node |
The route layer (server-to-server) prefers the **WireGuard mesh**
(`ROUTE_NETWORK=wg`); the client data plane and the HTTP control plane are reached
over the public IPs. The route CA is **separate** from the client CA, so a client
cert can never be presented to the route port.
`ROUTE_NETWORK=public`, **not `wg`**: there is no WireGuard mesh between the three
nodes (homer and datardos do not even have the `wg` binary; om's only WG peers are
the operator's PCs). The server-to-server routes therefore travel over the public
IPs, protected by the **separate cluster route CA** (mutual route TLS) — a client
data-plane cert can never be presented to the route port. The client data plane and
the HTTP control plane are also reached over the public IPs. There is no fixed
"seed" node: with R3 the three are peers (see "Bring up" for why a lone node cannot
self-serve).
## Prerequisites (HUMAN, once)
@@ -93,25 +100,48 @@ SEED
> The KV written here lives in `./local_files/jetstream`, which the cluster unit
> reuses (`--nats-store` default), so the admin is present when the enforce cluster
> starts. Additional users are added the same loopback way until a
> `user add --store kv` exists (see GAP in report 0009).
> starts. This loopback bootstrap is needed ONLY for the very first admin (the
> chicken-and-egg). **Every user after that is added with the cluster live** — no
> stop-seed-restart — via `user add --store kv` (see "Add users to the live
> cluster" below, report 0012).
## Bring up (HUMAN — staggered)
## Bring up (HUMAN)
Bring up the seed first, then the replicas one at a time, checking each joins.
> **CORRECTION (report 0012).** The original instruction — "start magnus alone and
> verify healthz, then add the others" — is **WRONG and will look like a hung
> deploy.** A 3-node JetStream cluster forms a RAFT meta-group that needs a quorum
> (2 of 3) to elect a leader. A single started node has no quorum, so its JetStream
> meta never becomes current: `--store kv` blocks creating the KV buckets and
> **`/healthz` never returns ok** until a second node joins. Waiting for magnus to
> "go green" before starting the others therefore deadlocks the rollout.
Start the nodes so a quorum forms. On a **clean cluster** the simplest correct
procedure is to start all three close together and let the meta-group converge:
```bash
# 1. Seed node (after the seed step above).
ssh root@magnus 'systemctl enable --now membershipd-cluster'
ssh root@magnus 'curl -fsS https://127.0.0.1:8470/healthz --cacert /opt/unibus/tls/ca.crt'
# Start all three (order does not matter); each blocks on the others until a
# 2/3 quorum elects a JetStream meta leader, then the KV buckets are created.
for h in magnus homer datardos; do ssh "$h" 'sudo systemctl enable --now membershipd-cluster'; done
# 2. Replicas, one at a time.
ssh root@homer 'systemctl enable --now membershipd-cluster'
ssh root@datardos 'systemctl enable --now membershipd-cluster'
# Only NOW does healthz return ok — once the meta-group has a leader (give it
# ~10-30s on a cold start). Poll, do not assume the first node is broken.
for h in magnus homer datardos; do
echo "== $h =="; ssh "$h" 'curl -fsS https://127.0.0.1:8470/healthz --cacert /opt/unibus/tls/ca.crt || echo "(not ready yet — needs quorum)"'
done
```
> Initial rollout runs at **R1** (`KV_REPLICAS=1` in `nodes.env`): the buckets live
> on the seed only. This is NOT HA yet — see "Scale to R3".
A **staggered** start also works, but only because `membershipd`'s KV open RETRIES
the bucket creation for a 120s bootstrap budget (issue 0006g, fix #3): the first
node sits in that retry loop — NOT serving healthz — until the second node makes a
quorum, then both converge and the third catches up. Either way, a lone node never
self-serves; do not gate the next node's start on the previous one's healthz.
> A cold multi-node start only converges because of **three cold-start fixes**
> (report 0011): route pooling off (`PoolSize=-1`), `NoAdvertise=true` (Docker
> bridge IPs not gossiped), and the KV-open retry loop above. Without them the
> meta-group re-elects leaders forever and bucket creation hangs. If a fresh
> cluster will not form, confirm the running binary contains these fixes before
> touching config.
## Promote an existing single-node (SQLite) deployment (HUMAN, optional)
@@ -137,11 +167,80 @@ ssh root@magnus 'nats --server nats://127.0.0.1:4250 server list' # 3 servers,
A healthy cluster shows 3 routed servers and a JetStream meta-group with a leader.
## Scale to R3 (HUMAN — real HA)
## Add users to the live cluster (HUMAN — `user add --store kv`)
Once all three nodes are up and routed, raise the replication factor of every
control-plane stream from 1 to 3 IN PLACE (no data loss), then flip `KV_REPLICAS=3`
in `nodes.env` so future (re)deploys keep it:
With the cluster up, add (and revoke) bus users **without stopping anything**,
directly against the replicated KV allowlist. This replaces the stop-seed-restart
procedure the original runbook implied for every user beyond the first admin.
The mechanism is the cluster's own **privileged internal connection**: under
`enforce` every bus user is confined by the per-subject ACL to its own rooms, so no
ordinary identity may write the control-plane buckets. The only identity the
authenticator grants full JetStream permissions is `membershipd`'s internal service
identity. The unit persists that identity to `${INTERNAL_ID_FILE}`
(`/opt/unibus/secrets/internal.id`, 0600) via `--internal-id-file`, so the same key
is available to the CLI. Run the CLI **on a node, over loopback** (the data-plane
TLS cert SAN covers `127.0.0.1`); reading the identity file requires root on that
node, which already implies full control of it, so this adds no practical exposure.
```bash
# Add a member to the live cluster's replicated allowlist (run on any node).
ssh root@magnus 'sudo /opt/unibus/membershipd user add --store kv \
--handle alice --role member --sign-pub <64-hex-ed25519-pub>'
# -> added user "alice" (...) role=member
# -> KV_UNIBUS_users: leader=<node> followers_current=2/2 msgs=N (replicated, HA)
# List / revoke against the same live KV:
ssh root@magnus 'sudo /opt/unibus/membershipd user list --store kv'
ssh root@magnus 'sudo /opt/unibus/membershipd user revoke --store kv <64-hex-ed25519-pub>'
```
Defaults assume an on-node invocation (`--nats-url nats://127.0.0.1:4250`,
`--internal-id-file /opt/unibus/secrets/internal.id`, `--ca /opt/unibus/tls/ca.crt`,
`--kv-replicas 3`). Semantics:
- **Idempotent / non-destructive**: re-adding the same key is an explicit
`already registered` error (exit 1), never a silent overwrite — a re-add cannot
flip a member to admin. To replace a user, `revoke` then add.
- **HA**: the write commits through the JetStream quorum, so it succeeds even with
one node down (2/3); the printed `followers_current` shows replication.
- **No hard delete**: `revoke` flips status to `revoked` (denied on both planes,
auditable); the KV has no row deletion, matching the SQLite store.
> **Rollout note (report 0012):** the live verification deployed this binary +
> `--internal-id-file` to **datardos only** (the non-critical node). magnus and
> homer still run the 0011 binary. To make the capability available (and the unit)
> on all three — recommended, the posture is identical so there is no urgency — roll
> the new binary with backups, one node at a time, verifying healthz between each:
> ```bash
> for h in homer magnus; do
> ssh "$h" 'sudo cp -a /opt/unibus/membershipd /opt/unibus/membershipd.bak' # backup
> scp build/membershipd "$h:/tmp/m" && ssh "$h" 'sudo install -o ubuntu -g ubuntu -m0775 /tmp/m /opt/unibus/membershipd'
> # add INTERNAL_ID_FILE=/opt/unibus/secrets/internal.id to /opt/unibus/cluster.env
> # add `--internal-id-file ${INTERNAL_ID_FILE} \` to the unit before `--store kv`
> ssh "$h" 'sudo systemctl daemon-reload && sudo systemctl restart membershipd-cluster'
> ssh "$h" 'curl -fsS https://127.0.0.1:8470/healthz --cacert /opt/unibus/tls/ca.crt' # green before next
> done
> ```
> (`deploy-cluster.sh` + the unit template already emit `INTERNAL_ID_FILE` and the
> flag, so a fresh `./deploy-cluster.sh --yes` is correct for all three.)
## Replication: go straight to R3 (HUMAN — real HA)
> **CORRECTION (report 0012).** The original "start at R1, then scale to R3" plan
> assumed R1 is a usable interim state. **It is not, in this cluster.** At R1 all six
> control-plane buckets (`KV_UNIBUS_users/rooms/members/room_keys/rooms_by_member`
> + `KV_UNIBUS_nonces`) live on a SINGLE node — a hard **SPOF for authentication**:
> if that node dies, the nonce/KV control plane is unreachable and EVERY
> authenticated request fails closed (auth DoS). Worse, the cold multi-node start
> only converges at all because of the three cold-start fixes (see "Bring up"); the
> real deploy never ran a healthy R1 and **jumped straight to R3 once the cluster
> formed.** Treat R1 as a transient artifact of bucket creation, not a milestone.
The deployed config already sets `KV_REPLICAS=3` in `nodes.env`. If buckets were
created at R1 (e.g. only one node was up when `--store kv` first opened them), raise
every control-plane stream to R3 IN PLACE (no data loss) once all three nodes are
routed:
```bash
for s in KV_UNIBUS_users KV_UNIBUS_rooms KV_UNIBUS_members KV_UNIBUS_room_keys \
@@ -151,27 +250,32 @@ done
# (also OBJ_UNIBUS_blobs if the object store is in use)
```
Until this is done, R1 means the seed node is a **single point of failure for
authentication**: if it dies, the nonce/KV control plane is unreachable and every
authenticated request fails closed (auth DoS). R1 is a rollout step, not HA.
After this each bucket shows `followers_current=2/2` (quorum 2/3). The
`user add --store kv` command prints that figure for `KV_UNIBUS_users` on every add,
which is a cheap live HA check.
## Chaos test (HUMAN — requires the 3 live VPS; NOT run here)
## Chaos test (HUMAN — requires the 3 live VPS)
Validate quorum tolerance after R3:
```bash
# Kill one node; the cluster keeps serving (quorum 2/3).
ssh root@datardos 'systemctl stop membershipd-cluster'
# Kill one node; the cluster keeps serving (quorum 2/3). On ubuntu nodes use sudo.
ssh dd 'sudo systemctl stop membershipd-cluster'
# -> clients fail over (multiple seed URLs); reads/writes still succeed.
ssh root@datardos 'systemctl start membershipd-cluster' # rejoins, catches up
ssh dd 'sudo systemctl start membershipd-cluster' # rejoins, catches up
# Kill two nodes; quorum is LOST — the control plane should fail CLOSED (deny),
# never fail open. Verify a request is rejected, not silently served.
```
This network-level chaos test (kill 1/3, kill 2/3, partition/split-brain) is part
of the deploy validation (issue 0003f) and runs against the real VPS — it is
deliberately out of scope for the authoring agent.
> **Validated (report 0012).** The 0011 chaos run checked only the control plane
> (healthz + meta/stream-leader failover + KV readable with 2/3). Report 0012 added
> the missing data-plane proofs against the live cluster: a real authenticated
> client (`cmd/clientcheck`, operator identity, nkey+TLS) creating an E2E room and
> publishing/subscribing — including a node stopped mid-stream, where the client
> failed over to a survivor and kept receiving with zero loss (quorum 2/3) — and
> `user add --store kv` committing with one node (the KV leader) down. The kill-2/3
> fail-closed case remains a documented manual step.
## Rollback
+12 -8
View File
@@ -97,6 +97,7 @@ TLS_KEY=${REMOTE_DIR}/tls/server-${name}.key
ROUTE_TLS_CERT=${REMOTE_DIR}/tls/route-${name}.crt
ROUTE_TLS_KEY=${REMOTE_DIR}/tls/route-${name}.key
ROUTE_TLS_CA=${REMOTE_DIR}/tls/cluster-ca.crt
INTERNAL_ID_FILE=${REMOTE_DIR}/secrets/internal.id
EOF
run ssh "$target" "mkdir -p ${REMOTE_DIR}/tls ${REMOTE_DIR}/secrets"
@@ -114,13 +115,16 @@ if [[ $APPLY -eq 0 ]]; then
fi
cat <<'NEXT'
HUMAN — staggered start (do NOT enable all at once; see README "Bring up"):
1. Seed node first (e.g. magnus):
ssh root@magnus 'systemctl enable --now membershipd-cluster'
ssh root@magnus '/opt/unibus/membershipd user add --admin ...' # seed admin
2. Then the other two, one at a time, checking quorum after each:
ssh root@homer 'systemctl enable --now membershipd-cluster'
ssh root@datardos 'systemctl enable --now membershipd-cluster'
HUMAN — bring up (see README "Bring up" — a LONE node has no quorum and never
serves healthz, so do NOT gate the next node on the previous one going green):
1. Seed the FIRST admin into the KV via the loopback bootstrap (README
"Seed the first admin"); this is needed only for the chicken-and-egg admin.
2. Start all three so a 2/3 quorum forms (order does not matter); healthz
turns ok only once the meta-group elects a leader (~10-30s cold):
for h in magnus homer datardos; do ssh "$h" 'sudo systemctl enable --now membershipd-cluster'; done
3. Verify posture + quorum (README "Verify").
4. Scale replicas 1 -> 3 once all three are up (README "Scale to R3").
4. Ensure R3 on every control-plane stream (README "Replication: go straight to
R3"); R1 is a SPOF, not a milestone.
5. Add further users with the cluster LIVE — no restart — via
`membershipd user add --store kv` (README "Add users to the live cluster").
NEXT
@@ -33,6 +33,7 @@ ExecStart=/opt/unibus/membershipd \
--route-tls-cert ${ROUTE_TLS_CERT} \
--route-tls-key ${ROUTE_TLS_KEY} \
--route-tls-ca ${ROUTE_TLS_CA} \
--internal-id-file ${INTERNAL_ID_FILE} \
--store kv \
--kv-replicas ${KV_REPLICAS}
# Restart=always (NOT on-failure): a clean SIGTERM exits success, and on-failure