From ce72131ddfe5c43520e8d3316b203c63b190f656 Mon Sep 17 00:00:00 2001 From: Egutierrez Date: Sun, 7 Jun 2026 19:41:56 +0200 Subject: [PATCH] docs(cluster): correct runbook + wire --internal-id-file into deploy MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Corrections learned from the real 0011 deploy: - Bring up: the "start magnus alone and verify healthz" order deadlocks — a lone node of a 3-node cluster has no meta-group quorum and never serves healthz until a second node joins. Document a quorum-forming start and that a node never self-serves. - Replication: R1 is an unusable SPOF (all six control-plane buckets on one node) and the cold start only converges with the three cold-start fixes; go straight to R3 once the cluster forms. - Add a "user add --store kv" section: the live user-add path that replaces stop-seed-restart, with its security model and idempotency/HA/no-delete semantics. - Topology: real IPs, ROUTE_NETWORK=public (no WireGuard mesh exists). - Chaos test: mark the data-plane client + failover proofs as validated (0012). Deploy machinery now emits the persisted internal identity: the unit gains --internal-id-file ${INTERNAL_ID_FILE} and deploy-cluster.sh writes INTERNAL_ID_FILE into each node's cluster.env, so a fresh deploy enables the live user-add path on every node. Co-Authored-By: Claude Opus 4.8 (1M context) --- deploy/cluster/README.md | 182 ++++++++++++++++----- deploy/cluster/deploy-cluster.sh | 20 ++- deploy/cluster/membershipd-cluster.service | 1 + 3 files changed, 156 insertions(+), 47 deletions(-) diff --git a/deploy/cluster/README.md b/deploy/cluster/README.md index 5caf865..a1777ec 100644 --- a/deploy/cluster/README.md +++ b/deploy/cluster/README.md @@ -5,9 +5,12 @@ This directory holds the material to bring up unibus as a **3-node cluster** plane (rooms/members/keys/users on JetStream KV + the anti-replay nonce bucket) survives the loss of any one node (quorum 2/3). -> **The agent that authored this never touched a VPS.** Every step that changes a -> remote host is marked **HUMAN** and is executed by the operator. `deploy-cluster.sh` -> defaults to a dry run. +> **Status: this cluster is DEPLOYED in production** (magnus + homer + datardos, +> R3, enforce+ACL+TLS) — see report 0011. The runbook below was authored before any +> VPS existed and has since been **corrected against the real deploy** (report 0012): +> the start ordering, the R1→R3 reality, and the live user-add path were all wrong +> or missing. Steps that change a remote host are marked **HUMAN**; `deploy-cluster.sh` +> still defaults to a dry run. ## Files @@ -22,18 +25,22 @@ Generated keys/secrets (`out/`, `build/`, `secrets/`) are **gitignored** — the secret and never leave the operator's trusted machine except over the secure rsync channel. -## Topology +## Topology (as deployed, report 0011) -| Node | SSH | Public IP | WireGuard IP | Role | -|---|---|---|---|---| -| magnus | `magnus` | `` | `` | seed (first up) | -| homer | `homer` | `141.94.69.66` | `` | replica | -| datardos | `dd` | `51.91.100.142` | `` (10.21.0.x) | replica | +| Node | SSH | Public IP | Role | +|---|---|---|---| +| magnus | `magnus` (root) | `135.125.201.30` | node — **= organic-machine.com = `om`**, the critical host (caddy + gitea + registry-api + monitoring); the bus runs alongside, untouched | +| homer | `homer` (ubuntu+sudo) | `141.94.69.66` | node | +| datardos | `dd` (ubuntu+sudo) | `51.91.100.142` | node | -The route layer (server-to-server) prefers the **WireGuard mesh** -(`ROUTE_NETWORK=wg`); the client data plane and the HTTP control plane are reached -over the public IPs. The route CA is **separate** from the client CA, so a client -cert can never be presented to the route port. +`ROUTE_NETWORK=public`, **not `wg`**: there is no WireGuard mesh between the three +nodes (homer and datardos do not even have the `wg` binary; om's only WG peers are +the operator's PCs). The server-to-server routes therefore travel over the public +IPs, protected by the **separate cluster route CA** (mutual route TLS) — a client +data-plane cert can never be presented to the route port. The client data plane and +the HTTP control plane are also reached over the public IPs. There is no fixed +"seed" node: with R3 the three are peers (see "Bring up" for why a lone node cannot +self-serve). ## Prerequisites (HUMAN, once) @@ -93,25 +100,48 @@ SEED > The KV written here lives in `./local_files/jetstream`, which the cluster unit > reuses (`--nats-store` default), so the admin is present when the enforce cluster -> starts. Additional users are added the same loopback way until a -> `user add --store kv` exists (see GAP in report 0009). +> starts. This loopback bootstrap is needed ONLY for the very first admin (the +> chicken-and-egg). **Every user after that is added with the cluster live** — no +> stop-seed-restart — via `user add --store kv` (see "Add users to the live +> cluster" below, report 0012). -## Bring up (HUMAN — staggered) +## Bring up (HUMAN) -Bring up the seed first, then the replicas one at a time, checking each joins. +> **CORRECTION (report 0012).** The original instruction — "start magnus alone and +> verify healthz, then add the others" — is **WRONG and will look like a hung +> deploy.** A 3-node JetStream cluster forms a RAFT meta-group that needs a quorum +> (2 of 3) to elect a leader. A single started node has no quorum, so its JetStream +> meta never becomes current: `--store kv` blocks creating the KV buckets and +> **`/healthz` never returns ok** until a second node joins. Waiting for magnus to +> "go green" before starting the others therefore deadlocks the rollout. + +Start the nodes so a quorum forms. On a **clean cluster** the simplest correct +procedure is to start all three close together and let the meta-group converge: ```bash -# 1. Seed node (after the seed step above). -ssh root@magnus 'systemctl enable --now membershipd-cluster' -ssh root@magnus 'curl -fsS https://127.0.0.1:8470/healthz --cacert /opt/unibus/tls/ca.crt' +# Start all three (order does not matter); each blocks on the others until a +# 2/3 quorum elects a JetStream meta leader, then the KV buckets are created. +for h in magnus homer datardos; do ssh "$h" 'sudo systemctl enable --now membershipd-cluster'; done -# 2. Replicas, one at a time. -ssh root@homer 'systemctl enable --now membershipd-cluster' -ssh root@datardos 'systemctl enable --now membershipd-cluster' +# Only NOW does healthz return ok — once the meta-group has a leader (give it +# ~10-30s on a cold start). Poll, do not assume the first node is broken. +for h in magnus homer datardos; do + echo "== $h =="; ssh "$h" 'curl -fsS https://127.0.0.1:8470/healthz --cacert /opt/unibus/tls/ca.crt || echo "(not ready yet — needs quorum)"' +done ``` -> Initial rollout runs at **R1** (`KV_REPLICAS=1` in `nodes.env`): the buckets live -> on the seed only. This is NOT HA yet — see "Scale to R3". +A **staggered** start also works, but only because `membershipd`'s KV open RETRIES +the bucket creation for a 120s bootstrap budget (issue 0006g, fix #3): the first +node sits in that retry loop — NOT serving healthz — until the second node makes a +quorum, then both converge and the third catches up. Either way, a lone node never +self-serves; do not gate the next node's start on the previous one's healthz. + +> A cold multi-node start only converges because of **three cold-start fixes** +> (report 0011): route pooling off (`PoolSize=-1`), `NoAdvertise=true` (Docker +> bridge IPs not gossiped), and the KV-open retry loop above. Without them the +> meta-group re-elects leaders forever and bucket creation hangs. If a fresh +> cluster will not form, confirm the running binary contains these fixes before +> touching config. ## Promote an existing single-node (SQLite) deployment (HUMAN, optional) @@ -137,11 +167,80 @@ ssh root@magnus 'nats --server nats://127.0.0.1:4250 server list' # 3 servers, A healthy cluster shows 3 routed servers and a JetStream meta-group with a leader. -## Scale to R3 (HUMAN — real HA) +## Add users to the live cluster (HUMAN — `user add --store kv`) -Once all three nodes are up and routed, raise the replication factor of every -control-plane stream from 1 to 3 IN PLACE (no data loss), then flip `KV_REPLICAS=3` -in `nodes.env` so future (re)deploys keep it: +With the cluster up, add (and revoke) bus users **without stopping anything**, +directly against the replicated KV allowlist. This replaces the stop-seed-restart +procedure the original runbook implied for every user beyond the first admin. + +The mechanism is the cluster's own **privileged internal connection**: under +`enforce` every bus user is confined by the per-subject ACL to its own rooms, so no +ordinary identity may write the control-plane buckets. The only identity the +authenticator grants full JetStream permissions is `membershipd`'s internal service +identity. The unit persists that identity to `${INTERNAL_ID_FILE}` +(`/opt/unibus/secrets/internal.id`, 0600) via `--internal-id-file`, so the same key +is available to the CLI. Run the CLI **on a node, over loopback** (the data-plane +TLS cert SAN covers `127.0.0.1`); reading the identity file requires root on that +node, which already implies full control of it, so this adds no practical exposure. + +```bash +# Add a member to the live cluster's replicated allowlist (run on any node). +ssh root@magnus 'sudo /opt/unibus/membershipd user add --store kv \ + --handle alice --role member --sign-pub <64-hex-ed25519-pub>' +# -> added user "alice" (...) role=member +# -> KV_UNIBUS_users: leader= followers_current=2/2 msgs=N (replicated, HA) + +# List / revoke against the same live KV: +ssh root@magnus 'sudo /opt/unibus/membershipd user list --store kv' +ssh root@magnus 'sudo /opt/unibus/membershipd user revoke --store kv <64-hex-ed25519-pub>' +``` + +Defaults assume an on-node invocation (`--nats-url nats://127.0.0.1:4250`, +`--internal-id-file /opt/unibus/secrets/internal.id`, `--ca /opt/unibus/tls/ca.crt`, +`--kv-replicas 3`). Semantics: + +- **Idempotent / non-destructive**: re-adding the same key is an explicit + `already registered` error (exit 1), never a silent overwrite — a re-add cannot + flip a member to admin. To replace a user, `revoke` then add. +- **HA**: the write commits through the JetStream quorum, so it succeeds even with + one node down (2/3); the printed `followers_current` shows replication. +- **No hard delete**: `revoke` flips status to `revoked` (denied on both planes, + auditable); the KV has no row deletion, matching the SQLite store. + +> **Rollout note (report 0012):** the live verification deployed this binary + +> `--internal-id-file` to **datardos only** (the non-critical node). magnus and +> homer still run the 0011 binary. To make the capability available (and the unit) +> on all three — recommended, the posture is identical so there is no urgency — roll +> the new binary with backups, one node at a time, verifying healthz between each: +> ```bash +> for h in homer magnus; do +> ssh "$h" 'sudo cp -a /opt/unibus/membershipd /opt/unibus/membershipd.bak' # backup +> scp build/membershipd "$h:/tmp/m" && ssh "$h" 'sudo install -o ubuntu -g ubuntu -m0775 /tmp/m /opt/unibus/membershipd' +> # add INTERNAL_ID_FILE=/opt/unibus/secrets/internal.id to /opt/unibus/cluster.env +> # add `--internal-id-file ${INTERNAL_ID_FILE} \` to the unit before `--store kv` +> ssh "$h" 'sudo systemctl daemon-reload && sudo systemctl restart membershipd-cluster' +> ssh "$h" 'curl -fsS https://127.0.0.1:8470/healthz --cacert /opt/unibus/tls/ca.crt' # green before next +> done +> ``` +> (`deploy-cluster.sh` + the unit template already emit `INTERNAL_ID_FILE` and the +> flag, so a fresh `./deploy-cluster.sh --yes` is correct for all three.) + +## Replication: go straight to R3 (HUMAN — real HA) + +> **CORRECTION (report 0012).** The original "start at R1, then scale to R3" plan +> assumed R1 is a usable interim state. **It is not, in this cluster.** At R1 all six +> control-plane buckets (`KV_UNIBUS_users/rooms/members/room_keys/rooms_by_member` +> + `KV_UNIBUS_nonces`) live on a SINGLE node — a hard **SPOF for authentication**: +> if that node dies, the nonce/KV control plane is unreachable and EVERY +> authenticated request fails closed (auth DoS). Worse, the cold multi-node start +> only converges at all because of the three cold-start fixes (see "Bring up"); the +> real deploy never ran a healthy R1 and **jumped straight to R3 once the cluster +> formed.** Treat R1 as a transient artifact of bucket creation, not a milestone. + +The deployed config already sets `KV_REPLICAS=3` in `nodes.env`. If buckets were +created at R1 (e.g. only one node was up when `--store kv` first opened them), raise +every control-plane stream to R3 IN PLACE (no data loss) once all three nodes are +routed: ```bash for s in KV_UNIBUS_users KV_UNIBUS_rooms KV_UNIBUS_members KV_UNIBUS_room_keys \ @@ -151,27 +250,32 @@ done # (also OBJ_UNIBUS_blobs if the object store is in use) ``` -Until this is done, R1 means the seed node is a **single point of failure for -authentication**: if it dies, the nonce/KV control plane is unreachable and every -authenticated request fails closed (auth DoS). R1 is a rollout step, not HA. +After this each bucket shows `followers_current=2/2` (quorum 2/3). The +`user add --store kv` command prints that figure for `KV_UNIBUS_users` on every add, +which is a cheap live HA check. -## Chaos test (HUMAN — requires the 3 live VPS; NOT run here) +## Chaos test (HUMAN — requires the 3 live VPS) Validate quorum tolerance after R3: ```bash -# Kill one node; the cluster keeps serving (quorum 2/3). -ssh root@datardos 'systemctl stop membershipd-cluster' +# Kill one node; the cluster keeps serving (quorum 2/3). On ubuntu nodes use sudo. +ssh dd 'sudo systemctl stop membershipd-cluster' # -> clients fail over (multiple seed URLs); reads/writes still succeed. -ssh root@datardos 'systemctl start membershipd-cluster' # rejoins, catches up +ssh dd 'sudo systemctl start membershipd-cluster' # rejoins, catches up # Kill two nodes; quorum is LOST — the control plane should fail CLOSED (deny), # never fail open. Verify a request is rejected, not silently served. ``` -This network-level chaos test (kill 1/3, kill 2/3, partition/split-brain) is part -of the deploy validation (issue 0003f) and runs against the real VPS — it is -deliberately out of scope for the authoring agent. +> **Validated (report 0012).** The 0011 chaos run checked only the control plane +> (healthz + meta/stream-leader failover + KV readable with 2/3). Report 0012 added +> the missing data-plane proofs against the live cluster: a real authenticated +> client (`cmd/clientcheck`, operator identity, nkey+TLS) creating an E2E room and +> publishing/subscribing — including a node stopped mid-stream, where the client +> failed over to a survivor and kept receiving with zero loss (quorum 2/3) — and +> `user add --store kv` committing with one node (the KV leader) down. The kill-2/3 +> fail-closed case remains a documented manual step. ## Rollback diff --git a/deploy/cluster/deploy-cluster.sh b/deploy/cluster/deploy-cluster.sh index 46f583e..f14fba0 100755 --- a/deploy/cluster/deploy-cluster.sh +++ b/deploy/cluster/deploy-cluster.sh @@ -97,6 +97,7 @@ TLS_KEY=${REMOTE_DIR}/tls/server-${name}.key ROUTE_TLS_CERT=${REMOTE_DIR}/tls/route-${name}.crt ROUTE_TLS_KEY=${REMOTE_DIR}/tls/route-${name}.key ROUTE_TLS_CA=${REMOTE_DIR}/tls/cluster-ca.crt +INTERNAL_ID_FILE=${REMOTE_DIR}/secrets/internal.id EOF run ssh "$target" "mkdir -p ${REMOTE_DIR}/tls ${REMOTE_DIR}/secrets" @@ -114,13 +115,16 @@ if [[ $APPLY -eq 0 ]]; then fi cat <<'NEXT' -HUMAN — staggered start (do NOT enable all at once; see README "Bring up"): - 1. Seed node first (e.g. magnus): - ssh root@magnus 'systemctl enable --now membershipd-cluster' - ssh root@magnus '/opt/unibus/membershipd user add --admin ...' # seed admin - 2. Then the other two, one at a time, checking quorum after each: - ssh root@homer 'systemctl enable --now membershipd-cluster' - ssh root@datardos 'systemctl enable --now membershipd-cluster' +HUMAN — bring up (see README "Bring up" — a LONE node has no quorum and never +serves healthz, so do NOT gate the next node on the previous one going green): + 1. Seed the FIRST admin into the KV via the loopback bootstrap (README + "Seed the first admin"); this is needed only for the chicken-and-egg admin. + 2. Start all three so a 2/3 quorum forms (order does not matter); healthz + turns ok only once the meta-group elects a leader (~10-30s cold): + for h in magnus homer datardos; do ssh "$h" 'sudo systemctl enable --now membershipd-cluster'; done 3. Verify posture + quorum (README "Verify"). - 4. Scale replicas 1 -> 3 once all three are up (README "Scale to R3"). + 4. Ensure R3 on every control-plane stream (README "Replication: go straight to + R3"); R1 is a SPOF, not a milestone. + 5. Add further users with the cluster LIVE — no restart — via + `membershipd user add --store kv` (README "Add users to the live cluster"). NEXT diff --git a/deploy/cluster/membershipd-cluster.service b/deploy/cluster/membershipd-cluster.service index 45ee329..ddb88c4 100644 --- a/deploy/cluster/membershipd-cluster.service +++ b/deploy/cluster/membershipd-cluster.service @@ -33,6 +33,7 @@ ExecStart=/opt/unibus/membershipd \ --route-tls-cert ${ROUTE_TLS_CERT} \ --route-tls-key ${ROUTE_TLS_KEY} \ --route-tls-ca ${ROUTE_TLS_CA} \ + --internal-id-file ${INTERNAL_ID_FILE} \ --store kv \ --kv-replicas ${KV_REPLICAS} # Restart=always (NOT on-failure): a clean SIGTERM exits success, and on-failure