docs(cluster): correct runbook + wire --internal-id-file into deploy

Corrections learned from the real 0011 deploy: - Bring up: the "start magnus alone and verify healthz" order deadlocks — a lone node of a 3-node cluster has no meta-group quorum and never serves healthz until a second node joins. Document a quorum-forming start and that a node never self-serves. - Replication: R1 is an unusable SPOF (all six control-plane buckets on one node) and the cold start only converges with the three cold-start fixes; go straight to R3 once the cluster forms. - Add a "user add --store kv" section: the live user-add path that replaces stop-seed-restart, with its security model and idempotency/HA/no-delete semantics. - Topology: real IPs, ROUTE_NETWORK=public (no WireGuard mesh exists). - Chaos test: mark the data-plane client + failover proofs as validated (0012). Deploy machinery now emits the persisted internal identity: the unit gains --internal-id-file ${INTERNAL_ID_FILE} and deploy-cluster.sh writes INTERNAL_ID_FILE into each node's cluster.env, so a fresh deploy enables the live user-add path on every node. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 19:41:56 +02:00
parent 3aa5a2c9a9
commit ce72131ddf
3 changed files with 156 additions and 47 deletions
@@ -5,9 +5,12 @@ This directory holds the material to bring up unibus as a **3-node cluster**
 plane (rooms/members/keys/users on JetStream KV + the anti-replay nonce bucket)
 survives the loss of any one node (quorum 2/3).

-> **The agent that authored this never touched a VPS.** Every step that changes a
-> remote host is marked **HUMAN** and is executed by the operator. `deploy-cluster.sh`
-> defaults to a dry run.
+> **Status: this cluster is DEPLOYED in production** (magnus + homer + datardos,
+> R3, enforce+ACL+TLS) — see report 0011. The runbook below was authored before any
+> VPS existed and has since been **corrected against the real deploy** (report 0012):
+> the start ordering, the R1→R3 reality, and the live user-add path were all wrong
+> or missing. Steps that change a remote host are marked **HUMAN**; `deploy-cluster.sh`
+> still defaults to a dry run.

 ## Files

@@ -22,18 +25,22 @@ Generated keys/secrets (`out/`, `build/`, `secrets/`) are **gitignored** — the
 secret and never leave the operator's trusted machine except over the secure
 rsync channel.

-## Topology
+## Topology (as deployed, report 0011)

-| Node | SSH | Public IP | WireGuard IP | Role |
-|---|---|---|---|---|
-| magnus | `magnus` | `<MAGNUS_PUBLIC_IP>` | `<MAGNUS_WG_IP>` | seed (first up) |
-| homer | `homer` | `141.94.69.66` | `<HOMER_WG_IP>` | replica |
-| datardos | `dd` | `51.91.100.142` | `<DATARDOS_WG_IP>` (10.21.0.x) | replica |
+| Node | SSH | Public IP | Role |
+|---|---|---|---|
+| magnus | `magnus` (root) | `135.125.201.30` | node — **= organic-machine.com = `om`**, the critical host (caddy + gitea + registry-api + monitoring); the bus runs alongside, untouched |
+| homer | `homer` (ubuntu+sudo) | `141.94.69.66` | node |
+| datardos | `dd` (ubuntu+sudo) | `51.91.100.142` | node |

-The route layer (server-to-server) prefers the **WireGuard mesh**
-(`ROUTE_NETWORK=wg`); the client data plane and the HTTP control plane are reached
-over the public IPs. The route CA is **separate** from the client CA, so a client
-cert can never be presented to the route port.
+`ROUTE_NETWORK=public`, **not `wg`**: there is no WireGuard mesh between the three
+nodes (homer and datardos do not even have the `wg` binary; om's only WG peers are
+the operator's PCs). The server-to-server routes therefore travel over the public
+IPs, protected by the **separate cluster route CA** (mutual route TLS) — a client
+data-plane cert can never be presented to the route port. The client data plane and
+the HTTP control plane are also reached over the public IPs. There is no fixed
+"seed" node: with R3 the three are peers (see "Bring up" for why a lone node cannot
+self-serve).

 ## Prerequisites (HUMAN, once)

@@ -93,25 +100,48 @@ SEED

 > The KV written here lives in `./local_files/jetstream`, which the cluster unit
 > reuses (`--nats-store` default), so the admin is present when the enforce cluster
-> starts. Additional users are added the same loopback way until a
-> `user add --store kv` exists (see GAP in report 0009).
+> starts. This loopback bootstrap is needed ONLY for the very first admin (the
+> chicken-and-egg). **Every user after that is added with the cluster live** — no
+> stop-seed-restart — via `user add --store kv` (see "Add users to the live
+> cluster" below, report 0012).

-## Bring up (HUMAN — staggered)
+## Bring up (HUMAN)

-Bring up the seed first, then the replicas one at a time, checking each joins.
+> **CORRECTION (report 0012).** The original instruction — "start magnus alone and
+> verify healthz, then add the others" — is **WRONG and will look like a hung
+> deploy.** A 3-node JetStream cluster forms a RAFT meta-group that needs a quorum
+> (2 of 3) to elect a leader. A single started node has no quorum, so its JetStream
+> meta never becomes current: `--store kv` blocks creating the KV buckets and
+> **`/healthz` never returns ok** until a second node joins. Waiting for magnus to
+> "go green" before starting the others therefore deadlocks the rollout.
+
+Start the nodes so a quorum forms. On a **clean cluster** the simplest correct
+procedure is to start all three close together and let the meta-group converge:

 ```bash
-# 1. Seed node (after the seed step above).
-ssh root@magnus 'systemctl enable --now membershipd-cluster'
-ssh root@magnus 'curl -fsS https://127.0.0.1:8470/healthz --cacert /opt/unibus/tls/ca.crt'
+# Start all three (order does not matter); each blocks on the others until a
+# 2/3 quorum elects a JetStream meta leader, then the KV buckets are created.
+for h in magnus homer datardos; do ssh "$h" 'sudo systemctl enable --now membershipd-cluster'; done

-# 2. Replicas, one at a time.
-ssh root@homer    'systemctl enable --now membershipd-cluster'
-ssh root@datardos 'systemctl enable --now membershipd-cluster'
+# Only NOW does healthz return ok — once the meta-group has a leader (give it
+# ~10-30s on a cold start). Poll, do not assume the first node is broken.
+for h in magnus homer datardos; do
+  echo "== $h =="; ssh "$h" 'curl -fsS https://127.0.0.1:8470/healthz --cacert /opt/unibus/tls/ca.crt || echo "(not ready yet — needs quorum)"'
+done
 ```

-> Initial rollout runs at **R1** (`KV_REPLICAS=1` in `nodes.env`): the buckets live
-> on the seed only. This is NOT HA yet — see "Scale to R3".
+A **staggered** start also works, but only because `membershipd`'s KV open RETRIES
+the bucket creation for a 120s bootstrap budget (issue 0006g, fix #3): the first
+node sits in that retry loop — NOT serving healthz — until the second node makes a
+quorum, then both converge and the third catches up. Either way, a lone node never
+self-serves; do not gate the next node's start on the previous one's healthz.
+
+> A cold multi-node start only converges because of **three cold-start fixes**
+> (report 0011): route pooling off (`PoolSize=-1`), `NoAdvertise=true` (Docker
+> bridge IPs not gossiped), and the KV-open retry loop above. Without them the
+> meta-group re-elects leaders forever and bucket creation hangs. If a fresh
+> cluster will not form, confirm the running binary contains these fixes before
+> touching config.

 ## Promote an existing single-node (SQLite) deployment (HUMAN, optional)

@@ -137,11 +167,80 @@ ssh root@magnus 'nats --server nats://127.0.0.1:4250 server list'   # 3 servers,

 A healthy cluster shows 3 routed servers and a JetStream meta-group with a leader.

-## Scale to R3 (HUMAN — real HA)
+## Add users to the live cluster (HUMAN — `user add --store kv`)

-Once all three nodes are up and routed, raise the replication factor of every
-control-plane stream from 1 to 3 IN PLACE (no data loss), then flip `KV_REPLICAS=3`
-in `nodes.env` so future (re)deploys keep it:
+With the cluster up, add (and revoke) bus users **without stopping anything**,
+directly against the replicated KV allowlist. This replaces the stop-seed-restart
+procedure the original runbook implied for every user beyond the first admin.
+
+The mechanism is the cluster's own **privileged internal connection**: under
+`enforce` every bus user is confined by the per-subject ACL to its own rooms, so no
+ordinary identity may write the control-plane buckets. The only identity the
+authenticator grants full JetStream permissions is `membershipd`'s internal service
+identity. The unit persists that identity to `${INTERNAL_ID_FILE}`
+(`/opt/unibus/secrets/internal.id`, 0600) via `--internal-id-file`, so the same key
+is available to the CLI. Run the CLI **on a node, over loopback** (the data-plane
+TLS cert SAN covers `127.0.0.1`); reading the identity file requires root on that
+node, which already implies full control of it, so this adds no practical exposure.
+
+```bash
+# Add a member to the live cluster's replicated allowlist (run on any node).
+ssh root@magnus 'sudo /opt/unibus/membershipd user add --store kv \
+  --handle alice --role member --sign-pub <64-hex-ed25519-pub>'
+#   -> added user "alice" (...) role=member
+#   -> KV_UNIBUS_users: leader=<node> followers_current=2/2 msgs=N   (replicated, HA)
+
+# List / revoke against the same live KV:
+ssh root@magnus 'sudo /opt/unibus/membershipd user list   --store kv'
+ssh root@magnus 'sudo /opt/unibus/membershipd user revoke --store kv <64-hex-ed25519-pub>'
+```
+
+Defaults assume an on-node invocation (`--nats-url nats://127.0.0.1:4250`,
+`--internal-id-file /opt/unibus/secrets/internal.id`, `--ca /opt/unibus/tls/ca.crt`,
+`--kv-replicas 3`). Semantics:
+
+- **Idempotent / non-destructive**: re-adding the same key is an explicit
+  `already registered` error (exit 1), never a silent overwrite — a re-add cannot
+  flip a member to admin. To replace a user, `revoke` then add.
+- **HA**: the write commits through the JetStream quorum, so it succeeds even with
+  one node down (2/3); the printed `followers_current` shows replication.
+- **No hard delete**: `revoke` flips status to `revoked` (denied on both planes,
+  auditable); the KV has no row deletion, matching the SQLite store.
+
+> **Rollout note (report 0012):** the live verification deployed this binary +
+> `--internal-id-file` to **datardos only** (the non-critical node). magnus and
+> homer still run the 0011 binary. To make the capability available (and the unit)
+> on all three — recommended, the posture is identical so there is no urgency — roll
+> the new binary with backups, one node at a time, verifying healthz between each:
+> ```bash
+> for h in homer magnus; do
+>   ssh "$h" 'sudo cp -a /opt/unibus/membershipd /opt/unibus/membershipd.bak'   # backup
+>   scp build/membershipd "$h:/tmp/m" && ssh "$h" 'sudo install -o ubuntu -g ubuntu -m0775 /tmp/m /opt/unibus/membershipd'
+>   # add INTERNAL_ID_FILE=/opt/unibus/secrets/internal.id to /opt/unibus/cluster.env
+>   # add `--internal-id-file ${INTERNAL_ID_FILE} \` to the unit before `--store kv`
+>   ssh "$h" 'sudo systemctl daemon-reload && sudo systemctl restart membershipd-cluster'
+>   ssh "$h" 'curl -fsS https://127.0.0.1:8470/healthz --cacert /opt/unibus/tls/ca.crt'  # green before next
+> done
+> ```
+> (`deploy-cluster.sh` + the unit template already emit `INTERNAL_ID_FILE` and the
+> flag, so a fresh `./deploy-cluster.sh --yes` is correct for all three.)
+
+## Replication: go straight to R3 (HUMAN — real HA)
+
+> **CORRECTION (report 0012).** The original "start at R1, then scale to R3" plan
+> assumed R1 is a usable interim state. **It is not, in this cluster.** At R1 all six
+> control-plane buckets (`KV_UNIBUS_users/rooms/members/room_keys/rooms_by_member`
+> + `KV_UNIBUS_nonces`) live on a SINGLE node — a hard **SPOF for authentication**:
+> if that node dies, the nonce/KV control plane is unreachable and EVERY
+> authenticated request fails closed (auth DoS). Worse, the cold multi-node start
+> only converges at all because of the three cold-start fixes (see "Bring up"); the
+> real deploy never ran a healthy R1 and **jumped straight to R3 once the cluster
+> formed.** Treat R1 as a transient artifact of bucket creation, not a milestone.
+
+The deployed config already sets `KV_REPLICAS=3` in `nodes.env`. If buckets were
+created at R1 (e.g. only one node was up when `--store kv` first opened them), raise
+every control-plane stream to R3 IN PLACE (no data loss) once all three nodes are
+routed:

 ```bash
 for s in KV_UNIBUS_users KV_UNIBUS_rooms KV_UNIBUS_members KV_UNIBUS_room_keys \
@@ -151,27 +250,32 @@ done
 # (also OBJ_UNIBUS_blobs if the object store is in use)
 ```

-Until this is done, R1 means the seed node is a **single point of failure for
-authentication**: if it dies, the nonce/KV control plane is unreachable and every
-authenticated request fails closed (auth DoS). R1 is a rollout step, not HA.
+After this each bucket shows `followers_current=2/2` (quorum 2/3). The
+`user add --store kv` command prints that figure for `KV_UNIBUS_users` on every add,
+which is a cheap live HA check.

-## Chaos test (HUMAN — requires the 3 live VPS; NOT run here)
+## Chaos test (HUMAN — requires the 3 live VPS)

 Validate quorum tolerance after R3:

 ```bash
-# Kill one node; the cluster keeps serving (quorum 2/3).
-ssh root@datardos 'systemctl stop membershipd-cluster'
+# Kill one node; the cluster keeps serving (quorum 2/3). On ubuntu nodes use sudo.
+ssh dd 'sudo systemctl stop membershipd-cluster'
 #   -> clients fail over (multiple seed URLs); reads/writes still succeed.
-ssh root@datardos 'systemctl start membershipd-cluster'   # rejoins, catches up
+ssh dd 'sudo systemctl start membershipd-cluster'   # rejoins, catches up

 # Kill two nodes; quorum is LOST — the control plane should fail CLOSED (deny),
 # never fail open. Verify a request is rejected, not silently served.
 ```

-This network-level chaos test (kill 1/3, kill 2/3, partition/split-brain) is part
-of the deploy validation (issue 0003f) and runs against the real VPS — it is
-deliberately out of scope for the authoring agent.
+> **Validated (report 0012).** The 0011 chaos run checked only the control plane
+> (healthz + meta/stream-leader failover + KV readable with 2/3). Report 0012 added
+> the missing data-plane proofs against the live cluster: a real authenticated
+> client (`cmd/clientcheck`, operator identity, nkey+TLS) creating an E2E room and
+> publishing/subscribing — including a node stopped mid-stream, where the client
+> failed over to a survivor and kept receiving with zero loss (quorum 2/3) — and
+> `user add --store kv` committing with one node (the KV leader) down. The kill-2/3
+> fail-closed case remains a documented manual step.

 ## Rollback

@@ -97,6 +97,7 @@ TLS_KEY=${REMOTE_DIR}/tls/server-${name}.key
 ROUTE_TLS_CERT=${REMOTE_DIR}/tls/route-${name}.crt
 ROUTE_TLS_KEY=${REMOTE_DIR}/tls/route-${name}.key
 ROUTE_TLS_CA=${REMOTE_DIR}/tls/cluster-ca.crt
+INTERNAL_ID_FILE=${REMOTE_DIR}/secrets/internal.id
 EOF

  run ssh "$target" "mkdir -p ${REMOTE_DIR}/tls ${REMOTE_DIR}/secrets"
@@ -114,13 +115,16 @@ if [[ $APPLY -eq 0 ]]; then
 fi
 cat <<'NEXT'

-HUMAN — staggered start (do NOT enable all at once; see README "Bring up"):
-  1. Seed node first (e.g. magnus):
-       ssh root@magnus 'systemctl enable --now membershipd-cluster'
-       ssh root@magnus '/opt/unibus/membershipd user add --admin ...'   # seed admin
-  2. Then the other two, one at a time, checking quorum after each:
-       ssh root@homer    'systemctl enable --now membershipd-cluster'
-       ssh root@datardos 'systemctl enable --now membershipd-cluster'
+HUMAN — bring up (see README "Bring up" — a LONE node has no quorum and never
+serves healthz, so do NOT gate the next node on the previous one going green):
+  1. Seed the FIRST admin into the KV via the loopback bootstrap (README
+     "Seed the first admin"); this is needed only for the chicken-and-egg admin.
+  2. Start all three so a 2/3 quorum forms (order does not matter); healthz
+     turns ok only once the meta-group elects a leader (~10-30s cold):
+       for h in magnus homer datardos; do ssh "$h" 'sudo systemctl enable --now membershipd-cluster'; done
  3. Verify posture + quorum (README "Verify").
-  4. Scale replicas 1 -> 3 once all three are up (README "Scale to R3").
+  4. Ensure R3 on every control-plane stream (README "Replication: go straight to
+     R3"); R1 is a SPOF, not a milestone.
+  5. Add further users with the cluster LIVE — no restart — via
+     `membershipd user add --store kv` (README "Add users to the live cluster").
 NEXT
@@ -33,6 +33,7 @@ ExecStart=/opt/unibus/membershipd \
  --route-tls-cert ${ROUTE_TLS_CERT} \
  --route-tls-key ${ROUTE_TLS_KEY} \
  --route-tls-ca ${ROUTE_TLS_CA} \
+  --internal-id-file ${INTERNAL_ID_FILE} \
  --store kv \
  --kv-replicas ${KV_REPLICAS}
 # Restart=always (NOT on-failure): a clean SIGTERM exits success, and on-failure