docs(cluster): correct runbook + wire --internal-id-file into deploy

Corrections learned from the real 0011 deploy: - Bring up: the "start magnus alone and verify healthz" order deadlocks — a lone node of a 3-node cluster has no meta-group quorum and never serves healthz until a second node joins. Document a quorum-forming start and that a node never self-serves. - Replication: R1 is an unusable SPOF (all six control-plane buckets on one node) and the cold start only converges with the three cold-start fixes; go straight to R3 once the cluster forms. - Add a "user add --store kv" section: the live user-add path that replaces stop-seed-restart, with its security model and idempotency/HA/no-delete semantics. - Topology: real IPs, ROUTE_NETWORK=public (no WireGuard mesh exists). - Chaos test: mark the data-plane client + failover proofs as validated (0012). Deploy machinery now emits the persisted internal identity: the unit gains --internal-id-file ${INTERNAL_ID_FILE} and deploy-cluster.sh writes INTERNAL_ID_FILE into each node's cluster.env, so a fresh deploy enables the live user-add path on every node. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 19:41:56 +02:00
parent 3aa5a2c9a9
commit ce72131ddf
3 changed files with 156 additions and 47 deletions
@@ -97,6 +97,7 @@ TLS_KEY=${REMOTE_DIR}/tls/server-${name}.key
 ROUTE_TLS_CERT=${REMOTE_DIR}/tls/route-${name}.crt
 ROUTE_TLS_KEY=${REMOTE_DIR}/tls/route-${name}.key
 ROUTE_TLS_CA=${REMOTE_DIR}/tls/cluster-ca.crt
+INTERNAL_ID_FILE=${REMOTE_DIR}/secrets/internal.id
 EOF

  run ssh "$target" "mkdir -p ${REMOTE_DIR}/tls ${REMOTE_DIR}/secrets"
@@ -114,13 +115,16 @@ if [[ $APPLY -eq 0 ]]; then
 fi
 cat <<'NEXT'

-HUMAN — staggered start (do NOT enable all at once; see README "Bring up"):
-  1. Seed node first (e.g. magnus):
-       ssh root@magnus 'systemctl enable --now membershipd-cluster'
-       ssh root@magnus '/opt/unibus/membershipd user add --admin ...'   # seed admin
-  2. Then the other two, one at a time, checking quorum after each:
-       ssh root@homer    'systemctl enable --now membershipd-cluster'
-       ssh root@datardos 'systemctl enable --now membershipd-cluster'
+HUMAN — bring up (see README "Bring up" — a LONE node has no quorum and never
+serves healthz, so do NOT gate the next node on the previous one going green):
+  1. Seed the FIRST admin into the KV via the loopback bootstrap (README
+     "Seed the first admin"); this is needed only for the chicken-and-egg admin.
+  2. Start all three so a 2/3 quorum forms (order does not matter); healthz
+     turns ok only once the meta-group elects a leader (~10-30s cold):
+       for h in magnus homer datardos; do ssh "$h" 'sudo systemctl enable --now membershipd-cluster'; done
  3. Verify posture + quorum (README "Verify").
-  4. Scale replicas 1 -> 3 once all three are up (README "Scale to R3").
+  4. Ensure R3 on every control-plane stream (README "Replication: go straight to
+     R3"); R1 is a SPOF, not a milestone.
+  5. Add further users with the cluster LIVE — no restart — via
+     `membershipd user add --store kv` (README "Add users to the live cluster").
 NEXT