docs(cluster): correct runbook + wire --internal-id-file into deploy
Corrections learned from the real 0011 deploy:
- Bring up: the "start magnus alone and verify healthz" order deadlocks — a
lone node of a 3-node cluster has no meta-group quorum and never serves
healthz until a second node joins. Document a quorum-forming start and that
a node never self-serves.
- Replication: R1 is an unusable SPOF (all six control-plane buckets on one
node) and the cold start only converges with the three cold-start fixes;
go straight to R3 once the cluster forms.
- Add a "user add --store kv" section: the live user-add path that replaces
stop-seed-restart, with its security model and idempotency/HA/no-delete
semantics.
- Topology: real IPs, ROUTE_NETWORK=public (no WireGuard mesh exists).
- Chaos test: mark the data-plane client + failover proofs as validated (0012).
Deploy machinery now emits the persisted internal identity: the unit gains
--internal-id-file ${INTERNAL_ID_FILE} and deploy-cluster.sh writes
INTERNAL_ID_FILE into each node's cluster.env, so a fresh deploy enables the
live user-add path on every node.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -97,6 +97,7 @@ TLS_KEY=${REMOTE_DIR}/tls/server-${name}.key
|
||||
ROUTE_TLS_CERT=${REMOTE_DIR}/tls/route-${name}.crt
|
||||
ROUTE_TLS_KEY=${REMOTE_DIR}/tls/route-${name}.key
|
||||
ROUTE_TLS_CA=${REMOTE_DIR}/tls/cluster-ca.crt
|
||||
INTERNAL_ID_FILE=${REMOTE_DIR}/secrets/internal.id
|
||||
EOF
|
||||
|
||||
run ssh "$target" "mkdir -p ${REMOTE_DIR}/tls ${REMOTE_DIR}/secrets"
|
||||
@@ -114,13 +115,16 @@ if [[ $APPLY -eq 0 ]]; then
|
||||
fi
|
||||
cat <<'NEXT'
|
||||
|
||||
HUMAN — staggered start (do NOT enable all at once; see README "Bring up"):
|
||||
1. Seed node first (e.g. magnus):
|
||||
ssh root@magnus 'systemctl enable --now membershipd-cluster'
|
||||
ssh root@magnus '/opt/unibus/membershipd user add --admin ...' # seed admin
|
||||
2. Then the other two, one at a time, checking quorum after each:
|
||||
ssh root@homer 'systemctl enable --now membershipd-cluster'
|
||||
ssh root@datardos 'systemctl enable --now membershipd-cluster'
|
||||
HUMAN — bring up (see README "Bring up" — a LONE node has no quorum and never
|
||||
serves healthz, so do NOT gate the next node on the previous one going green):
|
||||
1. Seed the FIRST admin into the KV via the loopback bootstrap (README
|
||||
"Seed the first admin"); this is needed only for the chicken-and-egg admin.
|
||||
2. Start all three so a 2/3 quorum forms (order does not matter); healthz
|
||||
turns ok only once the meta-group elects a leader (~10-30s cold):
|
||||
for h in magnus homer datardos; do ssh "$h" 'sudo systemctl enable --now membershipd-cluster'; done
|
||||
3. Verify posture + quorum (README "Verify").
|
||||
4. Scale replicas 1 -> 3 once all three are up (README "Scale to R3").
|
||||
4. Ensure R3 on every control-plane stream (README "Replication: go straight to
|
||||
R3"); R1 is a SPOF, not a milestone.
|
||||
5. Add further users with the cluster LIVE — no restart — via
|
||||
`membershipd user add --store kv` (README "Add users to the live cluster").
|
||||
NEXT
|
||||
|
||||
Reference in New Issue
Block a user