docs(cluster): correct runbook + wire --internal-id-file into deploy
Corrections learned from the real 0011 deploy:
- Bring up: the "start magnus alone and verify healthz" order deadlocks — a
lone node of a 3-node cluster has no meta-group quorum and never serves
healthz until a second node joins. Document a quorum-forming start and that
a node never self-serves.
- Replication: R1 is an unusable SPOF (all six control-plane buckets on one
node) and the cold start only converges with the three cold-start fixes;
go straight to R3 once the cluster forms.
- Add a "user add --store kv" section: the live user-add path that replaces
stop-seed-restart, with its security model and idempotency/HA/no-delete
semantics.
- Topology: real IPs, ROUTE_NETWORK=public (no WireGuard mesh exists).
- Chaos test: mark the data-plane client + failover proofs as validated (0012).
Deploy machinery now emits the persisted internal identity: the unit gains
--internal-id-file ${INTERNAL_ID_FILE} and deploy-cluster.sh writes
INTERNAL_ID_FILE into each node's cluster.env, so a fresh deploy enables the
live user-add path on every node.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
+143
-39
@@ -5,9 +5,12 @@ This directory holds the material to bring up unibus as a **3-node cluster**
|
||||
plane (rooms/members/keys/users on JetStream KV + the anti-replay nonce bucket)
|
||||
survives the loss of any one node (quorum 2/3).
|
||||
|
||||
> **The agent that authored this never touched a VPS.** Every step that changes a
|
||||
> remote host is marked **HUMAN** and is executed by the operator. `deploy-cluster.sh`
|
||||
> defaults to a dry run.
|
||||
> **Status: this cluster is DEPLOYED in production** (magnus + homer + datardos,
|
||||
> R3, enforce+ACL+TLS) — see report 0011. The runbook below was authored before any
|
||||
> VPS existed and has since been **corrected against the real deploy** (report 0012):
|
||||
> the start ordering, the R1→R3 reality, and the live user-add path were all wrong
|
||||
> or missing. Steps that change a remote host are marked **HUMAN**; `deploy-cluster.sh`
|
||||
> still defaults to a dry run.
|
||||
|
||||
## Files
|
||||
|
||||
@@ -22,18 +25,22 @@ Generated keys/secrets (`out/`, `build/`, `secrets/`) are **gitignored** — the
|
||||
secret and never leave the operator's trusted machine except over the secure
|
||||
rsync channel.
|
||||
|
||||
## Topology
|
||||
## Topology (as deployed, report 0011)
|
||||
|
||||
| Node | SSH | Public IP | WireGuard IP | Role |
|
||||
|---|---|---|---|---|
|
||||
| magnus | `magnus` | `<MAGNUS_PUBLIC_IP>` | `<MAGNUS_WG_IP>` | seed (first up) |
|
||||
| homer | `homer` | `141.94.69.66` | `<HOMER_WG_IP>` | replica |
|
||||
| datardos | `dd` | `51.91.100.142` | `<DATARDOS_WG_IP>` (10.21.0.x) | replica |
|
||||
| Node | SSH | Public IP | Role |
|
||||
|---|---|---|---|
|
||||
| magnus | `magnus` (root) | `135.125.201.30` | node — **= organic-machine.com = `om`**, the critical host (caddy + gitea + registry-api + monitoring); the bus runs alongside, untouched |
|
||||
| homer | `homer` (ubuntu+sudo) | `141.94.69.66` | node |
|
||||
| datardos | `dd` (ubuntu+sudo) | `51.91.100.142` | node |
|
||||
|
||||
The route layer (server-to-server) prefers the **WireGuard mesh**
|
||||
(`ROUTE_NETWORK=wg`); the client data plane and the HTTP control plane are reached
|
||||
over the public IPs. The route CA is **separate** from the client CA, so a client
|
||||
cert can never be presented to the route port.
|
||||
`ROUTE_NETWORK=public`, **not `wg`**: there is no WireGuard mesh between the three
|
||||
nodes (homer and datardos do not even have the `wg` binary; om's only WG peers are
|
||||
the operator's PCs). The server-to-server routes therefore travel over the public
|
||||
IPs, protected by the **separate cluster route CA** (mutual route TLS) — a client
|
||||
data-plane cert can never be presented to the route port. The client data plane and
|
||||
the HTTP control plane are also reached over the public IPs. There is no fixed
|
||||
"seed" node: with R3 the three are peers (see "Bring up" for why a lone node cannot
|
||||
self-serve).
|
||||
|
||||
## Prerequisites (HUMAN, once)
|
||||
|
||||
@@ -93,25 +100,48 @@ SEED
|
||||
|
||||
> The KV written here lives in `./local_files/jetstream`, which the cluster unit
|
||||
> reuses (`--nats-store` default), so the admin is present when the enforce cluster
|
||||
> starts. Additional users are added the same loopback way until a
|
||||
> `user add --store kv` exists (see GAP in report 0009).
|
||||
> starts. This loopback bootstrap is needed ONLY for the very first admin (the
|
||||
> chicken-and-egg). **Every user after that is added with the cluster live** — no
|
||||
> stop-seed-restart — via `user add --store kv` (see "Add users to the live
|
||||
> cluster" below, report 0012).
|
||||
|
||||
## Bring up (HUMAN — staggered)
|
||||
## Bring up (HUMAN)
|
||||
|
||||
Bring up the seed first, then the replicas one at a time, checking each joins.
|
||||
> **CORRECTION (report 0012).** The original instruction — "start magnus alone and
|
||||
> verify healthz, then add the others" — is **WRONG and will look like a hung
|
||||
> deploy.** A 3-node JetStream cluster forms a RAFT meta-group that needs a quorum
|
||||
> (2 of 3) to elect a leader. A single started node has no quorum, so its JetStream
|
||||
> meta never becomes current: `--store kv` blocks creating the KV buckets and
|
||||
> **`/healthz` never returns ok** until a second node joins. Waiting for magnus to
|
||||
> "go green" before starting the others therefore deadlocks the rollout.
|
||||
|
||||
Start the nodes so a quorum forms. On a **clean cluster** the simplest correct
|
||||
procedure is to start all three close together and let the meta-group converge:
|
||||
|
||||
```bash
|
||||
# 1. Seed node (after the seed step above).
|
||||
ssh root@magnus 'systemctl enable --now membershipd-cluster'
|
||||
ssh root@magnus 'curl -fsS https://127.0.0.1:8470/healthz --cacert /opt/unibus/tls/ca.crt'
|
||||
# Start all three (order does not matter); each blocks on the others until a
|
||||
# 2/3 quorum elects a JetStream meta leader, then the KV buckets are created.
|
||||
for h in magnus homer datardos; do ssh "$h" 'sudo systemctl enable --now membershipd-cluster'; done
|
||||
|
||||
# 2. Replicas, one at a time.
|
||||
ssh root@homer 'systemctl enable --now membershipd-cluster'
|
||||
ssh root@datardos 'systemctl enable --now membershipd-cluster'
|
||||
# Only NOW does healthz return ok — once the meta-group has a leader (give it
|
||||
# ~10-30s on a cold start). Poll, do not assume the first node is broken.
|
||||
for h in magnus homer datardos; do
|
||||
echo "== $h =="; ssh "$h" 'curl -fsS https://127.0.0.1:8470/healthz --cacert /opt/unibus/tls/ca.crt || echo "(not ready yet — needs quorum)"'
|
||||
done
|
||||
```
|
||||
|
||||
> Initial rollout runs at **R1** (`KV_REPLICAS=1` in `nodes.env`): the buckets live
|
||||
> on the seed only. This is NOT HA yet — see "Scale to R3".
|
||||
A **staggered** start also works, but only because `membershipd`'s KV open RETRIES
|
||||
the bucket creation for a 120s bootstrap budget (issue 0006g, fix #3): the first
|
||||
node sits in that retry loop — NOT serving healthz — until the second node makes a
|
||||
quorum, then both converge and the third catches up. Either way, a lone node never
|
||||
self-serves; do not gate the next node's start on the previous one's healthz.
|
||||
|
||||
> A cold multi-node start only converges because of **three cold-start fixes**
|
||||
> (report 0011): route pooling off (`PoolSize=-1`), `NoAdvertise=true` (Docker
|
||||
> bridge IPs not gossiped), and the KV-open retry loop above. Without them the
|
||||
> meta-group re-elects leaders forever and bucket creation hangs. If a fresh
|
||||
> cluster will not form, confirm the running binary contains these fixes before
|
||||
> touching config.
|
||||
|
||||
## Promote an existing single-node (SQLite) deployment (HUMAN, optional)
|
||||
|
||||
@@ -137,11 +167,80 @@ ssh root@magnus 'nats --server nats://127.0.0.1:4250 server list' # 3 servers,
|
||||
|
||||
A healthy cluster shows 3 routed servers and a JetStream meta-group with a leader.
|
||||
|
||||
## Scale to R3 (HUMAN — real HA)
|
||||
## Add users to the live cluster (HUMAN — `user add --store kv`)
|
||||
|
||||
Once all three nodes are up and routed, raise the replication factor of every
|
||||
control-plane stream from 1 to 3 IN PLACE (no data loss), then flip `KV_REPLICAS=3`
|
||||
in `nodes.env` so future (re)deploys keep it:
|
||||
With the cluster up, add (and revoke) bus users **without stopping anything**,
|
||||
directly against the replicated KV allowlist. This replaces the stop-seed-restart
|
||||
procedure the original runbook implied for every user beyond the first admin.
|
||||
|
||||
The mechanism is the cluster's own **privileged internal connection**: under
|
||||
`enforce` every bus user is confined by the per-subject ACL to its own rooms, so no
|
||||
ordinary identity may write the control-plane buckets. The only identity the
|
||||
authenticator grants full JetStream permissions is `membershipd`'s internal service
|
||||
identity. The unit persists that identity to `${INTERNAL_ID_FILE}`
|
||||
(`/opt/unibus/secrets/internal.id`, 0600) via `--internal-id-file`, so the same key
|
||||
is available to the CLI. Run the CLI **on a node, over loopback** (the data-plane
|
||||
TLS cert SAN covers `127.0.0.1`); reading the identity file requires root on that
|
||||
node, which already implies full control of it, so this adds no practical exposure.
|
||||
|
||||
```bash
|
||||
# Add a member to the live cluster's replicated allowlist (run on any node).
|
||||
ssh root@magnus 'sudo /opt/unibus/membershipd user add --store kv \
|
||||
--handle alice --role member --sign-pub <64-hex-ed25519-pub>'
|
||||
# -> added user "alice" (...) role=member
|
||||
# -> KV_UNIBUS_users: leader=<node> followers_current=2/2 msgs=N (replicated, HA)
|
||||
|
||||
# List / revoke against the same live KV:
|
||||
ssh root@magnus 'sudo /opt/unibus/membershipd user list --store kv'
|
||||
ssh root@magnus 'sudo /opt/unibus/membershipd user revoke --store kv <64-hex-ed25519-pub>'
|
||||
```
|
||||
|
||||
Defaults assume an on-node invocation (`--nats-url nats://127.0.0.1:4250`,
|
||||
`--internal-id-file /opt/unibus/secrets/internal.id`, `--ca /opt/unibus/tls/ca.crt`,
|
||||
`--kv-replicas 3`). Semantics:
|
||||
|
||||
- **Idempotent / non-destructive**: re-adding the same key is an explicit
|
||||
`already registered` error (exit 1), never a silent overwrite — a re-add cannot
|
||||
flip a member to admin. To replace a user, `revoke` then add.
|
||||
- **HA**: the write commits through the JetStream quorum, so it succeeds even with
|
||||
one node down (2/3); the printed `followers_current` shows replication.
|
||||
- **No hard delete**: `revoke` flips status to `revoked` (denied on both planes,
|
||||
auditable); the KV has no row deletion, matching the SQLite store.
|
||||
|
||||
> **Rollout note (report 0012):** the live verification deployed this binary +
|
||||
> `--internal-id-file` to **datardos only** (the non-critical node). magnus and
|
||||
> homer still run the 0011 binary. To make the capability available (and the unit)
|
||||
> on all three — recommended, the posture is identical so there is no urgency — roll
|
||||
> the new binary with backups, one node at a time, verifying healthz between each:
|
||||
> ```bash
|
||||
> for h in homer magnus; do
|
||||
> ssh "$h" 'sudo cp -a /opt/unibus/membershipd /opt/unibus/membershipd.bak' # backup
|
||||
> scp build/membershipd "$h:/tmp/m" && ssh "$h" 'sudo install -o ubuntu -g ubuntu -m0775 /tmp/m /opt/unibus/membershipd'
|
||||
> # add INTERNAL_ID_FILE=/opt/unibus/secrets/internal.id to /opt/unibus/cluster.env
|
||||
> # add `--internal-id-file ${INTERNAL_ID_FILE} \` to the unit before `--store kv`
|
||||
> ssh "$h" 'sudo systemctl daemon-reload && sudo systemctl restart membershipd-cluster'
|
||||
> ssh "$h" 'curl -fsS https://127.0.0.1:8470/healthz --cacert /opt/unibus/tls/ca.crt' # green before next
|
||||
> done
|
||||
> ```
|
||||
> (`deploy-cluster.sh` + the unit template already emit `INTERNAL_ID_FILE` and the
|
||||
> flag, so a fresh `./deploy-cluster.sh --yes` is correct for all three.)
|
||||
|
||||
## Replication: go straight to R3 (HUMAN — real HA)
|
||||
|
||||
> **CORRECTION (report 0012).** The original "start at R1, then scale to R3" plan
|
||||
> assumed R1 is a usable interim state. **It is not, in this cluster.** At R1 all six
|
||||
> control-plane buckets (`KV_UNIBUS_users/rooms/members/room_keys/rooms_by_member`
|
||||
> + `KV_UNIBUS_nonces`) live on a SINGLE node — a hard **SPOF for authentication**:
|
||||
> if that node dies, the nonce/KV control plane is unreachable and EVERY
|
||||
> authenticated request fails closed (auth DoS). Worse, the cold multi-node start
|
||||
> only converges at all because of the three cold-start fixes (see "Bring up"); the
|
||||
> real deploy never ran a healthy R1 and **jumped straight to R3 once the cluster
|
||||
> formed.** Treat R1 as a transient artifact of bucket creation, not a milestone.
|
||||
|
||||
The deployed config already sets `KV_REPLICAS=3` in `nodes.env`. If buckets were
|
||||
created at R1 (e.g. only one node was up when `--store kv` first opened them), raise
|
||||
every control-plane stream to R3 IN PLACE (no data loss) once all three nodes are
|
||||
routed:
|
||||
|
||||
```bash
|
||||
for s in KV_UNIBUS_users KV_UNIBUS_rooms KV_UNIBUS_members KV_UNIBUS_room_keys \
|
||||
@@ -151,27 +250,32 @@ done
|
||||
# (also OBJ_UNIBUS_blobs if the object store is in use)
|
||||
```
|
||||
|
||||
Until this is done, R1 means the seed node is a **single point of failure for
|
||||
authentication**: if it dies, the nonce/KV control plane is unreachable and every
|
||||
authenticated request fails closed (auth DoS). R1 is a rollout step, not HA.
|
||||
After this each bucket shows `followers_current=2/2` (quorum 2/3). The
|
||||
`user add --store kv` command prints that figure for `KV_UNIBUS_users` on every add,
|
||||
which is a cheap live HA check.
|
||||
|
||||
## Chaos test (HUMAN — requires the 3 live VPS; NOT run here)
|
||||
## Chaos test (HUMAN — requires the 3 live VPS)
|
||||
|
||||
Validate quorum tolerance after R3:
|
||||
|
||||
```bash
|
||||
# Kill one node; the cluster keeps serving (quorum 2/3).
|
||||
ssh root@datardos 'systemctl stop membershipd-cluster'
|
||||
# Kill one node; the cluster keeps serving (quorum 2/3). On ubuntu nodes use sudo.
|
||||
ssh dd 'sudo systemctl stop membershipd-cluster'
|
||||
# -> clients fail over (multiple seed URLs); reads/writes still succeed.
|
||||
ssh root@datardos 'systemctl start membershipd-cluster' # rejoins, catches up
|
||||
ssh dd 'sudo systemctl start membershipd-cluster' # rejoins, catches up
|
||||
|
||||
# Kill two nodes; quorum is LOST — the control plane should fail CLOSED (deny),
|
||||
# never fail open. Verify a request is rejected, not silently served.
|
||||
```
|
||||
|
||||
This network-level chaos test (kill 1/3, kill 2/3, partition/split-brain) is part
|
||||
of the deploy validation (issue 0003f) and runs against the real VPS — it is
|
||||
deliberately out of scope for the authoring agent.
|
||||
> **Validated (report 0012).** The 0011 chaos run checked only the control plane
|
||||
> (healthz + meta/stream-leader failover + KV readable with 2/3). Report 0012 added
|
||||
> the missing data-plane proofs against the live cluster: a real authenticated
|
||||
> client (`cmd/clientcheck`, operator identity, nkey+TLS) creating an E2E room and
|
||||
> publishing/subscribing — including a node stopped mid-stream, where the client
|
||||
> failed over to a survivor and kept receiving with zero loss (quorum 2/3) — and
|
||||
> `user add --store kv` committing with one node (the KV leader) down. The kill-2/3
|
||||
> fail-closed case remains a documented manual step.
|
||||
|
||||
## Rollback
|
||||
|
||||
|
||||
@@ -97,6 +97,7 @@ TLS_KEY=${REMOTE_DIR}/tls/server-${name}.key
|
||||
ROUTE_TLS_CERT=${REMOTE_DIR}/tls/route-${name}.crt
|
||||
ROUTE_TLS_KEY=${REMOTE_DIR}/tls/route-${name}.key
|
||||
ROUTE_TLS_CA=${REMOTE_DIR}/tls/cluster-ca.crt
|
||||
INTERNAL_ID_FILE=${REMOTE_DIR}/secrets/internal.id
|
||||
EOF
|
||||
|
||||
run ssh "$target" "mkdir -p ${REMOTE_DIR}/tls ${REMOTE_DIR}/secrets"
|
||||
@@ -114,13 +115,16 @@ if [[ $APPLY -eq 0 ]]; then
|
||||
fi
|
||||
cat <<'NEXT'
|
||||
|
||||
HUMAN — staggered start (do NOT enable all at once; see README "Bring up"):
|
||||
1. Seed node first (e.g. magnus):
|
||||
ssh root@magnus 'systemctl enable --now membershipd-cluster'
|
||||
ssh root@magnus '/opt/unibus/membershipd user add --admin ...' # seed admin
|
||||
2. Then the other two, one at a time, checking quorum after each:
|
||||
ssh root@homer 'systemctl enable --now membershipd-cluster'
|
||||
ssh root@datardos 'systemctl enable --now membershipd-cluster'
|
||||
HUMAN — bring up (see README "Bring up" — a LONE node has no quorum and never
|
||||
serves healthz, so do NOT gate the next node on the previous one going green):
|
||||
1. Seed the FIRST admin into the KV via the loopback bootstrap (README
|
||||
"Seed the first admin"); this is needed only for the chicken-and-egg admin.
|
||||
2. Start all three so a 2/3 quorum forms (order does not matter); healthz
|
||||
turns ok only once the meta-group elects a leader (~10-30s cold):
|
||||
for h in magnus homer datardos; do ssh "$h" 'sudo systemctl enable --now membershipd-cluster'; done
|
||||
3. Verify posture + quorum (README "Verify").
|
||||
4. Scale replicas 1 -> 3 once all three are up (README "Scale to R3").
|
||||
4. Ensure R3 on every control-plane stream (README "Replication: go straight to
|
||||
R3"); R1 is a SPOF, not a milestone.
|
||||
5. Add further users with the cluster LIVE — no restart — via
|
||||
`membershipd user add --store kv` (README "Add users to the live cluster").
|
||||
NEXT
|
||||
|
||||
@@ -33,6 +33,7 @@ ExecStart=/opt/unibus/membershipd \
|
||||
--route-tls-cert ${ROUTE_TLS_CERT} \
|
||||
--route-tls-key ${ROUTE_TLS_KEY} \
|
||||
--route-tls-ca ${ROUTE_TLS_CA} \
|
||||
--internal-id-file ${INTERNAL_ID_FILE} \
|
||||
--store kv \
|
||||
--kv-replicas ${KV_REPLICAS}
|
||||
# Restart=always (NOT on-failure): a clean SIGTERM exits success, and on-failure
|
||||
|
||||
Reference in New Issue
Block a user