Merge issue/0006g-deploy: cluster deploy material (magnus+homer+datardos, R3 HA)

2026-06-07 17:31:13 +02:00
parent 24ff45ca7e 48a3d6be33
commit ae39e35fb4
6 changed files with 523 additions and 0 deletions
@@ -0,0 +1,7 @@
+# Generated TLS material and secrets — NEVER commit (audit 0008: keys/secret).
+out/
+build/
+secrets/
+*.key
+*.srl
+cluster-ca.crt
@@ -0,0 +1,181 @@
+# unibus cluster — 3-node deploy runbook (issue 0006g)
+
+This directory holds the material to bring up unibus as a **3-node cluster**
+(`magnus` + `homer` + `datardos`) for real HA: with **R3** replication the control
+plane (rooms/members/keys/users on JetStream KV + the anti-replay nonce bucket)
+survives the loss of any one node (quorum 2/3).
+
+> **The agent that authored this never touched a VPS.** Every step that changes a
+> remote host is marked **HUMAN** and is executed by the operator. `deploy-cluster.sh`
+> defaults to a dry run.
+
+## Files
+
+| File | What it is |
+|---|---|
+| `nodes.env` | Topology: cluster name, ports, and the per-node rows (name, ssh host, public IP, WG IP). **HUMAN fills the placeholders.** |
+| `generate-cluster-certs.sh` | Mints a **separate cluster route CA** + a route cert per node, and a data-plane server cert per node signed by the **client CA** (`../tls/ca.*`). |
+| `membershipd-cluster.service` | One systemd unit, parameterized per node by `/opt/unibus/cluster.env`. enforce + per-subject ACL + TLS + `--store kv`, `Restart=always`. |
+| `deploy-cluster.sh` | Cross-builds the linux binary, generates each node's `cluster.env`, and (with `--yes`) rsyncs everything + installs the unit. Staggered start is manual. |
+
+Generated keys/secrets (`out/`, `build/`, `secrets/`) are **gitignored** — they are
+secret and never leave the operator's trusted machine except over the secure
+rsync channel.
+
+## Topology
+
+| Node | SSH | Public IP | WireGuard IP | Role |
+|---|---|---|---|---|
+| magnus | `magnus` | `<MAGNUS_PUBLIC_IP>` | `<MAGNUS_WG_IP>` | seed (first up) |
+| homer | `homer` | `141.94.69.66` | `<HOMER_WG_IP>` | replica |
+| datardos | `dd` | `51.91.100.142` | `<DATARDOS_WG_IP>` (10.21.0.x) | replica |
+
+The route layer (server-to-server) prefers the **WireGuard mesh**
+(`ROUTE_NETWORK=wg`); the client data plane and the HTTP control plane are reached
+over the public IPs. The route CA is **separate** from the client CA, so a client
+cert can never be presented to the route port.
+
+## Prerequisites (HUMAN, once)
+
+1. **Fill `nodes.env`** — replace every `<PLACEHOLDER>` (magnus public IP, all WG
+   IPs). The scripts refuse to run while any remain.
+2. **Client CA exists** — `../tls/ca.crt` + `../tls/ca.key`. If not, run
+   `../tls/generate-certs.sh` on the CA host (om) first. The cluster reuses this CA
+   for the data plane so existing clients keep trusting the bus.
+3. **Mint cluster TLS**:
+   ```bash
+   ./generate-cluster-certs.sh        # writes out/<name>/ ; --force to rotate the cluster CA
+   ```
+4. **Create the route secret** (out of argv, shared by all nodes):
+   ```bash
+   mkdir -p secrets && openssl rand -hex 32 > secrets/cluster.pass
+   ```
+5. **SSH** to each node's SSH host as `root` works (`ssh magnus true`, `ssh dd true`, ...).
+
+## Stage the nodes
+
+```bash
+./deploy-cluster.sh            # DRY RUN — prints the full plan, touches nothing
+./deploy-cluster.sh --yes      # HUMAN: actually rsync + install the unit on all 3 nodes
+```
+
+This cross-builds `membershipd` (linux/amd64, `CGO_ENABLED=0`), writes each node's
+`cluster.env` (its `NODE_NAME` and the `--routes` to the OTHER two nodes), and
+ships the binary, the node's TLS material, the secret, the env file and the unit.
+It does **not** start anything.
+
+## Seed the first admin into the KV (HUMAN — loopback bootstrap)
+
+The empty KV control plane has no users, and under `enforce` no external tool can
+write the FIRST admin over NATS (it would need to be an admin already — a
+chicken-and-egg). The `user` CLI also writes only to a local SQLite file, not the
+KV. So the first admin is seeded on the seed node through a **loopback, no-auth
+bootstrap** that populates the same JetStream store the cluster unit then reuses:
+
+```bash
+ssh root@magnus 'bash -s' <<'SEED'
+set -euo pipefail
+cd /opt/unibus
+# a) Put the first admin into a local SQLite seed file.
+./membershipd user add --db ./seed.db --handle root --sign-pub <ADMIN_SIGN_PUB_HEX> --role admin
+# b) Bring up a TEMPORARY loopback, no-auth, single-node KV server on the cluster's
+#    own JetStream store dir (not exposed; bus-auth off is allowed on 127.0.0.1).
+./membershipd --store kv --bus-auth off --bind 127.0.0.1 \
+  --nats-store ./local_files/jetstream --db ./seed.db >/tmp/seed-boot.log 2>&1 &
+BOOT=$!; sleep 2
+# c) Migrate the admin from SQLite into the replicated KV (loopback — no --ca needed).
+./membershipd migrate-to-kv --db ./seed.db --nats-url nats://127.0.0.1:4250 --replicas 1
+# d) Stop the bootstrap server. The KV buckets persist in ./local_files/jetstream.
+kill "$BOOT"; wait "$BOOT" 2>/dev/null || true
+rm -f ./seed.db
+SEED
+```
+
+> The KV written here lives in `./local_files/jetstream`, which the cluster unit
+> reuses (`--nats-store` default), so the admin is present when the enforce cluster
+> starts. Additional users are added the same loopback way until a
+> `user add --store kv` exists (see GAP in report 0009).
+
+## Bring up (HUMAN — staggered)
+
+Bring up the seed first, then the replicas one at a time, checking each joins.
+
+```bash
+# 1. Seed node (after the seed step above).
+ssh root@magnus 'systemctl enable --now membershipd-cluster'
+ssh root@magnus 'curl -fsS https://127.0.0.1:8470/healthz --cacert /opt/unibus/tls/ca.crt'
+
+# 2. Replicas, one at a time.
+ssh root@homer    'systemctl enable --now membershipd-cluster'
+ssh root@datardos 'systemctl enable --now membershipd-cluster'
+```
+
+> Initial rollout runs at **R1** (`KV_REPLICAS=1` in `nodes.env`): the buckets live
+> on the seed only. This is NOT HA yet — see "Scale to R3".
+
+## Promote an existing single-node (SQLite) deployment (HUMAN, optional)
+
+Instead of seeding fresh, you can migrate an existing single-node `unibus.db` into
+the KV — **loopback only** (the allowlist would otherwise travel cleartext; the
+command refuses a remote target without `--ca`). Use the same loopback-bootstrap
+shape as the seed step (temporary `--bus-auth off` server on 127.0.0.1, then
+`migrate-to-kv --db /opt/unibus/local_files/unibus.db`).
+
+## Verify
+
+```bash
+# Posture on every node — all must be enforce+acl+tls+cluster, store=kv.
+for h in magnus homer datardos; do
+  echo "== $h =="
+  ssh root@$h 'curl -fsS https://127.0.0.1:8470/healthz --cacert /opt/unibus/tls/ca.crt'
+done
+
+# Cluster + JetStream meta-group health (needs the `nats` CLI on a node):
+ssh root@magnus 'nats --server nats://127.0.0.1:4250 server report jetstream'
+ssh root@magnus 'nats --server nats://127.0.0.1:4250 server list'   # 3 servers, routes up
+```
+
+A healthy cluster shows 3 routed servers and a JetStream meta-group with a leader.
+
+## Scale to R3 (HUMAN — real HA)
+
+Once all three nodes are up and routed, raise the replication factor of every
+control-plane stream from 1 to 3 IN PLACE (no data loss), then flip `KV_REPLICAS=3`
+in `nodes.env` so future (re)deploys keep it:
+
+```bash
+for s in KV_UNIBUS_users KV_UNIBUS_rooms KV_UNIBUS_members KV_UNIBUS_room_keys \
+         KV_UNIBUS_rooms_by_member KV_UNIBUS_nonces; do
+  ssh root@magnus "nats --server nats://127.0.0.1:4250 stream update $s --replicas 3 -f"
+done
+# (also OBJ_UNIBUS_blobs if the object store is in use)
+```
+
+Until this is done, R1 means the seed node is a **single point of failure for
+authentication**: if it dies, the nonce/KV control plane is unreachable and every
+authenticated request fails closed (auth DoS). R1 is a rollout step, not HA.
+
+## Chaos test (HUMAN — requires the 3 live VPS; NOT run here)
+
+Validate quorum tolerance after R3:
+
+```bash
+# Kill one node; the cluster keeps serving (quorum 2/3).
+ssh root@datardos 'systemctl stop membershipd-cluster'
+#   -> clients fail over (multiple seed URLs); reads/writes still succeed.
+ssh root@datardos 'systemctl start membershipd-cluster'   # rejoins, catches up
+
+# Kill two nodes; quorum is LOST — the control plane should fail CLOSED (deny),
+# never fail open. Verify a request is rejected, not silently served.
+```
+
+This network-level chaos test (kill 1/3, kill 2/3, partition/split-brain) is part
+of the deploy validation (issue 0003f) and runs against the real VPS — it is
+deliberately out of scope for the authoring agent.
+
+## Rollback
+
+`membershipd` does not delete data. To revert a node to standalone SQLite, stop
+the unit and start it without `--store kv`/`--cluster-name`; the KV buckets remain
+for a later retry. To rotate the cluster CA, re-run `generate-cluster-certs.sh
+--force` and re-stage (every node must get the new `cluster-ca.crt` together).
@@ -0,0 +1,126 @@
+#!/usr/bin/env bash
+#
+# deploy-cluster.sh — cross-build membershipd and stage it onto the three cluster
+# nodes (issue 0006g). DEFAULT IS DRY-RUN: it prints the plan and touches nothing.
+# Pass --yes to actually rsync + run remote commands. Steps that a HUMAN must run
+# (or confirm) are marked "HUMAN:".
+#
+# Prerequisites (HUMAN, once):
+#   1. Fill nodes.env (no <PLACEHOLDER> left).
+#   2. ./generate-cluster-certs.sh   (mints out/<name>/ TLS material)
+#   3. Create the route secret locally:  mkdir -p secrets && openssl rand -hex 32 > secrets/cluster.pass
+#      (secrets/ is gitignored; it is rsynced to each node as cluster.pass)
+#   4. SSH access to every node's SSH_HOST with sudo-less root (SSH_USER=root).
+#
+# What it does per node (with --yes):
+#   - rsync the membershipd binary, the node's TLS material, the unit, the
+#     generated cluster.env and the route secret into REMOTE_DIR.
+#   - install + daemon-reload the systemd unit.
+# Start is STAGGERED and left to the human (see README): start the seed node,
+# seed the admin, then start the rest.
+set -euo pipefail
+
+DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+cd "$DIR"
+
+# shellcheck source=/dev/null
+source ./nodes.env
+
+APPLY=0
+[[ "${1:-}" == "--yes" ]] && APPLY=1
+
+if grep -q '<[A-Z_]\+>' nodes.env; then
+  echo "ERROR: nodes.env still has <PLACEHOLDER> values — fill them in first." >&2
+  exit 2
+fi
+
+SECRET_FILE="secrets/cluster.pass"
+if [[ ! -f "$SECRET_FILE" ]]; then
+  echo "ERROR: $SECRET_FILE missing. HUMAN: mkdir -p secrets && openssl rand -hex 32 > $SECRET_FILE" >&2
+  exit 2
+fi
+
+run() {
+  # Echo every action; only execute it under --yes.
+  echo "  + $*"
+  if [[ $APPLY -eq 1 ]]; then
+    "$@"
+  fi
+}
+
+echo "==> [1/3] cross-build membershipd (linux/amd64, CGO disabled)"
+run env CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -o build/membershipd ../../cmd/membershipd
+
+# Build the comma-separated route list for a node = the OTHER nodes' addresses on
+# the chosen network, with NO userinfo (the secret is injected by membershipd from
+# the file). Echoes nothing; prints the value.
+routes_for() {
+  local self="$1" out=""
+  local row name _ssh pub wg addr
+  for row in "${CLUSTER_NODES[@]}"; do
+    read -r name _ssh pub wg <<<"$row"
+    [[ "$name" == "$self" ]] && continue
+    if [[ "$ROUTE_NETWORK" == "public" ]]; then addr="$pub"; else addr="$wg"; fi
+    out+="nats://${addr}:${NATS_ROUTE_PORT},"
+  done
+  echo "${out%,}"
+}
+
+echo "==> [2/3] stage each node (REMOTE_DIR=$REMOTE_DIR)"
+for row in "${CLUSTER_NODES[@]}"; do
+  read -r name ssh _pub _wg <<<"$row"
+  target="${SSH_USER}@${ssh}"
+  nodedir="out/${name}"
+  if [[ ! -d "$nodedir" ]]; then
+    echo "ERROR: $nodedir missing — run ./generate-cluster-certs.sh first." >&2
+    exit 2
+  fi
+  routes="$(routes_for "$name")"
+
+  echo "-- node ${name} (ssh ${ssh}) routes=${routes}"
+
+  # Generate this node's cluster.env locally, then ship it.
+  envfile="build/cluster-${name}.env"
+  mkdir -p build
+  cat > "$envfile" <<EOF
+NODE_NAME=${name}
+CLUSTER_NAME=${CLUSTER_NAME}
+CLUSTER_USER=${CLUSTER_USER}
+KV_REPLICAS=${KV_REPLICAS}
+HTTP_PORT=${HTTP_PORT}
+NATS_CLIENT_PORT=${NATS_CLIENT_PORT}
+NATS_ROUTE_PORT=${NATS_ROUTE_PORT}
+ROUTES=${routes}
+CLUSTER_PASS_FILE=${REMOTE_DIR}/secrets/cluster.pass
+TLS_CERT=${REMOTE_DIR}/tls/server-${name}.crt
+TLS_KEY=${REMOTE_DIR}/tls/server-${name}.key
+ROUTE_TLS_CERT=${REMOTE_DIR}/tls/route-${name}.crt
+ROUTE_TLS_KEY=${REMOTE_DIR}/tls/route-${name}.key
+ROUTE_TLS_CA=${REMOTE_DIR}/tls/cluster-ca.crt
+EOF
+
+  run ssh "$target" "mkdir -p ${REMOTE_DIR}/tls ${REMOTE_DIR}/secrets"
+  run rsync -az build/membershipd "${target}:${REMOTE_DIR}/membershipd"
+  run rsync -az "${nodedir}/" "${target}:${REMOTE_DIR}/tls/"
+  run rsync -az "$SECRET_FILE" "${target}:${REMOTE_DIR}/secrets/cluster.pass"
+  run rsync -az "$envfile" "${target}:${REMOTE_DIR}/cluster.env"
+  run rsync -az membershipd-cluster.service "${target}:/etc/systemd/system/membershipd-cluster.service"
+  run ssh "$target" "chmod 600 ${REMOTE_DIR}/secrets/cluster.pass ${REMOTE_DIR}/tls/*.key && systemctl daemon-reload"
+done
+
+echo "==> [3/3] staged."
+if [[ $APPLY -eq 0 ]]; then
+  echo "    DRY-RUN: nothing was sent. Re-run with --yes to apply."
+fi
+cat <<'NEXT'
+
+HUMAN — staggered start (do NOT enable all at once; see README "Bring up"):
+  1. Seed node first (e.g. magnus):
+       ssh root@magnus 'systemctl enable --now membershipd-cluster'
+       ssh root@magnus '/opt/unibus/membershipd user add --admin ...'   # seed admin
+  2. Then the other two, one at a time, checking quorum after each:
+       ssh root@homer    'systemctl enable --now membershipd-cluster'
+       ssh root@datardos 'systemctl enable --now membershipd-cluster'
+  3. Verify posture + quorum (README "Verify").
+  4. Scale replicas 1 -> 3 once all three are up (README "Scale to R3").
+NEXT
@@ -0,0 +1,120 @@
+#!/usr/bin/env bash
+#
+# generate-cluster-certs.sh — mint the TLS material for a unibus 3-node cluster
+# (issue 0006g). Run ONCE on a trusted machine (e.g. om, which custodies the bus
+# CA); distribute the per-node output to each node over a secure channel. This
+# script touches NO remote host.
+#
+# It produces two trust roots, kept SEPARATE on purpose (audit 0008 N1-low):
+#
+#   1. The CLUSTER route CA (cluster-ca.crt/key, generated here): signs each
+#      node's ROUTE certificate. The route layer authenticates NODES, not bus
+#      users, so it must NOT share the client data-plane CA — a client cert can
+#      then never be presented to the route port.
+#   2. The CLIENT data-plane CA (../tls/ca.crt/key, the one clients pin): signs
+#      each node's DATA-PLANE server certificate. Reused, not regenerated, so
+#      existing clients keep trusting the bus.
+#
+# Per node it emits, under out/<name>/:
+#   route-<name>.crt/key   route cert (cluster CA),  EKU server+clientAuth
+#                          (each node is BOTH server and dialer to its peers)
+#   server-<name>.crt/key  data-plane cert (client CA), EKU serverAuth
+#   cluster-ca.crt         the route CA cert (for --route-tls-ca)
+#   ca.crt                 the client CA cert (for clients / control-plane TLS)
+#
+# SANs per node = its public IP + its WireGuard IP + its hostname + localhost.
+#
+# Key material: EC P-256 (Go crypto/tls + nats-server friendly), matching
+# ../tls/generate-certs.sh.
+set -euo pipefail
+
+DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+cd "$DIR"
+
+# shellcheck source=/dev/null
+source ./nodes.env
+
+# Refuse to run while any placeholder remains (HUMAN must fill nodes.env first).
+if grep -q '<[A-Z_]\+>' nodes.env; then
+  echo "ERROR: nodes.env still has <PLACEHOLDER> values — fill them in first." >&2
+  grep -n '<[A-Z_]\+>' nodes.env >&2
+  exit 2
+fi
+
+CLIENT_CA_CRT="../tls/ca.crt"
+CLIENT_CA_KEY="../tls/ca.key"
+if [[ ! -f "$CLIENT_CA_CRT" || ! -f "$CLIENT_CA_KEY" ]]; then
+  echo "ERROR: client data-plane CA not found at ../tls/ca.{crt,key}." >&2
+  echo "       Run ../tls/generate-certs.sh first (it mints the client CA)." >&2
+  exit 2
+fi
+
+DAYS_CA=3650
+DAYS_CRT=825
+
+force=0
+[[ "${1:-}" == "--force" ]] && force=1
+
+# --- cluster route CA (separate trust root) ---
+if [[ ! -f cluster-ca.crt || ! -f cluster-ca.key || $force -eq 1 ]]; then
+  echo "==> generating cluster route CA (separate from the client CA)"
+  openssl ecparam -name prime256v1 -genkey -noout -out cluster-ca.key
+  chmod 600 cluster-ca.key
+  openssl req -x509 -new -key cluster-ca.key -sha256 -days "$DAYS_CA" \
+    -subj "/CN=unibus-cluster-ca" -out cluster-ca.crt
+else
+  echo "==> reusing existing cluster route CA (pass --force to regenerate)"
+fi
+
+# mint <out_key> <out_crt> <subject_cn> <san> <eku> <ca_crt> <ca_key>
+mint_cert() {
+  local out_key="$1" out_crt="$2" cn="$3" san="$4" eku="$5" ca_crt="$6" ca_key="$7"
+  local csr ext
+  csr="$(mktemp)"
+  ext="$(mktemp)"
+  openssl ecparam -name prime256v1 -genkey -noout -out "$out_key"
+  chmod 600 "$out_key"
+  openssl req -new -key "$out_key" -subj "/CN=${cn}" -out "$csr"
+  cat > "$ext" <<EOF
+subjectAltName=${san}
+extendedKeyUsage=${eku}
+keyUsage=digitalSignature,keyEncipherment
+EOF
+  openssl x509 -req -in "$csr" -CA "$ca_crt" -CAkey "$ca_key" -CAcreateserial \
+    -sha256 -days "$DAYS_CRT" -extfile "$ext" -out "$out_crt"
+  rm -f "$csr" "$ext"
+}
+
+for row in "${CLUSTER_NODES[@]}"; do
+  read -r name _ssh pub wg <<<"$row"
+  echo "==> node ${name}: SAN IP:${pub}, IP:${wg}, DNS:${name}, localhost, 127.0.0.1"
+  nodedir="out/${name}"
+  mkdir -p "$nodedir"
+  san="IP:${pub},IP:${wg},DNS:${name},DNS:localhost,IP:127.0.0.1"
+
+  # Route cert: signed by the cluster CA; server+client auth (mutual routes).
+  mint_cert "${nodedir}/route-${name}.key" "${nodedir}/route-${name}.crt" \
+    "unibus-route-${name}" "$san" "serverAuth,clientAuth" \
+    cluster-ca.crt cluster-ca.key
+
+  # Data-plane server cert: signed by the client CA; serverAuth only.
+  mint_cert "${nodedir}/server-${name}.key" "${nodedir}/server-${name}.crt" \
+    "unibus-${name}" "$san" "serverAuth" \
+    "$CLIENT_CA_CRT" "$CLIENT_CA_KEY"
+
+  # Co-locate the two CA certs each node needs.
+  cp cluster-ca.crt "${nodedir}/cluster-ca.crt"
+  cp "$CLIENT_CA_CRT" "${nodedir}/ca.crt"
+done
+
+rm -f cluster-ca.srl ../tls/ca.srl 2>/dev/null || true
+
+echo
+echo "==> done. Per-node material under out/<name>/ (KEYS ARE SECRET — never git):"
+for row in "${CLUSTER_NODES[@]}"; do
+  read -r name _rest <<<"$row"
+  echo "    out/${name}/  (route-${name}.*, server-${name}.*, cluster-ca.crt, ca.crt)"
+done
+echo
+echo "verify a SAN with:"
+echo "    openssl x509 -in out/<name>/server-<name>.crt -noout -text | grep -A1 'Subject Alternative Name'"
@@ -0,0 +1,45 @@
+[Unit]
+# unibus membershipd — cluster node (issue 0006g).
+#
+# One unit, parameterized per node by /opt/unibus/cluster.env (generated by
+# deploy-cluster.sh): NODE_NAME, ROUTES and the cert paths differ per node, the
+# rest of the posture (enforce + per-subject ACL + TLS + --store kv) is identical
+# on every node, which is the homogeneous posture a secure cluster requires
+# (audit 0008 N1).
+Description=unibus membershipd (cluster node)
+After=network-online.target
+Wants=network-online.target
+
+[Service]
+Type=simple
+WorkingDirectory=/opt/unibus
+EnvironmentFile=/opt/unibus/cluster.env
+# The route password comes from a FILE referenced by ${CLUSTER_PASS_FILE}, never
+# from argv (audit 0008 N1-low). The peer --routes carry no userinfo; membershipd
+# injects the credentials from the file/user.
+ExecStart=/opt/unibus/membershipd \
+  --bind 0.0.0.0 \
+  --bus-auth enforce \
+  --http-port ${HTTP_PORT} \
+  --nats-port ${NATS_CLIENT_PORT} \
+  --tls-cert ${TLS_CERT} \
+  --tls-key ${TLS_KEY} \
+  --cluster-name ${CLUSTER_NAME} \
+  --server-name ${NODE_NAME} \
+  --cluster-port ${NATS_ROUTE_PORT} \
+  --routes ${ROUTES} \
+  --cluster-user ${CLUSTER_USER} \
+  --cluster-pass-file ${CLUSTER_PASS_FILE} \
+  --route-tls-cert ${ROUTE_TLS_CERT} \
+  --route-tls-key ${ROUTE_TLS_KEY} \
+  --route-tls-ca ${ROUTE_TLS_CA} \
+  --store kv \
+  --kv-replicas ${KV_REPLICAS}
+# Restart=always (NOT on-failure): a clean SIGTERM exits success, and on-failure
+# would then NOT restart, leaving the node silently dead (see function_tags.md).
+Restart=always
+RestartSec=2
+LimitNOFILE=65536
+
+[Install]
+WantedBy=multi-user.target
@@ -0,0 +1,44 @@
+# Cluster topology for the unibus 3-node deployment (issue 0006g).
+#
+# This file is SOURCED by generate-cluster-certs.sh and deploy-cluster.sh.
+#
+# HUMAN: fill in every <PLACEHOLDER> with the real value before running the
+# scripts. The public IPs known at authoring time are pre-filled; the WireGuard
+# mesh IPs and magnus's public IP must be supplied. The scripts refuse to run
+# while any <PLACEHOLDER> remains.
+
+# Cluster identity (must be identical on every node).
+CLUSTER_NAME="unibus"
+# Route-secret username; the password is NOT here — it lives in a file (see
+# CLUSTER_PASS_FILE in deploy-cluster.sh) so it never lands in argv or git.
+CLUSTER_USER="unibus-cluster"
+
+# KV/nonce replication factor. START AT 1 for the initial 1->3 rollout, then raise
+# to 3 IN PLACE (see README "Scale to R3") once all three nodes have joined. Only
+# set this to 3 here after the third node is up and you re-run the KV update.
+KV_REPLICAS=1
+
+# Ports (same on every node; the route port is server-to-server only).
+NATS_CLIENT_PORT=4250
+NATS_ROUTE_PORT=6250
+HTTP_PORT=8470
+
+# Remote install layout and SSH login user.
+REMOTE_DIR="/opt/unibus"
+SSH_USER="root"
+
+# Which address family the inter-node routes use. "wg" builds --routes from the
+# WireGuard mesh IPs (private server-to-server links, preferred); "public" uses
+# the public IPs. The route layer is always mutual-TLS regardless.
+ROUTE_NETWORK="wg"
+
+# One row per node: NAME  SSH_HOST  PUBLIC_IP  WG_IP
+#   NAME      -> --server-name and the per-node cert filenames (unique).
+#   SSH_HOST  -> the `ssh <SSH_HOST>` alias (see ~/.ssh/config).
+#   PUBLIC_IP -> public address; goes in the cert SANs (client-facing data plane).
+#   WG_IP     -> WireGuard mesh address; cert SAN + route target when ROUTE_NETWORK=wg.
+CLUSTER_NODES=(
+  "magnus    magnus  <MAGNUS_PUBLIC_IP>  <MAGNUS_WG_IP>"
+  "homer     homer   141.94.69.66        <HOMER_WG_IP>"
+  "datardos  dd      51.91.100.142       <DATARDOS_WG_IP>"
+)