Merge issue/0006g-deploy: cluster deploy material (magnus+homer+datardos, R3 HA)
This commit is contained in:
@@ -0,0 +1,7 @@
|
||||
# Generated TLS material and secrets — NEVER commit (audit 0008: keys/secret).
|
||||
out/
|
||||
build/
|
||||
secrets/
|
||||
*.key
|
||||
*.srl
|
||||
cluster-ca.crt
|
||||
@@ -0,0 +1,181 @@
|
||||
# unibus cluster — 3-node deploy runbook (issue 0006g)
|
||||
|
||||
This directory holds the material to bring up unibus as a **3-node cluster**
|
||||
(`magnus` + `homer` + `datardos`) for real HA: with **R3** replication the control
|
||||
plane (rooms/members/keys/users on JetStream KV + the anti-replay nonce bucket)
|
||||
survives the loss of any one node (quorum 2/3).
|
||||
|
||||
> **The agent that authored this never touched a VPS.** Every step that changes a
|
||||
> remote host is marked **HUMAN** and is executed by the operator. `deploy-cluster.sh`
|
||||
> defaults to a dry run.
|
||||
|
||||
## Files
|
||||
|
||||
| File | What it is |
|
||||
|---|---|
|
||||
| `nodes.env` | Topology: cluster name, ports, and the per-node rows (name, ssh host, public IP, WG IP). **HUMAN fills the placeholders.** |
|
||||
| `generate-cluster-certs.sh` | Mints a **separate cluster route CA** + a route cert per node, and a data-plane server cert per node signed by the **client CA** (`../tls/ca.*`). |
|
||||
| `membershipd-cluster.service` | One systemd unit, parameterized per node by `/opt/unibus/cluster.env`. enforce + per-subject ACL + TLS + `--store kv`, `Restart=always`. |
|
||||
| `deploy-cluster.sh` | Cross-builds the linux binary, generates each node's `cluster.env`, and (with `--yes`) rsyncs everything + installs the unit. Staggered start is manual. |
|
||||
|
||||
Generated keys/secrets (`out/`, `build/`, `secrets/`) are **gitignored** — they are
|
||||
secret and never leave the operator's trusted machine except over the secure
|
||||
rsync channel.
|
||||
|
||||
## Topology
|
||||
|
||||
| Node | SSH | Public IP | WireGuard IP | Role |
|
||||
|---|---|---|---|---|
|
||||
| magnus | `magnus` | `<MAGNUS_PUBLIC_IP>` | `<MAGNUS_WG_IP>` | seed (first up) |
|
||||
| homer | `homer` | `141.94.69.66` | `<HOMER_WG_IP>` | replica |
|
||||
| datardos | `dd` | `51.91.100.142` | `<DATARDOS_WG_IP>` (10.21.0.x) | replica |
|
||||
|
||||
The route layer (server-to-server) prefers the **WireGuard mesh**
|
||||
(`ROUTE_NETWORK=wg`); the client data plane and the HTTP control plane are reached
|
||||
over the public IPs. The route CA is **separate** from the client CA, so a client
|
||||
cert can never be presented to the route port.
|
||||
|
||||
## Prerequisites (HUMAN, once)
|
||||
|
||||
1. **Fill `nodes.env`** — replace every `<PLACEHOLDER>` (magnus public IP, all WG
|
||||
IPs). The scripts refuse to run while any remain.
|
||||
2. **Client CA exists** — `../tls/ca.crt` + `../tls/ca.key`. If not, run
|
||||
`../tls/generate-certs.sh` on the CA host (om) first. The cluster reuses this CA
|
||||
for the data plane so existing clients keep trusting the bus.
|
||||
3. **Mint cluster TLS**:
|
||||
```bash
|
||||
./generate-cluster-certs.sh # writes out/<name>/ ; --force to rotate the cluster CA
|
||||
```
|
||||
4. **Create the route secret** (out of argv, shared by all nodes):
|
||||
```bash
|
||||
mkdir -p secrets && openssl rand -hex 32 > secrets/cluster.pass
|
||||
```
|
||||
5. **SSH** to each node's SSH host as `root` works (`ssh magnus true`, `ssh dd true`, ...).
|
||||
|
||||
## Stage the nodes
|
||||
|
||||
```bash
|
||||
./deploy-cluster.sh # DRY RUN — prints the full plan, touches nothing
|
||||
./deploy-cluster.sh --yes # HUMAN: actually rsync + install the unit on all 3 nodes
|
||||
```
|
||||
|
||||
This cross-builds `membershipd` (linux/amd64, `CGO_ENABLED=0`), writes each node's
|
||||
`cluster.env` (its `NODE_NAME` and the `--routes` to the OTHER two nodes), and
|
||||
ships the binary, the node's TLS material, the secret, the env file and the unit.
|
||||
It does **not** start anything.
|
||||
|
||||
## Seed the first admin into the KV (HUMAN — loopback bootstrap)
|
||||
|
||||
The empty KV control plane has no users, and under `enforce` no external tool can
|
||||
write the FIRST admin over NATS (it would need to be an admin already — a
|
||||
chicken-and-egg). The `user` CLI also writes only to a local SQLite file, not the
|
||||
KV. So the first admin is seeded on the seed node through a **loopback, no-auth
|
||||
bootstrap** that populates the same JetStream store the cluster unit then reuses:
|
||||
|
||||
```bash
|
||||
ssh root@magnus 'bash -s' <<'SEED'
|
||||
set -euo pipefail
|
||||
cd /opt/unibus
|
||||
# a) Put the first admin into a local SQLite seed file.
|
||||
./membershipd user add --db ./seed.db --handle root --sign-pub <ADMIN_SIGN_PUB_HEX> --role admin
|
||||
# b) Bring up a TEMPORARY loopback, no-auth, single-node KV server on the cluster's
|
||||
# own JetStream store dir (not exposed; bus-auth off is allowed on 127.0.0.1).
|
||||
./membershipd --store kv --bus-auth off --bind 127.0.0.1 \
|
||||
--nats-store ./local_files/jetstream --db ./seed.db >/tmp/seed-boot.log 2>&1 &
|
||||
BOOT=$!; sleep 2
|
||||
# c) Migrate the admin from SQLite into the replicated KV (loopback — no --ca needed).
|
||||
./membershipd migrate-to-kv --db ./seed.db --nats-url nats://127.0.0.1:4250 --replicas 1
|
||||
# d) Stop the bootstrap server. The KV buckets persist in ./local_files/jetstream.
|
||||
kill "$BOOT"; wait "$BOOT" 2>/dev/null || true
|
||||
rm -f ./seed.db
|
||||
SEED
|
||||
```
|
||||
|
||||
> The KV written here lives in `./local_files/jetstream`, which the cluster unit
|
||||
> reuses (`--nats-store` default), so the admin is present when the enforce cluster
|
||||
> starts. Additional users are added the same loopback way until a
|
||||
> `user add --store kv` exists (see GAP in report 0009).
|
||||
|
||||
## Bring up (HUMAN — staggered)
|
||||
|
||||
Bring up the seed first, then the replicas one at a time, checking each joins.
|
||||
|
||||
```bash
|
||||
# 1. Seed node (after the seed step above).
|
||||
ssh root@magnus 'systemctl enable --now membershipd-cluster'
|
||||
ssh root@magnus 'curl -fsS https://127.0.0.1:8470/healthz --cacert /opt/unibus/tls/ca.crt'
|
||||
|
||||
# 2. Replicas, one at a time.
|
||||
ssh root@homer 'systemctl enable --now membershipd-cluster'
|
||||
ssh root@datardos 'systemctl enable --now membershipd-cluster'
|
||||
```
|
||||
|
||||
> Initial rollout runs at **R1** (`KV_REPLICAS=1` in `nodes.env`): the buckets live
|
||||
> on the seed only. This is NOT HA yet — see "Scale to R3".
|
||||
|
||||
## Promote an existing single-node (SQLite) deployment (HUMAN, optional)
|
||||
|
||||
Instead of seeding fresh, you can migrate an existing single-node `unibus.db` into
|
||||
the KV — **loopback only** (the allowlist would otherwise travel cleartext; the
|
||||
command refuses a remote target without `--ca`). Use the same loopback-bootstrap
|
||||
shape as the seed step (temporary `--bus-auth off` server on 127.0.0.1, then
|
||||
`migrate-to-kv --db /opt/unibus/local_files/unibus.db`).
|
||||
|
||||
## Verify
|
||||
|
||||
```bash
|
||||
# Posture on every node — all must be enforce+acl+tls+cluster, store=kv.
|
||||
for h in magnus homer datardos; do
|
||||
echo "== $h =="
|
||||
ssh root@$h 'curl -fsS https://127.0.0.1:8470/healthz --cacert /opt/unibus/tls/ca.crt'
|
||||
done
|
||||
|
||||
# Cluster + JetStream meta-group health (needs the `nats` CLI on a node):
|
||||
ssh root@magnus 'nats --server nats://127.0.0.1:4250 server report jetstream'
|
||||
ssh root@magnus 'nats --server nats://127.0.0.1:4250 server list' # 3 servers, routes up
|
||||
```
|
||||
|
||||
A healthy cluster shows 3 routed servers and a JetStream meta-group with a leader.
|
||||
|
||||
## Scale to R3 (HUMAN — real HA)
|
||||
|
||||
Once all three nodes are up and routed, raise the replication factor of every
|
||||
control-plane stream from 1 to 3 IN PLACE (no data loss), then flip `KV_REPLICAS=3`
|
||||
in `nodes.env` so future (re)deploys keep it:
|
||||
|
||||
```bash
|
||||
for s in KV_UNIBUS_users KV_UNIBUS_rooms KV_UNIBUS_members KV_UNIBUS_room_keys \
|
||||
KV_UNIBUS_rooms_by_member KV_UNIBUS_nonces; do
|
||||
ssh root@magnus "nats --server nats://127.0.0.1:4250 stream update $s --replicas 3 -f"
|
||||
done
|
||||
# (also OBJ_UNIBUS_blobs if the object store is in use)
|
||||
```
|
||||
|
||||
Until this is done, R1 means the seed node is a **single point of failure for
|
||||
authentication**: if it dies, the nonce/KV control plane is unreachable and every
|
||||
authenticated request fails closed (auth DoS). R1 is a rollout step, not HA.
|
||||
|
||||
## Chaos test (HUMAN — requires the 3 live VPS; NOT run here)
|
||||
|
||||
Validate quorum tolerance after R3:
|
||||
|
||||
```bash
|
||||
# Kill one node; the cluster keeps serving (quorum 2/3).
|
||||
ssh root@datardos 'systemctl stop membershipd-cluster'
|
||||
# -> clients fail over (multiple seed URLs); reads/writes still succeed.
|
||||
ssh root@datardos 'systemctl start membershipd-cluster' # rejoins, catches up
|
||||
|
||||
# Kill two nodes; quorum is LOST — the control plane should fail CLOSED (deny),
|
||||
# never fail open. Verify a request is rejected, not silently served.
|
||||
```
|
||||
|
||||
This network-level chaos test (kill 1/3, kill 2/3, partition/split-brain) is part
|
||||
of the deploy validation (issue 0003f) and runs against the real VPS — it is
|
||||
deliberately out of scope for the authoring agent.
|
||||
|
||||
## Rollback
|
||||
|
||||
`membershipd` does not delete data. To revert a node to standalone SQLite, stop
|
||||
the unit and start it without `--store kv`/`--cluster-name`; the KV buckets remain
|
||||
for a later retry. To rotate the cluster CA, re-run `generate-cluster-certs.sh
|
||||
--force` and re-stage (every node must get the new `cluster-ca.crt` together).
|
||||
Executable
+126
@@ -0,0 +1,126 @@
|
||||
#!/usr/bin/env bash
|
||||
#
|
||||
# deploy-cluster.sh — cross-build membershipd and stage it onto the three cluster
|
||||
# nodes (issue 0006g). DEFAULT IS DRY-RUN: it prints the plan and touches nothing.
|
||||
# Pass --yes to actually rsync + run remote commands. Steps that a HUMAN must run
|
||||
# (or confirm) are marked "HUMAN:".
|
||||
#
|
||||
# Prerequisites (HUMAN, once):
|
||||
# 1. Fill nodes.env (no <PLACEHOLDER> left).
|
||||
# 2. ./generate-cluster-certs.sh (mints out/<name>/ TLS material)
|
||||
# 3. Create the route secret locally: mkdir -p secrets && openssl rand -hex 32 > secrets/cluster.pass
|
||||
# (secrets/ is gitignored; it is rsynced to each node as cluster.pass)
|
||||
# 4. SSH access to every node's SSH_HOST with sudo-less root (SSH_USER=root).
|
||||
#
|
||||
# What it does per node (with --yes):
|
||||
# - rsync the membershipd binary, the node's TLS material, the unit, the
|
||||
# generated cluster.env and the route secret into REMOTE_DIR.
|
||||
# - install + daemon-reload the systemd unit.
|
||||
# Start is STAGGERED and left to the human (see README): start the seed node,
|
||||
# seed the admin, then start the rest.
|
||||
set -euo pipefail
|
||||
|
||||
DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
cd "$DIR"
|
||||
|
||||
# shellcheck source=/dev/null
|
||||
source ./nodes.env
|
||||
|
||||
APPLY=0
|
||||
[[ "${1:-}" == "--yes" ]] && APPLY=1
|
||||
|
||||
if grep -q '<[A-Z_]\+>' nodes.env; then
|
||||
echo "ERROR: nodes.env still has <PLACEHOLDER> values — fill them in first." >&2
|
||||
exit 2
|
||||
fi
|
||||
|
||||
SECRET_FILE="secrets/cluster.pass"
|
||||
if [[ ! -f "$SECRET_FILE" ]]; then
|
||||
echo "ERROR: $SECRET_FILE missing. HUMAN: mkdir -p secrets && openssl rand -hex 32 > $SECRET_FILE" >&2
|
||||
exit 2
|
||||
fi
|
||||
|
||||
run() {
|
||||
# Echo every action; only execute it under --yes.
|
||||
echo " + $*"
|
||||
if [[ $APPLY -eq 1 ]]; then
|
||||
"$@"
|
||||
fi
|
||||
}
|
||||
|
||||
echo "==> [1/3] cross-build membershipd (linux/amd64, CGO disabled)"
|
||||
run env CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -o build/membershipd ../../cmd/membershipd
|
||||
|
||||
# Build the comma-separated route list for a node = the OTHER nodes' addresses on
|
||||
# the chosen network, with NO userinfo (the secret is injected by membershipd from
|
||||
# the file). Echoes nothing; prints the value.
|
||||
routes_for() {
|
||||
local self="$1" out=""
|
||||
local row name _ssh pub wg addr
|
||||
for row in "${CLUSTER_NODES[@]}"; do
|
||||
read -r name _ssh pub wg <<<"$row"
|
||||
[[ "$name" == "$self" ]] && continue
|
||||
if [[ "$ROUTE_NETWORK" == "public" ]]; then addr="$pub"; else addr="$wg"; fi
|
||||
out+="nats://${addr}:${NATS_ROUTE_PORT},"
|
||||
done
|
||||
echo "${out%,}"
|
||||
}
|
||||
|
||||
echo "==> [2/3] stage each node (REMOTE_DIR=$REMOTE_DIR)"
|
||||
for row in "${CLUSTER_NODES[@]}"; do
|
||||
read -r name ssh _pub _wg <<<"$row"
|
||||
target="${SSH_USER}@${ssh}"
|
||||
nodedir="out/${name}"
|
||||
if [[ ! -d "$nodedir" ]]; then
|
||||
echo "ERROR: $nodedir missing — run ./generate-cluster-certs.sh first." >&2
|
||||
exit 2
|
||||
fi
|
||||
routes="$(routes_for "$name")"
|
||||
|
||||
echo "-- node ${name} (ssh ${ssh}) routes=${routes}"
|
||||
|
||||
# Generate this node's cluster.env locally, then ship it.
|
||||
envfile="build/cluster-${name}.env"
|
||||
mkdir -p build
|
||||
cat > "$envfile" <<EOF
|
||||
NODE_NAME=${name}
|
||||
CLUSTER_NAME=${CLUSTER_NAME}
|
||||
CLUSTER_USER=${CLUSTER_USER}
|
||||
KV_REPLICAS=${KV_REPLICAS}
|
||||
HTTP_PORT=${HTTP_PORT}
|
||||
NATS_CLIENT_PORT=${NATS_CLIENT_PORT}
|
||||
NATS_ROUTE_PORT=${NATS_ROUTE_PORT}
|
||||
ROUTES=${routes}
|
||||
CLUSTER_PASS_FILE=${REMOTE_DIR}/secrets/cluster.pass
|
||||
TLS_CERT=${REMOTE_DIR}/tls/server-${name}.crt
|
||||
TLS_KEY=${REMOTE_DIR}/tls/server-${name}.key
|
||||
ROUTE_TLS_CERT=${REMOTE_DIR}/tls/route-${name}.crt
|
||||
ROUTE_TLS_KEY=${REMOTE_DIR}/tls/route-${name}.key
|
||||
ROUTE_TLS_CA=${REMOTE_DIR}/tls/cluster-ca.crt
|
||||
EOF
|
||||
|
||||
run ssh "$target" "mkdir -p ${REMOTE_DIR}/tls ${REMOTE_DIR}/secrets"
|
||||
run rsync -az build/membershipd "${target}:${REMOTE_DIR}/membershipd"
|
||||
run rsync -az "${nodedir}/" "${target}:${REMOTE_DIR}/tls/"
|
||||
run rsync -az "$SECRET_FILE" "${target}:${REMOTE_DIR}/secrets/cluster.pass"
|
||||
run rsync -az "$envfile" "${target}:${REMOTE_DIR}/cluster.env"
|
||||
run rsync -az membershipd-cluster.service "${target}:/etc/systemd/system/membershipd-cluster.service"
|
||||
run ssh "$target" "chmod 600 ${REMOTE_DIR}/secrets/cluster.pass ${REMOTE_DIR}/tls/*.key && systemctl daemon-reload"
|
||||
done
|
||||
|
||||
echo "==> [3/3] staged."
|
||||
if [[ $APPLY -eq 0 ]]; then
|
||||
echo " DRY-RUN: nothing was sent. Re-run with --yes to apply."
|
||||
fi
|
||||
cat <<'NEXT'
|
||||
|
||||
HUMAN — staggered start (do NOT enable all at once; see README "Bring up"):
|
||||
1. Seed node first (e.g. magnus):
|
||||
ssh root@magnus 'systemctl enable --now membershipd-cluster'
|
||||
ssh root@magnus '/opt/unibus/membershipd user add --admin ...' # seed admin
|
||||
2. Then the other two, one at a time, checking quorum after each:
|
||||
ssh root@homer 'systemctl enable --now membershipd-cluster'
|
||||
ssh root@datardos 'systemctl enable --now membershipd-cluster'
|
||||
3. Verify posture + quorum (README "Verify").
|
||||
4. Scale replicas 1 -> 3 once all three are up (README "Scale to R3").
|
||||
NEXT
|
||||
Executable
+120
@@ -0,0 +1,120 @@
|
||||
#!/usr/bin/env bash
|
||||
#
|
||||
# generate-cluster-certs.sh — mint the TLS material for a unibus 3-node cluster
|
||||
# (issue 0006g). Run ONCE on a trusted machine (e.g. om, which custodies the bus
|
||||
# CA); distribute the per-node output to each node over a secure channel. This
|
||||
# script touches NO remote host.
|
||||
#
|
||||
# It produces two trust roots, kept SEPARATE on purpose (audit 0008 N1-low):
|
||||
#
|
||||
# 1. The CLUSTER route CA (cluster-ca.crt/key, generated here): signs each
|
||||
# node's ROUTE certificate. The route layer authenticates NODES, not bus
|
||||
# users, so it must NOT share the client data-plane CA — a client cert can
|
||||
# then never be presented to the route port.
|
||||
# 2. The CLIENT data-plane CA (../tls/ca.crt/key, the one clients pin): signs
|
||||
# each node's DATA-PLANE server certificate. Reused, not regenerated, so
|
||||
# existing clients keep trusting the bus.
|
||||
#
|
||||
# Per node it emits, under out/<name>/:
|
||||
# route-<name>.crt/key route cert (cluster CA), EKU server+clientAuth
|
||||
# (each node is BOTH server and dialer to its peers)
|
||||
# server-<name>.crt/key data-plane cert (client CA), EKU serverAuth
|
||||
# cluster-ca.crt the route CA cert (for --route-tls-ca)
|
||||
# ca.crt the client CA cert (for clients / control-plane TLS)
|
||||
#
|
||||
# SANs per node = its public IP + its WireGuard IP + its hostname + localhost.
|
||||
#
|
||||
# Key material: EC P-256 (Go crypto/tls + nats-server friendly), matching
|
||||
# ../tls/generate-certs.sh.
|
||||
set -euo pipefail
|
||||
|
||||
DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
cd "$DIR"
|
||||
|
||||
# shellcheck source=/dev/null
|
||||
source ./nodes.env
|
||||
|
||||
# Refuse to run while any placeholder remains (HUMAN must fill nodes.env first).
|
||||
if grep -q '<[A-Z_]\+>' nodes.env; then
|
||||
echo "ERROR: nodes.env still has <PLACEHOLDER> values — fill them in first." >&2
|
||||
grep -n '<[A-Z_]\+>' nodes.env >&2
|
||||
exit 2
|
||||
fi
|
||||
|
||||
CLIENT_CA_CRT="../tls/ca.crt"
|
||||
CLIENT_CA_KEY="../tls/ca.key"
|
||||
if [[ ! -f "$CLIENT_CA_CRT" || ! -f "$CLIENT_CA_KEY" ]]; then
|
||||
echo "ERROR: client data-plane CA not found at ../tls/ca.{crt,key}." >&2
|
||||
echo " Run ../tls/generate-certs.sh first (it mints the client CA)." >&2
|
||||
exit 2
|
||||
fi
|
||||
|
||||
DAYS_CA=3650
|
||||
DAYS_CRT=825
|
||||
|
||||
force=0
|
||||
[[ "${1:-}" == "--force" ]] && force=1
|
||||
|
||||
# --- cluster route CA (separate trust root) ---
|
||||
if [[ ! -f cluster-ca.crt || ! -f cluster-ca.key || $force -eq 1 ]]; then
|
||||
echo "==> generating cluster route CA (separate from the client CA)"
|
||||
openssl ecparam -name prime256v1 -genkey -noout -out cluster-ca.key
|
||||
chmod 600 cluster-ca.key
|
||||
openssl req -x509 -new -key cluster-ca.key -sha256 -days "$DAYS_CA" \
|
||||
-subj "/CN=unibus-cluster-ca" -out cluster-ca.crt
|
||||
else
|
||||
echo "==> reusing existing cluster route CA (pass --force to regenerate)"
|
||||
fi
|
||||
|
||||
# mint <out_key> <out_crt> <subject_cn> <san> <eku> <ca_crt> <ca_key>
|
||||
mint_cert() {
|
||||
local out_key="$1" out_crt="$2" cn="$3" san="$4" eku="$5" ca_crt="$6" ca_key="$7"
|
||||
local csr ext
|
||||
csr="$(mktemp)"
|
||||
ext="$(mktemp)"
|
||||
openssl ecparam -name prime256v1 -genkey -noout -out "$out_key"
|
||||
chmod 600 "$out_key"
|
||||
openssl req -new -key "$out_key" -subj "/CN=${cn}" -out "$csr"
|
||||
cat > "$ext" <<EOF
|
||||
subjectAltName=${san}
|
||||
extendedKeyUsage=${eku}
|
||||
keyUsage=digitalSignature,keyEncipherment
|
||||
EOF
|
||||
openssl x509 -req -in "$csr" -CA "$ca_crt" -CAkey "$ca_key" -CAcreateserial \
|
||||
-sha256 -days "$DAYS_CRT" -extfile "$ext" -out "$out_crt"
|
||||
rm -f "$csr" "$ext"
|
||||
}
|
||||
|
||||
for row in "${CLUSTER_NODES[@]}"; do
|
||||
read -r name _ssh pub wg <<<"$row"
|
||||
echo "==> node ${name}: SAN IP:${pub}, IP:${wg}, DNS:${name}, localhost, 127.0.0.1"
|
||||
nodedir="out/${name}"
|
||||
mkdir -p "$nodedir"
|
||||
san="IP:${pub},IP:${wg},DNS:${name},DNS:localhost,IP:127.0.0.1"
|
||||
|
||||
# Route cert: signed by the cluster CA; server+client auth (mutual routes).
|
||||
mint_cert "${nodedir}/route-${name}.key" "${nodedir}/route-${name}.crt" \
|
||||
"unibus-route-${name}" "$san" "serverAuth,clientAuth" \
|
||||
cluster-ca.crt cluster-ca.key
|
||||
|
||||
# Data-plane server cert: signed by the client CA; serverAuth only.
|
||||
mint_cert "${nodedir}/server-${name}.key" "${nodedir}/server-${name}.crt" \
|
||||
"unibus-${name}" "$san" "serverAuth" \
|
||||
"$CLIENT_CA_CRT" "$CLIENT_CA_KEY"
|
||||
|
||||
# Co-locate the two CA certs each node needs.
|
||||
cp cluster-ca.crt "${nodedir}/cluster-ca.crt"
|
||||
cp "$CLIENT_CA_CRT" "${nodedir}/ca.crt"
|
||||
done
|
||||
|
||||
rm -f cluster-ca.srl ../tls/ca.srl 2>/dev/null || true
|
||||
|
||||
echo
|
||||
echo "==> done. Per-node material under out/<name>/ (KEYS ARE SECRET — never git):"
|
||||
for row in "${CLUSTER_NODES[@]}"; do
|
||||
read -r name _rest <<<"$row"
|
||||
echo " out/${name}/ (route-${name}.*, server-${name}.*, cluster-ca.crt, ca.crt)"
|
||||
done
|
||||
echo
|
||||
echo "verify a SAN with:"
|
||||
echo " openssl x509 -in out/<name>/server-<name>.crt -noout -text | grep -A1 'Subject Alternative Name'"
|
||||
@@ -0,0 +1,45 @@
|
||||
[Unit]
|
||||
# unibus membershipd — cluster node (issue 0006g).
|
||||
#
|
||||
# One unit, parameterized per node by /opt/unibus/cluster.env (generated by
|
||||
# deploy-cluster.sh): NODE_NAME, ROUTES and the cert paths differ per node, the
|
||||
# rest of the posture (enforce + per-subject ACL + TLS + --store kv) is identical
|
||||
# on every node, which is the homogeneous posture a secure cluster requires
|
||||
# (audit 0008 N1).
|
||||
Description=unibus membershipd (cluster node)
|
||||
After=network-online.target
|
||||
Wants=network-online.target
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
WorkingDirectory=/opt/unibus
|
||||
EnvironmentFile=/opt/unibus/cluster.env
|
||||
# The route password comes from a FILE referenced by ${CLUSTER_PASS_FILE}, never
|
||||
# from argv (audit 0008 N1-low). The peer --routes carry no userinfo; membershipd
|
||||
# injects the credentials from the file/user.
|
||||
ExecStart=/opt/unibus/membershipd \
|
||||
--bind 0.0.0.0 \
|
||||
--bus-auth enforce \
|
||||
--http-port ${HTTP_PORT} \
|
||||
--nats-port ${NATS_CLIENT_PORT} \
|
||||
--tls-cert ${TLS_CERT} \
|
||||
--tls-key ${TLS_KEY} \
|
||||
--cluster-name ${CLUSTER_NAME} \
|
||||
--server-name ${NODE_NAME} \
|
||||
--cluster-port ${NATS_ROUTE_PORT} \
|
||||
--routes ${ROUTES} \
|
||||
--cluster-user ${CLUSTER_USER} \
|
||||
--cluster-pass-file ${CLUSTER_PASS_FILE} \
|
||||
--route-tls-cert ${ROUTE_TLS_CERT} \
|
||||
--route-tls-key ${ROUTE_TLS_KEY} \
|
||||
--route-tls-ca ${ROUTE_TLS_CA} \
|
||||
--store kv \
|
||||
--kv-replicas ${KV_REPLICAS}
|
||||
# Restart=always (NOT on-failure): a clean SIGTERM exits success, and on-failure
|
||||
# would then NOT restart, leaving the node silently dead (see function_tags.md).
|
||||
Restart=always
|
||||
RestartSec=2
|
||||
LimitNOFILE=65536
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
@@ -0,0 +1,44 @@
|
||||
# Cluster topology for the unibus 3-node deployment (issue 0006g).
|
||||
#
|
||||
# This file is SOURCED by generate-cluster-certs.sh and deploy-cluster.sh.
|
||||
#
|
||||
# HUMAN: fill in every <PLACEHOLDER> with the real value before running the
|
||||
# scripts. The public IPs known at authoring time are pre-filled; the WireGuard
|
||||
# mesh IPs and magnus's public IP must be supplied. The scripts refuse to run
|
||||
# while any <PLACEHOLDER> remains.
|
||||
|
||||
# Cluster identity (must be identical on every node).
|
||||
CLUSTER_NAME="unibus"
|
||||
# Route-secret username; the password is NOT here — it lives in a file (see
|
||||
# CLUSTER_PASS_FILE in deploy-cluster.sh) so it never lands in argv or git.
|
||||
CLUSTER_USER="unibus-cluster"
|
||||
|
||||
# KV/nonce replication factor. START AT 1 for the initial 1->3 rollout, then raise
|
||||
# to 3 IN PLACE (see README "Scale to R3") once all three nodes have joined. Only
|
||||
# set this to 3 here after the third node is up and you re-run the KV update.
|
||||
KV_REPLICAS=1
|
||||
|
||||
# Ports (same on every node; the route port is server-to-server only).
|
||||
NATS_CLIENT_PORT=4250
|
||||
NATS_ROUTE_PORT=6250
|
||||
HTTP_PORT=8470
|
||||
|
||||
# Remote install layout and SSH login user.
|
||||
REMOTE_DIR="/opt/unibus"
|
||||
SSH_USER="root"
|
||||
|
||||
# Which address family the inter-node routes use. "wg" builds --routes from the
|
||||
# WireGuard mesh IPs (private server-to-server links, preferred); "public" uses
|
||||
# the public IPs. The route layer is always mutual-TLS regardless.
|
||||
ROUTE_NETWORK="wg"
|
||||
|
||||
# One row per node: NAME SSH_HOST PUBLIC_IP WG_IP
|
||||
# NAME -> --server-name and the per-node cert filenames (unique).
|
||||
# SSH_HOST -> the `ssh <SSH_HOST>` alias (see ~/.ssh/config).
|
||||
# PUBLIC_IP -> public address; goes in the cert SANs (client-facing data plane).
|
||||
# WG_IP -> WireGuard mesh address; cert SAN + route target when ROUTE_NETWORK=wg.
|
||||
CLUSTER_NODES=(
|
||||
"magnus magnus <MAGNUS_PUBLIC_IP> <MAGNUS_WG_IP>"
|
||||
"homer homer 141.94.69.66 <HOMER_WG_IP>"
|
||||
"datardos dd 51.91.100.142 <DATARDOS_WG_IP>"
|
||||
)
|
||||
Reference in New Issue
Block a user