Bringing up the 3-node cluster from clean stores never converged: every node
looped on `open KV bucket "UNIBUS_rooms" (replicas=1): context deadline exceeded`.
Three independent defects in the clustered bootstrap path, none of which surface
on a single node (where JetStream is ready instantly), caused it:
1. embeddednats: route connection pooling (nats-server 2.10 default pool of 3)
churned with "duplicate route"/"client closed" reconnects on the small cluster,
interrupting the meta-group RAFT heartbeats and forcing perpetual leader
re-elections. Set Cluster.PoolSize = -1 (single route per peer).
2. embeddednats: the cluster nodes are Docker hosts, so NATS advertised the docker
bridge IPs (172.x / 10.0.x) to peers, which then tried to dial those private,
mutually-unreachable addresses. Set Cluster.NoAdvertise = true so only the
explicit public-IP routes are used. Also added a UNIBUS_NATS_DEBUG env toggle
(off by default) that enables the embedded server's logger and loopback
monitoring port for debugging the route/meta layer.
3. membership.OpenJetStream: a KV op is a NATS request/reply; on a cold cluster the
op was published once, before the node had contact with the meta leader, so the
request was dropped and the single long-context call just blocked until timeout.
Retry each bucket op with short per-attempt contexts until it succeeds or an
overall bootstrap budget (120s) is exhausted, so it lands once the meta settles.
With these the cluster forms cleanly, creates the KV buckets, scales R1->R3 in
place, and survives loss of one node (quorum 2/3). Verified on magnus+homer+datardos.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add high-availability cluster support to the embedded NATS server
(issue 0003a, first phase of decentralization).
pkg/embeddednats:
- ServerConfig gains ServerName (unique per node, required by JetStream
RAFT) and an optional *ClusterConfig (cluster name, route host/port,
peer route URLs, shared-secret Username/Password, and a mutual-TLS
*tls.Config). applyClusterOpts maps it onto server.Options.Cluster +
Routes. Nil Cluster keeps the legacy standalone server.
pkg/busauth:
- RouteTLSConfig builds the route layer's mutual-TLS config: the node
presents its CA-signed certificate AND verifies the peer's certificate
against the bus CA (RequireAndVerifyClientCert), reusing the issue-0001
CA. Routes authenticate NODES, never the client nkey authenticator.
cmd/membershipd:
- Cluster flags (--cluster-name/--server-name/--cluster-port/--routes/
--cluster-user/--cluster-pass/--route-tls-cert/-key/-ca) wire a node
into the cluster. validateClusterConfig refuses a public cluster
without a route secret and complete mutual route TLS, and rejects
partial route-TLS flags (all-or-nothing). splitRoutes parses the CSV.
Tests (DoD: golden + 2 edge + error path):
- TestClusterForwardsAcrossNodes: 2-node cluster forwards a client
subject from one node to a subscriber on the other.
- TestClusterThreeNodesForward: 3-node (HA shape) cross-node forwarding.
- TestClusterMutualTLSForwards: forwarding over mutual-TLS routes.
- TestClusterRejectsBadRouteAuth: wrong cluster password -> no route.
- TestClusterRejectsUnsignedNode: cert not signed by the bus CA -> no route.
- TestClusterConfigPolicy / TestSplitRoutes: boot-guard + CSV parsing.
Master stays green: standalone (no --cluster-name) is unchanged.
Collapses Start/StartHost/StartHostAuth onto StartServer(ServerConfig) so
auth and a TLS config can be set without growing the parameter list further.
When TLS is set the server presents the certificate and requires TLS on the
data plane; the wrappers preserve the existing no-auth/no-TLS behavior.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
busauth.NewNkeyAuthenticator verifies a client's nkey signature over the
server nonce (decoding like nats-server: raw-url then std base64), maps the
nkey to its Ed25519 hex, and consults an injected IsAuthorized predicate.
Checking on every connection (rather than a static Options.Nkeys map) means
revoking a user denies its next connection with no restart. embeddednats
gains StartHostAuth(auth) and sets AlwaysEnableNonce so the server advertises
the nonce nkey clients need; Start/StartHost stay open (auth=nil) for dev.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a --bind flag (default 127.0.0.1) to membershipd that controls which
network interface both the control-plane HTTP API and the embedded NATS data
plane listen on. Use 0.0.0.0 to expose the stack to the LAN so remote peers
(phones, other PCs) can connect; keep the default for a loopback-only dev stack.
embeddednats gains StartHost(storeDir, host, port) for explicit interface
control; Start stays a backward-compatible wrapper (host "" = nats default
0.0.0.0) so the playground and tests are untouched.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>