membershipd never called Server.UseReplicatedNonces, so every node kept a
per-process anti-replay cache and a signed request accepted on node A could be
replayed to node B (200+200). This wires the shared JetStream KV nonce bucket on
any clustered node, closing the cross-node replay hole.
Bootstrap: under enforce the service needs JetStream on its own embedded server,
but the data plane only accepts allowlisted clients. Resolved with an ephemeral
internal service identity the authenticator recognizes and grants full
permissions (NewNkeyAuthenticatorACLInternal), connected over the in-process
transport (no TLS/CA needed for the self-connection).
Hard rule: --cluster-name != "" means the replicated nonce bucket is mandatory;
if it cannot be created the node refuses to start (wireReplicatedNonces returns a
fatal error) rather than run insecurely. Standalone nodes keep the in-memory
cache unchanged (branch-by-abstraction: no JetStream dependency added).
Changes:
- busauth: NewNkeyAuthenticatorACLInternal + fullPermissions for the internal id.
- cmd/membershipd: connectInternalJS (in-process, privileged) / connectExternalJS;
wireReplicatedNonces helper; main wires it when clustered; --kv-replicas flag.
Tests (regression of audit 0008 N3):
- TestAttack0008_N3: 2 clustered nodes share the bucket, cross-node replay -> 401.
- TestAttack0008_N3_StandaloneKeepsLocalCache: standalone needs no JetStream,
same-node replay still 401.
- TestAttack0008_N3_ClusteredRequiresJetStream: clustered + no JetStream -> fatal.
- TestInternalConnPrivilegedUnderEnforce / ...OutsiderRejected: the privileged
self-connection works under enforce and no other identity can claim it.
CGO_ENABLED=0 go build/vet/test green; govulncheck 0 reachable.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The per-subject data-plane ACL existed since 0003e (membership.SubjectACLFor +
busauth.NewNkeyAuthenticatorACL, unit-tested in TestSubjectACLIsolation) but the
binary never used it: cmd/membershipd installed the plain NewNkeyAuthenticator, so
in production a registered NON-member could open a raw NATS connection,
Subscribe(">"), and harvest every room's subject plus JetStream stream/advisory
activity (payload stayed E2E ciphertext, metadata leaked) — the re-audit's H4
vector (report 0006).
Fix:
- New busauth.PermissionsFromSubjects adapts a subject-deriving function into the
PermissionsFunc the ACL authenticator expects (subjects granted as both the
publish and subscribe allow set; a derivation error fails closed). It lives in
busauth so membership stays free of the nats-server dependency.
- cmd/membershipd, under enforce, now installs
NewNkeyAuthenticatorACL(store.IsAuthorized,
PermissionsFromSubjects(membership.SubjectACLFor(store)))
so every connection is confined to the subjects of the rooms it belongs to plus
the client-infra subjects.
- pkg/membership/acl_test.go's helper now delegates to the production wiring
(PermissionsFromSubjects) instead of a test-only reimplementation, so the tests
exercise the real path.
Verification (pkg/membership/acl_test.go):
- TestReaudit_H4_WildcardMetadataLeak: a non-member's Subscribe(">") and any
foreign-subject subscribe raise permission violations; the member still pub/subs
her own room and the non-member captures nothing. With the plain authenticator
(the pre-0005e wiring) the test fails ("wildcard metadata leak still open"),
confirming the wiring is what closes it.
- TestSubjectACLIsolation / TestRefreshSessionGainsNewRoom still green.
- CGO_ENABLED=0 go build ./... && go vet ./... && go test -count=1 ./... green.
Residual (documented): the client-infra grant includes "$JS.API.>", shared by all
peers so per-connection JetStream works; a peer that subscribes specifically to
"$JS.API.>" can still observe stream-management requests whose subjects embed the
room-derived stream name. Fully closing that needs NATS accounts/permissions per
identity (deferred to the 0003 decentralization line). Operational note: NATS
freezes permissions at connect time, so clients must client.RefreshSession after a
membership change to gain a new room's subject; cmd/chat and cmd/worker do not yet
call it, a functional gap to close before an enforce+ACL deployment.
Refs: report 0006 H4, issue 0005e.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Closes the residual the 0004 hardening deferred: the NATS authenticator
can now confine a registered peer to the subjects of the rooms it
belongs to, instead of letting any registered identity sub/pub on any
subject. The dynamic-membership reconnection model the audit named is
provided by client.RefreshSession.
pkg/busauth:
- verifyNkey factors out the shared nkey verification.
- NewNkeyAuthenticatorACL + PermissionsFunc: an authenticator that, after
authorizing, derives and RegisterUser()s per-subject permissions. A
derivation error denies the connection (fail closed).
pkg/membership:
- SubjectACLFor(store) maps a signing pubkey to the subjects it may use:
the subject of every room it belongs to, plus the client infrastructure
subjects (_INBOX.>, $JS.API.> for request/reply and the persisted plane).
pkg/client:
- RefreshSession() rebuilds the data-plane connection so the authenticator
re-derives permissions after a membership change (NATS freezes
permissions at connect time). It retains the seeds/options to reconnect;
active subscriptions are dropped and must be re-made (documented).
Tests (DoD: isolation + refresh):
- TestSubjectACLIsolation: alice (member of room.A) may sub/pub room.A but
is DENIED sub and pub on room.B (permissions violation), and never reads
bob's room.B traffic; bob never receives alice's cross-room publish.
- TestRefreshSessionGainsNewRoom: alice has no permission for room B until
she is added and calls RefreshSession; the reconnect grants the subject
and she then receives room B traffic.
Scope note: the per-subject ACL authenticator is opt-in (NewServer/
membershipd keep the open authenticator by default) and is wired in with
the decentralized boot path; auto-RefreshSession on every membership
change (fully transparent) remains for 0003f. Master behavior unchanged.
Add high-availability cluster support to the embedded NATS server
(issue 0003a, first phase of decentralization).
pkg/embeddednats:
- ServerConfig gains ServerName (unique per node, required by JetStream
RAFT) and an optional *ClusterConfig (cluster name, route host/port,
peer route URLs, shared-secret Username/Password, and a mutual-TLS
*tls.Config). applyClusterOpts maps it onto server.Options.Cluster +
Routes. Nil Cluster keeps the legacy standalone server.
pkg/busauth:
- RouteTLSConfig builds the route layer's mutual-TLS config: the node
presents its CA-signed certificate AND verifies the peer's certificate
against the bus CA (RequireAndVerifyClientCert), reusing the issue-0001
CA. Routes authenticate NODES, never the client nkey authenticator.
cmd/membershipd:
- Cluster flags (--cluster-name/--server-name/--cluster-port/--routes/
--cluster-user/--cluster-pass/--route-tls-cert/-key/-ca) wire a node
into the cluster. validateClusterConfig refuses a public cluster
without a route secret and complete mutual route TLS, and rejects
partial route-TLS flags (all-or-nothing). splitRoutes parses the CSV.
Tests (DoD: golden + 2 edge + error path):
- TestClusterForwardsAcrossNodes: 2-node cluster forwards a client
subject from one node to a subscriber on the other.
- TestClusterThreeNodesForward: 3-node (HA shape) cross-node forwarding.
- TestClusterMutualTLSForwards: forwarding over mutual-TLS routes.
- TestClusterRejectsBadRouteAuth: wrong cluster password -> no route.
- TestClusterRejectsUnsignedNode: cert not signed by the bus CA -> no route.
- TestClusterConfigPolicy / TestSplitRoutes: boot-guard + CSV parsing.
Master stays green: standalone (no --cluster-name) is unchanged.
client/tls_test: mints a throwaway CA + server cert in-memory; a client
pinning the CA completes the handshake and operates (golden), a client
without the CA fails the handshake (error path). busauth/tls_test: golden
load of a CA PEM and a server keypair, plus error paths (missing file,
non-PEM). Harness body extracted to bootHarness(ctrlMode, natsAuth, natsTLS).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
busauth.LoadCATLSConfig turns a ca.crt path into a *tls.Config trusting only
that private CA (clients must pin it; the system roots would reject a
self-signed server cert). busauth.ServerTLSConfig loads the server keypair.
client.Options gains TLS; NewWithOptions calls nats.Secure when set, so the
data-plane connection is encrypted and the server pinned.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
busauth.NewNkeyAuthenticator verifies a client's nkey signature over the
server nonce (decoding like nats-server: raw-url then std base64), maps the
nkey to its Ed25519 hex, and consults an injected IsAuthorized predicate.
Checking on every connection (rather than a static Options.Nkeys map) means
revoking a user denies its next connection with no restart. embeddednats
gains StartHostAuth(auth) and sets AlwaysEnableNonce so the server advertises
the nonce nkey clients need; Start/StartHost stay open (auth=nil) for dev.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A NATS nkey is an Ed25519 keypair, so the bus reuses each peer's signing
identity for the data plane instead of minting new key material. ClientNkey
derives the user nkey public string and a nonce-signing callback from the
peer's Ed25519 private key (its first 32 bytes are the nkey seed);
SignPubHexFromNkey maps a presented nkey back to the allowlist's hex key;
NkeyPublicFromSignPub is the public-only derivation.
This is NATS-specific transport glue kept in the app, not promoted to the
registry, to avoid pulling nats-io/nkeys into the multi-domain registry
module. The dedicated round-trip test runs first (spec requirement): it
proves the nkey signature equals the identity's raw Ed25519 signature and
that the nkey maps back to the identity's hex.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>