A secured bus freezes per-subject permissions at connect time, so a peer that
creates or joins a room after connecting cannot pub/sub on it until it reconnects
(RefreshSession). No client called it, so under enforce+ACL the demos failed
closed — pushing the operator to disable the ACL (a security regression at the
operator's discretion).
Wire the membership-change contract into every client:
- cmd/worker: RefreshSession after CreateRoom, before publishing.
- cmd/chat (simple): RefreshSession after CreateRoom+Join, before Subscribe.
- cmd/chat (encrypted demo): A refreshes after CreateRoom; B refreshes after the
invite+join, both before pub/sub.
- local_files/bridge (gateway): RefreshSession after CreateRoom+Join, before Subscribe.
- mobile: new Session.RefreshSession wrapper + the contract documented for callers.
Contract (documented on the wrappers): after ANY membership change, call
RefreshSession BEFORE pub/sub on the new room (it drops active subs, so it must
precede Subscribe). On an unsecured/dev bus it is a harmless reconnect.
Test:
- TestClientCreateRoomRefreshPublishFlow: end-to-end under enforce+ACL, a peer
creates a room, refreshes, invites a second peer who joins+refreshes+subscribes,
and the publish is received — no manual intervention, the ACL stays on.
CGO_ENABLED=0 go build/vet/test green; govulncheck 0 reachable.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A cluster is only as secure as its weakest node: the data plane forwards every
subject between nodes, so one node running without enforced auth lets an
unauthenticated peer Subscribe(">") on it and harvest the traffic forwarded from
the ACL'd nodes.
- validateClusterConfig now takes the auth mode and REFUSES to join a cluster
unless --bus-auth enforce, regardless of bind (a clustered node is a production
node; there is no safe dev cluster without auth). This binary therefore cannot
BE the weak node.
- Server.Posture {enforce,acl,tls,cluster,store} is published on /healthz (non
secret operational metadata, probe stays unauthenticated) so a monitor or peer
can detect a cluster member not running enforce+ACL+TLS — covering a peer that
runs a tampered/old binary outside this node's control.
Tests:
- TestAttack0008_N1: a clustered node with --bus-auth off is refused; the same
node with enforce + full route security is allowed.
- TestClusterConfigPolicy: extended with off/soft clustered cases (refused) and
the mode parameter throughout.
- TestHealthExposesPosture: /healthz returns the posture booleans + store backend.
CGO_ENABLED=0 go build/vet/test green; govulncheck 0 reachable.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
0003 built the JetStream KV store (jetstreamStore) but the binary never selected
it: membership.Open (SQLite) was hardcoded and OpenJetStream was only reached by
migrate-to-kv. This completes the wiring so a node actually serves its control
plane from the replicated KV.
- New flag --store kv|sqlite (default sqlite). kv opens the JetStream KV control
plane over the privileged internal connection; sqlite is the unchanged baseline
(branch-by-abstraction: the full suite's SQLite paths are untouched).
- Bootstrap cycle resolved with storeHolder: the authenticator consults the holder
(fail-closed until set), so it can be built before the KV store exists. The KV
store opens after NATS is up and is published into the holder. The only client
that can connect in that window is the internal identity, which bypasses the
store by key. In SQLite mode the store is set before StartServer, so the window
does not exist.
- needJS now covers --store kv as well as --cluster-name; the JetStream client is
shared by the KV store and the replicated nonce bucket.
- feature_flags.json: decentralized wiring documented as complete, realized via
--store kv (opt-in per deploy; default stays sqlite).
Fail-closed preserved: jetstreamStore.IsAuthorized already denies on any backend
error; the holder denies while unset.
Tests:
- TestStoreHolderFailClosed: empty holder denies; serves after set.
- TestKVStoreBootstrapUnderEnforce: end-to-end decentralized boot — KV-seeded user
authenticates over nkey under enforce; outsider denied.
- TestKVStoreDecentralizedConsistency: a room/user created on one node's KV store
is visible to another's (ends the per-node SQLite divergence, audit 0008 N5).
CGO_ENABLED=0 go build/vet/test green; govulncheck 0 reachable.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The client-infra grant was {"_INBOX.>", "$JS.API.>"}. The broad "$JS.API.>" let
any registered peer drive the whole JetStream API and read the control-plane KV
buckets (KV_UNIBUS_users/rooms/members/room_keys) and the object store directly
over NATS, bypassing the HTTP authorization (requireMember + own-endpoint
checks): a full leak of the allowlist, room graph and sealed-key metadata once the
decentralized control plane is active.
Fix: replace the broad grant with a CLOSED, per-room allow set.
- clientInfraSubjects shrinks to {"_INBOX.>", "$JS.API.INFO"} ($JS.API.INFO is
account counters only — no room/user/key contents).
- SubjectACLFor now grants, per room the peer belongs to, the room subject plus
the minimal JetStream API subjects of THAT room's stream (jsSubjectsFor:
STREAM.*, CONSUMER.*, $JS.ACK scoped to UNIBUS_<roomID>).
- Because KV_UNIBUS_* and OBJ_UNIBUS_* are never a room stream, they fall outside
the closed allow set and are denied by default. Clients reach blobs over the
HTTP control plane, not the NATS object store, so OBJ needs no client grant.
roomStreamName mirrors pkg/client.streamName so the authorizer and the producer
never drift.
Tests:
- TestAttack0008_N2: eve (registered, member of no room) cannot bind the KV users
bucket nor subscribe $KV.UNIBUS_users.> (permissions violation); golden: the
room owner can still drive her OWN room stream's JetStream API; edge: eve cannot
reach a foreign room's stream.
- TestReaudit_H4 residual note updated: the $JS.API.> leak it deferred is closed.
CGO_ENABLED=0 go build/vet/test green; govulncheck 0 reachable.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
membershipd never called Server.UseReplicatedNonces, so every node kept a
per-process anti-replay cache and a signed request accepted on node A could be
replayed to node B (200+200). This wires the shared JetStream KV nonce bucket on
any clustered node, closing the cross-node replay hole.
Bootstrap: under enforce the service needs JetStream on its own embedded server,
but the data plane only accepts allowlisted clients. Resolved with an ephemeral
internal service identity the authenticator recognizes and grants full
permissions (NewNkeyAuthenticatorACLInternal), connected over the in-process
transport (no TLS/CA needed for the self-connection).
Hard rule: --cluster-name != "" means the replicated nonce bucket is mandatory;
if it cannot be created the node refuses to start (wireReplicatedNonces returns a
fatal error) rather than run insecurely. Standalone nodes keep the in-memory
cache unchanged (branch-by-abstraction: no JetStream dependency added).
Changes:
- busauth: NewNkeyAuthenticatorACLInternal + fullPermissions for the internal id.
- cmd/membershipd: connectInternalJS (in-process, privileged) / connectExternalJS;
wireReplicatedNonces helper; main wires it when clustered; --kv-replicas flag.
Tests (regression of audit 0008 N3):
- TestAttack0008_N3: 2 clustered nodes share the bucket, cross-node replay -> 401.
- TestAttack0008_N3_StandaloneKeepsLocalCache: standalone needs no JetStream,
same-node replay still 401.
- TestAttack0008_N3_ClusteredRequiresJetStream: clustered + no JetStream -> fatal.
- TestInternalConnPrivilegedUnderEnforce / ...OutsiderRejected: the privileged
self-connection works under enforce and no other identity can claim it.
CGO_ENABLED=0 go build/vet/test green; govulncheck 0 reachable.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Hardening 2 (issue 0005, fases 0005a-0005e) cierra los hallazgos nuevos de la
re-auditoría red-team (report 0006): bump de nats-server + toolchain (16 CVEs ->
0 alcanzables), drop de frames sin firma en rooms SignMsgs, limiter global de
bytes en vuelo contra el DoS por concurrencia, TLS obligatorio en bind publico, y
cableado de la ACL por subject que cierra el wildcard metadata leak. Detalle por
fase en el capability growth log del app.md y en el report 0007.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The per-subject data-plane ACL existed since 0003e (membership.SubjectACLFor +
busauth.NewNkeyAuthenticatorACL, unit-tested in TestSubjectACLIsolation) but the
binary never used it: cmd/membershipd installed the plain NewNkeyAuthenticator, so
in production a registered NON-member could open a raw NATS connection,
Subscribe(">"), and harvest every room's subject plus JetStream stream/advisory
activity (payload stayed E2E ciphertext, metadata leaked) — the re-audit's H4
vector (report 0006).
Fix:
- New busauth.PermissionsFromSubjects adapts a subject-deriving function into the
PermissionsFunc the ACL authenticator expects (subjects granted as both the
publish and subscribe allow set; a derivation error fails closed). It lives in
busauth so membership stays free of the nats-server dependency.
- cmd/membershipd, under enforce, now installs
NewNkeyAuthenticatorACL(store.IsAuthorized,
PermissionsFromSubjects(membership.SubjectACLFor(store)))
so every connection is confined to the subjects of the rooms it belongs to plus
the client-infra subjects.
- pkg/membership/acl_test.go's helper now delegates to the production wiring
(PermissionsFromSubjects) instead of a test-only reimplementation, so the tests
exercise the real path.
Verification (pkg/membership/acl_test.go):
- TestReaudit_H4_WildcardMetadataLeak: a non-member's Subscribe(">") and any
foreign-subject subscribe raise permission violations; the member still pub/subs
her own room and the non-member captures nothing. With the plain authenticator
(the pre-0005e wiring) the test fails ("wildcard metadata leak still open"),
confirming the wiring is what closes it.
- TestSubjectACLIsolation / TestRefreshSessionGainsNewRoom still green.
- CGO_ENABLED=0 go build ./... && go vet ./... && go test -count=1 ./... green.
Residual (documented): the client-infra grant includes "$JS.API.>", shared by all
peers so per-connection JetStream works; a peer that subscribes specifically to
"$JS.API.>" can still observe stream-management requests whose subjects embed the
room-derived stream name. Fully closing that needs NATS accounts/permissions per
identity (deferred to the 0003 decentralization line). Operational note: NATS
freezes permissions at connect time, so clients must client.RefreshSession after a
membership change to gain a new room's subject; cmd/chat and cmd/worker do not yet
call it, a functional gap to close before an enforce+ACL deployment.
Refs: report 0006 H4, issue 0005e.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The H2 guard refused "public bind without enforce" and "TLS flags without
enforce", but it still ALLOWED a public bind with enforce and no --tls-cert: the
control plane then served metadata (subjects, pubkeys, sealed keys, the social
graph) over plaintext HTTP publicly, so audit H5 reappeared as the N4 gap (TLS
was a capability, not a requirement; report 0006).
Fix: validateBootConfig now also refuses a non-loopback --bind unless both
--tls-cert and --tls-key are set. Public deployments must serve HTTPS; loopback
dev is unaffected (no TLS still allowed there).
Verification (cmd/membershipd/config_test.go):
- TestGap_PublicEnforceNoTLS: validateBootConfig("0.0.0.0", enforce, "", "")
now returns an error mentioning --tls-cert (golden public+enforce+TLS allowed;
edge loopback-without-TLS still allowed).
- TestBootConfigPolicy table updated: public+enforce+notls / +certonly / +keyonly
and lan-ip+enforce+notls are now refused; public+enforce+tls and
loopback+enforce+tls allowed.
- CGO_ENABLED=0 go build ./... && go vet ./... && go test -count=1 ./... green.
Refs: report 0006 N4, issue 0005d.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The H1 fix bounds each request (1 MiB control / 16 MiB blob) and the per-IP rate
limiter throttles a single source, but neither bounds the AGGREGATE memory across
concurrent requests. The re-audit (report 0006, N2) drove RSS to ~1.42 GB with 40
concurrent 16 MiB uploads, and noted that a multi-IP (botnet) flood scales without
a ceiling because the rate limit is per-IP.
Fix: a global, non-blocking, byte-counting limiter (pkg/membership/inflight.go).
ServeHTTP reserves a POST's worst-case buffered size (its route ceiling) from the
limiter before reading the body, and releases it when the request finishes. When
the global cap (maxInflightBytes = 128 MiB) is reached, further POSTs are shed
with 503 (backpressure) rather than parking goroutines, so total bytes buffered
in flight stays bounded regardless of connection count or source-IP spread. GETs
carry no body and do not consume the budget.
The limiter is implemented inside unibus (not delegated to the fn-registry, where
a generic concurrency primitive would normally live) because functions/core pulls
transitive deps requiring CGO (mattn/go-sqlite3) and external modules that are
incompatible with unibus's CGO_ENABLED=0 build, and because this work is scoped
to the unibus sub-repo. The type/method comments document this.
Verification:
- pkg/membership/inflight_test.go: TestInflightLimiter{Basics,Disabled,Concurrent}
cover golden/edge/error/disabled/over-release and a -race concurrency invariant
(inFlight returns to 0, never exceeds cap).
- pkg/membership/dos_concurrency_test.go: TestReaudit_DoSConcurrency fires 40
concurrent 16 MiB uploads from distinct IPs (the multi-IP shape) against a 48 MiB
test cap -> 200=3 503=37, RSS delta ~93 MiB (bound 256 MiB), inFlight()==0, and a
fresh upload still 200. With the limiter disabled the test fails (200=40 503=0),
confirming it is a real regression guard.
- CGO_ENABLED=0 go build ./... && go vet ./... && go test -count=1 ./... green;
CGO_ENABLED=1 go test -race ./pkg/membership/ green.
Residual (documented): under enforce the body is buffered twice (auth verify +
handler), so real RSS is ~2x the reserved bytes; closing that fully means
streaming blobs to disk (overlaps H9 / issue 0002).
Refs: report 0006 N2, issue 0005c.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
client.processFrame verified a frame's signature only when one was present
(`info.Policy.SignMsgs && f.Sig != nil`). In a room whose policy REQUIRES
per-message signatures, an attacker with data-plane access could publish a raw
frame with Sig==nil and a forged Sender, and the receiver accepted it as
authentic because the verification block was skipped (audit N3, report 0006).
On a signed-but-cleartext room any peer that knows the subject could thus
impersonate any sender.
Fix: in a SignMsgs room a missing signature is itself a rejection. processFrame
now drops any frame with Sig==nil before attempting verification:
if info.Policy.SignMsgs {
if f.Sig == nil { return } // signature required but absent: drop
// verify ...
}
Non-signed rooms (ModeNATS) are unaffected: unsigned frames there are still
delivered, so the plain-NATS path is unchanged.
Verification (pkg/client/sig_nil_spoof_test.go, TestReaudit_SigNilSpoof):
- golden: a properly signed frame from a member is delivered.
- error : an unsigned frame with a forged Sender in a SignMsgs room is dropped
(the test fails with "SIG-NIL SPOOF: receiver accepted ..." when the fix is
reverted, confirming it is a real regression guard).
- edge : a non-signed room still delivers an unsigned frame.
- CGO_ENABLED=0 go build ./... && go vet ./... && go test -count=1 ./... green.
Refs: report 0006 N3, issue 0005b.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
govulncheck reported 16 reachable vulnerabilities (re-audit finding N1, report 0006):
14 in github.com/nats-io/nats-server/v2@v2.10.22 -- the embedded NATS server, which
is exposed to the internet in the chosen deployment -- and 2 in the Go standard
library (GO-2026-5039 net/textproto, GO-2026-5037 crypto/x509).
Changes:
- go get github.com/nats-io/nats-server/v2@v2.11.15 (covers all 14 server CVEs;
pulls nats.go v1.49.0, nkeys v0.4.15, jwt v2.8.1, klauspost/compress v1.18.4
and friends transitively).
- go directive 1.25.0 -> 1.26.4 so the toolchain ships the two stdlib fixes.
This is a go.mod/go.sum change justified purely by CVE remediation; it is the
explicit exception to the "do not touch deps" rule for a CVE bump.
Verification:
- CGO_ENABLED=0 go build ./... && go vet ./... && go test -count=1 ./... -> green,
including the 0003 multi-node cluster/JetStream e2e in pkg/embeddednats, so the
server bump did not break the cluster or the durable plane.
- govulncheck ./... -> "No vulnerabilities found" (0 reachable; the 13 that remain
are in required-but-not-called modules).
Refs: report 0006 N1, issue 0005a.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cluster NATS routes (auth + mutual TLS), Store/blobstore interfaces with
replicated JetStream KV and Object Store backends, idempotent
migrate-to-kv with backup, client failover over seed/control-plane lists,
replicated nonce store (closes the multi-node replay hole), and the
per-subject membership ACL (audit H4 residual). All behind the
`decentralized` flag (off); single-node SQLite+disk behavior unchanged.
The multi-node deploy (0003f) is the human's; runbook in report 0006.
Closes the residual the 0004 hardening deferred: the NATS authenticator
can now confine a registered peer to the subjects of the rooms it
belongs to, instead of letting any registered identity sub/pub on any
subject. The dynamic-membership reconnection model the audit named is
provided by client.RefreshSession.
pkg/busauth:
- verifyNkey factors out the shared nkey verification.
- NewNkeyAuthenticatorACL + PermissionsFunc: an authenticator that, after
authorizing, derives and RegisterUser()s per-subject permissions. A
derivation error denies the connection (fail closed).
pkg/membership:
- SubjectACLFor(store) maps a signing pubkey to the subjects it may use:
the subject of every room it belongs to, plus the client infrastructure
subjects (_INBOX.>, $JS.API.> for request/reply and the persisted plane).
pkg/client:
- RefreshSession() rebuilds the data-plane connection so the authenticator
re-derives permissions after a membership change (NATS freezes
permissions at connect time). It retains the seeds/options to reconnect;
active subscriptions are dropped and must be re-made (documented).
Tests (DoD: isolation + refresh):
- TestSubjectACLIsolation: alice (member of room.A) may sub/pub room.A but
is DENIED sub and pub on room.B (permissions violation), and never reads
bob's room.B traffic; bob never receives alice's cross-room publish.
- TestRefreshSessionGainsNewRoom: alice has no permission for room B until
she is added and calls RefreshSession; the reconnect grants the subject
and she then receives room B traffic.
Scope note: the per-subject ACL authenticator is opt-in (NewServer/
membershipd keep the open authenticator by default) and is wired in with
the decentralized boot path; auto-RefreshSession on every membership
change (fully transparent) remains for 0003f. Master behavior unchanged.
The per-process nonce cache breaks anti-replay under multi-node failover
(audit 0004): a request captured on one node can be replayed to a
DIFFERENT node whose local cache never saw the nonce, and is accepted.
This makes the nonce state shared so a replay is rejected cluster-wide.
pkg/membership:
- nonceStore is now an interface. The in-memory cache is renamed
memNonceCache (still the default, single-node behavior).
- kvNonceStore (new) claims each nonce with an atomic KV Create on a
shared bucket: first sight wins (accept), any later sight on any node
rejects (replay). A backend error fails CLOSED (reject), so a KV outage
never silently disables anti-replay. The bucket carries a TTL =
nonceTTL (2*clockSkew) so a key expires exactly when its replay window
closes; raw base64 nonces are mapped to KV-safe keys via sha256-hex.
- Server.UseReplicatedNonces(js, replicas) swaps the store on a node;
every node in a cluster calls it. NewServer still defaults to the
in-memory cache (master behavior unchanged).
Test (DoD error path — the issue's cross-node replay case):
- TestReplicatedNonceRejectsCrossNodeReplay: two membershipd nodes share
one KV bucket; a request accepted (200) on node A, replayed with the
same ts+nonce to node B, is rejected (401) — and replaying to A again
is rejected too.
The client (issue 0003e, part 1) accepts a LIST of NATS seeds and a LIST
of control-plane URLs so a node loss is transparent.
pkg/client:
- Options.NatsServers: extra NATS seeds beyond the primary. The client
connects to the joined seed list with MaxReconnects(-1) +
RetryOnFailedConnect, so nats.go fails over to a surviving node when the
one a client is attached to dies and rejoins a node that comes back.
- Options.CtrlURLs: extra control-plane endpoints. doJSON/putBlob/getBlob
now try each endpoint in order, falling over on a transport error to the
next (an HTTP response from any node is authoritative — every node
serves the same state under the KV store). newSignedRequest becomes
newSignedRequestTo(base, ...); each failover attempt mints a fresh nonce
(the signature covers method+path+ts+nonce+body, not the host), so a
retried request is never seen as a replay.
- ConnectedServer()/IsConnected(): observability for which node the data
plane is attached to, for ops and failover tests.
- New/Connect/NewWithOptions keep their signatures (a single URL = a
one-element list), so worker/chat/mobile/playground are unchanged.
Test (DoD edge — the issue's "kill node A" case):
- TestClientFailoverAcrossNodes: A seeds two clustered nodes, subscribes,
receives a cross-node message; the node A is attached to is KILLED; A
reconnects to the survivor and still receives messages — session intact.
Branch-by-abstraction for the blob store (issue 0003d): media ciphertext
can live in a replicated JetStream Object Store instead of local disk, so
a blob uploaded to one node survives a node loss and is reachable from
any node.
pkg/blobstore:
- Store is now an interface (Put/Get/Has). The filesystem backend is
renamed diskStore and stays the default: New(dir) returns it.
- objectStore (new) implements Store over a NATS Object Store bucket with
a configurable replication factor (R1..R5), matching the KV store's
R1->R3 rollout. Content-addressing (sha256-hex) is identical, so the
wire contract is unchanged.
pkg/membership:
- Server.blobs and NewServer take the blobstore.Store interface instead
of the concrete type; no behavior change with the disk default.
Tests (DoD: golden + edge + contract):
- TestObjectStoreRoundTrip: put/get/has + content-addressed dedup.
- TestObjectStoreMissing: unknown hash is absent and unreadable.
- TestObjectStoreAddressMatchesDisk: the Object Store and disk backends
address identical bytes to the IDENTICAL hash (portable blob refs).
Like the KV store (0003b), wiring membershipd to select the Object Store
is deferred to the decentralized boot path (flag off); disk stays default.
The one-time data move decentralization needs (issue 0003c): copy the
entire control-plane state from the local SQLite database into the
replicated JetStream KV buckets, with a backup taken first.
pkg/membership:
- Snapshot / SealedKeyRecord: a backend-agnostic dump of the whole
control plane (rooms with their real epoch, members, every sealed-key
row across epochs, users with status).
- (*sqliteStore).ExportSnapshot and (*jetstreamStore).ExportSnapshot read
a full Snapshot from each backend; (*jetstreamStore).importSnapshot
writes one with raw Puts (preserving epoch/status, not resetting to
defaults) so the migration is faithful and idempotent (every write is
an overwrite, so re-running converges).
- MigrateSQLiteToKV orchestrates export -> import; BackupSQLite makes a
consistent copy via SQLite's VACUUM INTO before any migration.
cmd/membershipd:
- `membershipd migrate-to-kv --db <path> --nats-url <url> [--replicas N]
[--ca <cert>] [--no-backup]` backs up the SQLite file, connects to the
cluster's NATS, and migrates. Dispatched on the host like `user`.
Tests (DoD: golden + edge + parity):
- TestMigrateSQLiteToKVParity: seed a representative SQLite (two rooms,
one rekeyed to epoch 2, members, a revoked user); after migration the
KV ExportSnapshot equals the SQLite ExportSnapshot.
- TestMigrateSQLiteToKVIdempotent: running the migration twice yields the
same KV state.
- TestBackupSQLiteCreatesConsistentCopy: the backup reopens with
identical data.
Plus a binary smoke (seed user -> run server -> migrate-to-kv -> re-run):
backup written, 1 user migrated, second run identical.
Branch-by-abstraction for the control-plane store (issue 0003b), so the
membership state can move off process-local SQLite onto replicated
JetStream KV without rewriting callers and without breaking master.
pkg/membership:
- Store is now an interface (rooms/members/keys + user allowlist +
Close). The existing SQLite implementation is renamed sqliteStore and
stays the default: Open(path) still returns it. openSQLite keeps the
concrete type for internal callers (the 0003c migration).
- ErrNotFound is a storage-agnostic "no such record" sentinel; both
backends return it (the SQLite store maps sql.ErrNoRows to it). The
control plane now branches on ErrNotFound instead of sql.ErrNoRows, so
server.go no longer imports database/sql.
- jetstreamStore (new) implements Store over five replicated KV buckets:
rooms, members, rooms_by_member (reverse index for ListRoomsForEndpoint),
room_keys, users. Replication factor is configurable (R1..R5) for the
R1->R3 rollout. Every read is bounded by OpTimeout and IsAuthorized /
HasAdmin FAIL CLOSED on any backend error (a KV quorum loss denies,
never admits), per the audit's requirement for the decentralized store.
dev/feature_flags.json:
- Add the `decentralized` flag (OFF): sqliteStore default while off,
jetstreamStore behind it. The membershipd boot wiring that selects the
KV store is deliberately deferred to 0003e/0003f (the embedded-NATS
authenticator<->store bootstrap is part of the session/deploy redesign);
OFF keeps the single-node SQLite control plane unchanged.
Tests (DoD: golden + edges + error path):
- TestJetStreamStoreRoomsCRUD: encrypted room + owner + invited member
round-trip through every room/member/key method, including latest-epoch
resolution and rekey.
- TestJetStreamStoreUsers: add/get/authorize/list/revoke + admin gate,
with case-insensitive key normalization and duplicate rejection.
- TestJetStreamStoreNotFound: ErrNotFound mapping for misses.
- TestJetStreamStoreIsAuthorizedFailClosed: NATS backend shut down ->
IsAuthorized and HasAdmin both DENY within the bounded timeout.
The full existing suite stays green: sqliteStore is unchanged behavior.
Add high-availability cluster support to the embedded NATS server
(issue 0003a, first phase of decentralization).
pkg/embeddednats:
- ServerConfig gains ServerName (unique per node, required by JetStream
RAFT) and an optional *ClusterConfig (cluster name, route host/port,
peer route URLs, shared-secret Username/Password, and a mutual-TLS
*tls.Config). applyClusterOpts maps it onto server.Options.Cluster +
Routes. Nil Cluster keeps the legacy standalone server.
pkg/busauth:
- RouteTLSConfig builds the route layer's mutual-TLS config: the node
presents its CA-signed certificate AND verifies the peer's certificate
against the bus CA (RequireAndVerifyClientCert), reusing the issue-0001
CA. Routes authenticate NODES, never the client nkey authenticator.
cmd/membershipd:
- Cluster flags (--cluster-name/--server-name/--cluster-port/--routes/
--cluster-user/--cluster-pass/--route-tls-cert/-key/-ca) wire a node
into the cluster. validateClusterConfig refuses a public cluster
without a route secret and complete mutual route TLS, and rejects
partial route-TLS flags (all-or-nothing). splitRoutes parses the CSV.
Tests (DoD: golden + 2 edge + error path):
- TestClusterForwardsAcrossNodes: 2-node cluster forwards a client
subject from one node to a subscriber on the other.
- TestClusterThreeNodesForward: 3-node (HA shape) cross-node forwarding.
- TestClusterMutualTLSForwards: forwarding over mutual-TLS routes.
- TestClusterRejectsBadRouteAuth: wrong cluster password -> no route.
- TestClusterRejectsUnsignedNode: cert not signed by the bus CA -> no route.
- TestClusterConfigPolicy / TestSplitRoutes: boot-guard + CSV parsing.
Master stays green: standalone (no --cluster-name) is unchanged.
Issue 0004 (security hardening) done across 0004a-0004f. app.md version 0.5.0
with the capability growth log entry; dev/0004d-dataplane-acl.md documents the
chosen minimum-defense strategy for the NATS data plane and its residual limit
(per-subject ACL deferred to 0003). Full work report in
projects/message_bus/reports/0005-2026-06-07-unibus-security-hardening.md.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Owner of a created room must be the signer; the replay cache is populated only
after authorization (with bounded, O(expired) pruning); internal errors no
longer leak to clients.
TestAudit_OwnerSpoof: a body declaring a foreign owner endpoint or signing key
is 403; a self-owned create is 201.
TestAudit_NonceCachePoisonPreAuth: an unregistered identity's repeated nonce
still fails 'not authorized' (never 'replayed'), proving it was not cached, while
an authorized identity's replay is still rejected.
Nonce cache unit tests: prune-after-TTL and cap-bounded memory.
Three medium audit findings.
H6 (owner spoof): handleCreateRoom now binds the body's declared owner to the
authenticated signer — both the endpoint id and the signing key must be the
signer's — so a registered peer cannot create rooms in another identity's name.
Enforced only when an authenticated signer is present.
H7 (nonce-cache poison pre-auth): IsAuthorized now runs BEFORE the replay cache
is touched, so an unregistered identity (Ed25519 keys are free) can no longer
seed nonces into it. The cache is rewritten with O(expired) pruning (insertion
order equals expiry order under a constant TTL) instead of the previous O(n)
full-map scan under the mutex, plus a size cap with oldest-eviction. This is the
prerequisite the 0003 replicated nonce store builds on.
H12 (error leak): internal store/blob errors are logged and replaced with a
generic client message via writeServerErr, so SQL fragments and filesystem paths
no longer reach the caller. Crafted 4xx messages (owner-sig, validation) are kept.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Audit H5 (Alto, public). The control plane was signed but plaintext, so a
network MITM could read all metadata (subjects, endpoints, public keys, sealed
keys, blob hashes, the social graph) and drop requests. Signing gives integrity,
not confidentiality.
- membershipd serves the control plane over TLS (ListenAndServeTLS, MinVersion
1.2) with the same CA-signed cert as the data plane when --tls-cert is set; the
fail-open guard already requires --bus-auth enforce alongside it.
- The client gets a separate Options.CtrlTLS so the HTTP client pins the bus CA,
independent of the NATS data-plane TLS. Connect now sets both planes' TLS from
the one CA and REFUSES a plaintext http:// control-plane URL when a CA is
provided, so metadata is never sent in the clear when TLS is expected.
Connect's signature is unchanged; callers (worker/chat --ca, mobile NewSession)
must pass an https:// control-plane URL when they pass a CA. Documented for the
deploy step.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
pkg/membership TestRequireEncryptedRoomsRejectsCleartext: cleartext create ->
403, encrypted -> 201, flag off -> cleartext allowed again.
pkg/client TestAudit_NoSubjectACL: under the public posture a ModeNATS room is
refused; bob (member) decrypts the secret; eve raw-subscribes to the subject off
the data plane and receives only ciphertext (non-empty AEAD nonce, no plaintext
substring) — closing the auditor's 'eve reads internal: salary numbers'.
Audit H4 (Alto). The embedded NATS has a single account with no per-subject
permissions, so any registered peer can subscribe to any subject — a cleartext
(ModeNATS) room's payload is readable by anyone who knows the subject.
A complete per-subject ACL derived from membership does not fit here: NATS
evaluates a connection's permissions once at connect time and never re-evaluates
them, but unibus clients connect-then-create/join-then-publish on one connection
(TestSecureBusEndToEnd). Static permissions would forbid the owner from
publishing to a room it just created; the dynamic reconnection model belongs to
the 0003 decentralization redesign. See dev/0004d-dataplane-acl.md.
Minimum defense implemented: Server.RequireEncryptedRooms (set by membershipd on
any non-loopback bind) refuses to create cleartext rooms, so every room on a
public deployment is end-to-end encrypted. Message CONTENT stays confidential
even with no subject isolation; residual traffic-metadata exposure is documented
and tracked for 0003.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Room metadata, member lists, room directories and sealed keys are now served
only to members of the room (and a sealed key only to its own endpoint),
closing the horizontal metadata leak.
Audit H3 (Alto). 'Authorized' meant 'registered in the allowlist', not 'member
of the room', so any registered peer could read another room's subject, its
full member list (every member's sign_pub + kex_pub), any endpoint's room
directory, and even another member's sealed key.
The middleware now carries the authenticated signer's endpoint id into the
handler via request context. Room handlers enforce membership:
- GET /rooms/{id} and /rooms/{id}/members require the signer to be a member;
- GET /rooms/{id}/key serves the sealed key only to its own endpoint
(endpoint == signer) and only to a member;
- GET /members/{endpoint}/rooms is restricted to the signer's own endpoint.
Authorization is skipped only when no authenticated signer is present (AuthOff
dev, or a soft-mode pass-through), preserving legacy/dev behavior. Internal
errors no longer echo store messages to the client on these paths.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Audit H2 (Alto). The binary defaulted to --bus-auth off, the NATS nkey
authenticator only turned on under enforce, and TLS was an independent flag.
Booting --bind 0.0.0.0 --tls-cert … without --bus-auth enforce left both
planes open while looking secure.
validateBootConfig is a pure guard, called right after flag parsing, that
log.Fatals on two insecure shapes:
- a non-loopback --bind without --bus-auth enforce, and
- --tls-cert/--tls-key without --bus-auth enforce.
An insecure public startup is now impossible (the process exits), so a
fail-open data plane never comes up for an unregistered client to reach.
TestAudit_FailOpenTLSWithoutAuth plus a full policy table cover golden
(public+enforce, dev loopback) and every refused shape.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>