The one-time data move decentralization needs (issue 0003c): copy the
entire control-plane state from the local SQLite database into the
replicated JetStream KV buckets, with a backup taken first.
pkg/membership:
- Snapshot / SealedKeyRecord: a backend-agnostic dump of the whole
control plane (rooms with their real epoch, members, every sealed-key
row across epochs, users with status).
- (*sqliteStore).ExportSnapshot and (*jetstreamStore).ExportSnapshot read
a full Snapshot from each backend; (*jetstreamStore).importSnapshot
writes one with raw Puts (preserving epoch/status, not resetting to
defaults) so the migration is faithful and idempotent (every write is
an overwrite, so re-running converges).
- MigrateSQLiteToKV orchestrates export -> import; BackupSQLite makes a
consistent copy via SQLite's VACUUM INTO before any migration.
cmd/membershipd:
- `membershipd migrate-to-kv --db <path> --nats-url <url> [--replicas N]
[--ca <cert>] [--no-backup]` backs up the SQLite file, connects to the
cluster's NATS, and migrates. Dispatched on the host like `user`.
Tests (DoD: golden + edge + parity):
- TestMigrateSQLiteToKVParity: seed a representative SQLite (two rooms,
one rekeyed to epoch 2, members, a revoked user); after migration the
KV ExportSnapshot equals the SQLite ExportSnapshot.
- TestMigrateSQLiteToKVIdempotent: running the migration twice yields the
same KV state.
- TestBackupSQLiteCreatesConsistentCopy: the backup reopens with
identical data.
Plus a binary smoke (seed user -> run server -> migrate-to-kv -> re-run):
backup written, 1 user migrated, second run identical.
Branch-by-abstraction for the control-plane store (issue 0003b), so the
membership state can move off process-local SQLite onto replicated
JetStream KV without rewriting callers and without breaking master.
pkg/membership:
- Store is now an interface (rooms/members/keys + user allowlist +
Close). The existing SQLite implementation is renamed sqliteStore and
stays the default: Open(path) still returns it. openSQLite keeps the
concrete type for internal callers (the 0003c migration).
- ErrNotFound is a storage-agnostic "no such record" sentinel; both
backends return it (the SQLite store maps sql.ErrNoRows to it). The
control plane now branches on ErrNotFound instead of sql.ErrNoRows, so
server.go no longer imports database/sql.
- jetstreamStore (new) implements Store over five replicated KV buckets:
rooms, members, rooms_by_member (reverse index for ListRoomsForEndpoint),
room_keys, users. Replication factor is configurable (R1..R5) for the
R1->R3 rollout. Every read is bounded by OpTimeout and IsAuthorized /
HasAdmin FAIL CLOSED on any backend error (a KV quorum loss denies,
never admits), per the audit's requirement for the decentralized store.
dev/feature_flags.json:
- Add the `decentralized` flag (OFF): sqliteStore default while off,
jetstreamStore behind it. The membershipd boot wiring that selects the
KV store is deliberately deferred to 0003e/0003f (the embedded-NATS
authenticator<->store bootstrap is part of the session/deploy redesign);
OFF keeps the single-node SQLite control plane unchanged.
Tests (DoD: golden + edges + error path):
- TestJetStreamStoreRoomsCRUD: encrypted room + owner + invited member
round-trip through every room/member/key method, including latest-epoch
resolution and rekey.
- TestJetStreamStoreUsers: add/get/authorize/list/revoke + admin gate,
with case-insensitive key normalization and duplicate rejection.
- TestJetStreamStoreNotFound: ErrNotFound mapping for misses.
- TestJetStreamStoreIsAuthorizedFailClosed: NATS backend shut down ->
IsAuthorized and HasAdmin both DENY within the bounded timeout.
The full existing suite stays green: sqliteStore is unchanged behavior.
Add high-availability cluster support to the embedded NATS server
(issue 0003a, first phase of decentralization).
pkg/embeddednats:
- ServerConfig gains ServerName (unique per node, required by JetStream
RAFT) and an optional *ClusterConfig (cluster name, route host/port,
peer route URLs, shared-secret Username/Password, and a mutual-TLS
*tls.Config). applyClusterOpts maps it onto server.Options.Cluster +
Routes. Nil Cluster keeps the legacy standalone server.
pkg/busauth:
- RouteTLSConfig builds the route layer's mutual-TLS config: the node
presents its CA-signed certificate AND verifies the peer's certificate
against the bus CA (RequireAndVerifyClientCert), reusing the issue-0001
CA. Routes authenticate NODES, never the client nkey authenticator.
cmd/membershipd:
- Cluster flags (--cluster-name/--server-name/--cluster-port/--routes/
--cluster-user/--cluster-pass/--route-tls-cert/-key/-ca) wire a node
into the cluster. validateClusterConfig refuses a public cluster
without a route secret and complete mutual route TLS, and rejects
partial route-TLS flags (all-or-nothing). splitRoutes parses the CSV.
Tests (DoD: golden + 2 edge + error path):
- TestClusterForwardsAcrossNodes: 2-node cluster forwards a client
subject from one node to a subscriber on the other.
- TestClusterThreeNodesForward: 3-node (HA shape) cross-node forwarding.
- TestClusterMutualTLSForwards: forwarding over mutual-TLS routes.
- TestClusterRejectsBadRouteAuth: wrong cluster password -> no route.
- TestClusterRejectsUnsignedNode: cert not signed by the bus CA -> no route.
- TestClusterConfigPolicy / TestSplitRoutes: boot-guard + CSV parsing.
Master stays green: standalone (no --cluster-name) is unchanged.
Audit H5 (Alto, public). The control plane was signed but plaintext, so a
network MITM could read all metadata (subjects, endpoints, public keys, sealed
keys, blob hashes, the social graph) and drop requests. Signing gives integrity,
not confidentiality.
- membershipd serves the control plane over TLS (ListenAndServeTLS, MinVersion
1.2) with the same CA-signed cert as the data plane when --tls-cert is set; the
fail-open guard already requires --bus-auth enforce alongside it.
- The client gets a separate Options.CtrlTLS so the HTTP client pins the bus CA,
independent of the NATS data-plane TLS. Connect now sets both planes' TLS from
the one CA and REFUSES a plaintext http:// control-plane URL when a CA is
provided, so metadata is never sent in the clear when TLS is expected.
Connect's signature is unchanged; callers (worker/chat --ca, mobile NewSession)
must pass an https:// control-plane URL when they pass a CA. Documented for the
deploy step.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Audit H4 (Alto). The embedded NATS has a single account with no per-subject
permissions, so any registered peer can subscribe to any subject — a cleartext
(ModeNATS) room's payload is readable by anyone who knows the subject.
A complete per-subject ACL derived from membership does not fit here: NATS
evaluates a connection's permissions once at connect time and never re-evaluates
them, but unibus clients connect-then-create/join-then-publish on one connection
(TestSecureBusEndToEnd). Static permissions would forbid the owner from
publishing to a room it just created; the dynamic reconnection model belongs to
the 0003 decentralization redesign. See dev/0004d-dataplane-acl.md.
Minimum defense implemented: Server.RequireEncryptedRooms (set by membershipd on
any non-loopback bind) refuses to create cleartext rooms, so every room on a
public deployment is end-to-end encrypted. Message CONTENT stays confidential
even with no subject isolation; residual traffic-metadata exposure is documented
and tracked for 0003.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Audit H2 (Alto). The binary defaulted to --bus-auth off, the NATS nkey
authenticator only turned on under enforce, and TLS was an independent flag.
Booting --bind 0.0.0.0 --tls-cert … without --bus-auth enforce left both
planes open while looking secure.
validateBootConfig is a pure guard, called right after flag parsing, that
log.Fatals on two insecure shapes:
- a non-loopback --bind without --bus-auth enforce, and
- --tls-cert/--tls-key without --bus-auth enforce.
An insecure public startup is now impossible (the process exits), so a
fail-open data plane never comes up for an unregistered client to reach.
TestAudit_FailOpenTLSWithoutAuth plus a full policy table cover golden
(public+enforce, dev loopback) and every refused shape.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Pre-auth DoS hardening (audit H1, Critical). The control-plane middleware
read the request body with io.ReadAll before authenticating and with no size
cap, so an unauthenticated peer could force the server to buffer an arbitrary
body in RAM (the auditor sent 400 MB and watched RSS climb to ~898 MB).
- ServeHTTP now caps the buffered body before reading: a per-route ceiling
(1 MiB JSON, 16 MiB /blobs) rejects an over-declared Content-Length outright
and wraps the body in http.MaxBytesReader so a lying/chunked sender trips at
the ceiling instead of unbounded.
- handlePutBlob maps the MaxBytesReader cutoff to 413 in every auth mode.
- Per-IP token-bucket rate limiter (golang.org/x/time/rate, already in the
module graph) sheds floods before auth or body reads. Loopback dev stacks are
unaffected (burst >> any single client's rate). Kept in-package as transport
glue, not promoted to the registry, mirroring the nonceCache decision in 0003.
- membershipd sets http.Server.MaxHeaderBytes and ReadHeaderTimeout.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Opens the store before NATS so the authenticator can consult IsAuthorized.
Under --bus-auth enforce the embedded NATS gets the nkey authenticator (only
allowlisted identities connect); --tls-cert/--tls-key make it present the
server certificate and require TLS. External NATS manages its own auth/TLS.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
client.Connect is the single migration seam: a non-empty caPath connects with
TLS pinned to the bus CA plus nkey auth (matching enforce + bus-tls), an empty
caPath keeps the legacy plaintext dev connection; control-plane requests are
signed either way. worker and chat gain a --ca flag; the gomobile NewSession
gains a caPath parameter so the Android app bundles ca.crt and connects
securely. Every peer now flows through one code path.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Maps off|soft|enforce to membership.AuthMode and wires it into NewServer.
Defaults to off so existing deployments are unaffected until the operator
opts into the rollout.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Local administration surface for the user allowlist, dispatched before the
server flag set parses os.Args. It opens the SQLite store directly with no
network or auth: running on the bus host is trusted by design, which is how
the first admin is seeded (breaking the chicken-egg of needing an admin to
add an admin). Validates that sign-pub is a 32-byte Ed25519 key in hex and
tolerates the sign-pub positional appearing before or after --db.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a --bind flag (default 127.0.0.1) to membershipd that controls which
network interface both the control-plane HTTP API and the embedded NATS data
plane listen on. Use 0.0.0.0 to expose the stack to the LAN so remote peers
(phones, other PCs) can connect; keep the default for a loopback-only dev stack.
embeddednats gains StartHost(storeDir, host, port) for explicit interface
control; Start stays a backward-compatible wrapper (host "" = nats default
0.0.0.0) so the playground and tests are untouched.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>