The per-process nonce cache breaks anti-replay under multi-node failover
(audit 0004): a request captured on one node can be replayed to a
DIFFERENT node whose local cache never saw the nonce, and is accepted.
This makes the nonce state shared so a replay is rejected cluster-wide.
pkg/membership:
- nonceStore is now an interface. The in-memory cache is renamed
memNonceCache (still the default, single-node behavior).
- kvNonceStore (new) claims each nonce with an atomic KV Create on a
shared bucket: first sight wins (accept), any later sight on any node
rejects (replay). A backend error fails CLOSED (reject), so a KV outage
never silently disables anti-replay. The bucket carries a TTL =
nonceTTL (2*clockSkew) so a key expires exactly when its replay window
closes; raw base64 nonces are mapped to KV-safe keys via sha256-hex.
- Server.UseReplicatedNonces(js, replicas) swaps the store on a node;
every node in a cluster calls it. NewServer still defaults to the
in-memory cache (master behavior unchanged).
Test (DoD error path — the issue's cross-node replay case):
- TestReplicatedNonceRejectsCrossNodeReplay: two membershipd nodes share
one KV bucket; a request accepted (200) on node A, replayed with the
same ts+nonce to node B, is rejected (401) — and replaying to A again
is rejected too.
Three medium audit findings.
H6 (owner spoof): handleCreateRoom now binds the body's declared owner to the
authenticated signer — both the endpoint id and the signing key must be the
signer's — so a registered peer cannot create rooms in another identity's name.
Enforced only when an authenticated signer is present.
H7 (nonce-cache poison pre-auth): IsAuthorized now runs BEFORE the replay cache
is touched, so an unregistered identity (Ed25519 keys are free) can no longer
seed nonces into it. The cache is rewritten with O(expired) pruning (insertion
order equals expiry order under a constant TTL) instead of the previous O(n)
full-map scan under the mutex, plus a size cap with oldest-eviction. This is the
prerequisite the 0003 replicated nonce store builds on.
H12 (error leak): internal store/blob errors are logged and replaced with a
generic client message via writeServerErr, so SQL fragments and filesystem paths
no longer reach the caller. Crafted 4xx messages (owner-sig, validation) are kept.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Audit H3 (Alto). 'Authorized' meant 'registered in the allowlist', not 'member
of the room', so any registered peer could read another room's subject, its
full member list (every member's sign_pub + kex_pub), any endpoint's room
directory, and even another member's sealed key.
The middleware now carries the authenticated signer's endpoint id into the
handler via request context. Room handlers enforce membership:
- GET /rooms/{id} and /rooms/{id}/members require the signer to be a member;
- GET /rooms/{id}/key serves the sealed key only to its own endpoint
(endpoint == signer) and only to a member;
- GET /members/{endpoint}/rooms is restricted to the signer's own endpoint.
Authorization is skipped only when no authenticated signer is present (AuthOff
dev, or a soft-mode pass-through), preserving legacy/dev behavior. Internal
errors no longer echo store messages to the client on these paths.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds the bus-auth rollout (off|soft|enforce) to the control plane. The
middleware verifies an Ed25519 request signature over CanonicalRequest
(method, request-URI, ts, nonce, sha256(body)), checks the timestamp is
within +/-30s, rejects replayed nonces via an in-memory TTL cache (60s), and
requires the signer to be an active user in the allowlist. soft logs
rejections but lets requests through so clients can migrate without an
outage; off is the legacy no-op default. /healthz is exempt so health probes
work before any identity exists. CanonicalRequest is exported as the single
source of truth shared with the client.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>