104 Commits

Author SHA1 Message Date
agent 5ea8fa1c20 feat(web): wire the SPA to the live bus via the gateway (drop mock)
Replace the mock data source with a real data layer that talks to the webgw
gateway over REST + SSE. The UI components keep their look and props; only
where the data comes from changed.

- src/api.ts: the single repository layer. fetch wrappers (same-origin cookie)
  for login/logout/me and rooms list/create/join/send, plus streamRoom() which
  opens an EventSource and yields each decrypted message. Wire->UI mappers
  (roomFromWire, messageFromWire).
- src/types.ts: add the gateway wire shapes (MeInfo, RoomWire, MsgWire) next to
  the existing UI types.
- App.tsx: probe /api/me on mount to resume an existing session; otherwise show
  Login. Logout calls the gateway.
- Login.tsx: the password field now unlocks the gateway session (operator
  passphrase); shows a basic error and a loading state. Wallet-per-browser is
  phase 2.
- ChatShell.tsx: load rooms from /api/rooms with loading / empty / error states;
  same Flex layout.
- ChatPanel.tsx: stream messages over SSE for the active room (dedup by id),
  composer sends through the gateway; no optimistic insert (the peer's own echo
  returns over SSE with the real frame id).
- vite.config.ts: dev proxy /api (REST + SSE) -> the gateway on :8481.

mock.ts is left untouched (no longer imported) to avoid churn with the parallel
styling work on master.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 21:14:19 +02:00
agent fb8a03cf0c feat(webgw): web gateway peer (REST + SSE) for the chat SPA
Add cmd/webgw: a single Go binary that holds the operator's bus identity,
connects to the bus as a real authenticated peer (pkg/client), and exposes a
small REST + SSE API the browser consumes. The browser never signs, never
speaks NATS, and never sees a private key.

Endpoints (all under /api, gated by a session cookie except login):
  POST /api/login            unlock a session with the operator passphrase
  POST /api/logout
  GET  /api/me               operator identity the gateway acts as
  GET  /api/rooms            ListMyRooms
  POST /api/rooms            CreateRoom (default policy: encrypted+persisted+signed)
  POST /api/rooms/{id}/join  Join (fetch room key)
  POST /api/rooms/{id}/send  Publish (sealed + signed by the peer)
  GET  /api/rooms/{id}/stream  SSE of decrypted frames (history then live)

Design notes:
- One fan-out hub per room: a single bus subscription is multiplexed to N SSE
  clients, avoiding the per-(room,endpoint) durable-consumer contention that
  multiple Subscribe calls would cause.
- Posture seam mirrors unibus_admin/clientcheck: empty --ca = plaintext dev,
  non-empty = TLS+nkey on both planes; RefreshSession after a membership change
  only under the secured (ACL) posture.
- Identity loaded from `pass` or a 0600 file, held only in memory.
- Session auth: passphrase compared in constant time; opaque HttpOnly cookie
  so EventSource (which cannot set headers) can authenticate the stream.

TRUST MODEL: room content stays end-to-end encrypted on the bus. The gateway
reads plaintext only because it acts AS the operator's client — a legitimate
member of each room holding the room key. The per-browser wallet (WebCrypto)
that moves decryption into the browser is phase 2.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 21:14:08 +02:00
egutierrez b4f3118e85 Merge quick/users-http-admin: HTTP admin-only users API + client methods (report 0014) 2026-06-07 20:46:44 +02:00
egutierrez e9053169da Merge quick/0011-deploy-gaps: live user-add --store kv + clientcheck E2E + runbook fixes (report 0012) 2026-06-07 20:46:44 +02:00
Egutierrez b983e43090 docs(0007): spec encryption-at-rest del control plane (JetStream/SQLite en disco) 2026-06-07 20:34:35 +02:00
egutierrez b379730225 docs(app): document users HTTP admin model, bump 0.10.0
Add a gotcha describing the unified-storage model (the server writes
users to the same store/KV as rooms), the admin-only HTTP surface, and
the CLI-seeds-admin-#0 bootstrap. Bump the version 0.9.0 -> 0.10.0 and
add the capability growth log entry for the new HTTP admin users API.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 20:32:05 +02:00
egutierrez 450ca01baf feat(membership,client): HTTP admin-only users API
Close the last control-plane asymmetry: rooms had a signed HTTP surface
but users were only manageable via the local CLI or direct store access.
Add admin-only HTTP endpoints, symmetric with rooms, executed against the
same privileged store the server already serves (SQLite single-node, the
replicated JetStream KV in cluster) — no new KV connection, no internal
identity, so the admin panel can manage the allowlist by signing as an
admin instead of needing --db / direct KV access.

Endpoints (all behind requireAdmin, on top of the existing
signature+nonce+TLS+enforce middleware):
  - GET  /users                    list the full allowlist (incl. revoked)
  - POST /users                    add {sign_pub, handle, role}
  - POST /users/{signpub}/revoke   revoke (status flip, no hard delete)

requireAdmin is default-deny with no dev relaxation: it allows a request
only when the authenticated signer is confirmed by the store as an active
admin; any other case (no signer, non-admin, revoked, store error) is 403,
fail-closed. The request context now also carries the signer's sign_pub
hex, because the endpoint id is a one-way hash of the key and cannot be
reversed to look the signer up in the allowlist.

Validation/idempotency mirror the CLL: sign_pub must be 64-hex, role must
be admin|member (empty defaults to member), re-adding an existing key is a
409 that leaves the row untouched. The hex check is unified into
membership.ValidateSignPubHex, reused by the CLI and the handlers.

pkg/client gains ListUsers/AddUser/RevokeUser (flat UserInfo type) signed
via doJSON, so the panel plugs in directly.

Tests: non-admin -> 403 on all three endpoints; admin add->list->revoke
roundtrip; validation (400 hex, 400 role, 409 re-add, row untouched); plus
a client test against an embedded membershipd under enforce.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 20:31:57 +02:00
egutierrez e1a7402ff1 chore: bump unibus to 0.9.0 (live user-add + clientcheck)
New capability membershipd user add --store kv against a live cluster plus
cmd/clientcheck end-to-end verification (issue 0011 gaps, report 0012). Adds
the capability growth log entry.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 19:41:56 +02:00
egutierrez ce72131ddf docs(cluster): correct runbook + wire --internal-id-file into deploy
Corrections learned from the real 0011 deploy:
- Bring up: the "start magnus alone and verify healthz" order deadlocks — a
  lone node of a 3-node cluster has no meta-group quorum and never serves
  healthz until a second node joins. Document a quorum-forming start and that
  a node never self-serves.
- Replication: R1 is an unusable SPOF (all six control-plane buckets on one
  node) and the cold start only converges with the three cold-start fixes;
  go straight to R3 once the cluster forms.
- Add a "user add --store kv" section: the live user-add path that replaces
  stop-seed-restart, with its security model and idempotency/HA/no-delete
  semantics.
- Topology: real IPs, ROUTE_NETWORK=public (no WireGuard mesh exists).
- Chaos test: mark the data-plane client + failover proofs as validated (0012).

Deploy machinery now emits the persisted internal identity: the unit gains
--internal-id-file ${INTERNAL_ID_FILE} and deploy-cluster.sh writes
INTERNAL_ID_FILE into each node's cluster.env, so a fresh deploy enables the
live user-add path on every node.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 19:41:56 +02:00
egutierrez 3aa5a2c9a9 feat(clientcheck): end-to-end client verification (E2E room + failover)
The 0011 chaos test validated only the control plane (healthz + leader
failover + KV readable with 2/3); it never connected an authenticated bus
client to the data plane. cmd/clientcheck is a reusable verification tool: it
connects with a real identity (nkey + TLS on both planes, multi-node seed
lists), creates an ephemeral E2E room (encrypted + signed, no durable stream),
and either publishes N messages and asserts all come back decrypted (golden)
or publishes a counter for a duration while logging the attached node (loop),
so stopping a node mid-run shows the client fail over to a survivor and keep
receiving with quorum 2/3.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 19:41:56 +02:00
egutierrez 02c2004ebd feat(membershipd): user add/list/revoke --store kv against a live cluster
Closes the most valuable 0011 deploy gap: adding users to the running
cluster's replicated allowlist with no stop-seed-restart. Under enforce the
per-subject ACL confines every bus user to its own rooms, so no ordinary
identity may write the control-plane KV buckets; the only identity the
authenticator grants full JetStream permissions is membershipd's internal
service identity.

- main.go: --internal-id-file persists that identity (load-or-create, 0600)
  instead of a fresh ephemeral key, so the same nkey is available out of
  process. Empty keeps the ephemeral default (single-node/dev unchanged).
- users_kv.go: connectKVStore loads the persisted identity, presents its
  nkey (recognized as internal -> full perms), opens the KV store and
  writes. Defaults assume an on-node loopback invocation; a remote target
  without --ca is refused (allowlist must not travel cleartext, audit N6).
  Prints KV_UNIBUS_users replication (followers_current) after a write.
- users_cli.go: --store kv on add/list/revoke. Re-adding a key is an explicit
  ErrUserExists (no silent overwrite / role flip); revoke is a status flip.
- pkg/client: LoadIdentity (load-only) extracted from LoadOrCreateIdentity,
  preserving its "corrupt file is an error, not silently regenerated" guard.
- kv_useradd_test.go: golden write under enforce, idempotency, unreachable
  endpoint, and remote-without-CA refusal against an embedded node.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 19:41:38 +02:00
egutierrez ff580ac031 Merge quick/cluster-coldstart-fixes: 3-node cluster cold-start fixes + real topology 2026-06-07 18:56:28 +02:00
egutierrez 9fbff79df4 chore(deploy): fill cluster nodes.env with the real 3-node topology
Set magnus's public IP (135.125.201.30) and switch ROUTE_NETWORK to "public":
the three nodes have no WireGuard mesh (homer/datardos do not even have wg
installed), so server-to-server routes go over the public IPs, still protected
by the separate cluster route CA (mutual TLS). KV_REPLICAS is raised to 3 now
that the cluster runs at R3.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 18:56:28 +02:00
egutierrez 33746d9962 fix(cluster): make the JetStream control-plane survive a cold multi-node start
Bringing up the 3-node cluster from clean stores never converged: every node
looped on `open KV bucket "UNIBUS_rooms" (replicas=1): context deadline exceeded`.
Three independent defects in the clustered bootstrap path, none of which surface
on a single node (where JetStream is ready instantly), caused it:

1. embeddednats: route connection pooling (nats-server 2.10 default pool of 3)
   churned with "duplicate route"/"client closed" reconnects on the small cluster,
   interrupting the meta-group RAFT heartbeats and forcing perpetual leader
   re-elections. Set Cluster.PoolSize = -1 (single route per peer).

2. embeddednats: the cluster nodes are Docker hosts, so NATS advertised the docker
   bridge IPs (172.x / 10.0.x) to peers, which then tried to dial those private,
   mutually-unreachable addresses. Set Cluster.NoAdvertise = true so only the
   explicit public-IP routes are used. Also added a UNIBUS_NATS_DEBUG env toggle
   (off by default) that enables the embedded server's logger and loopback
   monitoring port for debugging the route/meta layer.

3. membership.OpenJetStream: a KV op is a NATS request/reply; on a cold cluster the
   op was published once, before the node had contact with the meta leader, so the
   request was dropped and the single long-context call just blocked until timeout.
   Retry each bucket op with short per-attempt contexts until it succeeds or an
   overall bootstrap budget (120s) is exhausted, so it lands once the meta settles.

With these the cluster forms cleanly, creates the KV buckets, scales R1->R3 in
place, and survives loss of one node (quorum 2/3). Verified on magnus+homer+datardos.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 18:56:28 +02:00
agent caf005f04b feat(web): frontend v1 — login (handle+contraseña), sidebar rooms+buscador, chat estilo Element
SPA React 19 + Vite + Mantine v9 en modo oscuro (acento índigo), datos mock para
iterar el diseño antes de cablear el gateway. Login con identidad + contraseña
(la contraseña desbloqueará la identidad Ed25519 cifrada en el dispositivo).
Sidebar: avatar de usuario, buscador (rooms/usuarios/mensajes) y lista de rooms
con candado E2E / hash cleartext / badges de no leídos. Panel de chat estilo
Element (avatar+nombre+hora+texto) con composer interactivo.
2026-06-07 17:57:50 +02:00
agent 9787c218ac chore: remove experimental frontends (web, android, playground, mobile)
Limpieza de los frontends de prueba (SPA React, app Kotlin, gateway playground,
binding gomobile) tras la fase de exploración. El bus (cmd/membershipd + pkg/*)
queda intacto y verde. Empezamos un frontend web nuevo desde cero, construido
de forma incremental. Todo lo borrado permanece en el historial git por si hay
que recuperar algo.
2026-06-07 17:38:07 +02:00
egutierrez 926b8e96af chore(0006): bump unibus to 0.8.0, close issue 0006 (cluster hardening + wiring)
All seven phases (0006a–0006g) merged: blockers N3 (replicated nonce) and N2
($JS.API.> KV leak) closed, decentralized KV store wired (--store kv), homogeneous
cluster posture enforced (N1), RefreshSession in all clients (N4), the lows
(secret out of argv, migrate guard, R1/CA docs), and the 3-node deploy material.

Full suite + every audit-0008 attack regression green; govulncheck 0 reachable.
See report 0009.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 17:33:03 +02:00
egutierrez ae39e35fb4 Merge issue/0006g-deploy: cluster deploy material (magnus+homer+datardos, R3 HA) 2026-06-07 17:31:13 +02:00
egutierrez 48a3d6be33 docs(0006g): cluster deploy material for magnus+homer+datardos (R3 HA)
Parameterized, NO-VPS-touched material to bring up unibus as a 3-node cluster.
The authoring agent ran none of it on a host; every remote-changing step is
marked HUMAN and deploy-cluster.sh defaults to a dry run.

deploy/cluster/:
- nodes.env — topology (cluster name, ports, per-node rows). Public IPs known
  (homer 141.94.69.66, datardos 51.91.100.142) pre-filled; magnus public IP and
  all WireGuard IPs are <PLACEHOLDER> for the human; scripts refuse to run while
  any remain.
- generate-cluster-certs.sh — mints a SEPARATE cluster route CA + a route cert per
  node (server+clientAuth, mutual routes) and a data-plane server cert per node
  signed by the reused client CA (../tls/ca.*); SAN = public + WG + hostname.
- membershipd-cluster.service — one unit, parameterized per node via
  /opt/unibus/cluster.env: enforce + per-subject ACL + TLS + --store kv,
  --cluster-pass-file (secret out of argv), Restart=always.
- deploy-cluster.sh — cross-build linux/amd64, generate each node's cluster.env
  (routes to the other two on the WG mesh, no userinfo), rsync + install (only
  with --yes); staggered start is manual.
- README.md — runbook: prerequisites, loopback bootstrap to seed the first admin
  into the KV (works around the user-CLI/KV chicken-and-egg), staggered bring-up,
  verify posture+quorum, scale R1->R3 in place, and the chaos test (left to 0003f
  on the real VPS).
- .gitignore — out/, build/, secrets/, *.key never committed.

bash -n passes on both scripts; go build/test unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 17:31:13 +02:00
egutierrez 24ff45ca7e Merge issue/0006f-lows: cluster secret out of argv + migrate guard + docs (audit 0008 lows) 2026-06-07 17:24:46 +02:00
egutierrez b8201a82cd fix(0006f): cluster secret out of argv, migrate-to-kv TLS guard, R1/CA docs (audit 0008 lows)
Low-severity cluster hardening from audit 0008:

- Route secret out of argv (N1-low): --cluster-pass and a nats://user:pass@host in
  --routes are visible in ps/journald. New --cluster-pass-file and the
  UNIBUS_CLUSTER_PASS env var (precedence file > env > flag); the resolved secret
  guards the route layer and is injected into bare --routes entries
  (injectRouteCreds), so peers can be listed as nats://host:6250 with no secret in
  argv. The legacy --cluster-pass stays for dev/compat.
- migrate-to-kv confidentiality (N6): refuse a remote --nats-url without --ca (the
  allowlist would travel cleartext); loopback targets are exempt (isLoopbackURL).
- Docs (N1 route CA, N3 DoS): deploy/README gains a Clustering section — use a
  SEPARATE cluster CA for routes (not the client CA), keep the secret out of argv,
  run migrate-to-kv loopback/TLS only, and R1 is a SPOF of auth (not HA); R3
  quorum is real HA. The generated cert material lives in deploy/cluster/ (0006g).

Tests:
- TestResolveClusterPass (file > env > flag precedence; missing file errors),
- TestInjectRouteCreds (injects only into userinfo-less routes; preserves overrides),
- TestIsLoopbackURL (loopback vs remote vs malformed).

CGO_ENABLED=0 go build/vet/test green; govulncheck 0 reachable.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 17:24:46 +02:00
egutierrez 3a33656cac Merge issue/0006e-refresh: RefreshSession in all clients (audit 0008 N4) 2026-06-07 17:21:14 +02:00
egutierrez 2f5b372a80 fix(0006e): call RefreshSession after membership changes in all clients (audit 0008 N4)
A secured bus freezes per-subject permissions at connect time, so a peer that
creates or joins a room after connecting cannot pub/sub on it until it reconnects
(RefreshSession). No client called it, so under enforce+ACL the demos failed
closed — pushing the operator to disable the ACL (a security regression at the
operator's discretion).

Wire the membership-change contract into every client:
- cmd/worker: RefreshSession after CreateRoom, before publishing.
- cmd/chat (simple): RefreshSession after CreateRoom+Join, before Subscribe.
- cmd/chat (encrypted demo): A refreshes after CreateRoom; B refreshes after the
  invite+join, both before pub/sub.
- local_files/bridge (gateway): RefreshSession after CreateRoom+Join, before Subscribe.
- mobile: new Session.RefreshSession wrapper + the contract documented for callers.

Contract (documented on the wrappers): after ANY membership change, call
RefreshSession BEFORE pub/sub on the new room (it drops active subs, so it must
precede Subscribe). On an unsecured/dev bus it is a harmless reconnect.

Test:
- TestClientCreateRoomRefreshPublishFlow: end-to-end under enforce+ACL, a peer
  creates a room, refreshes, invites a second peer who joins+refreshes+subscribes,
  and the publish is received — no manual intervention, the ACL stays on.

CGO_ENABLED=0 go build/vet/test green; govulncheck 0 reachable.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 17:21:14 +02:00
egutierrez 32bec75665 Merge issue/0006d-posture: homogeneous cluster posture + /healthz posture (audit 0008 N1) 2026-06-07 17:17:37 +02:00
egutierrez 9b96537aa6 fix(0006d): enforce homogeneous cluster posture + publish posture on /healthz (audit 0008 N1)
A cluster is only as secure as its weakest node: the data plane forwards every
subject between nodes, so one node running without enforced auth lets an
unauthenticated peer Subscribe(">") on it and harvest the traffic forwarded from
the ACL'd nodes.

- validateClusterConfig now takes the auth mode and REFUSES to join a cluster
  unless --bus-auth enforce, regardless of bind (a clustered node is a production
  node; there is no safe dev cluster without auth). This binary therefore cannot
  BE the weak node.
- Server.Posture {enforce,acl,tls,cluster,store} is published on /healthz (non
  secret operational metadata, probe stays unauthenticated) so a monitor or peer
  can detect a cluster member not running enforce+ACL+TLS — covering a peer that
  runs a tampered/old binary outside this node's control.

Tests:
- TestAttack0008_N1: a clustered node with --bus-auth off is refused; the same
  node with enforce + full route security is allowed.
- TestClusterConfigPolicy: extended with off/soft clustered cases (refused) and
  the mode parameter throughout.
- TestHealthExposesPosture: /healthz returns the posture booleans + store backend.

CGO_ENABLED=0 go build/vet/test green; govulncheck 0 reachable.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 17:17:37 +02:00
egutierrez 18ee7c469b Merge issue/0006c-kv-store: wire decentralized control-plane KV store (--store kv) 2026-06-07 17:14:20 +02:00
egutierrez e9ad719424 feat(0006c): wire the decentralized control-plane KV store (--store kv)
0003 built the JetStream KV store (jetstreamStore) but the binary never selected
it: membership.Open (SQLite) was hardcoded and OpenJetStream was only reached by
migrate-to-kv. This completes the wiring so a node actually serves its control
plane from the replicated KV.

- New flag --store kv|sqlite (default sqlite). kv opens the JetStream KV control
  plane over the privileged internal connection; sqlite is the unchanged baseline
  (branch-by-abstraction: the full suite's SQLite paths are untouched).
- Bootstrap cycle resolved with storeHolder: the authenticator consults the holder
  (fail-closed until set), so it can be built before the KV store exists. The KV
  store opens after NATS is up and is published into the holder. The only client
  that can connect in that window is the internal identity, which bypasses the
  store by key. In SQLite mode the store is set before StartServer, so the window
  does not exist.
- needJS now covers --store kv as well as --cluster-name; the JetStream client is
  shared by the KV store and the replicated nonce bucket.
- feature_flags.json: decentralized wiring documented as complete, realized via
  --store kv (opt-in per deploy; default stays sqlite).

Fail-closed preserved: jetstreamStore.IsAuthorized already denies on any backend
error; the holder denies while unset.

Tests:
- TestStoreHolderFailClosed: empty holder denies; serves after set.
- TestKVStoreBootstrapUnderEnforce: end-to-end decentralized boot — KV-seeded user
  authenticates over nkey under enforce; outsider denied.
- TestKVStoreDecentralizedConsistency: a room/user created on one node's KV store
  is visible to another's (ends the per-node SQLite divergence, audit 0008 N5).

CGO_ENABLED=0 go build/vet/test green; govulncheck 0 reachable.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 17:14:20 +02:00
egutierrez d1e1a478f8 Merge issue/0006b-kv-acl: scope JetStream ACL per-room (audit 0008 N2) 2026-06-07 17:08:54 +02:00
egutierrez cacf608fde fix(0006b): scope JetStream ACL per-room, close $JS.API.> KV leak (audit 0008 N2)
The client-infra grant was {"_INBOX.>", "$JS.API.>"}. The broad "$JS.API.>" let
any registered peer drive the whole JetStream API and read the control-plane KV
buckets (KV_UNIBUS_users/rooms/members/room_keys) and the object store directly
over NATS, bypassing the HTTP authorization (requireMember + own-endpoint
checks): a full leak of the allowlist, room graph and sealed-key metadata once the
decentralized control plane is active.

Fix: replace the broad grant with a CLOSED, per-room allow set.
- clientInfraSubjects shrinks to {"_INBOX.>", "$JS.API.INFO"} ($JS.API.INFO is
  account counters only — no room/user/key contents).
- SubjectACLFor now grants, per room the peer belongs to, the room subject plus
  the minimal JetStream API subjects of THAT room's stream (jsSubjectsFor:
  STREAM.*, CONSUMER.*, $JS.ACK scoped to UNIBUS_<roomID>).
- Because KV_UNIBUS_* and OBJ_UNIBUS_* are never a room stream, they fall outside
  the closed allow set and are denied by default. Clients reach blobs over the
  HTTP control plane, not the NATS object store, so OBJ needs no client grant.

roomStreamName mirrors pkg/client.streamName so the authorizer and the producer
never drift.

Tests:
- TestAttack0008_N2: eve (registered, member of no room) cannot bind the KV users
  bucket nor subscribe $KV.UNIBUS_users.> (permissions violation); golden: the
  room owner can still drive her OWN room stream's JetStream API; edge: eve cannot
  reach a foreign room's stream.
- TestReaudit_H4 residual note updated: the $JS.API.> leak it deferred is closed.

CGO_ENABLED=0 go build/vet/test green; govulncheck 0 reachable.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 17:08:54 +02:00
egutierrez a9c245d468 Merge issue/0006a-replicated-nonce: wire replicated nonce store (audit 0008 N3) 2026-06-07 17:02:19 +02:00
egutierrez 8b6a01d280 fix(0006a): wire replicated nonce store on clustered nodes (audit 0008 N3)
membershipd never called Server.UseReplicatedNonces, so every node kept a
per-process anti-replay cache and a signed request accepted on node A could be
replayed to node B (200+200). This wires the shared JetStream KV nonce bucket on
any clustered node, closing the cross-node replay hole.

Bootstrap: under enforce the service needs JetStream on its own embedded server,
but the data plane only accepts allowlisted clients. Resolved with an ephemeral
internal service identity the authenticator recognizes and grants full
permissions (NewNkeyAuthenticatorACLInternal), connected over the in-process
transport (no TLS/CA needed for the self-connection).

Hard rule: --cluster-name != "" means the replicated nonce bucket is mandatory;
if it cannot be created the node refuses to start (wireReplicatedNonces returns a
fatal error) rather than run insecurely. Standalone nodes keep the in-memory
cache unchanged (branch-by-abstraction: no JetStream dependency added).

Changes:
- busauth: NewNkeyAuthenticatorACLInternal + fullPermissions for the internal id.
- cmd/membershipd: connectInternalJS (in-process, privileged) / connectExternalJS;
  wireReplicatedNonces helper; main wires it when clustered; --kv-replicas flag.

Tests (regression of audit 0008 N3):
- TestAttack0008_N3: 2 clustered nodes share the bucket, cross-node replay -> 401.
- TestAttack0008_N3_StandaloneKeepsLocalCache: standalone needs no JetStream,
  same-node replay still 401.
- TestAttack0008_N3_ClusteredRequiresJetStream: clustered + no JetStream -> fatal.
- TestInternalConnPrivilegedUnderEnforce / ...OutsiderRejected: the privileged
  self-connection works under enforce and no other identity can claim it.

CGO_ENABLED=0 go build/vet/test green; govulncheck 0 reachable.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 17:02:19 +02:00
agent 5df99fa4c4 docs(issue): 0006 completar+endurecer cluster — wiring KV + N1-N6 auditoría 0008 + material deploy magnus/homer/datardos 2026-06-07 16:48:07 +02:00
egutierrez df3b62a601 Merge quick/0005-bump-close: unibus 0.7.0 + close issue 0005 2026-06-07 16:17:41 +02:00
egutierrez 6976537842 chore(0005): bump unibus to 0.7.0, close issue 0005 (hardening 2)
Hardening 2 (issue 0005, fases 0005a-0005e) cierra los hallazgos nuevos de la
re-auditoría red-team (report 0006): bump de nats-server + toolchain (16 CVEs ->
0 alcanzables), drop de frames sin firma en rooms SignMsgs, limiter global de
bytes en vuelo contra el DoS por concurrencia, TLS obligatorio en bind publico, y
cableado de la ACL por subject que cierra el wildcard metadata leak. Detalle por
fase en el capability growth log del app.md y en el report 0007.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 16:17:41 +02:00
egutierrez a4bbe8209b Merge issue/0005e-acl-wire: wire per-subject ACL into membershipd (audit H4) 2026-06-07 16:15:52 +02:00
egutierrez 87ef52cc80 fix(0005e): wire per-subject ACL into membershipd (close H4 wildcard metadata leak)
The per-subject data-plane ACL existed since 0003e (membership.SubjectACLFor +
busauth.NewNkeyAuthenticatorACL, unit-tested in TestSubjectACLIsolation) but the
binary never used it: cmd/membershipd installed the plain NewNkeyAuthenticator, so
in production a registered NON-member could open a raw NATS connection,
Subscribe(">"), and harvest every room's subject plus JetStream stream/advisory
activity (payload stayed E2E ciphertext, metadata leaked) — the re-audit's H4
vector (report 0006).

Fix:
- New busauth.PermissionsFromSubjects adapts a subject-deriving function into the
  PermissionsFunc the ACL authenticator expects (subjects granted as both the
  publish and subscribe allow set; a derivation error fails closed). It lives in
  busauth so membership stays free of the nats-server dependency.
- cmd/membershipd, under enforce, now installs
  NewNkeyAuthenticatorACL(store.IsAuthorized,
    PermissionsFromSubjects(membership.SubjectACLFor(store)))
  so every connection is confined to the subjects of the rooms it belongs to plus
  the client-infra subjects.
- pkg/membership/acl_test.go's helper now delegates to the production wiring
  (PermissionsFromSubjects) instead of a test-only reimplementation, so the tests
  exercise the real path.

Verification (pkg/membership/acl_test.go):
- TestReaudit_H4_WildcardMetadataLeak: a non-member's Subscribe(">") and any
  foreign-subject subscribe raise permission violations; the member still pub/subs
  her own room and the non-member captures nothing. With the plain authenticator
  (the pre-0005e wiring) the test fails ("wildcard metadata leak still open"),
  confirming the wiring is what closes it.
- TestSubjectACLIsolation / TestRefreshSessionGainsNewRoom still green.
- CGO_ENABLED=0 go build ./... && go vet ./... && go test -count=1 ./...  green.

Residual (documented): the client-infra grant includes "$JS.API.>", shared by all
peers so per-connection JetStream works; a peer that subscribes specifically to
"$JS.API.>" can still observe stream-management requests whose subjects embed the
room-derived stream name. Fully closing that needs NATS accounts/permissions per
identity (deferred to the 0003 decentralization line). Operational note: NATS
freezes permissions at connect time, so clients must client.RefreshSession after a
membership change to gain a new room's subject; cmd/chat and cmd/worker do not yet
call it, a functional gap to close before an enforce+ACL deployment.

Refs: report 0006 H4, issue 0005e.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 16:15:52 +02:00
egutierrez a2ec78c81d Merge issue/0005d-tls-guard: require TLS on public bind (audit N4) 2026-06-07 16:11:45 +02:00
egutierrez d01da9d396 fix(0005d): require TLS on a public bind (close N4 plaintext control plane)
The H2 guard refused "public bind without enforce" and "TLS flags without
enforce", but it still ALLOWED a public bind with enforce and no --tls-cert: the
control plane then served metadata (subjects, pubkeys, sealed keys, the social
graph) over plaintext HTTP publicly, so audit H5 reappeared as the N4 gap (TLS
was a capability, not a requirement; report 0006).

Fix: validateBootConfig now also refuses a non-loopback --bind unless both
--tls-cert and --tls-key are set. Public deployments must serve HTTPS; loopback
dev is unaffected (no TLS still allowed there).

Verification (cmd/membershipd/config_test.go):
- TestGap_PublicEnforceNoTLS: validateBootConfig("0.0.0.0", enforce, "", "")
  now returns an error mentioning --tls-cert (golden public+enforce+TLS allowed;
  edge loopback-without-TLS still allowed).
- TestBootConfigPolicy table updated: public+enforce+notls / +certonly / +keyonly
  and lan-ip+enforce+notls are now refused; public+enforce+tls and
  loopback+enforce+tls allowed.
- CGO_ENABLED=0 go build ./... && go vet ./... && go test -count=1 ./...  green.

Refs: report 0006 N4, issue 0005d.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 16:11:45 +02:00
egutierrez db8618ddc3 Merge issue/0005c-inflight: global in-flight byte limiter bounds aggregate memory (audit N2) 2026-06-07 16:09:58 +02:00
egutierrez e7d59fd01d fix(0005c): bound aggregate buffered memory with a global in-flight byte limiter
The H1 fix bounds each request (1 MiB control / 16 MiB blob) and the per-IP rate
limiter throttles a single source, but neither bounds the AGGREGATE memory across
concurrent requests. The re-audit (report 0006, N2) drove RSS to ~1.42 GB with 40
concurrent 16 MiB uploads, and noted that a multi-IP (botnet) flood scales without
a ceiling because the rate limit is per-IP.

Fix: a global, non-blocking, byte-counting limiter (pkg/membership/inflight.go).
ServeHTTP reserves a POST's worst-case buffered size (its route ceiling) from the
limiter before reading the body, and releases it when the request finishes. When
the global cap (maxInflightBytes = 128 MiB) is reached, further POSTs are shed
with 503 (backpressure) rather than parking goroutines, so total bytes buffered
in flight stays bounded regardless of connection count or source-IP spread. GETs
carry no body and do not consume the budget.

The limiter is implemented inside unibus (not delegated to the fn-registry, where
a generic concurrency primitive would normally live) because functions/core pulls
transitive deps requiring CGO (mattn/go-sqlite3) and external modules that are
incompatible with unibus's CGO_ENABLED=0 build, and because this work is scoped
to the unibus sub-repo. The type/method comments document this.

Verification:
- pkg/membership/inflight_test.go: TestInflightLimiter{Basics,Disabled,Concurrent}
  cover golden/edge/error/disabled/over-release and a -race concurrency invariant
  (inFlight returns to 0, never exceeds cap).
- pkg/membership/dos_concurrency_test.go: TestReaudit_DoSConcurrency fires 40
  concurrent 16 MiB uploads from distinct IPs (the multi-IP shape) against a 48 MiB
  test cap -> 200=3 503=37, RSS delta ~93 MiB (bound 256 MiB), inFlight()==0, and a
  fresh upload still 200. With the limiter disabled the test fails (200=40 503=0),
  confirming it is a real regression guard.
- CGO_ENABLED=0 go build ./... && go vet ./... && go test -count=1 ./...  green;
  CGO_ENABLED=1 go test -race ./pkg/membership/ green.

Residual (documented): under enforce the body is buffered twice (auth verify +
handler), so real RSS is ~2x the reserved bytes; closing that fully means
streaming blobs to disk (overlaps H9 / issue 0002).

Refs: report 0006 N2, issue 0005c.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 16:09:58 +02:00
egutierrez 0f79708338 Merge issue/0005b-sig-nil: drop unsigned frames in SignMsgs rooms (audit N3) 2026-06-07 15:58:10 +02:00
egutierrez ef3af6dfd1 fix(0005b): drop unsigned frames in SignMsgs rooms (close sig-nil spoof)
client.processFrame verified a frame's signature only when one was present
(`info.Policy.SignMsgs && f.Sig != nil`). In a room whose policy REQUIRES
per-message signatures, an attacker with data-plane access could publish a raw
frame with Sig==nil and a forged Sender, and the receiver accepted it as
authentic because the verification block was skipped (audit N3, report 0006).
On a signed-but-cleartext room any peer that knows the subject could thus
impersonate any sender.

Fix: in a SignMsgs room a missing signature is itself a rejection. processFrame
now drops any frame with Sig==nil before attempting verification:

    if info.Policy.SignMsgs {
        if f.Sig == nil { return }   // signature required but absent: drop
        // verify ...
    }

Non-signed rooms (ModeNATS) are unaffected: unsigned frames there are still
delivered, so the plain-NATS path is unchanged.

Verification (pkg/client/sig_nil_spoof_test.go, TestReaudit_SigNilSpoof):
- golden: a properly signed frame from a member is delivered.
- error : an unsigned frame with a forged Sender in a SignMsgs room is dropped
  (the test fails with "SIG-NIL SPOOF: receiver accepted ..." when the fix is
  reverted, confirming it is a real regression guard).
- edge  : a non-signed room still delivers an unsigned frame.
- CGO_ENABLED=0 go build ./... && go vet ./... && go test -count=1 ./...  green.

Refs: report 0006 N3, issue 0005b.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 15:58:10 +02:00
egutierrez 88b47912bd Merge issue/0005a-cve-bump: bump nats-server to v2.11.15 + go1.26.4 (16 CVEs -> 0 reachable) 2026-06-07 15:55:32 +02:00
egutierrez a3ac58fb70 fix(0005a): bump nats-server v2.10.22->v2.11.15 + toolchain go1.26.4 (close 16 CVEs)
govulncheck reported 16 reachable vulnerabilities (re-audit finding N1, report 0006):
14 in github.com/nats-io/nats-server/v2@v2.10.22 -- the embedded NATS server, which
is exposed to the internet in the chosen deployment -- and 2 in the Go standard
library (GO-2026-5039 net/textproto, GO-2026-5037 crypto/x509).

Changes:
- go get github.com/nats-io/nats-server/v2@v2.11.15 (covers all 14 server CVEs;
  pulls nats.go v1.49.0, nkeys v0.4.15, jwt v2.8.1, klauspost/compress v1.18.4
  and friends transitively).
- go directive 1.25.0 -> 1.26.4 so the toolchain ships the two stdlib fixes.

This is a go.mod/go.sum change justified purely by CVE remediation; it is the
explicit exception to the "do not touch deps" rule for a CVE bump.

Verification:
- CGO_ENABLED=0 go build ./... && go vet ./... && go test -count=1 ./...  -> green,
  including the 0003 multi-node cluster/JetStream e2e in pkg/embeddednats, so the
  server bump did not break the cluster or the durable plane.
- govulncheck ./...  -> "No vulnerabilities found" (0 reachable; the 13 that remain
  are in required-but-not-called modules).

Refs: report 0006 N1, issue 0005a.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 15:55:32 +02:00
agent fb0291ad8a docs(issue): 0005 hardening 2 — CVEs, sig-nil spoof, DoS concurrencia, TLS forzado (re-auditoría 0006) 2026-06-07 15:48:49 +02:00
agent d821bc1794 chore(0003): bump unibus to 0.6.0 — decentralization / HA (0003a-0003e)
Cluster NATS routes (auth + mutual TLS), Store/blobstore interfaces with
replicated JetStream KV and Object Store backends, idempotent
migrate-to-kv with backup, client failover over seed/control-plane lists,
replicated nonce store (closes the multi-node replay hole), and the
per-subject membership ACL (audit H4 residual). All behind the
`decentralized` flag (off); single-node SQLite+disk behavior unchanged.
The multi-node deploy (0003f) is the human's; runbook in report 0006.
2026-06-07 15:31:14 +02:00
egutierrez da420513b6 Merge issue/0003e-client-failover: client failover + replicated nonce store + subject ACL (H4) 2026-06-07 15:27:45 +02:00
agent 96abb75a2e feat(0003e/3): per-subject data-plane ACL from room membership (audit H4)
Closes the residual the 0004 hardening deferred: the NATS authenticator
can now confine a registered peer to the subjects of the rooms it
belongs to, instead of letting any registered identity sub/pub on any
subject. The dynamic-membership reconnection model the audit named is
provided by client.RefreshSession.

pkg/busauth:
- verifyNkey factors out the shared nkey verification.
- NewNkeyAuthenticatorACL + PermissionsFunc: an authenticator that, after
  authorizing, derives and RegisterUser()s per-subject permissions. A
  derivation error denies the connection (fail closed).

pkg/membership:
- SubjectACLFor(store) maps a signing pubkey to the subjects it may use:
  the subject of every room it belongs to, plus the client infrastructure
  subjects (_INBOX.>, $JS.API.> for request/reply and the persisted plane).

pkg/client:
- RefreshSession() rebuilds the data-plane connection so the authenticator
  re-derives permissions after a membership change (NATS freezes
  permissions at connect time). It retains the seeds/options to reconnect;
  active subscriptions are dropped and must be re-made (documented).

Tests (DoD: isolation + refresh):
- TestSubjectACLIsolation: alice (member of room.A) may sub/pub room.A but
  is DENIED sub and pub on room.B (permissions violation), and never reads
  bob's room.B traffic; bob never receives alice's cross-room publish.
- TestRefreshSessionGainsNewRoom: alice has no permission for room B until
  she is added and calls RefreshSession; the reconnect grants the subject
  and she then receives room B traffic.

Scope note: the per-subject ACL authenticator is opt-in (NewServer/
membershipd keep the open authenticator by default) and is wired in with
the decentralized boot path; auto-RefreshSession on every membership
change (fully transparent) remains for 0003f. Master behavior unchanged.
2026-06-07 15:27:45 +02:00
agent 37c778ca9a feat(0003e/2): replicated anti-replay nonce store on JetStream KV
The per-process nonce cache breaks anti-replay under multi-node failover
(audit 0004): a request captured on one node can be replayed to a
DIFFERENT node whose local cache never saw the nonce, and is accepted.
This makes the nonce state shared so a replay is rejected cluster-wide.

pkg/membership:
- nonceStore is now an interface. The in-memory cache is renamed
  memNonceCache (still the default, single-node behavior).
- kvNonceStore (new) claims each nonce with an atomic KV Create on a
  shared bucket: first sight wins (accept), any later sight on any node
  rejects (replay). A backend error fails CLOSED (reject), so a KV outage
  never silently disables anti-replay. The bucket carries a TTL =
  nonceTTL (2*clockSkew) so a key expires exactly when its replay window
  closes; raw base64 nonces are mapped to KV-safe keys via sha256-hex.
- Server.UseReplicatedNonces(js, replicas) swaps the store on a node;
  every node in a cluster calls it. NewServer still defaults to the
  in-memory cache (master behavior unchanged).

Test (DoD error path — the issue's cross-node replay case):
- TestReplicatedNonceRejectsCrossNodeReplay: two membershipd nodes share
  one KV bucket; a request accepted (200) on node A, replayed with the
  same ts+nonce to node B, is rejected (401) — and replaying to A again
  is rejected too.
2026-06-07 15:21:45 +02:00
agent c6ad63059f feat(0003e/1): client failover over a list of seeds and control planes
The client (issue 0003e, part 1) accepts a LIST of NATS seeds and a LIST
of control-plane URLs so a node loss is transparent.

pkg/client:
- Options.NatsServers: extra NATS seeds beyond the primary. The client
  connects to the joined seed list with MaxReconnects(-1) +
  RetryOnFailedConnect, so nats.go fails over to a surviving node when the
  one a client is attached to dies and rejoins a node that comes back.
- Options.CtrlURLs: extra control-plane endpoints. doJSON/putBlob/getBlob
  now try each endpoint in order, falling over on a transport error to the
  next (an HTTP response from any node is authoritative — every node
  serves the same state under the KV store). newSignedRequest becomes
  newSignedRequestTo(base, ...); each failover attempt mints a fresh nonce
  (the signature covers method+path+ts+nonce+body, not the host), so a
  retried request is never seen as a replay.
- ConnectedServer()/IsConnected(): observability for which node the data
  plane is attached to, for ops and failover tests.
- New/Connect/NewWithOptions keep their signatures (a single URL = a
  one-element list), so worker/chat/mobile/playground are unchanged.

Test (DoD edge — the issue's "kill node A" case):
- TestClientFailoverAcrossNodes: A seeds two clustered nodes, subscribes,
  receives a cross-node message; the node A is attached to is KILLED; A
  reconnects to the survivor and still receives messages — session intact.
2026-06-07 15:18:18 +02:00
egutierrez 649dc9e244 Merge issue/0003d-objectstore: replicated blobs on NATS Object Store 2026-06-07 15:12:45 +02:00
agent d6e668b984 feat(0003d): replicated blob store on NATS Object Store
Branch-by-abstraction for the blob store (issue 0003d): media ciphertext
can live in a replicated JetStream Object Store instead of local disk, so
a blob uploaded to one node survives a node loss and is reachable from
any node.

pkg/blobstore:
- Store is now an interface (Put/Get/Has). The filesystem backend is
  renamed diskStore and stays the default: New(dir) returns it.
- objectStore (new) implements Store over a NATS Object Store bucket with
  a configurable replication factor (R1..R5), matching the KV store's
  R1->R3 rollout. Content-addressing (sha256-hex) is identical, so the
  wire contract is unchanged.

pkg/membership:
- Server.blobs and NewServer take the blobstore.Store interface instead
  of the concrete type; no behavior change with the disk default.

Tests (DoD: golden + edge + contract):
- TestObjectStoreRoundTrip: put/get/has + content-addressed dedup.
- TestObjectStoreMissing: unknown hash is absent and unreadable.
- TestObjectStoreAddressMatchesDisk: the Object Store and disk backends
  address identical bytes to the IDENTICAL hash (portable blob refs).

Like the KV store (0003b), wiring membershipd to select the Object Store
is deferred to the decentralized boot path (flag off); disk stays default.
2026-06-07 15:12:45 +02:00
egutierrez 94e7ced1ef Merge issue/0003c-migrate-kv: idempotent SQLite->KV migration + backup 2026-06-07 15:09:56 +02:00
agent 9013ea5e33 feat(0003c): membershipd migrate-to-kv (idempotent SQLite -> JetStream KV)
The one-time data move decentralization needs (issue 0003c): copy the
entire control-plane state from the local SQLite database into the
replicated JetStream KV buckets, with a backup taken first.

pkg/membership:
- Snapshot / SealedKeyRecord: a backend-agnostic dump of the whole
  control plane (rooms with their real epoch, members, every sealed-key
  row across epochs, users with status).
- (*sqliteStore).ExportSnapshot and (*jetstreamStore).ExportSnapshot read
  a full Snapshot from each backend; (*jetstreamStore).importSnapshot
  writes one with raw Puts (preserving epoch/status, not resetting to
  defaults) so the migration is faithful and idempotent (every write is
  an overwrite, so re-running converges).
- MigrateSQLiteToKV orchestrates export -> import; BackupSQLite makes a
  consistent copy via SQLite's VACUUM INTO before any migration.

cmd/membershipd:
- `membershipd migrate-to-kv --db <path> --nats-url <url> [--replicas N]
  [--ca <cert>] [--no-backup]` backs up the SQLite file, connects to the
  cluster's NATS, and migrates. Dispatched on the host like `user`.

Tests (DoD: golden + edge + parity):
- TestMigrateSQLiteToKVParity: seed a representative SQLite (two rooms,
  one rekeyed to epoch 2, members, a revoked user); after migration the
  KV ExportSnapshot equals the SQLite ExportSnapshot.
- TestMigrateSQLiteToKVIdempotent: running the migration twice yields the
  same KV state.
- TestBackupSQLiteCreatesConsistentCopy: the backup reopens with
  identical data.
Plus a binary smoke (seed user -> run server -> migrate-to-kv -> re-run):
backup written, 1 user migrated, second run identical.
2026-06-07 15:09:56 +02:00
egutierrez b8c9b2b652 Merge issue/0003b-jetstream-store: Store interface + JetStream KV backend (fail-closed) 2026-06-07 15:04:52 +02:00
agent 6b3ace1d39 feat(0003b): membership.Store interface + JetStream KV implementation
Branch-by-abstraction for the control-plane store (issue 0003b), so the
membership state can move off process-local SQLite onto replicated
JetStream KV without rewriting callers and without breaking master.

pkg/membership:
- Store is now an interface (rooms/members/keys + user allowlist +
  Close). The existing SQLite implementation is renamed sqliteStore and
  stays the default: Open(path) still returns it. openSQLite keeps the
  concrete type for internal callers (the 0003c migration).
- ErrNotFound is a storage-agnostic "no such record" sentinel; both
  backends return it (the SQLite store maps sql.ErrNoRows to it). The
  control plane now branches on ErrNotFound instead of sql.ErrNoRows, so
  server.go no longer imports database/sql.
- jetstreamStore (new) implements Store over five replicated KV buckets:
  rooms, members, rooms_by_member (reverse index for ListRoomsForEndpoint),
  room_keys, users. Replication factor is configurable (R1..R5) for the
  R1->R3 rollout. Every read is bounded by OpTimeout and IsAuthorized /
  HasAdmin FAIL CLOSED on any backend error (a KV quorum loss denies,
  never admits), per the audit's requirement for the decentralized store.

dev/feature_flags.json:
- Add the `decentralized` flag (OFF): sqliteStore default while off,
  jetstreamStore behind it. The membershipd boot wiring that selects the
  KV store is deliberately deferred to 0003e/0003f (the embedded-NATS
  authenticator<->store bootstrap is part of the session/deploy redesign);
  OFF keeps the single-node SQLite control plane unchanged.

Tests (DoD: golden + edges + error path):
- TestJetStreamStoreRoomsCRUD: encrypted room + owner + invited member
  round-trip through every room/member/key method, including latest-epoch
  resolution and rekey.
- TestJetStreamStoreUsers: add/get/authorize/list/revoke + admin gate,
  with case-insensitive key normalization and duplicate rejection.
- TestJetStreamStoreNotFound: ErrNotFound mapping for misses.
- TestJetStreamStoreIsAuthorizedFailClosed: NATS backend shut down ->
  IsAuthorized and HasAdmin both DENY within the bounded timeout.

The full existing suite stays green: sqliteStore is unchanged behavior.
2026-06-07 15:04:52 +02:00
egutierrez 3230b31ade Merge issue/0003a-cluster: NATS cluster routes (auth + mutual TLS) 2026-06-07 14:54:53 +02:00
agent c90f145a05 feat(0003a): NATS cluster routes with shared-secret auth + mutual route TLS
Add high-availability cluster support to the embedded NATS server
(issue 0003a, first phase of decentralization).

pkg/embeddednats:
- ServerConfig gains ServerName (unique per node, required by JetStream
  RAFT) and an optional *ClusterConfig (cluster name, route host/port,
  peer route URLs, shared-secret Username/Password, and a mutual-TLS
  *tls.Config). applyClusterOpts maps it onto server.Options.Cluster +
  Routes. Nil Cluster keeps the legacy standalone server.

pkg/busauth:
- RouteTLSConfig builds the route layer's mutual-TLS config: the node
  presents its CA-signed certificate AND verifies the peer's certificate
  against the bus CA (RequireAndVerifyClientCert), reusing the issue-0001
  CA. Routes authenticate NODES, never the client nkey authenticator.

cmd/membershipd:
- Cluster flags (--cluster-name/--server-name/--cluster-port/--routes/
  --cluster-user/--cluster-pass/--route-tls-cert/-key/-ca) wire a node
  into the cluster. validateClusterConfig refuses a public cluster
  without a route secret and complete mutual route TLS, and rejects
  partial route-TLS flags (all-or-nothing). splitRoutes parses the CSV.

Tests (DoD: golden + 2 edge + error path):
- TestClusterForwardsAcrossNodes: 2-node cluster forwards a client
  subject from one node to a subscriber on the other.
- TestClusterThreeNodesForward: 3-node (HA shape) cross-node forwarding.
- TestClusterMutualTLSForwards: forwarding over mutual-TLS routes.
- TestClusterRejectsBadRouteAuth: wrong cluster password -> no route.
- TestClusterRejectsUnsignedNode: cert not signed by the bus CA -> no route.
- TestClusterConfigPolicy / TestSplitRoutes: boot-guard + CSV parsing.

Master stays green: standalone (no --cluster-name) is unchanged.
2026-06-07 14:54:53 +02:00
egutierrez 618f6b61da chore(0004): close issue, bump unibus to 0.5.0, record dataplane-acl decision
Issue 0004 (security hardening) done across 0004a-0004f. app.md version 0.5.0
with the capability growth log entry; dev/0004d-dataplane-acl.md documents the
chosen minimum-defense strategy for the NATS data plane and its residual limit
(per-subject ACL deferred to 0003). Full work report in
projects/message_bus/reports/0005-2026-06-07-unibus-security-hardening.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 14:40:39 +02:00
egutierrez d483c90356 Merge issue/0004f-medium-fixes: owner binding, nonce-cache pre-auth, error leak (H6/H7/H12)
Owner of a created room must be the signer; the replay cache is populated only
after authorization (with bounded, O(expired) pruning); internal errors no
longer leak to clients.
2026-06-07 14:36:22 +02:00
egutierrez 1bcca987a4 test(membership): regression for H6 owner spoof and H7 nonce-cache poison
TestAudit_OwnerSpoof: a body declaring a foreign owner endpoint or signing key
is 403; a self-owned create is 201.
TestAudit_NonceCachePoisonPreAuth: an unregistered identity's repeated nonce
still fails 'not authorized' (never 'replayed'), proving it was not cached, while
an authorized identity's replay is still rejected.
Nonce cache unit tests: prune-after-TTL and cap-bounded memory.
2026-06-07 14:36:22 +02:00
egutierrez 0aa2caae43 feat(membership): owner binding, pre-auth nonce-cache fix, generic errors
Three medium audit findings.

H6 (owner spoof): handleCreateRoom now binds the body's declared owner to the
authenticated signer — both the endpoint id and the signing key must be the
signer's — so a registered peer cannot create rooms in another identity's name.
Enforced only when an authenticated signer is present.

H7 (nonce-cache poison pre-auth): IsAuthorized now runs BEFORE the replay cache
is touched, so an unregistered identity (Ed25519 keys are free) can no longer
seed nonces into it. The cache is rewritten with O(expired) pruning (insertion
order equals expiry order under a constant TTL) instead of the previous O(n)
full-map scan under the mutex, plus a size cap with oldest-eviction. This is the
prerequisite the 0003 replicated nonce store builds on.

H12 (error leak): internal store/blob errors are logged and replaced with a
generic client message via writeServerErr, so SQL fragments and filesystem paths
no longer reach the caller. Crafted 4xx messages (owner-sig, validation) are kept.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 14:36:22 +02:00
egutierrez 957b728160 Merge issue/0004e-control-tls: TLS on the HTTP control plane (H5)
membershipd serves https with the bus CA cert; the client pins the CA and
refuses a plaintext control plane when a CA is provided.
2026-06-07 14:30:15 +02:00
egutierrez 07f4af817e feat(client,membershipd): TLS on the HTTP control plane (H5)
Audit H5 (Alto, public). The control plane was signed but plaintext, so a
network MITM could read all metadata (subjects, endpoints, public keys, sealed
keys, blob hashes, the social graph) and drop requests. Signing gives integrity,
not confidentiality.

- membershipd serves the control plane over TLS (ListenAndServeTLS, MinVersion
  1.2) with the same CA-signed cert as the data plane when --tls-cert is set; the
  fail-open guard already requires --bus-auth enforce alongside it.
- The client gets a separate Options.CtrlTLS so the HTTP client pins the bus CA,
  independent of the NATS data-plane TLS. Connect now sets both planes' TLS from
  the one CA and REFUSES a plaintext http:// control-plane URL when a CA is
  provided, so metadata is never sent in the clear when TLS is expected.

Connect's signature is unchanged; callers (worker/chat --ca, mobile NewSession)
must pass an https:// control-plane URL when they pass a CA. Documented for the
deploy step.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 14:30:15 +02:00
egutierrez 0d56c3c81d Merge issue/0004d-dataplane-acl: data-plane content confidentiality (H4)
Public deployments refuse cleartext rooms, forcing E2E so a data-plane sniffer
gets only ciphertext. Per-subject ACL deferred to 0003 (documented).
2026-06-07 14:26:45 +02:00
egutierrez fb6c796059 test: regression for H4 data-plane content confidentiality
pkg/membership TestRequireEncryptedRoomsRejectsCleartext: cleartext create ->
403, encrypted -> 201, flag off -> cleartext allowed again.

pkg/client TestAudit_NoSubjectACL: under the public posture a ModeNATS room is
refused; bob (member) decrypts the secret; eve raw-subscribes to the subject off
the data plane and receives only ciphertext (non-empty AEAD nonce, no plaintext
substring) — closing the auditor's 'eve reads internal: salary numbers'.
2026-06-07 14:26:45 +02:00
egutierrez e502b16675 feat(membership): forbid cleartext rooms on public deployments (H4 min defense)
Audit H4 (Alto). The embedded NATS has a single account with no per-subject
permissions, so any registered peer can subscribe to any subject — a cleartext
(ModeNATS) room's payload is readable by anyone who knows the subject.

A complete per-subject ACL derived from membership does not fit here: NATS
evaluates a connection's permissions once at connect time and never re-evaluates
them, but unibus clients connect-then-create/join-then-publish on one connection
(TestSecureBusEndToEnd). Static permissions would forbid the owner from
publishing to a room it just created; the dynamic reconnection model belongs to
the 0003 decentralization redesign. See dev/0004d-dataplane-acl.md.

Minimum defense implemented: Server.RequireEncryptedRooms (set by membershipd on
any non-loopback bind) refuses to create cleartext rooms, so every room on a
public deployment is end-to-end encrypted. Message CONTENT stays confidential
even with no subject isolation; residual traffic-metadata exposure is documented
and tracked for 0003.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 14:26:45 +02:00
egutierrez 47ff74d837 Merge issue/0004c-membership-authz: membership-based authorization (H3)
Room metadata, member lists, room directories and sealed keys are now served
only to members of the room (and a sealed key only to its own endpoint),
closing the horizontal metadata leak.
2026-06-07 14:21:55 +02:00
egutierrez b81e5f26f1 feat(membership): authorize room reads by membership, not registration
Audit H3 (Alto). 'Authorized' meant 'registered in the allowlist', not 'member
of the room', so any registered peer could read another room's subject, its
full member list (every member's sign_pub + kex_pub), any endpoint's room
directory, and even another member's sealed key.

The middleware now carries the authenticated signer's endpoint id into the
handler via request context. Room handlers enforce membership:
  - GET /rooms/{id} and /rooms/{id}/members require the signer to be a member;
  - GET /rooms/{id}/key serves the sealed key only to its own endpoint
    (endpoint == signer) and only to a member;
  - GET /members/{endpoint}/rooms is restricted to the signer's own endpoint.

Authorization is skipped only when no authenticated signer is present (AuthOff
dev, or a soft-mode pass-through), preserving legacy/dev behavior. Internal
errors no longer echo store messages to the client on these paths.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 14:21:55 +02:00
egutierrez d742f91881 Merge issue/0004b-failopen-guard: close the fail-open startup (H2)
A non-loopback bind, or any TLS flag, now requires --bus-auth enforce or the
service refuses to start.
2026-06-07 14:17:37 +02:00
egutierrez 30577145ce feat(membershipd): refuse fail-open startup configs
Audit H2 (Alto). The binary defaulted to --bus-auth off, the NATS nkey
authenticator only turned on under enforce, and TLS was an independent flag.
Booting --bind 0.0.0.0 --tls-cert … without --bus-auth enforce left both
planes open while looking secure.

validateBootConfig is a pure guard, called right after flag parsing, that
log.Fatals on two insecure shapes:
  - a non-loopback --bind without --bus-auth enforce, and
  - --tls-cert/--tls-key without --bus-auth enforce.

An insecure public startup is now impossible (the process exits), so a
fail-open data plane never comes up for an unregistered client to reach.
TestAudit_FailOpenTLSWithoutAuth plus a full policy table cover golden
(public+enforce, dev loopback) and every refused shape.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 14:17:37 +02:00
egutierrez 01e2ee1aa0 Merge issue/0004a-dos-limit: pre-auth DoS hardening (H1)
Body-size ceilings (MaxBytesReader, per-route + Content-Length pre-reject),
Server.MaxHeaderBytes, and a per-IP token-bucket rate limit shut the critical
pre-auth memory-exhaustion vector. Regression test asserts a bounded RSS.
2026-06-07 14:16:13 +02:00
egutierrez e7bdcc978c test(membership): regression for H1 pre-auth DoS body limit
Ports the auditor's TestAudit_DoSBodyLimitNoAuth: an unsigned oversized POST
to /blobs is now rejected 413 without the resident set spiking (measured via
/proc/self/status, delta bounded to <96 MiB vs the attack's 400 MB+). Covers
both a truthful over-ceiling Content-Length (rejected pre-read) and a chunked
unknown-length sender (MaxBytesReader caps the read). Plus golden (normal blob
stored), boundary (exactly at the ceiling accepted), the 1 MiB control-plane
ceiling, and the per-IP rate limit (flood -> 429, distinct IPs not throttled).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 14:16:04 +02:00
egutierrez 60d6a86655 feat(membership): bound request bodies and add per-IP rate limit
Pre-auth DoS hardening (audit H1, Critical). The control-plane middleware
read the request body with io.ReadAll before authenticating and with no size
cap, so an unauthenticated peer could force the server to buffer an arbitrary
body in RAM (the auditor sent 400 MB and watched RSS climb to ~898 MB).

- ServeHTTP now caps the buffered body before reading: a per-route ceiling
  (1 MiB JSON, 16 MiB /blobs) rejects an over-declared Content-Length outright
  and wraps the body in http.MaxBytesReader so a lying/chunked sender trips at
  the ceiling instead of unbounded.
- handlePutBlob maps the MaxBytesReader cutoff to 413 in every auth mode.
- Per-IP token-bucket rate limiter (golang.org/x/time/rate, already in the
  module graph) sheds floods before auth or body reads. Loopback dev stacks are
  unaffected (burst >> any single client's rate). Kept in-package as transport
  glue, not promoted to the registry, mirroring the nonceCache decision in 0003.
- membershipd sets http.Server.MaxHeaderBytes and ReadHeaderTimeout.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 14:16:04 +02:00
agent bcd02716d5 docs(issues): encolar 0002 (media v2), 0003 (descentralización HA), 0004 (hardening seguridad)
Specs de los tres issues siguientes del bus, derivados de esta sesión:
- 0002 media v2: chunking, mimetype, GC del object store, exponer en clientes.
- 0003 descentralización/HA: cluster NATS magnus+homer (R1→R3), control plane
  SQLite→JetStream KV, quorum, failover. Tercer nodo = homer (141.94.69.66).
- 0004 hardening: cierra los hallazgos de la auditoría red-team (report 0004):
  DoS pre-auth, fail-open, autorización por pertenencia, ACL NATS, TLS control plane.
2026-06-07 14:04:33 +02:00
egutierrez 484a07d6fd Merge issue/0001e-migrate-clients: secure-by-default clients, bus-auth enforce
Phase 0001e of issue 0001. client.Connect(caPath) is the single seam every
peer uses: with the bundled ca.crt it connects with TLS + nkey and signs the
control plane (enforce); without it, legacy plaintext dev. worker/chat gain
--ca, the mobile NewSession gains caPath, membershipd gains --tls-cert/--tls-key
and turns on the nkey authenticator under enforce. dev/feature_flags.json
declares the target state (bus-auth enforce, bus-tls on); the gateway and
unibots migrations are documented as notes (dev/0001e-remaining-clients.md).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 12:49:42 +02:00
egutierrez 04e27518af test(client): secure bus end-to-end (auth + TLS + E2E together)
TestSecureBusEndToEnd boots the server with control-plane enforce, NATS nkey
auth, and TLS all on; two registered peers connect with nkey+TLS, A creates a
Matrix room, invites B, publishes, and B decrypts — proving the three layers
compose. This is the headline golden of issue 0001.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 12:49:19 +02:00
egutierrez 6b0916f1fa docs(0001e): note remaining gateway and unibots migration
The web gateway (playground) and unibots (in the agents repo) are not migrated
here: the gateway stays a local dev tool at AuthOff, and the bot transport
lives outside this sub-repo. dev/0001e-remaining-clients.md records exactly
what each needs (client.Connect with ca.crt, identity registered in the
allowlist) and the operator server flags for phase 0001f.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 12:49:19 +02:00
egutierrez 87dbc421cd chore(flags): flip bus-auth to enforce and bus-tls on (target state)
Declares the project's target rollout: bus-auth enforce, bus-tls enabled.
Flags are declarative; the operator activates them at deploy via membershipd
--bus-auth/--tls-cert/--tls-key. CLI defaults stay off so dev and tests run
unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 12:49:19 +02:00
egutierrez b647779521 feat(membershipd): enable NATS nkey auth (enforce) and TLS flags
Opens the store before NATS so the authenticator can consult IsAuthorized.
Under --bus-auth enforce the embedded NATS gets the nkey authenticator (only
allowlisted identities connect); --tls-cert/--tls-key make it present the
server certificate and require TLS. External NATS manages its own auth/TLS.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 12:49:19 +02:00
egutierrez 74c8d4f941 feat(client,cmd,mobile): connect securely via client.Connect(caPath)
client.Connect is the single migration seam: a non-empty caPath connects with
TLS pinned to the bus CA plus nkey auth (matching enforce + bus-tls), an empty
caPath keeps the legacy plaintext dev connection; control-plane requests are
signed either way. worker and chat gain a --ca flag; the gomobile NewSession
gains a caPath parameter so the Android app bundles ca.crt and connects
securely. Every peer now flows through one code path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 12:49:19 +02:00
egutierrez 2ccd11b68c Merge issue/0001d-tls: TLS on the NATS data plane (self-signed CA)
Phase 0001d of issue 0001. embeddednats grows a ServerConfig with an optional
TLS config; the client can pin the bus's self-signed CA via Options.TLS built
from busauth.LoadCATLSConfig. deploy/tls/generate-certs.sh mints the CA and a
server cert (SAN: public IP, WG IP, om, localhost) — only the public ca.crt is
versioned, private keys are gitignored. A client trusting the CA completes the
handshake; one without it fails. TLS stays off until phase 0001e wires it in.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 12:44:22 +02:00
egutierrez 75939a192c test: TLS data plane end to end + CA/keypair loaders
client/tls_test: mints a throwaway CA + server cert in-memory; a client
pinning the CA completes the handshake and operates (golden), a client
without the CA fails the handshake (error path). busauth/tls_test: golden
load of a CA PEM and a server keypair, plus error paths (missing file,
non-PEM). Harness body extracted to bootHarness(ctrlMode, natsAuth, natsTLS).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 12:44:13 +02:00
egutierrez 1b56f14c20 feat(deploy/tls): self-signed CA + server cert generator
generate-certs.sh mints the bus CA and a NATS server certificate whose SANs
cover the public IP (135.125.201.30), the WireGuard IP (10.42.0.1), the om
hostname, and localhost/127.0.0.1 for on-host smoke tests (all overridable via
env). Only the public ca.crt is committed; ca.key, server.key and server.crt
are gitignored and distributed out of band. README documents generation, use
and rotation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 12:44:13 +02:00
egutierrez 2786ae2dde feat(busauth,client): pin the bus CA over TLS
busauth.LoadCATLSConfig turns a ca.crt path into a *tls.Config trusting only
that private CA (clients must pin it; the system roots would reject a
self-signed server cert). busauth.ServerTLSConfig loads the server keypair.
client.Options gains TLS; NewWithOptions calls nats.Secure when set, so the
data-plane connection is encrypted and the server pinned.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 12:44:13 +02:00
egutierrez 6d3d6d2562 refactor(embeddednats): ServerConfig with optional TLS
Collapses Start/StartHost/StartHostAuth onto StartServer(ServerConfig) so
auth and a TLS config can be set without growing the parameter list further.
When TLS is set the server presents the certificate and requires TLS on the
data plane; the wrappers preserve the existing no-auth/no-TLS behavior.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 12:44:13 +02:00
egutierrez 217daae472 Merge issue/0001c-nats-nkey: NATS nkey authentication
Phase 0001c of issue 0001. The data plane now authenticates with each peer's
Ed25519 identity reused as a NATS nkey: busauth converts the identity to an
nkey and back, embeddednats installs a CustomClientAuthentication that verifies
the nkey signature and checks the user allowlist on every connection (live
revocation, no restart), and the client opts into nkey via NewWithOptions. The
embedded server stays open by default so dev stacks and existing tests are
unaffected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 12:37:59 +02:00
egutierrez 00058ea0af test(client): NATS nkey auth end to end
Harness gains newHarnessFull(ctrlMode, natsAuth) wiring the nkey authenticator
to the user allowlist; NATS auth and HTTP auth are independent so each plane
is tested in isolation. TestNatsNkeyAuth: registered peer connects with nkey
and operates (golden); unregistered peer and no-nkey client refused at connect
(error paths); peer revoked at runtime refused on its next connection without
a restart (edge).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 12:37:59 +02:00
egutierrez 1630f6f163 feat(client): opt-in nkey NATS connection via NewWithOptions
nats.go refuses to connect with an nkey to a server that does not advertise
nkey auth, so the connection cannot blindly always present one. New keeps the
legacy plain connection; NewWithOptions(Options{UseNkey:true}) presents the
peer's identity-derived nkey. NewWithOptions is the single place the data-plane
connection is built, so every peer gets identical behavior from the same
Options (TLS fields arrive in phase 0001d).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 12:37:59 +02:00
egutierrez b09bafe242 feat(embeddednats): nkey CustomClientAuthentication against the allowlist
busauth.NewNkeyAuthenticator verifies a client's nkey signature over the
server nonce (decoding like nats-server: raw-url then std base64), maps the
nkey to its Ed25519 hex, and consults an injected IsAuthorized predicate.
Checking on every connection (rather than a static Options.Nkeys map) means
revoking a user denies its next connection with no restart. embeddednats
gains StartHostAuth(auth) and sets AlwaysEnableNonce so the server advertises
the nonce nkey clients need; Start/StartHost stay open (auth=nil) for dev.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 12:37:46 +02:00
egutierrez 413dd61041 feat(busauth): Ed25519<->NATS nkey conversion with round-trip test
A NATS nkey is an Ed25519 keypair, so the bus reuses each peer's signing
identity for the data plane instead of minting new key material. ClientNkey
derives the user nkey public string and a nonce-signing callback from the
peer's Ed25519 private key (its first 32 bytes are the nkey seed);
SignPubHexFromNkey maps a presented nkey back to the allowlist's hex key;
NkeyPublicFromSignPub is the public-only derivation.

This is NATS-specific transport glue kept in the app, not promoted to the
registry, to avoid pulling nats-io/nkeys into the multi-domain registry
module. The dedicated round-trip test runs first (spec requirement): it
proves the nkey signature equals the identity's raw Ed25519 signature and
that the nkey maps back to the identity's hex.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 12:37:46 +02:00
egutierrez 89e0d0e64a Merge issue/0001b-ctrl-auth: signed control-plane auth + anti-replay
Phase 0001b of issue 0001. The control plane (membershipd HTTP) now supports
the bus-auth rollout off->soft->enforce: clients sign every request with their
Ed25519 identity (headers over method/path/ts/nonce/sha256(body)); the server
verifies the signature, clock skew (+/-30s), nonce replay (60s TTL cache), and
the user allowlist. Revocation denies access on the next request without a
restart. Default stays off so master keeps working.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 12:32:06 +02:00
egutierrez 2130eaa44d test: control-plane auth middleware + end-to-end enforce
membership/auth_test: golden (signed+registered accepted), error paths
(unregistered 401, replayed nonce 401, clock skew 401, tampered body 401,
missing headers 401), exemptions (healthz, soft allows, off no-op).
client_test: end-to-end with the real client against an enforce server —
registered peer accepted, unregistered rejected, revoked peer denied without
a server restart.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 12:31:58 +02:00
egutierrez 567e604fc7 chore(playground): pass AuthOff to NewServer (gateway not yet migrated)
The local dev gateway has not adopted signed requests; tracked for phase
0001e. Keeps it working while the NewServer signature gains the auth mode.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 12:31:58 +02:00
egutierrez 0f8a38d62b feat(membershipd): --bus-auth flag selects control-plane auth mode
Maps off|soft|enforce to membership.AuthMode and wires it into NewServer.
Defaults to off so existing deployments are unaffected until the operator
opts into the rollout.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 12:31:58 +02:00
egutierrez e0ef3a27cc feat(client): sign every control-plane request (transport auth headers)
doJSON, putBlob and getBlob now go through newSignedRequest, which attaches
X-Unibus-Pub/Ts/Nonce/Sig signing membership.CanonicalRequest with the peer's
Ed25519 key. GETs are signed too so the server can authenticate the caller
uniformly under enforce. The payload-level owner signature (invite/rekey)
is unchanged and coexists with this transport-level signature.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 12:31:50 +02:00
egutierrez 3e39e23fe0 feat(membership): signed control-plane auth middleware + anti-replay
Adds the bus-auth rollout (off|soft|enforce) to the control plane. The
middleware verifies an Ed25519 request signature over CanonicalRequest
(method, request-URI, ts, nonce, sha256(body)), checks the timestamp is
within +/-30s, rejects replayed nonces via an in-memory TTL cache (60s), and
requires the signer to be an active user in the allowlist. soft logs
rejections but lets requests through so clients can migrate without an
outage; off is the legacy no-op default. /healthz is exempt so health probes
work before any identity exists. CanonicalRequest is exported as the single
source of truth shared with the client.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 12:31:50 +02:00
egutierrez e9711bf74b Merge issue/0001a-users: bus user allowlist (store + CLI + migration 002)
Phase 0001a of issue 0001 (bus auth + TLS). Adds the users table, store CRUD
(AddUser/GetUser/ListUsers/RevokeUser/IsAuthorized/HasAdmin), the local
'membershipd user' admin CLI for seeding the first admin, and the bus-auth /
bus-tls feature flags (both off). No behavior change yet: the allowlist is
not consulted until phase 0001b wires the control-plane middleware.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 12:23:29 +02:00
egutierrez 822982b71b test(membership): cover user store golden/edge/error paths
Golden: add -> get -> IsAuthorized true, admin seeded. Edge: empty role
defaults to member, case-insensitive hex lookup, list ordering, revoke
denies authorization and stamps revoked_at. Error: duplicate key
(ErrUserExists), invalid role, empty sign_pub, unknown user not authorized,
revoke of unknown/already-revoked. Plus users-table migration idempotency.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 12:23:23 +02:00
egutierrez ddc6cabc24 feat(flags): declare bus-auth and bus-tls feature flags (off)
bus-auth carries the off -> soft -> enforce rollout state; bus-tls is a
boolean. Both start disabled so master keeps compiling and passing tests
while the auth/TLS code lands behind them across phases 0001a-0001e.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 12:23:23 +02:00
egutierrez 0d7ab22d4a feat(membershipd): add 'user add/list/revoke' local admin CLI
Local administration surface for the user allowlist, dispatched before the
server flag set parses os.Args. It opens the SQLite store directly with no
network or auth: running on the bus host is trusted by design, which is how
the first admin is seeded (breaking the chicken-egg of needing an admin to
add an admin). Validates that sign-pub is a 32-byte Ed25519 key in hex and
tolerates the sign-pub positional appearing before or after --db.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 12:23:16 +02:00
egutierrez c5387028e0 feat(membership): add 002_users.sql migration and user CRUD store
Bus-level user allowlist (issue 0001a): the authoritative directory of
Ed25519 signing identities permitted to use the bus, independent of room
membership. Migration is additive and mirrored byte-for-byte between the
module-root migrations/ and the embedded pkg/membership/migrations/.

Store adds AddUser/GetUser/ListUsers/RevokeUser/IsAuthorized/HasAdmin.
IsAuthorized is the single fail-closed predicate both the control plane and
the NATS data plane will consult, so revocation is a status flip that denies
access on both without a restart.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 12:23:11 +02:00
agent 7de05c8591 docs(security): fijar exposición pública en el spec 0001
Decisión del operador: el bus se expone a internet protegido por auth+TLS
(WireGuard pasa a ser una vía más, no la barrera). ufw en om abre 8470/4250;
el server cert lleva SAN con la IP pública 135.125.201.30 + la IP WG 10.42.0.1
+ hostname; los clientes controlados embeben el ca.crt propio (sin Let's Encrypt).
La fase de despliegue 0001f la ejecuta el humano; el agente entrega 0001a-0001e.
2026-06-07 12:15:32 +02:00
agent 9a915839c8 docs(security): spec issue 0001 — sistema de usuarios + auth firmada del control plane + NATS nkey + TLS
Diseño de las tres capas de seguridad del bus para que WireGuard pase a ser
opcional: tabla users (allowlist Ed25519 con roles/revocación), middleware de
firma Ed25519 + anti-replay en el control plane (generaliza signRequest/
verifyOwnerSig ya existentes), y NATS endurecido con CustomClientAuthentication
(nkey sobre la identidad del peer, revocación dinámica) + TLS con CA propia.
Incluye 6 fases TBD con feature flag bus-auth (off->soft->enforce), migración de
clientes (pkg/client centraliza el cambio), plan de despliegue a om y matriz de
tests (golden/edge/error).
2026-06-07 12:12:19 +02:00
138 changed files with 13969 additions and 4123 deletions
-12
View File
@@ -1,12 +0,0 @@
.gradle/
build/
local.properties
*.iml
.idea/
captures/
.cxx/
# The gomobile binding is a build artifact (~24 MB). Regenerate it from ../mobile
# with `gomobile bind` (see README.md); it is not versioned.
app/libs/*.aar
app/libs/*.jar
-83
View File
@@ -1,83 +0,0 @@
# unibus · app Android
Cliente móvil nativo de unibus. La app no habla con un gateway: embebe un **peer
real** del bus a través del binding gomobile `mobile/unibus.go`, de modo que el
cifrado extremo a extremo corre **en el dispositivo**. Cada teléfono es un peer
de primera clase del bus, igual que cualquier peer Go.
## Arquitectura
```
Kotlin/Compose UI ──> BusViewModel ──> com.unibus.core.mobile.Session (.aar)
│ (NATS data plane + E2E crypto, en Go)
membershipd (control plane HTTP :8470)
NATS (data plane :4250)
```
- `BusViewModel` traduce intents de UI en llamadas al binding. Las llamadas de red
(`newSession`, `createRoom`, `join`, `publish`) corren en `Dispatchers.IO`.
- Los frames entrantes llegan por `FrameListener.onFrame` en una goroutine NATS
(hilo JNI); se publican en un `StateFlow` (thread-safe) que Compose recolecta en
el hilo principal.
## Requisitos
- Android SDK (compileSdk 34), NDK (para regenerar el `.aar`), JDK 17.
- El binding `app/libs/unibus.aar` (no versionado: es un artefacto de ~24 MB).
## 1. Generar el binding (.aar)
Desde la raíz del repo de la app (`projects/message_bus/apps/unibus`):
```bash
export ANDROID_HOME=$HOME/android-sdk
export ANDROID_NDK_HOME=$HOME/android-sdk/ndk/26.3.11579264
mkdir -p android/app/libs
gomobile bind -target=android -androidapi 21 -javapkg com.unibus.core \
-o android/app/libs/unibus.aar ./mobile
```
Esto produce `unibus.aar` con la clase estática `com.unibus.core.mobile.Mobile`
(`generateIdentity`, `newSession`) y los tipos `Session` y `FrameListener`.
## 2. Compilar el APK
```bash
cd android
export JAVA_HOME=$HOME/android-sdk/jdk-17/jdk-17.0.19+10
export ANDROID_HOME=$HOME/android-sdk
./gradlew assembleDebug
# APK: app/build/outputs/apk/debug/app-debug.apk
```
`local.properties` apunta a `sdk.dir`; ajústalo si tu SDK está en otra ruta.
## 3. Arrancar el bus y probar en el emulador
```bash
# 1. En el PC: control plane + NATS embebido (HTTP :8470, NATS :4250)
cd projects/message_bus/apps/unibus && go run ./cmd/membershipd
# 2. Emulador Pixel_API34
$ANDROID_HOME/emulator/emulator -avd Pixel_API34 &
# 3. Instalar + lanzar
adb install -r app/build/outputs/apk/debug/app-debug.apk
adb shell am start -n com.unibus.app/.MainActivity
```
En la pantalla de conexión, desde el emulador el host del PC es `10.0.2.2`:
- **Host (control plane):** `http://10.0.2.2:8470`
- **NATS (data plane):** `nats://10.0.2.2:4250`
Para un teléfono físico en la misma LAN, usa la IP LAN del PC en lugar de
`10.0.2.2`.
## Notas
- La identidad del peer se guarda en `filesDir/peer.id` (claves privadas
Ed25519 + X25519). No se sincroniza ni se respalda.
- Una room creada en modo "cifrar (E2E)" usa la política Matrix (cifrada,
persistida, firmada); en modo normal usa NATS cleartext.
-66
View File
@@ -1,66 +0,0 @@
plugins {
id("com.android.application")
id("org.jetbrains.kotlin.android")
id("org.jetbrains.kotlin.plugin.compose")
}
android {
namespace = "com.unibus.app"
compileSdk = 34
defaultConfig {
applicationId = "com.unibus.app"
minSdk = 21
targetSdk = 34
versionCode = 1
versionName = "0.1.0"
}
buildFeatures {
compose = true
}
compileOptions {
sourceCompatibility = JavaVersion.VERSION_17
targetCompatibility = JavaVersion.VERSION_17
}
kotlinOptions {
jvmTarget = "17"
}
buildTypes {
getByName("release") {
isMinifyEnabled = false
proguardFiles(
getDefaultProguardFile("proguard-android-optimize.txt"),
"proguard-rules.pro",
)
}
}
packaging {
resources {
excludes += "/META-INF/{AL2.0,LGPL2.1}"
}
}
}
dependencies {
// The unibus gomobile binding: a real bus peer that does NATS + E2E crypto
// on the device. All protocol logic lives here, shared with every other peer.
implementation(files("libs/unibus.aar"))
val composeBom = platform("androidx.compose:compose-bom:2024.09.03")
implementation(composeBom)
implementation("androidx.compose.ui:ui")
implementation("androidx.compose.ui:ui-graphics")
implementation("androidx.compose.ui:ui-tooling-preview")
implementation("androidx.compose.material3:material3")
implementation("androidx.compose.material:material-icons-extended")
implementation("androidx.activity:activity-compose:1.9.2")
implementation("androidx.lifecycle:lifecycle-viewmodel-compose:2.8.6")
implementation("androidx.lifecycle:lifecycle-runtime-ktx:2.8.6")
implementation("org.jetbrains.kotlinx:kotlinx-coroutines-android:1.8.1")
debugImplementation("androidx.compose.ui:ui-tooling")
}
-4
View File
@@ -1,4 +0,0 @@
# gomobile generates JNI-bound classes under com.unibus.core.mobile and go.*.
# They are reached from native code, so keep them intact even when minifying.
-keep class com.unibus.core.mobile.** { *; }
-keep class go.** { *; }
-25
View File
@@ -1,25 +0,0 @@
<?xml version="1.0" encoding="utf-8"?>
<manifest xmlns:android="http://schemas.android.com/apk/res/android">
<uses-permission android:name="android.permission.INTERNET" />
<uses-permission android:name="android.permission.ACCESS_NETWORK_STATE" />
<application
android:allowBackup="true"
android:label="@string/app_name"
android:supportsRtl="true"
android:usesCleartextTraffic="true"
android:theme="@style/Theme.Unibus">
<activity
android:name=".MainActivity"
android:exported="true"
android:label="@string/app_name"
android:windowSoftInputMode="adjustResize">
<intent-filter>
<action android:name="android.intent.action.MAIN" />
<category android:name="android.intent.category.LAUNCHER" />
</intent-filter>
</activity>
</application>
</manifest>
@@ -1,162 +0,0 @@
package com.unibus.app
import android.app.Application
import androidx.lifecycle.AndroidViewModel
import androidx.lifecycle.viewModelScope
import com.unibus.core.mobile.FrameListener
import com.unibus.core.mobile.Mobile
import com.unibus.core.mobile.Session
import kotlinx.coroutines.Dispatchers
import kotlinx.coroutines.flow.MutableStateFlow
import kotlinx.coroutines.flow.StateFlow
import kotlinx.coroutines.flow.asStateFlow
import kotlinx.coroutines.flow.update
import kotlinx.coroutines.launch
import java.io.File
/** One chat message shown in the UI. */
data class ChatMessage(
val sender: String,
val text: String,
val mine: Boolean,
val ts: Long,
)
/** The whole observable UI state of the app. */
data class BusState(
val connecting: Boolean = false,
val connected: Boolean = false,
val endpointId: String = "",
val roomId: String = "",
val roomSubject: String = "",
val status: String = "",
val error: String? = null,
val messages: List<ChatMessage> = emptyList(),
)
/**
* BusViewModel drives a real unibus peer on the device through the gomobile
* binding. The binding performs NATS transport and end-to-end crypto natively;
* this class only translates UI intents into binding calls and exposes the
* incoming frames as observable state.
*
* Threading: every binding call that touches the network (newSession, createRoom,
* join, publish) runs off the main thread on Dispatchers.IO to avoid
* NetworkOnMainThreadException. Incoming frames arrive on a JNI-attached NATS
* goroutine via [onFrame]; we only append to a thread-safe StateFlow there, and
* Compose collects that flow on the main thread.
*/
class BusViewModel(app: Application) : AndroidViewModel(app), FrameListener {
private val _state = MutableStateFlow(BusState())
val state: StateFlow<BusState> = _state.asStateFlow()
private var session: Session? = null
private var myEndpoint: String = ""
private val idPath: String
get() = File(getApplication<Application>().filesDir, "peer.id").absolutePath
override fun onFrame(roomID: String, sender: String, msgID: String, text: String) {
_state.update {
it.copy(
messages = it.messages + ChatMessage(
sender = sender,
text = text,
mine = sender == myEndpoint,
ts = System.currentTimeMillis(),
),
)
}
}
fun connect(host: String, nats: String, peerName: String) {
if (_state.value.connecting) return
_state.update { it.copy(connecting = true, error = null, status = "Conectando…") }
viewModelScope.launch(Dispatchers.IO) {
try {
val s = Mobile.newSession(idPath, nats.trim(), host.trim())
session = s
myEndpoint = s.endpointID()
_state.update {
it.copy(
connecting = false,
connected = true,
endpointId = myEndpoint,
status = "Conectado como $peerName",
)
}
} catch (e: Exception) {
_state.update {
it.copy(connecting = false, connected = false, error = e.message ?: "error desconocido")
}
}
}
}
fun createRoom(subject: String, encrypted: Boolean) {
val s = session ?: return
viewModelScope.launch(Dispatchers.IO) {
try {
val mode = if (encrypted) "matrix" else "nats"
val roomId = s.createRoom(subject.trim(), mode)
s.subscribe(roomId, this@BusViewModel)
_state.update {
it.copy(
roomId = roomId,
roomSubject = subject.trim(),
messages = emptyList(),
status = "Room creada",
)
}
} catch (e: Exception) {
_state.update { it.copy(error = e.message ?: "error al crear room") }
}
}
}
fun joinRoom(roomId: String) {
val s = session ?: return
viewModelScope.launch(Dispatchers.IO) {
try {
val rid = roomId.trim()
s.join(rid)
s.subscribe(rid, this@BusViewModel)
_state.update {
it.copy(roomId = rid, roomSubject = "(unida)", messages = emptyList(), status = "Unido a la room")
}
} catch (e: Exception) {
_state.update { it.copy(error = e.message ?: "error al unirse") }
}
}
}
fun publish(text: String) {
val s = session ?: return
val room = _state.value.roomId
if (room.isEmpty() || text.isBlank()) return
viewModelScope.launch(Dispatchers.IO) {
try {
s.publish(room, text)
} catch (e: Exception) {
_state.update { it.copy(error = e.message ?: "error al publicar") }
}
}
}
/** card returns this peer's shareable public identity (no secret). */
fun card(): String = try {
session?.card() ?: ""
} catch (_: Exception) {
""
}
fun clearError() = _state.update { it.copy(error = null) }
override fun onCleared() {
try {
session?.close()
} catch (_: Exception) {
}
session = null
}
}
@@ -1,307 +0,0 @@
package com.unibus.app
import android.os.Bundle
import androidx.activity.ComponentActivity
import androidx.activity.compose.setContent
import androidx.activity.viewModels
import androidx.compose.foundation.layout.Arrangement
import androidx.compose.foundation.layout.Box
import androidx.compose.foundation.layout.Column
import androidx.compose.foundation.layout.Row
import androidx.compose.foundation.layout.Spacer
import androidx.compose.foundation.layout.fillMaxSize
import androidx.compose.foundation.layout.fillMaxWidth
import androidx.compose.foundation.layout.height
import androidx.compose.foundation.layout.padding
import androidx.compose.foundation.layout.width
import androidx.compose.foundation.lazy.LazyColumn
import androidx.compose.foundation.lazy.itemsIndexed
import androidx.compose.foundation.lazy.rememberLazyListState
import androidx.compose.material.icons.Icons
import androidx.compose.material.icons.automirrored.filled.Send
import androidx.compose.material.icons.filled.Add
import androidx.compose.material.icons.filled.Lock
import androidx.compose.material3.Button
import androidx.compose.material3.Card
import androidx.compose.material3.CircularProgressIndicator
import androidx.compose.material3.ExperimentalMaterial3Api
import androidx.compose.material3.Icon
import androidx.compose.material3.IconButton
import androidx.compose.material3.MaterialTheme
import androidx.compose.material3.OutlinedButton
import androidx.compose.material3.OutlinedTextField
import androidx.compose.material3.Scaffold
import androidx.compose.material3.Surface
import androidx.compose.material3.Switch
import androidx.compose.material3.Text
import androidx.compose.material3.TopAppBar
import androidx.compose.material3.darkColorScheme
import androidx.compose.runtime.Composable
import androidx.compose.runtime.LaunchedEffect
import androidx.compose.runtime.collectAsState
import androidx.compose.runtime.getValue
import androidx.compose.runtime.mutableStateOf
import androidx.compose.runtime.remember
import androidx.compose.runtime.saveable.rememberSaveable
import androidx.compose.runtime.setValue
import androidx.compose.ui.Alignment
import androidx.compose.ui.Modifier
import androidx.compose.ui.text.style.TextOverflow
import androidx.compose.ui.unit.dp
import java.text.SimpleDateFormat
import java.util.Date
import java.util.Locale
class MainActivity : ComponentActivity() {
private val vm: BusViewModel by viewModels()
override fun onCreate(savedInstanceState: Bundle?) {
super.onCreate(savedInstanceState)
setContent {
MaterialTheme(colorScheme = darkColorScheme()) {
Surface(modifier = Modifier.fillMaxSize()) {
UnibusApp(vm)
}
}
}
}
}
@Composable
fun UnibusApp(vm: BusViewModel) {
val state by vm.state.collectAsState()
if (!state.connected) {
ConnectScreen(
connecting = state.connecting,
error = state.error,
onConnect = { host, nats, name -> vm.connect(host, nats, name) },
)
} else {
ChatScreen(state = state, vm = vm)
}
}
@Composable
fun ConnectScreen(
connecting: Boolean,
error: String?,
onConnect: (String, String, String) -> Unit,
) {
var host by rememberSaveable { mutableStateOf("http://10.0.2.2:8470") }
var nats by rememberSaveable { mutableStateOf("nats://10.0.2.2:4250") }
var name by rememberSaveable { mutableStateOf("android") }
Column(
modifier = Modifier
.fillMaxSize()
.padding(24.dp),
verticalArrangement = Arrangement.Center,
) {
Text("unibus", style = MaterialTheme.typography.headlineMedium)
Text(
"chat cifrado extremo a extremo sobre NATS",
style = MaterialTheme.typography.bodyMedium,
color = MaterialTheme.colorScheme.onSurfaceVariant,
)
Spacer(Modifier.height(24.dp))
OutlinedTextField(
value = host,
onValueChange = { host = it },
label = { Text("Host (control plane)") },
singleLine = true,
modifier = Modifier.fillMaxWidth(),
)
Spacer(Modifier.height(12.dp))
OutlinedTextField(
value = nats,
onValueChange = { nats = it },
label = { Text("NATS (data plane)") },
singleLine = true,
modifier = Modifier.fillMaxWidth(),
)
Spacer(Modifier.height(12.dp))
OutlinedTextField(
value = name,
onValueChange = { name = it },
label = { Text("Identidad") },
singleLine = true,
modifier = Modifier.fillMaxWidth(),
)
if (error != null) {
Spacer(Modifier.height(12.dp))
Text(error, color = MaterialTheme.colorScheme.error)
}
Spacer(Modifier.height(24.dp))
Button(
onClick = { onConnect(host, nats, name) },
enabled = !connecting,
modifier = Modifier.fillMaxWidth(),
) {
if (connecting) {
CircularProgressIndicator(modifier = Modifier.height(18.dp).width(18.dp), strokeWidth = 2.dp)
Spacer(Modifier.width(8.dp))
}
Text(if (connecting) "Conectando…" else "Conectar")
}
}
}
@OptIn(ExperimentalMaterial3Api::class)
@Composable
fun ChatScreen(state: BusState, vm: BusViewModel) {
var subject by rememberSaveable { mutableStateOf("room.general") }
var encrypt by rememberSaveable { mutableStateOf(false) }
var joinId by rememberSaveable { mutableStateOf("") }
var draft by rememberSaveable { mutableStateOf("") }
val listState = rememberLazyListState()
LaunchedEffect(state.messages.size) {
if (state.messages.isNotEmpty()) listState.animateScrollToItem(state.messages.size - 1)
}
Scaffold(
topBar = {
TopAppBar(
title = {
Column {
Text("unibus", style = MaterialTheme.typography.titleMedium)
Text(
state.status.ifEmpty { state.endpointId.take(12) + "" },
style = MaterialTheme.typography.bodySmall,
maxLines = 1,
overflow = TextOverflow.Ellipsis,
)
}
},
)
},
) { inner ->
Column(
modifier = Modifier
.fillMaxSize()
.padding(inner)
.padding(horizontal = 12.dp),
) {
// Room controls.
Card(modifier = Modifier.fillMaxWidth().padding(vertical = 8.dp)) {
Column(Modifier.padding(12.dp)) {
Row(verticalAlignment = Alignment.CenterVertically) {
OutlinedTextField(
value = subject,
onValueChange = { subject = it },
label = { Text("subject") },
singleLine = true,
modifier = Modifier.weight(1f),
)
Spacer(Modifier.width(8.dp))
Button(onClick = { vm.createRoom(subject, encrypt) }) {
Icon(Icons.Filled.Add, contentDescription = "crear")
}
}
Row(verticalAlignment = Alignment.CenterVertically) {
Switch(checked = encrypt, onCheckedChange = { encrypt = it })
Spacer(Modifier.width(8.dp))
Icon(Icons.Filled.Lock, contentDescription = null, modifier = Modifier.height(16.dp))
Text("cifrar (E2E)", style = MaterialTheme.typography.bodySmall)
}
Spacer(Modifier.height(4.dp))
Row(verticalAlignment = Alignment.CenterVertically) {
OutlinedTextField(
value = joinId,
onValueChange = { joinId = it },
label = { Text("unirse por room id") },
singleLine = true,
modifier = Modifier.weight(1f),
)
Spacer(Modifier.width(8.dp))
OutlinedButton(onClick = { if (joinId.isNotBlank()) vm.joinRoom(joinId) }) {
Text("Unir")
}
}
if (state.roomId.isNotEmpty()) {
Spacer(Modifier.height(4.dp))
Text(
"room: ${state.roomSubject} · ${state.roomId}",
style = MaterialTheme.typography.bodySmall,
color = MaterialTheme.colorScheme.onSurfaceVariant,
maxLines = 1,
overflow = TextOverflow.Ellipsis,
)
}
}
}
if (state.error != null) {
Text(
state.error,
color = MaterialTheme.colorScheme.error,
modifier = Modifier.fillMaxWidth().padding(vertical = 4.dp),
)
}
// Messages.
LazyColumn(
state = listState,
modifier = Modifier.weight(1f).fillMaxWidth(),
verticalArrangement = Arrangement.spacedBy(6.dp),
) {
itemsIndexed(state.messages, key = { i, m -> "${m.ts}-$i" }) { _, m ->
MessageBubble(m)
}
}
// Composer.
Row(
modifier = Modifier.fillMaxWidth().padding(vertical = 8.dp),
verticalAlignment = Alignment.CenterVertically,
) {
OutlinedTextField(
value = draft,
onValueChange = { draft = it },
placeholder = { Text("Mensaje…") },
singleLine = true,
enabled = state.roomId.isNotEmpty(),
modifier = Modifier.weight(1f),
)
Spacer(Modifier.width(8.dp))
IconButton(
onClick = {
vm.publish(draft)
draft = ""
},
enabled = state.roomId.isNotEmpty() && draft.isNotBlank(),
) {
Icon(Icons.AutoMirrored.Filled.Send, contentDescription = "enviar")
}
}
}
}
}
private val timeFmt = SimpleDateFormat("HH:mm:ss", Locale.getDefault())
@Composable
fun MessageBubble(m: ChatMessage) {
val align = if (m.mine) Alignment.End else Alignment.Start
Column(modifier = Modifier.fillMaxWidth(), horizontalAlignment = align) {
Card(
modifier = Modifier.fillMaxWidth(0.8f),
) {
Column(Modifier.padding(8.dp)) {
if (!m.mine) {
Text(
m.sender.take(12) + "",
style = MaterialTheme.typography.labelSmall,
color = MaterialTheme.colorScheme.primary,
)
}
Text(m.text, style = MaterialTheme.typography.bodyMedium)
Text(
timeFmt.format(Date(m.ts)),
style = MaterialTheme.typography.labelSmall,
color = MaterialTheme.colorScheme.onSurfaceVariant,
)
}
}
}
}
@@ -1,4 +0,0 @@
<?xml version="1.0" encoding="utf-8"?>
<resources>
<string name="app_name">unibus</string>
</resources>
@@ -1,6 +0,0 @@
<?xml version="1.0" encoding="utf-8"?>
<resources>
<!-- A minimal Material3 base theme; the real UI styling is driven by Compose
Material3 (MaterialTheme) at runtime. -->
<style name="Theme.Unibus" parent="android:Theme.Material.NoActionBar" />
</resources>
-8
View File
@@ -1,8 +0,0 @@
// Top-level build file. Plugin versions are declared here and applied in the
// module build scripts. AGP 8.5 + Kotlin 2.0 (with the dedicated Compose
// compiler plugin) target the locally installed SDK (compileSdk 34).
plugins {
id("com.android.application") version "8.5.2" apply false
id("org.jetbrains.kotlin.android") version "2.0.21" apply false
id("org.jetbrains.kotlin.plugin.compose") version "2.0.21" apply false
}
-5
View File
@@ -1,5 +0,0 @@
org.gradle.jvmargs=-Xmx2048m -Dfile.encoding=UTF-8
org.gradle.caching=true
android.useAndroidX=true
android.nonTransitiveRClass=true
kotlin.code.style=official
Binary file not shown.
-7
View File
@@ -1,7 +0,0 @@
distributionBase=GRADLE_USER_HOME
distributionPath=wrapper/dists
distributionUrl=https\://services.gradle.org/distributions/gradle-8.9-bin.zip
networkTimeout=10000
validateDistributionUrl=true
zipStoreBase=GRADLE_USER_HOME
zipStorePath=wrapper/dists
-252
View File
@@ -1,252 +0,0 @@
#!/bin/sh
#
# Copyright © 2015-2021 the original authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# SPDX-License-Identifier: Apache-2.0
#
##############################################################################
#
# Gradle start up script for POSIX generated by Gradle.
#
# Important for running:
#
# (1) You need a POSIX-compliant shell to run this script. If your /bin/sh is
# noncompliant, but you have some other compliant shell such as ksh or
# bash, then to run this script, type that shell name before the whole
# command line, like:
#
# ksh Gradle
#
# Busybox and similar reduced shells will NOT work, because this script
# requires all of these POSIX shell features:
# * functions;
# * expansions «$var», «${var}», «${var:-default}», «${var+SET}»,
# «${var#prefix}», «${var%suffix}», and «$( cmd )»;
# * compound commands having a testable exit status, especially «case»;
# * various built-in commands including «command», «set», and «ulimit».
#
# Important for patching:
#
# (2) This script targets any POSIX shell, so it avoids extensions provided
# by Bash, Ksh, etc; in particular arrays are avoided.
#
# The "traditional" practice of packing multiple parameters into a
# space-separated string is a well documented source of bugs and security
# problems, so this is (mostly) avoided, by progressively accumulating
# options in "$@", and eventually passing that to Java.
#
# Where the inherited environment variables (DEFAULT_JVM_OPTS, JAVA_OPTS,
# and GRADLE_OPTS) rely on word-splitting, this is performed explicitly;
# see the in-line comments for details.
#
# There are tweaks for specific operating systems such as AIX, CygWin,
# Darwin, MinGW, and NonStop.
#
# (3) This script is generated from the Groovy template
# https://github.com/gradle/gradle/blob/HEAD/platforms/jvm/plugins-application/src/main/resources/org/gradle/api/internal/plugins/unixStartScript.txt
# within the Gradle project.
#
# You can find Gradle at https://github.com/gradle/gradle/.
#
##############################################################################
# Attempt to set APP_HOME
# Resolve links: $0 may be a link
app_path=$0
# Need this for daisy-chained symlinks.
while
APP_HOME=${app_path%"${app_path##*/}"} # leaves a trailing /; empty if no leading path
[ -h "$app_path" ]
do
ls=$( ls -ld "$app_path" )
link=${ls#*' -> '}
case $link in #(
/*) app_path=$link ;; #(
*) app_path=$APP_HOME$link ;;
esac
done
# This is normally unused
# shellcheck disable=SC2034
APP_BASE_NAME=${0##*/}
# Discard cd standard output in case $CDPATH is set (https://github.com/gradle/gradle/issues/25036)
APP_HOME=$( cd -P "${APP_HOME:-./}" > /dev/null && printf '%s
' "$PWD" ) || exit
# Use the maximum available, or set MAX_FD != -1 to use that value.
MAX_FD=maximum
warn () {
echo "$*"
} >&2
die () {
echo
echo "$*"
echo
exit 1
} >&2
# OS specific support (must be 'true' or 'false').
cygwin=false
msys=false
darwin=false
nonstop=false
case "$( uname )" in #(
CYGWIN* ) cygwin=true ;; #(
Darwin* ) darwin=true ;; #(
MSYS* | MINGW* ) msys=true ;; #(
NONSTOP* ) nonstop=true ;;
esac
CLASSPATH=$APP_HOME/gradle/wrapper/gradle-wrapper.jar
# Determine the Java command to use to start the JVM.
if [ -n "$JAVA_HOME" ] ; then
if [ -x "$JAVA_HOME/jre/sh/java" ] ; then
# IBM's JDK on AIX uses strange locations for the executables
JAVACMD=$JAVA_HOME/jre/sh/java
else
JAVACMD=$JAVA_HOME/bin/java
fi
if [ ! -x "$JAVACMD" ] ; then
die "ERROR: JAVA_HOME is set to an invalid directory: $JAVA_HOME
Please set the JAVA_HOME variable in your environment to match the
location of your Java installation."
fi
else
JAVACMD=java
if ! command -v java >/dev/null 2>&1
then
die "ERROR: JAVA_HOME is not set and no 'java' command could be found in your PATH.
Please set the JAVA_HOME variable in your environment to match the
location of your Java installation."
fi
fi
# Increase the maximum file descriptors if we can.
if ! "$cygwin" && ! "$darwin" && ! "$nonstop" ; then
case $MAX_FD in #(
max*)
# In POSIX sh, ulimit -H is undefined. That's why the result is checked to see if it worked.
# shellcheck disable=SC2039,SC3045
MAX_FD=$( ulimit -H -n ) ||
warn "Could not query maximum file descriptor limit"
esac
case $MAX_FD in #(
'' | soft) :;; #(
*)
# In POSIX sh, ulimit -n is undefined. That's why the result is checked to see if it worked.
# shellcheck disable=SC2039,SC3045
ulimit -n "$MAX_FD" ||
warn "Could not set maximum file descriptor limit to $MAX_FD"
esac
fi
# Collect all arguments for the java command, stacking in reverse order:
# * args from the command line
# * the main class name
# * -classpath
# * -D...appname settings
# * --module-path (only if needed)
# * DEFAULT_JVM_OPTS, JAVA_OPTS, and GRADLE_OPTS environment variables.
# For Cygwin or MSYS, switch paths to Windows format before running java
if "$cygwin" || "$msys" ; then
APP_HOME=$( cygpath --path --mixed "$APP_HOME" )
CLASSPATH=$( cygpath --path --mixed "$CLASSPATH" )
JAVACMD=$( cygpath --unix "$JAVACMD" )
# Now convert the arguments - kludge to limit ourselves to /bin/sh
for arg do
if
case $arg in #(
-*) false ;; # don't mess with options #(
/?*) t=${arg#/} t=/${t%%/*} # looks like a POSIX filepath
[ -e "$t" ] ;; #(
*) false ;;
esac
then
arg=$( cygpath --path --ignore --mixed "$arg" )
fi
# Roll the args list around exactly as many times as the number of
# args, so each arg winds up back in the position where it started, but
# possibly modified.
#
# NB: a `for` loop captures its iteration list before it begins, so
# changing the positional parameters here affects neither the number of
# iterations, nor the values presented in `arg`.
shift # remove old arg
set -- "$@" "$arg" # push replacement arg
done
fi
# Add default JVM options here. You can also use JAVA_OPTS and GRADLE_OPTS to pass JVM options to this script.
DEFAULT_JVM_OPTS='"-Xmx64m" "-Xms64m"'
# Collect all arguments for the java command:
# * DEFAULT_JVM_OPTS, JAVA_OPTS, JAVA_OPTS, and optsEnvironmentVar are not allowed to contain shell fragments,
# and any embedded shellness will be escaped.
# * For example: A user cannot expect ${Hostname} to be expanded, as it is an environment variable and will be
# treated as '${Hostname}' itself on the command line.
set -- \
"-Dorg.gradle.appname=$APP_BASE_NAME" \
-classpath "$CLASSPATH" \
org.gradle.wrapper.GradleWrapperMain \
"$@"
# Stop when "xargs" is not available.
if ! command -v xargs >/dev/null 2>&1
then
die "xargs is not available"
fi
# Use "xargs" to parse quoted args.
#
# With -n1 it outputs one arg per line, with the quotes and backslashes removed.
#
# In Bash we could simply go:
#
# readarray ARGS < <( xargs -n1 <<<"$var" ) &&
# set -- "${ARGS[@]}" "$@"
#
# but POSIX shell has neither arrays nor command substitution, so instead we
# post-process each arg (as a line of input to sed) to backslash-escape any
# character that might be a shell metacharacter, then use eval to reverse
# that process (while maintaining the separation between arguments), and wrap
# the whole thing up as a single "set" statement.
#
# This will of course break if any of these variables contains a newline or
# an unmatched quote.
#
eval "set -- $(
printf '%s\n' "$DEFAULT_JVM_OPTS $JAVA_OPTS $GRADLE_OPTS" |
xargs -n1 |
sed ' s~[^-[:alnum:]+,./:=@_]~\\&~g; ' |
tr '\n' ' '
)" '"$@"'
exec "$JAVACMD" "$@"
-94
View File
@@ -1,94 +0,0 @@
@rem
@rem Copyright 2015 the original author or authors.
@rem
@rem Licensed under the Apache License, Version 2.0 (the "License");
@rem you may not use this file except in compliance with the License.
@rem You may obtain a copy of the License at
@rem
@rem https://www.apache.org/licenses/LICENSE-2.0
@rem
@rem Unless required by applicable law or agreed to in writing, software
@rem distributed under the License is distributed on an "AS IS" BASIS,
@rem WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
@rem See the License for the specific language governing permissions and
@rem limitations under the License.
@rem
@rem SPDX-License-Identifier: Apache-2.0
@rem
@if "%DEBUG%"=="" @echo off
@rem ##########################################################################
@rem
@rem Gradle startup script for Windows
@rem
@rem ##########################################################################
@rem Set local scope for the variables with windows NT shell
if "%OS%"=="Windows_NT" setlocal
set DIRNAME=%~dp0
if "%DIRNAME%"=="" set DIRNAME=.
@rem This is normally unused
set APP_BASE_NAME=%~n0
set APP_HOME=%DIRNAME%
@rem Resolve any "." and ".." in APP_HOME to make it shorter.
for %%i in ("%APP_HOME%") do set APP_HOME=%%~fi
@rem Add default JVM options here. You can also use JAVA_OPTS and GRADLE_OPTS to pass JVM options to this script.
set DEFAULT_JVM_OPTS="-Xmx64m" "-Xms64m"
@rem Find java.exe
if defined JAVA_HOME goto findJavaFromJavaHome
set JAVA_EXE=java.exe
%JAVA_EXE% -version >NUL 2>&1
if %ERRORLEVEL% equ 0 goto execute
echo. 1>&2
echo ERROR: JAVA_HOME is not set and no 'java' command could be found in your PATH. 1>&2
echo. 1>&2
echo Please set the JAVA_HOME variable in your environment to match the 1>&2
echo location of your Java installation. 1>&2
goto fail
:findJavaFromJavaHome
set JAVA_HOME=%JAVA_HOME:"=%
set JAVA_EXE=%JAVA_HOME%/bin/java.exe
if exist "%JAVA_EXE%" goto execute
echo. 1>&2
echo ERROR: JAVA_HOME is set to an invalid directory: %JAVA_HOME% 1>&2
echo. 1>&2
echo Please set the JAVA_HOME variable in your environment to match the 1>&2
echo location of your Java installation. 1>&2
goto fail
:execute
@rem Setup the command line
set CLASSPATH=%APP_HOME%\gradle\wrapper\gradle-wrapper.jar
@rem Execute Gradle
"%JAVA_EXE%" %DEFAULT_JVM_OPTS% %JAVA_OPTS% %GRADLE_OPTS% "-Dorg.gradle.appname=%APP_BASE_NAME%" -classpath "%CLASSPATH%" org.gradle.wrapper.GradleWrapperMain %*
:end
@rem End local scope for the variables with windows NT shell
if %ERRORLEVEL% equ 0 goto mainEnd
:fail
rem Set variable GRADLE_EXIT_CONSOLE if you need the _script_ return code instead of
rem the _cmd.exe /c_ return code!
set EXIT_CODE=%ERRORLEVEL%
if %EXIT_CODE% equ 0 set EXIT_CODE=1
if not ""=="%GRADLE_EXIT_CONSOLE%" exit %EXIT_CODE%
exit /b %EXIT_CODE%
:mainEnd
if "%OS%"=="Windows_NT" endlocal
:omega
-23
View File
@@ -1,23 +0,0 @@
pluginManagement {
repositories {
google {
content {
includeGroupByRegex("com\\.android.*")
includeGroupByRegex("com\\.google.*")
includeGroupByRegex("androidx.*")
}
}
mavenCentral()
gradlePluginPortal()
}
}
dependencyResolutionManagement {
repositoriesMode.set(RepositoriesMode.FAIL_ON_PROJECT_REPOS)
repositories {
google()
mavenCentral()
}
}
rootProject.name = "unibus"
include(":app")
+140 -1
View File
@@ -2,7 +2,7 @@
name: unibus
lang: go
domain: infra
version: 0.4.0
version: 0.10.0
description: "Bus de mensajería unificado sobre NATS+JetStream con cifrado E2E por room (megolm/olm reducido): service de membresía/claves, librería cliente y peers demo."
tags: [service, messaging, nats, e2e]
uses_functions:
@@ -122,6 +122,21 @@ Para apuntar a un NATS externo en producción: `--nats-url nats://host:4222` en
las rutas GET de lectura. Confía en la red interna. Las rutas mutantes
(`/rooms`, `/invite`, `/rekey`) sí exigen firma Ed25519 del owner sobre los
bytes canónicos de la request. Endurecer es fase posterior.
- **Gestión de usuarios: storage unificado, alta por dos vías.** El allowlist de
usuarios vive en el MISMO store que las rooms (`pkg/membership.Store`): SQLite en
single-node, JetStream KV replicado (`UNIBUS_users`) en cluster. El `Server` ya
tiene ese store privilegiado abierto (es quien sirve el KV en cada nodo), así que
expone `GET/POST /users` y `POST /users/{signpub}/revoke` como API HTTP admin-only,
simétrica con las rutas de rooms: el panel de administración firma como admin y el
server ejecuta la mutación contra el mismo store. El panel NO necesita `--db`, ni la
identidad interna, ni correr en un nodo del cluster; funciona idéntico en single-node
y cluster. La autorización es default-deny: solo un firmante que el store confirma como
`role == "admin"` activo pasa, cualquier otro recibe 403 (encima de la firma+nonce+TLS
ya existentes). La CLI `membershipd user add --store kv` sigue existiendo SOLO para
sembrar el admin #0 (bootstrap del huevo-gallina: sin un admin sembrado no hay quién
firme el primer `POST /users`); a partir de ahí toda la gestión es HTTP admin-only. El
alta es idempotente igual que la CLI: re-alta de una clave ya registrada = 409, sin
sobrescribir ni elevar rol; el revoke es un flip de status (sin hard-delete), auditable.
- **Identidad = secreto crítico.** El archivo de identidad (`worker.id`,
`chat.id`) contiene las claves privadas (Ed25519 + X25519). Se escribe 0600.
Perderlo = mensajes ilegibles, sin recuperación. Trátalo como una clave SSH.
@@ -154,6 +169,130 @@ agent.<nombre>.{in,out} inbox/outbox de agente LLM (agent.scout.in)
## Capability growth log
- v0.10.0 (2026-06-07) — API HTTP admin-only de gestión de usuarios, cerrando la
última asimetría del control plane: las rooms tenían superficie HTTP firmada
(`POST /rooms`, etc.) pero los users solo se gestionaban por CLI local o acceso
directo al store. Se añaden `GET /users` (lista completa, incluidos revocados),
`POST /users` (alta `{sign_pub, handle, role}`: valida hex de 64 chars + role en
`{admin, member}`, 409 idempotente que no sobrescribe ni eleva rol) y
`POST /users/{signpub}/revoke` (flip de status, sin hard-delete). Los tres pasan por
un helper `requireAdmin` default-deny que confirma contra el store que el firmante
autenticado es un user `role == "admin"` activo (el endpoint id es un hash one-way de
la clave, así que el contexto lleva ahora también el `sign_pub` hex del firmante para
resolver `GetUser`); cualquier otro firmante recibe 403, encima de la firma+nonce+TLS+
enforce ya heredadas del middleware. NO se abre conexión KV nueva ni se usa la identidad
interna: el server escribe vía su `s.store` privilegiado, el MISMO que las rooms (SQLite
single-node, KV `UNIBUS_users` en cluster). `pkg/client` gana `ListUsers/AddUser/RevokeUser`
(tipo plano `UserInfo`) firmando como admin, así la pestaña Users del panel deja de
necesitar `--db`/acceso KV directo. La CLI `membershipd user add --store kv` queda SOLO
para sembrar el admin #0 (bootstrap). La validación de `sign_pub` se unifica en
`membership.ValidateSignPubHex`, reusada por la CLI y los handlers. Tests nuevos:
no-admin → 403 en los tres endpoints, roundtrip admin add→list→revoke, y validación
(hex inválido → 400, role inválido → 400, re-alta → 409), más un test de cliente contra
un membershipd embebido. Cambios 100% aditivos: el comportamiento single-node y de las
rutas de rooms no cambia; vet/build/test verdes.
- v0.9.0 (2026-06-07) — cierre de los gaps que el despliegue del cluster (report
0011) dejó abiertos (report 0012). (GAP A) Nueva capability `membershipd user
add|list|revoke --store kv`: alta/baja de usuarios contra el KV replicado del
cluster EN MARCHA, sin el procedimiento de parar-sembrar-rearrancar. Usa la
conexión interna privilegiada — el daemon persiste su identidad de servicio con
`--internal-id-file` (cada nodo genera/carga la suya, 0600 junto a las claves TLS)
y la CLI, ejecutada por loopback en un nodo, presenta esa nkey que el
autenticador reconoce con permisos plenos de JetStream; ninguna identidad de
usuario normal puede tocar los buckets `KV_UNIBUS_*` bajo la ACL por-subject. El
alta es idempotente (re-alta de la misma clave = `ErrUserExists` explícito, sin
sobrescribir ni elevar rol), commitea con quórum 2/3 (HA, imprime
`followers_current`) y rechaza un destino remoto sin `--ca` (igual que
`migrate-to-kv`). (GAP B) Nuevo `cmd/clientcheck`: verificación end-to-end real
con un cliente autenticado (identidad operator, nkey+TLS+https) que crea una room
E2E, publica y recibe descifrado contra el cluster vivo, incluido un nodo parado a
media transmisión donde el cliente hace failover a un superviviente y sigue
recibiendo con cero pérdida (quórum 2/3) — el plano de datos que el chaos test del
0011 nunca probó. (GAP C) Runbook `deploy/cluster/README.md` corregido: el orden
de arranque "magnus solo y verifica healthz" deadlockeaba (un nodo solo no tiene
quórum del meta-group y nunca sirve healthz); se documenta el arranque por quórum,
que R1 es un SPOF inservible (ir directo a R3) y la nueva vía de alta con el
cluster vivo. La plantilla de deploy (unit + `deploy-cluster.sh`) emite ya
`INTERNAL_ID_FILE` y el flag. Verificado contra los 3 VPS reales (magnus + homer +
datardos); posture enforce+ACL+TLS+R3 intacta.
- v0.8.0 (2026-06-07) — completar y endurecer el cluster (issue 0006, fases
0006a0006g) que cierra los bloqueantes de la auditoría dedicada del cluster
(report 0008) y cablea el control plane descentralizado que 0003 dejó a medias.
(0006a) Se cablea el nonce replicado en el binario: un nodo con `--cluster-name`
usa el bucket JetStream KV compartido obligatoriamente (fail-fast si no se crea),
cerrando el replay cross-node (N3); el "ciclo bootstrap" se resuelve con una
identidad interna efímera que el authenticator reconoce (full perms) y una
conexión in-process privilegiada. (0006b) Se cierra la fuga del control plane
por `$JS.API.>` (N2): la ACL pasa a un allow-set cerrado por-room (JS API solo de
los streams `UNIBUS_<room>` del peer), dejando `KV_UNIBUS_*`/`OBJ_*` fuera del
set y, por tanto, denegados. (0006c) Se cablea el store KV descentralizado
(`--store kv|sqlite`, default sqlite = baseline idéntico) con un `storeHolder`
fail-closed que rompe el ciclo bootstrap del authenticator. (0006d) Posture
homogénea: un nodo rechaza unirse al cluster sin `enforce`, y `/healthz` publica
la posture (N1). (0006e) Todos los clientes llaman `RefreshSession` tras cambios
de membresía (N4), de modo que la ACL es usable bajo enforce sin desactivarla.
(0006f) Bajos: secreto de cluster fuera de argv (`--cluster-pass-file`/env +
inyección en routes), `migrate-to-kv` rechaza target remoto sin `--ca`, y docs
de CA separada para routes + R1 SPOF vs R3 HA. (0006g) Material de deploy del
cluster de 3 nodos (magnus+homer+datardos) en `deploy/cluster/` (certs, unit,
script de despliegue dry-run, runbook) — sin tocar ningún VPS. Toda la
regresión de auditorías previas + los ataques 0008 siguen verdes; govulncheck 0
alcanzables. Branch-by-abstraction: con `--store sqlite` el single-node sigue
idéntico y desplegable en todo momento.
- v0.7.0 (2026-06-07) — hardening de seguridad 2 (issue 0005, fases 0005a0005e)
que cierra los hallazgos nuevos de la re-auditoría red-team (report 0006) y
lleva el veredicto de exposición pública a "sí-con-condiciones". (0005a) Bump de
`github.com/nats-io/nats-server/v2` v2.10.22→v2.11.15 y de la toolchain a
go1.26.4: `govulncheck ./...` pasa de 16 vulnerabilidades alcanzables (14 del
servidor NATS embebido + 2 de la stdlib) a 0. (0005b) `client.processFrame`
ahora descarta cualquier frame sin firma en una room `SignMsgs` (antes verificaba
solo si la firma venía presente, lo que permitía suplantar `Sender` con
`Sig==nil`). (0005c) Nuevo limiter global de bytes en vuelo
(`pkg/membership.inflightLimiter`) que acota la memoria agregada que el control
plane bufferiza bajo concurrencia (el límite por-request y el rate-limit por-IP
no acotaban el total): un flood concurrente multi-IP se descarta con 503 en vez
de crecer sin techo (el RSS deja de escalar con N). (0005d) El guard de arranque
`validateBootConfig` ahora exige `--tls-cert/--tls-key` en bind no-loopback (un
control plane público sin TLS servía metadata en claro). (0005e) Se cablea por
fin en `membershipd` la ACL por subject que ya existía huérfana desde 0003e
(`busauth.NewNkeyAuthenticatorACL` + nuevo adaptador `busauth.PermissionsFromSubjects`
sobre `membership.SubjectACLFor`): un registrado no-miembro ya no puede
`Subscribe(">")` y captar los subjects/advisories de rooms ajenas. Residuales
documentados: `$JS.API.>` sigue compartido (cierre completo = NATS accounts por
identidad, diferido) y los clientes deben `RefreshSession` tras cambios de
membresía (chat/worker aún no lo hacen). El comportamiento de un solo nodo no
cambia y master sigue verde.
- v0.6.0 (2026-06-07) — descentralización / alta disponibilidad (issue 0003,
fases 0003a0003e), report 0006. El servidor NATS embebido gana soporte de
cluster con routes autenticadas (secreto de cluster) y TLS mutuo de nodo
(`pkg/embeddednats.ClusterConfig` + `busauth.RouteTLSConfig`, reusando la CA
del 0001). El control plane (`pkg/membership.Store`) pasa a interfaz por
branch-by-abstraction: `sqliteStore` (default) + `jetstreamStore` nuevo sobre
JetStream KV replicado (réplicas configurables R1→R3), con `IsAuthorized`
fail-closed ante pérdida de quorum. `membershipd migrate-to-kv` mueve el
estado SQLite→KV de forma idempotente con backup previo. Los blobs
(`pkg/blobstore.Store`, ahora interfaz) ganan un backend NATS Object Store
replicado además del disco. El cliente acepta listas de seeds NATS y de
control planes con failover/reconnect nativo, el anti-replay pasa a un store
de nonces compartido en KV con TTL (cierra el agujero de replay multi-nodo), y
se implementa la ACL por subject derivada de pertenencia (audit H4 residual:
`busauth.NewNkeyAuthenticatorACL` + `membership.SubjectACLFor` +
`client.RefreshSession`). Todo viaja detrás del flag `decentralized` (off):
el comportamiento de un solo nodo (SQLite + disco) no cambia y master sigue
verde. El despliegue multi-nodo real (0003f) lo ejecuta el humano.
- v0.5.0 (2026-06-07) — hardening de seguridad (issue 0004) que cierra los
hallazgos de la auditoría red-team (report 0004) y lleva el veredicto de
exposición pública de "NO" a "sí-con-condiciones". Anti-DoS pre-auth
(`http.MaxBytesReader` por ruta + rechazo por `Content-Length` + rate-limit
por IP + `MaxHeaderBytes`); guard de fail-open que prohíbe arrancar con bind
público o TLS sin `--bus-auth enforce`; autorización por pertenencia en los GET
de room (metadata y clave sellada solo para miembros / el propio endpoint);
rooms cleartext deshabilitadas en bind público (contenido siempre E2E, mínimo
defensivo del data plane mientras la ACL por subject llega con 0003); TLS en el
control plane HTTP con la CA propia y cliente que exige `https` cuando hay CA;
y los medios H6/H7/H12 (owner ligado al firmante, `IsAuthorized` antes del
nonce-cache con poda O(expired) + cap, errores genéricos al cliente). Cada
hallazgo lleva su test adversarial `TestAudit_*` portado como regresión.
- v0.4.0 (2026-06-07) — descubrimiento de rooms: `GET /members/{endpoint}/rooms`
lista las rooms de un endpoint con su metadata y rol, y `client.ListMyRooms()`
lo consume. El control plane es pull (no hay push de invitaciones), así que un
+28 -12
View File
@@ -27,11 +27,12 @@ import (
func main() {
var (
natsURL = flag.String("nats-url", "nats://127.0.0.1:4250", "NATS url")
ctrlURL = flag.String("ctrl-url", "http://127.0.0.1:8470", "membershipd control-plane url")
roomSub = flag.String("room", "proc.test.ticks", "room subject to subscribe to")
idFile = flag.String("id-file", "./local_files/chat.id", "identity file path")
demoEnc = flag.Bool("demo-encrypted", false, "run the encrypted forward-secrecy demo")
natsURL = flag.String("nats-url", "nats://127.0.0.1:4250", "NATS url")
ctrlURL = flag.String("ctrl-url", "http://127.0.0.1:8470", "membershipd control-plane url")
roomSub = flag.String("room", "proc.test.ticks", "room subject to subscribe to")
idFile = flag.String("id-file", "./local_files/chat.id", "identity file path")
demoEnc = flag.Bool("demo-encrypted", false, "run the encrypted forward-secrecy demo")
caFile = flag.String("ca", "", "path to the bus CA cert (ca.crt); set to connect with TLS + nkey to a secured bus")
)
flag.Parse()
@@ -39,19 +40,19 @@ func main() {
log.SetPrefix("[chat] ")
if *demoEnc {
runEncryptedDemo(*natsURL, *ctrlURL)
runEncryptedDemo(*natsURL, *ctrlURL, *caFile)
return
}
runSimple(*natsURL, *ctrlURL, *roomSub, *idFile)
runSimple(*natsURL, *ctrlURL, *roomSub, *idFile, *caFile)
}
// runSimple subscribes to a cleartext subject and prints messages live.
func runSimple(natsURL, ctrlURL, roomSub, idFile string) {
func runSimple(natsURL, ctrlURL, roomSub, idFile, caFile string) {
id, err := client.LoadOrCreateIdentity(idFile)
if err != nil {
log.Fatalf("identity: %v", err)
}
c, err := client.New(natsURL, ctrlURL, id)
c, err := client.Connect(natsURL, ctrlURL, id, caFile)
if err != nil {
log.Fatalf("connect: %v", err)
}
@@ -68,6 +69,12 @@ func runSimple(natsURL, ctrlURL, roomSub, idFile string) {
if err := c.Join(roomID); err != nil {
log.Fatalf("join: %v", err)
}
// Membership-change contract (issue 0006e): refresh so the just-created room's
// subject is subscribable under enforce+ACL (permissions are frozen at connect
// time). Must run BEFORE Subscribe — RefreshSession drops active subscriptions.
if err := c.RefreshSession(); err != nil {
log.Fatalf("refresh session after create room: %v", err)
}
sub, err := c.Subscribe(roomID, func(f frame.Frame, plaintext []byte) {
fmt.Printf("[%s] %s: %s\n", f.Subject, shortID(f.Sender), string(plaintext))
})
@@ -91,7 +98,7 @@ func shortID(id string) string {
}
// runEncryptedDemo proves E2E encryption + forward secrecy end-to-end.
func runEncryptedDemo(natsURL, ctrlURL string) {
func runEncryptedDemo(natsURL, ctrlURL, caFile string) {
log.Printf("=== encrypted forward-secrecy demo ===")
pass := true
check := func(name string, ok bool) {
@@ -109,10 +116,10 @@ func runEncryptedDemo(natsURL, ctrlURL string) {
idB, err := newEphemeralIdentity()
must(err, "generate B identity")
a, err := client.New(natsURL, ctrlURL, idA)
a, err := client.Connect(natsURL, ctrlURL, idA, caFile)
must(err, "connect A")
defer a.Close()
b, err := client.New(natsURL, ctrlURL, idB)
b, err := client.Connect(natsURL, ctrlURL, idB, caFile)
must(err, "connect B")
defer b.Close()
@@ -121,12 +128,21 @@ func runEncryptedDemo(natsURL, ctrlURL string) {
must(err, "A create room")
fmt.Printf(" room.test -> %s (E2E, persisted, signed)\n", roomID)
// Membership-change contract (issue 0006e): A only became a member of this room
// after connecting, so refresh to gain its subject + per-room JetStream API
// under enforce+ACL before publishing.
must(a.RefreshSession(), "A refresh after create room")
// A invites B (seals K to B's X25519 key).
must(a.Invite(roomID, b.Endpoint()), "A invite B")
// B joins (fetches + decrypts K).
must(b.Join(roomID), "B join")
// B became a member via the invite above; refresh so B can subscribe to the
// room's subject under enforce+ACL (before subscribing — refresh drops subs).
must(b.RefreshSession(), "B refresh after join")
// B subscribes; capture received plaintexts.
recv := make(chan string, 4)
subB, err := b.Subscribe(roomID, func(f frame.Frame, plaintext []byte) {
+260
View File
@@ -0,0 +1,260 @@
// Command clientcheck is an end-to-end verification client for a live unibus
// cluster (issue 0011 GAP B). The 0011 chaos test validated only the control
// plane (healthz + meta/stream-leader failover + KV readable with 2/3); it never
// connected an authenticated bus client (nkey + TLS) to create a room and
// publish/subscribe through it, least of all across a node loss. clientcheck does
// exactly that with a real identity (the operator), so the data-plane end-to-end
// path — connect, create an E2E room, publish, receive decrypted — is exercised
// against the running cluster, including while a node is stopped.
//
// It is a reusable tool, not a throwaway script: point it at the cluster's CA,
// an identity file, and the NATS + control-plane seed lists.
//
// # golden: connect, create an E2E room, publish N, confirm N decrypted back
// clientcheck --ca ca.crt --identity-file operator.id \
// --nats-seeds nats://A:4250,nats://B:4250,nats://C:4250 \
// --ctrl-seeds https://A:8470,https://B:8470,https://C:8470 --messages 5
//
// # loop: publish a counter every interval for the duration, logging the node
// # it is attached to — stop a node mid-run (systemctl stop membershipd-cluster)
// # and watch it fail over to a survivor and keep receiving (quorum 2/3).
// clientcheck ... --mode loop --duration 45s --interval 1s
package main
import (
"crypto/rand"
"encoding/hex"
"flag"
"fmt"
"log"
"sort"
"strings"
"sync"
"time"
"github.com/enmanuel/unibus/pkg/busauth"
"github.com/enmanuel/unibus/pkg/client"
"github.com/enmanuel/unibus/pkg/frame"
"github.com/enmanuel/unibus/pkg/room"
)
func main() {
var (
caPath = flag.String("ca", "", "bus CA cert pinning TLS on both planes (required for a secured cluster)")
idFile = flag.String("identity-file", "", "path to the client identity JSON (e.g. `pass show unibus/operator-identity` written 0600) (required)")
natsSeeds = flag.String("nats-seeds", "", "comma-separated NATS urls of the cluster nodes (required)")
ctrlSeeds = flag.String("ctrl-seeds", "", "comma-separated control-plane https urls of the cluster nodes (required)")
subject = flag.String("subject", "test.gapcheck", "test room subject PREFIX; a random token is appended so runs never collide with real rooms")
messages = flag.Int("messages", 5, "golden mode: number of messages to publish and expect back")
mode = flag.String("mode", "golden", "golden (publish N, verify N decrypted) | loop (publish a counter for --duration, for failover testing)")
duration = flag.Duration("duration", 30*time.Second, "loop mode: how long to keep publishing")
interval = flag.Duration("interval", 1*time.Second, "loop mode: delay between published messages")
)
flag.Parse()
if *idFile == "" || *natsSeeds == "" || *ctrlSeeds == "" {
log.Fatalf("clientcheck: --identity-file, --nats-seeds and --ctrl-seeds are required")
}
id, err := client.LoadIdentity(*idFile)
if err != nil {
log.Fatalf("clientcheck: load identity: %v", err)
}
natsList := splitCSV(*natsSeeds)
ctrlList := splitCSV(*ctrlSeeds)
if len(natsList) == 0 || len(ctrlList) == 0 {
log.Fatalf("clientcheck: empty --nats-seeds or --ctrl-seeds")
}
// Build the secure client options: nkey on the data plane, TLS pinned to the
// bus CA on both planes, and the FULL seed lists so nats.go fails over to a
// surviving node when the attached one dies (the failover this tool verifies).
opts := client.Options{
NatsServers: natsList[1:],
CtrlURLs: ctrlList[1:],
}
if *caPath != "" {
tlsCfg, err := busauth.LoadCATLSConfig(*caPath)
if err != nil {
log.Fatalf("clientcheck: load CA: %v", err)
}
opts.UseNkey = true
opts.TLS = tlsCfg
opts.CtrlTLS = tlsCfg
for _, u := range ctrlList {
if !strings.HasPrefix(u, "https://") {
log.Fatalf("clientcheck: control URL %q must be https:// when --ca is set", u)
}
}
}
c, err := client.NewWithOptions(natsList[0], ctrlList[0], id, opts)
if err != nil {
log.Fatalf("clientcheck: connect: %v", err)
}
defer c.Close()
log.Printf("connected: endpoint=%s nats=%s", c.Endpoint().ID, c.ConnectedServer())
// Create an EPHEMERAL E2E room (encrypted + signed, NOT persisted): the test
// stays end-to-end encrypted (the cluster requires encryption on a public
// bind) while leaving no durable JetStream stream behind. The random subject
// token guarantees the room is unique and never a real room.
rnd := make([]byte, 8)
if _, err := rand.Read(rnd); err != nil {
log.Fatalf("clientcheck: random: %v", err)
}
subj := fmt.Sprintf("%s.%s", *subject, hex.EncodeToString(rnd))
policy := room.Policy{Encrypt: true, Persist: false, SignMsgs: true}
roomID, err := c.CreateRoom(subj, policy)
if err != nil {
log.Fatalf("clientcheck: create room: %v", err)
}
log.Printf("created E2E room: id=%s subject=%s (encrypt=%v sign=%v persist=%v)", roomID, subj, policy.Encrypt, policy.SignMsgs, policy.Persist)
// Under the per-subject ACL, NATS freezes permissions at connect time, so the
// just-created room's subject is not yet publishable/subscribable on the live
// connection. RefreshSession reconnects so the authenticator re-derives the
// ACL (now including this room) — the post-0006 contract every client follows
// after a membership change.
if err := c.RefreshSession(); err != nil {
log.Fatalf("clientcheck: refresh session: %v", err)
}
switch *mode {
case "golden":
runGolden(c, roomID, *messages)
case "loop":
runLoop(c, roomID, *duration, *interval)
default:
log.Fatalf("clientcheck: --mode must be golden or loop, got %q", *mode)
}
}
// runGolden subscribes, publishes n messages, and asserts all n come back
// decrypted. Exits non-zero if any are missing.
func runGolden(c *client.Client, roomID string, n int) {
var mu sync.Mutex
got := map[string]bool{}
sub, err := c.Subscribe(roomID, func(_ frame.Frame, plaintext []byte) {
mu.Lock()
got[string(plaintext)] = true
mu.Unlock()
})
if err != nil {
log.Fatalf("clientcheck: subscribe: %v", err)
}
defer sub.Unsubscribe()
time.Sleep(300 * time.Millisecond) // let the subscription settle
want := make([]string, n)
for i := 0; i < n; i++ {
msg := fmt.Sprintf("gapcheck-e2e-%d", i)
want[i] = msg
if err := c.Publish(roomID, []byte(msg)); err != nil {
log.Fatalf("clientcheck: publish %d: %v", i, err)
}
}
log.Printf("published %d messages to %s; waiting for decrypted echoes...", n, roomID)
deadline := time.Now().Add(15 * time.Second)
for time.Now().Before(deadline) {
mu.Lock()
have := len(got)
mu.Unlock()
if have >= n {
break
}
time.Sleep(100 * time.Millisecond)
}
mu.Lock()
defer mu.Unlock()
missing := 0
for _, w := range want {
if !got[w] {
missing++
log.Printf(" MISSING: %q", w)
}
}
log.Printf("connected node at finish: %s", c.ConnectedServer())
if missing > 0 {
log.Fatalf("GOLDEN FAIL: %d/%d messages not received decrypted", missing, n)
}
log.Printf("GOLDEN OK: all %d messages received and decrypted end-to-end", n)
}
// runLoop publishes a numbered message every interval for the duration and logs
// the count received plus the node currently attached, so an operator stopping a
// cluster node mid-run sees the client fail over to a survivor and keep receiving
// (quorum 2/3). It is the live failover-with-a-connected-client test the 0011
// chaos run never performed.
func runLoop(c *client.Client, roomID string, duration, interval time.Duration) {
var mu sync.Mutex
received := 0
servers := map[string]int{} // node -> #ticks observed attached
sub, err := c.Subscribe(roomID, func(_ frame.Frame, _ []byte) {
mu.Lock()
received++
mu.Unlock()
})
if err != nil {
log.Fatalf("clientcheck: subscribe: %v", err)
}
defer sub.Unsubscribe()
time.Sleep(300 * time.Millisecond)
log.Printf("loop: publishing every %s for %s — stop a node now to test failover", interval, duration)
end := time.Now().Add(duration)
sent := 0
for time.Now().Before(end) {
msg := fmt.Sprintf("gapcheck-loop-%d", sent)
err := c.Publish(roomID, []byte(msg))
sent++
mu.Lock()
recv := received
mu.Unlock()
node := c.ConnectedServer()
up := c.IsConnected()
if node != "" {
mu.Lock()
servers[node]++
mu.Unlock()
}
pubStatus := "ok"
if err != nil {
pubStatus = "ERR:" + err.Error()
}
log.Printf(" t=%2ds sent=%d recv=%d up=%v node=%s publish=%s",
sent, sent, recv, up, node, pubStatus)
time.Sleep(interval)
}
mu.Lock()
defer mu.Unlock()
log.Printf("loop done: sent=%d received=%d", sent, received)
nodes := make([]string, 0, len(servers))
for n := range servers {
nodes = append(nodes, n)
}
sort.Strings(nodes)
for _, n := range nodes {
log.Printf(" attached to %s for %d ticks", n, servers[n])
}
if len(servers) > 1 {
log.Printf("FAILOVER OBSERVED: client was attached to %d distinct nodes across the run", len(servers))
}
if received == 0 {
log.Fatalf("LOOP FAIL: received 0 messages")
}
log.Printf("LOOP OK: client kept receiving across the run (received=%d)", received)
}
func splitCSV(s string) []string {
var out []string
for _, p := range strings.Split(s, ",") {
if p = strings.TrimSpace(p); p != "" {
out = append(out, p)
}
}
return out
}
+221
View File
@@ -0,0 +1,221 @@
package main
// Regression for audit report 0008, vector N3: the binary must wire the
// replicated nonce store on a clustered node so a signed request accepted on one
// node cannot be replayed to another. The auditor's ephemeral attack showed the
// OLD binary never called UseReplicatedNonces (each node kept a per-process
// cache), so a captured request replayed to a second node with 200+200. These
// tests drive the SAME helper the binary uses (wireReplicatedNonces) so they
// prove the WIRING, not just the underlying API.
import (
"bytes"
"crypto/rand"
"encoding/base64"
"encoding/hex"
"io"
"net"
"net/http"
"net/http/httptest"
"path/filepath"
"strconv"
"testing"
"time"
cs "fn-registry/functions/cybersecurity"
"github.com/enmanuel/unibus/pkg/blobstore"
"github.com/enmanuel/unibus/pkg/embeddednats"
"github.com/enmanuel/unibus/pkg/frame"
"github.com/enmanuel/unibus/pkg/membership"
"github.com/nats-io/nats.go"
"github.com/nats-io/nats.go/jetstream"
)
func freePort(t *testing.T) int {
t.Helper()
l, err := net.Listen("tcp", "127.0.0.1:0")
if err != nil {
t.Fatalf("free port: %v", err)
}
defer l.Close()
return l.Addr().(*net.TCPAddr).Port
}
// signed008 builds a transport-signed control-plane request with a caller-chosen
// ts+nonce, so a test can reuse the exact same signed bytes against two nodes to
// exercise replay.
func signed008(t *testing.T, baseURL, method, path string, body []byte, id cs.Identity, ts int64, nonce string) *http.Request {
t.Helper()
canonical := membership.CanonicalRequest(method, path, strconv.FormatInt(ts, 10), nonce, body)
sig := cs.SignEd25519(id.SignPriv, canonical)
var rdr io.Reader
if body != nil {
rdr = bytes.NewReader(body)
}
req, err := http.NewRequest(method, baseURL+path, rdr)
if err != nil {
t.Fatalf("new request: %v", err)
}
req.Header.Set("X-Unibus-Pub", hex.EncodeToString(id.SignPub))
req.Header.Set("X-Unibus-Ts", strconv.FormatInt(ts, 10))
req.Header.Set("X-Unibus-Nonce", nonce)
req.Header.Set("X-Unibus-Sig", base64.StdEncoding.EncodeToString(sig))
return req
}
func randNonce(t *testing.T) string {
t.Helper()
raw := make([]byte, 16)
if _, err := rand.Read(raw); err != nil {
t.Fatalf("nonce: %v", err)
}
return base64.StdEncoding.EncodeToString(raw)
}
// TestAttack0008_N3 is the blocker regression: two clustered membershipd nodes
// wired through wireReplicatedNonces share a JetStream KV nonce bucket, so a
// request accepted on node A is rejected (401) when replayed to node B. Before
// the fix the binary never wired this and the replay returned 200.
func TestAttack0008_N3(t *testing.T) {
// One NATS+JetStream backing the shared nonce bucket (no client auth needed:
// the test drives the membership.Server's nonce store directly via HTTP).
ns, err := embeddednats.StartServer(embeddednats.ServerConfig{
StoreDir: t.TempDir(), Host: "127.0.0.1", Port: freePort(t),
})
if err != nil {
t.Fatalf("nats: %v", err)
}
t.Cleanup(func() { ns.Shutdown(); ns.WaitForShutdown() })
nc, err := nats.Connect(ns.ClientURL())
if err != nil {
t.Fatalf("connect: %v", err)
}
t.Cleanup(nc.Close)
js, err := jetstream.New(nc)
if err != nil {
t.Fatalf("jetstream: %v", err)
}
// Shared control-plane state (stand-in for the replicated store) + two nodes.
dir := t.TempDir()
store, err := membership.Open(filepath.Join(dir, "unibus.db"))
if err != nil {
t.Fatalf("store: %v", err)
}
t.Cleanup(func() { store.Close() })
alice, err := cs.GenerateIdentity()
if err != nil {
t.Fatalf("identity: %v", err)
}
if err := store.AddUser(hex.EncodeToString(alice.SignPub), "alice", membership.RoleAdmin); err != nil {
t.Fatalf("add alice: %v", err)
}
blobs, _ := blobstore.New(filepath.Join(dir, "blobs"))
// Each node is wired EXACTLY as the binary wires a clustered node.
mkNode := func() *httptest.Server {
srv := membership.NewServer(store, blobs, membership.AuthEnforce)
if err := wireReplicatedNonces(srv, js, true /*clustered*/, 1); err != nil {
t.Fatalf("wireReplicatedNonces: %v", err)
}
return httptest.NewServer(srv)
}
nodeA := mkNode()
t.Cleanup(nodeA.Close)
nodeB := mkNode()
t.Cleanup(nodeB.Close)
ts := time.Now().Unix()
nonce := randNonce(t)
path := "/members/" + frame.EndpointID(alice.SignPub) + "/rooms"
// Golden: alice's signed request is accepted on node A.
respA, err := http.DefaultClient.Do(signed008(t, nodeA.URL, "GET", path, nil, alice, ts, nonce))
if err != nil {
t.Fatalf("do A: %v", err)
}
respA.Body.Close()
if respA.StatusCode != http.StatusOK {
t.Fatalf("node A first use: status %d, want 200", respA.StatusCode)
}
// Error path (the attack): replay the SAME signed bytes to node B → 401.
respB, err := http.DefaultClient.Do(signed008(t, nodeB.URL, "GET", path, nil, alice, ts, nonce))
if err != nil {
t.Fatalf("do B: %v", err)
}
respB.Body.Close()
if respB.StatusCode != http.StatusUnauthorized {
t.Fatalf("cross-node replay to node B: status %d, want 401 (replayed nonce must be rejected)", respB.StatusCode)
}
}
// TestAttack0008_N3_StandaloneKeepsLocalCache is the edge: a NON-clustered node
// must NOT require JetStream — wireReplicatedNonces is a no-op and the node keeps
// its in-memory cache, which still rejects a same-node replay (the single-node
// guarantee is unchanged). This proves the fix does not add a JetStream
// dependency to standalone deployments.
func TestAttack0008_N3_StandaloneKeepsLocalCache(t *testing.T) {
dir := t.TempDir()
store, err := membership.Open(filepath.Join(dir, "unibus.db"))
if err != nil {
t.Fatalf("store: %v", err)
}
t.Cleanup(func() { store.Close() })
alice, err := cs.GenerateIdentity()
if err != nil {
t.Fatalf("identity: %v", err)
}
if err := store.AddUser(hex.EncodeToString(alice.SignPub), "alice", membership.RoleAdmin); err != nil {
t.Fatalf("add alice: %v", err)
}
blobs, _ := blobstore.New(filepath.Join(dir, "blobs"))
srv := membership.NewServer(store, blobs, membership.AuthEnforce)
// Standalone: clustered=false, js=nil. Must succeed (no JetStream needed).
if err := wireReplicatedNonces(srv, nil, false /*clustered*/, 1); err != nil {
t.Fatalf("standalone wireReplicatedNonces must be a no-op, got: %v", err)
}
node := httptest.NewServer(srv)
t.Cleanup(node.Close)
ts := time.Now().Unix()
nonce := randNonce(t)
path := "/members/" + frame.EndpointID(alice.SignPub) + "/rooms"
resp1, err := http.DefaultClient.Do(signed008(t, node.URL, "GET", path, nil, alice, ts, nonce))
if err != nil {
t.Fatalf("do 1: %v", err)
}
resp1.Body.Close()
if resp1.StatusCode != http.StatusOK {
t.Fatalf("first use: status %d, want 200", resp1.StatusCode)
}
// Same-node replay is still rejected by the in-memory cache.
resp2, err := http.DefaultClient.Do(signed008(t, node.URL, "GET", path, nil, alice, ts, nonce))
if err != nil {
t.Fatalf("do 2: %v", err)
}
resp2.Body.Close()
if resp2.StatusCode != http.StatusUnauthorized {
t.Fatalf("same-node replay: status %d, want 401", resp2.StatusCode)
}
}
// TestAttack0008_N3_ClusteredRequiresJetStream proves the hard rule: a clustered
// node with NO JetStream available refuses (error), so the binary fails fast
// instead of silently running with a per-process cache.
func TestAttack0008_N3_ClusteredRequiresJetStream(t *testing.T) {
dir := t.TempDir()
store, err := membership.Open(filepath.Join(dir, "unibus.db"))
if err != nil {
t.Fatalf("store: %v", err)
}
t.Cleanup(func() { store.Close() })
blobs, _ := blobstore.New(filepath.Join(dir, "blobs"))
srv := membership.NewServer(store, blobs, membership.AuthEnforce)
if err := wireReplicatedNonces(srv, nil, true /*clustered*/, 1); err == nil {
t.Fatalf("clustered node with no JetStream must fail, got nil")
}
}
+198
View File
@@ -0,0 +1,198 @@
package main
import (
"fmt"
"net"
"net/url"
"os"
"strings"
"github.com/enmanuel/unibus/pkg/membership"
)
// splitRoutes parses the comma-separated --routes flag into a clean slice of
// route URLs, dropping empty entries and surrounding whitespace so a trailing
// comma or a spaced list does not yield a bogus empty route.
func splitRoutes(csv string) []string {
var out []string
for _, r := range strings.Split(csv, ",") {
if r = strings.TrimSpace(r); r != "" {
out = append(out, r)
}
}
return out
}
// resolveClusterPass resolves the cluster route secret WITHOUT leaking it through
// argv (audit 0008 N1-low: --cluster-pass in argv is visible in ps/journald).
// Precedence: --cluster-pass-file (read + trim the file), then the env var
// UNIBUS_CLUSTER_PASS, then the legacy --cluster-pass flag (argv-visible, kept for
// dev/compat). env is injected (os.Getenv result) so the function stays testable.
// It returns the secret and a short source label for logging (never the secret).
func resolveClusterPass(passFlag, passFile, env string) (secret, source string, err error) {
if passFile != "" {
b, rerr := os.ReadFile(passFile)
if rerr != nil {
return "", "", fmt.Errorf("read --cluster-pass-file %q: %w", passFile, rerr)
}
return strings.TrimSpace(string(b)), "file", nil
}
if env != "" {
return env, "env", nil
}
if passFlag != "" {
return passFlag, "flag", nil
}
return "", "none", nil
}
// injectRouteCreds rewrites each route URL that carries NO userinfo to embed
// user:pass, so the cluster secret is supplied once (via file/env) instead of
// repeated in every --routes argv entry where ps/journald would expose it. A route
// that already carries userinfo is left untouched (operator override). With an
// empty user it is a no-op. A malformed route URL is an error (configuration bug)
// rather than a silently dropped peer.
func injectRouteCreds(routes []string, user, pass string) ([]string, error) {
if user == "" {
return routes, nil
}
out := make([]string, 0, len(routes))
for _, r := range routes {
u, err := url.Parse(r)
if err != nil {
return nil, fmt.Errorf("parse route %q: %w", r, err)
}
if u.User == nil {
u.User = url.UserPassword(user, pass)
}
out = append(out, u.String())
}
return out, nil
}
// isLoopbackURL reports whether a NATS url targets this host only (loopback). Used
// to guard migrate-to-kv (audit 0008 N6): pushing the allowlist to a REMOTE NATS
// without TLS would send handles/roles/sign-pubs in cleartext, so a remote target
// must be TLS-pinned (--ca). A url we cannot classify is treated as NON-loopback
// (conservative: it then requires --ca).
func isLoopbackURL(natsURL string) bool {
u, err := url.Parse(natsURL)
if err != nil {
return false
}
host := u.Hostname()
switch host {
case "localhost":
return true
case "":
return false
}
ip := net.ParseIP(host)
return ip != nil && ip.IsLoopback()
}
// isLoopbackBind reports whether the --bind value keeps the service reachable
// only from this host. An empty bind means "all interfaces" (public), and a
// hostname we cannot resolve to a loopback literal is treated as public — the
// conservative choice, so an unusual bind never silently slips past the guard.
func isLoopbackBind(bind string) bool {
switch bind {
case "localhost":
return true
case "":
return false // empty binds every interface
}
ip := net.ParseIP(bind)
if ip == nil {
return false // a hostname we can't classify: assume public
}
return ip.IsLoopback()
}
// validateBootConfig is the fail-open guard (audit H2). It refuses any startup
// configuration that would expose the bus without enforced authentication:
//
// - a non-loopback --bind without --bus-auth enforce (the data plane and
// control plane would both accept anyone),
// - --tls-cert/--tls-key without --bus-auth enforce (TLS encrypts the channel
// but authenticates no one — encrypted access for everybody is still open), and
// - a non-loopback --bind WITHOUT --tls-cert/--tls-key (the control plane would
// serve metadata over plaintext HTTP publicly — audit H5 reappearing, the N4
// gap the re-audit found: TLS was available but not mandatory).
//
// It is a pure function of the parsed flags so the command can fail fast at
// startup and tests can assert the policy without booting a server.
func validateBootConfig(bind string, mode membership.AuthMode, tlsCert, tlsKey string) error {
if !isLoopbackBind(bind) && mode != membership.AuthEnforce {
return fmt.Errorf(
"refusing to start: --bind %q is not loopback but --bus-auth is %q; a public bind requires --bus-auth enforce (or bind 127.0.0.1 for local dev)",
bind, mode)
}
if (tlsCert != "" || tlsKey != "") && mode != membership.AuthEnforce {
return fmt.Errorf(
"refusing to start: --tls-cert/--tls-key set but --bus-auth is %q; TLS without enforced auth is fail-open (encrypted channel, no authentication) — set --bus-auth enforce",
mode)
}
if !isLoopbackBind(bind) && (tlsCert == "" || tlsKey == "") {
return fmt.Errorf(
"refusing to start: --bind %q is not loopback but --tls-cert/--tls-key are not both set; a public control plane must serve HTTPS or its metadata (subjects, pubkeys, sealed keys, the social graph) travels in cleartext to a network MITM (audit H5/N4) — provide a CA-signed --tls-cert/--tls-key, or bind 127.0.0.1 for local dev",
bind)
}
return nil
}
// validateClusterConfig guards the cluster route layer (issue 0003a). The route
// layer is a server-to-server trust boundary distinct from the client data
// plane: leaving it open lets anyone who reaches the route port join the cluster
// or inject messages into the whole bus (audit 0004, "auth of the cluster
// routes"). So on a public (non-loopback) bind, a cluster MUST carry both a
// shared route secret AND mutual route TLS. It is a pure function of the parsed
// flags. An empty clusterName means "no cluster" (standalone) and is always
// allowed.
//
// The three route-TLS paths are all-or-nothing (mutual TLS needs the node cert,
// its key, and the CA together), independent of the bind, so a partial TLS
// config never silently degrades to plaintext routes.
//
// Homogeneous posture (issue 0006d, audit 0008 N1): a cluster is only as secure
// as its weakest node — the data plane forwards every subject between nodes, so a
// single node running without enforced auth lets an unauthenticated peer
// Subscribe(">") on it and harvest the traffic forwarded from the ACL'd nodes.
// This node therefore REFUSES to join a cluster unless it runs --bus-auth enforce,
// regardless of bind: a clustered node is a production node, and there is no safe
// "dev cluster without auth". (A peer running a tampered binary is out of this
// node's control; /healthz exposes each node's posture so a monitor can detect
// one that is not enforce+ACL — see Server.Posture.)
func validateClusterConfig(clusterName, bind, user, pass, rtCert, rtKey, rtCA string, mode membership.AuthMode) error {
rtAny := rtCert != "" || rtKey != "" || rtCA != ""
rtAll := rtCert != "" && rtKey != "" && rtCA != ""
if rtAny && !rtAll {
return fmt.Errorf(
"refusing to start: --route-tls-cert/--route-tls-key/--route-tls-ca must be set together (mutual route TLS needs all three)")
}
if clusterName == "" {
return nil // standalone: no route layer to secure
}
// A clustered node MUST enforce auth (homogeneous posture). Checked before the
// loopback shortcut so even a loopback cluster cannot form without enforce.
if mode != membership.AuthEnforce {
return fmt.Errorf(
"refusing to start: cluster %q requires --bus-auth enforce; a cluster node without enforced auth+ACL lets an unauthenticated peer harvest the traffic forwarded from the other nodes (audit 0008 N1) — every node must run the same enforce+ACL+TLS posture",
clusterName)
}
if isLoopbackBind(bind) {
return nil // loopback cluster is dev-only and unreachable from outside
}
// Public cluster: demand a route secret and mutual route TLS.
if user == "" || pass == "" {
return fmt.Errorf(
"refusing to start: cluster %q on public bind %q requires --cluster-user and --cluster-pass; an unauthenticated route port lets anyone join the cluster",
clusterName, bind)
}
if !rtAll {
return fmt.Errorf(
"refusing to start: cluster %q on public bind %q requires mutual route TLS (--route-tls-cert/--route-tls-key/--route-tls-ca); plaintext routes expose server-to-server traffic and admit unsigned nodes",
clusterName, bind)
}
return nil
}
+188
View File
@@ -0,0 +1,188 @@
package main
import (
"strings"
"testing"
"github.com/enmanuel/unibus/pkg/membership"
)
// TestAudit_FailOpenTLSWithoutAuth ports the auditor's H2 vector. Before the
// guard, booting with TLS on but the authenticator off ("--bind 0.0.0.0
// --tls-cert … " without enforce) produced an encrypted data plane that an
// unregistered, nkey-less client could still connect to — a fail-open config
// wearing the appearance of security. validateBootConfig now refuses it, so the
// insecure server never starts (the client therefore has nothing to connect to).
func TestAudit_FailOpenTLSWithoutAuth(t *testing.T) {
// The exact auditor configuration: public bind, TLS provided, auth off.
err := validateBootConfig("0.0.0.0", membership.AuthOff, "server.crt", "server.key")
if err == nil {
t.Fatalf("TLS without enforce on a public bind must be refused at startup")
}
if !strings.Contains(err.Error(), "enforce") {
t.Fatalf("error should point the operator at --bus-auth enforce, got: %v", err)
}
// And TLS without enforce is rejected even on loopback: TLS implies a
// security posture, so authenticating no one is always a misconfiguration.
if err := validateBootConfig("127.0.0.1", membership.AuthOff, "server.crt", "server.key"); err == nil {
t.Fatalf("TLS flags without enforce must be refused regardless of bind")
}
}
// TestGap_PublicEnforceNoTLS ports the re-auditor's N4 gap: the H2 guard refused
// "public without enforce" and "TLS without enforce", but ALLOWED a public bind
// with enforce and NO --tls-cert, so the control plane served metadata over
// plaintext HTTP publicly (H5 reappearing). The guard now refuses it.
func TestGap_PublicEnforceNoTLS(t *testing.T) {
// The exact auditor configuration: public bind, enforce on, no TLS cert/key.
err := validateBootConfig("0.0.0.0", membership.AuthEnforce, "", "")
if err == nil {
t.Fatalf("public bind + enforce + NO --tls-cert must be refused: the control plane would serve plaintext HTTP publicly (audit N4)")
}
if !strings.Contains(err.Error(), "tls-cert") {
t.Fatalf("error should point the operator at --tls-cert/--tls-key, got: %v", err)
}
// Golden: the same public+enforce config WITH a cert/key is allowed.
if err := validateBootConfig("0.0.0.0", membership.AuthEnforce, "server.crt", "server.key"); err != nil {
t.Fatalf("public + enforce + TLS is the intended production config, got: %v", err)
}
// Edge: loopback without TLS stays allowed (local dev is not a public exposure).
if err := validateBootConfig("127.0.0.1", membership.AuthOff, "", ""); err != nil {
t.Fatalf("loopback dev without TLS must remain allowed, got: %v", err)
}
}
// TestBootConfigPolicy is the full table: the golden secure-public config is
// allowed, dev loopback is allowed, and every fail-open shape is refused.
func TestBootConfigPolicy(t *testing.T) {
cases := []struct {
name string
bind string
mode membership.AuthMode
cert string
key string
wantErr bool
}{
// Golden: the intended public production config — enforce AND TLS.
{"public+enforce+tls", "0.0.0.0", membership.AuthEnforce, "s.crt", "s.key", false},
// Edge: local dev on loopback may stay open (no auth, no TLS).
{"loopback+off", "127.0.0.1", membership.AuthOff, "", "", false},
{"loopback-ipv6+off", "::1", membership.AuthOff, "", "", false},
{"localhost+off", "localhost", membership.AuthOff, "", "", false},
{"loopback+soft", "127.0.0.1", membership.AuthSoft, "", "", false},
// Edge: loopback with full enforce+TLS is also fine.
{"loopback+enforce+tls", "127.0.0.1", membership.AuthEnforce, "s.crt", "s.key", false},
// Error: public bind without enforce.
{"public+off", "0.0.0.0", membership.AuthOff, "", "", true},
{"public+soft", "0.0.0.0", membership.AuthSoft, "", "", true},
{"lan-ip+off", "192.168.1.10", membership.AuthOff, "", "", true},
{"empty-bind+off", "", membership.AuthOff, "", "", true},
// Error (N4): public bind + enforce but NO TLS -> plaintext control plane.
{"public+enforce+notls", "0.0.0.0", membership.AuthEnforce, "", "", true},
{"public+enforce+certonly", "0.0.0.0", membership.AuthEnforce, "s.crt", "", true},
{"public+enforce+keyonly", "0.0.0.0", membership.AuthEnforce, "", "s.key", true},
{"lan-ip+enforce+notls", "192.168.1.10", membership.AuthEnforce, "", "", true},
// Error: TLS flags without enforce (cert or key alone is enough to trip it).
{"loopback+tlscert+off", "127.0.0.1", membership.AuthOff, "s.crt", "", true},
{"loopback+tlskey+soft", "127.0.0.1", membership.AuthSoft, "", "s.key", true},
}
for _, c := range cases {
t.Run(c.name, func(t *testing.T) {
err := validateBootConfig(c.bind, c.mode, c.cert, c.key)
if c.wantErr && err == nil {
t.Fatalf("config %+v should be refused", c)
}
if !c.wantErr && err != nil {
t.Fatalf("config %+v should be allowed, got: %v", c, err)
}
})
}
}
// TestClusterConfigPolicy is the cluster route guard (issue 0003a): a standalone
// server is always fine; a loopback cluster is dev-only and unguarded; a public
// cluster demands both a route secret and complete mutual route TLS; and the
// route-TLS flags are all-or-nothing regardless of bind.
func TestClusterConfigPolicy(t *testing.T) {
const c, k, ca = "node.crt", "node.key", "ca.crt"
en := membership.AuthEnforce
off := membership.AuthOff
soft := membership.AuthSoft
cases := []struct {
name string
clusterName, bind string
user, pass string
rtCert, rtKey, rtCA string
mode membership.AuthMode
wantErr bool
}{
// Standalone (no cluster name) is always allowed, even on a public bind and
// without enforce — the cluster posture rule does not apply to a single node.
{"standalone-public-off", "", "0.0.0.0", "", "", "", "", "", off, false},
// Loopback dev cluster WITH enforce: allowed (unreachable from outside).
{"loopback-cluster-enforce", "unibus", "127.0.0.1", "", "", "", "", "", en, false},
// Golden: full public HA config under enforce.
{"public-full-enforce", "unibus", "0.0.0.0", "u", "p", c, k, ca, en, false},
// N1 (audit 0008): a clustered node WITHOUT enforce is refused — even on
// loopback — so no weak node can join the cluster.
{"cluster-off-refused", "unibus", "127.0.0.1", "", "", "", "", "", off, true},
{"cluster-soft-refused", "unibus", "0.0.0.0", "u", "p", c, k, ca, soft, true},
// Error: public cluster without a route secret (enforce on, fails on secret).
{"public-no-secret", "unibus", "0.0.0.0", "", "", c, k, ca, en, true},
{"public-half-secret", "unibus", "0.0.0.0", "u", "", c, k, ca, en, true},
// Error: public cluster without mutual route TLS.
{"public-no-tls", "unibus", "10.0.0.1", "u", "p", "", "", "", en, true},
// Error: partial route-TLS flags trip regardless of bind/mode.
{"loopback-partial-tls", "unibus", "127.0.0.1", "", "", c, "", "", en, true},
{"standalone-partial-tls", "", "127.0.0.1", "", "", c, k, "", off, true},
}
for _, tc := range cases {
t.Run(tc.name, func(t *testing.T) {
err := validateClusterConfig(tc.clusterName, tc.bind, tc.user, tc.pass, tc.rtCert, tc.rtKey, tc.rtCA, tc.mode)
if tc.wantErr && err == nil {
t.Fatalf("cluster config %+v should be refused", tc)
}
if !tc.wantErr && err != nil {
t.Fatalf("cluster config %+v should be allowed, got: %v", tc, err)
}
})
}
}
// TestAttack0008_N1 is the regression for audit 0008 N1 scenario 2: a node
// configured to join a cluster while NOT enforcing auth (the weak node that lets
// an unauthenticated peer harvest the cluster's forwarded traffic) must be refused
// at startup. The homogeneous-posture rule makes this binary unable to BE that
// weak node.
func TestAttack0008_N1(t *testing.T) {
// Weak node: clustered but --bus-auth off -> refused.
if err := validateClusterConfig("unibus", "0.0.0.0", "u", "p", "n.crt", "n.key", "ca.crt", membership.AuthOff); err == nil {
t.Fatalf("a clustered node without enforce must be refused (audit 0008 N1)")
}
// Same node WITH enforce + full route security -> allowed.
if err := validateClusterConfig("unibus", "0.0.0.0", "u", "p", "n.crt", "n.key", "ca.crt", membership.AuthEnforce); err != nil {
t.Fatalf("a clustered enforce node with full route security must be allowed, got: %v", err)
}
}
func TestSplitRoutes(t *testing.T) {
cases := []struct {
in string
want int
}{
{"", 0},
{"nats://a:1", 1},
{"nats://a:1,nats://b:2", 2},
{" nats://a:1 , nats://b:2 ", 2}, // spaces trimmed
{"nats://a:1,,", 1}, // empty entries dropped
{",", 0},
}
for _, c := range cases {
if got := splitRoutes(c.in); len(got) != c.want {
t.Fatalf("splitRoutes(%q) = %v (len %d), want len %d", c.in, got, len(got), c.want)
}
}
}
+84
View File
@@ -0,0 +1,84 @@
package main
import (
"fmt"
cs "fn-registry/functions/cybersecurity"
"github.com/enmanuel/unibus/pkg/busauth"
"github.com/nats-io/nats.go"
"github.com/nats-io/nats.go/jetstream"
server "github.com/nats-io/nats-server/v2/server"
)
// connectInternalJS opens a privileged JetStream client from membershipd to its
// OWN embedded NATS server. This is the resolution of the "bootstrap cycle"
// (issue 0006a/c): the service needs JetStream to create the replicated nonce
// bucket and the control-plane KV, but under enforce the data plane only accepts
// allowlisted clients confined to their rooms. The connection therefore
// authenticates with the process's ephemeral internal identity — the identity the
// authenticator was built to recognize (NewNkeyAuthenticatorACLInternal) and
// grant full permissions — without ever appearing in the user allowlist.
//
// It uses the in-process transport (nats.InProcessServer), a Go pipe inside the
// process, so it bypasses TLS entirely: no CA wiring is needed for this
// self-connection even when the public data plane is TLS-only. useNkey mirrors
// whether the embedded server enforces auth: under enforce the internal identity
// presents its nkey; without enforce the server accepts an unauthenticated
// in-process client and the nkey is omitted.
//
// The caller owns the returned connection and must Close it on shutdown (after
// the JetStream context is no longer used).
func connectInternalJS(ns *server.Server, internalID cs.Identity, useNkey bool) (*nats.Conn, jetstream.JetStream, error) {
opts := []nats.Option{
nats.Name("membershipd-internal"),
nats.InProcessServer(ns),
}
if useNkey {
pub, sign, err := busauth.ClientNkey(internalID.SignPriv)
if err != nil {
return nil, nil, fmt.Errorf("internal nkey: %w", err)
}
opts = append(opts, nats.Nkey(pub, sign))
}
// The URL is ignored for an in-process connection; the InProcessServer option
// supplies the transport.
nc, err := nats.Connect("", opts...)
if err != nil {
return nil, nil, fmt.Errorf("connect internal nats: %w", err)
}
js, err := jetstream.New(nc)
if err != nil {
nc.Close()
return nil, nil, fmt.Errorf("internal jetstream: %w", err)
}
return nc, js, nil
}
// connectExternalJS opens a JetStream client to an EXTERNAL NATS the operator
// runs (membershipd started with --nats-url). Unlike the embedded path there is
// no in-process transport and no internal identity: the external server enforces
// its own auth, so membershipd connects as a plain client (optionally TLS-pinned
// to the bus CA). It is best-effort and intended for an operator-managed cluster;
// the standard unibus deploy uses the embedded server (connectInternalJS).
func connectExternalJS(natsURL, caPath string) (*nats.Conn, jetstream.JetStream, error) {
opts := []nats.Option{nats.Name("membershipd-internal")}
if caPath != "" {
tlsCfg, err := busauth.LoadCATLSConfig(caPath)
if err != nil {
return nil, nil, fmt.Errorf("load CA %q: %w", caPath, err)
}
opts = append(opts, nats.Secure(tlsCfg))
}
nc, err := nats.Connect(natsURL, opts...)
if err != nil {
return nil, nil, fmt.Errorf("connect external nats %q: %w", natsURL, err)
}
js, err := jetstream.New(nc)
if err != nil {
nc.Close()
return nil, nil, fmt.Errorf("external jetstream: %w", err)
}
return nc, js, nil
}
+119
View File
@@ -0,0 +1,119 @@
package main
// Bootstrap test for issue 0006a/c: under enforce, membershipd must still reach
// JetStream on its OWN embedded server to create the nonce/KV buckets. It does so
// with an ephemeral internal identity the authenticator grants full permissions
// (NewNkeyAuthenticatorACLInternal). These tests prove that privileged
// self-connection works AND that no other identity can claim it.
import (
"context"
"encoding/hex"
"net"
"testing"
"time"
cs "fn-registry/functions/cybersecurity"
"github.com/enmanuel/unibus/pkg/busauth"
"github.com/enmanuel/unibus/pkg/embeddednats"
"github.com/nats-io/nats.go"
"github.com/nats-io/nats.go/jetstream"
)
func icFreePort(t *testing.T) int {
t.Helper()
l, err := net.Listen("tcp", "127.0.0.1:0")
if err != nil {
t.Fatalf("free port: %v", err)
}
defer l.Close()
return l.Addr().(*net.TCPAddr).Port
}
// TestInternalConnPrivilegedUnderEnforce: with an enforce authenticator that
// authorizes NO bus user, the internal identity still connects in-process and has
// full permissions — it creates a KV bucket and round-trips a value. This is the
// resolution of the bootstrap cycle the audit flagged as the reason the KV store
// was never wired.
func TestInternalConnPrivilegedUnderEnforce(t *testing.T) {
internalID, err := cs.GenerateIdentity()
if err != nil {
t.Fatalf("internal identity: %v", err)
}
internalPubHex := hex.EncodeToString(internalID.SignPub)
// Authenticator: no bus user is authorized; only the internal identity passes.
auth := busauth.NewNkeyAuthenticatorACLInternal(
func(string) bool { return false },
busauth.PermissionsFromSubjects(func(string) ([]string, error) { return []string{"_INBOX.>"}, nil }),
internalPubHex,
)
ns, err := embeddednats.StartServer(embeddednats.ServerConfig{
StoreDir: t.TempDir(), Host: "127.0.0.1", Port: icFreePort(t), Auth: auth,
})
if err != nil {
t.Fatalf("nats: %v", err)
}
t.Cleanup(func() { ns.Shutdown(); ns.WaitForShutdown() })
nc, js, err := connectInternalJS(ns, internalID, true /*useNkey*/)
if err != nil {
t.Fatalf("connectInternalJS: %v", err)
}
t.Cleanup(nc.Close)
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
kv, err := js.CreateKeyValue(ctx, jetstream.KeyValueConfig{Bucket: "KV_UNIBUS_test", Replicas: 1})
if err != nil {
t.Fatalf("internal conn could not create KV bucket (full perms expected): %v", err)
}
if _, err := kv.Put(ctx, "k", []byte("v")); err != nil {
t.Fatalf("kv put: %v", err)
}
e, err := kv.Get(ctx, "k")
if err != nil || string(e.Value()) != "v" {
t.Fatalf("kv get: val=%q err=%v", e, err)
}
}
// TestInternalConnOutsiderRejected: an identity that is neither the internal one
// nor an allowlisted bus user cannot connect — proving the internal bypass is
// scoped to the exact internal key, not a blanket hole.
func TestInternalConnOutsiderRejected(t *testing.T) {
internalID, err := cs.GenerateIdentity()
if err != nil {
t.Fatalf("internal identity: %v", err)
}
auth := busauth.NewNkeyAuthenticatorACLInternal(
func(string) bool { return false },
busauth.PermissionsFromSubjects(func(string) ([]string, error) { return []string{"_INBOX.>"}, nil }),
hex.EncodeToString(internalID.SignPub),
)
ns, err := embeddednats.StartServer(embeddednats.ServerConfig{
StoreDir: t.TempDir(), Host: "127.0.0.1", Port: icFreePort(t), Auth: auth,
})
if err != nil {
t.Fatalf("nats: %v", err)
}
t.Cleanup(func() { ns.Shutdown(); ns.WaitForShutdown() })
outsider, err := cs.GenerateIdentity()
if err != nil {
t.Fatalf("outsider identity: %v", err)
}
pub, sign, err := busauth.ClientNkey(outsider.SignPriv)
if err != nil {
t.Fatalf("outsider nkey: %v", err)
}
conn, err := nats.Connect(ns.ClientURL(),
nats.Nkey(pub, sign),
nats.MaxReconnects(0),
nats.Timeout(2*time.Second),
)
if err == nil {
conn.Close()
t.Fatalf("outsider (unauthorized, non-internal) must be rejected, but connected")
}
}
+154
View File
@@ -0,0 +1,154 @@
package main
// Wiring tests for issue 0006c: --store kv selects the replicated JetStream KV
// control plane, the authenticator serves from it through the storeHolder, and a
// new node sees state created by another (the divergence that per-node SQLite
// caused — audit 0008 N5 — is gone). Branch-by-abstraction is verified elsewhere
// (the SQLite default path is the unchanged baseline covered by the existing
// suite).
import (
"encoding/hex"
"testing"
"time"
cs "fn-registry/functions/cybersecurity"
"github.com/enmanuel/unibus/pkg/busauth"
"github.com/enmanuel/unibus/pkg/embeddednats"
"github.com/enmanuel/unibus/pkg/membership"
"github.com/nats-io/nats.go"
"github.com/nats-io/nats.go/jetstream"
)
// TestKVStoreBootstrapUnderEnforce drives the exact decentralized boot the binary
// performs: build the authenticator over an empty holder, start NATS, open the
// privileged internal connection, open the KV store, publish it into the holder,
// then a real bus user (seeded into the KV store) authenticates over nkey. This
// proves the bootstrap cycle is broken correctly — the KV-backed control plane
// authorizes live clients under enforce.
func TestKVStoreBootstrapUnderEnforce(t *testing.T) {
internalID, err := cs.GenerateIdentity()
if err != nil {
t.Fatalf("internal identity: %v", err)
}
holder := &storeHolder{}
auth := busauth.NewNkeyAuthenticatorACLInternal(
holder.IsAuthorized,
busauth.PermissionsFromSubjects(holder.subjectACL),
hex.EncodeToString(internalID.SignPub),
)
ns, err := embeddednats.StartServer(embeddednats.ServerConfig{
StoreDir: t.TempDir(), Host: "127.0.0.1", Port: freePort(t), Auth: auth,
})
if err != nil {
t.Fatalf("nats: %v", err)
}
t.Cleanup(func() { ns.Shutdown(); ns.WaitForShutdown() })
// Privileged internal connection opens the KV store while the holder still
// denies every normal client.
intNC, js, err := connectInternalJS(ns, internalID, true)
if err != nil {
t.Fatalf("connectInternalJS: %v", err)
}
t.Cleanup(intNC.Close)
kvStore, err := membership.OpenJetStream(js, membership.JetStreamConfig{Replicas: 1, OpTimeout: 3 * time.Second})
if err != nil {
t.Fatalf("open kv store: %v", err)
}
holder.set(kvStore)
// Seed a bus user into the KV control plane.
alice, err := cs.GenerateIdentity()
if err != nil {
t.Fatalf("alice: %v", err)
}
if err := kvStore.AddUser(hex.EncodeToString(alice.SignPub), "alice", membership.RoleMember); err != nil {
t.Fatalf("seed alice: %v", err)
}
// alice authenticates over nkey — authorized via the KV store through the holder.
pub, sign, err := busauth.ClientNkey(alice.SignPriv)
if err != nil {
t.Fatalf("alice nkey: %v", err)
}
aliceNC, err := nats.Connect(ns.ClientURL(), nats.Nkey(pub, sign), nats.MaxReconnects(0), nats.Timeout(2*time.Second))
if err != nil {
t.Fatalf("alice (KV-authorized) must connect under enforce: %v", err)
}
aliceNC.Close()
// An outsider not in the KV store is denied (fail closed).
outsider, err := cs.GenerateIdentity()
if err != nil {
t.Fatalf("outsider: %v", err)
}
opub, osign, err := busauth.ClientNkey(outsider.SignPriv)
if err != nil {
t.Fatalf("outsider nkey: %v", err)
}
if oc, err := nats.Connect(ns.ClientURL(), nats.Nkey(opub, osign), nats.MaxReconnects(0), nats.Timeout(2*time.Second)); err == nil {
oc.Close()
t.Fatalf("an outsider absent from the KV store must be rejected")
}
}
// TestKVStoreDecentralizedConsistency: a room/user created via one node's KV store
// is immediately visible to another node's KV store over the same JetStream — the
// shared, replicated control plane that ends the per-node SQLite divergence.
func TestKVStoreDecentralizedConsistency(t *testing.T) {
ns, err := embeddednats.StartServer(embeddednats.ServerConfig{
StoreDir: t.TempDir(), Host: "127.0.0.1", Port: freePort(t),
})
if err != nil {
t.Fatalf("nats: %v", err)
}
t.Cleanup(func() { ns.Shutdown(); ns.WaitForShutdown() })
open := func() membership.Store {
nc, err := nats.Connect(ns.ClientURL())
if err != nil {
t.Fatalf("connect: %v", err)
}
t.Cleanup(nc.Close)
js, err := jetstream.New(nc)
if err != nil {
t.Fatalf("jetstream: %v", err)
}
st, err := membership.OpenJetStream(js, membership.JetStreamConfig{Replicas: 1, OpTimeout: 3 * time.Second})
if err != nil {
t.Fatalf("open kv: %v", err)
}
return st
}
nodeA := open()
nodeB := open()
owner, err := cs.GenerateIdentity()
if err != nil {
t.Fatalf("owner: %v", err)
}
ownerPub := hex.EncodeToString(owner.SignPub)
if err := nodeA.AddUser(ownerPub, "owner", membership.RoleAdmin); err != nil {
t.Fatalf("nodeA add user: %v", err)
}
if err := nodeA.CreateRoom(
membership.RoomInfo{RoomID: "ROOMX", Subject: "room.shared.x", OwnerEndpoint: "owner-ep"},
owner.SignPub, owner.KexPub, nil,
); err != nil {
t.Fatalf("nodeA create room: %v", err)
}
// nodeB (a different connection, same buckets) sees both immediately.
if !nodeB.IsAuthorized(ownerPub) {
t.Fatalf("nodeB must see the user created on nodeA (decentralized state divergence)")
}
got, err := nodeB.GetRoom("ROOMX")
if err != nil {
t.Fatalf("nodeB must see the room created on nodeA: %v", err)
}
if got.Subject != "room.shared.x" {
t.Fatalf("nodeB read wrong room subject: %q", got.Subject)
}
}
+152
View File
@@ -0,0 +1,152 @@
package main
// Integration tests for issue 0011 GAP A: `membershipd user add --store kv`
// adds users to a RUNNING cluster's replicated allowlist via the privileged
// internal connection, instead of the stop-seed-restart procedure the 0011
// deploy required. These exercise the real connectKVStore path (load the
// persisted internal identity from a file, present its nkey, open the KV store,
// write the user) against an embedded enforce node, plus the idempotency and
// error semantics the DoD calls for. Multi-node replication and node-down quorum
// are validated against the live cluster (report 0012).
import (
"encoding/hex"
"errors"
"path/filepath"
"testing"
"time"
cs "fn-registry/functions/cybersecurity"
"github.com/enmanuel/unibus/pkg/busauth"
"github.com/enmanuel/unibus/pkg/client"
"github.com/enmanuel/unibus/pkg/embeddednats"
"github.com/enmanuel/unibus/pkg/membership"
)
// startEnforceKVNode boots a single embedded enforce node whose authenticator
// recognizes internalPubHex as the privileged internal identity, bootstraps the
// KV control-plane store over the in-process internal connection, and publishes
// it into the holder — the exact sequence main.go performs for --store kv. It
// returns the client URL the CLI connects to.
func startEnforceKVNode(t *testing.T, internalID cs.Identity) string {
t.Helper()
holder := &storeHolder{}
auth := busauth.NewNkeyAuthenticatorACLInternal(
holder.IsAuthorized,
busauth.PermissionsFromSubjects(holder.subjectACL),
hex.EncodeToString(internalID.SignPub),
)
ns, err := embeddednats.StartServer(embeddednats.ServerConfig{
StoreDir: t.TempDir(), Host: "127.0.0.1", Port: freePort(t), Auth: auth,
})
if err != nil {
t.Fatalf("start enforce node: %v", err)
}
t.Cleanup(func() { ns.Shutdown(); ns.WaitForShutdown() })
intNC, js, err := connectInternalJS(ns, internalID, true)
if err != nil {
t.Fatalf("bootstrap internal connection: %v", err)
}
t.Cleanup(intNC.Close)
kvStore, err := membership.OpenJetStream(js, membership.JetStreamConfig{Replicas: 1, OpTimeout: 3 * time.Second})
if err != nil {
t.Fatalf("bootstrap KV store: %v", err)
}
holder.set(kvStore)
return ns.ClientURL()
}
// TestUserAddStoreKV_GoldenAndIdempotent is the GAP A golden + edge-1: the CLI
// connection (real connectKVStore, loading the internal identity from a file and
// presenting its nkey) writes a user into the live KV allowlist, the user is
// authorized afterward, and re-adding the same key is an explicit ErrUserExists
// with no corruption (the unchanged row is still authorized).
func TestUserAddStoreKV_GoldenAndIdempotent(t *testing.T) {
idFile := filepath.Join(t.TempDir(), "internal.id")
internalID, err := client.LoadOrCreateIdentity(idFile) // persists 0600
if err != nil {
t.Fatalf("persist internal identity: %v", err)
}
url := startEnforceKVNode(t, internalID)
// Golden: connect as the privileged internal identity (loopback, no TLS) and
// add a new user, exactly as `user add --store kv` does.
kv, err := connectKVStore(url, idFile, "", 1)
if err != nil {
t.Fatalf("connectKVStore (privileged): %v", err)
}
defer kv.Close()
newUser, err := cs.GenerateIdentity()
if err != nil {
t.Fatalf("new user identity: %v", err)
}
pub := hex.EncodeToString(newUser.SignPub)
if err := kv.store.AddUser(pub, "gapcheck_user", membership.RoleMember); err != nil {
t.Fatalf("add user to live KV: %v", err)
}
if !kv.store.IsAuthorized(pub) {
t.Fatalf("user added to KV must be authorized")
}
// Edge 1: re-adding the same key is a clean, non-destructive ErrUserExists.
err = kv.store.AddUser(pub, "gapcheck_user", membership.RoleMember)
if !errors.Is(err, membership.ErrUserExists) {
t.Fatalf("re-add must return ErrUserExists (idempotent), got %v", err)
}
// A different handle/role with the SAME key is also rejected — the row is not
// silently overwritten (no role flip).
if err := kv.store.AddUser(pub, "impostor", membership.RoleAdmin); !errors.Is(err, membership.ErrUserExists) {
t.Fatalf("re-add with a different role must NOT overwrite; want ErrUserExists, got %v", err)
}
u, err := kv.store.GetUser(pub)
if err != nil {
t.Fatalf("get user: %v", err)
}
if u.Handle != "gapcheck_user" || u.Role != membership.RoleMember || u.Status != membership.StatusActive {
t.Fatalf("idempotent re-add corrupted the row: %+v", u)
}
}
// TestUserAddStoreKV_RequiresInternalIdentity: --store kv without a usable
// internal identity file fails loudly (missing file, empty path) rather than
// silently connecting unprivileged.
func TestUserAddStoreKV_RequiresInternalIdentity(t *testing.T) {
if _, err := connectKVStore("nats://127.0.0.1:4250", "", "", 1); err == nil {
t.Fatalf("empty --internal-id-file must be an error")
}
missing := filepath.Join(t.TempDir(), "nope.id")
if _, err := connectKVStore("nats://127.0.0.1:4250", missing, "", 1); err == nil {
t.Fatalf("missing internal identity file must be an error")
}
}
// TestUserAddStoreKV_UnreachableKV is the GAP A error case: pointing --store kv
// at a dead endpoint yields a clear, handled error (no crash, no silent success).
func TestUserAddStoreKV_UnreachableKV(t *testing.T) {
idFile := filepath.Join(t.TempDir(), "internal.id")
if _, err := client.LoadOrCreateIdentity(idFile); err != nil {
t.Fatalf("persist internal identity: %v", err)
}
// A loopback port with nothing listening: connect must fail fast and wrapped.
_, err := connectKVStore("nats://127.0.0.1:1/", idFile, "", 1)
if err == nil {
t.Fatalf("connecting to a dead endpoint must error")
}
}
// TestUserAddStoreKV_RemoteWithoutCARefused: a non-loopback target without --ca
// is refused so the allowlist write never travels in cleartext (audit 0008 N6,
// same guard as migrate-to-kv).
func TestUserAddStoreKV_RemoteWithoutCARefused(t *testing.T) {
idFile := filepath.Join(t.TempDir(), "internal.id")
if _, err := client.LoadOrCreateIdentity(idFile); err != nil {
t.Fatalf("persist internal identity: %v", err)
}
_, err := connectKVStore("nats://203.0.113.1:4250", idFile, "", 1)
if err == nil {
t.Fatalf("remote target without --ca must be refused")
}
}
+75
View File
@@ -0,0 +1,75 @@
package main
import (
"os"
"path/filepath"
"strings"
"testing"
)
// TestResolveClusterPass verifies the secret resolution precedence
// (file > env > flag) that keeps the cluster password out of argv (issue 0006f).
func TestResolveClusterPass(t *testing.T) {
// file wins over env and flag, and is trimmed.
f := filepath.Join(t.TempDir(), "pass")
if err := os.WriteFile(f, []byte("filesecret\n"), 0o600); err != nil {
t.Fatalf("write: %v", err)
}
if got, src, err := resolveClusterPass("flagsecret", f, "envsecret"); err != nil || got != "filesecret" || src != "file" {
t.Fatalf("file precedence: got %q src %q err %v", got, src, err)
}
// env wins over flag when no file.
if got, src, err := resolveClusterPass("flagsecret", "", "envsecret"); err != nil || got != "envsecret" || src != "env" {
t.Fatalf("env precedence: got %q src %q err %v", got, src, err)
}
// flag is the last resort.
if got, src, err := resolveClusterPass("flagsecret", "", ""); err != nil || got != "flagsecret" || src != "flag" {
t.Fatalf("flag fallback: got %q src %q err %v", got, src, err)
}
// none set.
if got, src, err := resolveClusterPass("", "", ""); err != nil || got != "" || src != "none" {
t.Fatalf("none: got %q src %q err %v", got, src, err)
}
// missing file is an error.
if _, _, err := resolveClusterPass("", filepath.Join(t.TempDir(), "nope"), ""); err == nil {
t.Fatalf("missing file must error")
}
}
// TestInjectRouteCreds verifies the secret is injected only into routes that omit
// userinfo, so --routes argv need not carry the password (issue 0006f).
func TestInjectRouteCreds(t *testing.T) {
in := []string{"nats://10.0.0.2:6250", "nats://override:pw@10.0.0.3:6250"}
out, err := injectRouteCreds(in, "user", "secret")
if err != nil {
t.Fatalf("inject: %v", err)
}
if !strings.Contains(out[0], "user:secret@10.0.0.2:6250") {
t.Fatalf("creds not injected into bare route: %q", out[0])
}
if !strings.Contains(out[1], "override:pw@10.0.0.3:6250") {
t.Fatalf("existing userinfo must be preserved: %q", out[1])
}
// empty user is a no-op.
noop, err := injectRouteCreds(in, "", "")
if err != nil || noop[0] != in[0] {
t.Fatalf("empty user must be a no-op: %v %q", err, noop[0])
}
}
// TestIsLoopbackURL guards migrate-to-kv against pushing the allowlist cleartext
// to a remote NATS (issue 0006f, audit 0008 N6).
func TestIsLoopbackURL(t *testing.T) {
loop := []string{"nats://127.0.0.1:4250", "nats://localhost:4250", "nats://[::1]:4250"}
for _, u := range loop {
if !isLoopbackURL(u) {
t.Fatalf("%q should be loopback", u)
}
}
remote := []string{"nats://10.0.0.2:4250", "nats://bus.example.com:4250", "::not-a-url"}
for _, u := range remote {
if isLoopbackURL(u) {
t.Fatalf("%q should NOT be loopback", u)
}
}
}
+321 -23
View File
@@ -6,6 +6,8 @@ package main
import (
"context"
"crypto/tls"
"encoding/hex"
"flag"
"log"
"net/http"
@@ -14,14 +16,37 @@ import (
"syscall"
"time"
cs "fn-registry/functions/cybersecurity"
"github.com/nats-io/nats.go"
"github.com/nats-io/nats.go/jetstream"
server "github.com/nats-io/nats-server/v2/server"
"github.com/enmanuel/unibus/pkg/blobstore"
"github.com/enmanuel/unibus/pkg/busauth"
"github.com/enmanuel/unibus/pkg/client"
"github.com/enmanuel/unibus/pkg/embeddednats"
"github.com/enmanuel/unibus/pkg/membership"
)
func main() {
// Subcommand dispatch: `membershipd user ...` is the local administration CLI
// (seed/list/revoke bus users) and must be handled before the server flag set
// parses os.Args. Running the CLI on the bus host is trusted by design (whoever
// has a shell there already controls the service), which is how the first admin
// is seeded without a chicken-egg auth problem.
if len(os.Args) > 1 && os.Args[1] == "user" {
runUserCLI(os.Args[2:])
return
}
// `membershipd migrate-to-kv` is the one-time, idempotent SQLite->JetStream KV
// data move for decentralization (issue 0003c). Like the user CLI it runs on
// the host and is dispatched before the server flag set parses os.Args.
if len(os.Args) > 1 && os.Args[1] == "migrate-to-kv" {
runMigrateCLI(os.Args[2:])
return
}
var (
bind = flag.String("bind", "127.0.0.1", "network interface to bind the HTTP API and the embedded NATS to; use 0.0.0.0 to accept LAN/remote peers")
natsURL = flag.String("nats-url", "", "external NATS url; empty starts an embedded server")
@@ -30,21 +55,219 @@ func main() {
storeDir = flag.String("store-dir", "./local_files/blobs", "blob store directory")
natsPort = flag.Int("nats-port", 4250, "embedded NATS listen port (when --nats-url empty)")
natsStore = flag.String("nats-store", "./local_files/jetstream", "embedded JetStream store dir")
busAuth = flag.String("bus-auth", "off", "control-plane auth rollout: off|soft|enforce (feature flag bus-auth)")
tlsCert = flag.String("tls-cert", "", "PATH to the NATS server certificate (deploy/tls/server.crt); enables TLS on the embedded data plane")
tlsKey = flag.String("tls-key", "", "path to the NATS server private key (deploy/tls/server.key); required with --tls-cert")
// Cluster (issue 0003a): empty --cluster-name keeps the server standalone.
clusterName = flag.String("cluster-name", "", "NATS cluster name (identical on every node); empty = standalone, no HA")
serverName = flag.String("server-name", "", "unique node name within the cluster (required by JetStream RAFT when clustered)")
clusterPort = flag.Int("cluster-port", 6250, "route listener port for server-to-server cluster traffic")
routesCSV = flag.String("routes", "", "comma-separated nats-route URLs of the OTHER nodes, e.g. nats://user:pass@10.0.0.2:6250")
clusterUser = flag.String("cluster-user", "", "shared route secret username (gates the route listener)")
clusterPass = flag.String("cluster-pass", "", "shared route secret password (argv-visible — prefer --cluster-pass-file or UNIBUS_CLUSTER_PASS)")
// Secret out of argv (issue 0006f, audit 0008 N1-low): a password in
// --cluster-pass / --routes is visible in ps/journald. Prefer a file or the
// UNIBUS_CLUSTER_PASS env var; routes may then omit userinfo and the secret
// is injected from here.
clusterPassFile = flag.String("cluster-pass-file", "", "path to a file holding the cluster route password (preferred over --cluster-pass; keeps the secret out of argv)")
routeTLSCert = flag.String("route-tls-cert", "", "this node's route certificate (CA-signed); enables mutual route TLS with --route-tls-key/--route-tls-ca")
routeTLSKey = flag.String("route-tls-key", "", "this node's route private key")
routeTLSCA = flag.String("route-tls-ca", "", "bus CA that signs every node's route certificate (deploy/tls/ca.crt)")
// Replicated control plane (issue 0006a/c): the JetStream replication factor
// for the shared nonce bucket (and, with --store kv, the control-plane KV).
// 1 for a 1-2 node rollout, 3 for real HA quorum (raise in place with
// `nats stream update --replicas 3` when the third node joins).
kvReplicas = flag.Int("kv-replicas", 1, "JetStream replication factor for the shared nonce/KV buckets (1..3)")
caFile = flag.String("ca", "", "bus CA cert; only used to pin TLS on the internal JetStream connection to an EXTERNAL --nats-url (the embedded server uses an in-process connection that needs no CA)")
// Control-plane store backend (issue 0006c, feature flag decentralized):
// "sqlite" (default) keeps the local single-node SQLite control plane;
// "kv" puts rooms/members/keys/users in replicated JetStream KV so any node
// in the cluster serves the same state.
storeBackend = flag.String("store", "sqlite", "control-plane store backend: sqlite (default, single-node) | kv (replicated JetStream, decentralized)")
// Persisted internal service identity (issue 0011 gaps, GAP A): when set, the
// privileged internal identity used to manage JetStream is LOADED from this
// file (generated and persisted on first start) instead of being a fresh
// ephemeral key each boot. Persisting it is what lets `membershipd user add
// --store kv` write the replicated allowlist of a LIVE cluster: that CLI,
// run over loopback on a node, loads the SAME identity and presents the nkey
// this node's authenticator already grants full permissions. Empty keeps the
// ephemeral-per-process behavior (single-node/dev default, unchanged). The
// file holds a private key: it is written 0600 and belongs next to the node's
// TLS keys (deploy keeps it under secrets/, gitignored).
internalIDFile = flag.String("internal-id-file", "", "path to a persisted internal service identity (JSON); enables `membershipd user add --store kv` against the live cluster. Empty = ephemeral per-process identity (dev default)")
)
flag.Parse()
authMode, err := membership.ParseAuthMode(*busAuth)
if err != nil {
log.Fatalf("%v", err)
}
if *storeBackend != "sqlite" && *storeBackend != "kv" {
log.Fatalf("--store must be \"sqlite\" or \"kv\", got %q", *storeBackend)
}
// Resolve the cluster route secret out of argv (file/env preferred). The
// resolved value (not *clusterPass) is what guards the route layer and is
// injected into peer route URLs below.
clusterPassResolved, passSource, err := resolveClusterPass(*clusterPass, *clusterPassFile, os.Getenv("UNIBUS_CLUSTER_PASS"))
if err != nil {
log.Fatalf("%v", err)
}
// Fail-open guard (audit H2): a non-loopback bind, or any TLS flag, demands
// --bus-auth enforce. This makes an insecure public startup impossible rather
// than silently exposing the bus with the appearance of security.
if err := validateBootConfig(*bind, authMode, *tlsCert, *tlsKey); err != nil {
log.Fatalf("%v", err)
}
// Cluster route guard (issue 0003a): a public cluster needs a route secret
// and mutual route TLS, and the route-TLS flags are all-or-nothing.
if err := validateClusterConfig(*clusterName, *bind, *clusterUser, clusterPassResolved, *routeTLSCert, *routeTLSKey, *routeTLSCA, authMode); err != nil {
log.Fatalf("%v", err)
}
log.SetFlags(log.LstdFlags | log.Lmsgprefix)
log.SetPrefix("[membershipd] ")
// Data plane: embedded or external NATS.
// A clustered node shares its control plane with peers, so it needs a JetStream
// client to manage the replicated nonce bucket (issue 0006a). --store kv (issue
// 0006c) also needs JetStream, for the control-plane KV itself. A standalone
// single-node SQLite deployment needs none of this and keeps the in-process,
// in-memory behavior unchanged.
clustered := *clusterName != ""
decentralized := *storeBackend == "kv"
needJS := clustered || decentralized
enforce := authMode == membership.AuthEnforce
// Internal service identity (issue 0006a): when the embedded data plane enforces
// auth, membershipd must still connect to its OWN server to manage JetStream.
// It does so with this ephemeral identity, which the authenticator is built to
// recognize and grant full permissions (it never enters the user allowlist). It
// is only generated when actually needed (JetStream required AND enforce on AND
// the server is embedded), so a standalone or non-enforce node is unchanged.
var internalID cs.Identity
var internalPubHex string
if needJS && enforce && *natsURL == "" {
if *internalIDFile != "" {
// Persisted identity: load it, generating + writing it (0600) on first
// start. A stable internal key is what `user add --store kv` presents to
// add users to a live cluster (GAP A); rotate it by deleting the file and
// restarting.
internalID, err = client.LoadOrCreateIdentity(*internalIDFile)
if err != nil {
log.Fatalf("load internal service identity %q: %v", *internalIDFile, err)
}
log.Printf("internal service identity: persisted (%s)", *internalIDFile)
} else {
internalID, err = cs.GenerateIdentity()
if err != nil {
log.Fatalf("generate internal identity: %v", err)
}
}
internalPubHex = hex.EncodeToString(internalID.SignPub)
}
// The authenticator consults the store through a holder so it can be built
// before the store exists: with --store kv the JetStream KV store opens only
// after NATS is up (the bootstrap cycle). In the default SQLite path the store
// is opened and set into the holder right here, before the server starts, so
// behavior is identical to the pre-0006c baseline. `store` is the final store
// used by the HTTP server (set below for the KV path).
holder := &storeHolder{}
var store membership.Store
if !decentralized {
store, err = membership.Open(*dbPath)
if err != nil {
log.Fatalf("open membership store: %v", err)
}
holder.set(store)
log.Printf("membership store: sqlite %s", *dbPath)
}
// Close whichever store ends up final (SQLite closes its file; the JetStream KV
// store's Close is a no-op — its NATS connection is closed separately).
defer func() {
if store != nil {
store.Close()
}
}()
blobs, err := blobstore.New(*storeDir)
if err != nil {
log.Fatalf("open blob store: %v", err)
}
log.Printf("blob store: %s", *storeDir)
// Data plane: embedded or external NATS. For the embedded server, enforce
// turns on the nkey authenticator (only allowlisted identities may connect)
// and --tls-cert/--tls-key turn on TLS. An external NATS manages its own
// auth/TLS, so those flags do not apply to it.
var ns *server.Server
natsClientURL := *natsURL
if natsClientURL == "" {
var err error
// Bind the embedded NATS to the same interface as the HTTP API so a single
// --bind flag governs reachability: 127.0.0.1 keeps the whole stack
// loopback-only; 0.0.0.0 exposes both planes to the LAN.
ns, err = embeddednats.StartHost(*natsStore, *bind, *natsPort)
cfg := embeddednats.ServerConfig{
// Bind the embedded NATS to the same interface as the HTTP API so a
// single --bind flag governs reachability: 127.0.0.1 keeps the whole
// stack loopback-only; 0.0.0.0 exposes both planes to the LAN.
StoreDir: *natsStore,
Host: *bind,
Port: *natsPort,
ServerName: *serverName,
}
// Cluster (issue 0003a): with a cluster name, join the route layer for HA.
if *clusterName != "" {
// Inject the resolved secret into peer route URLs that omit userinfo, so
// the password need not appear in --routes argv (issue 0006f).
routes, rerr := injectRouteCreds(splitRoutes(*routesCSV), *clusterUser, clusterPassResolved)
if rerr != nil {
log.Fatalf("%v", rerr)
}
cc := &embeddednats.ClusterConfig{
Name: *clusterName,
Host: *bind,
Port: *clusterPort,
Routes: routes,
Username: *clusterUser,
Password: clusterPassResolved,
}
log.Printf("cluster route secret source: %s", passSource)
if *routeTLSCert != "" {
rtls, err := busauth.RouteTLSConfig(*routeTLSCert, *routeTLSKey, *routeTLSCA)
if err != nil {
log.Fatalf("load route TLS: %v", err)
}
cc.TLS = rtls
log.Printf("cluster route TLS: ON (mutual, CA %s)", *routeTLSCA)
}
cfg.Cluster = cc
log.Printf("cluster: %q node %q, route port %d, %d peer route(s)", *clusterName, *serverName, *clusterPort, len(cc.Routes))
}
if authMode == membership.AuthEnforce {
// Per-subject data-plane ACL (audit H4 / N4 residual): the authenticator
// authorizes by the bus allowlist AND confines each connection to the
// subjects of the rooms it belongs to (plus client-infra subjects). This
// closes the wildcard metadata leak where a registered non-member could
// Subscribe(">") and harvest every room's subject and JetStream activity.
// NATS freezes permissions at connect time, so a peer that joins a room
// after connecting must client.RefreshSession to gain that room's subject.
cfg.Auth = busauth.NewNkeyAuthenticatorACLInternal(
holder.IsAuthorized,
busauth.PermissionsFromSubjects(holder.subjectACL),
internalPubHex,
)
log.Printf("NATS nkey authentication: ON (enforce, per-subject ACL)")
}
if *tlsCert != "" || *tlsKey != "" {
if *tlsCert == "" || *tlsKey == "" {
log.Fatalf("--tls-cert and --tls-key must be set together")
}
tlsCfg, err := busauth.ServerTLSConfig(*tlsCert, *tlsKey)
if err != nil {
log.Fatalf("load NATS TLS: %v", err)
}
cfg.TLS = tlsCfg
log.Printf("NATS TLS: ON (%s)", *tlsCert)
}
ns, err = embeddednats.StartServer(cfg)
if err != nil {
log.Fatalf("start embedded nats: %v", err)
}
@@ -54,29 +277,104 @@ func main() {
log.Printf("using external NATS: %s", natsClientURL)
}
// Control plane: SQLite store + blob store + HTTP API.
store, err := membership.Open(*dbPath)
if err != nil {
log.Fatalf("open membership store: %v", err)
}
defer store.Close()
log.Printf("membership store: %s", *dbPath)
// JetStream client + decentralized store (issue 0006a/c). needJS is set for a
// clustered node (shared nonce bucket) and for --store kv (the KV control
// plane). Open the privileged JetStream client first (in-process for the
// embedded server, a plain client for external NATS), then — for --store kv —
// open the replicated KV store and publish it into the holder so the
// authenticator and HTTP server serve from it. The privileged connection is the
// only client that can connect in this window (the holder still denies everyone
// else; the internal identity bypasses the store).
var js jetstream.JetStream
if needJS {
var internalNC *nats.Conn
if *natsURL == "" {
internalNC, js, err = connectInternalJS(ns, internalID, enforce)
} else {
internalNC, js, err = connectExternalJS(natsClientURL, *caFile)
}
if err != nil {
log.Fatalf("internal JetStream connection (required by --cluster-name/--store kv): %v", err)
}
defer internalNC.Close()
blobs, err := blobstore.New(*storeDir)
if err != nil {
log.Fatalf("open blob store: %v", err)
if decentralized {
kvStore, err := membership.OpenJetStream(js, membership.JetStreamConfig{Replicas: *kvReplicas})
if err != nil {
log.Fatalf("open decentralized control-plane KV store: %v", err)
}
store = kvStore
holder.set(store)
log.Printf("membership store: jetstream KV (replicas=%d)", *kvReplicas)
}
}
log.Printf("blob store: %s", *storeDir)
srv := membership.NewServer(store, blobs)
srv := membership.NewServer(store, blobs, authMode)
// On a public (non-loopback) bind, disable cleartext rooms: the embedded NATS
// has no per-subject ACL, so cleartext content would be readable by any
// registered peer. Forcing E2E keeps message content confidential regardless
// (audit H4 minimum defense; see dev/0004d-dataplane-acl.md).
if !isLoopbackBind(*bind) {
srv.RequireEncryptedRooms = true
log.Printf("cleartext rooms: DISABLED (public bind requires end-to-end encryption)")
}
// Publish this node's posture on /healthz so a monitor (or a peer) can detect a
// cluster member not running the homogeneous enforce+ACL+TLS posture (audit
// 0008 N1). enforce implies the per-subject ACL in this binary (they are wired
// together above).
srv.Posture = membership.Posture{
Enforce: enforce,
ACL: enforce,
TLS: *tlsCert != "",
Cluster: clustered,
Store: *storeBackend,
}
// Replicated anti-replay (issue 0006a, audit 0008 N3): a clustered node MUST
// share its nonce store across the cluster, or a request accepted on one node
// can be replayed to another. HARD requirement: if the bucket cannot be created
// the node refuses to start rather than run with a per-process cache that leaves
// the replay hole open.
if needJS {
if err := wireReplicatedNonces(srv, js, clustered, *kvReplicas); err != nil {
log.Fatalf("%v", err)
}
if clustered {
log.Printf("anti-replay: replicated nonce bucket \"KV_UNIBUS_nonces\" (replicas=%d) — cluster-safe", *kvReplicas)
}
}
log.Printf("control-plane auth: %s", authMode)
addr := *bind + ":" + *httpPort
httpSrv := &http.Server{Addr: addr, Handler: srv}
httpSrv := &http.Server{
Addr: addr,
Handler: srv,
// Bound request header size so a peer cannot exhaust memory with huge
// headers before any body limit applies (the body ceilings live in the
// membership middleware).
MaxHeaderBytes: membership.MaxHeaderBytes,
ReadHeaderTimeout: 10 * time.Second,
}
go func() {
log.Printf("HTTP control-plane API: http://%s", addr)
log.Printf(" health: http://%s/healthz", addr)
if err := httpSrv.ListenAndServe(); err != nil && err != http.ErrServerClosed {
log.Fatalf("http server: %v", err)
var serveErr error
if *tlsCert != "" {
// Serve the control plane over TLS with the same CA-signed cert as the
// data plane (audit H5): metadata (subjects, pubkeys, sealed keys, the
// social graph) is no longer readable by a network MITM. The fail-open
// guard already requires --bus-auth enforce alongside these flags.
httpSrv.TLSConfig = &tls.Config{MinVersion: tls.VersionTLS12}
log.Printf("HTTPS control-plane API: https://%s", addr)
log.Printf(" health: https://%s/healthz", addr)
log.Printf("control-plane TLS: ON (%s)", *tlsCert)
serveErr = httpSrv.ListenAndServeTLS(*tlsCert, *tlsKey)
} else {
log.Printf("HTTP control-plane API: http://%s", addr)
log.Printf(" health: http://%s/healthz", addr)
serveErr = httpSrv.ListenAndServe()
}
if serveErr != nil && serveErr != http.ErrServerClosed {
log.Fatalf("http server: %v", serveErr)
}
}()
+95
View File
@@ -0,0 +1,95 @@
package main
import (
"flag"
"fmt"
"os"
"time"
"github.com/enmanuel/unibus/pkg/busauth"
"github.com/enmanuel/unibus/pkg/membership"
"github.com/nats-io/nats.go"
"github.com/nats-io/nats.go/jetstream"
)
// runMigrateCLI implements `membershipd migrate-to-kv`, the idempotent move of
// the control-plane state from the local SQLite database into replicated
// JetStream KV (issue 0003c). It backs up the SQLite file first (VACUUM INTO),
// then connects to the target NATS and copies every room/member/key/user into
// the KV buckets. Re-running it converges to the same state.
//
// It runs on the bus host (no auth on the control-plane side), connecting to the
// cluster's NATS; --ca pins TLS when the data plane is secured.
func runMigrateCLI(args []string) {
fs := flag.NewFlagSet("migrate-to-kv", flag.ExitOnError)
dbPath := fs.String("db", defaultDBPath, "SQLite database path to migrate FROM")
natsURL := fs.String("nats-url", "", "NATS url of the cluster to migrate INTO (required)")
ca := fs.String("ca", "", "CA cert to pin TLS on the NATS connection (optional)")
replicas := fs.Int("replicas", 1, "KV replication factor (1 for a 1-2 node rollout, 3 for HA quorum)")
noBackup := fs.Bool("no-backup", false, "skip the SQLite backup before migrating (NOT recommended)")
_ = fs.Parse(args)
if *natsURL == "" {
fmt.Fprintln(os.Stderr, "membershipd migrate-to-kv: --nats-url is required (the cluster to write the KV buckets into)")
os.Exit(2)
}
// Confidentiality guard (issue 0006f, audit 0008 N6): the migration writes the
// allowlist (handles, roles, signing pubkeys) into the KV. Against a REMOTE NATS
// without TLS that metadata would travel in cleartext, so a remote target MUST
// be TLS-pinned with --ca. A loopback target is local-only and exempt.
if !isLoopbackURL(*natsURL) && *ca == "" {
fmt.Fprintf(os.Stderr, "membershipd migrate-to-kv: refusing to migrate to remote %q without --ca; the allowlist (handles/roles/sign pubs) would travel in cleartext — pin TLS with --ca, or run against a loopback nats-url\n", *natsURL)
os.Exit(2)
}
// Back up the SQLite database first so a botched migration can be undone.
var backupPath string
if !*noBackup {
bak, err := membership.BackupSQLite(*dbPath)
if err != nil {
fmt.Fprintf(os.Stderr, "membershipd migrate-to-kv: backup failed: %v\n", err)
os.Exit(1)
}
backupPath = bak
fmt.Printf("backed up %s -> %s\n", *dbPath, backupPath)
}
// Connect to the target NATS (optionally TLS-pinned to the bus CA).
natsOpts := []nats.Option{nats.Name("unibus-migrate")}
if *ca != "" {
tlsCfg, err := busauth.LoadCATLSConfig(*ca)
if err != nil {
fmt.Fprintf(os.Stderr, "membershipd migrate-to-kv: load CA: %v\n", err)
os.Exit(1)
}
natsOpts = append(natsOpts, nats.Secure(tlsCfg))
}
nc, err := nats.Connect(*natsURL, natsOpts...)
if err != nil {
fmt.Fprintf(os.Stderr, "membershipd migrate-to-kv: connect %q: %v\n", *natsURL, err)
os.Exit(1)
}
defer nc.Close()
js, err := jetstream.New(nc)
if err != nil {
fmt.Fprintf(os.Stderr, "membershipd migrate-to-kv: jetstream: %v\n", err)
os.Exit(1)
}
report, err := membership.MigrateSQLiteToKV(*dbPath, js, membership.JetStreamConfig{
Replicas: *replicas,
OpTimeout: 30 * time.Second,
})
if err != nil {
fmt.Fprintf(os.Stderr, "membershipd migrate-to-kv: %v\n", err)
os.Exit(1)
}
report.BackupPath = backupPath
fmt.Printf("migrated to KV (replicas=%d): %d rooms, %d members, %d keys, %d users\n",
*replicas, report.Rooms, report.Members, report.Keys, report.Users)
if backupPath != "" {
fmt.Printf("rollback: restore %s if needed\n", backupPath)
}
}
+60
View File
@@ -0,0 +1,60 @@
package main
import (
"fmt"
"sync"
"github.com/enmanuel/unibus/pkg/membership"
)
// storeHolder is a concurrency-safe slot for the control-plane store, used to
// break the decentralized bootstrap cycle (issue 0006c): the NATS authenticator
// must be built BEFORE the embedded server starts, but the JetStream KV store can
// only be opened AFTER NATS is up (it needs a JetStream client). The authenticator
// therefore consults the holder instead of a concrete store.
//
// Fail-closed by construction: until the store is set, IsAuthorized denies and
// SubjectACL errors, so any client connecting in the startup window is rejected.
// The only connection expected in that window is membershipd's own internal
// service identity, which the authenticator recognizes by key and lets through
// without consulting the store at all. In the SQLite (default) path the store is
// set before StartServer, so the window does not exist and behavior is identical
// to the pre-0006c baseline.
type storeHolder struct {
mu sync.RWMutex
s membership.Store
}
func (h *storeHolder) set(s membership.Store) {
h.mu.Lock()
h.s = s
h.mu.Unlock()
}
func (h *storeHolder) get() membership.Store {
h.mu.RLock()
defer h.mu.RUnlock()
return h.s
}
// IsAuthorized reports whether signPubHex is an active bus user, denying while the
// store is not yet set (fail closed). It is the predicate the nkey authenticator
// uses for every connecting client.
func (h *storeHolder) IsAuthorized(signPubHex string) bool {
s := h.get()
if s == nil {
return false
}
return s.IsAuthorized(signPubHex)
}
// subjectACL derives the per-subject permissions for signPubHex via the live
// store, erroring (so the caller fails closed and denies the connection) while the
// store is not yet set.
func (h *storeHolder) subjectACL(signPubHex string) ([]string, error) {
s := h.get()
if s == nil {
return nil, fmt.Errorf("control-plane store not ready")
}
return membership.SubjectACLFor(s)(signPubHex)
}
+51
View File
@@ -0,0 +1,51 @@
package main
import (
"encoding/hex"
"path/filepath"
"testing"
cs "fn-registry/functions/cybersecurity"
"github.com/enmanuel/unibus/pkg/membership"
)
// TestStoreHolderFailClosed: an empty holder denies everything (the bootstrap
// window before the store is set), and starts serving once a store is published.
func TestStoreHolderFailClosed(t *testing.T) {
h := &storeHolder{}
// Empty: deny + error (fail closed).
if h.IsAuthorized("anything") {
t.Fatalf("empty holder must deny IsAuthorized")
}
if _, err := h.subjectACL("anything"); err == nil {
t.Fatalf("empty holder must error from subjectACL (fail closed)")
}
// After set: serves from the real store.
store, err := membership.Open(filepath.Join(t.TempDir(), "unibus.db"))
if err != nil {
t.Fatalf("store: %v", err)
}
t.Cleanup(func() { store.Close() })
id, err := cs.GenerateIdentity()
if err != nil {
t.Fatalf("identity: %v", err)
}
pub := hex.EncodeToString(id.SignPub)
if err := store.AddUser(pub, "alice", membership.RoleMember); err != nil {
t.Fatalf("add user: %v", err)
}
h.set(store)
if !h.IsAuthorized(pub) {
t.Fatalf("after set, an active user must be authorized")
}
if _, err := h.subjectACL(pub); err != nil {
t.Fatalf("after set, subjectACL must succeed: %v", err)
}
if h.IsAuthorized("deadbeef") {
t.Fatalf("a non-user must not be authorized")
}
}
+245
View File
@@ -0,0 +1,245 @@
package main
import (
"errors"
"flag"
"fmt"
"os"
"strings"
"text/tabwriter"
"github.com/enmanuel/unibus/pkg/membership"
)
// runUserCLI implements `membershipd user <add|list|revoke> ...`, the local
// administration surface for the bus user allowlist. It opens the SQLite store
// directly (no network, no auth): it is meant to run on the bus host, where
// shell access already implies full control. This is the seam that seeds the
// first admin, breaking the chicken-egg of "you need an admin to add an admin".
//
// The function never returns: it exits the process with a non-zero status on
// error so it composes cleanly in shell scripts and systemd ExecStartPre hooks.
func runUserCLI(args []string) {
if len(args) == 0 {
userUsage()
os.Exit(2)
}
sub, rest := args[0], args[1:]
switch sub {
case "add":
userAdd(rest)
case "list":
userList(rest)
case "revoke":
userRevoke(rest)
case "-h", "--help", "help":
userUsage()
os.Exit(0)
default:
fmt.Fprintf(os.Stderr, "membershipd user: unknown subcommand %q\n\n", sub)
userUsage()
os.Exit(2)
}
}
func userUsage() {
fmt.Fprint(os.Stderr, `usage: membershipd user <command> [flags]
commands:
add Register a bus user from their Ed25519 signing public key
list List all registered users
revoke Revoke a user (denies access on both planes immediately)
store backends (--store):
sqlite local SQLite database (default; seeds the first admin offline)
kv the RUNNING cluster's replicated JetStream KV allowlist, via the
privileged internal connection — add users with the cluster live,
no stop-seed-restart needed (run over loopback/SSH on a node)
examples:
membershipd user add --handle alice --sign-pub <64-hex> --role admin
membershipd user add --store kv --handle bob --sign-pub <64-hex> --role member
membershipd user list --store kv
membershipd user revoke <64-hex>
common flags:
--db <path> SQLite database path (--store sqlite; default ./local_files/unibus.db)
--store kv flags (defaults assume an on-node invocation):
--nats-url <url> cluster NATS (default nats://127.0.0.1:4250)
--internal-id-file <path> persisted internal service identity (default /opt/unibus/secrets/internal.id)
--ca <path> CA cert pinning the data-plane TLS (default /opt/unibus/tls/ca.crt)
--kv-replicas <n> KV replication factor, match the cluster (default 3)
`)
}
const defaultDBPath = "./local_files/unibus.db"
// openStore opens the membership store at path, exiting on failure. Migrations
// (including 002_users.sql) are applied by membership.Open, so a fresh database
// gets the users table on first use of the CLI.
func openStore(path string) membership.Store {
store, err := membership.Open(path)
if err != nil {
fmt.Fprintf(os.Stderr, "membershipd user: open store %q: %v\n", path, err)
os.Exit(1)
}
return store
}
// validateSignPubHex ensures the key is exactly a 32-byte Ed25519 public key in
// hex (64 hex chars). Catching this here turns a silent "authorized nobody" into
// an explicit error at seed time. It delegates to membership.ValidateSignPubHex
// so the CLI and the HTTP user-management handlers share one rule.
func validateSignPubHex(signPub string) error {
return membership.ValidateSignPubHex(signPub)
}
// kvFlags holds the connection flags shared by the --store kv path of the user
// subcommands. registerKVFlags wires them onto a flag set so add and list expose
// an identical interface.
type kvFlags struct {
store *string
natsURL *string
internalID *string
ca *string
replicas *int
}
func registerKVFlags(fs *flag.FlagSet) kvFlags {
return kvFlags{
store: fs.String("store", "sqlite", "user store backend: sqlite (local DB) | kv (the live cluster's replicated allowlist)"),
natsURL: fs.String("nats-url", defaultClusterNatsURL, "cluster NATS url for --store kv"),
internalID: fs.String("internal-id-file", defaultInternalIDFile, "persisted internal service identity for --store kv"),
ca: fs.String("ca", defaultClusterCAFile, "CA cert pinning TLS on the --store kv NATS connection"),
replicas: fs.Int("kv-replicas", 3, "KV replication factor for --store kv (match the cluster)"),
}
}
// resolveStore returns the membership store for the chosen backend plus a cleanup
// func. For --store kv it opens the privileged connection to the live cluster; for
// sqlite it opens the local file. It exits the process with a clear message on any
// failure (a dead NATS, a missing identity file), so a broken --store kv add fails
// loudly instead of silently — Error case of the GAP A DoD. The returned *kvConn
// is non-nil only for the kv backend (so the caller can report replication).
func resolveStore(cmd string, kf kvFlags, dbPath string) (membership.Store, *kvConn, func()) {
switch *kf.store {
case "sqlite":
store := openStore(dbPath)
return store, nil, func() { store.Close() }
case "kv":
kv, err := connectKVStore(*kf.natsURL, *kf.internalID, *kf.ca, *kf.replicas)
if err != nil {
fmt.Fprintf(os.Stderr, "membershipd %s: --store kv: %v\n", cmd, err)
os.Exit(1)
}
return kv.store, kv, kv.Close
default:
fmt.Fprintf(os.Stderr, "membershipd %s: --store must be \"sqlite\" or \"kv\", got %q\n", cmd, *kf.store)
os.Exit(2)
return nil, nil, func() {}
}
}
func userAdd(args []string) {
fs := flag.NewFlagSet("user add", flag.ExitOnError)
handle := fs.String("handle", "", "human-readable user name (required)")
signPub := fs.String("sign-pub", "", "Ed25519 signing public key in hex (required)")
role := fs.String("role", membership.RoleMember, "role: admin or member")
dbPath := fs.String("db", defaultDBPath, "SQLite database path")
kf := registerKVFlags(fs)
_ = fs.Parse(args)
if *handle == "" || *signPub == "" {
fmt.Fprintln(os.Stderr, "membershipd user add: --handle and --sign-pub are required")
os.Exit(2)
}
if err := validateSignPubHex(*signPub); err != nil {
fmt.Fprintf(os.Stderr, "membershipd user add: %v\n", err)
os.Exit(2)
}
store, kv, closeStore := resolveStore("user add", kf, *dbPath)
defer closeStore()
if err := store.AddUser(*signPub, *handle, *role); err != nil {
if errors.Is(err, membership.ErrUserExists) {
// Idempotency contract (GAP A): re-adding the same key is an EXPLICIT,
// non-destructive error — the existing row is left untouched (no silent
// upsert that could flip a role or clobber status, which would corrupt the
// allowlist). To replace a user, `user revoke <key>` then add again.
fmt.Fprintf(os.Stderr, "membershipd user add: user %s already registered (unchanged); revoke it first to replace\n", *signPub)
os.Exit(1)
}
fmt.Fprintf(os.Stderr, "membershipd user add: %v\n", err)
os.Exit(1)
}
fmt.Printf("added user %q (%s) role=%s\n", *handle, *signPub, *role)
if kv != nil {
reportKVReplication(kv.js)
}
}
func userList(args []string) {
fs := flag.NewFlagSet("user list", flag.ExitOnError)
dbPath := fs.String("db", defaultDBPath, "SQLite database path")
kf := registerKVFlags(fs)
_ = fs.Parse(args)
store, _, closeStore := resolveStore("user list", kf, *dbPath)
defer closeStore()
users, err := store.ListUsers()
if err != nil {
fmt.Fprintf(os.Stderr, "membershipd user list: %v\n", err)
os.Exit(1)
}
if len(users) == 0 {
fmt.Println("(no users)")
return
}
w := tabwriter.NewWriter(os.Stdout, 0, 2, 2, ' ', 0)
fmt.Fprintln(w, "HANDLE\tROLE\tSTATUS\tSIGN_PUB\tCREATED")
for _, u := range users {
fmt.Fprintf(w, "%s\t%s\t%s\t%s\t%s\n", u.Handle, u.Role, u.Status, u.SignPub, u.CreatedAt)
}
_ = w.Flush()
}
func userRevoke(args []string) {
fs := flag.NewFlagSet("user revoke", flag.ExitOnError)
dbPath := fs.String("db", defaultDBPath, "SQLite database path")
kf := registerKVFlags(fs)
// Go's flag package stops at the first non-flag argument, so `revoke <key>
// --db path` would otherwise leave --db unparsed. Pull a leading positional
// (the sign-pub) off the front before parsing so both `revoke <key> --db p`
// and `revoke --db p <key>` work for the operator.
var signPub string
if len(args) > 0 && !strings.HasPrefix(args[0], "-") {
signPub, args = args[0], args[1:]
}
_ = fs.Parse(args)
if signPub == "" {
if rest := fs.Args(); len(rest) == 1 {
signPub = rest[0]
}
}
if signPub == "" {
fmt.Fprintln(os.Stderr, "membershipd user revoke: exactly one <sign-pub> argument required")
os.Exit(2)
}
if err := validateSignPubHex(signPub); err != nil {
fmt.Fprintf(os.Stderr, "membershipd user revoke: %v\n", err)
os.Exit(2)
}
store, _, closeStore := resolveStore("user revoke", kf, *dbPath)
defer closeStore()
if err := store.RevokeUser(signPub); err != nil {
fmt.Fprintf(os.Stderr, "membershipd user revoke: %v\n", err)
os.Exit(1)
}
fmt.Printf("revoked user %s\n", signPub)
}
+151
View File
@@ -0,0 +1,151 @@
package main
import (
"context"
"fmt"
"os"
"time"
"github.com/enmanuel/unibus/pkg/busauth"
"github.com/enmanuel/unibus/pkg/client"
"github.com/enmanuel/unibus/pkg/membership"
"github.com/nats-io/nats.go"
"github.com/nats-io/nats.go/jetstream"
)
// users_kv.go is the `--store kv` half of the user administration CLI (issue 0011
// gaps, GAP A): adding and listing bus users directly against the RUNNING
// cluster's replicated JetStream KV allowlist, with no need to stop the cluster,
// seed a standalone node, and restart (the procedure the 0011 deploy required).
//
// The mechanism is the cluster's own privileged internal connection. Under
// enforce every bus user is confined by the per-subject ACL to the JetStream API
// of its own rooms, so no ordinary identity may touch the control-plane buckets
// (KV_UNIBUS_*). The ONLY identity the authenticator grants full JetStream
// permissions is membershipd's internal service identity. By persisting that
// identity to a file (membershipd --internal-id-file) the same key becomes
// available to this CLI, which presents it as its NATS nkey and is therefore
// recognized as the privileged internal client and allowed to read/write the KV.
//
// Intended invocation is over loopback on a cluster node (SSH): the data-plane
// TLS certificate's SAN covers 127.0.0.1/localhost and the internal identity file
// lives 0600 next to the node's TLS keys. Using the file requires root on the
// node, which already implies full control of that node — so co-locating it adds
// no practical exposure beyond what the TLS server key and cluster password
// already represent.
// defaultClusterNatsURL is the node-local NATS listener. The CLI is meant to run
// on a cluster node over SSH, talking to that node's own embedded server.
const defaultClusterNatsURL = "nats://127.0.0.1:4250"
// Deploy-default paths for the privileged identity and the data-plane CA, so an
// on-node invocation needs only --handle/--sign-pub/--role. Override for other
// layouts.
const (
defaultInternalIDFile = "/opt/unibus/secrets/internal.id"
defaultClusterCAFile = "/opt/unibus/tls/ca.crt"
)
// kvConn bundles the privileged NATS connection to a live cluster and the
// KV-backed control-plane store opened over it. Close releases both.
type kvConn struct {
nc *nats.Conn
js jetstream.JetStream
store membership.Store
}
func (k *kvConn) Close() {
if k == nil {
return
}
if k.store != nil {
_ = k.store.Close()
}
if k.nc != nil {
k.nc.Close()
}
}
// connectKVStore opens the privileged internal connection to the cluster's NATS
// and the JetStream KV control-plane store on top of it. internalIDFile is the
// membershipd-persisted internal service identity whose nkey the authenticator
// grants full permissions; caPath pins the data-plane TLS (empty only for a
// non-TLS dev cluster). A non-loopback target without --ca is refused, mirroring
// migrate-to-kv (audit 0008 N6): the allowlist write must not travel in cleartext.
func connectKVStore(natsURL, internalIDFile, caPath string, replicas int) (*kvConn, error) {
if internalIDFile == "" {
return nil, fmt.Errorf("--internal-id-file is required for --store kv (the privileged identity membershipd persists with --internal-id-file)")
}
// Confidentiality guard: a remote NATS without TLS would expose the allowlist
// (handles/roles/sign-pubs) and the privileged nkey handshake in cleartext.
if !isLoopbackURL(natsURL) && caPath == "" {
return nil, fmt.Errorf("refusing to connect to remote %q without --ca: the allowlist write would travel in cleartext — pin TLS with --ca, or run over a loopback --nats-url on a node", natsURL)
}
id, err := client.LoadIdentity(internalIDFile)
if err != nil {
return nil, fmt.Errorf("load internal identity: %w", err)
}
nkeyPub, nkeySign, err := busauth.ClientNkey(id.SignPriv)
if err != nil {
return nil, fmt.Errorf("derive nkey from internal identity: %w", err)
}
opts := []nats.Option{
nats.Name("membershipd-user-cli"),
nats.Nkey(nkeyPub, nkeySign),
}
if caPath != "" {
tlsCfg, err := busauth.LoadCATLSConfig(caPath)
if err != nil {
return nil, fmt.Errorf("load CA %q: %w", caPath, err)
}
opts = append(opts, nats.Secure(tlsCfg))
}
nc, err := nats.Connect(natsURL, opts...)
if err != nil {
return nil, fmt.Errorf("connect cluster NATS %q: %w", natsURL, err)
}
js, err := jetstream.New(nc)
if err != nil {
nc.Close()
return nil, fmt.Errorf("jetstream: %w", err)
}
store, err := membership.OpenJetStream(js, membership.JetStreamConfig{Replicas: replicas})
if err != nil {
nc.Close()
return nil, fmt.Errorf("open KV control-plane store: %w", err)
}
return &kvConn{nc: nc, js: js, store: store}, nil
}
// reportKVReplication prints the replication status of the allowlist bucket
// stream (KV_UNIBUS_users) right after a write, so the operator sees the add
// landed on a quorum and replicated to the followers — executable evidence that
// the live-cluster add is HA, not single-node. Best-effort: a read failure is a
// note, not an error (the write itself already succeeded).
func reportKVReplication(js jetstream.JetStream) {
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
st, err := js.Stream(ctx, "KV_UNIBUS_users")
if err != nil {
fmt.Fprintf(os.Stderr, "note: could not read KV_UNIBUS_users stream info: %v\n", err)
return
}
info, err := st.Info(ctx)
if err != nil {
fmt.Fprintf(os.Stderr, "note: could not read KV_UNIBUS_users stream info: %v\n", err)
return
}
if info.Cluster == nil {
fmt.Printf("KV_UNIBUS_users: standalone (R1, no cluster replication); msgs=%d\n", info.State.Msgs)
return
}
current := 0
for _, r := range info.Cluster.Replicas {
if r.Current {
current++
}
}
fmt.Printf("KV_UNIBUS_users: leader=%s followers_current=%d/%d msgs=%d\n",
info.Cluster.Leader, current, len(info.Cluster.Replicas), info.State.Msgs)
}
+40
View File
@@ -0,0 +1,40 @@
package main
import (
"fmt"
"github.com/enmanuel/unibus/pkg/membership"
"github.com/nats-io/nats.go/jetstream"
)
// wireReplicatedNonces applies the cluster anti-replay policy to srv. It is the
// single piece of wiring the binary uses to decide whether a node must share its
// nonce store, extracted so a regression test exercises the EXACT decision the
// running binary makes (issue 0006a, audit 0008 N3).
//
// Policy:
// - A clustered node (clustered == true) MUST use the shared JetStream KV nonce
// bucket. Every node sees the same bucket, so a request accepted on one node
// cannot be replayed to another whose per-process cache never saw the nonce.
// A missing JetStream context, or a failure to create the bucket, is a FATAL
// configuration error returned to the caller — a clustered node running with a
// per-process nonce cache is precisely the replay hole the audit flagged, so
// it must refuse to start rather than serve insecurely.
// - A standalone node (clustered == false) keeps the in-memory cache that
// NewServer installed: there is no second node to replay to, so the shared
// bucket would only add a JetStream dependency for no security gain.
//
// replicas is the nonce bucket's replication factor (R1..R3). Returns nil when no
// action is required (standalone).
func wireReplicatedNonces(srv *membership.Server, js jetstream.JetStream, clustered bool, replicas int) error {
if !clustered {
return nil // standalone: the in-memory nonce cache is sufficient and safe
}
if js == nil {
return fmt.Errorf("clustered node requires JetStream for the shared nonce bucket, but none is available")
}
if err := srv.UseReplicatedNonces(js, replicas); err != nil {
return fmt.Errorf("replicated nonces: %w", err)
}
return nil
}
+246
View File
@@ -0,0 +1,246 @@
package main
import (
"encoding/hex"
"fmt"
"strings"
"sync"
cs "fn-registry/functions/cybersecurity"
"github.com/enmanuel/unibus/pkg/busauth"
"github.com/enmanuel/unibus/pkg/client"
"github.com/enmanuel/unibus/pkg/frame"
"github.com/enmanuel/unibus/pkg/room"
)
// gateway is the live web gateway: it owns the operator's identity and a single
// connected unibus client, and turns the bus's crypto-bearing API into the plain
// REST/SSE surface the browser consumes. The browser never signs, never speaks
// NATS, and never sees a private key — the gateway is the legitimate room member
// that seals/opens payloads on the browser's behalf.
//
// TRUST MODEL: content stays end-to-end encrypted on the wire. The gateway can
// read plaintext because it acts AS the operator's client — a real member of
// each room, holding the room key K like any peer. It is the same trust a native
// desktop client has. In the wallet phase (per-browser WebCrypto identity) the
// decryption can move into the browser; today, for the single-operator MVP, the
// gateway decrypts server-side and pushes cleartext over a loopback/authenticated
// SSE channel.
type gateway struct {
id cs.Identity
endpoint string
cli *client.Client
refreshACL bool // call RefreshSession after a membership change (needed under a per-subject ACL bus)
mu sync.Mutex
hubs map[string]*roomHub // roomID -> live fan-out of decrypted frames to SSE clients
}
// gatewayConfig wires a live gateway.
type gatewayConfig struct {
Identity cs.Identity
NatsURL string
CtrlURL string
CtrlURLs []string
NatsURLs []string
CAPath string // bus CA; empty => plaintext dev connection (matches a loopback membershipd)
}
// newGateway connects the unibus client with the operator identity following the
// same posture seam every peer uses: a non-empty CA path means TLS + nkey, empty
// means plaintext dev. When a CA is configured the bus is assumed to enforce a
// per-subject ACL, so membership changes trigger a session refresh.
func newGateway(cfg gatewayConfig) (*gateway, error) {
opts := client.Options{
CtrlURLs: cfg.CtrlURLs,
NatsServers: cfg.NatsURLs,
}
if cfg.CAPath != "" {
tlsCfg, err := busauth.LoadCATLSConfig(cfg.CAPath)
if err != nil {
return nil, fmt.Errorf("webgw: load bus CA %q: %w", cfg.CAPath, err)
}
opts.UseNkey = true
opts.TLS = tlsCfg
opts.CtrlTLS = tlsCfg
}
cli, err := client.NewWithOptions(cfg.NatsURL, cfg.CtrlURL, cfg.Identity, opts)
if err != nil {
return nil, fmt.Errorf("webgw: connect bus client: %w", err)
}
return &gateway{
id: cfg.Identity,
endpoint: frame.EndpointID(cfg.Identity.SignPub),
cli: cli,
refreshACL: cfg.CAPath != "",
hubs: map[string]*roomHub{},
}, nil
}
// Close stops every hub and releases the bus client connection.
func (g *gateway) Close() error {
g.mu.Lock()
for _, h := range g.hubs {
h.stop()
}
g.hubs = map[string]*roomHub{}
g.mu.Unlock()
if g.cli != nil {
return g.cli.Close()
}
return nil
}
// ---- wire types (browser-facing JSON) ------------------------------------
// meInfo is what GET /api/me returns: the operator identity the gateway acts as.
type meInfo struct {
Endpoint string `json:"endpoint"`
SignPub string `json:"sign_pub"`
}
// roomWire is the browser view of a room. It deliberately omits messages: those
// stream over SSE (GET /api/rooms/{id}/stream), not in the room list.
type roomWire struct {
ID string `json:"id"`
Subject string `json:"subject"`
Name string `json:"name"`
Epoch int `json:"epoch"`
Encrypt bool `json:"encrypt"`
Persist bool `json:"persist"`
SignMsgs bool `json:"sign_msgs"`
Role string `json:"role"`
}
// createRoomReq is the POST /api/rooms body. Encrypt/Persist/SignMsgs are
// pointers so an omitted field falls back to the chat default rather than to the
// Go zero value (false). The common case — the browser sending only {subject,
// encrypted} — maps encrypted onto all three (the Matrix-like chat policy).
type createRoomReq struct {
Subject string `json:"subject"`
Encrypted *bool `json:"encrypted,omitempty"`
Encrypt *bool `json:"encrypt,omitempty"`
Persist *bool `json:"persist,omitempty"`
SignMsgs *bool `json:"sign_msgs,omitempty"`
}
// policy resolves the requested policy. A bare {subject} defaults to the
// Matrix-like chat room (encrypted + persisted + signed) so a created room keeps
// durable, end-to-end-encrypted, authored history. Callers can override any leg.
func (r createRoomReq) policy() room.Policy {
enc, per, sig := true, true, true
if r.Encrypted != nil {
enc, per, sig = *r.Encrypted, *r.Encrypted, *r.Encrypted
}
if r.Encrypt != nil {
enc = *r.Encrypt
}
if r.Persist != nil {
per = *r.Persist
}
if r.SignMsgs != nil {
sig = *r.SignMsgs
}
return room.Policy{Encrypt: enc, Persist: per, SignMsgs: sig}
}
// sendReq is the POST /api/rooms/{id}/send body.
type sendReq struct {
Body string `json:"body"`
}
// msgWire is one decrypted message pushed over SSE.
type msgWire struct {
ID string `json:"id"`
Sender string `json:"sender"`
Body string `json:"body"`
TS int64 `json:"ts"` // epoch ms (decoded from the frame's ULID id)
Mine bool `json:"mine"`
}
// ---- operations -----------------------------------------------------------
func (g *gateway) me() meInfo {
return meInfo{Endpoint: g.endpoint, SignPub: hex.EncodeToString(g.id.SignPub)}
}
// subjectName derives a short, human-friendly room name from its bus subject by
// dropping the leading namespace segment (room., test., proc., agent.). It is a
// display nicety only; the canonical identity stays the subject/room id.
func subjectName(subject string) string {
for _, p := range []string{"room.", "test.", "proc.", "agent.", "rpc."} {
if strings.HasPrefix(subject, p) {
return strings.TrimPrefix(subject, p)
}
}
return subject
}
func (g *gateway) listRooms() ([]roomWire, error) {
rooms, err := g.cli.ListMyRooms()
if err != nil {
return nil, err
}
out := make([]roomWire, 0, len(rooms))
for _, rm := range rooms {
out = append(out, roomWire{
ID: rm.RoomID,
Subject: rm.Subject,
Name: subjectName(rm.Subject),
Epoch: rm.Epoch,
Encrypt: rm.Policy.Encrypt,
Persist: rm.Policy.Persist,
SignMsgs: rm.Policy.SignMsgs,
Role: rm.Role,
})
}
return out, nil
}
func (g *gateway) createRoom(req createRoomReq) (roomWire, error) {
subject := strings.TrimSpace(req.Subject)
if subject == "" {
return roomWire{}, fmt.Errorf("webgw: subject required")
}
p := req.policy()
roomID, err := g.cli.CreateRoom(subject, p)
if err != nil {
return roomWire{}, err
}
// Under a per-subject ACL the operator's frozen NATS permissions do not yet
// cover the new room's subject; refresh so subsequent data-plane use works. On
// a plaintext/non-ACL dev bus this is unnecessary and would needlessly drop any
// live SSE subscriptions, so it is gated on the secured posture.
if g.refreshACL {
_ = g.cli.RefreshSession()
}
return roomWire{
ID: roomID,
Subject: subject,
Name: subjectName(subject),
Epoch: 1,
Encrypt: p.Encrypt,
Persist: p.Persist,
SignMsgs: p.SignMsgs,
Role: "owner",
}, nil
}
// join resolves room metadata and (for encrypted rooms) fetches the room key so
// the gateway can later open payloads. Idempotent.
func (g *gateway) join(roomID string) error {
if err := g.cli.Join(roomID); err != nil {
return err
}
if g.refreshACL {
_ = g.cli.RefreshSession()
}
return nil
}
// send publishes plaintext to a room. The unibus client seals it with the room
// key (encrypted rooms) and signs it (signed rooms) before it leaves the process.
func (g *gateway) send(roomID, body string) error {
return g.cli.Publish(roomID, []byte(body))
}
+140
View File
@@ -0,0 +1,140 @@
package main
import (
"sync"
"github.com/enmanuel/unibus/pkg/client"
"github.com/enmanuel/unibus/pkg/frame"
"github.com/oklog/ulid/v2"
)
// roomHub multiplexes ONE unibus room subscription to MANY SSE clients. The
// unibus client derives a per-(room, endpoint) durable consumer name, so a
// second Subscribe for the same room from the same operator would contend for
// the same durable (load-balanced delivery) rather than each browser receiving
// every message. The hub holds a single subscription per room and fans each
// decrypted frame out to every connected browser, which also means the gateway
// opens at most one bus subscription per room regardless of how many tabs watch
// it.
type roomHub struct {
roomID string
myEndpoint string
sub *client.Sub
mu sync.Mutex
clients map[chan msgWire]struct{}
}
// frameTS decodes the millisecond timestamp embedded in a frame's ULID id. A
// malformed id (should not happen for bus-produced frames) yields 0, which the
// browser renders without crashing.
func frameTS(msgID string) int64 {
id, err := ulid.Parse(msgID)
if err != nil {
return 0
}
return int64(id.Time())
}
// newRoomHub opens the single bus subscription for roomID and starts fanning
// decrypted frames out to registered clients. The room must already be joined
// (so the gateway holds the room key) before this is called.
func newRoomHub(cli *client.Client, roomID, myEndpoint string) (*roomHub, error) {
h := &roomHub{
roomID: roomID,
myEndpoint: myEndpoint,
clients: map[chan msgWire]struct{}{},
}
sub, err := cli.Subscribe(roomID, func(f frame.Frame, plaintext []byte) {
m := msgWire{
ID: f.MsgID,
Sender: f.Sender,
Body: string(plaintext),
TS: frameTS(f.MsgID),
Mine: f.Sender == myEndpoint,
}
h.broadcast(m)
})
if err != nil {
return nil, err
}
h.sub = sub
return h, nil
}
// broadcast delivers a message to every registered client without blocking the
// NATS delivery goroutine: a client whose buffer is full (a stalled browser)
// drops this frame rather than stalling the whole room.
func (h *roomHub) broadcast(m msgWire) {
h.mu.Lock()
defer h.mu.Unlock()
for ch := range h.clients {
select {
case ch <- m:
default:
}
}
}
// add registers a new SSE client channel.
func (h *roomHub) add(ch chan msgWire) {
h.mu.Lock()
defer h.mu.Unlock()
h.clients[ch] = struct{}{}
}
// stop unsubscribes from the bus. Local delivery ends; for a persisted room the
// durable consumer's ack position stays on the server, so a later subscription
// with the same operator resumes from where it left off.
func (h *roomHub) stop() {
if h.sub != nil {
_ = h.sub.Unsubscribe()
}
}
// openStream joins the room (idempotent; fetches the room key for encrypted
// rooms), attaches an SSE client to the room's hub (creating it on first watcher),
// and returns the client's message channel plus a cleanup func. The cleanup
// detaches the client and, when it was the last watcher, tears down the room's
// single bus subscription.
func (g *gateway) openStream(roomID string) (chan msgWire, func(), error) {
if err := g.join(roomID); err != nil {
return nil, nil, err
}
g.mu.Lock()
h := g.hubs[roomID]
if h == nil {
var err error
h, err = newRoomHub(g.cli, roomID, g.endpoint)
if err != nil {
g.mu.Unlock()
return nil, nil, err
}
g.hubs[roomID] = h
}
g.mu.Unlock()
// Buffer so a brief render hitch in the browser does not drop live frames; a
// sustained stall still drops (broadcast is non-blocking) rather than wedging
// the room.
ch := make(chan msgWire, 64)
h.add(ch)
// cleanup takes g.mu before h.mu (the single, consistent lock order) so a
// concurrent openStream that re-creates the hub cannot race the teardown.
cleanup := func() {
g.mu.Lock()
defer g.mu.Unlock()
h.mu.Lock()
delete(h.clients, ch)
empty := len(h.clients) == 0
h.mu.Unlock()
if empty {
if cur := g.hubs[roomID]; cur == h {
delete(g.hubs, roomID)
h.stop()
}
}
}
return ch, cleanup, nil
}
+98
View File
@@ -0,0 +1,98 @@
package main
import (
"encoding/base64"
"encoding/json"
"fmt"
"os"
"os/exec"
cs "fn-registry/functions/cybersecurity"
)
// identityJSON mirrors the on-disk / pass-stored identity format shared across
// the unibus tooling: the four keypair halves, each std-base64. It is the SAME
// shape the bus client persists (pkg/client identity file) and the operator's
// `pass` entry unibus/operator-identity, so the web gateway loads the operator's
// identity without a divergent serialization. Kept in lockstep with
// unibus_admin/internal/admin/identity.go.
type identityJSON struct {
SignPub string `json:"sign_pub"`
SignPriv string `json:"sign_priv"`
KexPub string `json:"kex_pub"`
KexPriv string `json:"kex_priv"`
}
// decodeIdentity turns the JSON identity bytes into a cs.Identity. The private
// halves stay only in memory; this never writes them anywhere.
func decodeIdentity(raw []byte) (cs.Identity, error) {
var f identityJSON
if err := json.Unmarshal(raw, &f); err != nil {
return cs.Identity{}, fmt.Errorf("webgw: parse identity json: %w", err)
}
dec := base64.StdEncoding.DecodeString
signPub, err := dec(f.SignPub)
if err != nil {
return cs.Identity{}, fmt.Errorf("webgw: decode sign_pub: %w", err)
}
signPriv, err := dec(f.SignPriv)
if err != nil {
return cs.Identity{}, fmt.Errorf("webgw: decode sign_priv: %w", err)
}
kexPub, err := dec(f.KexPub)
if err != nil {
return cs.Identity{}, fmt.Errorf("webgw: decode kex_pub: %w", err)
}
kexPriv, err := dec(f.KexPriv)
if err != nil {
return cs.Identity{}, fmt.Errorf("webgw: decode kex_priv: %w", err)
}
if len(signPub) != 32 || len(signPriv) != 64 || len(kexPub) != 32 || len(kexPriv) != 32 {
return cs.Identity{}, fmt.Errorf("webgw: identity has wrong key sizes (sign_pub=%d sign_priv=%d kex_pub=%d kex_priv=%d)",
len(signPub), len(signPriv), len(kexPub), len(kexPriv))
}
return cs.Identity{SignPub: signPub, SignPriv: signPriv, KexPub: kexPub, KexPriv: kexPriv}, nil
}
// loadIdentityFromFile reads a 0600 identity JSON file (the same format the bus
// client writes) and decodes it. Used on a deploy host where `pass` is not
// available and the operator identity is delivered as a protected file.
func loadIdentityFromFile(path string) (cs.Identity, error) {
raw, err := os.ReadFile(path)
if err != nil {
return cs.Identity{}, fmt.Errorf("webgw: read identity file %q: %w", path, err)
}
return decodeIdentity(raw)
}
// loadIdentityFromPass shells out to `pass show <entry>` and decodes the JSON
// identity it returns. The secret is held only in memory; this process never
// writes it to disk or argv. Used in local operator workflows where the GNU
// password store holds unibus/operator-identity.
func loadIdentityFromPass(entry string) (cs.Identity, error) {
out, err := exec.Command("pass", "show", entry).Output()
if err != nil {
return cs.Identity{}, fmt.Errorf("webgw: pass show %q: %w", entry, err)
}
return decodeIdentity(out)
}
// loadPassValue returns the first line of a `pass show <entry>` for non-identity
// secrets (e.g. the unlock passphrase). Empty entry yields an empty string and
// no error, so callers can treat "no pass entry configured" as "not set".
func loadPassValue(entry string) (string, error) {
if entry == "" {
return "", nil
}
out, err := exec.Command("pass", "show", entry).Output()
if err != nil {
return "", fmt.Errorf("webgw: pass show %q: %w", entry, err)
}
s := string(out)
for i := 0; i < len(s); i++ {
if s[i] == '\n' || s[i] == '\r' {
return s[:i], nil
}
}
return s, nil
}
+180
View File
@@ -0,0 +1,180 @@
// Command webgw is the web gateway for the unibus chat SPA. It is a single Go
// binary that holds the operator's bus identity, connects to the bus as a real
// authenticated peer (pkg/client), and exposes a small REST + SSE API the
// browser consumes. The browser never signs, never speaks NATS, and never sees a
// private key: it authenticates to the gateway with a passphrase and thereafter
// holds only an opaque session cookie.
//
// TRUST MODEL (MVP, single operator): room content stays end-to-end encrypted on
// the bus. The gateway can read plaintext because it acts AS the operator's
// client — a legitimate member of each room holding the room key. Decryption
// happens server-side in this process; cleartext then crosses an authenticated
// (loopback or TLS-fronted) SSE channel to the browser. The wallet phase (issue:
// per-browser WebCrypto identity) can move decryption into the browser; see the
// report for the FASE 2 plan.
//
// # local dev against a loopback membershipd (plaintext), operator from pass:
// webgw --identity-pass unibus/operator-identity \
// --ctrl-url http://127.0.0.1:8470 --nats-url nats://127.0.0.1:4250
//
// # secured cluster (TLS + nkey on both planes), identity from a 0600 file:
// webgw --ca ca.crt --identity-file operator.id \
// --ctrl-url https://node-a:8470 --nats-url nats://node-a:4250 \
// --ctrl-urls https://node-b:8470,https://node-c:8470 \
// --nats-urls nats://node-b:4250,nats://node-c:4250
package main
import (
"context"
"flag"
"log"
"net/http"
"os"
"os/signal"
"path/filepath"
"strings"
"syscall"
"time"
cs "fn-registry/functions/cybersecurity"
)
func main() {
var (
bind = flag.String("bind", "127.0.0.1", "interface to bind the gateway HTTP server to (loopback by default)")
port = flag.String("port", "8481", "gateway HTTP port")
ctrlURL = flag.String("ctrl-url", "http://127.0.0.1:8470", "primary unibus control-plane base URL")
ctrlURLs = flag.String("ctrl-urls", "", "comma-separated ADDITIONAL control-plane base URLs (cluster failover)")
natsURL = flag.String("nats-url", "nats://127.0.0.1:4250", "primary NATS URL")
natsURLs = flag.String("nats-urls", "", "comma-separated ADDITIONAL NATS seed URLs (cluster failover)")
caPath = flag.String("ca", "", "bus CA cert path; set to talk TLS+nkey to a secured bus (empty = plaintext dev)")
identityFile = flag.String("identity-file", "", "path to the operator identity JSON file (0600). Mutually exclusive with --identity-pass")
identityPass = flag.String("identity-pass", "", "pass(1) entry holding the operator identity JSON, e.g. unibus/operator-identity")
unlockPass = flag.String("unlock-pass", "", "literal passphrase the browser must send to unlock a session (dev). Prefer --unlock-pass-entry")
unlockEntry = flag.String("unlock-pass-entry", "unibus/admin-panel-password", "pass(1) entry holding the unlock passphrase (used when --unlock-pass is empty)")
webDir = flag.String("web-dir", "", "OPTIONAL path to the built SPA (web/dist) to serve. Empty = API only (use vite dev server)")
)
flag.Parse()
log.SetFlags(log.LstdFlags | log.Lmsgprefix)
log.SetPrefix("[webgw] ")
id, err := loadIdentity(*identityFile, *identityPass)
if err != nil {
log.Fatalf("%v", err)
}
unlock := *unlockPass
if unlock == "" {
unlock, err = loadPassValue(*unlockEntry)
if err != nil {
log.Fatalf("resolve unlock passphrase: %v", err)
}
}
if unlock == "" {
log.Fatalf("an unlock passphrase is required: set --unlock-pass or a non-empty --unlock-pass-entry (default unibus/admin-panel-password)")
}
resolvedWebDir := resolveWebDir(*webDir)
gw, err := newGateway(gatewayConfig{
Identity: id,
NatsURL: *natsURL,
CtrlURL: *ctrlURL,
CtrlURLs: splitCSV(*ctrlURLs),
NatsURLs: splitCSV(*natsURLs),
CAPath: *caPath,
})
if err != nil {
log.Fatalf("%v", err)
}
defer gw.Close()
log.Printf("operator endpoint: %s", gw.endpoint)
log.Printf("control plane: %s (+%d failover)", *ctrlURL, len(splitCSV(*ctrlURLs)))
tls := "OFF (plaintext dev)"
if *caPath != "" {
tls = "ON (CA " + *caPath + ")"
}
log.Printf("bus TLS+nkey: %s", tls)
if resolvedWebDir != "" {
log.Printf("serving SPA from: %s", resolvedWebDir)
} else {
log.Printf("API only (no --web-dir): use the vite dev server with a /api+stream proxy")
}
srv := newServer(gw, unlock, resolvedWebDir)
addr := *bind + ":" + *port
httpSrv := &http.Server{
Addr: addr,
Handler: srv,
// No global write timeout: SSE streams are long-lived. Header timeout still
// bounds slowloris on the request line/headers.
ReadHeaderTimeout: 10 * time.Second,
}
go func() {
log.Printf("web gateway: http://%s", addr)
if err := httpSrv.ListenAndServe(); err != nil && err != http.ErrServerClosed {
log.Fatalf("http server: %v", err)
}
}()
stop := make(chan os.Signal, 1)
signal.Notify(stop, syscall.SIGINT, syscall.SIGTERM)
<-stop
log.Printf("shutting down...")
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
_ = httpSrv.Shutdown(ctx)
log.Printf("bye")
}
// loadIdentity resolves the operator identity from exactly one of --identity-file
// or --identity-pass.
func loadIdentity(file, passEntry string) (cs.Identity, error) {
switch {
case file != "" && passEntry != "":
return cs.Identity{}, errFlag("set only one of --identity-file or --identity-pass")
case file != "":
return loadIdentityFromFile(file)
case passEntry != "":
return loadIdentityFromPass(passEntry)
default:
return cs.Identity{}, errFlag("an identity is required: pass --identity-file <path> or --identity-pass <entry>")
}
}
// resolveWebDir validates the --web-dir flag. An empty flag means API-only. A
// non-empty dir is kept only if it actually holds an index.html, so a typo logs
// "API only" rather than serving 404s.
func resolveWebDir(dir string) string {
if dir == "" {
return ""
}
abs, err := filepath.Abs(dir)
if err != nil {
log.Printf("WARN --web-dir %q: %v; serving API only", dir, err)
return ""
}
if !statFile(filepath.Join(abs, "index.html")) {
log.Printf("WARN --web-dir %q has no index.html; serving API only", abs)
return ""
}
return abs
}
type flagErr string
func (e flagErr) Error() string { return string(e) }
func errFlag(s string) error { return flagErr("webgw: " + s) }
func splitCSV(s string) []string {
var out []string
for _, p := range strings.Split(s, ",") {
if p = strings.TrimSpace(p); p != "" {
out = append(out, p)
}
}
return out
}
+301
View File
@@ -0,0 +1,301 @@
package main
import (
"crypto/rand"
"crypto/subtle"
"encoding/hex"
"encoding/json"
"net/http"
"os"
"path/filepath"
"strings"
"sync"
"time"
)
// sessionCookie is the name of the gateway's session cookie. The browser sends
// it automatically on same-origin fetches AND on EventSource (SSE) connections —
// EventSource cannot set custom headers, so a cookie is the only way to
// authenticate the stream. It is HttpOnly so page JS can never read the token.
const sessionCookie = "unibus_session"
// server is the gateway's HTTP surface: a small REST/SSE API under /api gated by
// a session cookie, plus an optional static file server for the built SPA. The
// gateway's privileged operator identity never leaves the process; the browser
// authenticates with a passphrase and thereafter holds only an opaque session
// token.
type server struct {
gw *gateway
unlock string // passphrase that unlocks a session (compared in constant time)
webDir string // optional path to the built SPA (web/dist); empty = API only
mux *http.ServeMux
mu sync.Mutex
sessions map[string]time.Time // token -> issued-at
}
func newServer(gw *gateway, unlock, webDir string) *server {
s := &server{
gw: gw,
unlock: unlock,
webDir: webDir,
mux: http.NewServeMux(),
sessions: map[string]time.Time{},
}
s.routes()
return s
}
func (s *server) ServeHTTP(w http.ResponseWriter, r *http.Request) { s.mux.ServeHTTP(w, r) }
func (s *server) routes() {
// Liveness, unauthenticated (systemd / deploy smoke).
s.mux.HandleFunc("GET /healthz", func(w http.ResponseWriter, _ *http.Request) {
writeJSON(w, http.StatusOK, map[string]string{"status": "ok"})
})
// Auth: login is the only /api route reachable without a session.
s.mux.HandleFunc("POST /api/login", s.handleLogin)
s.mux.HandleFunc("POST /api/logout", s.auth(s.handleLogout))
s.mux.HandleFunc("GET /api/me", s.auth(s.handleMe))
s.mux.HandleFunc("GET /api/rooms", s.auth(s.handleListRooms))
s.mux.HandleFunc("POST /api/rooms", s.auth(s.handleCreateRoom))
s.mux.HandleFunc("POST /api/rooms/{id}/join", s.auth(s.handleJoin))
s.mux.HandleFunc("POST /api/rooms/{id}/send", s.auth(s.handleSend))
s.mux.HandleFunc("GET /api/rooms/{id}/stream", s.auth(s.handleStream))
// Everything else is the SPA (when --web-dir is set). Registered last.
if s.webDir != "" {
s.mux.Handle("/", s.spaHandler())
}
}
// ---- auth -----------------------------------------------------------------
// auth wraps a handler so it runs only with a valid session cookie. A missing or
// unknown token yields 401, which the SPA treats as "show the login screen".
func (s *server) auth(next http.HandlerFunc) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
c, err := r.Cookie(sessionCookie)
if err != nil || !s.validSession(c.Value) {
writeErr(w, http.StatusUnauthorized, "not authenticated")
return
}
next(w, r)
}
}
func (s *server) validSession(token string) bool {
if token == "" {
return false
}
s.mu.Lock()
defer s.mu.Unlock()
_, ok := s.sessions[token]
return ok
}
func (s *server) handleLogin(w http.ResponseWriter, r *http.Request) {
var req struct {
Passphrase string `json:"passphrase"`
}
if !decode(w, r, &req) {
return
}
// Constant-time compare so a wrong passphrase cannot be timed character by
// character. An empty configured passphrase never matches (main refuses to
// start without one, so this is defense in depth).
if s.unlock == "" || subtle.ConstantTimeCompare([]byte(req.Passphrase), []byte(s.unlock)) != 1 {
writeErr(w, http.StatusUnauthorized, "wrong passphrase")
return
}
tok := newToken()
s.mu.Lock()
s.sessions[tok] = time.Now()
s.mu.Unlock()
http.SetCookie(w, &http.Cookie{
Name: sessionCookie,
Value: tok,
Path: "/",
HttpOnly: true,
SameSite: http.SameSiteLaxMode,
})
writeJSON(w, http.StatusOK, s.gw.me())
}
func (s *server) handleLogout(w http.ResponseWriter, r *http.Request) {
if c, err := r.Cookie(sessionCookie); err == nil {
s.mu.Lock()
delete(s.sessions, c.Value)
s.mu.Unlock()
}
http.SetCookie(w, &http.Cookie{Name: sessionCookie, Value: "", Path: "/", MaxAge: -1, HttpOnly: true})
writeJSON(w, http.StatusOK, map[string]string{"status": "logged_out"})
}
func (s *server) handleMe(w http.ResponseWriter, _ *http.Request) {
writeJSON(w, http.StatusOK, s.gw.me())
}
// ---- rooms ----------------------------------------------------------------
func (s *server) handleListRooms(w http.ResponseWriter, _ *http.Request) {
rooms, err := s.gw.listRooms()
if err != nil {
writeErr(w, http.StatusBadGateway, err.Error())
return
}
writeJSON(w, http.StatusOK, rooms)
}
func (s *server) handleCreateRoom(w http.ResponseWriter, r *http.Request) {
var req createRoomReq
if !decode(w, r, &req) {
return
}
rv, err := s.gw.createRoom(req)
if err != nil {
writeErr(w, http.StatusBadGateway, err.Error())
return
}
writeJSON(w, http.StatusCreated, rv)
}
func (s *server) handleJoin(w http.ResponseWriter, r *http.Request) {
if err := s.gw.join(r.PathValue("id")); err != nil {
writeErr(w, http.StatusBadGateway, err.Error())
return
}
writeJSON(w, http.StatusOK, map[string]string{"status": "joined"})
}
func (s *server) handleSend(w http.ResponseWriter, r *http.Request) {
var req sendReq
if !decode(w, r, &req) {
return
}
if strings.TrimSpace(req.Body) == "" {
writeErr(w, http.StatusBadRequest, "body required")
return
}
if err := s.gw.send(r.PathValue("id"), req.Body); err != nil {
writeErr(w, http.StatusBadGateway, err.Error())
return
}
writeJSON(w, http.StatusOK, map[string]string{"status": "sent"})
}
// handleStream is the SSE endpoint: it joins the room, attaches to the room's
// fan-out hub, and streams each decrypted message as a `data:` event. For a
// persisted room the hub's underlying subscription delivers history first
// (scrollback) and then live messages; for an ephemeral room only live messages
// flow. The stream ends when the browser disconnects (ctx cancelled).
func (s *server) handleStream(w http.ResponseWriter, r *http.Request) {
flusher, ok := w.(http.Flusher)
if !ok {
writeErr(w, http.StatusInternalServerError, "streaming unsupported")
return
}
ch, cleanup, err := s.gw.openStream(r.PathValue("id"))
if err != nil {
writeErr(w, http.StatusBadGateway, err.Error())
return
}
defer cleanup()
w.Header().Set("Content-Type", "text/event-stream")
w.Header().Set("Cache-Control", "no-cache")
w.Header().Set("Connection", "keep-alive")
w.Header().Set("X-Accel-Buffering", "no") // disable proxy buffering (nginx/caddy)
w.WriteHeader(http.StatusOK)
// An initial comment opens the stream immediately so the browser's
// EventSource fires `onopen` without waiting for the first message.
_, _ = w.Write([]byte(": connected\n\n"))
flusher.Flush()
ctx := r.Context()
ping := time.NewTicker(25 * time.Second)
defer ping.Stop()
for {
select {
case <-ctx.Done():
return
case <-ping.C:
// Comment line keeps idle proxies from closing the connection.
if _, err := w.Write([]byte(": ping\n\n")); err != nil {
return
}
flusher.Flush()
case m := <-ch:
b, err := json.Marshal(m)
if err != nil {
continue
}
if _, err := w.Write([]byte("data: " + string(b) + "\n\n")); err != nil {
return
}
flusher.Flush()
}
}
}
// ---- SPA serving (optional) -----------------------------------------------
// spaHandler serves the built SPA from s.webDir. A request for an existing asset
// is served directly; any other path (a client-side route) falls back to
// index.html so the SPA router can take over. /api and /healthz are matched first.
func (s *server) spaHandler() http.Handler {
root := http.Dir(s.webDir)
fileServer := http.FileServer(root)
index := filepath.Join(s.webDir, "index.html")
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
p := strings.TrimPrefix(r.URL.Path, "/")
if p == "" {
http.ServeFile(w, r, index)
return
}
if f, err := root.Open(p); err == nil {
_ = f.Close()
fileServer.ServeHTTP(w, r)
return
}
http.ServeFile(w, r, index) // unknown path -> SPA client-side routing
})
}
// ---- helpers --------------------------------------------------------------
func newToken() string {
b := make([]byte, 32)
_, _ = rand.Read(b)
return hex.EncodeToString(b)
}
func writeJSON(w http.ResponseWriter, code int, v any) {
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(code)
_ = json.NewEncoder(w).Encode(v)
}
func writeErr(w http.ResponseWriter, code int, msg string) {
writeJSON(w, code, map[string]string{"error": msg})
}
// decode reads a JSON body into v, writing a 400 and returning false on failure.
func decode(w http.ResponseWriter, r *http.Request, v any) bool {
defer r.Body.Close()
if err := json.NewDecoder(http.MaxBytesReader(w, r.Body, 1<<20)).Decode(v); err != nil {
writeErr(w, http.StatusBadRequest, "bad json: "+err.Error())
return false
}
return true
}
// statFile reports whether path exists and is a regular file (used to validate
// --web-dir at startup so a typo surfaces as a clear log line, not 404s later).
func statFile(path string) bool {
fi, err := os.Stat(path)
return err == nil && !fi.IsDir()
}
+9 -1
View File
@@ -23,6 +23,7 @@ func main() {
ctrlURL = flag.String("ctrl-url", "http://127.0.0.1:8470", "membershipd control-plane url")
roomSub = flag.String("room", "proc.test.ticks", "room subject to publish to")
idFile = flag.String("id-file", "./local_files/worker.id", "identity file path")
caFile = flag.String("ca", "", "path to the bus CA cert (ca.crt); set to connect with TLS + nkey to a secured bus")
)
flag.Parse()
@@ -33,7 +34,7 @@ func main() {
if err != nil {
log.Fatalf("identity: %v", err)
}
c, err := client.New(*natsURL, *ctrlURL, id)
c, err := client.Connect(*natsURL, *ctrlURL, id, *caFile)
if err != nil {
log.Fatalf("connect: %v", err)
}
@@ -46,6 +47,13 @@ func main() {
if err != nil {
log.Fatalf("create room: %v", err)
}
// Membership-change contract (issue 0006e): the bus freezes per-subject
// permissions at connect time, and this room did not exist then. Refresh the
// session so the new room's subject becomes publishable under enforce+ACL. On
// an unsecured/dev bus this is a harmless reconnect.
if err := c.RefreshSession(); err != nil {
log.Fatalf("refresh session after create room: %v", err)
}
log.Printf("room %q -> %s (subject %s, cleartext)", *roomSub, roomID, *roomSub)
stop := make(chan os.Signal, 1)
+31
View File
@@ -65,3 +65,34 @@ curl -fsS http://<host-lan-ip>:8470/healthz
- To run against an external NATS instead of the embedded one, append
`--nats-url nats://<host>:4222` to `ExecStart` and re-run `daemon-reload` +
`restart`.
## Clustering (HA) — see `deploy/cluster/`
The single-node service above is secure on its own. Running unibus as a
multi-node **cluster** has extra hardening rules (issues 0006a0006f); the full
runbook and the generated material live in `deploy/cluster/`. Key points an
operator must know:
- **Homogeneous posture (0006d).** Every node MUST run `--bus-auth enforce` (the
binary refuses to join a cluster otherwise) and present mutual route TLS on a
public bind. `/healthz` publishes each node's `posture` so a monitor can flag a
node that is not `enforce`+`acl`+`tls`.
- **Separate route CA (0006f).** The cluster route layer authenticates *nodes*,
not bus users — sign the route certs with a **dedicated cluster CA**
(`--route-tls-ca`), NOT the client data-plane CA (`--tls-cert`'s CA). Keeping
the two trust roots separate means a client cert can never be presented to the
route port. `deploy/cluster/generate-cluster-certs.sh` builds this CA.
- **Secret out of argv (0006f).** Pass the route password via
`--cluster-pass-file` or the `UNIBUS_CLUSTER_PASS` env var, NOT `--cluster-pass`
or a `nats://user:pass@host` in `--routes` (both are visible in `ps`/journald).
When the secret comes from a file/env, list peers as bare `--routes
nats://<host>:6250` and the binary injects the credentials.
- **`migrate-to-kv` confidentiality (0006f).** The migration writes the allowlist
(handles/roles/sign pubs) into KV. Run it only against a **loopback** nats-url,
or pin TLS with `--ca` for a remote target — otherwise that metadata travels in
cleartext. The binary refuses a remote target without `--ca`.
- **R1 is NOT HA (0006a/N3-DoS).** With `--kv-replicas 1` the control plane
(including the nonce bucket) is a single point of failure: if the node owning
the stream dies, every authenticated request fails closed (auth DoS). Real HA
needs **R3** (quorum 2/3): raise replicas in place with `nats stream update
--replicas 3` once the third node has joined. Do not advertise R1 as HA.
+7
View File
@@ -0,0 +1,7 @@
# Generated TLS material and secrets — NEVER commit (audit 0008: keys/secret).
out/
build/
secrets/
*.key
*.srl
cluster-ca.crt
+285
View File
@@ -0,0 +1,285 @@
# unibus cluster — 3-node deploy runbook (issue 0006g)
This directory holds the material to bring up unibus as a **3-node cluster**
(`magnus` + `homer` + `datardos`) for real HA: with **R3** replication the control
plane (rooms/members/keys/users on JetStream KV + the anti-replay nonce bucket)
survives the loss of any one node (quorum 2/3).
> **Status: this cluster is DEPLOYED in production** (magnus + homer + datardos,
> R3, enforce+ACL+TLS) — see report 0011. The runbook below was authored before any
> VPS existed and has since been **corrected against the real deploy** (report 0012):
> the start ordering, the R1→R3 reality, and the live user-add path were all wrong
> or missing. Steps that change a remote host are marked **HUMAN**; `deploy-cluster.sh`
> still defaults to a dry run.
## Files
| File | What it is |
|---|---|
| `nodes.env` | Topology: cluster name, ports, and the per-node rows (name, ssh host, public IP, WG IP). **HUMAN fills the placeholders.** |
| `generate-cluster-certs.sh` | Mints a **separate cluster route CA** + a route cert per node, and a data-plane server cert per node signed by the **client CA** (`../tls/ca.*`). |
| `membershipd-cluster.service` | One systemd unit, parameterized per node by `/opt/unibus/cluster.env`. enforce + per-subject ACL + TLS + `--store kv`, `Restart=always`. |
| `deploy-cluster.sh` | Cross-builds the linux binary, generates each node's `cluster.env`, and (with `--yes`) rsyncs everything + installs the unit. Staggered start is manual. |
Generated keys/secrets (`out/`, `build/`, `secrets/`) are **gitignored** — they are
secret and never leave the operator's trusted machine except over the secure
rsync channel.
## Topology (as deployed, report 0011)
| Node | SSH | Public IP | Role |
|---|---|---|---|
| magnus | `magnus` (root) | `135.125.201.30` | node — **= organic-machine.com = `om`**, the critical host (caddy + gitea + registry-api + monitoring); the bus runs alongside, untouched |
| homer | `homer` (ubuntu+sudo) | `141.94.69.66` | node |
| datardos | `dd` (ubuntu+sudo) | `51.91.100.142` | node |
`ROUTE_NETWORK=public`, **not `wg`**: there is no WireGuard mesh between the three
nodes (homer and datardos do not even have the `wg` binary; om's only WG peers are
the operator's PCs). The server-to-server routes therefore travel over the public
IPs, protected by the **separate cluster route CA** (mutual route TLS) — a client
data-plane cert can never be presented to the route port. The client data plane and
the HTTP control plane are also reached over the public IPs. There is no fixed
"seed" node: with R3 the three are peers (see "Bring up" for why a lone node cannot
self-serve).
## Prerequisites (HUMAN, once)
1. **Fill `nodes.env`** — replace every `<PLACEHOLDER>` (magnus public IP, all WG
IPs). The scripts refuse to run while any remain.
2. **Client CA exists**`../tls/ca.crt` + `../tls/ca.key`. If not, run
`../tls/generate-certs.sh` on the CA host (om) first. The cluster reuses this CA
for the data plane so existing clients keep trusting the bus.
3. **Mint cluster TLS**:
```bash
./generate-cluster-certs.sh # writes out/<name>/ ; --force to rotate the cluster CA
```
4. **Create the route secret** (out of argv, shared by all nodes):
```bash
mkdir -p secrets && openssl rand -hex 32 > secrets/cluster.pass
```
5. **SSH** to each node's SSH host as `root` works (`ssh magnus true`, `ssh dd true`, ...).
## Stage the nodes
```bash
./deploy-cluster.sh # DRY RUN — prints the full plan, touches nothing
./deploy-cluster.sh --yes # HUMAN: actually rsync + install the unit on all 3 nodes
```
This cross-builds `membershipd` (linux/amd64, `CGO_ENABLED=0`), writes each node's
`cluster.env` (its `NODE_NAME` and the `--routes` to the OTHER two nodes), and
ships the binary, the node's TLS material, the secret, the env file and the unit.
It does **not** start anything.
## Seed the first admin into the KV (HUMAN — loopback bootstrap)
The empty KV control plane has no users, and under `enforce` no external tool can
write the FIRST admin over NATS (it would need to be an admin already — a
chicken-and-egg). The `user` CLI also writes only to a local SQLite file, not the
KV. So the first admin is seeded on the seed node through a **loopback, no-auth
bootstrap** that populates the same JetStream store the cluster unit then reuses:
```bash
ssh root@magnus 'bash -s' <<'SEED'
set -euo pipefail
cd /opt/unibus
# a) Put the first admin into a local SQLite seed file.
./membershipd user add --db ./seed.db --handle root --sign-pub <ADMIN_SIGN_PUB_HEX> --role admin
# b) Bring up a TEMPORARY loopback, no-auth, single-node KV server on the cluster's
# own JetStream store dir (not exposed; bus-auth off is allowed on 127.0.0.1).
./membershipd --store kv --bus-auth off --bind 127.0.0.1 \
--nats-store ./local_files/jetstream --db ./seed.db >/tmp/seed-boot.log 2>&1 &
BOOT=$!; sleep 2
# c) Migrate the admin from SQLite into the replicated KV (loopback — no --ca needed).
./membershipd migrate-to-kv --db ./seed.db --nats-url nats://127.0.0.1:4250 --replicas 1
# d) Stop the bootstrap server. The KV buckets persist in ./local_files/jetstream.
kill "$BOOT"; wait "$BOOT" 2>/dev/null || true
rm -f ./seed.db
SEED
```
> The KV written here lives in `./local_files/jetstream`, which the cluster unit
> reuses (`--nats-store` default), so the admin is present when the enforce cluster
> starts. This loopback bootstrap is needed ONLY for the very first admin (the
> chicken-and-egg). **Every user after that is added with the cluster live** — no
> stop-seed-restart — via `user add --store kv` (see "Add users to the live
> cluster" below, report 0012).
## Bring up (HUMAN)
> **CORRECTION (report 0012).** The original instruction — "start magnus alone and
> verify healthz, then add the others" — is **WRONG and will look like a hung
> deploy.** A 3-node JetStream cluster forms a RAFT meta-group that needs a quorum
> (2 of 3) to elect a leader. A single started node has no quorum, so its JetStream
> meta never becomes current: `--store kv` blocks creating the KV buckets and
> **`/healthz` never returns ok** until a second node joins. Waiting for magnus to
> "go green" before starting the others therefore deadlocks the rollout.
Start the nodes so a quorum forms. On a **clean cluster** the simplest correct
procedure is to start all three close together and let the meta-group converge:
```bash
# Start all three (order does not matter); each blocks on the others until a
# 2/3 quorum elects a JetStream meta leader, then the KV buckets are created.
for h in magnus homer datardos; do ssh "$h" 'sudo systemctl enable --now membershipd-cluster'; done
# Only NOW does healthz return ok — once the meta-group has a leader (give it
# ~10-30s on a cold start). Poll, do not assume the first node is broken.
for h in magnus homer datardos; do
echo "== $h =="; ssh "$h" 'curl -fsS https://127.0.0.1:8470/healthz --cacert /opt/unibus/tls/ca.crt || echo "(not ready yet — needs quorum)"'
done
```
A **staggered** start also works, but only because `membershipd`'s KV open RETRIES
the bucket creation for a 120s bootstrap budget (issue 0006g, fix #3): the first
node sits in that retry loop — NOT serving healthz — until the second node makes a
quorum, then both converge and the third catches up. Either way, a lone node never
self-serves; do not gate the next node's start on the previous one's healthz.
> A cold multi-node start only converges because of **three cold-start fixes**
> (report 0011): route pooling off (`PoolSize=-1`), `NoAdvertise=true` (Docker
> bridge IPs not gossiped), and the KV-open retry loop above. Without them the
> meta-group re-elects leaders forever and bucket creation hangs. If a fresh
> cluster will not form, confirm the running binary contains these fixes before
> touching config.
## Promote an existing single-node (SQLite) deployment (HUMAN, optional)
Instead of seeding fresh, you can migrate an existing single-node `unibus.db` into
the KV — **loopback only** (the allowlist would otherwise travel cleartext; the
command refuses a remote target without `--ca`). Use the same loopback-bootstrap
shape as the seed step (temporary `--bus-auth off` server on 127.0.0.1, then
`migrate-to-kv --db /opt/unibus/local_files/unibus.db`).
## Verify
```bash
# Posture on every node — all must be enforce+acl+tls+cluster, store=kv.
for h in magnus homer datardos; do
echo "== $h =="
ssh root@$h 'curl -fsS https://127.0.0.1:8470/healthz --cacert /opt/unibus/tls/ca.crt'
done
# Cluster + JetStream meta-group health (needs the `nats` CLI on a node):
ssh root@magnus 'nats --server nats://127.0.0.1:4250 server report jetstream'
ssh root@magnus 'nats --server nats://127.0.0.1:4250 server list' # 3 servers, routes up
```
A healthy cluster shows 3 routed servers and a JetStream meta-group with a leader.
## Add users to the live cluster (HUMAN — `user add --store kv`)
With the cluster up, add (and revoke) bus users **without stopping anything**,
directly against the replicated KV allowlist. This replaces the stop-seed-restart
procedure the original runbook implied for every user beyond the first admin.
The mechanism is the cluster's own **privileged internal connection**: under
`enforce` every bus user is confined by the per-subject ACL to its own rooms, so no
ordinary identity may write the control-plane buckets. The only identity the
authenticator grants full JetStream permissions is `membershipd`'s internal service
identity. The unit persists that identity to `${INTERNAL_ID_FILE}`
(`/opt/unibus/secrets/internal.id`, 0600) via `--internal-id-file`, so the same key
is available to the CLI. Run the CLI **on a node, over loopback** (the data-plane
TLS cert SAN covers `127.0.0.1`); reading the identity file requires root on that
node, which already implies full control of it, so this adds no practical exposure.
```bash
# Add a member to the live cluster's replicated allowlist (run on any node).
ssh root@magnus 'sudo /opt/unibus/membershipd user add --store kv \
--handle alice --role member --sign-pub <64-hex-ed25519-pub>'
# -> added user "alice" (...) role=member
# -> KV_UNIBUS_users: leader=<node> followers_current=2/2 msgs=N (replicated, HA)
# List / revoke against the same live KV:
ssh root@magnus 'sudo /opt/unibus/membershipd user list --store kv'
ssh root@magnus 'sudo /opt/unibus/membershipd user revoke --store kv <64-hex-ed25519-pub>'
```
Defaults assume an on-node invocation (`--nats-url nats://127.0.0.1:4250`,
`--internal-id-file /opt/unibus/secrets/internal.id`, `--ca /opt/unibus/tls/ca.crt`,
`--kv-replicas 3`). Semantics:
- **Idempotent / non-destructive**: re-adding the same key is an explicit
`already registered` error (exit 1), never a silent overwrite — a re-add cannot
flip a member to admin. To replace a user, `revoke` then add.
- **HA**: the write commits through the JetStream quorum, so it succeeds even with
one node down (2/3); the printed `followers_current` shows replication.
- **No hard delete**: `revoke` flips status to `revoked` (denied on both planes,
auditable); the KV has no row deletion, matching the SQLite store.
> **Rollout note (report 0012):** the live verification deployed this binary +
> `--internal-id-file` to **datardos only** (the non-critical node). magnus and
> homer still run the 0011 binary. To make the capability available (and the unit)
> on all three — recommended, the posture is identical so there is no urgency — roll
> the new binary with backups, one node at a time, verifying healthz between each:
> ```bash
> for h in homer magnus; do
> ssh "$h" 'sudo cp -a /opt/unibus/membershipd /opt/unibus/membershipd.bak' # backup
> scp build/membershipd "$h:/tmp/m" && ssh "$h" 'sudo install -o ubuntu -g ubuntu -m0775 /tmp/m /opt/unibus/membershipd'
> # add INTERNAL_ID_FILE=/opt/unibus/secrets/internal.id to /opt/unibus/cluster.env
> # add `--internal-id-file ${INTERNAL_ID_FILE} \` to the unit before `--store kv`
> ssh "$h" 'sudo systemctl daemon-reload && sudo systemctl restart membershipd-cluster'
> ssh "$h" 'curl -fsS https://127.0.0.1:8470/healthz --cacert /opt/unibus/tls/ca.crt' # green before next
> done
> ```
> (`deploy-cluster.sh` + the unit template already emit `INTERNAL_ID_FILE` and the
> flag, so a fresh `./deploy-cluster.sh --yes` is correct for all three.)
## Replication: go straight to R3 (HUMAN — real HA)
> **CORRECTION (report 0012).** The original "start at R1, then scale to R3" plan
> assumed R1 is a usable interim state. **It is not, in this cluster.** At R1 all six
> control-plane buckets (`KV_UNIBUS_users/rooms/members/room_keys/rooms_by_member`
> + `KV_UNIBUS_nonces`) live on a SINGLE node — a hard **SPOF for authentication**:
> if that node dies, the nonce/KV control plane is unreachable and EVERY
> authenticated request fails closed (auth DoS). Worse, the cold multi-node start
> only converges at all because of the three cold-start fixes (see "Bring up"); the
> real deploy never ran a healthy R1 and **jumped straight to R3 once the cluster
> formed.** Treat R1 as a transient artifact of bucket creation, not a milestone.
The deployed config already sets `KV_REPLICAS=3` in `nodes.env`. If buckets were
created at R1 (e.g. only one node was up when `--store kv` first opened them), raise
every control-plane stream to R3 IN PLACE (no data loss) once all three nodes are
routed:
```bash
for s in KV_UNIBUS_users KV_UNIBUS_rooms KV_UNIBUS_members KV_UNIBUS_room_keys \
KV_UNIBUS_rooms_by_member KV_UNIBUS_nonces; do
ssh root@magnus "nats --server nats://127.0.0.1:4250 stream update $s --replicas 3 -f"
done
# (also OBJ_UNIBUS_blobs if the object store is in use)
```
After this each bucket shows `followers_current=2/2` (quorum 2/3). The
`user add --store kv` command prints that figure for `KV_UNIBUS_users` on every add,
which is a cheap live HA check.
## Chaos test (HUMAN — requires the 3 live VPS)
Validate quorum tolerance after R3:
```bash
# Kill one node; the cluster keeps serving (quorum 2/3). On ubuntu nodes use sudo.
ssh dd 'sudo systemctl stop membershipd-cluster'
# -> clients fail over (multiple seed URLs); reads/writes still succeed.
ssh dd 'sudo systemctl start membershipd-cluster' # rejoins, catches up
# Kill two nodes; quorum is LOST — the control plane should fail CLOSED (deny),
# never fail open. Verify a request is rejected, not silently served.
```
> **Validated (report 0012).** The 0011 chaos run checked only the control plane
> (healthz + meta/stream-leader failover + KV readable with 2/3). Report 0012 added
> the missing data-plane proofs against the live cluster: a real authenticated
> client (`cmd/clientcheck`, operator identity, nkey+TLS) creating an E2E room and
> publishing/subscribing — including a node stopped mid-stream, where the client
> failed over to a survivor and kept receiving with zero loss (quorum 2/3) — and
> `user add --store kv` committing with one node (the KV leader) down. The kill-2/3
> fail-closed case remains a documented manual step.
## Rollback
`membershipd` does not delete data. To revert a node to standalone SQLite, stop
the unit and start it without `--store kv`/`--cluster-name`; the KV buckets remain
for a later retry. To rotate the cluster CA, re-run `generate-cluster-certs.sh
--force` and re-stage (every node must get the new `cluster-ca.crt` together).
+130
View File
@@ -0,0 +1,130 @@
#!/usr/bin/env bash
#
# deploy-cluster.sh — cross-build membershipd and stage it onto the three cluster
# nodes (issue 0006g). DEFAULT IS DRY-RUN: it prints the plan and touches nothing.
# Pass --yes to actually rsync + run remote commands. Steps that a HUMAN must run
# (or confirm) are marked "HUMAN:".
#
# Prerequisites (HUMAN, once):
# 1. Fill nodes.env (no <PLACEHOLDER> left).
# 2. ./generate-cluster-certs.sh (mints out/<name>/ TLS material)
# 3. Create the route secret locally: mkdir -p secrets && openssl rand -hex 32 > secrets/cluster.pass
# (secrets/ is gitignored; it is rsynced to each node as cluster.pass)
# 4. SSH access to every node's SSH_HOST with sudo-less root (SSH_USER=root).
#
# What it does per node (with --yes):
# - rsync the membershipd binary, the node's TLS material, the unit, the
# generated cluster.env and the route secret into REMOTE_DIR.
# - install + daemon-reload the systemd unit.
# Start is STAGGERED and left to the human (see README): start the seed node,
# seed the admin, then start the rest.
set -euo pipefail
DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
cd "$DIR"
# shellcheck source=/dev/null
source ./nodes.env
APPLY=0
[[ "${1:-}" == "--yes" ]] && APPLY=1
if grep -q '<[A-Z_]\+>' nodes.env; then
echo "ERROR: nodes.env still has <PLACEHOLDER> values — fill them in first." >&2
exit 2
fi
SECRET_FILE="secrets/cluster.pass"
if [[ ! -f "$SECRET_FILE" ]]; then
echo "ERROR: $SECRET_FILE missing. HUMAN: mkdir -p secrets && openssl rand -hex 32 > $SECRET_FILE" >&2
exit 2
fi
run() {
# Echo every action; only execute it under --yes.
echo " + $*"
if [[ $APPLY -eq 1 ]]; then
"$@"
fi
}
echo "==> [1/3] cross-build membershipd (linux/amd64, CGO disabled)"
run env CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -o build/membershipd ../../cmd/membershipd
# Build the comma-separated route list for a node = the OTHER nodes' addresses on
# the chosen network, with NO userinfo (the secret is injected by membershipd from
# the file). Echoes nothing; prints the value.
routes_for() {
local self="$1" out=""
local row name _ssh pub wg addr
for row in "${CLUSTER_NODES[@]}"; do
read -r name _ssh pub wg <<<"$row"
[[ "$name" == "$self" ]] && continue
if [[ "$ROUTE_NETWORK" == "public" ]]; then addr="$pub"; else addr="$wg"; fi
out+="nats://${addr}:${NATS_ROUTE_PORT},"
done
echo "${out%,}"
}
echo "==> [2/3] stage each node (REMOTE_DIR=$REMOTE_DIR)"
for row in "${CLUSTER_NODES[@]}"; do
read -r name ssh _pub _wg <<<"$row"
target="${SSH_USER}@${ssh}"
nodedir="out/${name}"
if [[ ! -d "$nodedir" ]]; then
echo "ERROR: $nodedir missing — run ./generate-cluster-certs.sh first." >&2
exit 2
fi
routes="$(routes_for "$name")"
echo "-- node ${name} (ssh ${ssh}) routes=${routes}"
# Generate this node's cluster.env locally, then ship it.
envfile="build/cluster-${name}.env"
mkdir -p build
cat > "$envfile" <<EOF
NODE_NAME=${name}
CLUSTER_NAME=${CLUSTER_NAME}
CLUSTER_USER=${CLUSTER_USER}
KV_REPLICAS=${KV_REPLICAS}
HTTP_PORT=${HTTP_PORT}
NATS_CLIENT_PORT=${NATS_CLIENT_PORT}
NATS_ROUTE_PORT=${NATS_ROUTE_PORT}
ROUTES=${routes}
CLUSTER_PASS_FILE=${REMOTE_DIR}/secrets/cluster.pass
TLS_CERT=${REMOTE_DIR}/tls/server-${name}.crt
TLS_KEY=${REMOTE_DIR}/tls/server-${name}.key
ROUTE_TLS_CERT=${REMOTE_DIR}/tls/route-${name}.crt
ROUTE_TLS_KEY=${REMOTE_DIR}/tls/route-${name}.key
ROUTE_TLS_CA=${REMOTE_DIR}/tls/cluster-ca.crt
INTERNAL_ID_FILE=${REMOTE_DIR}/secrets/internal.id
EOF
run ssh "$target" "mkdir -p ${REMOTE_DIR}/tls ${REMOTE_DIR}/secrets"
run rsync -az build/membershipd "${target}:${REMOTE_DIR}/membershipd"
run rsync -az "${nodedir}/" "${target}:${REMOTE_DIR}/tls/"
run rsync -az "$SECRET_FILE" "${target}:${REMOTE_DIR}/secrets/cluster.pass"
run rsync -az "$envfile" "${target}:${REMOTE_DIR}/cluster.env"
run rsync -az membershipd-cluster.service "${target}:/etc/systemd/system/membershipd-cluster.service"
run ssh "$target" "chmod 600 ${REMOTE_DIR}/secrets/cluster.pass ${REMOTE_DIR}/tls/*.key && systemctl daemon-reload"
done
echo "==> [3/3] staged."
if [[ $APPLY -eq 0 ]]; then
echo " DRY-RUN: nothing was sent. Re-run with --yes to apply."
fi
cat <<'NEXT'
HUMAN — bring up (see README "Bring up" — a LONE node has no quorum and never
serves healthz, so do NOT gate the next node on the previous one going green):
1. Seed the FIRST admin into the KV via the loopback bootstrap (README
"Seed the first admin"); this is needed only for the chicken-and-egg admin.
2. Start all three so a 2/3 quorum forms (order does not matter); healthz
turns ok only once the meta-group elects a leader (~10-30s cold):
for h in magnus homer datardos; do ssh "$h" 'sudo systemctl enable --now membershipd-cluster'; done
3. Verify posture + quorum (README "Verify").
4. Ensure R3 on every control-plane stream (README "Replication: go straight to
R3"); R1 is a SPOF, not a milestone.
5. Add further users with the cluster LIVE — no restart — via
`membershipd user add --store kv` (README "Add users to the live cluster").
NEXT
+120
View File
@@ -0,0 +1,120 @@
#!/usr/bin/env bash
#
# generate-cluster-certs.sh — mint the TLS material for a unibus 3-node cluster
# (issue 0006g). Run ONCE on a trusted machine (e.g. om, which custodies the bus
# CA); distribute the per-node output to each node over a secure channel. This
# script touches NO remote host.
#
# It produces two trust roots, kept SEPARATE on purpose (audit 0008 N1-low):
#
# 1. The CLUSTER route CA (cluster-ca.crt/key, generated here): signs each
# node's ROUTE certificate. The route layer authenticates NODES, not bus
# users, so it must NOT share the client data-plane CA — a client cert can
# then never be presented to the route port.
# 2. The CLIENT data-plane CA (../tls/ca.crt/key, the one clients pin): signs
# each node's DATA-PLANE server certificate. Reused, not regenerated, so
# existing clients keep trusting the bus.
#
# Per node it emits, under out/<name>/:
# route-<name>.crt/key route cert (cluster CA), EKU server+clientAuth
# (each node is BOTH server and dialer to its peers)
# server-<name>.crt/key data-plane cert (client CA), EKU serverAuth
# cluster-ca.crt the route CA cert (for --route-tls-ca)
# ca.crt the client CA cert (for clients / control-plane TLS)
#
# SANs per node = its public IP + its WireGuard IP + its hostname + localhost.
#
# Key material: EC P-256 (Go crypto/tls + nats-server friendly), matching
# ../tls/generate-certs.sh.
set -euo pipefail
DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
cd "$DIR"
# shellcheck source=/dev/null
source ./nodes.env
# Refuse to run while any placeholder remains (HUMAN must fill nodes.env first).
if grep -q '<[A-Z_]\+>' nodes.env; then
echo "ERROR: nodes.env still has <PLACEHOLDER> values — fill them in first." >&2
grep -n '<[A-Z_]\+>' nodes.env >&2
exit 2
fi
CLIENT_CA_CRT="../tls/ca.crt"
CLIENT_CA_KEY="../tls/ca.key"
if [[ ! -f "$CLIENT_CA_CRT" || ! -f "$CLIENT_CA_KEY" ]]; then
echo "ERROR: client data-plane CA not found at ../tls/ca.{crt,key}." >&2
echo " Run ../tls/generate-certs.sh first (it mints the client CA)." >&2
exit 2
fi
DAYS_CA=3650
DAYS_CRT=825
force=0
[[ "${1:-}" == "--force" ]] && force=1
# --- cluster route CA (separate trust root) ---
if [[ ! -f cluster-ca.crt || ! -f cluster-ca.key || $force -eq 1 ]]; then
echo "==> generating cluster route CA (separate from the client CA)"
openssl ecparam -name prime256v1 -genkey -noout -out cluster-ca.key
chmod 600 cluster-ca.key
openssl req -x509 -new -key cluster-ca.key -sha256 -days "$DAYS_CA" \
-subj "/CN=unibus-cluster-ca" -out cluster-ca.crt
else
echo "==> reusing existing cluster route CA (pass --force to regenerate)"
fi
# mint <out_key> <out_crt> <subject_cn> <san> <eku> <ca_crt> <ca_key>
mint_cert() {
local out_key="$1" out_crt="$2" cn="$3" san="$4" eku="$5" ca_crt="$6" ca_key="$7"
local csr ext
csr="$(mktemp)"
ext="$(mktemp)"
openssl ecparam -name prime256v1 -genkey -noout -out "$out_key"
chmod 600 "$out_key"
openssl req -new -key "$out_key" -subj "/CN=${cn}" -out "$csr"
cat > "$ext" <<EOF
subjectAltName=${san}
extendedKeyUsage=${eku}
keyUsage=digitalSignature,keyEncipherment
EOF
openssl x509 -req -in "$csr" -CA "$ca_crt" -CAkey "$ca_key" -CAcreateserial \
-sha256 -days "$DAYS_CRT" -extfile "$ext" -out "$out_crt"
rm -f "$csr" "$ext"
}
for row in "${CLUSTER_NODES[@]}"; do
read -r name _ssh pub wg <<<"$row"
echo "==> node ${name}: SAN IP:${pub}, IP:${wg}, DNS:${name}, localhost, 127.0.0.1"
nodedir="out/${name}"
mkdir -p "$nodedir"
san="IP:${pub},IP:${wg},DNS:${name},DNS:localhost,IP:127.0.0.1"
# Route cert: signed by the cluster CA; server+client auth (mutual routes).
mint_cert "${nodedir}/route-${name}.key" "${nodedir}/route-${name}.crt" \
"unibus-route-${name}" "$san" "serverAuth,clientAuth" \
cluster-ca.crt cluster-ca.key
# Data-plane server cert: signed by the client CA; serverAuth only.
mint_cert "${nodedir}/server-${name}.key" "${nodedir}/server-${name}.crt" \
"unibus-${name}" "$san" "serverAuth" \
"$CLIENT_CA_CRT" "$CLIENT_CA_KEY"
# Co-locate the two CA certs each node needs.
cp cluster-ca.crt "${nodedir}/cluster-ca.crt"
cp "$CLIENT_CA_CRT" "${nodedir}/ca.crt"
done
rm -f cluster-ca.srl ../tls/ca.srl 2>/dev/null || true
echo
echo "==> done. Per-node material under out/<name>/ (KEYS ARE SECRET — never git):"
for row in "${CLUSTER_NODES[@]}"; do
read -r name _rest <<<"$row"
echo " out/${name}/ (route-${name}.*, server-${name}.*, cluster-ca.crt, ca.crt)"
done
echo
echo "verify a SAN with:"
echo " openssl x509 -in out/<name>/server-<name>.crt -noout -text | grep -A1 'Subject Alternative Name'"
@@ -0,0 +1,46 @@
[Unit]
# unibus membershipd — cluster node (issue 0006g).
#
# One unit, parameterized per node by /opt/unibus/cluster.env (generated by
# deploy-cluster.sh): NODE_NAME, ROUTES and the cert paths differ per node, the
# rest of the posture (enforce + per-subject ACL + TLS + --store kv) is identical
# on every node, which is the homogeneous posture a secure cluster requires
# (audit 0008 N1).
Description=unibus membershipd (cluster node)
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
WorkingDirectory=/opt/unibus
EnvironmentFile=/opt/unibus/cluster.env
# The route password comes from a FILE referenced by ${CLUSTER_PASS_FILE}, never
# from argv (audit 0008 N1-low). The peer --routes carry no userinfo; membershipd
# injects the credentials from the file/user.
ExecStart=/opt/unibus/membershipd \
--bind 0.0.0.0 \
--bus-auth enforce \
--http-port ${HTTP_PORT} \
--nats-port ${NATS_CLIENT_PORT} \
--tls-cert ${TLS_CERT} \
--tls-key ${TLS_KEY} \
--cluster-name ${CLUSTER_NAME} \
--server-name ${NODE_NAME} \
--cluster-port ${NATS_ROUTE_PORT} \
--routes ${ROUTES} \
--cluster-user ${CLUSTER_USER} \
--cluster-pass-file ${CLUSTER_PASS_FILE} \
--route-tls-cert ${ROUTE_TLS_CERT} \
--route-tls-key ${ROUTE_TLS_KEY} \
--route-tls-ca ${ROUTE_TLS_CA} \
--internal-id-file ${INTERNAL_ID_FILE} \
--store kv \
--kv-replicas ${KV_REPLICAS}
# Restart=always (NOT on-failure): a clean SIGTERM exits success, and on-failure
# would then NOT restart, leaving the node silently dead (see function_tags.md).
Restart=always
RestartSec=2
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target
+57
View File
@@ -0,0 +1,57 @@
# Cluster topology for the unibus 3-node deployment (issue 0006g).
#
# This file is SOURCED by generate-cluster-certs.sh and deploy-cluster.sh.
#
# HUMAN: fill in every placeholder with the real value before running the
# scripts. The public IPs known at authoring time are pre-filled; the WireGuard
# mesh IPs and magnus's public IP must be supplied. The scripts refuse to run
# while any unfilled placeholder remains.
# Cluster identity (must be identical on every node).
CLUSTER_NAME="unibus"
# Route-secret username; the password is NOT here — it lives in a file (see
# CLUSTER_PASS_FILE in deploy-cluster.sh) so it never lands in argv or git.
CLUSTER_USER="unibus-cluster"
# KV/nonce replication factor. START AT 1 for the initial 1->3 rollout, then raise
# to 3 IN PLACE (see README "Scale to R3") once all three nodes have joined. Only
# set this to 3 here after the third node is up and you re-run the KV update.
KV_REPLICAS=3
# Ports (same on every node; the route port is server-to-server only).
NATS_CLIENT_PORT=4250
NATS_ROUTE_PORT=6250
HTTP_PORT=8470
# Remote install layout and SSH login user.
REMOTE_DIR="/opt/unibus"
SSH_USER="root"
# Which address family the inter-node routes use. "wg" builds --routes from the
# WireGuard mesh IPs (private server-to-server links, preferred); "public" uses
# the public IPs. The route layer is always mutual-TLS regardless.
#
# DEPLOY DECISION (2026-06-07): set to "public". No WireGuard mesh exists between
# the three cluster nodes — homer and datardos do not even have the `wg` binary
# installed, and om's only WG peers are the operator's personal PCs, not the VPS.
# Rather than stand up a fresh mesh blindly, the routes go over the public IPs,
# still protected by the separate cluster route CA (mutual-TLS). On magnus (the
# only node with ufw active) the route port 6250 is restricted to the homer and
# datardos public IPs; homer/datardos run ufw inactive (Docker hosts) and rely on
# the route mutual-TLS for 6250.
ROUTE_NETWORK="public"
# One row per node: NAME SSH_HOST PUBLIC_IP WG_IP
# NAME -> --server-name and the per-node cert filenames (unique).
# SSH_HOST -> the `ssh ALIAS` alias (see ~/.ssh/config).
# PUBLIC_IP -> public address; goes in the cert SANs (client-facing data plane).
# WG_IP -> WireGuard mesh address; cert SAN + route target when ROUTE_NETWORK=wg.
# NOTE: with ROUTE_NETWORK=public and no WireGuard mesh, the WG_IP column is set to
# each node's public IP so the cert SAN covers the address actually used by the
# public routes and no unfilled placeholder remains (scripts refuse to run otherwise).
# magnus == organic-machine.com == om (135.125.201.30); SSH alias `magnus` enters as root.
CLUSTER_NODES=(
"magnus magnus 135.125.201.30 135.125.201.30"
"homer homer 141.94.69.66 141.94.69.66"
"datardos dd 51.91.100.142 51.91.100.142"
)
+6
View File
@@ -0,0 +1,6 @@
# Private keys and the deploy-specific server certificate never go to git.
# Only the public CA certificate (ca.crt) is versioned, because clients embed it.
*.key
*.csr
*.srl
server.crt
+56
View File
@@ -0,0 +1,56 @@
# Bus TLS — self-signed CA and server certificate
The unibus data plane (NATS) is encrypted with TLS using the project's own
self-signed CA. The bus is exposed publicly, protected by auth + TLS, so the CA
is private (not Let's Encrypt) and every client we control embeds the public
`ca.crt`; the server presents `server.crt`/`server.key`.
## Files
| File | Secret? | Goes where |
|---|---|---|
| `ca.crt` | no (public) | versioned in git; embedded/distributed to every client |
| `ca.key` | **yes** | stays on the machine that mints certs; gitignored |
| `server.crt` | no | deployed to the bus host; gitignored (deploy-specific SANs) |
| `server.key` | **yes** | deployed to the bus host over a secure channel; gitignored |
Only `ca.crt` is committed. `ca.key`, `server.key`, `server.crt`, and any
`*.csr`/`*.srl` are gitignored — see `.gitignore`.
## Generate
```bash
cd deploy/tls
./generate-certs.sh # CA (if missing) + server cert with default SANs
./generate-certs.sh --force # also regenerate the CA (invalidates pinned clients)
```
The server certificate's SANs cover the public IP, the WireGuard IP, the om
hostname, plus `localhost`/`127.0.0.1` for on-host smoke tests. Override the
defaults via environment variables:
```bash
UNIBUS_PUBLIC_IP=135.125.201.30 UNIBUS_WG_IP=10.42.0.1 UNIBUS_HOSTNAME=om ./generate-certs.sh
```
Verify the SANs:
```bash
openssl x509 -in server.crt -noout -text | grep -A1 'Subject Alternative Name'
```
## Use
- **Server** (`membershipd`, phase 0001e): point it at `server.crt`/`server.key`
so the embedded NATS presents the certificate and requires TLS. Built with
`busauth.ServerTLSConfig(certPath, keyPath)`.
- **Clients** (Go peers, mobile binding, gateway): pin `ca.crt` with
`busauth.LoadCATLSConfig(caPath)` and pass the result as `client.Options.TLS`.
## Rotation
The CA is long-lived (10 years). Rotate the server certificate (825 days) by
re-running `generate-certs.sh` (without `--force`) and redeploying
`server.crt`/`server.key`; clients are unaffected because they pin the CA, not
the server cert. Rotating the CA (`--force`) requires redistributing `ca.crt` to
every client.
+11
View File
@@ -0,0 +1,11 @@
-----BEGIN CERTIFICATE-----
MIIBfTCCASOgAwIBAgIUW2HZJDDlixxw/DgNP/IDIrJ7MeMwCgYIKoZIzj0EAwIw
FDESMBAGA1UEAwwJdW5pYnVzLWNhMB4XDTI2MDYwNzEwNDIyNloXDTM2MDYwNDEw
NDIyNlowFDESMBAGA1UEAwwJdW5pYnVzLWNhMFkwEwYHKoZIzj0CAQYIKoZIzj0D
AQcDQgAEe2by5l9dcEbqKB11yJtPIH9S/01XNhuFnBB/IpDevO2fWLLV+muqoB8C
ADH1wKleq8jF5D0sSlK2DCuYrjAjPqNTMFEwHQYDVR0OBBYEFABX+UI7bXICRF4l
WmmDR/rUtxnrMB8GA1UdIwQYMBaAFABX+UI7bXICRF4lWmmDR/rUtxnrMA8GA1Ud
EwEB/wQFMAMBAf8wCgYIKoZIzj0EAwIDSAAwRQIgCAeOYTKvA6SBB8xMdMdqNrp1
20OPyi2BwFovW6vTCLMCIQC1qRi8SGRHTui8BVqIvp/DFJaZ/U8ocAg/qedLdy+R
/w==
-----END CERTIFICATE-----
+64
View File
@@ -0,0 +1,64 @@
#!/usr/bin/env bash
#
# generate-certs.sh — mint the unibus bus's self-signed CA and the NATS server
# certificate. Run once on a trusted machine; distribute ca.crt to clients and
# server.crt/server.key to the bus host (server.key by a secure channel, never
# git). Re-running regenerates the server cert; pass --force to also regenerate
# the CA (which invalidates every client that pinned the old ca.crt).
#
# SANs cover the public IP, the WireGuard IP, the om hostname, plus localhost so
# the operator can smoke-test the TLS handshake on the box. Override via env:
# UNIBUS_PUBLIC_IP (default 135.125.201.30)
# UNIBUS_WG_IP (default 10.42.0.1)
# UNIBUS_HOSTNAME (default om)
#
# Key material: EC P-256 (widely supported by Go's crypto/tls and nats-server).
set -euo pipefail
DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
cd "$DIR"
PUBLIC_IP="${UNIBUS_PUBLIC_IP:-135.125.201.30}"
WG_IP="${UNIBUS_WG_IP:-10.42.0.1}"
HOSTNAME_OM="${UNIBUS_HOSTNAME:-om}"
DAYS_CA=3650
DAYS_SRV=825
force=0
[[ "${1:-}" == "--force" ]] && force=1
# --- CA (long-lived; only the cert is public) ---
if [[ ! -f ca.crt || ! -f ca.key || $force -eq 1 ]]; then
echo "==> generating CA"
openssl ecparam -name prime256v1 -genkey -noout -out ca.key
chmod 600 ca.key
openssl req -x509 -new -key ca.key -sha256 -days "$DAYS_CA" \
-subj "/CN=unibus-ca" -out ca.crt
else
echo "==> reusing existing CA (pass --force to regenerate)"
fi
# --- server certificate, signed by the CA, with the bus SANs ---
echo "==> generating server certificate (SAN: $PUBLIC_IP, $WG_IP, $HOSTNAME_OM, localhost, 127.0.0.1)"
openssl ecparam -name prime256v1 -genkey -noout -out server.key
chmod 600 server.key
openssl req -new -key server.key -subj "/CN=unibus-bus" -out server.csr
cat > server.ext <<EOF
subjectAltName=IP:${PUBLIC_IP},IP:${WG_IP},DNS:${HOSTNAME_OM},DNS:localhost,IP:127.0.0.1
extendedKeyUsage=serverAuth
keyUsage=digitalSignature,keyEncipherment
EOF
openssl x509 -req -in server.csr -CA ca.crt -CAkey ca.key -CAcreateserial \
-sha256 -days "$DAYS_SRV" -extfile server.ext -out server.crt
rm -f server.csr server.ext ca.srl
echo "==> done:"
echo " ca.crt -> embed/distribute to every client (public)"
echo " server.crt -> deploy to the bus host"
echo " server.key -> deploy to the bus host over a secure channel (NEVER git)"
echo
echo "verify SANs with:"
echo " openssl x509 -in server.crt -noout -text | grep -A1 'Subject Alternative Name'"
+55
View File
@@ -0,0 +1,55 @@
# Issue 0001e — remaining client migrations (notes, NOT implemented)
Phase 0001e migrated the first-class Go clients and the mobile binding to the
secure connection path (`client.Connect(caPath)` → TLS + nkey; control-plane
requests are always signed). Two consumers are intentionally **left as notes**
because they live outside this sub-repo or need their own coordination:
## 1. Web gateway (`playground/server.go`)
The playground is a local dev gateway that embeds its own membershipd
(`membership.NewServer(..., AuthOff)`) and an open embedded NATS, and connects
browser sessions through an in-process client. To run it against a **secured**
bus it would need:
- Connect its internal client via `client.Connect(natsURL, ctrlURL, id, caPath)`
with the bundled `ca.crt` (it currently builds the client without options).
- If it should itself enforce auth on the browser-facing side, start its
embedded membershipd with an auth mode and its embedded NATS with
`embeddednats.StartServer(ServerConfig{Auth: ..., TLS: ...})` — but a local
dev gateway typically stays open and only the *upstream* bus is secured.
- The gateway's own bus identity must be registered in the upstream allowlist
(`membershipd user add`).
Decision: left at `AuthOff` + plaintext for now (local dev tool). Migrate when
the gateway is pointed at the public bus.
## 2. unibots (`shell/transportunibus`, in the agents repo — NOT this sub-repo)
The bot transport lives in the `agents_and_robots` / message_bus consumer, not
in `dataforge/unibus`. To talk to the secured bus it must, after recompiling
against this `pkg/client`:
- Switch its connect call to `client.Connect(natsURL, ctrlURL, id, caPath)`,
passing the path to the bundled `ca.crt`.
- Ship `ca.crt` alongside the bot binary (read-only) and point `caPath` at it.
- Register each bot's identity (`hex(SignPub)`) in the bus allowlist via
`membershipd user add --handle <bot> --sign-pub <hex>` on the bus host.
- Run as `systemd --user` with `caPath` set, per the deploy plan (0001f).
No code change is possible from this sub-repo; this is the contract the bot
transport consumes.
## Server enablement (operator, phase 0001f)
`membershipd` now accepts:
- `--bus-auth enforce` — verify signed control-plane requests AND turn on the
NATS nkey authenticator (only allowlisted identities connect).
- `--tls-cert deploy/tls/server.crt --tls-key deploy/tls/server.key` — present
the server certificate and require TLS on the embedded NATS.
`dev/feature_flags.json` now declares both `bus-auth: enforce` and
`bus-tls: enabled` as the project's target state. The flags are declarative;
the operator activates them at deploy time with the flags above. The CLI
defaults remain off so local dev and the test suite are unaffected.
+80
View File
@@ -0,0 +1,80 @@
# 0004d — Data-plane access control on NATS (audit H4)
## The finding
The NATS authenticator (`pkg/busauth`) decides one thing per connection:
*is this identity registered on the bus?* It does **not** scope what a connected
client may subscribe to or publish. There is a single NATS account with no
`Permissions`, so any registered peer can subscribe to, or publish on, **any**
subject. Concretely:
- A cleartext room (`ModeNATS`) carries its payload in the clear on its subject.
A registered peer that knows or guesses the subject subscribes and reads the
content directly (the auditor's `TestAudit_NoSubjectACL`: eve, never invited,
receives `"internal: salary numbers"`).
- An encrypted room (`ModeMatrix`) keeps its **content** confidential (the
payload is AEAD ciphertext), but the **metadata of traffic** — that a subject
is active, message sizes and timing, who is publishing — is still observable by
any registered peer that subscribes to the subject.
## Why the "complete" fix does not fit here
The preferred fix is per-subject permissions derived from room membership: when a
client connects, the authenticator looks up the rooms it belongs to and grants
`Sub`/`Pub` only on those subjects. NATS supports this — `CustomClientAuthentication`
can register a `*server.User` carrying `Permissions`.
The blocker is that **NATS evaluates permissions once, at connect time, and never
re-evaluates them on a live connection.** unibus clients routinely *connect → create
or get invited to a room → publish/subscribe* within the **same** connection
(`TestSecureBusEndToEnd` does exactly this: A connects, then creates `room.secure`,
then publishes to it). Permissions frozen at connect time would not include a room
created or joined afterwards, so the legitimate owner could not publish to the room
it just made. Making per-subject ACLs work would therefore require the client to
**reconnect on every membership change**, an invasive change to the client library
and to every peer (worker, chat, mobile) — and the prompt for this issue scopes the
client changes to the minimum.
That dynamic-membership reconnection model is precisely the redesign that issue
**0003** (decentralization) already has to do: it moves the control-plane state to a
replicated JetStream KV and reworks how nodes and clients (re)establish sessions. Per
the issue's own guidance ("if a complete strategy does not fit, implement the minimum
defense and document the rest"), the full subject ACL is deferred to 0003, where the
session/permission model is being rebuilt anyway.
## The strategy implemented here: forbid cleartext rooms in public
`Server.RequireEncryptedRooms` (set by `membershipd` on any non-loopback bind)
refuses to create a cleartext (`ModeNATS`) room. Every room on a public deployment
is therefore end-to-end encrypted, so **message content stays confidential even
though the transport offers no subject isolation**: a peer that sniffs another
room's subject receives only AEAD ciphertext it has no key for.
This composes with the 0004c control-plane authorization: a non-member cannot even
learn a room's subject through the control plane (`GET /rooms/{id}` → 403), so to
sniff it an attacker must already know or guess the subject out of band.
## What this does NOT close (residual exposure, by design)
- **Traffic metadata.** A registered peer that already knows a subject can still
subscribe and observe that the subject is active, the ciphertext sizes, and the
timing/cadence of messages. It cannot read content.
- **Cross-room publish.** A registered peer can still *publish* arbitrary bytes on
any subject. In an encrypted room those bytes fail AEAD open and the signature
check (`SignMsgs`), so receivers drop them — it is a nuisance/spam vector, not a
confidentiality or integrity break.
- **WireGuard-only deployments** may still use cleartext rooms (the guard only trips
on a public bind), because the network already restricts who can reach the bus.
Closing the residual metadata exposure requires the per-subject ACL described above,
tracked for issue 0003.
## Regression evidence
- `pkg/membership``TestRequireEncryptedRoomsRejectsCleartext`: with
`RequireEncryptedRooms` on, `POST /rooms` for a cleartext policy returns 403 while
an encrypted-room create returns 201.
- `pkg/client``TestAudit_NoSubjectACL`: under the public posture, creating a
`ModeNATS` room fails; alice creates an encrypted room and publishes; eve (a
registered non-member) raw-subscribes to the subject and receives only ciphertext —
she never recovers the plaintext.
+26
View File
@@ -0,0 +1,26 @@
{
"flags": {
"bus-auth": {
"enabled": true,
"state": "enforce",
"issue": "0001",
"description": "Signed control-plane auth + NATS nkey auth. Rollout: off -> soft (verify+log, allow) -> enforce (reject). 'enabled' mirrors state!=off. Server opts in via membershipd --bus-auth; clients via client.Connect(caPath).",
"added": "2026-06-07",
"enabled_at": "2026-06-07"
},
"bus-tls": {
"enabled": true,
"issue": "0001",
"description": "TLS on the NATS data plane using the project's self-signed CA (deploy/tls/). Server opts in via membershipd --tls-cert/--tls-key; clients pin ca.crt via client.Connect(caPath).",
"added": "2026-06-07",
"enabled_at": "2026-06-07"
},
"decentralized": {
"enabled": false,
"issue": "0003",
"description": "Control-plane state on replicated JetStream KV instead of local SQLite (branch-by-abstraction membership.Store: sqliteStore default, jetstreamStore opt-in). The route cluster (0003a) and the KV store (0003b) shipped behind this flag; the membershipd boot wiring that selects the store is COMPLETE since issue 0006c and is realized at runtime with the server flag --store kv|sqlite (default sqlite). The internal-identity bootstrap (0006a) lets membershipd open the KV store on its own embedded NATS under enforce. Per-deploy opt-in: a node joins the decentralized control plane by starting with --store kv (and --cluster-name for HA). OFF (--store sqlite) keeps the single-node SQLite control plane unchanged.",
"added": "2026-06-07",
"enabled_at": null
}
}
}
+214
View File
@@ -0,0 +1,214 @@
---
issue: 0001
title: Seguridad del bus — sistema de usuarios, auth firmada del control plane, NATS nkey + TLS
status: spec
created: 2026-06-07
domain: security
scope: unibus (membershipd, pkg/membership, pkg/embeddednats, pkg/client) + clientes (mobile, web gateway, unibots)
---
# Objetivo
Hoy el bus unibus solo está protegido por la red (WireGuard) y por el cifrado E2E
por room (megolm). El **control plane** (HTTP `:8470`) y el **data plane** (NATS
`:4250`) **no tienen autenticación ni TLS**: cualquiera que alcance esos puertos
puede crear rooms, leer metadata, publicar, y hacer DoS. El contenido de las rooms
`ModeMatrix` está cifrado E2E, pero las rooms `ModeNATS` (cleartext), la metadata
de subjects y todo el control plane viajan en claro y sin control de acceso.
Este issue añade tres capas de seguridad al propio bus, de modo que **WireGuard
pase a ser opcional** (defensa en profundidad) y el bus pueda exponerse de forma
segura incluso a un cliente móvil en una red ajena:
1. **Sistema de usuarios** — un registro a nivel bus de las identidades autorizadas
(allowlist de claves públicas Ed25519), con roles y revocación.
2. **Auth del control plane** — cada request HTTP va firmado con la identidad del
peer; el server verifica la firma y que la identidad esté autorizada.
3. **NATS endurecido** — autenticación por nkey (Ed25519) contra el registro de
usuarios + TLS para cifrar todo el transporte del data plane.
# Modelo de amenazas y capas
| Capa | Qué protege | Estado hoy | Tras este issue |
|---|---|---|---|
| WireGuard | Acceso de red; oculta el bus de internet | activo (opcional) | sigue disponible, ya no imprescindible |
| TLS NATS | Confidencialidad/integridad del **canal** (cleartext rooms, metadata, nonces de auth) | ausente | CA propia self-signed |
| Auth (firma Ed25519 / nkey) | **Autenticación**: solo identidades registradas conectan/operan | ausente | control plane + data plane |
| E2E por room (megolm) | Confidencialidad del **contenido** de rooms cifradas | activo | sin cambios |
Principio: cada capa es independiente. TLS cifra el canal, la auth decide quién
entra, el E2E protege el contenido aunque el bus fuera comprometido.
# Diseño
## Pieza 1 — Sistema de usuarios
Registro a nivel bus (no por room) de las identidades autorizadas. Migración
**aditiva** `migrations/002_users.sql` (y su gemela embebida en
`pkg/membership/migrations/`):
```sql
CREATE TABLE IF NOT EXISTS users (
sign_pub TEXT PRIMARY KEY, -- clave pública Ed25519 en hex (identidad del peer)
handle TEXT NOT NULL, -- nombre legible (único recomendado, no PK)
role TEXT NOT NULL DEFAULT 'member', -- 'admin' | 'member'
status TEXT NOT NULL DEFAULT 'active', -- 'active' | 'revoked'
created_at TEXT NOT NULL,
revoked_at TEXT
);
CREATE INDEX IF NOT EXISTS idx_users_status ON users(status);
```
- `sign_pub` es la misma clave que ya deriva el `endpoint` (`frame.EndpointID(SignPub)`).
- CRUD en `pkg/membership/store.go`: `AddUser`, `GetUser`, `ListUsers`,
`RevokeUser`, `IsAuthorized(signPubHex) bool`.
- CLI de administración en `cmd/membershipd`: `membershipd user add --handle h
--sign-pub <hex> [--role admin]`, `user list`, `user revoke <sign-pub>`.
- **Bootstrap (chicken-egg):** el primer `admin` se siembra ejecutando el CLI
localmente en el host del bus (`user add --role admin --sign-pub <tu_pub>`). El
CLI local se considera de confianza (quien tiene shell en el host ya manda). Sin
al menos un admin, los endpoints de gestión de usuarios devuelven 403.
## Pieza 2 — Auth del control plane (HTTP :8470)
Generaliza la firma que ya existe (`pkg/client.signRequest` ↔
`pkg/membership.verifyOwnerSig`) de "solo owner" a "todo request".
**Cliente** (`pkg/client`): cada request añade cabeceras:
```
X-Unibus-Pub: <sign_pub hex>
X-Unibus-Ts: <unix seconds>
X-Unibus-Nonce: <16 bytes aleatorios, base64>
X-Unibus-Sig: Ed25519( canonical ) ; canonical = method "\n" path "\n" ts "\n" nonce "\n" sha256(body)
```
**Server** (middleware en `membershipd`):
1. Parsear cabeceras; reconstruir `canonical`; verificar firma con `X-Unibus-Pub`.
2. Comprobar `IsAuthorized(pub)` (status active). Si no → `401`.
3. **Anti-replay:** rechazar si `|now - ts| > 30s`; cachear `nonce` con TTL 60s y
rechazar repetidos (LRU en memoria, suficiente para un único membershipd).
4. Autorización fina: operaciones de gestión de usuarios exigen `role=admin`;
operaciones de room siguen exigiendo ownership donde ya aplica.
Feature flag `bus-auth` en `dev/feature_flags.json` con tres estados de rollout:
`off` (sin verificar) → `soft` (verifica y **loguea** rechazos pero deja pasar) →
`enforce` (rechaza). Permite migrar clientes sin cortar el servicio.
## Pieza 3 — NATS: nkey auth + TLS
### Auth (nkey sobre la identidad Ed25519)
Los nkeys de NATS **son** claves Ed25519, así que reutilizamos la identidad del
peer sin material nuevo.
- **Server** (`pkg/embeddednats`): `server.Options.CustomClientAuthentication` con
un autenticador que, dado el nonce que NATS presenta al cliente y la firma que el
cliente devuelve, verifica la firma con la pubkey declarada y consulta
`store.IsAuthorized(pub)`. Validar dinámicamente contra la BD permite **revocar
sin reiniciar** el server (ventaja sobre precargar `Options.Nkeys`).
- **Cliente** (`pkg/client`): conectar con `nats.Nkey(pubSeedEncoded, sigCB)` donde
`sigCB` firma el nonce con la Ed25519 del peer. Convertir `cs.Identity` →
formato nkey con `github.com/nats-io/nkeys` (`nkeys.FromRawSeed(PrefixByteUser,
seed)`).
### TLS (CA self-signed propia)
**Exposición DECIDIDA: pública.** El bus se expone a internet protegido por
auth+TLS (WireGuard pasa a ser una vía de acceso más, no la barrera). En
consecuencia: `ufw` en om abre `8470/tcp` y `4250/tcp`, y el server cert incluye en
su SAN la **IP pública de om `135.125.201.30`**, la **IP WG `10.42.0.1`** (los peers
internos siguen funcionando) y el hostname de om. Los clientes son todos
controlados por nosotros (`pkg/client`, binding móvil, gateway web, unibots), así
que **embeben el `ca.crt`** propio — no hace falta Let's Encrypt ni un dominio
público apuntando al NATS.
- Generar una **CA propia** una vez (`deploy/tls/ca.{key,crt}`), y un **server
cert** para el bus con SAN = `135.125.201.30`, `10.42.0.1`, hostname de om.
- `pkg/embeddednats`: `server.Options.TLSConfig` con el server cert. NATS pasa a
`tls://`.
- Cliente: `nats.Secure(&tls.Config{RootCAs: caPool})` cargando la CA propia.
- Las claves privadas (CA key, server key) **nunca** se commitean: van gitignored y
se distribuyen por `pass`/scp. Solo el `ca.crt` (público) viaja con los clientes.
# Decisiones técnicas
| Decisión | Elegido | Alternativa descartada | Razón |
|---|---|---|---|
| Auth NATS | `CustomClientAuthentication` contra tabla `users` | `Options.Nkeys` estático | revocación dinámica sin reinicio |
| TLS | CA self-signed propia | Let's Encrypt | infra privada, sin dependencia de dominio público apuntando al NATS |
| Anti-replay control plane | timestamp ±30s + cache de nonce | nonce emitido por server (round-trip extra) | menos latencia, suficiente con un solo membershipd |
| Material de identidad | reutilizar la Ed25519 del peer (firma + nkey) | claves separadas por capa | una identidad, menos gestión |
| Rollout | feature flag `bus-auth` off→soft→enforce | corte directo | no romper clientes en vuelo |
# Fases (TBD, ramas `issue/0001x-*`, feature flags)
1. **0001a — users store + CLI** — migración `002_users.sql`, CRUD en store,
comandos `membershipd user *`, seed admin. Flag `bus-auth: off`. Tests de store.
2. **0001b — control-plane auth** — firma generalizada en `pkg/client`, middleware
de verificación + anti-replay en `membershipd`. Flag `bus-auth: soft`. Tests:
request firmado OK, no-autorizado 401, replay rechazado, reloj desfasado 401.
3. **0001c — NATS nkey auth** — `CustomClientAuthentication` + cliente con
`nats.Nkey`. Tests: peer no registrado rechazado al conectar; revocado pierde
acceso sin reiniciar.
4. **0001d — TLS NATS** — generación de CA/cert (`deploy/tls/` + script), server
`TLSConfig`, cliente `RootCAs`. Flag `bus-tls`. Test: handshake TLS, cliente sin
CA rechazado.
5. **0001e — migrar clientes** — `mobile/` (binding), gateway web (`playground/`),
`unibots` (`shell/transportunibus`): todos firman requests y conectan con
nkey+TLS. Pasar `bus-auth` a `enforce`.
6. **0001f — deploy** — unibus en om (bind `10.42.0.1` o público con auth+TLS),
unibots como systemd-user en el PC local. Verificación E2E.
# Migración de clientes
Todo el cambio se concentra en `pkg/client` (firma de requests HTTP + conexión
NATS nkey+TLS). `mobile/`, el gateway web y `unibots` lo heredan al recompilar; solo
necesitan **pasar la ruta de la CA** y su identidad (que ya tienen). El binding
gomobile expone un parámetro nuevo `caPath` en `NewSession`.
# Plan de despliegue (fase 0001f)
1. Cross-build `CGO_ENABLED=0 GOOS=linux GOARCH=amd64` del `membershipd`.
2. `scp` binario + `ca.crt` + server cert/key a om (`/opt/unibus/`), dir de datos
persistente para JetStream/db/blobs.
3. systemd-system unit, `--bind 0.0.0.0` (exposición pública), `Restart=always`.
4. `ufw allow 8470/tcp` y `ufw allow 4250/tcp` en om.
5. Seed del admin (tu identidad) por CLI local en om.
6. Verificar **desde fuera de la VPN** (red pública) y desde la WG: handshake TLS,
`curl` firmado a `/healthz` OK, `curl` sin firma → 401, conexión NATS de un peer
no registrado → rechazada.
7. unibots local: systemd-user con `caPath` + identidad registrada.
> **Nota:** la fase de despliegue (0001f: abrir firewall público, scp a om, systemd
> en el VPS) la ejecuta el humano en coordinación, no el agente autónomo — es una
> acción outward sobre infraestructura pública. El agente entrega 0001a0001e
> (código + tests + CA/cert generados) en master de unibus, listos para desplegar.
# Tests (DoD: golden + edge + error path, evidencia ejecutable)
- **Golden:** peer autorizado crea room, publica y recibe por el bus con auth+TLS
activos.
- **Edge:** revocar un usuario activo → su próxima conexión NATS y su próximo
request HTTP son rechazados sin reiniciar el server.
- **Error path:** request con firma válida pero identidad no registrada → 401;
conexión NATS con nkey no autorizado → rechazada; cliente sin la CA → fallo de
handshake TLS; replay de un request firmado → rechazado.
- Suite completa `CGO_ENABLED=0 go test ./...` verde.
# Riesgos y mitigaciones
| Riesgo | Mitigación |
|---|---|
| Chicken-egg del primer admin | seed por CLI local en el host (confianza de shell) |
| Romper clientes en vuelo al activar auth | flag `bus-auth` off→soft→enforce; migrar clientes en soft |
| Rotación/caducidad de certs | CA propia de larga vida; documentar regeneración del server cert en `deploy/tls/README.md` |
| Coste de verificar firma por request | Ed25519 verify ≈ µs; despreciable frente a la latencia de red |
| Conversión Ed25519 → nkey mal hecha | test dedicado de ida y vuelta firma/verify nkey antes de tocar el server |
| Claves privadas filtradas en git | CA key / server key gitignored; distribución por `pass`/scp; solo `ca.crt` versionado |
# Fuera de alcance (futuro)
- Rotación automática de credenciales de usuario.
- Cuentas/multi-tenant de NATS (un solo account basta hoy).
- Federación entre buses.
+146
View File
@@ -0,0 +1,146 @@
---
issue: 0002
title: Media v2 — archivos grandes (chunking), metadata, GC del object store, exponer en clientes
status: spec
created: 2026-06-07
domain: media
scope: unibus (pkg/blobstore, pkg/frame, pkg/client, pkg/membership) + clientes (mobile binding, gateway web, unibots)
depends_on: 0001 (la auth firmada del control plane debe cubrir /blobs antes de exponer media)
---
# Objetivo
El envío de archivos (imágenes, audio, vídeo) ya funciona en v1, pero con límites
que lo hacen inviable para vídeo grande y poco usable para los clientes. Este issue
lleva la media a un estado de producción: archivos grandes por chunks, metadata de
tipo/nombre, recolección de basura del object store, y exposición en los frontends.
# Contexto — cómo funciona media v1 (hoy)
`PublishMedia(roomID, data []byte)` cifra el archivo **entero** con la clave de la
room (`SealAEAD`), lo sube **entero** al object store (`pkg/blobstore`,
content-addressed por hash) vía el control plane (`POST /blobs`), y publica por el
bus solo una referencia `frame.BlobRef{Hash, Nonce, Size}`. `FetchMedia` baja el
ciphertext por hash (`GET /blobs/{hash}`) y lo descifra. El binario nunca viaja por
NATS; el bus solo lleva la referencia. El object store guarda solo ciphertext (E2E
real). Es correcto y simple, pero:
| Limitación v1 | Consecuencia |
|---|---|
| Todo el archivo en RAM (cifra y sube de una vez) | imágenes/audio OK; vídeo grande (cientos MB/GB) revienta memoria |
| `BlobRef` solo lleva hash+nonce+size | el receptor no sabe mimetype/filename; no puede renderizar bien |
| Sin resumable | si falla la subida de un archivo grande, reempezar de cero |
| Object store sin GC | blobs content-addressed crecen indefinidamente, sin refcount ni TTL |
| `mobile/` solo expone `Publish` (texto) | no se puede enviar una foto desde el móvil |
| Gateway web sin endpoints de media | la SPA no sube/baja archivos |
Fuera de alcance de este issue (sería otro): **streaming en vivo** (videollamada,
audio en tiempo real) — eso no es modelo blob, requiere WebRTC señalizado por el bus.
# Diseño
## Pieza 1 — Chunking de archivos grandes
Partir el archivo en chunks de tamaño fijo (propuesta: 4 MB), cifrar **cada chunk**
de forma independiente con la clave de la room (nonce por chunk), y subir cada chunk
como un blob propio (content-addressed). La referencia pasa de un solo blob a un
manifiesto de chunks.
- `frame.BlobRef` evoluciona (de forma compatible) a soportar lista de chunks:
```
BlobRef{
Hash string // hash del manifiesto (o del blob único si no hay chunks)
Nonce []byte // nonce del manifiesto / del blob único
Size int64 // tamaño total en claro
Chunks []ChunkRef // vacío en archivos pequeños (camino v1 intacto)
}
ChunkRef{ Hash string; Nonce []byte; Size int64 } // por chunk cifrado
```
- `PublishMediaStream(roomID string, r io.Reader, meta MediaMeta) (BlobRef, error)`:
lee del `io.Reader` en chunks (no carga el archivo entero en RAM), cifra y sube
cada chunk, y construye el manifiesto. El `PublishMedia([]byte)` v1 se mantiene
como atajo para archivos pequeños (sin chunks).
- `FetchMediaStream(roomID, BlobRef) (io.ReadCloser, error)`: baja y descifra chunks
bajo demanda, exponiendo un `io.Reader` (descarga progresiva, no todo en RAM).
- Subida/descarga de chunks en paralelo acotado (p. ej. 4 a la vez) para throughput.
## Pieza 2 — Metadata (mimetype + filename)
Añadir a `BlobRef` (o a un sidecar cifrado) los campos `Mime string` y `Name
string`, de modo que el receptor sepa renderizar (imagen inline, reproductor de
audio/vídeo, icono de descarga). Como `Name`/`Mime` pueden ser sensibles, viajan
**dentro del campo cifrado** del frame, no en claro. Detección de mimetype por
sniffing del primer chunk + extensión.
## Pieza 3 — Garbage collection del object store
Hoy los blobs no se borran nunca. Introducir refcount o barrido:
- **Refcount por referencia**: una tabla `blob_refs(hash, room_id, msg_id)` en el
control plane; al expirar un mensaje de una room efímera o al purgar historial de
una room persistente, decrementar y borrar el blob cuando llega a cero.
- **Alternativa TTL**: blobs de rooms efímeras con TTL; blobs de rooms persistentes
viven mientras viva el mensaje en JetStream.
- Comando `membershipd blobs gc [--dry-run]` para barrido manual + métrica de
espacio. Debe ser idempotente y seguro (nunca borrar un blob aún referenciado).
## Pieza 4 — Exponer media en los clientes
- **Binding móvil** (`mobile/unibus.go`): `SendFile(roomID, path, mime)` y
`FetchFile(roomID, frameJSON) -> path` (escribe a un archivo local del sandbox de
la app y devuelve la ruta; no pasa []byte grandes por el puente gomobile).
- **Gateway web** (`playground/server.go`): `POST /api/media` (multipart, streaming
al store) y `GET /api/media/{room}/{hash}` (descarga descifrada con los headers
`Content-Type`/`Content-Disposition` derivados de la metadata).
- **unibots**: una tool `send_file` para que un bot pueda adjuntar archivos.
# Decisiones técnicas
| Decisión | Elegido | Alternativa | Razón |
|---|---|---|---|
| Tamaño de chunk | 4 MB | 1 MB / 16 MB | equilibrio RAM vs overhead de manifiesto |
| Cifrado por chunk | nonce independiente por chunk, misma clave de room | re-cifrar todo | permite descarga/borrado parcial y paralelismo |
| Metadata sensible | dentro del frame cifrado | en claro en BlobRef | filename/mime pueden filtrar info |
| GC | refcount en control plane | solo TTL | preciso, no borra lo aún referenciado |
| Compatibilidad v1 | `Chunks` vacío = camino v1 | romper formato | no romper media ya enviada |
# Fases (TBD, ramas `issue/0002x-*`)
1. **0002a — BlobRef con chunks (compatible)** — extender el tipo + tests de
marshalling con `Chunks` vacío (v1) y con chunks (v2). Sin cambiar clientes aún.
2. **0002b — PublishMediaStream / FetchMediaStream** — API de streaming en
`pkg/client` sobre `io.Reader`/`io.ReadCloser`, cifrado por chunk, subida/descarga
paralela acotada. Tests con un archivo > tamaño de chunk.
3. **0002c — metadata mime+name** (en el campo cifrado) + sniffing.
4. **0002d — GC del object store** — refcount + `membershipd blobs gc` + tests de
"no borrar referenciado / borrar huérfano".
5. **0002e — exponer en clientes** — binding móvil (`SendFile`/`FetchFile`), gateway
web (`/api/media`), tool `send_file` en unibots.
# Definition of Done (evidencia ejecutable)
- **Golden:** enviar y recibir una imagen pequeña (camino v1, sin chunks) sigue
funcionando; enviar y recibir un archivo de 50 MB por chunks sin cargar 50 MB en
RAM (medir RSS durante la operación).
- **Edge:** archivo cuyo tamaño es múltiplo exacto del chunk; archivo de 1 byte;
archivo justo por debajo y por encima del umbral de chunking.
- **Error path:** chunk corrupto/no descifrable → error claro, no panic; `blobs gc`
con un blob aún referenciado → NO lo borra (assert).
- `CGO_ENABLED=0 go test ./...` verde.
# Riesgos y mitigaciones
| Riesgo | Mitigación |
|---|---|
| Romper media v1 ya enviada | `Chunks` vacío preserva el camino v1; tests de compatibilidad |
| GC borra un blob aún referenciado | refcount + barrido conservador + `--dry-run` por defecto en CI |
| Puente gomobile con []byte grandes | el binding trabaja con rutas de archivo, no buffers en memoria |
| Paralelismo de chunks satura el control plane | límite de concurrencia (4) + el endurecimiento de auth del issue 0001 |
# Relación con otros issues
- **0001 (seguridad)** — prerequisito: la auth firmada del control plane debe cubrir
`POST/GET /blobs` antes de exponer media públicamente; si no, cualquiera llena el
store o descarga ciphertext ajeno.
- **Streaming en vivo** (futuro, no este issue) — videollamada/audio en tiempo real =
WebRTC con el bus como canal de señalización; modelo distinto al blob.
+195
View File
@@ -0,0 +1,195 @@
---
issue: 0003
title: Descentralización / alta disponibilidad — cluster NATS + JetStream replicado + control plane sin SPOF
status: spec
created: 2026-06-07
domain: infra
scope: unibus (pkg/embeddednats, pkg/membership, pkg/blobstore, pkg/client, cmd/membershipd) + despliegue multi-nodo
depends_on: 0001 (la auth de cluster y de clientes va junto con el endurecimiento)
---
# Objetivo
Que la caída de un servidor **no deje el bus sin servicio**. Hoy unibus es un único
`membershipd` (con NATS embebido + SQLite local): si ese host muere, no hay bus.
Este issue lleva unibus a un modelo **descentralizado / alta disponibilidad** usando
las capacidades nativas de NATS: cluster multi-nodo, JetStream replicado (RAFT), y
el estado del control plane fuera de la SQLite local. **No es federación**
(multi-operador con dominios distintos); es eliminar el punto único de fallo dentro
de un único dominio administrativo controlado por nosotros.
# Requisito clave de quorum (decisión de infraestructura)
JetStream replica con RAFT, que necesita **mayoría (quorum)** para confirmar
escrituras. Las consecuencias son duras y hay que asumirlas desde el diseño:
| Nodos | Réplica | Tolera caída de | Nota |
|---|---|---|---|
| 1 | R1 | 0 | situación actual (SPOF) |
| 2 | R2 | **0** | si cae uno se pierde quorum: las escrituras se bloquean. NO sirve para HA |
| **3** | **R3** | **1** | mínimo real para "si un server cae, seguimos" |
| 5 | R5 | 2 | mayor tolerancia |
**Por tanto el objetivo del usuario ("si mi server falla, no nos quedamos sin
servicio") exige 3 nodos JetStream.** Servers disponibles hoy: **magnus** y
**homer** (ambos VPS OVH). El tercero está pendiente de conseguir.
| Nodo | IP pública | Estado | Notas |
|---|---|---|---|
| magnus | (en pass: `MAGNUS_ovh_ssh_ROOT`) | disponible, **cargado** | corre coolify, minio, postgres, authentik, portainer, dagu — revisar recursos antes |
| homer | `141.94.69.66` | disponible, vivo | creds en pass (`vps_ovhcloud_SSH_SERVER_HOMER_-_root`, `vps_SSH_SERVER_HOMER_dataherrero`); tenía coolify |
| nodo 3 | — | **pendiente** | conseguir un tercer VPS siempre-on, o reusar om/datardos si se liberan |
Preparación previa al deploy de cada nodo: alta del alias SSH + clave, integración en
la WireGuard, y revisar/aligerar la carga existente (coolify, etc.).
## Rollout R1 → R3: funcionar con 2 nodos hoy, HA con 3 mañana
No se "desactiva el quorum"; se controla el **número de réplicas** de cada stream/KV:
| Réplicas | Quorum | Tolera | Sirve con |
|---|---|---|---|
| R1 | ninguno (1 copia) | 0 caídas | 1-2 nodos, sin bloqueo |
| R3 | 2 de 3 | 1 caída | 3 nodos |
- **Fase actual (magnus + homer):** desplegar con streams/KV en **R1** (flag
`decentralized: off`). El bus funciona al 100% para operar, sin tolerancia a fallo
todavía. Opción: streams en **R2** para duplicar los datos en ambos nodos
(durabilidad/backup vivo), asumiendo que la escritura necesita los dos hasta el 3er
nodo.
- **Cuando entre el nodo 3:** escalar en caliente `nats stream update --replicas 3`
(idem KV/Object Store) + añadir el nodo al cluster + flag `decentralized: on`. **HA
real, sin downtime, sin reescritura, sin migrar datos.**
- **Aviso de 2 nodos:** NO montar el meta-group de JetStream con 2 nodos como si
fuera HA — su quorum es 2, y la caída de uno bloquea la gestión de streams. Con 2
servers, modelo recomendado: **magnus principal (R1) + homer 2º nodo/réplica**, y
escalar a R3 al tener el tercero.
Mientras solo haya 2 nodos: el **data plane efímero** (core-NATS, rooms `ModeNATS`)
sí tolera la caída de uno (los clientes reconectan al otro), pero las **rooms
persistentes y el control plane** (que necesitan quorum) no. El issue se despliega
de verdad cuando haya 3 nodos.
# Contexto — por qué hoy es un SPOF
- `pkg/embeddednats` arranca un NATS **standalone** (sin cluster).
- `pkg/membership` guarda rooms/members/room_keys/users en una **SQLite local** al
proceso.
- `pkg/blobstore` guarda los blobs en el **disco local** del proceso.
- El cliente (`pkg/client`) conecta a **una** URL de NATS y **una** de control plane.
Todo vive en un host. Ese host es el punto único de fallo.
# Diseño
## Pieza 1 — Cluster NATS (data plane replicado)
`pkg/embeddednats` gana opciones de cluster: `server.Options.Cluster` (nombre +
host/puerto de routes) y `Routes` (los otros nodos). Cada `membershipd` arranca su
NATS embebido en cluster con los demás. JetStream se habilita con `Replicas: 3` en
streams y KV. Auth entre nodos (routes) con credenciales propias (no las de
clientes), y TLS también en las routes (reusa la CA del issue 0001).
## Pieza 2 — Control plane sin estado local (SQLite → JetStream KV)
Es el corazón del issue. Hoy `pkg/membership.Store` es SQLite. Se introduce, por
**branch-by-abstraction**, una interfaz `Store` con dos implementaciones:
- `sqliteStore` — la actual (sigue siendo el default mientras el flag está off; útil
para un solo nodo / desarrollo).
- `jetstreamStore` — nueva: rooms, members, room_keys y users (la tabla del issue
0001) viven en **JetStream KV** (buckets replicados R3). Cualquier nodo lee/escribe
el mismo estado; RAFT garantiza consistencia. El HTTP control plane pasa a ser
efectivamente **stateless**: cualquier `membershipd` sirve cualquier request
porque el estado está en el KV replicado.
Flag `decentralized` (off → on). Migración inicial de datos SQLite → KV con un
comando `membershipd migrate-to-kv` (idempotente). Las claves de room siguen
selladas igual; solo cambia **dónde se guardan**, no el cifrado.
## Pieza 3 — Blobs replicados (object store → NATS Object Store)
`pkg/blobstore` gana una implementación sobre **NATS Object Store** (encima de
JetStream, replicado R3) además de la de disco local. Los blobs (ya ciphertext, E2E)
quedan disponibles desde cualquier nodo. Encaja con el GC del issue 0002.
## Pieza 4 — Cliente con failover
`pkg/client`: aceptar **lista** de seeds de NATS y **lista** de URLs de control
plane. `nats.go` ya hace reconnect/failover entre servidores del cluster nativamente
(`nats.Servers([...])`, `nats.MaxReconnects(-1)`). El control plane HTTP se prueba en
orden con reintento. Así, si un nodo cae, el cliente reconecta a otro de forma
transparente.
## Pieza 5 — Despliegue multi-nodo
3 nodos `membershipd`, cada uno con su NATS embebido en cluster, JetStream R3, mismo
`ca.crt`/credenciales de routes. systemd en cada VPS. Los clientes reciben la lista
de los 3 endpoints. Health/observabilidad por nodo (`/healthz` + métricas de
JetStream: líder RAFT, lag de réplica).
# Decisiones técnicas
| Decisión | Elegido | Alternativa | Razón |
|---|---|---|---|
| Nº de nodos de quorum | 3 (R3) | 2 (R2) | 2 no tolera caída de uno; 3 es el mínimo real de HA |
| Estado del control plane | JetStream KV replicado | SQLite replicada a mano / Postgres externo | KV ya viene con NATS, mismo RAFT que JetStream, cero infra extra |
| Migración del store | branch-by-abstraction (interfaz `Store`, dos impls, flag) | reescritura directa | master nunca se rompe; sqlite sigue para 1 nodo/dev |
| Blobs | NATS Object Store | disco compartido / S3 | replicado nativamente, sin dependencia externa |
| Failover de cliente | lista de seeds + reconnect nativo nats.go | balanceador externo | menos infra, nats.go ya lo hace |
| Federación multi-operador | **fuera de alcance** | — | no es el objetivo; es otra liga (trust entre dominios) |
# Fases (TBD, ramas `issue/0003x-*`)
1. **0003a — cluster NATS** — opciones de cluster/routes + TLS de routes en
`pkg/embeddednats`; arrancar 2-3 nodos locales en tests e2e y verificar que un
subject publicado en uno llega a un suscriptor en otro.
2. **0003b — interfaz Store + jetstreamStore (KV)** — abstraer `pkg/membership.Store`;
implementar rooms/members/room_keys/users sobre JetStream KV R3; tests de
consistencia. Flag `decentralized: off`.
3. **0003c — migrate-to-kv** — comando idempotente SQLite → KV + test de paridad
(mismo estado antes/después).
4. **0003d — blobs en Object Store** — impl `pkg/blobstore` sobre NATS Object Store
replicado.
5. **0003e — cliente failover** — lista de seeds + lista de ctrl-urls + reconnect;
test que mata el nodo al que está conectado y verifica que sigue operando.
6. **0003f — despliegue 3 nodos** (humano) — 3 VPS en cluster, JetStream R3, flag
`decentralized: on`. Chaos test real: matar un nodo en producción y comprobar que
el servicio sigue.
# Definition of Done (evidencia ejecutable)
- **Golden:** 3 nodos en cluster; un cliente publica en un nodo y otro cliente
suscrito a otro nodo lo recibe; crear room + invitar funciona desde cualquier nodo.
- **Edge:** un cliente conectado al nodo A; se **mata el nodo A**; el cliente
reconecta a B automáticamente y sigue publicando/recibiendo sin perder la sesión.
- **Error path (chaos):** matar 1 de 3 nodos → el control plane sigue aceptando
escrituras (quorum 2/3); matar 2 de 3 → las escrituras se bloquean (quorum perdido,
comportamiento esperado y documentado, no corrupción).
- `CGO_ENABLED=0 go test ./...` verde, incluido un test e2e multi-nodo en proceso.
# Riesgos y mitigaciones
| Riesgo | Mitigación |
|---|---|
| Solo 2 nodos disponibles → sin quorum real | prerequisito explícito de 3 nodos antes de 0003f; hasta entonces, despliegue queda en standalone |
| Latencia inter-VPS afecta RAFT | nodos en la misma región o con buena red; medir; R3 tolera latencias moderadas |
| Migración SQLite→KV pierde datos | comando idempotente + test de paridad + backup de la SQLite antes |
| Partición de red (split-brain) | RAFT lo previene: el lado sin quorum se bloquea para escritura, no diverge |
| Complejidad operativa de 3 nodos | observabilidad de JetStream (líder, lag) + `/healthz` por nodo + runbook en deploy/ |
# Orden recomendado respecto a otros issues
1. **0001 (seguridad)** primero: la auth de clientes (nkey) y la CA/TLS se reutilizan
para las routes del cluster. Desplegar descentralizado sin auth sería abrir varios
puntos públicos sin protección.
2. **0003 (este)** después: una vez el bus es seguro, replicarlo en 3 nodos.
3. **0002 (media v2)** es ortogonal; su object store encaja con la pieza 3 (blobs
replicados) cuando ambos estén.
# Fuera de alcance
- Federación entre operadores/dominios distintos (otra liga; requiere protocolo de
trust entre dominios).
- Multi-tenant / accounts de NATS por organización.
- Auto-escalado dinámico de nodos.
+146
View File
@@ -0,0 +1,146 @@
---
issue: 0004
title: Hardening de seguridad — autorización, anti-DoS y confidencialidad antes de exponer público
status: done
created: 2026-06-07
completed: 2026-06-07
report: projects/message_bus/reports/0005-2026-06-07-unibus-security-hardening.md
domain: security
scope: unibus (pkg/membership/server.go, auth.go, pkg/embeddednats, pkg/client, cmd/membershipd, deploy/tls)
depends_on: 0001 (cierra los gaps que la auditoría 0004 encontró sobre lo entregado en 0001)
blocks: 0001f (deploy público) y 0003f (deploy descentralizado)
source: projects/message_bus/reports/0004-2026-06-07-unibus-security-audit.md
---
# Objetivo
La auditoría red-team (report 0004) concluyó: la **autenticación** del bus es sólida,
pero faltan **autorización, disponibilidad y confidencialidad de metadata** — justo lo
que un bus *público* necesita. Veredicto: **NO exponer público hoy**. Este issue cierra
los hallazgos bloqueantes (1 crítico + 4 altos) y los medios relevantes, de modo que el
deploy 0001f (público) y luego 0003 (descentralizado) sean seguros.
Cada fase corresponde a un hallazgo del report 0004. La **DoD de cada fase es portar el
test adversarial del auditor** (`TestAudit_*`) y verificar que ahora arroja el resultado
SEGURO (lo que antes pasaba el ataque, ahora lo rechaza).
# Fases (TBD, ramas `issue/0004x-*`, una por hallazgo)
## 0004a — H1 (Crítico): límite de cuerpo + anti-DoS pre-auth
**Problema:** `Server.ServeHTTP` hace `io.ReadAll(r.Body)` **sin límite y antes** de
`authenticate()`; `handlePutBlob` repite el `io.ReadAll` sin límite. 400 MB sin
credenciales → 898 MB RSS → OOM con pocas conexiones.
**Fix:**
- `http.MaxBytesReader` en el middleware **antes** del `io.ReadAll` (límite control plane,
p.ej. 1 MB).
- Límite separado y mayor para `/blobs`, con rechazo temprano por `Content-Length` antes
de bufferizar; idealmente stream a disco en vez de RAM.
- `Server.MaxHeaderBytes` ajustado.
- Rate-limit por IP (y por identidad tras auth). Reusar/crear una función del registry si
aplica (delegar a `fn-constructor` si es genérica).
**DoD:** test que envía un cuerpo > límite sin firma → `413`/`401` **sin** que el RSS se
dispare (medir `/proc/self/status` antes/después, delta acotado). Golden (cuerpo normal
pasa) + edge (justo en el límite) + error (excede → rechazo barato).
## 0004b — H2 (Alto): cerrar el fail-open de configuración
**Problema:** default `--bus-auth off`; el nkey de NATS solo se activa en `enforce`; TLS
es flag independiente. `--bind 0.0.0.0 --tls-cert …` **sin** `--bus-auth enforce` deja el
bus abierto con apariencia de seguro.
**Fix:**
- Si `--bind` no es loopback ⇒ exigir `--bus-auth enforce` (si no, `log.Fatal` con mensaje
claro).
- `--tls-cert`/`--tls-key` sin `--bus-auth enforce` ⇒ error de arranque.
- Arranque inseguro imposible o, como mínimo, ruidoso y rechazado.
**DoD:** portar `TestAudit_FailOpenTLSWithoutAuth` → ahora el arranque público-sin-enforce
falla; cliente no registrado NO conecta. Golden (bind loopback dev sigue permitido) + error
(bind público sin enforce aborta).
## 0004c — H3 (Alto): autorización por pertenencia en el control plane
**Problema:** "autorizado" = "registrado", no "miembro". Los GET de room no comprueban
pertenencia: `/rooms/{id}`, `/rooms/{id}/members` (expone `sign_pub`+`kex_pub` de todos),
`/members/{endpoint}/rooms`, y `/rooms/{id}/key?endpoint=X` (devuelve la `sealed_key` ajena).
**Fix:**
- Cada handler de room consulta `members` y exige que el firmante (`X-Unibus-Pub`
endpoint) sea miembro.
- `/rooms/{id}/key` solo sirve la clave sellada **para el propio firmante** (`endpoint ==
signer`), nunca de un tercero.
- `/members/{endpoint}/rooms` solo si `endpoint == signer`.
- No exponer la member-list completa a no-miembros.
**DoD:** portar `TestAudit_HorizontalMetadataLeak` → bob (no miembro) ahora recibe `403`
en todos. Golden (miembro legítimo accede) + edge (owner accede) + error (no-miembro 403).
## 0004d — H4 (Alto): control de acceso en el data plane NATS
**Problema:** el authenticator nkey solo decide "registrado sí/no"; no hay permisos por
subject. Cualquier registrado se suscribe/publica en cualquier subject; las rooms
`ModeNATS` (cleartext) quedan expuestas entre usuarios.
**Fix (elegir y documentar la estrategia):**
- Preferente: NATS `Permissions` por identidad (subjects que el usuario puede sub/pub),
derivadas de su pertenencia a rooms; o
- Subjects impredecibles (no derivables del nombre) + verificación de pertenencia
server-side; o
- Prohibir `ModeNATS` en despliegue público (forzar siempre E2E) como mínimo defensivo.
**DoD:** portar `TestAudit_NoSubjectACL` → eve (no invitada) ya NO recibe el mensaje de la
room ajena. Documentar la estrategia elegida y su límite.
## 0004e — H5 (Alto, público): TLS en el control plane
**Problema:** HTTP `:8470` firmado pero **sin TLS** → metadata (subjects, endpoints,
pubkeys, sealed keys, hashes de blobs, grafo social) legible por un MITM en la red pública.
**Fix:**
- Servir el control plane sobre TLS con la misma CA propia (o documentar un reverse-proxy
TLS delante).
- El cliente exige `https` cuando se le pasa una CA (`client.Connect(caPath)` ⇒ control
plane también TLS).
**DoD:** cliente contra control plane `https` con la CA → OK; contra `http` con CA esperada
→ rechaza; un observador no ve la metadata (argumentado + test de esquema).
## 0004f — medios: owner binding, nonce-cache, error leak
- **H6** `handleCreateRoom`: exigir `Owner.Endpoint == frame.EndpointID(X-Unibus-Pub)` y
`Owner.SignPub == pub`. (Portar `TestAudit_OwnerSpoof` → ahora 403.)
- **H7** mover `IsAuthorized` **antes** de tocar el `nonceCache` (no cachear nonces de
no-autorizados); poda por expiry-bucket/heap en vez de O(n) bajo mutex global; cap de
tamaño. (Portar `TestAudit_NonceCachePoisonPreAuth`.) **Nota:** este fix es prerequisito
del cambio a nonce-cache replicado del issue 0003.
- **H12** mensajes de error genéricos al cliente; detalle solo al log (no filtrar rutas/SQL).
# Fuera de alcance de este issue (encolado en otros)
- **H9** (cuota/GC de blobs) → issue 0002 (media v2) ya lo cubre.
- **H10** (AEAD nonce 12B → XChaCha o rekey por volumen) → bajo, futuro; abrir issue propio
si se necesitan rooms de muy alto volumen.
- **H11** (firma de owner sin nonce/ts) → cubierto en la práctica por el envelope `enforce`;
documentar la dependencia. Reforzar si se relaja `enforce`.
- **H8** (custodia de la CA: generar en om, `ca.key` fuera del PC) → tarea operacional del
deploy 0001f/0003f, no de código.
- **govulncheck** sobre nats-server/nats.go/modernc → paso de CI aparte.
# Definition of Done global
- Las cuatro pruebas adversariales bloqueantes del report 0004 (DoS acotado, fail-open
cerrado, fuga horizontal 403, ACL data plane) portadas como tests de regresión y en verde.
- `CGO_ENABLED=0 go build ./...` + `go vet ./...` + `go test ./...` verdes.
- Re-evaluación: tras el hardening, el veredicto de exposición pública pasa de "NO" a
"sí-con-condiciones operacionales" (CA custodiada, Restart=always). Anotar en un report
nuevo o como addendum al 0004.
# Orden respecto a otros issues
1. **0004 (este)** — primero: hace el bus seguro para exponer.
2. **0003 (descentralización)** — después: absorbe el nonce-cache→KV replicado (apoyado en
0004f-H7), la auth de routes del cluster y el guard de fail-open ×N nodos.
3. **0002 (media v2)** — ortogonal; incluye la cuota/GC de blobs (H9).
+132
View File
@@ -0,0 +1,132 @@
---
issue: 0005
title: Hardening 2 — CVEs, spoof por firma omitida, DoS por concurrencia, TLS forzado (re-auditoría)
status: done
created: 2026-06-07
completed: 2026-06-07
domain: security
scope: unibus (go.mod, pkg/client, pkg/membership/server.go, cmd/membershipd/config.go, pkg/embeddednats, pkg/blobstore)
depends_on: 0001, 0004 (cierra los hallazgos NUEVOS de la re-auditoría sobre lo entregado)
blocks: 0001f (deploy público) y 0003f (deploy descentralizado)
source: projects/message_bus/reports/0006-2026-06-07-unibus-security-reaudit.md
---
# Objetivo
La re-auditoría red-team (report 0006) confirmó que el hardening 0004 cerró H1H7/H12,
pero encontró **hallazgos nuevos** que mantienen el veredicto en **"NO exponer público
aún"**. Este issue los cierra. La re-auditoría se hizo sobre el commit `618f6b6`
(pre-0003); algunos hallazgos pueden haber cambiado con 0003 — **cada fase debe primero
verificar si el hallazgo sigue vivo en el master actual** (post-0003, v0.6.0) antes de
arreglarlo.
Estado verificado al crear este issue (master post-0003):
- **N1 vivo**: `go.mod` sigue en `nats-server v2.10.22` y `go 1.25.0`.
- **N3 vivo**: `pkg/client/client.go:802` tiene `if info.Policy.SignMsgs && f.Sig != nil` (el patrón vulnerable exacto).
- **H4**: 0003 añadió `pkg/membership/acl.go` — hay que evaluar si cierra el wildcard `Subscribe(">")` o si falta la capa de NATS Permissions.
- N2, N4: presumiblemente vivos (0003 no los tocó); verificar.
# Fases (TBD, ramas `issue/0005x-*`)
## 0005a — N1 (Alto): CVEs en dependencias
**Hallazgo:** `govulncheck ./...` → 16 vulnerabilidades alcanzables: 14 en
`github.com/nats-io/nats-server/v2@v2.10.22` (servidor embebido, expuesto público en el
deploy decidido) + 2 en la stdlib de Go (`net/textproto` GO-2026-5039, `crypto/x509`
GO-2026-5037).
**Fix:**
- `go get github.com/nats-io/nats-server/v2@v2.11.15` (o superior que cubra las 14).
- Subir la toolchain a `go1.26.4` (cubre las 2 de stdlib); actualizar la directiva `go`
en `go.mod` si procede.
- Re-correr `govulncheck ./...` hasta **0 affected**.
- **Nota:** este es un cambio de `go.mod`/`go.sum` justificado por CVE; documentarlo en el
commit. Verificar que el bump de nats-server no rompe el cluster/JetStream de 0003
(correr toda la suite, incluido el e2e multi-nodo).
**DoD:** `govulncheck ./...` → "No vulnerabilities found" (o solo no-alcanzables); suite
completa verde tras el bump.
## 0005b — N3 (Alto): spoof por firma omitida en rooms firmadas
**Hallazgo:** `pkg/client/client.go::processFrame` verifica la firma **solo si el frame la
trae**: `if info.Policy.SignMsgs && f.Sig != nil { verify }`. Un atacante con acceso al
data plane publica un frame con `Sig==nil` y `Sender` forjado → el receptor lo acepta como
auténtico en una room que EXIGE firma.
**Fix:** en una room `SignMsgs`, un frame sin firma debe **dropearse**:
```go
if info.Policy.SignMsgs {
if f.Sig == nil { return } // exige firma; sin ella, descarta
if !verify(...) { return }
}
```
**DoD:** portar `TestReaudit_SigNilSpoof` → ahora el frame `Sig==nil` con `Sender` forjado
en una room `SignMsgs` se **descarta** (no se entrega al handler). Golden (frame firmado
válido se entrega) + edge (room sin SignMsgs no se ve afectada) + error (Sig==nil en
SignMsgs → drop).
## 0005c — N2 (Medio-Alto): DoS por concurrencia
**Hallazgo:** el límite por-request (16 MiB) + rate-limit per-IP NO acotan la memoria
agregada. 40 subidas de 16 MiB simultáneas (= el burst per-IP) → 1.42 GB RSS. Multi-IP
escala sin techo.
**Fix (elegir y documentar):**
- Límite global de conexiones concurrentes y/o de bytes-en-vuelo (semáforo con cota de
memoria total), y/o
- Stream del blob a disco en vez de `io.ReadAll` en RAM (encaja con la cuota/GC del issue
0002), y/o
- Bajar `maxBlobBytes` y separar mejor el límite de control (1 MiB) del de blobs.
**DoD:** test que lanza N subidas concurrentes al techo y verifica que el RSS agregado
queda **acotado** (mide `/proc/self/status`, cota declarada) en vez de crecer linealmente
con N. Golden (concurrencia normal pasa) + edge (en la cota) + error (exceso → 429/503 sin
OOM).
## 0005d — N4 (Medio): forzar TLS del control plane en bind público
**Hallazgo:** el guard `validateBootConfig` cierra "público sin enforce" y "TLS sin
enforce", pero **permite** público + enforce **sin** `--tls-cert` → el control plane sirve
HTTP plano públicamente (reaparece H5: metadata en claro).
**Fix:** el guard debe exigir `--tls-cert`/`--tls-key` cuando el bind no es loopback.
`public + enforce + sin TLS``log.Fatal`.
**DoD:** portar `TestGap_PublicEnforceNoTLS` → ahora `validateBootConfig("0.0.0.0",
enforce, "", "")` **rechaza**. Golden (público+enforce+TLS OK) + edge (loopback sin TLS
sigue OK para dev) + error (público sin TLS aborta).
## 0005e — H4 (Medio, residual): evaluar y completar la ACL por subject
**Contexto:** 0003 añadió `pkg/membership/acl.go`. Primero **evaluar** con el ataque del
report 0006 (`TestReaudit_H4_WildcardMetadataLeak`: un registrado no-miembro con
`Subscribe(">")` raw capta subjects + advisories de JetStream de rooms ajenas) si ese
acl.go ya lo cierra.
- Si lo cierra → portar el test como regresión y documentar.
- Si NO (probable: la ACL real necesita NATS `Permissions` por identidad a nivel del
authenticator/cuenta, no solo lógica de membership en el control plane) → implementar las
Permissions por identidad derivadas de pertenencia, o documentar el límite y el plan.
**DoD:** `TestReaudit_H4_WildcardMetadataLeak` → el no-miembro ya NO capta los subjects de
rooms ajenas (o, si queda residual, está documentado con su límite exacto).
# Fuera de alcance (otros issues)
- **H9** (cuota/GC de blobs) → issue 0002; se solapa con 0005c (streaming a disco).
- **H10** (AEAD nonce) / **H11** (nonce/ts en firma de owner) → bajo, futuro.
- **H8** (custodia de la CA: generar en om) → operacional del deploy.
- **Auditoría de la superficie nueva de 0003** (cluster routes auth, jetstreamStore KV
fail-closed, nonce-cache replicado, failover) → el report 0006 NO la cubrió (auditó
pre-0003). Pendiente una re-auditoría dedicada de 0003 (prompt ya preparado).
# Definition of Done global
- `govulncheck ./...` → 0 alcanzables.
- Los tests adversariales de la re-auditoría (`TestReaudit_SigNilSpoof`,
`TestGap_PublicEnforceNoTLS`, `TestReaudit_H4_WildcardMetadataLeak`, DoS-concurrencia)
portados como regresión y en verde (o el residual documentado).
- `CGO_ENABLED=0 go build ./... && go vet ./... && go test ./...` verdes (incluido el e2e
multi-nodo de 0003, para confirmar que el bump de nats-server no lo rompió).
- Re-evaluación: el veredicto de exposición pública pasa de "NO-aún" a "sí-con-condiciones".
@@ -0,0 +1,160 @@
---
issue: 0006
title: Completar y endurecer el cluster — wiring del control plane KV + N1-N6 de la auditoría 0008
status: done
created: 2026-06-07
closed: 2026-06-07
closed_by: fases 0006a0006g (ver report 0009); unibus v0.8.0
domain: security
scope: unibus (cmd/membershipd, pkg/membership, pkg/embeddednats, pkg/busauth, pkg/client)
depends_on: 0003 (completa su wiring), 0005 (hereda el bus single-node ya seguro)
blocks: 0003f (deploy del cluster descentralizado)
source: projects/message_bus/reports/0008-2026-06-07-unibus-decentralization-audit.md
---
# Objetivo
La auditoría dedicada de la superficie de 0003 (report 0008) concluyó: **el bus en
cluster NO es seguro para público** por dos bloqueantes, y además **0003 dejó el
control plane descentralizado SIN cablear** (el binario sigue usando SQLite single-store;
el flag `decentralized` existe pero ningún código Go lo lee). Como nodo único standalone
unibus YA es seguro (report 0008 lo confirma); como cluster, no.
Este issue cierra los bloqueantes de seguridad del cluster Y completa el wiring que 0003
dejó a medias, de modo que el deploy descentralizado (0003f) sea seguro. Cada fase
reproduce el ataque del report 0008 (`TestAttack0008_*`) y verifica que ahora se rechaza.
# Fases (TBD, ramas `issue/0006x-*`)
## 0006a — N3 (BLOQUEANTE): cablear el nonce replicado en el binario
**Hallazgo (ALTA):** `membershipd` **nunca llama** `Server.UseReplicatedNonces`; cada nodo
usa `memNonceCache` por-proceso. Un request firmado aceptado en el nodo A se **replaya con
éxito en el nodo B** (200+200). La API (`kvNonceStore`) y el test
(`TestReplicatedNonceRejectsCrossNodeReplay`) existen, pero el binario no los invoca.
**Fix:** en `cmd/membershipd/main.go`, cuando se arranca con `--cluster-name` (o siempre que
haya JetStream disponible), llamar `srv.UseReplicatedNonces(js, replicas)` y **fail-fast** si
el bucket `KV_UNIBUS_nonces` no se crea. Regla dura: `--cluster-name != ""` ⇒ nonce replicado
**obligatorio** (no arrancar un nodo de cluster con nonce-cache local).
**DoD:** reproducir `TestAttack0008_N3` (2 nodos con el wiring exacto del binario) → el replay
del nonce al nodo B ahora da **401**. Golden (request normal OK en cualquier nodo) + edge
(single-node sin cluster sigue usando cache local OK) + error (replay cross-node → 401).
## 0006b — N2 (BLOQUEANTE): cerrar `$JS.API.>` / aislar el control plane KV
**Hallazgo (ALTA):** el grant ACL `clientInfraSubjects = {"_INBOX.>", "$JS.API.>"}`
(`acl.go:20`) deja a cualquier peer registrado leer los buckets KV del control plane
(`KV_UNIBUS_users/rooms/members/room_keys`) directo por NATS, saltándose `requireMember` y
los chequeos del HTTP. Fuga del allowlist (handles+roles+claves), grafo de rooms y metadata
de sealed-keys. (La ESCRITURA al KV ya está denegada — verificado; la fuga es de lectura.)
**Fix (elegir y documentar):**
- Sustituir el grant amplio `$JS.API.>` por permisos JetStream **mínimos por-room** (solo la
API del stream/consumer de las rooms del peer), y **denegar explícitamente** los streams
`KV_UNIBUS_*` y `OBJ_*`; o
- (Más robusto) aislar el control plane KV en una NATS **account separada**, inaccesible
desde la account de clientes.
**DoD:** reproducir `TestAttack0008_N2` → eve (registrada, no miembro) ya **NO** puede leer
los buckets KV (`Permissions Violation` o equivalente). La JetStream API legítima de las
rooms del peer sigue funcionando.
## 0006c — wiring del control plane KV (completar 0003)
**Hallazgo (MEDIA / raíz):** el binario no activa el store descentralizado. `membership.Open`
(SQLite) está hardcodeado en `main.go:90`; `OpenJetStream` solo lo usa `migrate-to-kv`.
**Fix:** leer el flag `decentralized` (o un `--store kv|sqlite`) y **seleccionar el store** en
el arranque: SQLite (default, single-node/dev) o `jetstreamStore` (cluster). Resolver el
"ciclo bootstrap" del authenticator interno (el authenticator necesita el store para
`IsAuthorized`, y el store KV necesita el NATS arrancado). Mantener branch-by-abstraction:
con el flag off, comportamiento idéntico al actual. `IsAuthorized`/lecturas sobre KV
**fail-closed** ante pérdida de quorum/timeout (ya implementado en `jetstreamStore`
verificar que el wiring lo preserva).
**DoD:** con `decentralized: on` + cluster, el control plane sirve desde el KV replicado y un
nodo nuevo ve las rooms creadas en otro (cierra la divergencia de estado que nota N5).
Fail-closed: simular KV no disponible → deniega. Con flag off, suite idéntica al baseline.
## 0006d — N1 (ALTA): posture homogénea del cluster
**Hallazgo:** el cluster es tan seguro como su nodo más débil; un nodo sin authenticator o
`--bus-auth off` deja a un peer no autenticado `Subscribe(">")` y cosechar el tráfico
reenviado de los nodos con ACL.
**Fix:** garantizar (en arranque/health) que todos los nodos corren `enforce`+ACL+TLS;
rechazar formar cluster con un peer en posture inferior, o como mínimo documentar y exponer
un health que lo detecte. Nunca exponer el puerto de cliente de un nodo sin enforce.
**DoD:** reproducir `TestAttack0008_N1` escenario 2 (cluster con un nodo `withACL=false`) →
el arranque/health lo rechaza o lo señala; documentar la garantía.
## 0006e — N4 (MEDIA): RefreshSession en los clientes
**Hallazgo:** la ACL congela permisos al conectar; un peer que crea/se une a una room debe
llamar `client.RefreshSession()` para pub/sub en su subject. **Ningún cliente lo llama**
(`cmd/chat`, `cmd/worker`, `mobile`, `gateway`). Es fail-closed (deniega), pero rompe la
usabilidad bajo `enforce`+ACL → empuja al operador a desactivar la ACL (regresión de
seguridad a discreción del operador).
**Fix:** llamar `RefreshSession` tras cambios de membresía en `cmd/chat`/`cmd/worker` (y
documentar el contrato para `mobile`/`gateway`), o implementar refresh transparente (rehacer
suscripciones automáticamente al unirse a una room).
**DoD:** test que crea/une room bajo enforce+ACL y publica/recibe SIN intervención manual
(el cliente refresca solo o el demo llama RefreshSession). Documentar el requisito.
## 0006f — bajos: CA de routes, secreto de cluster, migrate-to-kv, R1≠HA
- **N1 (BAJA):** CA **separada** para las routes del cluster (no reusar la CA del data plane
de clientes); pasar el secreto de cluster por **archivo/env**, no por `--routes
nats://user:pass@host` en argv (hoy visible en `ps`/`journald`).
- **N6 (BAJA):** `migrate-to-kv` solo en loopback o con TLS (hoy el allowlist viaja plaintext
si `--nats-url` remoto sin `--ca`).
- **N3-DoS (MEDIA, doc):** documentar que el nonce/control plane en **R1 es SPOF de auth** (su
caída rechaza todos los requests autenticados); R3 (quorum 2/3) es la condición de HA real.
No vender R1 como HA.
## 0006g — preparar el material de deploy del cluster (3 nodos)
Los tres nodos del cluster están decididos: **magnus + homer + datardos** (3 VPS OVH →
quorum R3 real, tolera la caída de uno). Datos: homer `141.94.69.66`; datardos `ssh dd`
`51.91.100.142` (WG `datardos-wg` 10.21.0.x); magnus en `pass` (`MAGNUS_ovh_ssh_ROOT`).
**Preparar (NO ejecutar en los VPS — eso es 0003f, lo hace el humano):** dejar en
`deploy/cluster/` el material parametrizado por nodo:
- `generate-cluster-certs.sh` — CA propia del cluster (separada de la de clientes, ver 0006f)
+ un server cert por nodo con SAN = su IP pública + su IP WG + hostname.
- una plantilla de systemd unit por nodo (`membershipd@.service` o tres units) con
`--bind 0.0.0.0 --bus-auth enforce --tls-cert … --cluster-name unibus --routes
nats://…@<otros-2-nodos> --store kv` y `Restart=always`, secreto de cluster por archivo/env.
- `deploy-cluster.sh` (cross-build linux + rsync por nodo + plan de arranque escalonado).
- un `README.md` con el runbook: orden de arranque, seed del admin, `migrate-to-kv` (loopback/TLS),
escalado de réplicas a R3 (`nats stream update --replicas 3`), verificación de quorum y chaos
test (matar un nodo). Marcar claramente qué pasos toca el humano.
**DoD:** el material existe y es coherente (los certs cubren los 3 nodos; las units referencian
los routes correctos); un `bash -n` de los scripts pasa; el README describe el deploy end-to-end.
NO se toca ningún VPS desde el agente.
# Fuera de alcance (otros issues / operacional)
- **H8** (CA generada/custodiada en om) → operacional del deploy 0003f.
- **H9/H10/H11** → issue 0002 / futuro.
- **Object Store (blobs) vía `$JS.API.>`**: el report 0008 lo marca como "probable misma
clase que N2, no verificado" (impacto menor: blobs son ciphertext E2E). El fix de 0006b
(denegar `OBJ_*`) lo cubre; verificar.
- **Chaos test de red real** (matar 1/3, 2/3, split-brain) → 0003f (requiere 3 VPS).
# Definition of Done global
- `TestAttack0008_N3` → replay cross-node **401**; `TestAttack0008_N2` → eve no lee buckets KV;
`TestAttack0008_N1` → nodo débil rechazado/señalado. Portados como regresión.
- Con `decentralized: on`: control plane sobre KV replicado, fail-closed verificado, estado
consistente entre nodos. Con flag off: baseline idéntico.
- Clientes operan bajo `enforce`+ACL sin intervención manual (RefreshSession resuelto).
- `CGO_ENABLED=0 go build ./... && go vet ./... && go test ./...` verdes + `govulncheck` 0.
- Veredicto re-evaluado: el bus DESCENTRALIZADO pasa de "NO" a "sí-con-condiciones" (3 nodos
R3 para HA real, posture homogénea, CA en om).
@@ -0,0 +1,78 @@
---
issue: 0007
title: Cifrado at-rest del control plane (JetStream KV / SQLite en disco)
status: spec
created: 2026-06-07
domain: security
scope: unibus (pkg/embeddednats, cmd/membershipd, deploy/cluster) + procedimiento de migración del store existente
---
# Objetivo
Cifrar en reposo el almacenamiento del plano de control para que un nodo comprometido
(root en el VPS) o un disco robado no exponga los metadatos de control en claro.
Estado actual (auditado el 07/06/2026, report 0012 y siguientes):
- **Contenido de los mensajes**: cifrado E2E por room (megolm/olm). El servidor nunca ve el
plaintext; no vive en el plano de control. **No es el objeto de este issue.**
- **Claves de room** (`UNIBUS_room_keys`): guardadas **selladas** (sealed box X25519, cifradas
para cada miembro). El servidor las almacena y reparte pero no puede abrirlas. **Ya protegidas.**
- **Metadatos de control** (`UNIBUS_rooms`, `UNIBUS_members`, `UNIBUS_rooms_by_member`,
`UNIBUS_users`): se serializan con `json.Marshal` y se escriben **en claro** en el store. En
cluster ese store es el directorio `local_files/jetstream/` de cada nodo; en single-node es el
archivo SQLite `local_files/unibus.db`. Hoy **no hay cifrado at-rest**: con root en un nodo se
pueden leer subjects de salas, la pertenencia (quién está en qué sala con qué rol), los handles
y roles de los usuarios, y las claves públicas (signPub/kexPub). No se exponen mensajes (E2E) ni
se pueden descifrar salas (claves selladas), pero sí toda la topología.
Tras este issue, los buckets/archivos del control plane quedan cifrados en disco con una clave por
nodo gestionada fuera de git. El modelo de amenaza pasa de "root del nodo ve la topología" a "root
del nodo necesita además la clave at-rest (que puede vivir en un secreto separado / TPM / variable
de entorno inyectada) para leer cualquier cosa".
# Contexto técnico
- NATS Server / JetStream soporta **encryption at-rest** nativo: se configura una cifra
(`aes` o `chacha20`) y una clave; JetStream cifra los ficheros de los streams/KV en disco. El
bus usa un NATS **embebido** (`pkg/embeddednats`), así que la activación es por opciones del
servidor embebido, no por un `nats-server.conf` externo.
- Para el backend SQLite (single-node) el equivalente sería SQLCipher o cifrado a nivel de
archivo/FS; queda como sub-tarea de menor prioridad porque el despliegue real es cluster (KV).
# Tareas
1. Confirmar la API de encryption-at-rest del NATS embebido en la versión usada (opción de
servidor para cipher + clave; cómo se pasa la clave de forma que no quede en argv ni en git).
2. Activar el cifrado en `pkg/embeddednats` detrás de una opción de configuración. La clave se
inyecta por archivo (`--jetstream-encryption-key-file`, 0600, junto a las claves TLS del nodo)
o variable de entorno desde el unit systemd; nunca en argv ni commiteada.
3. `cmd/membershipd`: flag/env para la clave + reflejar el estado en la posture publicada en
`/healthz` (p.ej. `"at_rest":true`) para que el monitor lo verifique.
4. `deploy/cluster`: provisionar la clave at-rest por nodo (generación + `pass`/secrets gitignored)
y cablearla en `cluster.env` + el unit. Documentar en el runbook.
5. **Migración del store existente** (gotcha crítico): JetStream no re-cifra retroactivamente los
datos ya escritos en claro. Diseñar y documentar el procedimiento seguro para el cluster en
producción (probable: backup → exportar snapshot del control plane → parar nodo → recrear el
store con la clave activa → re-importar; o rotación nodo a nodo aprovechando la replicación R3).
Respetar la regla de migraciones (aditivo, sin pérdida de datos).
6. Tests: arrancar un nodo con clave at-rest, escribir un user/room, y verificar que el fichero en
disco **no** contiene en claro un subject/handle conocido (grep negativo), y que el nodo sigue
leyéndolos con la clave. Verificar que sin la clave el store no se abre.
# Definition of Done
- Cifrado at-rest activo en los 3 nodos del cluster; `/healthz` lo refleja en la posture.
- Evidencia ejecutable: un valor conocido (subject de sala / handle de usuario) **no** aparece en
claro al hacer `grep` sobre `local_files/jetstream/`; el nodo lo sigue sirviendo con la clave.
- Procedimiento de migración probado sobre datos reales sin pérdida (snapshot/restore verificado).
- La clave at-rest nunca está en git ni en argv; vive en archivo 0600 / secreto inyectado.
- No baja ninguna otra capa de seguridad (enforce + ACL + TLS + E2E + sealed keys intactas).
# Notas
Aditivo y ortogonal al resto de la seguridad: TLS protege en tránsito, E2E el contenido, las claves
de room van selladas; este issue cierra el último hueco (metadatos de control en claro en disco)
para el modelo de amenaza "VPS comprometido / disco robado". Prioridad media: el despliegue ya es
seguro frente a ataques de red (enforce+TLS+ACL); esto endurece frente a compromiso físico/root del
host. Relacionado con el endurecimiento de los issues 0004/0005/0006.
+10 -9
View File
@@ -1,25 +1,28 @@
module github.com/enmanuel/unibus
go 1.25.0
go 1.26.4
replace fn-registry => ../../../../
require (
fn-registry v0.0.0-00010101000000-000000000000
github.com/nats-io/nats-server/v2 v2.10.22
github.com/nats-io/nats.go v1.37.0
github.com/nats-io/nats-server/v2 v2.11.15
github.com/nats-io/nats.go v1.49.0
github.com/nats-io/nkeys v0.4.15
github.com/oklog/ulid/v2 v2.1.0
golang.org/x/time v0.15.0
modernc.org/sqlite v1.47.0
)
require (
github.com/antithesishq/antithesis-sdk-go v0.6.0-default-no-op // indirect
github.com/dustin/go-humanize v1.0.1 // indirect
github.com/google/go-tpm v0.9.8 // indirect
github.com/google/uuid v1.6.0 // indirect
github.com/klauspost/compress v1.18.3 // indirect
github.com/klauspost/compress v1.18.4 // indirect
github.com/mattn/go-isatty v0.0.20 // indirect
github.com/minio/highwayhash v1.0.3 // indirect
github.com/nats-io/jwt/v2 v2.5.8 // indirect
github.com/nats-io/nkeys v0.4.7 // indirect
github.com/minio/highwayhash v1.0.4-0.20251030100505-070ab1a87a76 // indirect
github.com/nats-io/jwt/v2 v2.8.1 // indirect
github.com/nats-io/nuid v1.0.1 // indirect
github.com/ncruces/go-strftime v1.0.0 // indirect
github.com/remyoudompheng/bigfft v0.0.0-20230129092748-24d4a6f8daec // indirect
@@ -28,8 +31,6 @@ require (
golang.org/x/mod v0.36.0 // indirect
golang.org/x/sync v0.20.0 // indirect
golang.org/x/sys v0.44.0 // indirect
golang.org/x/text v0.37.0 // indirect
golang.org/x/time v0.7.0 // indirect
golang.org/x/tools v0.45.0 // indirect
modernc.org/libc v1.70.0 // indirect
modernc.org/mathutil v1.7.1 // indirect
+24 -16
View File
@@ -1,25 +1,31 @@
github.com/antithesishq/antithesis-sdk-go v0.6.0-default-no-op h1:kpBdlEPbRvff0mDD1gk7o9BhI16b9p5yYAXRlidpqJE=
github.com/antithesishq/antithesis-sdk-go v0.6.0-default-no-op/go.mod h1:IUpT2DPAKh6i/YhSbt6Gl3v2yvUZjmKncl7U91fup7E=
github.com/dustin/go-humanize v1.0.1 h1:GzkhY7T5VNhEkwH0PVJgjz+fX1rhBrR7pRT3mDkpeCY=
github.com/dustin/go-humanize v1.0.1/go.mod h1:Mu1zIs6XwVuF/gI1OepvI0qD18qycQx+mFykh5fBlto=
github.com/google/go-cmp v0.6.0 h1:ofyhxvXcZhMsU5ulbFiLKl/XBFqE1GSq7atu8tAmTRI=
github.com/google/go-cmp v0.6.0/go.mod h1:17dUlkBOakJ0+DkrSSNjCkIjxS6bF9zb3elmeNGIjoY=
github.com/google/go-tpm v0.9.8 h1:slArAR9Ft+1ybZu0lBwpSmpwhRXaa85hWtMinMyRAWo=
github.com/google/go-tpm v0.9.8/go.mod h1:h9jEsEECg7gtLis0upRBQU+GhYVH6jMjrFxI8u6bVUY=
github.com/google/pprof v0.0.0-20250317173921-a4b03ec1a45e h1:ijClszYn+mADRFY17kjQEVQ1XRhq2/JR1M3sGqeJoxs=
github.com/google/pprof v0.0.0-20250317173921-a4b03ec1a45e/go.mod h1:boTsfXsheKC2y+lKOCMpSfarhxDeIzfZG1jqGcPl3cA=
github.com/google/uuid v1.6.0 h1:NIvaJDMOsjHA8n1jAhLSgzrAzy1Hgr+hNrb57e+94F0=
github.com/google/uuid v1.6.0/go.mod h1:TIyPZe4MgqvfeYDBFedMoGGpEw/LqOeaOT+nhxU+yHo=
github.com/hashicorp/golang-lru/v2 v2.0.7 h1:a+bsQ5rvGLjzHuww6tVxozPZFVghXaHOwFs4luLUK2k=
github.com/hashicorp/golang-lru/v2 v2.0.7/go.mod h1:QeFd9opnmA6QUJc5vARoKUSoFhyfM2/ZepoAG6RGpeM=
github.com/klauspost/compress v1.18.3 h1:9PJRvfbmTabkOX8moIpXPbMMbYN60bWImDDU7L+/6zw=
github.com/klauspost/compress v1.18.3/go.mod h1:R0h/fSBs8DE4ENlcrlib3PsXS61voFxhIs2DeRhCvJ4=
github.com/klauspost/compress v1.18.4 h1:RPhnKRAQ4Fh8zU2FY/6ZFDwTVTxgJ/EMydqSTzE9a2c=
github.com/klauspost/compress v1.18.4/go.mod h1:R0h/fSBs8DE4ENlcrlib3PsXS61voFxhIs2DeRhCvJ4=
github.com/mattn/go-isatty v0.0.20 h1:xfD0iDuEKnDkl03q4limB+vH+GxLEtL/jb4xVJSWWEY=
github.com/mattn/go-isatty v0.0.20/go.mod h1:W+V8PltTTMOvKvAeJH7IuucS94S2C6jfK/D7dTCTo3Y=
github.com/minio/highwayhash v1.0.3 h1:kbnuUMoHYyVl7szWjSxJnxw11k2U709jqFPPmIUyD6Q=
github.com/minio/highwayhash v1.0.3/go.mod h1:GGYsuwP/fPD6Y9hMiXuapVvlIUEhFhMTh0rxU3ik1LQ=
github.com/nats-io/jwt/v2 v2.5.8 h1:uvdSzwWiEGWGXf+0Q+70qv6AQdvcvxrv9hPM0RiPamE=
github.com/nats-io/jwt/v2 v2.5.8/go.mod h1:ZdWS1nZa6WMZfFwwgpEaqBV8EPGVgOTDHN/wTbz0Y5A=
github.com/nats-io/nats-server/v2 v2.10.22 h1:Yt63BGu2c3DdMoBZNcR6pjGQwk/asrKU7VX846ibxDA=
github.com/nats-io/nats-server/v2 v2.10.22/go.mod h1:X/m1ye9NYansUXYFrbcDwUi/blHkrgHh2rgCJaakonk=
github.com/nats-io/nats.go v1.37.0 h1:07rauXbVnnJvv1gfIyghFEo6lUcYRY0WXc3x7x0vUxE=
github.com/nats-io/nats.go v1.37.0/go.mod h1:Ubdu4Nh9exXdSz0RVWRFBbRfrbSxOYd26oF0wkWclB8=
github.com/nats-io/nkeys v0.4.7 h1:RwNJbbIdYCoClSDNY7QVKZlyb/wfT6ugvFCiKy6vDvI=
github.com/nats-io/nkeys v0.4.7/go.mod h1:kqXRgRDPlGy7nGaEDMuYzmiJCIAAWDK0IMBtDmGD0nc=
github.com/minio/highwayhash v1.0.4-0.20251030100505-070ab1a87a76 h1:KGuD/pM2JpL9FAYvBrnBBeENKZNh6eNtjqytV6TYjnk=
github.com/minio/highwayhash v1.0.4-0.20251030100505-070ab1a87a76/go.mod h1:GGYsuwP/fPD6Y9hMiXuapVvlIUEhFhMTh0rxU3ik1LQ=
github.com/nats-io/jwt/v2 v2.8.1 h1:V0xpGuD/N8Mi+fQNDynXohVvp7ZztevW5io8CUWlPmU=
github.com/nats-io/jwt/v2 v2.8.1/go.mod h1:nWnOEEiVMiKHQpnAy4eXlizVEtSfzacZ1Q43LIRavZg=
github.com/nats-io/nats-server/v2 v2.11.15 h1:StSf9TINInaZtr4oww2+kXmfwa9SkN//g/LwS19/UJ0=
github.com/nats-io/nats-server/v2 v2.11.15/go.mod h1:zwhv8Y0PE3KHyKgznJc/9Xoai638SaJd83zzJ5GJn74=
github.com/nats-io/nats.go v1.49.0 h1:yh/WvY59gXqYpgl33ZI+XoVPKyut/IcEaqtsiuTJpoE=
github.com/nats-io/nats.go v1.49.0/go.mod h1:fDCn3mN5cY8HooHwE2ukiLb4p4G4ImmzvXyJt+tGwdw=
github.com/nats-io/nkeys v0.4.15 h1:JACV5jRVO9V856KOapQ7x+EY8Jo3qw1vJt/9Jpwzkk4=
github.com/nats-io/nkeys v0.4.15/go.mod h1:CpMchTXC9fxA5zrMo4KpySxNjiDVvr8ANOSZdiNfUrs=
github.com/nats-io/nuid v1.0.1 h1:5iA8DT8V7q8WK2EScv2padNa/rTESc1KdnPw4TC2paw=
github.com/nats-io/nuid v1.0.1/go.mod h1:19wcPz3Ph3q0Jbyiqsd0kePYG7A95tJPxeL+1OSON2c=
github.com/ncruces/go-strftime v1.0.0 h1:HMFp8mLCTPp341M/ZnA4qaf7ZlsbTc+miZjCLOFAw7w=
@@ -41,12 +47,14 @@ golang.org/x/sys v0.6.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.21.0/go.mod h1:/VUhepiaJMQUp4+oa/7Zr1D23ma6VTLIYjOOTFZPUcA=
golang.org/x/sys v0.44.0 h1:ildZl3J4uzeKP07r2F++Op7E9B29JRUy+a27EibtBTQ=
golang.org/x/sys v0.44.0/go.mod h1:4GL1E5IUh+htKOUEOaiffhrAeqysfVGipDYzABqnCmw=
golang.org/x/text v0.37.0 h1:Cqjiwd9eSg8e0QAkyCaQTNHFIIzWtidPahFWR83rTrc=
golang.org/x/text v0.37.0/go.mod h1:a5sjxXGs9hsn/AJVwuElvCAo9v8QYLzvavO5z2PiM38=
golang.org/x/time v0.7.0 h1:ntUhktv3OPE6TgYxXWv9vKvUSJyIFJlyohwbkEwPrKQ=
golang.org/x/time v0.7.0/go.mod h1:3BpzKBy/shNhVucY/MWOyx10tF3SFh9QdLuxbVysPQM=
golang.org/x/time v0.15.0 h1:bbrp8t3bGUeFOx08pvsMYRTCVSMk89u4tKbNOZbp88U=
golang.org/x/time v0.15.0/go.mod h1:Y4YMaQmXwGQZoFaVFk4YpCt4FLQMYKZe9oeV/f4MSno=
golang.org/x/tools v0.45.0 h1:18qN3FAooORvApf5XjCXgsuayZOEtXf6JK18I3+ONa8=
golang.org/x/tools v0.45.0/go.mod h1:LuUGqqaXcXMEFEruIVJVm5mgDD8vww/z/SR1gQ4uE/0=
golang.org/x/tools/go/expect v0.1.1-deprecated h1:jpBZDwmgPhXsKZC6WhL20P4b/wmnpsEAGHaNy0n/rJM=
golang.org/x/tools/go/expect v0.1.1-deprecated/go.mod h1:eihoPOH+FgIqa3FpoTwguz/bVUSGBlGQU67vpBeOrBY=
golang.org/x/tools/go/packages/packagestest v0.1.1-deprecated h1:1h2MnaIAIXISqTFKdENegdpAgUXz6NrPEsbIeWaBRvM=
golang.org/x/tools/go/packages/packagestest v0.1.1-deprecated/go.mod h1:RVAQXBGNv1ib0J382/DPCRS/BPnsGebyM1Gj5VSDpG8=
modernc.org/cc/v4 v4.27.1 h1:9W30zRlYrefrDV2JE2O8VDtJ1yPGownxciz5rrbQZis=
modernc.org/cc/v4 v4.27.1/go.mod h1:uVtb5OGqUKpoLWhqwNQo/8LwvoiEBLvZXIQ/SmO6mL0=
modernc.org/ccgo/v4 v4.32.0 h1:hjG66bI/kqIPX1b2yT6fr/jt+QedtP2fqojG2VrFuVw=
+22
View File
@@ -0,0 +1,22 @@
-- 002_users.sql — bus-level user directory (issue 0001a).
--
-- The authoritative allowlist of identities permitted to use the bus, independent
-- of room membership. A user is identified by its Ed25519 signing public key (the
-- same key that derives the endpoint via frame.EndpointID); roles gate admin-only
-- control-plane operations; status enables revocation without deleting history.
--
-- Additive and idempotent: safe to apply repeatedly. Never modify this file;
-- further schema changes go in new numbered migrations (see
-- .claude/rules/db_migrations.md). The embedded copy under
-- pkg/membership/migrations/002_users.sql mirrors this file byte-for-byte.
CREATE TABLE IF NOT EXISTS users (
sign_pub TEXT PRIMARY KEY, -- Ed25519 public key in lowercase hex (peer identity)
handle TEXT NOT NULL, -- human-readable name (unique recommended, not enforced as PK)
role TEXT NOT NULL DEFAULT 'member', -- 'admin' | 'member'
status TEXT NOT NULL DEFAULT 'active', -- 'active' | 'revoked'
created_at TEXT NOT NULL,
revoked_at TEXT
);
CREATE INDEX IF NOT EXISTS idx_users_status ON users(status);
-161
View File
@@ -1,161 +0,0 @@
// Package mobile exposes a flat, gomobile-friendly API over the unibus client
// so an Android app can join rooms, publish, and receive messages with the same
// end-to-end encryption as any native Go peer.
//
// gomobile only supports a limited set of types across the binding boundary
// (string, []byte, int, bool, error, named structs, and interfaces). This layer
// translates the richer client API into those primitives and delivers incoming
// frames through a Java/Kotlin-implemented FrameListener callback. No protocol
// or cryptography is reimplemented here: every call delegates to pkg/client,
// which is the single source of truth shared with every other peer on the bus.
package mobile
import (
"encoding/base64"
"encoding/json"
"fmt"
"time"
"github.com/enmanuel/unibus/pkg/client"
"github.com/enmanuel/unibus/pkg/frame"
"github.com/enmanuel/unibus/pkg/room"
)
// FrameListener receives decrypted messages for a subscribed room. The Android
// side implements this interface. Its methods are invoked from a NATS delivery
// goroutine, so implementations must hop back to the UI thread (for example via
// a coroutine on the main dispatcher) before touching Android views.
type FrameListener interface {
OnFrame(roomID string, sender string, msgID string, text string)
}
// Session is a connected unibus peer. Create it with NewSession and close it
// with Close when the app stops.
type Session struct {
c *client.Client
}
// GenerateIdentity creates (or loads) the long-term keypair stored at path.
// Call it once on first launch. The resulting file holds the peer's private
// Ed25519 and X25519 keys and must be kept private to the app sandbox.
func GenerateIdentity(path string) error {
_, err := client.LoadOrCreateIdentity(path)
return err
}
// NewSession loads the identity at idPath and connects to the bus. natsURL is
// the data plane (for example nats://host:4250) and ctrlURL is the control
// plane HTTP endpoint (for example http://host:8470).
func NewSession(idPath, natsURL, ctrlURL string) (*Session, error) {
id, err := client.LoadOrCreateIdentity(idPath)
if err != nil {
return nil, err
}
c, err := client.New(natsURL, ctrlURL, id)
if err != nil {
return nil, err
}
return &Session{c: c}, nil
}
// EndpointID returns this peer's stable endpoint identifier, derived from its
// signing public key. It is the value that appears as the sender of frames.
func (s *Session) EndpointID() string {
return s.c.Endpoint().ID
}
// CreateRoom opens a room on the given subject. mode is "matrix" for the
// encrypted, persisted and signed policy, or "nats" for plain cleartext. It
// returns the room id used by Join, Publish and Subscribe.
func (s *Session) CreateRoom(subject, mode string) (string, error) {
p := room.ModeNATS
if mode == "matrix" {
p = room.ModeMatrix
}
return s.c.CreateRoom(subject, p)
}
// Join fetches the room key when the room is encrypted and prepares the session
// to publish to and receive from the room.
func (s *Session) Join(roomID string) error {
return s.c.Join(roomID)
}
// Publish sends a UTF-8 text message to the room.
func (s *Session) Publish(roomID, text string) error {
return s.c.Publish(roomID, []byte(text))
}
// Subscribe streams decrypted messages of the room to the listener until the
// session is closed.
func (s *Session) Subscribe(roomID string, l FrameListener) error {
_, err := s.c.Subscribe(roomID, func(f frame.Frame, plaintext []byte) {
l.OnFrame(roomID, f.Sender, f.MsgID, string(plaintext))
})
return err
}
// cardJSON is the portable, copy-pasteable public identity a peer shares so a
// room owner can invite it to an encrypted room. It carries no secret: only the
// endpoint id and the two public keys (signing + key-exchange), base64-encoded
// for transport over text or a QR code.
type cardJSON struct {
ID string `json:"id"`
SignPub string `json:"sign_pub"` // base64 std of the Ed25519 public key
KexPub string `json:"kex_pub"` // base64 std of the X25519 public key
}
// Card returns this peer's public identity as a portable JSON string. Share it
// (paste, QR) with a room owner so they can Invite you to an encrypted room. It
// contains no private key and is safe to transmit in the clear.
func (s *Session) Card() string {
ep := s.c.Endpoint()
b, _ := json.Marshal(cardJSON{
ID: ep.ID,
SignPub: base64.StdEncoding.EncodeToString(ep.SignPub),
KexPub: base64.StdEncoding.EncodeToString(ep.KexPub),
})
return string(b)
}
// Invite adds the holder of peerCard to roomID. peerCard is the JSON string the
// invitee produced with Card(). For encrypted rooms this seals the current room
// key to the invitee's X25519 public key and signs the request; the caller must
// be the room owner.
func (s *Session) Invite(roomID, peerCard string) error {
var card cardJSON
if err := json.Unmarshal([]byte(peerCard), &card); err != nil {
return fmt.Errorf("mobile: bad peer card: %w", err)
}
signPub, err := base64.StdEncoding.DecodeString(card.SignPub)
if err != nil {
return fmt.Errorf("mobile: bad sign_pub in card: %w", err)
}
kexPub, err := base64.StdEncoding.DecodeString(card.KexPub)
if err != nil {
return fmt.Errorf("mobile: bad kex_pub in card: %w", err)
}
return s.c.Invite(roomID, client.Endpoint{ID: card.ID, SignPub: signPub, KexPub: kexPub})
}
// Kick removes endpointID from roomID and, for encrypted rooms, rotates the room
// key to a new epoch so the removed peer cannot decrypt messages published after
// the kick (forward secrecy). The caller must be the room owner.
func (s *Session) Kick(roomID, endpointID string) error {
return s.c.Kick(roomID, endpointID)
}
// Request performs an RPC request/reply against subject and returns the reply
// payload as text. timeoutMs bounds the wait in milliseconds.
func (s *Session) Request(subject, text string, timeoutMs int) (string, error) {
out, err := s.c.Request(subject, []byte(text), time.Duration(timeoutMs)*time.Millisecond)
if err != nil {
return "", err
}
return string(out), nil
}
// Close disconnects the peer from the bus.
func (s *Session) Close() error {
return s.c.Close()
}
+27 -10
View File
@@ -1,9 +1,15 @@
// Package blobstore is a content-addressed object store on local disk.
// Package blobstore is a content-addressed object store for media ciphertext.
//
// The bus transports messages, not blobs. Media (images, files, large payloads)
// is encrypted by the client BEFORE being stored here, so the store only ever
// sees ciphertext. Objects are addressed by the sha256 hex of their (encrypted)
// bytes, which makes Put idempotent and deduplicating.
//
// Store is an interface (branch-by-abstraction, issue 0003d) with two backends:
// diskStore (the default, local filesystem) and objectStore (NATS Object Store
// on JetStream, replicated across the cluster so blobs survive a node loss and
// are reachable from any node). The wire contract (sha256-hex addressing) is
// identical, so a client cannot tell which backend a membershipd uses.
package blobstore
import (
@@ -14,27 +20,38 @@ import (
"path/filepath"
)
// Store is a directory-backed content-addressed blob store.
type Store struct {
// Store is a content-addressed blob store: Put returns the sha256-hex address of
// the stored bytes, Get fetches by that address, Has reports presence.
type Store interface {
Put(data []byte) (string, error)
Get(hash string) ([]byte, error)
Has(hash string) bool
}
// diskStore is a directory-backed content-addressed blob store (the default,
// single-node backend).
type diskStore struct {
dir string
}
// New creates a Store rooted at dir, creating the directory if needed.
func New(dir string) (*Store, error) {
// New creates a disk-backed Store rooted at dir, creating the directory if
// needed. It remains the default backend; the replicated NATS Object Store is
// constructed separately (NewObjectStore) when decentralization is enabled.
func New(dir string) (Store, error) {
if err := os.MkdirAll(dir, 0o755); err != nil {
return nil, fmt.Errorf("blobstore: mkdir %q: %w", dir, err)
}
return &Store{dir: dir}, nil
return &diskStore{dir: dir}, nil
}
// path returns the on-disk path for a given content hash.
func (s *Store) path(hash string) string {
func (s *diskStore) path(hash string) string {
return filepath.Join(s.dir, hash)
}
// Put writes data to the store and returns its sha256 hex hash. If an object
// with the same content already exists, Put is a no-op and returns the hash.
func (s *Store) Put(data []byte) (string, error) {
func (s *diskStore) Put(data []byte) (string, error) {
sum := sha256.Sum256(data)
hash := hex.EncodeToString(sum[:])
p := s.path(hash)
@@ -66,7 +83,7 @@ func (s *Store) Put(data []byte) (string, error) {
}
// Get reads the object with the given hash.
func (s *Store) Get(hash string) ([]byte, error) {
func (s *diskStore) Get(hash string) ([]byte, error) {
data, err := os.ReadFile(s.path(hash))
if err != nil {
return nil, fmt.Errorf("blobstore: get %q: %w", hash, err)
@@ -75,7 +92,7 @@ func (s *Store) Get(hash string) ([]byte, error) {
}
// Has reports whether an object with the given hash exists.
func (s *Store) Has(hash string) bool {
func (s *diskStore) Has(hash string) bool {
_, err := os.Stat(s.path(hash))
return err == nil
}
+102
View File
@@ -0,0 +1,102 @@
package blobstore
// objectStore is the NATS Object Store implementation of Store (issue 0003d):
// media ciphertext lives in a JetStream Object Store bucket replicated across
// the cluster, so a blob uploaded to one node is durable against the loss of a
// node and readable from any node. It is selected when decentralization is on;
// diskStore stays the single-node default. The content-addressing (sha256-hex)
// is identical to the disk backend, so the wire contract does not change.
import (
"context"
"crypto/sha256"
"encoding/hex"
"fmt"
"time"
"github.com/nats-io/nats.go/jetstream"
)
const (
defaultObjectBucket = "UNIBUS_blobs"
defaultObjOpTime = 10 * time.Second
)
// ObjectStoreConfig configures the replicated Object Store backend.
type ObjectStoreConfig struct {
// Bucket is the object store bucket name; empty uses UNIBUS_blobs.
Bucket string
// Replicas is the replication factor (R1..R5), matching the KV store's
// R1->R3 rollout.
Replicas int
// OpTimeout bounds each object operation; zero uses defaultObjOpTime.
OpTimeout time.Duration
}
type objectStore struct {
os jetstream.ObjectStore
opTimeout time.Duration
}
// NewObjectStore creates (or opens) the replicated Object Store bucket on js and
// returns it as a Store. The JetStream context belongs to the caller.
func NewObjectStore(js jetstream.JetStream, cfg ObjectStoreConfig) (Store, error) {
if cfg.Bucket == "" {
cfg.Bucket = defaultObjectBucket
}
if cfg.Replicas <= 0 {
cfg.Replicas = 1
}
opTimeout := cfg.OpTimeout
if opTimeout <= 0 {
opTimeout = defaultObjOpTime
}
ctx, cancel := context.WithTimeout(context.Background(), 15*time.Second)
defer cancel()
obj, err := js.CreateOrUpdateObjectStore(ctx, jetstream.ObjectStoreConfig{
Bucket: cfg.Bucket,
Replicas: cfg.Replicas,
Storage: jetstream.FileStorage,
})
if err != nil {
return nil, fmt.Errorf("blobstore: open object store %q (replicas=%d): %w", cfg.Bucket, cfg.Replicas, err)
}
return &objectStore{os: obj, opTimeout: opTimeout}, nil
}
func (s *objectStore) ctx() (context.Context, context.CancelFunc) {
return context.WithTimeout(context.Background(), s.opTimeout)
}
// Put stores data under its sha256-hex address. Re-putting identical bytes is a
// harmless overwrite (same address, same content), preserving the idempotent,
// deduplicating semantics of the disk backend.
func (s *objectStore) Put(data []byte) (string, error) {
sum := sha256.Sum256(data)
hash := hex.EncodeToString(sum[:])
ctx, cancel := s.ctx()
defer cancel()
if _, err := s.os.PutBytes(ctx, hash, data); err != nil {
return "", fmt.Errorf("blobstore: put object %q: %w", hash, err)
}
return hash, nil
}
// Get fetches the object by its hash address.
func (s *objectStore) Get(hash string) ([]byte, error) {
ctx, cancel := s.ctx()
defer cancel()
data, err := s.os.GetBytes(ctx, hash)
if err != nil {
return nil, fmt.Errorf("blobstore: get object %q: %w", hash, err)
}
return data, nil
}
// Has reports whether an object with the given hash exists.
func (s *objectStore) Has(hash string) bool {
ctx, cancel := s.ctx()
defer cancel()
_, err := s.os.GetInfo(ctx, hash)
return err == nil
}
+132
View File
@@ -0,0 +1,132 @@
package blobstore_test
import (
"bytes"
"crypto/sha256"
"encoding/hex"
"net"
"testing"
"time"
"github.com/enmanuel/unibus/pkg/blobstore"
"github.com/enmanuel/unibus/pkg/embeddednats"
"github.com/nats-io/nats.go"
"github.com/nats-io/nats.go/jetstream"
)
func objFreePort(t *testing.T) int {
t.Helper()
l, err := net.Listen("tcp", "127.0.0.1:0")
if err != nil {
t.Fatalf("free port: %v", err)
}
defer l.Close()
return l.Addr().(*net.TCPAddr).Port
}
// newObjectStore boots a single-node embedded NATS with JetStream and returns a
// replicated (R1) Object Store backend over it.
func newObjectStore(t *testing.T) blobstore.Store {
t.Helper()
ns, err := embeddednats.StartServer(embeddednats.ServerConfig{
StoreDir: t.TempDir(),
Host: "127.0.0.1",
Port: objFreePort(t),
})
if err != nil {
t.Fatalf("embedded nats: %v", err)
}
nc, err := nats.Connect(ns.ClientURL())
if err != nil {
ns.Shutdown()
t.Fatalf("nats connect: %v", err)
}
js, err := jetstream.New(nc)
if err != nil {
nc.Close()
ns.Shutdown()
t.Fatalf("jetstream: %v", err)
}
st, err := blobstore.NewObjectStore(js, blobstore.ObjectStoreConfig{Replicas: 1, OpTimeout: 5 * time.Second})
if err != nil {
nc.Close()
ns.Shutdown()
t.Fatalf("new object store: %v", err)
}
t.Cleanup(func() { nc.Close(); ns.Shutdown(); ns.WaitForShutdown() })
return st
}
// TestObjectStoreRoundTrip is the golden path: put ciphertext, get it back by
// its hash, Has reports presence, and re-putting identical bytes returns the
// same address (content-addressed dedup).
func TestObjectStoreRoundTrip(t *testing.T) {
s := newObjectStore(t)
data := []byte("encrypted-media-ciphertext-payload")
hash, err := s.Put(data)
if err != nil {
t.Fatalf("Put: %v", err)
}
want := hex.EncodeToString(sha256Sum(data))
if hash != want {
t.Fatalf("hash = %q, want sha256 hex %q", hash, want)
}
got, err := s.Get(hash)
if err != nil {
t.Fatalf("Get: %v", err)
}
if !bytes.Equal(got, data) {
t.Fatalf("Get returned %q, want %q", got, data)
}
if !s.Has(hash) {
t.Fatalf("Has should be true for a stored blob")
}
// Re-put identical bytes: same address, no error.
hash2, err := s.Put(data)
if err != nil || hash2 != hash {
t.Fatalf("re-Put: hash2=%q err=%v (want %q)", hash2, err, hash)
}
}
// TestObjectStoreMissing is the edge/error path: a hash that was never stored
// is absent and unreadable.
func TestObjectStoreMissing(t *testing.T) {
s := newObjectStore(t)
missing := hex.EncodeToString(sha256Sum([]byte("never stored")))
if s.Has(missing) {
t.Fatalf("Has should be false for an unknown hash")
}
if _, err := s.Get(missing); err == nil {
t.Fatalf("Get of an unknown hash should error")
}
}
// TestObjectStoreAddressMatchesDisk is the contract test: the Object Store and
// the disk backend address identical bytes to the IDENTICAL hash, so a client
// cannot tell which backend a node uses and a blob ref is portable across them.
func TestObjectStoreAddressMatchesDisk(t *testing.T) {
obj := newObjectStore(t)
disk, err := blobstore.New(t.TempDir())
if err != nil {
t.Fatalf("disk store: %v", err)
}
for _, payload := range [][]byte{[]byte("a"), []byte("longer ciphertext blob \x00\x01\x02"), {}} {
oh, err := obj.Put(payload)
if err != nil {
t.Fatalf("object Put: %v", err)
}
dh, err := disk.Put(payload)
if err != nil {
t.Fatalf("disk Put: %v", err)
}
if oh != dh {
t.Fatalf("address mismatch for %q: object=%q disk=%q", payload, oh, dh)
}
}
}
func sha256Sum(b []byte) []byte {
sum := sha256.Sum256(b)
return sum[:]
}
+154
View File
@@ -0,0 +1,154 @@
package busauth
import (
"encoding/base64"
server "github.com/nats-io/nats-server/v2/server"
"github.com/nats-io/nkeys"
)
// nkeyAuthenticator is a NATS server.Authentication that authorizes a client by
// verifying the nkey signature over the server-presented nonce and then
// consulting the bus user allowlist. Authorization is checked on every new
// connection via the injected predicate (not a static Options.Nkeys map), so
// revoking a user denies its next connection without restarting the server.
type nkeyAuthenticator struct {
// isAuthorized reports whether the lowercase-hex Ed25519 public key behind an
// nkey belongs to an active bus user. Injected (membership.Store.IsAuthorized)
// so this package stays free of the store dependency.
isAuthorized func(signPubHex string) bool
}
// NewNkeyAuthenticator builds a NATS custom authenticator backed by isAuthorized.
// Pass it to embeddednats so the data plane only accepts registered identities.
func NewNkeyAuthenticator(isAuthorized func(signPubHex string) bool) server.Authentication {
return &nkeyAuthenticator{isAuthorized: isAuthorized}
}
// Check verifies the client's nkey signature against the nonce the server
// presented, then maps the nkey to its allowlist key and checks authorization.
// Any malformed input or failed verification yields false (fail closed).
func (a *nkeyAuthenticator) Check(c server.ClientAuthentication) bool {
signPubHex, ok := verifyNkey(c)
if !ok {
return false
}
return a.isAuthorized(signPubHex)
}
// verifyNkey performs the shared nkey verification: it checks the client's
// signature against the server-presented nonce and returns the lowercase-hex
// Ed25519 public key behind the nkey. ok is false on any malformed input or
// failed verification (fail closed). The signature decoding mirrors
// nats-server's own (raw-url base64, then std base64 fallback) so genuine
// clients using nats.Nkey are accepted unchanged.
func verifyNkey(c server.ClientAuthentication) (signPubHex string, ok bool) {
opts := c.GetOpts()
if opts.Nkey == "" {
return "", false
}
sig, err := base64.RawURLEncoding.DecodeString(opts.Sig)
if err != nil {
sig, err = base64.StdEncoding.DecodeString(opts.Sig)
if err != nil {
return "", false
}
}
pub, err := nkeys.FromPublicKey(opts.Nkey)
if err != nil {
return "", false
}
if err := pub.Verify(c.GetNonce(), sig); err != nil {
return "", false
}
signPubHex, err = SignPubHexFromNkey(opts.Nkey)
if err != nil {
return "", false
}
return signPubHex, true
}
// PermissionsFunc maps a connecting identity (lowercase-hex Ed25519 signing key)
// to the NATS permissions it should be granted for this connection. Returning an
// error denies the connection (fail closed). It is how the data plane enforces
// per-subject access from room membership (issue 0003e, audit H4 residual).
type PermissionsFunc func(signPubHex string) (*server.Permissions, error)
// nkeyAuthenticatorACL is the nkey authenticator that ALSO scopes the connection
// to per-subject permissions derived from room membership. NATS evaluates
// permissions once, at connect time, so a peer that joins a room after
// connecting must reconnect (client.RefreshSession) to gain that room's subject
// — the dynamic-membership reconnection model the audit deferred to this issue.
type nkeyAuthenticatorACL struct {
isAuthorized func(signPubHex string) bool
perms PermissionsFunc
// internalPubHex is the lowercase-hex Ed25519 public key of membershipd's own
// ephemeral internal service identity. A connection that proves that key is
// granted full permissions WITHOUT consulting the allowlist, so the service can
// bootstrap and manage JetStream (the replicated nonce bucket and, when
// decentralized, the control-plane KV buckets) against its own embedded server
// even while the data plane confines every client to its rooms. Empty disables
// the internal-identity path entirely (behavior identical to a plain ACL
// authenticator).
internalPubHex string
}
// NewNkeyAuthenticatorACL builds an authenticator that authorizes by the bus
// allowlist AND registers per-subject permissions from perms. A registered but
// permission-less peer can no longer subscribe to or publish on arbitrary
// subjects: it is confined to the subjects of the rooms it belongs to (plus the
// client infrastructure subjects perms includes). This is the per-subject ACL
// the 0004 hardening left as a residual.
func NewNkeyAuthenticatorACL(isAuthorized func(signPubHex string) bool, perms PermissionsFunc) server.Authentication {
return &nkeyAuthenticatorACL{isAuthorized: isAuthorized, perms: perms}
}
// NewNkeyAuthenticatorACLInternal is NewNkeyAuthenticatorACL that also recognizes
// membershipd's internal service identity (internalPubHex, the lowercase hex of
// its ephemeral Ed25519 public key): a connection proving that key is granted
// full permissions without an allowlist lookup, so the service can create and
// manage JetStream against its own embedded server under enforce (issue 0006a/c —
// the replicated nonce bucket and the control-plane KV). Every other identity
// goes through the allowlist + per-subject ACL unchanged. An empty internalPubHex
// is identical to NewNkeyAuthenticatorACL, so this is a superset and safe to use
// everywhere the plain constructor was used.
func NewNkeyAuthenticatorACLInternal(isAuthorized func(signPubHex string) bool, perms PermissionsFunc, internalPubHex string) server.Authentication {
return &nkeyAuthenticatorACL{isAuthorized: isAuthorized, perms: perms, internalPubHex: internalPubHex}
}
// fullPermissions grants publish and subscribe on every subject (">"). It is the
// permission set for membershipd's own internal service connection, which must
// manage the JetStream control plane (nonce bucket + KV buckets) over NATS. It is
// NEVER granted to a bus user — only to the process's own ephemeral internal
// identity, recognized by exact public-key match in Check.
func fullPermissions() *server.Permissions {
sp := &server.SubjectPermission{Allow: []string{">"}}
return &server.Permissions{Publish: sp, Subscribe: sp}
}
// Check verifies the nkey, authorizes against the allowlist, then derives and
// registers the connection's subject permissions. A permissions-derivation
// error denies the connection (fail closed) rather than granting open access.
func (a *nkeyAuthenticatorACL) Check(c server.ClientAuthentication) bool {
signPubHex, ok := verifyNkey(c)
if !ok {
return false
}
// membershipd's own internal service identity bypasses the allowlist and is
// granted full permissions so the service can bootstrap JetStream under
// enforce. The key is matched exactly against the cryptographically verified
// connecting key, so no other identity can claim these permissions.
if a.internalPubHex != "" && signPubHex == a.internalPubHex {
c.RegisterUser(&server.User{Permissions: fullPermissions()})
return true
}
if !a.isAuthorized(signPubHex) {
return false
}
perms, err := a.perms(signPubHex)
if err != nil {
return false // fail closed: never grant open access on a derivation error
}
c.RegisterUser(&server.User{Permissions: perms})
return true
}
+76
View File
@@ -0,0 +1,76 @@
// Package busauth bridges a unibus peer's Ed25519 identity to NATS nkey
// authentication. A NATS nkey IS an Ed25519 keypair, so the bus reuses the
// peer's existing signing identity for the data plane instead of minting new
// key material — one identity authenticates both planes (HTTP request signatures
// and NATS connections), keyed in the user allowlist by the same Ed25519 public
// key.
//
// This is transport glue specific to NATS + unibus, not a general-purpose
// registry primitive: it deliberately lives in the app to avoid pulling
// github.com/nats-io/nkeys into the multi-domain registry module. The Ed25519
// signing/verification it relies on comes from the registry cybersecurity
// package; this package never reimplements a primitive.
package busauth
import (
"crypto/ed25519"
"encoding/hex"
"fmt"
"github.com/nats-io/nkeys"
)
// ClientNkey derives, from a peer's Ed25519 private key, the NATS user nkey
// public string ("U...") and a signature callback suitable for
// nats.Nkey(pub, sign). The callback signs the server-presented nonce with the
// same Ed25519 key, so the server can verify it and map it back to the bus user.
//
// signPriv must be a 64-byte Ed25519 private key (as produced by the registry's
// GenerateIdentity). Its first 32 bytes are the seed nkeys needs.
func ClientNkey(signPriv []byte) (pub string, sign func([]byte) ([]byte, error), err error) {
if len(signPriv) != ed25519.PrivateKeySize {
return "", nil, fmt.Errorf("busauth: signPriv must be %d bytes, got %d", ed25519.PrivateKeySize, len(signPriv))
}
seed := ed25519.PrivateKey(signPriv).Seed() // 32-byte Ed25519 seed
kp, err := nkeys.FromRawSeed(nkeys.PrefixByteUser, seed)
if err != nil {
return "", nil, fmt.Errorf("busauth: derive nkey from seed: %w", err)
}
pub, err = kp.PublicKey()
if err != nil {
return "", nil, fmt.Errorf("busauth: nkey public key: %w", err)
}
sign = func(nonce []byte) ([]byte, error) {
return kp.Sign(nonce)
}
return pub, sign, nil
}
// NkeyPublicFromSignPub derives the NATS user nkey public string from a 32-byte
// Ed25519 public key. It is the inverse view of the identity used by callers
// that have only the public key (e.g. to display or pre-register an nkey).
func NkeyPublicFromSignPub(signPub []byte) (string, error) {
if len(signPub) != ed25519.PublicKeySize {
return "", fmt.Errorf("busauth: signPub must be %d bytes, got %d", ed25519.PublicKeySize, len(signPub))
}
pub, err := nkeys.Encode(nkeys.PrefixByteUser, signPub)
if err != nil {
return "", fmt.Errorf("busauth: encode nkey public: %w", err)
}
return string(pub), nil
}
// SignPubHexFromNkey decodes a NATS user nkey public string ("U...") back to the
// lowercase hex of its 32-byte Ed25519 public key — the identity key used to
// look a peer up in the bus user allowlist. The server calls this to map the
// nkey a client presented to the users table.
func SignPubHexFromNkey(nkeyPub string) (string, error) {
raw, err := nkeys.Decode(nkeys.PrefixByteUser, []byte(nkeyPub))
if err != nil {
return "", fmt.Errorf("busauth: decode nkey %q: %w", nkeyPub, err)
}
if len(raw) != ed25519.PublicKeySize {
return "", fmt.Errorf("busauth: decoded nkey is %d bytes, want %d", len(raw), ed25519.PublicKeySize)
}
return hex.EncodeToString(raw), nil
}
+85
View File
@@ -0,0 +1,85 @@
package busauth
import (
"bytes"
"crypto/ed25519"
"encoding/hex"
"testing"
cs "fn-registry/functions/cybersecurity"
"github.com/nats-io/nkeys"
)
// TestNkeyRoundTrip is the dedicated sign/verify round-trip the spec requires
// BEFORE the NATS server depends on this conversion. It proves three things end
// to end: (1) ClientNkey produces a signature callback whose output verifies
// under the derived nkey public key; (2) that signature is exactly the Ed25519
// signature of the same identity (the nkey is the same key, not a new one);
// (3) the nkey public string maps back to the identity's Ed25519 hex, which is
// the key the allowlist is indexed by.
func TestNkeyRoundTrip(t *testing.T) {
id, err := cs.GenerateIdentity()
if err != nil {
t.Fatalf("identity: %v", err)
}
pub, sign, err := ClientNkey(id.SignPriv)
if err != nil {
t.Fatalf("ClientNkey: %v", err)
}
// (1) The callback's signature over a server-style nonce verifies under the
// public nkey, exactly as the NATS server will verify it.
nonce := []byte("server-presented-nonce-1234567890")
sig, err := sign(nonce)
if err != nil {
t.Fatalf("sign: %v", err)
}
kpPub, err := nkeys.FromPublicKey(pub)
if err != nil {
t.Fatalf("FromPublicKey: %v", err)
}
if err := kpPub.Verify(nonce, sig); err != nil {
t.Fatalf("nkey verify failed: %v", err)
}
// (2) The signature is the very same bytes as a raw Ed25519 sign with the
// identity's private key — confirming no separate key material was minted.
want := ed25519.Sign(ed25519.PrivateKey(id.SignPriv), nonce)
if !bytes.Equal(sig, want) {
t.Fatalf("nkey signature differs from Ed25519 signature of the same identity")
}
// (3) The nkey public maps back to the identity's Ed25519 hex (allowlist key).
gotHex, err := SignPubHexFromNkey(pub)
if err != nil {
t.Fatalf("SignPubHexFromNkey: %v", err)
}
if gotHex != hex.EncodeToString(id.SignPub) {
t.Fatalf("nkey->hex mismatch: got %s want %s", gotHex, hex.EncodeToString(id.SignPub))
}
// And NkeyPublicFromSignPub is consistent with ClientNkey's public.
pub2, err := NkeyPublicFromSignPub(id.SignPub)
if err != nil {
t.Fatalf("NkeyPublicFromSignPub: %v", err)
}
if pub2 != pub {
t.Fatalf("public nkey mismatch between derivations: %s vs %s", pub2, pub)
}
}
// Error path: a wrong-length private key is rejected, not silently misused.
func TestClientNkeyBadKey(t *testing.T) {
if _, _, err := ClientNkey([]byte("too-short")); err == nil {
t.Fatalf("expected error for short private key")
}
}
// Error path: a non-nkey string does not decode to an allowlist key.
func TestSignPubHexFromNkeyBad(t *testing.T) {
if _, err := SignPubHexFromNkey("not-a-real-nkey"); err == nil {
t.Fatalf("expected error decoding a bogus nkey")
}
}
+26
View File
@@ -0,0 +1,26 @@
package busauth
import server "github.com/nats-io/nats-server/v2/server"
// PermissionsFromSubjects adapts a subject-deriving function (e.g.
// membership.SubjectACLFor, which maps an identity to the subjects of the rooms
// it belongs to plus the client infrastructure subjects) into the PermissionsFunc
// the ACL authenticator expects. The derived subjects are granted as BOTH the
// publish and subscribe allow set, so a connection can only pub/sub on the
// subjects it is entitled to. A derivation error is propagated so the caller
// fails closed (denies the connection) rather than granting open access.
//
// This is the production wiring for the per-subject data-plane ACL (issue 0003e,
// audit H4): membershipd passes PermissionsFromSubjects(membership.SubjectACLFor(
// store)) to NewNkeyAuthenticatorACL. It lives in busauth (not membership) so the
// membership package stays free of the nats-server dependency.
func PermissionsFromSubjects(derive func(signPubHex string) ([]string, error)) PermissionsFunc {
return func(signPubHex string) (*server.Permissions, error) {
subjects, err := derive(signPubHex)
if err != nil {
return nil, err
}
sp := &server.SubjectPermission{Allow: subjects}
return &server.Permissions{Publish: sp, Subscribe: sp}, nil
}
}
+75
View File
@@ -0,0 +1,75 @@
package busauth
import (
"crypto/tls"
"crypto/x509"
"fmt"
"os"
)
// LoadCATLSConfig builds a *tls.Config that trusts ONLY the given CA certificate
// (PEM file), for a bus client pinning the project's self-signed CA. Because the
// bus uses a private CA rather than a public one, clients must pin it explicitly;
// trusting the system roots would reject the server cert. This is the single
// helper every client (Go peers, the mobile binding, the gateway) uses to turn a
// ca.crt path into a connection config.
func LoadCATLSConfig(caPEMPath string) (*tls.Config, error) {
pem, err := os.ReadFile(caPEMPath)
if err != nil {
return nil, fmt.Errorf("busauth: read CA %q: %w", caPEMPath, err)
}
pool := x509.NewCertPool()
if !pool.AppendCertsFromPEM(pem) {
return nil, fmt.Errorf("busauth: CA %q contains no valid PEM certificate", caPEMPath)
}
return &tls.Config{RootCAs: pool, MinVersion: tls.VersionTLS12}, nil
}
// ServerTLSConfig loads the bus NATS server's certificate and private key (PEM
// files) into a *tls.Config to present to clients. The private key never leaves
// the host; only the CA cert travels to clients.
func ServerTLSConfig(certPEMPath, keyPEMPath string) (*tls.Config, error) {
cert, err := tls.LoadX509KeyPair(certPEMPath, keyPEMPath)
if err != nil {
return nil, fmt.Errorf("busauth: load server keypair: %w", err)
}
return &tls.Config{Certificates: []tls.Certificate{cert}, MinVersion: tls.VersionTLS12}, nil
}
// RouteTLSConfig builds the mutual-TLS config for the NATS CLUSTER route layer
// (issue 0003a). Unlike the client data plane, where the server presents a cert
// and only the client verifies it, routes are server-to-server: each node both
// presents its own node certificate AND verifies the connecting node's
// certificate against the bus CA. So this single config carries:
//
// - Certificates: this node's CA-signed certificate (presented in both the
// server and the client role of a route handshake),
// - RootCAs: the bus CA, to verify the certificate of a node we dial out to,
// - ClientCAs + ClientAuth=RequireAndVerifyClientCert: the bus CA, to verify
// the certificate of a node dialing in.
//
// The effect: a node that lacks a certificate signed by the bus CA cannot
// establish a route in either direction, even if it knows the cluster password.
// Reuse the same CA as the client data plane (deploy/tls) but a per-node cert
// whose SAN covers that node's route address.
func RouteTLSConfig(certPEMPath, keyPEMPath, caPEMPath string) (*tls.Config, error) {
cert, err := tls.LoadX509KeyPair(certPEMPath, keyPEMPath)
if err != nil {
return nil, fmt.Errorf("busauth: load route keypair: %w", err)
}
pem, err := os.ReadFile(caPEMPath)
if err != nil {
return nil, fmt.Errorf("busauth: read route CA %q: %w", caPEMPath, err)
}
pool := x509.NewCertPool()
if !pool.AppendCertsFromPEM(pem) {
return nil, fmt.Errorf("busauth: route CA %q contains no valid PEM certificate", caPEMPath)
}
return &tls.Config{
Certificates: []tls.Certificate{cert},
RootCAs: pool,
ClientCAs: pool,
ClientAuth: tls.RequireAndVerifyClientCert,
MinVersion: tls.VersionTLS12,
}, nil
}
+95
View File
@@ -0,0 +1,95 @@
package busauth
import (
"crypto/ecdsa"
"crypto/elliptic"
"crypto/rand"
"crypto/x509"
"crypto/x509/pkix"
"encoding/pem"
"math/big"
"os"
"path/filepath"
"testing"
"time"
)
// writeSelfSigned writes a self-signed cert + key PEM pair to dir and returns
// their paths. It is enough to exercise both LoadCATLSConfig (reads the cert as
// a CA) and ServerTLSConfig (reads the cert+key as a server keypair).
func writeSelfSigned(t *testing.T, dir string) (certPath, keyPath string) {
t.Helper()
key, err := ecdsa.GenerateKey(elliptic.P256(), rand.Reader)
if err != nil {
t.Fatalf("key: %v", err)
}
tmpl := &x509.Certificate{
SerialNumber: big.NewInt(1),
Subject: pkix.Name{CommonName: "unibus-tls-test"},
NotBefore: time.Now().Add(-time.Hour),
NotAfter: time.Now().Add(time.Hour),
IsCA: true,
KeyUsage: x509.KeyUsageCertSign | x509.KeyUsageDigitalSignature,
BasicConstraintsValid: true,
}
der, err := x509.CreateCertificate(rand.Reader, tmpl, tmpl, &key.PublicKey, key)
if err != nil {
t.Fatalf("cert: %v", err)
}
certPath = filepath.Join(dir, "cert.pem")
keyPath = filepath.Join(dir, "key.pem")
certPEM := pem.EncodeToMemory(&pem.Block{Type: "CERTIFICATE", Bytes: der})
if err := os.WriteFile(certPath, certPEM, 0o644); err != nil {
t.Fatalf("write cert: %v", err)
}
keyDER, err := x509.MarshalECPrivateKey(key)
if err != nil {
t.Fatalf("marshal key: %v", err)
}
keyPEM := pem.EncodeToMemory(&pem.Block{Type: "EC PRIVATE KEY", Bytes: keyDER})
if err := os.WriteFile(keyPath, keyPEM, 0o600); err != nil {
t.Fatalf("write key: %v", err)
}
return certPath, keyPath
}
// Golden: a valid CA PEM loads into a config with a non-empty RootCAs pool, and
// a valid keypair loads into a config presenting one certificate.
func TestLoadTLSConfigsGolden(t *testing.T) {
dir := t.TempDir()
certPath, keyPath := writeSelfSigned(t, dir)
caCfg, err := LoadCATLSConfig(certPath)
if err != nil {
t.Fatalf("LoadCATLSConfig: %v", err)
}
if caCfg.RootCAs == nil {
t.Fatalf("expected a populated RootCAs pool")
}
srvCfg, err := ServerTLSConfig(certPath, keyPath)
if err != nil {
t.Fatalf("ServerTLSConfig: %v", err)
}
if len(srvCfg.Certificates) != 1 {
t.Fatalf("expected exactly one server certificate, got %d", len(srvCfg.Certificates))
}
}
// Error path: missing file, and a file that is not valid PEM.
func TestLoadTLSConfigsErrors(t *testing.T) {
if _, err := LoadCATLSConfig("/no/such/ca.crt"); err == nil {
t.Fatalf("expected error for missing CA file")
}
dir := t.TempDir()
junk := filepath.Join(dir, "junk.crt")
if err := os.WriteFile(junk, []byte("not a pem"), 0o644); err != nil {
t.Fatalf("write junk: %v", err)
}
if _, err := LoadCATLSConfig(junk); err == nil {
t.Fatalf("expected error for non-PEM CA file")
}
if _, err := ServerTLSConfig("/no/such/server.crt", "/no/such/server.key"); err == nil {
t.Fatalf("expected error for missing server keypair")
}
}
+381 -70
View File
@@ -16,16 +16,23 @@ import (
"bytes"
"context"
"crypto/rand"
"crypto/tls"
"encoding/base64"
"encoding/hex"
"encoding/json"
"fmt"
"io"
"net/http"
"strconv"
"strings"
"sync"
"time"
cs "fn-registry/functions/cybersecurity"
"github.com/enmanuel/unibus/pkg/busauth"
"github.com/enmanuel/unibus/pkg/frame"
"github.com/enmanuel/unibus/pkg/membership"
"github.com/enmanuel/unibus/pkg/room"
"github.com/nats-io/nats.go"
"github.com/nats-io/nats.go/jetstream"
@@ -44,20 +51,130 @@ type Client struct {
endpoint string
nc *nats.Conn
js jetstream.JetStream // durable plane for rooms with Policy.Persist
ctrlURL string
ctrlURLs []string // control-plane HTTP endpoints, tried in order (failover)
http *http.Client
// natsServers + natsOpts are retained so RefreshSession can rebuild the
// data-plane connection (re-triggering the server's subject-ACL evaluation).
natsServers []string
natsOpts []nats.Option
mu sync.RWMutex
keyCache map[string]map[int][]byte // roomID -> epoch -> K
signCache map[string][]byte // sender endpoint -> sign pub (for verification)
}
// New connects to NATS and records the control-plane URL. The identity holds
// the peer's long-term keypairs.
// Options configures how a client connects to the bus. The zero value is the
// legacy behavior: a plain NATS connection with no nkey and no TLS — what dev
// stacks and a not-yet-secured server expect. Secured deployments set these.
type Options struct {
// UseNkey authenticates the NATS connection with the peer's Ed25519 identity
// reused as a NATS nkey. It MUST match the server: nats.go refuses to connect
// with an nkey to a server that does not advertise nkey auth ("nkeys not
// supported by the server"), so this is opt-in rather than always-on.
UseNkey bool
// TLS, when non-nil, secures the NATS (data plane) connection and pins the
// server to this config's RootCAs (the bus's self-signed CA). Build it with
// busauth.LoadCATLSConfig(caPath). Nil keeps the data plane plaintext.
TLS *tls.Config
// CtrlTLS, when non-nil, secures the HTTP control-plane connection and pins it
// to this config's RootCAs. It is separate from TLS so the two planes can be
// secured independently (a test may TLS one and not the other); production
// sets both to the same CA via Connect. Nil keeps the control plane plaintext.
CtrlTLS *tls.Config
// NatsServers are ADDITIONAL NATS seed URLs for cluster failover (issue
// 0003e), beyond the primary natsURL passed to the constructor. With more
// than one server nats.go reconnects to a surviving node automatically when
// the one a client is attached to dies, so a node loss is transparent.
NatsServers []string
// CtrlURLs are ADDITIONAL control-plane HTTP endpoints (one per node) beyond
// the primary ctrlURL. Each request is tried against them in order until one
// answers, so the control plane survives a node loss too. With the
// decentralized KV store every node serves the same state, so any of them
// can answer any request.
CtrlURLs []string
}
// dedupNonEmpty returns the input with empty strings dropped and duplicates
// removed, preserving order. Used to build the NATS seed list and control-plane
// list from a primary URL plus optional extras without a redundant entry.
func dedupNonEmpty(in []string) []string {
seen := map[string]bool{}
var out []string
for _, s := range in {
if s == "" || seen[s] {
continue
}
seen[s] = true
out = append(out, s)
}
return out
}
// New connects to NATS and records the control-plane URL with default Options
// (no nkey, no TLS). The identity holds the peer's long-term keypairs.
func New(natsURL, ctrlURL string, id cs.Identity) (*Client, error) {
nc, err := nats.Connect(natsURL, nats.Name("unibus-client"))
return NewWithOptions(natsURL, ctrlURL, id, Options{})
}
// Connect is the single migration seam every peer (worker, chat, mobile,
// gateway) uses to pick its security posture from one input: the CA path. With
// a non-empty caPath it connects securely — TLS pinned to that CA plus nkey
// authentication on the data plane — matching a bus running with bus-auth
// enforce + bus-tls. With an empty caPath it falls back to the legacy plaintext,
// no-nkey connection for local dev against an unsecured bus. The control-plane
// HTTP requests are signed in both cases (that signing is unconditional).
func Connect(natsURL, ctrlURL string, id cs.Identity, caPath string) (*Client, error) {
if caPath == "" {
return New(natsURL, ctrlURL, id)
}
// A CA implies the bus is TLS on BOTH planes. Refuse a plaintext control-plane
// URL: signing gives integrity, not confidentiality, so sending metadata over
// http:// when the operator provisioned a CA would silently leak it to a MITM
// (audit H5). Force https rather than silently downgrade.
if !strings.HasPrefix(ctrlURL, "https://") {
return nil, fmt.Errorf("client: control-plane URL %q must be https:// when a CA is provided", ctrlURL)
}
tlsCfg, err := busauth.LoadCATLSConfig(caPath)
if err != nil {
return nil, fmt.Errorf("client: connect nats %q: %w", natsURL, err)
return nil, fmt.Errorf("client: load CA %q: %w", caPath, err)
}
// Pin the same CA on both planes: nkey+TLS on NATS, TLS on the HTTP control plane.
return NewWithOptions(natsURL, ctrlURL, id, Options{UseNkey: true, TLS: tlsCfg, CtrlTLS: tlsCfg})
}
// NewWithOptions is New with explicit connection options (nkey auth, and, from
// phase 0001d, TLS). It is the single place the data-plane connection is built,
// so every peer (worker, chat, mobile, gateway) gets identical behavior by
// passing the same Options.
func NewWithOptions(natsURL, ctrlURL string, id cs.Identity, opts Options) (*Client, error) {
// Seed list = primary + extras. With more than one seed, nats.go fails over
// to a surviving node on disconnect; MaxReconnects(-1) keeps it retrying
// indefinitely so a node coming back is rejoined rather than given up on.
natsServers := dedupNonEmpty(append([]string{natsURL}, opts.NatsServers...))
natsOpts := []nats.Option{
nats.Name("unibus-client"),
nats.MaxReconnects(-1),
nats.ReconnectWait(250 * time.Millisecond),
}
if len(natsServers) > 1 {
// Try every seed on the initial connect too, so startup tolerates one
// seed being down.
natsOpts = append(natsOpts, nats.RetryOnFailedConnect(true))
}
if opts.UseNkey {
nkeyPub, nkeySign, err := busauth.ClientNkey(id.SignPriv)
if err != nil {
return nil, fmt.Errorf("client: derive nkey: %w", err)
}
natsOpts = append(natsOpts, nats.Nkey(nkeyPub, nkeySign))
}
if opts.TLS != nil {
natsOpts = append(natsOpts, nats.Secure(opts.TLS))
}
nc, err := nats.Connect(strings.Join(natsServers, ","), natsOpts...)
if err != nil {
return nil, fmt.Errorf("client: connect nats %v: %w", natsServers, err)
}
// JetStream context for the durable plane. Obtaining it does not require any
// stream to exist yet and has no effect on cleartext/ephemeral rooms — those
@@ -67,18 +184,58 @@ func New(natsURL, ctrlURL string, id cs.Identity) (*Client, error) {
nc.Close()
return nil, fmt.Errorf("client: init jetstream: %w", err)
}
// The control-plane HTTP client pins the bus CA when CtrlTLS is set, so an
// https:// control plane is verified against the bus's own CA rather than the
// system roots (audit H5). Without it the client stays plaintext for dev.
httpClient := &http.Client{Timeout: 10 * time.Second}
if opts.CtrlTLS != nil {
httpClient.Transport = &http.Transport{TLSClientConfig: opts.CtrlTLS.Clone()}
}
return &Client{
id: id,
endpoint: frame.EndpointID(id.SignPub),
nc: nc,
js: js,
ctrlURL: ctrlURL,
http: &http.Client{Timeout: 10 * time.Second},
keyCache: map[string]map[int][]byte{},
signCache: map[string][]byte{},
id: id,
endpoint: frame.EndpointID(id.SignPub),
nc: nc,
js: js,
ctrlURLs: dedupNonEmpty(append([]string{ctrlURL}, opts.CtrlURLs...)),
http: httpClient,
natsServers: natsServers,
natsOpts: natsOpts,
keyCache: map[string]map[int][]byte{},
signCache: map[string][]byte{},
}, nil
}
// RefreshSession rebuilds the data-plane NATS connection so the server's
// subject-ACL authenticator re-evaluates this peer's room membership (issue
// 0003e, audit H4 residual). Call it after a membership change — a room you
// created, were invited to, or joined — when the bus enforces per-subject
// permissions, so the new room's subject becomes publishable and subscribable
// (NATS freezes permissions at connect time, so the prior connection cannot see
// the new room).
//
// It opens a fresh connection with the same seeds/options and swaps it in.
// IMPORTANT: active subscriptions from the previous connection are dropped —
// re-subscribe (client.Subscribe) to your rooms after calling this. The key and
// signer caches are preserved. On a non-ACL bus this is a no-op-safe reconnect.
func (c *Client) RefreshSession() error {
nc, err := nats.Connect(strings.Join(c.natsServers, ","), c.natsOpts...)
if err != nil {
return fmt.Errorf("client: refresh session: reconnect nats: %w", err)
}
js, err := jetstream.New(nc)
if err != nil {
nc.Close()
return fmt.Errorf("client: refresh session: init jetstream: %w", err)
}
old := c.nc
c.mu.Lock()
c.nc = nc
c.js = js
c.mu.Unlock()
old.Close()
return nil
}
// Endpoint returns this client's public identity.
func (c *Client) Endpoint() Endpoint {
return Endpoint{ID: c.endpoint, SignPub: c.id.SignPub, KexPub: c.id.KexPub}
@@ -90,6 +247,15 @@ func (c *Client) Close() error {
return nil
}
// ConnectedServer returns the URL of the NATS node this client is currently
// attached to (empty when disconnected). It is observability for cluster
// failover: after a node dies, this reports the surviving node nats.go
// reconnected to. IsConnected reports whether the data-plane link is up.
func (c *Client) ConnectedServer() string { return c.nc.ConnectedUrl() }
// IsConnected reports whether the NATS data-plane connection is currently up.
func (c *Client) IsConnected() bool { return c.nc.IsConnected() }
// ---- key cache ------------------------------------------------------------
func (c *Client) cacheKey(roomID string, epoch int, k []byte) {
@@ -116,54 +282,105 @@ func (c *Client) getCachedKey(roomID string, epoch int) ([]byte, bool) {
// ---- control-plane HTTP helpers ------------------------------------------
func (c *Client) doJSON(method, path string, body, out any) error {
var rdr io.Reader
var bodyBytes []byte
if body != nil {
b, err := json.Marshal(body)
if err != nil {
return fmt.Errorf("client: marshal request: %w", err)
}
rdr = bytes.NewReader(b)
bodyBytes = b
}
req, err := http.NewRequest(method, c.ctrlURL+path, rdr)
if err != nil {
return fmt.Errorf("client: new request: %w", err)
}
if body != nil {
req.Header.Set("Content-Type", "application/json")
}
resp, err := c.http.Do(req)
if err != nil {
return fmt.Errorf("client: do %s %s: %w", method, path, err)
}
defer resp.Body.Close()
respBody, _ := io.ReadAll(resp.Body)
if resp.StatusCode >= 300 {
// Surface the server's structured {"error": "..."} message when present,
// instead of leaking the raw HTTP envelope (method, path, status, JSON body).
var er struct {
Error string `json:"error"`
// Try each control-plane endpoint in order. A transport error (a dead node)
// falls over to the next; an HTTP response (any status) is authoritative and
// returned, since every node serves the same state. Each attempt is freshly
// signed (new nonce), so a failed-over retry is never seen as a replay.
var lastErr error
for _, base := range c.ctrlURLs {
req, err := c.newSignedRequestTo(base, method, path, bodyBytes)
if err != nil {
return err
}
if json.Unmarshal(respBody, &er) == nil && er.Error != "" {
return fmt.Errorf("%s (HTTP %d)", er.Error, resp.StatusCode)
if body != nil {
req.Header.Set("Content-Type", "application/json")
}
return fmt.Errorf("client: %s %s -> %d: %s", method, path, resp.StatusCode, string(respBody))
}
if out != nil {
if err := json.Unmarshal(respBody, out); err != nil {
return fmt.Errorf("client: decode response: %w", err)
resp, err := c.http.Do(req)
if err != nil {
lastErr = err
continue // dead node: try the next control plane
}
defer resp.Body.Close()
respBody, _ := io.ReadAll(resp.Body)
if resp.StatusCode >= 300 {
// Surface the server's structured {"error": "..."} message when present,
// instead of leaking the raw HTTP envelope (method, path, status, body).
var er struct {
Error string `json:"error"`
}
if json.Unmarshal(respBody, &er) == nil && er.Error != "" {
return fmt.Errorf("%s (HTTP %d)", er.Error, resp.StatusCode)
}
return fmt.Errorf("client: %s %s -> %d: %s", method, path, resp.StatusCode, string(respBody))
}
if out != nil {
if err := json.Unmarshal(respBody, out); err != nil {
return fmt.Errorf("client: decode response: %w", err)
}
}
return nil
}
return nil
return fmt.Errorf("client: %s %s: all control planes failed: %w", method, path, lastErr)
}
// signRequest signs the canonical bytes of req (req must already have its Sig
// field cleared) with the client's Ed25519 key. It is symmetric with the
// server's verifyOwnerSig.
// server's verifyOwnerSig. This is the PAYLOAD-level owner signature that
// authorizes room operations (invite/rekey) by ownership — distinct from the
// transport-level request signature applied by newSignedRequest below, which
// authenticates the caller's identity on every request.
func (c *Client) signRequest(req any) []byte {
b, _ := json.Marshal(req)
return cs.SignEd25519(c.id.SignPriv, b)
}
// newSignedRequestTo builds an *http.Request to the control-plane endpoint
// `base` and attaches the transport authentication headers
// (X-Unibus-Pub/Ts/Nonce/Sig) signing the canonical request bytes with this
// peer's Ed25519 key. path is the request URI (path plus any query); body is the
// raw request body (nil for GET). The server (membership.authenticate) verifies
// these headers under the bus-auth flag. The signature covers method+path+ts+
// nonce+sha256(body), NOT the host, so the same request can be addressed to any
// node — and each failover attempt mints a fresh nonce so it is never a replay.
//
// Signing happens on every request — including GETs — so that under enforce the
// server can authenticate the caller and reject unregistered or revoked
// identities uniformly. The canonical construction is the single source of truth
// in membership.CanonicalRequest, shared by both sides.
func (c *Client) newSignedRequestTo(base, method, path string, body []byte) (*http.Request, error) {
var rdr io.Reader
if body != nil {
rdr = bytes.NewReader(body)
}
req, err := http.NewRequest(method, base+path, rdr)
if err != nil {
return nil, fmt.Errorf("client: new request: %w", err)
}
ts := strconv.FormatInt(time.Now().Unix(), 10)
nonceRaw := make([]byte, 16)
if _, err := rand.Read(nonceRaw); err != nil {
return nil, fmt.Errorf("client: generate nonce: %w", err)
}
nonce := base64.StdEncoding.EncodeToString(nonceRaw)
canonical := membership.CanonicalRequest(method, path, ts, nonce, body)
sig := cs.SignEd25519(c.id.SignPriv, canonical)
req.Header.Set("X-Unibus-Pub", hex.EncodeToString(c.id.SignPub))
req.Header.Set("X-Unibus-Ts", ts)
req.Header.Set("X-Unibus-Nonce", nonce)
req.Header.Set("X-Unibus-Sig", base64.StdEncoding.EncodeToString(sig))
return req, nil
}
// ---- mirror of server wire types (control plane) -------------------------
type policyJSON struct {
@@ -239,6 +456,23 @@ type memberRoomJSON struct {
Role string `json:"role"`
}
// userJSON mirrors the server's wire type on the admin user-management endpoints.
type userJSON struct {
SignPub string `json:"sign_pub"`
Handle string `json:"handle"`
Role string `json:"role"`
Status string `json:"status"`
CreatedAt string `json:"created_at"`
RevokedAt string `json:"revoked_at,omitempty"`
}
// addUserReq is the POST /users body (mirror of the server type).
type addUserReq struct {
SignPub string `json:"sign_pub"`
Handle string `json:"handle"`
Role string `json:"role"`
}
// ---- room operations ------------------------------------------------------
// RoomRef is a room this peer belongs to, returned by ListMyRooms. It is the
@@ -273,6 +507,59 @@ func (c *Client) ListMyRooms() ([]RoomRef, error) {
return out, nil
}
// ---- user administration (admin-only) ------------------------------------
// UserInfo is a bus user as returned by the admin user-management endpoints. It
// is a flat view (no nested types) for the admin panel: the signing key
// (lowercase hex), handle, role ("admin"|"member"), status ("active"|"revoked"),
// and timestamps. RevokedAt is empty for an active user.
type UserInfo struct {
SignPub string
Handle string
Role string
Status string
CreatedAt string
RevokedAt string
}
// ListUsers returns the full bus allowlist, including revoked users. The caller
// must be signing as an admin: a non-admin signer is rejected by the server with
// 403, surfaced here as an error.
func (c *Client) ListUsers() ([]UserInfo, error) {
var resp []userJSON
if err := c.doJSON("GET", "/users", nil, &resp); err != nil {
return nil, err
}
out := make([]UserInfo, 0, len(resp))
for _, u := range resp {
out = append(out, UserInfo{
SignPub: u.SignPub,
Handle: u.Handle,
Role: u.Role,
Status: u.Status,
CreatedAt: u.CreatedAt,
RevokedAt: u.RevokedAt,
})
}
return out, nil
}
// AddUser registers a bus user from their Ed25519 signing public key (64-hex).
// role is "admin" or "member" (empty defaults to member, matching the server).
// The caller must be signing as an admin. Re-adding an already-registered key
// returns an error (the server replies 409 and leaves the existing row
// untouched — no silent role/status change).
func (c *Client) AddUser(signPub, handle, role string) error {
return c.doJSON("POST", "/users", addUserReq{SignPub: signPub, Handle: handle, Role: role}, nil)
}
// RevokeUser revokes a bus user by their signing public key (64-hex). Revocation
// is a status flip (no hard delete): the identity stays auditable and is denied
// on both planes immediately. The caller must be signing as an admin.
func (c *Client) RevokeUser(signPub string) error {
return c.doJSON("POST", "/users/"+signPub+"/revoke", nil, nil)
}
// newRoomKey returns 32 random bytes for a symmetric room key.
func newRoomKey() ([]byte, error) {
k := make([]byte, 32)
@@ -582,7 +869,17 @@ func (c *Client) processFrame(roomID string, info roomView, data []byte, handler
if err != nil {
return
}
if info.Policy.SignMsgs && f.Sig != nil {
// A room with SignMsgs REQUIRES a signature, so an unsigned frame is
// unauthenticated and must be dropped — not silently accepted. The previous
// `&& f.Sig != nil` guard verified the signature only when one was present, so
// an attacker with data-plane access could publish a frame with Sig==nil and a
// forged Sender and have the receiver accept it as authentic in a room that
// demands signatures (audit N3, report 0006). Requiring the signature first
// closes that spoof.
if info.Policy.SignMsgs {
if f.Sig == nil {
return // signature required by room policy but absent: drop
}
pub, err := c.signerPub(roomID, f.Sender)
if err != nil || !cs.VerifyEd25519(pub, f.SigningBytes(), f.Sig) {
return // unauthenticated frame: drop
@@ -769,36 +1066,50 @@ func (c *Client) FetchMedia(roomID string, f frame.Frame) ([]byte, error) {
}
func (c *Client) putBlob(ciphertext []byte) (string, error) {
req, err := http.NewRequest("POST", c.ctrlURL+"/blobs", bytes.NewReader(ciphertext))
if err != nil {
return "", fmt.Errorf("client: new blob request: %w", err)
var lastErr error
for _, base := range c.ctrlURLs {
req, err := c.newSignedRequestTo(base, "POST", "/blobs", ciphertext)
if err != nil {
return "", err
}
req.Header.Set("Content-Type", "application/octet-stream")
resp, err := c.http.Do(req)
if err != nil {
lastErr = err
continue // dead node: try the next control plane
}
defer resp.Body.Close()
body, _ := io.ReadAll(resp.Body)
if resp.StatusCode >= 300 {
return "", fmt.Errorf("client: put blob -> %d: %s", resp.StatusCode, string(body))
}
var r blobResp
if err := json.Unmarshal(body, &r); err != nil {
return "", fmt.Errorf("client: decode blob resp: %w", err)
}
return r.Hash, nil
}
req.Header.Set("Content-Type", "application/octet-stream")
resp, err := c.http.Do(req)
if err != nil {
return "", fmt.Errorf("client: put blob: %w", err)
}
defer resp.Body.Close()
body, _ := io.ReadAll(resp.Body)
if resp.StatusCode >= 300 {
return "", fmt.Errorf("client: put blob -> %d: %s", resp.StatusCode, string(body))
}
var r blobResp
if err := json.Unmarshal(body, &r); err != nil {
return "", fmt.Errorf("client: decode blob resp: %w", err)
}
return r.Hash, nil
return "", fmt.Errorf("client: put blob: all control planes failed: %w", lastErr)
}
func (c *Client) getBlob(hash string) ([]byte, error) {
resp, err := c.http.Get(c.ctrlURL + "/blobs/" + hash)
if err != nil {
return nil, fmt.Errorf("client: get blob: %w", err)
var lastErr error
for _, base := range c.ctrlURLs {
req, err := c.newSignedRequestTo(base, "GET", "/blobs/"+hash, nil)
if err != nil {
return nil, err
}
resp, err := c.http.Do(req)
if err != nil {
lastErr = err
continue // dead node: try the next control plane
}
defer resp.Body.Close()
if resp.StatusCode >= 300 {
body, _ := io.ReadAll(resp.Body)
return nil, fmt.Errorf("client: get blob -> %d: %s", resp.StatusCode, string(body))
}
return io.ReadAll(resp.Body)
}
defer resp.Body.Close()
if resp.StatusCode >= 300 {
body, _ := io.ReadAll(resp.Body)
return nil, fmt.Errorf("client: get blob -> %d: %s", resp.StatusCode, string(body))
}
return io.ReadAll(resp.Body)
return nil, fmt.Errorf("client: get blob: all control planes failed: %w", lastErr)
}
+145 -9
View File
@@ -1,10 +1,13 @@
package client_test
import (
"crypto/tls"
"encoding/hex"
"net"
"net/http"
"net/http/httptest"
"path/filepath"
"strings"
"sync"
"testing"
"time"
@@ -12,6 +15,7 @@ import (
cs "fn-registry/functions/cybersecurity"
"github.com/enmanuel/unibus/pkg/blobstore"
"github.com/enmanuel/unibus/pkg/busauth"
"github.com/enmanuel/unibus/pkg/client"
"github.com/enmanuel/unibus/pkg/embeddednats"
"github.com/enmanuel/unibus/pkg/frame"
@@ -27,6 +31,8 @@ type testHarness struct {
ctrlURL string
ns *server.Server
httpts *httptest.Server
store membership.Store
srv *membership.Server
}
func freePort(t *testing.T) int {
@@ -39,29 +45,61 @@ func freePort(t *testing.T) int {
return l.Addr().(*net.TCPAddr).Port
}
func newHarness(t *testing.T) *testHarness {
func newHarness(t *testing.T) *testHarness { return newHarnessFull(t, membership.AuthOff, false) }
// newHarnessMode is newHarness with an explicit control-plane auth mode and the
// NATS data plane left open (no nkey auth), so HTTP-auth tests can use a plain
// client.New that does not present an nkey.
func newHarnessMode(t *testing.T, mode membership.AuthMode) *testHarness {
return newHarnessFull(t, mode, false)
}
// newHarnessFull boots the embedded NATS (optionally with the nkey authenticator
// backed by the user allowlist) and the membershipd HTTP server in ctrlMode.
// natsAuth and ctrlMode are independent on purpose: an HTTP-enforce test can
// keep NATS open, and an nkey test can keep HTTP off, mirroring how the rollout
// flags compose. The store is created before NATS so the authenticator can
// consult IsAuthorized for live revocation.
func newHarnessFull(t *testing.T, ctrlMode membership.AuthMode, natsAuth bool) *testHarness {
return bootHarness(t, ctrlMode, natsAuth, nil)
}
// bootHarness is the shared body: a store, an embedded NATS (optionally with the
// nkey authenticator and/or TLS), and the membershipd HTTP server in ctrlMode.
func bootHarness(t *testing.T, ctrlMode membership.AuthMode, natsAuth bool, natsTLS *tls.Config) *testHarness {
t.Helper()
dir := t.TempDir()
ns, err := embeddednats.Start(filepath.Join(dir, "js"), freePort(t))
store, err := membership.Open(filepath.Join(dir, "unibus.db"))
if err != nil {
t.Fatalf("membership store: %v", err)
}
cfg := embeddednats.ServerConfig{
StoreDir: filepath.Join(dir, "js"),
Host: "127.0.0.1",
Port: freePort(t),
TLS: natsTLS,
}
if natsAuth {
cfg.Auth = busauth.NewNkeyAuthenticator(store.IsAuthorized)
}
ns, err := embeddednats.StartServer(cfg)
if err != nil {
store.Close()
t.Fatalf("embedded nats: %v", err)
}
store, err := membership.Open(filepath.Join(dir, "unibus.db"))
if err != nil {
ns.Shutdown()
t.Fatalf("membership store: %v", err)
}
blobs, err := blobstore.New(filepath.Join(dir, "blobs"))
if err != nil {
ns.Shutdown()
store.Close()
t.Fatalf("blob store: %v", err)
}
srv := membership.NewServer(store, blobs)
srv := membership.NewServer(store, blobs, ctrlMode)
httpts := httptest.NewServer(srv)
h := &testHarness{natsURL: embeddednats.ClientURL(ns), ctrlURL: httpts.URL, ns: ns, httpts: httpts}
h := &testHarness{natsURL: embeddednats.ClientURL(ns), ctrlURL: httpts.URL, ns: ns, httpts: httpts, store: store, srv: srv}
t.Cleanup(func() {
httpts.Close()
store.Close()
@@ -71,6 +109,15 @@ func newHarness(t *testing.T) *testHarness {
return h
}
// registerClient adds a peer's signing identity to the bus allowlist so its
// signed control-plane requests pass under enforce.
func registerClient(t *testing.T, h *testHarness, c *client.Client, handle, role string) {
t.Helper()
if err := h.store.AddUser(hex.EncodeToString(c.Endpoint().SignPub), handle, role); err != nil {
t.Fatalf("register %s: %v", handle, err)
}
}
func waitHealth(t *testing.T, ctrlURL string) {
t.Helper()
deadline := time.Now().Add(3 * time.Second)
@@ -455,6 +502,95 @@ func TestListMyRoomsDiscovery(t *testing.T) {
}
}
// TestControlPlaneAuthEnforceE2E closes the loop end to end with the production
// client against a server in enforce mode: a registered peer's signed requests
// are accepted (golden), and an unregistered peer is rejected with 401 on its
// first control-plane call (error path). This proves the client's real
// signature construction matches the server's verification.
func TestControlPlaneAuthEnforceE2E(t *testing.T) {
h := newHarnessMode(t, membership.AuthEnforce)
waitHealth(t, h.ctrlURL)
a, err := client.New(h.natsURL, h.ctrlURL, mustIdentity(t))
if err != nil {
t.Fatalf("connect A: %v", err)
}
defer a.Close()
registerClient(t, h, a, "alice", membership.RoleAdmin)
// Golden: registered peer's signed request is accepted.
if _, err := a.CreateRoom("room.enforced", room.ModeNATS); err != nil {
t.Fatalf("registered peer should create a room under enforce: %v", err)
}
// Error path: an unregistered peer is rejected on its first control-plane call.
b, err := client.New(h.natsURL, h.ctrlURL, mustIdentity(t))
if err != nil {
t.Fatalf("connect B: %v", err)
}
defer b.Close()
_, err = b.CreateRoom("room.denied", room.ModeNATS)
if err == nil {
t.Fatalf("unregistered peer must be rejected under enforce")
}
if !strings.Contains(err.Error(), "401") && !strings.Contains(strings.ToLower(err.Error()), "unauthorized") {
t.Fatalf("expected a 401/unauthorized error, got %v", err)
}
// Revocation takes effect without restart: revoke A, its next request fails.
if err := h.store.RevokeUser(hex.EncodeToString(a.Endpoint().SignPub)); err != nil {
t.Fatalf("revoke A: %v", err)
}
if _, err := a.CreateRoom("room.after-revoke", room.ModeNATS); err == nil {
t.Fatalf("revoked peer must be rejected without a server restart")
}
}
// TestNatsNkeyAuth exercises the data-plane authenticator: with NATS nkey auth
// on, a registered peer connecting with its nkey is accepted and can publish
// (golden); an unregistered peer is refused at connect time (error path); and a
// peer revoked while the server runs is refused on its NEXT connection, proving
// revocation without a restart (edge).
func TestNatsNkeyAuth(t *testing.T) {
h := newHarnessFull(t, membership.AuthOff, true) // NATS auth on; HTTP off to isolate the data plane
waitHealth(t, h.ctrlURL)
idA := mustIdentity(t)
if err := h.store.AddUser(hex.EncodeToString(idA.SignPub), "alice", membership.RoleMember); err != nil {
t.Fatalf("register A: %v", err)
}
// Golden: registered peer connects with its nkey and uses the bus.
a, err := client.NewWithOptions(h.natsURL, h.ctrlURL, idA, client.Options{UseNkey: true})
if err != nil {
t.Fatalf("registered peer should connect with nkey: %v", err)
}
defer a.Close()
if _, err := a.CreateRoom("room.nkey", room.ModeNATS); err != nil {
t.Fatalf("registered peer should operate: %v", err)
}
// Error path: an unregistered identity is refused at connect time.
idB := mustIdentity(t)
if _, err := client.NewWithOptions(h.natsURL, h.ctrlURL, idB, client.Options{UseNkey: true}); err == nil {
t.Fatalf("unregistered peer must be refused by the NATS authenticator")
}
// Error path: presenting no nkey to an auth-required server is refused.
if _, err := client.NewWithOptions(h.natsURL, h.ctrlURL, idB, client.Options{UseNkey: false}); err == nil {
t.Fatalf("a client without an nkey must be refused when the server requires auth")
}
// Edge: revoke A while the server runs; A's NEXT connection is refused even
// though an already-open connection (a) is unaffected. No server restart.
if err := h.store.RevokeUser(hex.EncodeToString(idA.SignPub)); err != nil {
t.Fatalf("revoke A: %v", err)
}
if _, err := client.NewWithOptions(h.natsURL, h.ctrlURL, idA, client.Options{UseNkey: true}); err == nil {
t.Fatalf("revoked peer must be refused on a new connection without a restart")
}
}
// ---- test helpers ---------------------------------------------------------
type collector struct {
+87
View File
@@ -0,0 +1,87 @@
package client_test
import (
"crypto/tls"
"crypto/x509"
"net/http/httptest"
"path/filepath"
"strings"
"testing"
"github.com/enmanuel/unibus/pkg/blobstore"
"github.com/enmanuel/unibus/pkg/client"
"github.com/enmanuel/unibus/pkg/embeddednats"
"github.com/enmanuel/unibus/pkg/membership"
"github.com/enmanuel/unibus/pkg/room"
)
// TestConnectRequiresHTTPSWithCA covers audit H5's client contract: when a CA is
// provided the control-plane URL must be https://. A signed request gives
// integrity but not confidentiality, so silently talking http:// to a bus the
// operator secured with a CA would leak all metadata to a MITM. Connect refuses
// the plaintext URL outright (error path; the scheme is checked before any
// network use, so a bogus CA path is irrelevant).
func TestConnectRequiresHTTPSWithCA(t *testing.T) {
_, err := client.Connect("nats://127.0.0.1:4222", "http://127.0.0.1:8470", mustIdentity(t), "/nonexistent/ca.crt")
if err == nil {
t.Fatalf("Connect with a CA and an http:// control plane must be refused")
}
if !strings.Contains(err.Error(), "https") {
t.Fatalf("error should point the caller at https, got: %v", err)
}
}
// TestControlPlaneOverTLS proves the control plane works over TLS pinned to the
// bus CA (golden) and that a client lacking the CA cannot complete the handshake
// (error path) — so a network observer can neither read nor inject control-plane
// traffic. The data plane is left plaintext here to isolate the HTTP-TLS wiring.
func TestControlPlaneOverTLS(t *testing.T) {
dir := t.TempDir()
store, err := membership.Open(filepath.Join(dir, "unibus.db"))
if err != nil {
t.Fatalf("store: %v", err)
}
t.Cleanup(func() { store.Close() })
blobs, err := blobstore.New(filepath.Join(dir, "blobs"))
if err != nil {
t.Fatalf("blobs: %v", err)
}
ns, err := embeddednats.StartServer(embeddednats.ServerConfig{
StoreDir: filepath.Join(dir, "js"), Host: "127.0.0.1", Port: freePort(t),
})
if err != nil {
t.Fatalf("nats: %v", err)
}
t.Cleanup(func() { ns.Shutdown(); ns.WaitForShutdown() })
natsURL := embeddednats.ClientURL(ns)
// An https control plane wrapping the real membership server.
ts := httptest.NewTLSServer(membership.NewServer(store, blobs, membership.AuthOff))
t.Cleanup(ts.Close)
pool := x509.NewCertPool()
pool.AddCert(ts.Certificate())
// Golden: trusting the control-plane CA, an https control-plane request works.
good, err := client.NewWithOptions(natsURL, ts.URL, mustIdentity(t),
client.Options{CtrlTLS: &tls.Config{RootCAs: pool, MinVersion: tls.VersionTLS12}})
if err != nil {
t.Fatalf("connect with the pinned CA: %v", err)
}
defer good.Close()
if _, err := good.CreateRoom("room.tls.ctrl", room.ModeNATS); err != nil {
t.Fatalf("control plane over TLS should succeed with the pinned CA: %v", err)
}
// Error path: without the CA the https handshake fails, so the request errors.
bad, err := client.NewWithOptions(natsURL, ts.URL, mustIdentity(t),
client.Options{CtrlTLS: &tls.Config{RootCAs: x509.NewCertPool(), MinVersion: tls.VersionTLS12}})
if err != nil {
t.Fatalf("nats connect (bad CA case): %v", err)
}
defer bad.Close()
if _, err := bad.CreateRoom("room.tls.fail", room.ModeNATS); err == nil {
t.Fatalf("a control-plane request without the CA must fail the TLS handshake")
}
}
+124
View File
@@ -0,0 +1,124 @@
package client_test
import (
"bytes"
"sync"
"testing"
"time"
"github.com/enmanuel/unibus/pkg/client"
"github.com/enmanuel/unibus/pkg/frame"
"github.com/enmanuel/unibus/pkg/room"
"github.com/nats-io/nats.go"
)
// TestAudit_NoSubjectACL ports the auditor's H4 (Alto) finding under the minimum
// defense chosen for this issue (forbid cleartext rooms in public; see
// dev/0004d-dataplane-acl.md). The NATS data plane still has no per-subject ACL,
// so the guarantee we make is CONTENT confidentiality, proven three ways:
//
// error : a cleartext (ModeNATS) room cannot be created under the public posture;
// golden: a legitimate member (bob) decrypts the secret;
// edge : eve, sniffing the raw subject off the data plane, receives only
// ciphertext — she never recovers the plaintext the auditor's eve did.
func TestAudit_NoSubjectACL(t *testing.T) {
h := newHarness(t)
h.srv.RequireEncryptedRooms = true // the public posture
waitHealth(t, h.ctrlURL)
alice, err := client.New(h.natsURL, h.ctrlURL, mustIdentity(t))
if err != nil {
t.Fatalf("connect alice: %v", err)
}
defer alice.Close()
// Error path: a cleartext room is refused, so no payload ever rides a subject
// in the clear for a sniffer to read (the exact vector the auditor exploited).
if _, err := alice.CreateRoom("secret.subject.payroll", room.ModeNATS); err == nil {
t.Fatalf("cleartext room must be refused on a public deployment")
}
// alice creates an encrypted room and invites bob (the legitimate reader).
const subject = "secret.subject.payroll.e2e"
const secret = "internal: salary numbers"
roomID, err := alice.CreateRoom(subject, room.ModeMatrix)
if err != nil {
t.Fatalf("alice create encrypted room: %v", err)
}
bob, err := client.New(h.natsURL, h.ctrlURL, mustIdentity(t))
if err != nil {
t.Fatalf("connect bob: %v", err)
}
defer bob.Close()
if err := alice.Invite(roomID, bob.Endpoint()); err != nil {
t.Fatalf("alice invite bob: %v", err)
}
if err := bob.Join(roomID); err != nil {
t.Fatalf("bob join: %v", err)
}
// Golden: bob (a member) subscribes and decrypts the secret.
var bmu sync.Mutex
var bobGot []string
bobSub, err := bob.Subscribe(roomID, func(_ frame.Frame, plaintext []byte) {
bmu.Lock()
bobGot = append(bobGot, string(plaintext))
bmu.Unlock()
})
if err != nil {
t.Fatalf("bob subscribe: %v", err)
}
defer bobSub.Unsubscribe()
// Edge: eve sniffs the raw subject directly off NATS (no membership, no key).
rawEve, err := nats.Connect(h.natsURL)
if err != nil {
t.Fatalf("eve raw connect: %v", err)
}
defer rawEve.Close()
eveGot := make(chan []byte, 8)
if _, err := rawEve.Subscribe(subject, func(m *nats.Msg) { eveGot <- m.Data }); err != nil {
t.Fatalf("eve raw subscribe: %v", err)
}
if err := rawEve.Flush(); err != nil {
t.Fatalf("eve flush: %v", err)
}
time.Sleep(200 * time.Millisecond) // let both subscriptions settle
if err := alice.Publish(roomID, []byte(secret)); err != nil {
t.Fatalf("alice publish: %v", err)
}
// bob must decrypt the secret.
if !waitFor(&bmu, &bobGot, func(rs []string) bool {
for _, r := range rs {
if r == secret {
return true
}
}
return false
}, 2*time.Second) {
t.Fatalf("bob (member) should decrypt the secret; got %v", snapshot(&bmu, &bobGot))
}
// eve must receive only ciphertext — never the plaintext.
select {
case data := <-eveGot:
if bytes.Contains(data, []byte(secret)) {
t.Fatalf("eve sniffed the plaintext off the data plane: %q", data)
}
f, err := frame.Unmarshal(data)
if err != nil {
t.Fatalf("eve received an undecodable frame: %v", err)
}
if string(f.Payload) == secret {
t.Fatalf("eve read the secret from the frame payload")
}
if len(f.Nonce) == 0 {
t.Fatalf("expected an AEAD-encrypted payload (non-empty nonce), got cleartext frame")
}
case <-time.After(2 * time.Second):
// eve receiving nothing is also a safe outcome; the assertion is only that
// she never gets the plaintext, which holds vacuously here.
}
}
+185
View File
@@ -0,0 +1,185 @@
package client_test
import (
"fmt"
"net/http/httptest"
"path/filepath"
"strconv"
"strings"
"sync"
"testing"
"time"
"github.com/enmanuel/unibus/pkg/blobstore"
"github.com/enmanuel/unibus/pkg/client"
"github.com/enmanuel/unibus/pkg/embeddednats"
"github.com/enmanuel/unibus/pkg/frame"
"github.com/enmanuel/unibus/pkg/membership"
"github.com/enmanuel/unibus/pkg/room"
server "github.com/nats-io/nats-server/v2/server"
)
// startClusterNode boots a clustered embedded NATS node (auth off, no route TLS:
// this test exercises client failover, not route security — that is covered in
// pkg/embeddednats).
func startClusterNode(t *testing.T, name string, clientPort, routePort int, peerRoutePorts []int) *server.Server {
t.Helper()
routes := make([]string, 0, len(peerRoutePorts))
for _, p := range peerRoutePorts {
routes = append(routes, fmt.Sprintf("nats://127.0.0.1:%d", p))
}
ns, err := embeddednats.StartServer(embeddednats.ServerConfig{
StoreDir: t.TempDir(),
Host: "127.0.0.1",
Port: clientPort,
ServerName: name,
Cluster: &embeddednats.ClusterConfig{Name: "unibus-failover", Host: "127.0.0.1", Port: routePort, Routes: routes},
})
if err != nil {
t.Fatalf("start node %s: %v", name, err)
}
t.Cleanup(func() { ns.Shutdown(); ns.WaitForShutdown() })
return ns
}
func waitClusterRoutes(t *testing.T, ns *server.Server) {
t.Helper()
deadline := time.Now().Add(8 * time.Second)
for time.Now().Before(deadline) {
if ns.NumRoutes() >= 1 {
return
}
time.Sleep(50 * time.Millisecond)
}
t.Fatalf("node %q never formed a route", ns.Name())
}
// portOf extracts the :port of a nats URL for matching ConnectedServer() (which
// may report a different host spelling than ClientURL()).
func portOf(natsURL string) string {
i := strings.LastIndex(natsURL, ":")
if i < 0 {
return ""
}
return natsURL[i+1:]
}
// TestClientFailoverAcrossNodes is the issue's edge case: a client connected to
// node A keeps its session when A is killed — nats.go reconnects it to node B
// and it keeps receiving messages published on the surviving node.
func TestClientFailoverAcrossNodes(t *testing.T) {
rp0, rp1 := freePort(t), freePort(t)
p0, p1 := freePort(t), freePort(t)
n0 := startClusterNode(t, "n0", p0, rp0, []int{rp1})
n1 := startClusterNode(t, "n1", p1, rp1, []int{rp0})
waitClusterRoutes(t, n0)
waitClusterRoutes(t, n1)
nodes := map[string]*server.Server{strconv.Itoa(p0): n0, strconv.Itoa(p1): n1}
// Control plane: one in-process membershipd (metadata only; the data plane is
// the NATS cluster). Auth off keeps the test focused on data-plane failover.
dir := t.TempDir()
store, err := membership.Open(filepath.Join(dir, "unibus.db"))
if err != nil {
t.Fatalf("store: %v", err)
}
t.Cleanup(func() { store.Close() })
blobs, err := blobstore.New(filepath.Join(dir, "blobs"))
if err != nil {
t.Fatalf("blobs: %v", err)
}
ctrl := httptest.NewServer(membership.NewServer(store, blobs, membership.AuthOff))
t.Cleanup(ctrl.Close)
url0 := n0.ClientURL()
url1 := n1.ClientURL()
// A seeds BOTH nodes (failover list); B connects directly to n1.
a, err := client.NewWithOptions(url0, ctrl.URL, mustIdentity(t), client.Options{NatsServers: []string{url1}})
if err != nil {
t.Fatalf("connect A: %v", err)
}
defer a.Close()
b, err := client.NewWithOptions(url1, ctrl.URL, mustIdentity(t), client.Options{NatsServers: []string{url0}})
if err != nil {
t.Fatalf("connect B: %v", err)
}
defer b.Close()
roomID, err := a.CreateRoom("room.failover", room.ModeNATS)
if err != nil {
t.Fatalf("A create room: %v", err)
}
var mu sync.Mutex
var got []string
sub, err := a.Subscribe(roomID, func(_ frame.Frame, plaintext []byte) {
mu.Lock()
got = append(got, string(plaintext))
mu.Unlock()
})
if err != nil {
t.Fatalf("A subscribe: %v", err)
}
defer sub.Unsubscribe()
time.Sleep(200 * time.Millisecond)
// Pre-kill sanity: B publishes, A receives across the cluster.
if err := b.Publish(roomID, []byte("before-kill")); err != nil {
t.Fatalf("B publish 1: %v", err)
}
if !waitFor(&mu, &got, func(rs []string) bool { return contains(rs, "before-kill") }, 3*time.Second) {
t.Fatalf("A did not receive the pre-kill message; got %v", snapshot(&mu, &got))
}
// Identify and KILL the node A is attached to, forcing a reconnect.
attached := a.ConnectedServer()
killPort := portOf(attached)
victim, ok := nodes[killPort]
if !ok {
t.Fatalf("A is attached to an unknown node %q (port %q)", attached, killPort)
}
survivorURL := url1
if killPort == strconv.Itoa(p1) {
survivorURL = url0
}
victim.Shutdown()
victim.WaitForShutdown()
// A must reconnect to the surviving node.
deadline := time.Now().Add(8 * time.Second)
for time.Now().Before(deadline) {
if a.IsConnected() && portOf(a.ConnectedServer()) == portOf(survivorURL) {
break
}
time.Sleep(100 * time.Millisecond)
}
if !a.IsConnected() || portOf(a.ConnectedServer()) != portOf(survivorURL) {
t.Fatalf("A did not fail over to the surviving node (now on %q, want port %s)", a.ConnectedServer(), portOf(survivorURL))
}
// Make B publish from the surviving node and confirm A still receives —
// the session (its subscription) survived the failover.
if survivorURL == url0 {
// B's primary was n1 (killed); ensure B is on the survivor too.
deadline := time.Now().Add(8 * time.Second)
for time.Now().Before(deadline) && portOf(b.ConnectedServer()) != portOf(survivorURL) {
time.Sleep(100 * time.Millisecond)
}
}
if err := b.Publish(roomID, []byte("after-kill")); err != nil {
t.Fatalf("B publish 2: %v", err)
}
if !waitFor(&mu, &got, func(rs []string) bool { return contains(rs, "after-kill") }, 6*time.Second) {
t.Fatalf("A did not receive a message after failover; got %v", snapshot(&mu, &got))
}
}
func contains(rs []string, want string) bool {
for _, r := range rs {
if r == want {
return true
}
}
return false
}
+27 -11
View File
@@ -33,20 +33,36 @@ type identityFile struct {
KexPriv string `json:"kex_priv"`
}
// LoadIdentity loads an existing identity from path. Unlike LoadOrCreateIdentity
// it NEVER creates one: a missing or unreadable file is an error. It is for
// callers that must consume a specific, pre-provisioned identity rather than mint
// a fresh one — for example membershipd's persisted internal service identity,
// which `membershipd user add --store kv` reads to present the privileged nkey
// the cluster authenticator recognizes.
func LoadIdentity(path string) (cs.Identity, error) {
data, err := os.ReadFile(path)
if err != nil {
return cs.Identity{}, fmt.Errorf("client: read identity %q: %w", path, err)
}
var f identityFile
if err := json.Unmarshal(data, &f); err != nil {
return cs.Identity{}, fmt.Errorf("client: parse identity %q: %w", path, err)
}
id, err := f.toIdentity()
if err != nil {
return cs.Identity{}, fmt.Errorf("client: decode identity %q: %w", path, err)
}
return id, nil
}
// LoadOrCreateIdentity loads the identity at path, or generates and persists a
// new one if the file does not exist. The file is written with 0600
// permissions because it holds private keys.
// permissions because it holds private keys. A file that exists but is
// unreadable or corrupt is an error (NOT silently regenerated), so a damaged
// identity surfaces instead of minting a new key that cannot decrypt old data.
func LoadOrCreateIdentity(path string) (cs.Identity, error) {
if data, err := os.ReadFile(path); err == nil {
var f identityFile
if err := json.Unmarshal(data, &f); err != nil {
return cs.Identity{}, fmt.Errorf("client: parse identity %q: %w", path, err)
}
id, err := f.toIdentity()
if err != nil {
return cs.Identity{}, fmt.Errorf("client: decode identity %q: %w", path, err)
}
return id, nil
if _, statErr := os.Stat(path); statErr == nil {
return LoadIdentity(path)
}
id, err := cs.GenerateIdentity()
+154
View File
@@ -0,0 +1,154 @@
package client_test
import (
"sync"
"testing"
"time"
"github.com/enmanuel/unibus/pkg/client"
"github.com/enmanuel/unibus/pkg/frame"
"github.com/enmanuel/unibus/pkg/room"
"github.com/nats-io/nats.go"
)
// TestReaudit_SigNilSpoof ports the re-auditor's N3 (Alto) finding: in a room
// that REQUIRES per-message signatures, an attacker with data-plane access
// publishes a raw frame with Sig==nil and a forged Sender. Before the fix
// processFrame verified the signature only when one was present
// (`SignMsgs && f.Sig != nil`), so the receiver accepted the unsigned, forged
// frame as authentic. The fix drops any unsigned frame in a SignMsgs room.
//
// Coverage:
// - golden: a properly signed frame from a real member IS delivered;
// - error : an unsigned frame with a forged Sender in a SignMsgs room is DROPPED;
// - edge : a room WITHOUT SignMsgs still delivers an unsigned frame (the drop
// is specific to signed rooms, not a blanket reject of unsigned frames).
func TestReaudit_SigNilSpoof(t *testing.T) {
h := newHarness(t)
waitHealth(t, h.ctrlURL)
alice, err := client.New(h.natsURL, h.ctrlURL, mustIdentity(t))
if err != nil {
t.Fatalf("connect alice: %v", err)
}
defer alice.Close()
bob, err := client.New(h.natsURL, h.ctrlURL, mustIdentity(t))
if err != nil {
t.Fatalf("connect bob: %v", err)
}
defer bob.Close()
// A signed-but-NOT-encrypted room: SignMsgs enforces authorship, and the lack
// of encryption is exactly the case the auditor flagged as Alto (any peer with
// the subject can forge a sender if signatures are not strictly required).
const subject = "room.signed.spoof"
signedPolicy := room.Policy{Encrypt: false, Persist: false, SignMsgs: true}
roomID, err := alice.CreateRoom(subject, signedPolicy)
if err != nil {
t.Fatalf("alice create signed room: %v", err)
}
if err := alice.Invite(roomID, bob.Endpoint()); err != nil {
t.Fatalf("alice invite bob: %v", err)
}
if err := bob.Join(roomID); err != nil {
t.Fatalf("bob join: %v", err)
}
var mu sync.Mutex
var got []string
sub, err := bob.Subscribe(roomID, func(_ frame.Frame, plaintext []byte) {
mu.Lock()
got = append(got, string(plaintext))
mu.Unlock()
})
if err != nil {
t.Fatalf("bob subscribe: %v", err)
}
defer sub.Unsubscribe()
time.Sleep(150 * time.Millisecond)
// Attacker: a raw NATS connection (the dev harness leaves the data plane open),
// no identity, forged Sender, NO signature.
const spoofMsg = "I am totally the victim"
rawAtk, err := nats.Connect(h.natsURL)
if err != nil {
t.Fatalf("attacker raw connect: %v", err)
}
defer rawAtk.Close()
spoof := frame.Frame{
Type: frame.PUB,
Subject: subject,
Sender: "victim-forged-endpoint",
MsgID: "spoof-1",
Epoch: 1,
Payload: []byte(spoofMsg),
// Sig intentionally nil — this is the attack.
}
sb, err := spoof.Marshal()
if err != nil {
t.Fatalf("marshal spoof: %v", err)
}
if err := rawAtk.Publish(subject, sb); err != nil {
t.Fatalf("attacker publish: %v", err)
}
_ = rawAtk.Flush()
// Golden: alice's properly signed frame must be delivered.
const goodMsg = "authentic from alice"
if err := alice.Publish(roomID, []byte(goodMsg)); err != nil {
t.Fatalf("alice publish: %v", err)
}
if !waitFor(&mu, &got, func(rs []string) bool {
for _, r := range rs {
if r == goodMsg {
return true
}
}
return false
}, 2*time.Second) {
t.Fatalf("a properly signed frame should be delivered; got %v", snapshot(&mu, &got))
}
// Error path: the unsigned, forged frame must NEVER reach the handler.
for _, r := range snapshot(&mu, &got) {
if r == spoofMsg {
t.Fatalf("SIG-NIL SPOOF: receiver accepted an unsigned frame with a forged Sender in a SignMsgs room")
}
}
// Edge: a room WITHOUT SignMsgs still delivers an unsigned raw frame, proving
// the drop is scoped to signed rooms and did not break the plain-NATS path.
const subjectOpen = "room.open.nosig"
openRoom, err := alice.CreateRoom(subjectOpen, room.ModeNATS)
if err != nil {
t.Fatalf("alice create open room: %v", err)
}
openCol := subscribeCollect(t, alice, openRoom)
defer openCol.sub.Unsubscribe()
time.Sleep(150 * time.Millisecond)
const openMsg = "unsigned but allowed here"
openFrame := frame.Frame{
Type: frame.PUB,
Subject: subjectOpen,
Sender: "anyone",
MsgID: "open-1",
Payload: []byte(openMsg),
// no Sig — fine in a non-signed room
}
ob, _ := openFrame.Marshal()
if err := rawAtk.Publish(subjectOpen, ob); err != nil {
t.Fatalf("publish open frame: %v", err)
}
_ = rawAtk.Flush()
if !waitFor(&openCol.mu, &openCol.msgs, func(rs []string) bool {
for _, r := range rs {
if r == openMsg {
return true
}
}
return false
}, 2*time.Second) {
t.Fatalf("an unsigned frame in a non-signed room should be delivered; got %v", snapshot(&openCol.mu, &openCol.msgs))
}
}
+185
View File
@@ -0,0 +1,185 @@
package client_test
import (
"crypto/ecdsa"
"crypto/elliptic"
"crypto/rand"
"crypto/tls"
"crypto/x509"
"crypto/x509/pkix"
"encoding/hex"
"encoding/pem"
"math/big"
"net"
"sync"
"testing"
"time"
"github.com/enmanuel/unibus/pkg/client"
"github.com/enmanuel/unibus/pkg/frame"
"github.com/enmanuel/unibus/pkg/membership"
"github.com/enmanuel/unibus/pkg/room"
)
// genTestCA mints a throwaway self-signed CA plus a server certificate (SAN
// 127.0.0.1 / localhost) signed by it, mirroring deploy/tls/generate-certs.sh
// without shelling out to openssl. It returns the server's *tls.Config (cert it
// presents) and the CA pool a client must trust to complete the handshake.
func genTestCA(t *testing.T) (server *tls.Config, caPool *x509.CertPool) {
t.Helper()
// --- CA ---
caKey, err := ecdsa.GenerateKey(elliptic.P256(), rand.Reader)
if err != nil {
t.Fatalf("ca key: %v", err)
}
caTmpl := &x509.Certificate{
SerialNumber: big.NewInt(1),
Subject: pkix.Name{CommonName: "unibus-test-ca"},
NotBefore: time.Now().Add(-time.Hour),
NotAfter: time.Now().Add(24 * time.Hour),
IsCA: true,
KeyUsage: x509.KeyUsageCertSign | x509.KeyUsageDigitalSignature,
BasicConstraintsValid: true,
}
caDER, err := x509.CreateCertificate(rand.Reader, caTmpl, caTmpl, &caKey.PublicKey, caKey)
if err != nil {
t.Fatalf("ca cert: %v", err)
}
caCert, err := x509.ParseCertificate(caDER)
if err != nil {
t.Fatalf("parse ca: %v", err)
}
// --- server cert signed by the CA ---
srvKey, err := ecdsa.GenerateKey(elliptic.P256(), rand.Reader)
if err != nil {
t.Fatalf("server key: %v", err)
}
srvTmpl := &x509.Certificate{
SerialNumber: big.NewInt(2),
Subject: pkix.Name{CommonName: "unibus-test-server"},
NotBefore: time.Now().Add(-time.Hour),
NotAfter: time.Now().Add(24 * time.Hour),
KeyUsage: x509.KeyUsageDigitalSignature,
ExtKeyUsage: []x509.ExtKeyUsage{x509.ExtKeyUsageServerAuth},
DNSNames: []string{"localhost"},
IPAddresses: []net.IP{net.IPv4(127, 0, 0, 1)},
}
srvDER, err := x509.CreateCertificate(rand.Reader, srvTmpl, caCert, &srvKey.PublicKey, caKey)
if err != nil {
t.Fatalf("server cert: %v", err)
}
srvCertPEM := pem.EncodeToMemory(&pem.Block{Type: "CERTIFICATE", Bytes: srvDER})
srvKeyDER, err := x509.MarshalECPrivateKey(srvKey)
if err != nil {
t.Fatalf("marshal server key: %v", err)
}
srvKeyPEM := pem.EncodeToMemory(&pem.Block{Type: "EC PRIVATE KEY", Bytes: srvKeyDER})
srvPair, err := tls.X509KeyPair(srvCertPEM, srvKeyPEM)
if err != nil {
t.Fatalf("server keypair: %v", err)
}
pool := x509.NewCertPool()
pool.AddCert(caCert)
return &tls.Config{Certificates: []tls.Certificate{srvPair}, MinVersion: tls.VersionTLS12}, pool
}
// TestNatsTLS validates the TLS data plane: a client trusting the bus CA
// completes the handshake and uses the bus (golden); a client that does NOT
// trust the CA fails the handshake (error path).
func TestNatsTLS(t *testing.T) {
serverTLS, caPool := genTestCA(t)
h := bootHarness(t, membership.AuthOff, false, serverTLS)
waitHealth(t, h.ctrlURL)
// Golden: client pinning the CA connects over TLS and operates.
clientTLS := &tls.Config{RootCAs: caPool, MinVersion: tls.VersionTLS12}
a, err := client.NewWithOptions(h.natsURL, h.ctrlURL, mustIdentity(t), client.Options{TLS: clientTLS})
if err != nil {
t.Fatalf("client trusting the CA should complete the TLS handshake: %v", err)
}
defer a.Close()
if _, err := a.CreateRoom("room.tls", room.ModeNATS); err != nil {
t.Fatalf("TLS client should operate on the bus: %v", err)
}
// Error path: a client that does not trust the CA fails the handshake. Use an
// empty pool (system roots would also reject this private CA, but an empty
// pool makes the intent explicit and avoids depending on the host's roots).
badTLS := &tls.Config{RootCAs: x509.NewCertPool(), MinVersion: tls.VersionTLS12}
if _, err := client.NewWithOptions(h.natsURL, h.ctrlURL, mustIdentity(t), client.Options{TLS: badTLS}); err == nil {
t.Fatalf("client without the CA must fail the TLS handshake")
}
}
// TestSecureBusEndToEnd is the headline golden of issue 0001: with ALL three
// layers active at once — control-plane request signing (enforce), NATS nkey
// auth, and TLS — two registered peers run an encrypted room end to end. A
// creates a Matrix-policy room, invites B, A publishes and B decrypts. This
// proves the layers compose: signed HTTP control plane + authenticated,
// encrypted data plane + E2E room content.
func TestSecureBusEndToEnd(t *testing.T) {
serverTLS, caPool := genTestCA(t)
h := bootHarness(t, membership.AuthEnforce, true, serverTLS)
waitHealth(t, h.ctrlURL)
clientTLS := &tls.Config{RootCAs: caPool, MinVersion: tls.VersionTLS12}
secure := func(t *testing.T, handle string) (*client.Client, membership.AuthMode) {
id := mustIdentity(t)
if err := h.store.AddUser(hex.EncodeToString(id.SignPub), handle, membership.RoleMember); err != nil {
t.Fatalf("register %s: %v", handle, err)
}
c, err := client.NewWithOptions(h.natsURL, h.ctrlURL, id, client.Options{UseNkey: true, TLS: clientTLS})
if err != nil {
t.Fatalf("connect %s securely: %v", handle, err)
}
return c, 0
}
a, _ := secure(t, "alice")
defer a.Close()
b, _ := secure(t, "bob")
defer b.Close()
roomID, err := a.CreateRoom("room.secure", room.ModeMatrix)
if err != nil {
t.Fatalf("A create encrypted room over secure bus: %v", err)
}
if err := a.Invite(roomID, b.Endpoint()); err != nil {
t.Fatalf("A invite B: %v", err)
}
if err := b.Join(roomID); err != nil {
t.Fatalf("B join: %v", err)
}
var mu sync.Mutex
var got []string
sub, err := b.Subscribe(roomID, func(_ frame.Frame, plaintext []byte) {
mu.Lock()
got = append(got, string(plaintext))
mu.Unlock()
})
if err != nil {
t.Fatalf("B subscribe: %v", err)
}
defer sub.Unsubscribe()
time.Sleep(150 * time.Millisecond)
const msg = "mensaje sobre bus seguro (auth+TLS+E2E)"
if err := a.Publish(roomID, []byte(msg)); err != nil {
t.Fatalf("A publish: %v", err)
}
if !waitFor(&mu, &got, func(rs []string) bool {
for _, r := range rs {
if r == msg {
return true
}
}
return false
}, 2*time.Second) {
t.Fatalf("B did not receive/decrypt the message over the secured bus; got %v", snapshot(&mu, &got))
}
}
+99
View File
@@ -0,0 +1,99 @@
package client_test
import (
"encoding/hex"
"strings"
"testing"
"github.com/enmanuel/unibus/pkg/client"
"github.com/enmanuel/unibus/pkg/membership"
)
// findUserInfo returns the row with the given signing key (case-insensitive).
func findUserInfo(users []client.UserInfo, signPub string) (client.UserInfo, bool) {
want := strings.ToLower(signPub)
for _, u := range users {
if strings.ToLower(u.SignPub) == want {
return u, true
}
}
return client.UserInfo{}, false
}
// TestClientUsersAdminAPI drives the admin user-management API through the real
// pkg/client methods against an in-process membershipd under enforce: an admin
// client adds a user, lists it, revokes it, and sees the status flip — and a
// non-admin client is denied. This is the path the admin panel uses, so it locks
// the client/server contract the panel depends on.
func TestClientUsersAdminAPI(t *testing.T) {
h := newHarnessMode(t, membership.AuthEnforce)
waitHealth(t, h.ctrlURL)
admin, err := client.New(h.natsURL, h.ctrlURL, mustIdentity(t))
if err != nil {
t.Fatalf("connect admin: %v", err)
}
defer admin.Close()
registerClient(t, h, admin, "admin", membership.RoleAdmin)
member, err := client.New(h.natsURL, h.ctrlURL, mustIdentity(t))
if err != nil {
t.Fatalf("connect member: %v", err)
}
defer member.Close()
registerClient(t, h, member, "member", membership.RoleMember)
// A brand-new identity the admin will register over HTTP.
carol := mustIdentity(t)
carolPub := hex.EncodeToString(carol.SignPub)
// Admin adds carol as a member.
if err := admin.AddUser(carolPub, "carol", membership.RoleMember); err != nil {
t.Fatalf("admin AddUser: %v", err)
}
// Admin lists: carol present and active.
users, err := admin.ListUsers()
if err != nil {
t.Fatalf("admin ListUsers: %v", err)
}
row, ok := findUserInfo(users, carolPub)
if !ok {
t.Fatalf("carol missing from list after add: %+v", users)
}
if row.Status != membership.StatusActive || row.Role != membership.RoleMember {
t.Fatalf("carol row wrong after add: %+v", row)
}
// Re-adding the same key is a conflict surfaced as an error (no silent upsert).
if err := admin.AddUser(carolPub, "carol-again", membership.RoleAdmin); err == nil {
t.Fatalf("re-adding carol should error (409), got nil")
}
// Admin revokes carol; list shows the status flip (no hard delete).
if err := admin.RevokeUser(carolPub); err != nil {
t.Fatalf("admin RevokeUser: %v", err)
}
users, err = admin.ListUsers()
if err != nil {
t.Fatalf("admin ListUsers after revoke: %v", err)
}
row, ok = findUserInfo(users, carolPub)
if !ok {
t.Fatalf("carol vanished after revoke (should be a status flip): %+v", users)
}
if row.Status != membership.StatusRevoked {
t.Fatalf("carol should be revoked, got status %q", row.Status)
}
// A non-admin (member) is denied on every user-management method.
if _, err := member.ListUsers(); err == nil {
t.Fatalf("non-admin ListUsers should error (403), got nil")
}
if err := member.AddUser(carolPub, "x", membership.RoleMember); err == nil {
t.Fatalf("non-admin AddUser should error (403), got nil")
}
if err := member.RevokeUser(carolPub); err == nil {
t.Fatalf("non-admin RevokeUser should error (403), got nil")
}
}
+344
View File
@@ -0,0 +1,344 @@
package embeddednats_test
import (
"crypto/ecdsa"
"crypto/elliptic"
"crypto/rand"
"crypto/x509"
"crypto/x509/pkix"
"encoding/pem"
"fmt"
"math/big"
"net"
"os"
"path/filepath"
"testing"
"time"
"github.com/enmanuel/unibus/pkg/busauth"
"github.com/enmanuel/unibus/pkg/embeddednats"
"github.com/nats-io/nats.go"
server "github.com/nats-io/nats-server/v2/server"
)
// freePort returns an OS-assigned free TCP port on loopback.
func freePort(t *testing.T) int {
t.Helper()
l, err := net.Listen("tcp", "127.0.0.1:0")
if err != nil {
t.Fatalf("free port: %v", err)
}
defer l.Close()
return l.Addr().(*net.TCPAddr).Port
}
// startNode boots a clustered embedded NATS node. peerRoutePorts are the route
// ports of the OTHER nodes; user/pass gate the route layer (empty disables it);
// routeTLS, when non-nil, secures the routes with mutual TLS.
func startNode(t *testing.T, name string, clientPort, routePort int, peerRoutePorts []int, user, pass string, routeTLS *clusterTLS) *server.Server {
t.Helper()
routes := make([]string, 0, len(peerRoutePorts))
for _, p := range peerRoutePorts {
// Carry the cluster credentials in the route URL so this node
// authenticates outbound to its peers' route listeners.
if user != "" {
routes = append(routes, fmt.Sprintf("nats://%s:%s@127.0.0.1:%d", user, pass, p))
} else {
routes = append(routes, fmt.Sprintf("nats://127.0.0.1:%d", p))
}
}
cc := &embeddednats.ClusterConfig{
Name: "unibus-test",
Host: "127.0.0.1",
Port: routePort,
Routes: routes,
Username: user,
Password: pass,
}
if routeTLS != nil {
cfg, err := busauth.RouteTLSConfig(routeTLS.cert, routeTLS.key, routeTLS.ca)
if err != nil {
t.Fatalf("route TLS for %s: %v", name, err)
}
cc.TLS = cfg
}
ns, err := embeddednats.StartServer(embeddednats.ServerConfig{
StoreDir: t.TempDir(),
Host: "127.0.0.1",
Port: clientPort,
ServerName: name,
Cluster: cc,
})
if err != nil {
t.Fatalf("start node %s: %v", name, err)
}
t.Cleanup(func() { ns.Shutdown(); ns.WaitForShutdown() })
return ns
}
// waitRoutes waits until ns has at least want established routes, or fails.
func waitRoutes(t *testing.T, ns *server.Server, want int) {
t.Helper()
deadline := time.Now().Add(8 * time.Second)
for time.Now().Before(deadline) {
if ns.NumRoutes() >= want {
return
}
time.Sleep(50 * time.Millisecond)
}
t.Fatalf("node %q never reached %d routes (have %d)", ns.Name(), want, ns.NumRoutes())
}
// stableRouteCount waits for ns's route count to stop changing (the NATS route
// pool opens several connections per peer asynchronously) and returns it, so a
// test can use it as a baseline that an impostor must not increase.
func stableRouteCount(t *testing.T, ns *server.Server) int {
t.Helper()
prev := -1
stableSince := time.Now()
deadline := time.Now().Add(5 * time.Second)
for time.Now().Before(deadline) {
n := ns.NumRoutes()
if n != prev {
prev = n
stableSince = time.Now()
} else if time.Since(stableSince) >= 750*time.Millisecond {
return n
}
time.Sleep(50 * time.Millisecond)
}
return prev
}
// pubSubAcrossNodes connects a subscriber to subURL and a publisher to pubURL,
// publishes one message on subject, and reports whether it arrived within 3s.
// This proves the cluster forwards client subjects between nodes.
func pubSubAcrossNodes(t *testing.T, subURL, pubURL, subject, payload string) bool {
t.Helper()
subConn, err := nats.Connect(subURL)
if err != nil {
t.Fatalf("subscriber connect %s: %v", subURL, err)
}
defer subConn.Close()
got := make(chan string, 1)
if _, err := subConn.Subscribe(subject, func(m *nats.Msg) {
select {
case got <- string(m.Data):
default:
}
}); err != nil {
t.Fatalf("subscribe: %v", err)
}
if err := subConn.Flush(); err != nil {
t.Fatalf("flush sub: %v", err)
}
pubConn, err := nats.Connect(pubURL)
if err != nil {
t.Fatalf("publisher connect %s: %v", pubURL, err)
}
defer pubConn.Close()
// Retry the publish for a moment: route interest propagation across the
// cluster is asynchronous, so the very first publish can race the gossip.
deadline := time.Now().Add(3 * time.Second)
for time.Now().Before(deadline) {
if err := pubConn.Publish(subject, []byte(payload)); err != nil {
t.Fatalf("publish: %v", err)
}
_ = pubConn.Flush()
select {
case v := <-got:
return v == payload
case <-time.After(100 * time.Millisecond):
}
}
return false
}
// --- golden: two-node cluster forwards client subjects across nodes ----------
func TestClusterForwardsAcrossNodes(t *testing.T) {
rp0, rp1 := freePort(t), freePort(t)
n0 := startNode(t, "n0", freePort(t), rp0, []int{rp1}, "clusteruser", "clusterpass", nil)
n1 := startNode(t, "n1", freePort(t), rp1, []int{rp0}, "clusteruser", "clusterpass", nil)
waitRoutes(t, n0, 1)
waitRoutes(t, n1, 1)
if !pubSubAcrossNodes(t, n0.ClientURL(), n1.ClientURL(), "test.cross", "hello-cluster") {
t.Fatalf("subject published on n1 did not reach subscriber on n0")
}
}
// --- edge: three-node cluster (HA shape) forwards between non-adjacent nodes --
func TestClusterThreeNodesForward(t *testing.T) {
rp0, rp1, rp2 := freePort(t), freePort(t), freePort(t)
n0 := startNode(t, "n0", freePort(t), rp0, []int{rp1, rp2}, "u", "p", nil)
n1 := startNode(t, "n1", freePort(t), rp1, []int{rp0, rp2}, "u", "p", nil)
n2 := startNode(t, "n2", freePort(t), rp2, []int{rp0, rp1}, "u", "p", nil)
waitRoutes(t, n0, 2)
waitRoutes(t, n1, 2)
waitRoutes(t, n2, 2)
// Publish on n2, subscribe on n0: a message must traverse the cluster.
if !pubSubAcrossNodes(t, n0.ClientURL(), n2.ClientURL(), "test.ha", "three-node") {
t.Fatalf("subject published on n2 did not reach subscriber on n0")
}
}
// --- error: a node with the wrong cluster password is rejected as a route -----
func TestClusterRejectsBadRouteAuth(t *testing.T) {
rp0, rp1 := freePort(t), freePort(t)
good := startNode(t, "good", freePort(t), rp0, []int{rp1}, "secret", "right", nil)
_ = startNode(t, "peer", freePort(t), rp1, []int{rp0}, "secret", "right", nil)
waitRoutes(t, good, 1)
// Let the route pool settle so the baseline count is stable (NATS opens a
// pool of route connections per peer, so NumRoutes counts connections, not
// distinct peers).
base := stableRouteCount(t, good)
// Impostor knows the addresses but not the cluster password. It tries to
// route to `good`; the route handshake must be rejected, so the impostor
// never establishes a route.
impostor := startNode(t, "impostor", freePort(t), freePort(t), []int{rp0}, "secret", "WRONG", nil)
// Give the route layer ample time to (fail to) connect, then assert it never
// formed: the impostor has zero routes, and `good`'s route count is unchanged
// (it did not accept a route from the impostor).
time.Sleep(2 * time.Second)
if n := impostor.NumRoutes(); n != 0 {
t.Fatalf("impostor with wrong cluster password formed %d routes, want 0", n)
}
if n := good.NumRoutes(); n != base {
t.Fatalf("legit node route count changed from %d to %d after impostor attempt (it accepted the impostor)", base, n)
}
}
// --- golden (TLS): mutual-TLS routes forward across nodes ---------------------
func TestClusterMutualTLSForwards(t *testing.T) {
ca, caKey := genCA(t)
dir := t.TempDir()
tlsA := writeNodeCert(t, dir, "a", ca, caKey)
tlsB := writeNodeCert(t, dir, "b", ca, caKey)
rp0, rp1 := freePort(t), freePort(t)
n0 := startNode(t, "n0", freePort(t), rp0, []int{rp1}, "u", "p", tlsA)
n1 := startNode(t, "n1", freePort(t), rp1, []int{rp0}, "u", "p", tlsB)
waitRoutes(t, n0, 1)
waitRoutes(t, n1, 1)
if !pubSubAcrossNodes(t, n0.ClientURL(), n1.ClientURL(), "test.tls", "mtls-ok") {
t.Fatalf("subject did not cross the mutual-TLS cluster")
}
}
// --- error (TLS): a node whose cert is not signed by the bus CA cannot join ---
func TestClusterRejectsUnsignedNode(t *testing.T) {
ca, caKey := genCA(t)
dir := t.TempDir()
tlsGood := writeNodeCert(t, dir, "good", ca, caKey)
tlsPeer := writeNodeCert(t, dir, "peer", ca, caKey)
// The impostor signs its node cert with a DIFFERENT CA, and pins only that
// CA. The legit nodes' RequireAndVerifyClientCert against the bus CA rejects
// it; the impostor likewise rejects the legit node's cert. No route forms.
otherCA, otherKey := genCA(t)
tlsImpostor := writeNodeCert(t, dir, "impostor", otherCA, otherKey)
rp0, rp1 := freePort(t), freePort(t)
good := startNode(t, "good", freePort(t), rp0, []int{rp1}, "u", "p", tlsGood)
_ = startNode(t, "peer", freePort(t), rp1, []int{rp0}, "u", "p", tlsPeer)
waitRoutes(t, good, 1)
base := stableRouteCount(t, good)
impostor := startNode(t, "impostor", freePort(t), freePort(t), []int{rp0}, "u", "p", tlsImpostor)
time.Sleep(2 * time.Second)
if n := impostor.NumRoutes(); n != 0 {
t.Fatalf("impostor with unsigned cert formed %d routes, want 0", n)
}
if n := good.NumRoutes(); n != base {
t.Fatalf("legit node route count changed from %d to %d after unsigned impostor attempt (it accepted the impostor)", base, n)
}
}
// --- cert helpers ------------------------------------------------------------
type clusterTLS struct{ cert, key, ca string } // PEM file paths
// genCA creates a self-signed ECDSA CA certificate and its key.
func genCA(t *testing.T) (*x509.Certificate, *ecdsa.PrivateKey) {
t.Helper()
key, err := ecdsa.GenerateKey(elliptic.P256(), rand.Reader)
if err != nil {
t.Fatalf("gen CA key: %v", err)
}
tmpl := &x509.Certificate{
SerialNumber: big.NewInt(1),
Subject: pkix.Name{CommonName: "unibus-test-CA"},
NotBefore: time.Now().Add(-time.Hour),
NotAfter: time.Now().Add(24 * time.Hour),
KeyUsage: x509.KeyUsageCertSign | x509.KeyUsageDigitalSignature,
BasicConstraintsValid: true,
IsCA: true,
}
der, err := x509.CreateCertificate(rand.Reader, tmpl, tmpl, &key.PublicKey, key)
if err != nil {
t.Fatalf("create CA cert: %v", err)
}
caCert, err := x509.ParseCertificate(der)
if err != nil {
t.Fatalf("parse CA cert: %v", err)
}
return caCert, key
}
// writeNodeCert issues a node certificate signed by ca (SAN 127.0.0.1/::1,
// usable as both server and client) and writes cert/key/ca PEM files, returning
// their paths for RouteTLSConfig.
func writeNodeCert(t *testing.T, dir, name string, ca *x509.Certificate, caKey *ecdsa.PrivateKey) *clusterTLS {
t.Helper()
key, err := ecdsa.GenerateKey(elliptic.P256(), rand.Reader)
if err != nil {
t.Fatalf("gen node key: %v", err)
}
tmpl := &x509.Certificate{
SerialNumber: big.NewInt(time.Now().UnixNano()),
Subject: pkix.Name{CommonName: name},
NotBefore: time.Now().Add(-time.Hour),
NotAfter: time.Now().Add(24 * time.Hour),
KeyUsage: x509.KeyUsageDigitalSignature | x509.KeyUsageKeyEncipherment,
ExtKeyUsage: []x509.ExtKeyUsage{x509.ExtKeyUsageServerAuth, x509.ExtKeyUsageClientAuth},
IPAddresses: []net.IP{net.ParseIP("127.0.0.1"), net.ParseIP("::1")},
DNSNames: []string{"localhost"},
}
der, err := x509.CreateCertificate(rand.Reader, tmpl, ca, &key.PublicKey, caKey)
if err != nil {
t.Fatalf("create node cert: %v", err)
}
certPath := filepath.Join(dir, name+".crt")
keyPath := filepath.Join(dir, name+".key")
caPath := filepath.Join(dir, name+"-ca.crt")
writePEM(t, certPath, "CERTIFICATE", der)
keyDER, err := x509.MarshalECPrivateKey(key)
if err != nil {
t.Fatalf("marshal node key: %v", err)
}
writePEM(t, keyPath, "EC PRIVATE KEY", keyDER)
writePEM(t, caPath, "CERTIFICATE", ca.Raw)
return &clusterTLS{cert: certPath, key: keyPath, ca: caPath}
}
func writePEM(t *testing.T, path, blockType string, der []byte) {
t.Helper()
b := pem.EncodeToMemory(&pem.Block{Type: blockType, Bytes: der})
if err := os.WriteFile(path, b, 0o600); err != nil {
t.Fatalf("write %s: %v", path, err)
}
}
+172 -13
View File
@@ -6,22 +6,85 @@
package embeddednats
import (
"crypto/tls"
"fmt"
"net/url"
"os"
"time"
server "github.com/nats-io/nats-server/v2/server"
)
// Start launches an embedded nats-server with JetStream enabled, listening on
// the given port and persisting JetStream state under storeDir. The listen host
// is left at the nats-server default ("0.0.0.0", all interfaces). It blocks
// until the server is ready to accept connections (up to 5s) and returns the
// running server. The caller is responsible for calling Shutdown on it.
// ClusterConfig configures the route layer that links several embedded NATS
// servers into a single cluster (issue 0003a). It is the data-plane side of
// high availability: with a cluster, a client subject published on one node is
// forwarded to subscribers connected to any other node, and (with JetStream
// replicas > 1) streams/KV are RAFT-replicated across nodes so the loss of one
// node does not lose the bus.
//
// Start is a thin backward-compatible wrapper over StartHost; callers that need
// to control the bind interface (loopback vs LAN) should use StartHost directly.
// The route layer is a SEPARATE trust boundary from the client data plane: it
// carries server-to-server traffic, so it authenticates NODES, not bus users.
// Never reuse the nkey client authenticator here. Routes are secured with their
// own shared secret (Username/Password -> NATS Cluster.Authorization) and their
// own mutual TLS (TLS, built from the bus CA with busauth.RouteTLSConfig): a
// node without the cluster secret and a CA-signed node certificate cannot join
// the cluster nor inject messages into it.
type ClusterConfig struct {
// Name is the cluster name; it MUST be identical on every node or the
// servers refuse to gossip routes to each other.
Name string
// Host and Port are the route listener (server-to-server), distinct from the
// client Host/Port. Use a free, non-client port (e.g. 6250).
Host string
Port int
// Routes are the nats-route URLs of the OTHER nodes, e.g.
// "nats://user:pass@10.0.0.2:6250". When the route layer is password
// protected each URL must carry the same userinfo as the local Username /
// Password so this node authenticates outbound to its peers.
Routes []string
// Username and Password gate the route listener (NATS Cluster.Authorization).
// A peer (or impostor) that connects to this node's route port without these
// credentials is rejected, so it never becomes a route. Empty disables route
// auth (dev / trusted-network only).
Username string
Password string
// TLS, when non-nil, secures the route connections with mutual TLS. Build it
// with busauth.RouteTLSConfig(cert, key, ca): the server presents its node
// certificate AND requires+verifies the connecting node's certificate against
// the bus CA, so an unsigned impostor cannot establish a route even with the
// right password. Nil keeps routes plaintext (dev / WireGuard-only).
TLS *tls.Config
}
// ServerConfig is the full set of knobs for the embedded NATS server. The zero
// value (empty StoreDir aside) yields a dev-friendly server: JetStream on, bound
// to all interfaces, no client auth, no TLS, standalone (no cluster). Secured
// deployments set Auth and TLS; HA deployments set ServerName + Cluster; tests
// set Host to loopback and a free Port.
type ServerConfig struct {
StoreDir string // JetStream store directory
Host string // bind interface; "" = nats-server default ("0.0.0.0")
Port int // listen port
// ServerName is this node's unique name within the cluster. JetStream's RAFT
// layer requires a stable, unique name per node to form its meta-group; leave
// it empty for a standalone server (nats-server then auto-generates one).
ServerName string
// Auth, when non-nil, is installed as CustomClientAuthentication so the data
// plane only accepts approved clients (nkey signature + bus allowlist).
Auth server.Authentication
// TLS, when non-nil, makes the server present a certificate and require TLS
// on the data plane. Clients must trust the issuing CA (see busauth).
TLS *tls.Config
// Cluster, when non-nil, joins this server to a route cluster for high
// availability (issue 0003a). Nil keeps the server standalone (the legacy
// single-node behavior).
Cluster *ClusterConfig
}
// Start is a thin backward-compatible wrapper: embedded JetStream server on the
// default interface, no auth, no TLS.
func Start(storeDir string, port int) (*server.Server, error) {
return StartHost(storeDir, "", port)
return StartServer(ServerConfig{StoreDir: storeDir, Port: port})
}
// StartHost is Start with explicit control over the bind interface. host selects
@@ -30,15 +93,64 @@ func Start(storeDir string, port int) (*server.Server, error) {
// to expose it to the LAN so remote peers (phones, other PCs) can connect. An
// empty host falls back to the nats-server default ("0.0.0.0", all interfaces).
func StartHost(storeDir, host string, port int) (*server.Server, error) {
return StartServer(ServerConfig{StoreDir: storeDir, Host: host, Port: port})
}
// StartHostAuth is StartHost with an optional custom client authenticator. When
// auth is non-nil only clients the authenticator approves may connect; when nil
// the server accepts any client (legacy, network-trusted behavior).
func StartHostAuth(storeDir, host string, port int, auth server.Authentication) (*server.Server, error) {
return StartServer(ServerConfig{StoreDir: storeDir, Host: host, Port: port, Auth: auth})
}
// StartServer launches an embedded nats-server with JetStream from cfg. It
// blocks until the server is ready to accept connections (up to 5s) and returns
// the running server; the caller must Shutdown it.
func StartServer(cfg ServerConfig) (*server.Server, error) {
// Diagnostic toggle: UNIBUS_NATS_DEBUG=1 enables the embedded nats-server's own
// logger (route/RAFT/JetStream errors), which is otherwise silenced. Off by
// default so production behavior is unchanged; only set it when debugging the
// cluster route layer.
debugLevel := os.Getenv("UNIBUS_NATS_DEBUG")
debugNATS := debugLevel == "1" || debugLevel == "2"
traceNATS := debugLevel == "2"
opts := &server.Options{
JetStream: true,
StoreDir: storeDir,
Host: host,
Port: port,
StoreDir: cfg.StoreDir,
Host: cfg.Host,
Port: cfg.Port,
ServerName: cfg.ServerName,
DontListen: false,
// Keep the embedded server quiet by default; the host app logs the URLs.
NoLog: true,
NoSigs: true,
NoLog: !debugNATS,
Debug: debugNATS,
Trace: traceNATS,
Logtime: true,
NoSigs: true,
}
if debugNATS {
// Expose the nats-server monitoring endpoint (loopback) so the operator can
// inspect /jsz, /routez, /varz while debugging the cluster meta-group.
opts.HTTPHost = "127.0.0.1"
opts.HTTPPort = 8222
}
if cfg.Auth != nil {
opts.CustomClientAuthentication = cfg.Auth
// A CustomClientAuthentication alone does not make the server advertise a
// nonce in its INFO line, and nats.go refuses to connect with an nkey to a
// server that does not ("nkeys not supported by the server"). Forcing the
// nonce makes nkey clients sign the challenge our authenticator verifies.
opts.AlwaysEnableNonce = true
}
if cfg.TLS != nil {
opts.TLSConfig = cfg.TLS
opts.TLS = true
}
if cfg.Cluster != nil {
if err := applyClusterOpts(opts, cfg.Cluster); err != nil {
return nil, err
}
}
ns, err := server.NewServer(opts)
@@ -46,6 +158,10 @@ func StartHost(storeDir, host string, port int) (*server.Server, error) {
return nil, fmt.Errorf("embeddednats: new server: %w", err)
}
if debugNATS {
ns.ConfigureLogger()
}
go ns.Start()
if !ns.ReadyForConnections(5 * time.Second) {
@@ -56,6 +172,49 @@ func StartHost(storeDir, host string, port int) (*server.Server, error) {
return ns, nil
}
// applyClusterOpts translates a ClusterConfig into the nats-server route options
// on opts: the cluster listener (name + host/port + shared-secret auth + mutual
// TLS) and the outbound routes to the other nodes. A malformed route URL is a
// configuration error and aborts startup rather than silently dropping a peer.
func applyClusterOpts(opts *server.Options, c *ClusterConfig) error {
opts.Cluster = server.ClusterOpts{
Name: c.Name,
Host: c.Host,
Port: c.Port,
Username: c.Username,
Password: c.Password,
// Disable route connection pooling (nats-server 2.10+ defaults to a pool of
// 3 connections per peer). On a small cluster the pool churns with
// "duplicate route"/"client closed" reconnects that interrupt the meta-group
// RAFT heartbeats, causing perpetual leader re-elections so the JetStream
// meta never becomes current and stream/KV creation hangs (issue 0006g).
// PoolSize=-1 forces the classic single route per peer, which is stable for
// the 3-node unibus cluster.
PoolSize: -1,
// NoAdvertise stops the server from gossiping its locally-discovered IPs to
// peers. The cluster nodes are Docker hosts, so without this NATS advertises
// the docker bridge addresses (172.x / 10.0.x) as reachable routes; peers
// then try to dial those private, mutually-unreachable IPs, churning the
// route layer and destabilizing the JetStream meta-group. With NoAdvertise
// the nodes use ONLY the explicit public-IP routes we configure (issue 0006g).
NoAdvertise: true,
}
if c.TLS != nil {
opts.Cluster.TLSConfig = c.TLS
// A generous handshake budget: route TLS does a mutual handshake and the
// peer may still be booting. The default 2s can flap on a cold cluster.
opts.Cluster.TLSTimeout = 5.0
}
for _, r := range c.Routes {
u, err := url.Parse(r)
if err != nil {
return fmt.Errorf("embeddednats: parse route %q: %w", r, err)
}
opts.Routes = append(opts.Routes, u)
}
return nil
}
// ClientURL returns a NATS connection URL for the running embedded server.
func ClientURL(ns *server.Server) string {
return ns.ClientURL()
+118
View File
@@ -0,0 +1,118 @@
package membership
// Per-subject data-plane access control derived from room membership (issue
// 0003e, audit H4 residual; tightened in issue 0006b for audit 0008 N2). The
// control plane already authorizes metadata by membership; this is the matching
// restriction on the NATS data plane so a registered peer can only
// publish/subscribe on the subjects of the rooms it actually belongs to — and can
// only reach the JetStream API of ITS OWN rooms' streams, never the control-plane
// KV buckets.
import (
"encoding/hex"
"fmt"
"strings"
"github.com/enmanuel/unibus/pkg/frame"
)
// clientInfraSubjects are the subjects every authorized peer needs regardless of
// room membership, kept deliberately MINIMAL (issue 0006b, audit 0008 N2):
//
// - "_INBOX.>" — request/reply plus the JetStream pull-consumer delivery
// and publish-ack inboxes.
// - "$JS.API.INFO" — account-level JetStream info (limits/usage counters). It
// exposes NO room/user/key contents, so granting it leaks nothing.
//
// It NO LONGER contains "$JS.API.>". That broad grant was the N2 leak: it let any
// registered peer drive the whole JetStream API and read the control-plane KV
// buckets (KV_UNIBUS_users/rooms/members/room_keys) and the object store directly
// over NATS, bypassing the HTTP authorization (requireMember and the own-endpoint
// checks). JetStream API access is now granted PER ROOM, scoped to the stream of
// each room the peer belongs to (jsSubjectsFor). Because the control-plane KV
// streams (KV_UNIBUS_*) and the object store (OBJ_UNIBUS_*) are never a room
// stream, they fall outside the closed allow set and are denied by default.
var clientInfraSubjects = []string{"_INBOX.>", "$JS.API.INFO"}
// roomStreamName is the JetStream stream name a persisted room maps to. It MUST
// stay identical to pkg/client.streamName ("UNIBUS_" + sanitized roomID) so the
// per-room ACL grants exactly the subjects the client's JetStream calls use. Room
// ids are ULIDs (no '.'), so the sanitizing is a no-op in practice, but the rule
// is replicated defensively so the producer (client) and the authorizer (this
// ACL) never drift apart.
func roomStreamName(roomID string) string {
var b strings.Builder
b.Grow(len("UNIBUS_") + len(roomID))
b.WriteString("UNIBUS_")
for _, r := range roomID {
switch {
case r >= 'a' && r <= 'z', r >= 'A' && r <= 'Z', r >= '0' && r <= '9', r == '_':
b.WriteRune(r)
default:
b.WriteRune('_')
}
}
return b.String()
}
// jsSubjectsFor returns the MINIMAL JetStream API subjects a peer needs to use the
// durable stream of ONE persisted room: create/update/info the stream, manage and
// pull from its durable consumer, and ack deliveries. Every subject embeds this
// room's stream name, so the grant cannot reach another room's stream nor any
// control-plane stream (KV_UNIBUS_* / OBJ_UNIBUS_*). The wildcard layout matches
// the NATS JetStream API subject grammar (the stream name is the trailing token
// of single-verb requests and follows a two-token verb for MSG.GET / MSG.NEXT /
// DURABLE.CREATE):
//
// $JS.API.STREAM.<verb>.<stream> verb in {CREATE,UPDATE,INFO,DELETE,PURGE,...}
// $JS.API.STREAM.MSG.<op>.<stream> op in {GET,DELETE}
// $JS.API.CONSUMER.<verb>.<stream> verb in {LIST,NAMES,CREATE(ephemeral)}
// $JS.API.CONSUMER.<verb>.<stream>.<consumer>... verb in {CREATE,INFO,DELETE}
// $JS.API.CONSUMER.<v1>.<v2>.<stream>.<cons> {MSG.NEXT, DURABLE.CREATE}
// $JS.ACK.<stream>.> message acknowledgements
func jsSubjectsFor(roomID string) []string {
s := roomStreamName(roomID)
return []string{
"$JS.API.STREAM.*." + s,
"$JS.API.STREAM.*.*." + s,
"$JS.API.CONSUMER.*." + s,
"$JS.API.CONSUMER.*." + s + ".>",
"$JS.API.CONSUMER.*.*." + s + ".>",
"$JS.ACK." + s + ".>",
}
}
// SubjectACLFor returns a function that maps a signing public key (lowercase hex)
// to the data-plane subjects that identity may publish and subscribe to: the
// subject of every room it belongs to, the per-room JetStream API subjects of
// those rooms (so persisted-room history keeps working), plus the minimal client
// infrastructure subjects. It reads the live membership store, so the permissions
// reflect the identity's rooms at the moment it connects. A decode error or a
// store failure is returned as an error so the caller can fail closed (deny the
// connection) rather than grant open access.
//
// Because NATS freezes permissions at connect time, a peer invited to a new room
// after connecting must reconnect (client.RefreshSession) to pick up the new
// room's subject. The bus is the authoritative directory of subjects, so an
// unlisted subject is simply absent from the allow set.
func SubjectACLFor(store Store) func(signPubHex string) ([]string, error) {
return func(signPubHex string) ([]string, error) {
pub, err := hex.DecodeString(signPubHex)
if err != nil || len(pub) != 32 {
return nil, fmt.Errorf("acl: malformed sign pub %q", signPubHex)
}
endpoint := frame.EndpointID(pub)
rooms, err := store.ListRoomsForEndpoint(endpoint)
if err != nil {
return nil, fmt.Errorf("acl: list rooms for %s: %w", endpoint, err)
}
// clientInfra + per room: the room subject + that room's JetStream API.
subjects := make([]string, 0, len(clientInfraSubjects)+len(rooms)*7)
subjects = append(subjects, clientInfraSubjects...)
for _, r := range rooms {
subjects = append(subjects, r.Subject)
subjects = append(subjects, jsSubjectsFor(r.RoomID)...)
}
return subjects, nil
}
}
+379
View File
@@ -0,0 +1,379 @@
package membership_test
import (
"encoding/hex"
"net"
"net/http/httptest"
"path/filepath"
"testing"
"time"
cs "fn-registry/functions/cybersecurity"
"github.com/enmanuel/unibus/pkg/blobstore"
"github.com/enmanuel/unibus/pkg/busauth"
"github.com/enmanuel/unibus/pkg/client"
"github.com/enmanuel/unibus/pkg/embeddednats"
"github.com/enmanuel/unibus/pkg/frame"
"github.com/enmanuel/unibus/pkg/membership"
"github.com/nats-io/nats.go"
server "github.com/nats-io/nats-server/v2/server"
)
func aclFreePort(t *testing.T) int {
t.Helper()
l, err := net.Listen("tcp", "127.0.0.1:0")
if err != nil {
t.Fatalf("free port: %v", err)
}
defer l.Close()
return l.Addr().(*net.TCPAddr).Port
}
func mustID(t *testing.T) cs.Identity {
t.Helper()
id, err := cs.GenerateIdentity()
if err != nil {
t.Fatalf("identity: %v", err)
}
return id
}
// aclPermsFunc builds the per-subject PermissionsFunc the ACL authenticator
// expects. It delegates to the SAME production wiring membershipd uses
// (busauth.PermissionsFromSubjects over membership.SubjectACLFor), so this test
// exercises the real path rather than a test-only reimplementation.
func aclPermsFunc(store membership.Store) busauth.PermissionsFunc {
return busauth.PermissionsFromSubjects(membership.SubjectACLFor(store))
}
// startACLNats boots an embedded NATS whose authenticator confines each peer to
// the subjects of the rooms it belongs to (audit H4 residual).
func startACLNats(t *testing.T, store membership.Store) *server.Server {
t.Helper()
auth := busauth.NewNkeyAuthenticatorACL(store.IsAuthorized, aclPermsFunc(store))
ns, err := embeddednats.StartServer(embeddednats.ServerConfig{
StoreDir: t.TempDir(), Host: "127.0.0.1", Port: aclFreePort(t), Auth: auth,
})
if err != nil {
t.Fatalf("acl nats: %v", err)
}
t.Cleanup(func() { ns.Shutdown(); ns.WaitForShutdown() })
return ns
}
func nkeyConn(t *testing.T, natsURL string, id cs.Identity, errCh chan error) *nats.Conn {
t.Helper()
pub, sign, err := busauth.ClientNkey(id.SignPriv)
if err != nil {
t.Fatalf("nkey: %v", err)
}
nc, err := nats.Connect(natsURL,
nats.Nkey(pub, sign),
nats.ErrorHandler(func(_ *nats.Conn, _ *nats.Subscription, e error) {
select {
case errCh <- e:
default:
}
}),
)
if err != nil {
t.Fatalf("connect nkey: %v", err)
}
t.Cleanup(nc.Close)
return nc
}
func mustAddUser(t *testing.T, store membership.Store, id cs.Identity, handle string) {
t.Helper()
if err := store.AddUser(hex.EncodeToString(id.SignPub), handle, membership.RoleMember); err != nil {
t.Fatalf("add user %s: %v", handle, err)
}
}
func mustCreateRoom(t *testing.T, store membership.Store, roomID, subject, ownerEP string, owner cs.Identity) {
t.Helper()
info := membership.RoomInfo{RoomID: roomID, Subject: subject, OwnerEndpoint: ownerEP}
if err := store.CreateRoom(info, owner.SignPub, owner.KexPub, nil); err != nil {
t.Fatalf("create room %s: %v", roomID, err)
}
}
func newCtrl(t *testing.T, store membership.Store, blobs blobstore.Store) string {
t.Helper()
ts := httptest.NewServer(membership.NewServer(store, blobs, membership.AuthOff))
t.Cleanup(ts.Close)
return ts.URL
}
func waitErr(ch chan error, d time.Duration) error {
select {
case e := <-ch:
return e
case <-time.After(d):
return nil
}
}
func drain(ch chan error) {
for {
select {
case <-ch:
default:
return
}
}
}
// TestSubjectACLIsolation closes the audit H4 residual: a registered peer is
// confined to the subjects of the rooms it belongs to. alice (member of room.A)
// may sub/pub room.A but is DENIED sub/pub on room.B, and never reads what bob
// (member of room.B) publishes there.
func TestSubjectACLIsolation(t *testing.T) {
dir := t.TempDir()
store, err := membership.Open(filepath.Join(dir, "unibus.db"))
if err != nil {
t.Fatalf("store: %v", err)
}
t.Cleanup(func() { store.Close() })
alice, bob := mustID(t), mustID(t)
aliceEP, bobEP := frame.EndpointID(alice.SignPub), frame.EndpointID(bob.SignPub)
mustAddUser(t, store, alice, "alice")
mustAddUser(t, store, bob, "bob")
const subjA, subjB = "room.acl.a", "room.acl.b"
mustCreateRoom(t, store, "ROOMA", subjA, aliceEP, alice)
mustCreateRoom(t, store, "ROOMB", subjB, bobEP, bob)
srv := startACLNats(t, store)
url := srv.ClientURL()
aliceErr := make(chan error, 4)
bobErr := make(chan error, 4)
aliceNC := nkeyConn(t, url, alice, aliceErr)
bobNC := nkeyConn(t, url, bob, bobErr)
// alice may subscribe to her own room (no error).
aliceGot := make(chan string, 4)
if _, err := aliceNC.Subscribe(subjA, func(m *nats.Msg) { aliceGot <- string(m.Data) }); err != nil {
t.Fatalf("alice sub A: %v", err)
}
_ = aliceNC.Flush()
if e := waitErr(aliceErr, 300*time.Millisecond); e != nil {
t.Fatalf("alice sub to her OWN room raised an error: %v", e)
}
// alice subscribing to bob's room is a permissions violation.
if _, err := aliceNC.Subscribe(subjB, func(m *nats.Msg) { aliceGot <- "LEAK:" + string(m.Data) }); err != nil {
t.Fatalf("alice sub B (queue): %v", err)
}
_ = aliceNC.Flush()
if e := waitErr(aliceErr, 1*time.Second); e == nil {
t.Fatalf("alice subscribing to bob's room should raise a permissions violation")
}
// bob publishes in his room; alice (denied) must not receive it.
bobGot := make(chan string, 4)
if _, err := bobNC.Subscribe(subjB, func(m *nats.Msg) { bobGot <- string(m.Data) }); err != nil {
t.Fatalf("bob sub B: %v", err)
}
_ = bobNC.Flush()
if err := bobNC.Publish(subjB, []byte("internal-bob")); err != nil {
t.Fatalf("bob pub B: %v", err)
}
_ = bobNC.Flush()
select {
case got := <-bobGot:
if got != "internal-bob" {
t.Fatalf("bob got %q", got)
}
case <-time.After(2 * time.Second):
t.Fatalf("bob did not receive his own message")
}
select {
case leak := <-aliceGot:
t.Fatalf("alice received bob's room traffic despite the ACL: %q", leak)
case <-time.After(500 * time.Millisecond):
// good: alice never got it
}
// alice publishing into bob's room is denied; bob must not receive it.
drain(aliceErr)
if err := aliceNC.Publish(subjB, []byte("intruder")); err != nil {
t.Fatalf("alice pub B (queue): %v", err)
}
_ = aliceNC.Flush()
if e := waitErr(aliceErr, 1*time.Second); e == nil {
t.Fatalf("alice publishing into bob's room should raise a permissions violation")
}
select {
case got := <-bobGot:
t.Fatalf("bob received alice's cross-room publish despite the ACL: %q", got)
case <-time.After(500 * time.Millisecond):
// good
}
}
// TestReaudit_H4_WildcardMetadataLeak ports the re-auditor's H4 vector. Before
// the per-subject ACL was WIRED into membershipd (it existed in pkg/membership and
// pkg/busauth but the binary used the plain NewNkeyAuthenticator), a registered
// NON-member could open a raw NATS connection, Subscribe(">"), and capture every
// room's subject plus JetStream stream/advisory activity — the payload stayed E2E
// ciphertext, but the metadata leaked. With NewNkeyAuthenticatorACL wired via the
// production path (busauth.PermissionsFromSubjects(membership.SubjectACLFor)), a
// non-member is confined to the client-infra subjects, so the wildcard and any
// foreign room subject are denied.
//
// Coverage:
// - error : a non-member's Subscribe(">") raises a permission violation;
// - edge : a non-member subscribing to another room's exact subject is denied;
// - golden: the member still pub/subs her own room, and the non-member never
// captures that traffic.
//
// Residual now CLOSED (issue 0006b, audit 0008 N2): the client-infra grant no
// longer includes "$JS.API.>". JetStream API access is granted per-room only
// (membership.jsSubjectsFor), so a peer can reach the API of its OWN rooms'
// streams but not the control-plane KV buckets (KV_UNIBUS_*) nor another room's
// stream. See TestAttack0008_N2 for the closed-leak regression.
func TestReaudit_H4_WildcardMetadataLeak(t *testing.T) {
dir := t.TempDir()
store, err := membership.Open(filepath.Join(dir, "unibus.db"))
if err != nil {
t.Fatalf("store: %v", err)
}
t.Cleanup(func() { store.Close() })
alice, eve := mustID(t), mustID(t)
aliceEP := frame.EndpointID(alice.SignPub)
mustAddUser(t, store, alice, "alice")
mustAddUser(t, store, eve, "eve") // eve is REGISTERED but never a member of alice's room
const subject = "room.e2e.confidential"
mustCreateRoom(t, store, "ROOMA", subject, aliceEP, alice)
srv := startACLNats(t, store)
url := srv.ClientURL()
eveErr := make(chan error, 8)
eveNC := nkeyConn(t, url, eve, eveErr)
eveAll := make(chan *nats.Msg, 16)
// Error: eve's wildcard subscription is rejected. nats.go creates the local sub
// object and the server rejects it asynchronously (delivered to ErrorHandler).
if _, err := eveNC.Subscribe(">", func(m *nats.Msg) { eveAll <- m }); err != nil {
t.Fatalf("eve sub >: %v", err)
}
_ = eveNC.Flush()
if e := waitErr(eveErr, 1*time.Second); e == nil {
t.Fatalf("a non-member's Subscribe(\">\") must raise a permissions violation (wildcard metadata leak still open)")
}
// Edge: eve subscribing to the foreign room's EXACT subject is also denied.
drain(eveErr)
if _, err := eveNC.Subscribe(subject, func(m *nats.Msg) { eveAll <- m }); err != nil {
t.Fatalf("eve sub subject: %v", err)
}
_ = eveNC.Flush()
if e := waitErr(eveErr, 1*time.Second); e == nil {
t.Fatalf("a non-member subscribing to another room's subject must be denied")
}
// Golden: alice (the member) pub/subs her own room with no violation, and eve
// never captured the traffic despite her (rejected) wildcard.
aliceErr := make(chan error, 4)
aliceNC := nkeyConn(t, url, alice, aliceErr)
aliceGot := make(chan string, 4)
if _, err := aliceNC.Subscribe(subject, func(m *nats.Msg) { aliceGot <- string(m.Data) }); err != nil {
t.Fatalf("alice sub own room: %v", err)
}
_ = aliceNC.Flush()
if e := waitErr(aliceErr, 300*time.Millisecond); e != nil {
t.Fatalf("alice subscribing to her OWN room raised an error: %v", e)
}
if err := aliceNC.Publish(subject, []byte("members-only metadata")); err != nil {
t.Fatalf("alice publish: %v", err)
}
_ = aliceNC.Flush()
select {
case got := <-aliceGot:
if got != "members-only metadata" {
t.Fatalf("alice got %q", got)
}
case <-time.After(2 * time.Second):
t.Fatalf("alice did not receive her own room's message")
}
select {
case m := <-eveAll:
t.Fatalf("eve captured room traffic despite the ACL: subject=%q data=%q", m.Subject, m.Data)
case <-time.After(500 * time.Millisecond):
// good: eve captured nothing
}
}
// TestRefreshSessionGainsNewRoom is the "permissions refreshed on join" path:
// alice is not in room B, so her connection has no permission for its subject;
// after she is added to room B and calls RefreshSession, the reconnect
// re-derives her permissions and she gains the room's subject.
func TestRefreshSessionGainsNewRoom(t *testing.T) {
dir := t.TempDir()
store, err := membership.Open(filepath.Join(dir, "unibus.db"))
if err != nil {
t.Fatalf("store: %v", err)
}
t.Cleanup(func() { store.Close() })
alice, bob := mustID(t), mustID(t)
aliceEP, bobEP := frame.EndpointID(alice.SignPub), frame.EndpointID(bob.SignPub)
mustAddUser(t, store, alice, "alice")
mustAddUser(t, store, bob, "bob")
const subjB = "room.refresh.b"
mustCreateRoom(t, store, "ROOMB", subjB, bobEP, bob)
srv := startACLNats(t, store)
blobs, _ := blobstore.New(filepath.Join(dir, "blobs"))
ctrl := newCtrl(t, store, blobs)
aliceC, err := client.NewWithOptions(srv.ClientURL(), ctrl, alice, client.Options{UseNkey: true})
if err != nil {
t.Fatalf("connect alice: %v", err)
}
defer aliceC.Close()
// Add alice to room B (as if invited), then RefreshSession so the
// authenticator re-derives her permissions on reconnect.
if _, err := store.GetMember("ROOMB", aliceEP); err == nil {
t.Fatalf("alice should not be a member yet")
}
if err := store.AddMember("ROOMB", membership.Member{Endpoint: aliceEP, Role: "member", SignPub: alice.SignPub, KexPub: alice.KexPub}, 1, nil); err != nil {
t.Fatalf("add alice to room B: %v", err)
}
if err := aliceC.RefreshSession(); err != nil {
t.Fatalf("refresh session: %v", err)
}
bobErr := make(chan error, 2)
bobNC := nkeyConn(t, srv.ClientURL(), bob, bobErr)
got := make(chan string, 2)
sub, err := aliceC.Subscribe("ROOMB", func(_ frame.Frame, plaintext []byte) { got <- string(plaintext) })
if err != nil {
t.Fatalf("alice subscribe room B after refresh: %v", err)
}
defer sub.Unsubscribe()
time.Sleep(200 * time.Millisecond)
// bob publishes a minimal cleartext frame on subjB.
f := frame.Frame{Type: frame.PUB, Subject: subjB, Sender: bobEP, MsgID: "m1", Payload: []byte("hello-after-join")}
b, _ := f.Marshal()
if err := bobNC.Publish(subjB, b); err != nil {
t.Fatalf("bob publish: %v", err)
}
_ = bobNC.Flush()
select {
case msg := <-got:
if msg != "hello-after-join" {
t.Fatalf("alice got %q", msg)
}
case <-time.After(3 * time.Second):
t.Fatalf("alice did not receive room B traffic after RefreshSession (permissions not refreshed)")
}
}
+241
View File
@@ -0,0 +1,241 @@
package membership
import (
"crypto/sha256"
"encoding/base64"
"encoding/hex"
"fmt"
"net/http"
"strconv"
"sync"
"time"
cs "fn-registry/functions/cybersecurity"
"github.com/enmanuel/unibus/pkg/frame"
)
// AuthMode is the control-plane authentication rollout state (feature flag
// bus-auth). It governs how the HTTP middleware treats a request whose signature
// is missing, invalid, replayed, skewed, or from an unregistered identity.
//
// AuthOff — do not verify anything (legacy behavior; default).
// AuthSoft — verify and LOG rejections, but let the request through. Lets
// clients migrate to signing without an outage.
// AuthEnforce — reject unauthenticated requests with 401.
type AuthMode int
const (
AuthOff AuthMode = iota
AuthSoft
AuthEnforce
)
func (m AuthMode) String() string {
switch m {
case AuthOff:
return "off"
case AuthSoft:
return "soft"
case AuthEnforce:
return "enforce"
default:
return "unknown"
}
}
// ParseAuthMode maps the bus-auth flag string to an AuthMode.
func ParseAuthMode(s string) (AuthMode, error) {
switch s {
case "off", "":
return AuthOff, nil
case "soft":
return AuthSoft, nil
case "enforce":
return AuthEnforce, nil
default:
return AuthOff, fmt.Errorf("membership: invalid bus-auth mode %q (want off|soft|enforce)", s)
}
}
// Control-plane signature headers. The client signs the canonical bytes of the
// request and presents these; the server reconstructs the canonical bytes and
// verifies. See canonicalRequest for the exact byte layout.
const (
hdrPub = "X-Unibus-Pub" // signer Ed25519 public key, lowercase hex
hdrTs = "X-Unibus-Ts" // unix seconds (string)
hdrNonce = "X-Unibus-Nonce" // 16 random bytes, std base64
hdrSig = "X-Unibus-Sig" // Ed25519 signature over canonical, std base64
)
// Anti-replay parameters. A request is accepted only if its timestamp is within
// clockSkew of now; nonces are remembered for nonceTTL so a captured request
// cannot be replayed inside its acceptance window. nonceTTL must be >= the full
// acceptance window (2*clockSkew) so a replay can never outlive its memory.
const (
clockSkew = 30 * time.Second
nonceTTL = 60 * time.Second
// maxNonceCacheEntries bounds the replay cache so it cannot grow without limit
// (audit H7). With IsAuthorized now gating insertion, only authorized traffic
// is cached, so this ceiling is only approached under a legitimate burst; at
// the cap the oldest nonce is evicted (its TTL is nearly up anyway).
maxNonceCacheEntries = 100_000
)
// CanonicalRequest returns the exact bytes that are signed and verified for a
// control-plane request:
//
// method "\n" path "\n" ts "\n" nonce "\n" hex(sha256(body))
//
// path is the request URI (path plus raw query) so query parameters (endpoint,
// epoch) are covered by the signature. It is exported so the client library and
// tests sign with the identical construction — the one place this format lives.
func CanonicalRequest(method, path, ts, nonce string, body []byte) []byte {
sum := sha256.Sum256(body)
return []byte(method + "\n" + path + "\n" + ts + "\n" + nonce + "\n" + hex.EncodeToString(sum[:]))
}
// nonceStore is the anti-replay backend: rememberOrReject records a nonce and
// reports whether it was unseen (true -> accept) or already seen (false ->
// reject the replay). It is an interface (issue 0003e) so the single-node
// in-memory cache can be swapped for a replicated KV store: a per-process cache
// is BROKEN under multi-node failover (a request captured and replayed to a
// DIFFERENT node whose cache never saw the nonce would be accepted), so a
// cluster MUST share the nonce state. Every implementation fails CLOSED — a
// backend it cannot reach rejects rather than admits.
type nonceStore interface {
rememberOrReject(nonce string, now time.Time) bool
}
// memNonceCache remembers recently-seen nonces to reject replays. It is an
// in-memory store guarded by a mutex — sufficient for a SINGLE membershipd
// process. A clustered deployment uses kvNonceStore instead (issue 0003e).
//
// Pruning is O(expired), not O(n): because the TTL is constant, insertion order
// equals expiry order, so the oldest entries (front of `order`) are exactly the
// ones that expire first (audit H7 — the previous full-map scan under the mutex
// was a CPU-amplification vector). A size cap bounds memory.
type memNonceCache struct {
mu sync.Mutex
seen map[string]time.Time // nonce -> expiry
order []string // nonces in insertion order == expiry order
ttl time.Duration
cap int
}
func newMemNonceCache(ttl time.Duration, capacity int) *memNonceCache {
return &memNonceCache{seen: make(map[string]time.Time), ttl: ttl, cap: capacity}
}
// rememberOrReject records nonce and returns true if it was unseen, or false if
// it is a replay (still live in the cache).
func (n *memNonceCache) rememberOrReject(nonce string, now time.Time) bool {
n.mu.Lock()
defer n.mu.Unlock()
// Prune expired entries from the front (oldest first). The first live entry
// ends the scan — everything behind it was inserted later and is newer.
cut := 0
for cut < len(n.order) {
exp, ok := n.seen[n.order[cut]]
if !ok {
cut++ // already evicted by the cap path below
continue
}
if !exp.Before(now) {
break
}
delete(n.seen, n.order[cut])
cut++
}
if cut > 0 {
n.order = append(n.order[:0], n.order[cut:]...)
}
if exp, ok := n.seen[nonce]; ok && !exp.Before(now) {
return false // a live replay
}
// Bound memory: at capacity, evict the oldest entry (its TTL is nearly up).
for len(n.seen) >= n.cap && len(n.order) > 0 {
oldest := n.order[0]
n.order = n.order[1:]
delete(n.seen, oldest)
}
n.seen[nonce] = now.Add(n.ttl)
n.order = append(n.order, nonce)
return true
}
// authResult is what a successful authentication yields: the verified signing
// key (hex), the endpoint id derived from it, and the authorized user record.
// Handlers use endpoint for membership authorization (only a member of a room
// may read its metadata/keys); user is available for role checks.
type authResult struct {
pubHex string
endpoint string
user User
}
// authenticate verifies the signature headers on r against body and the user
// allowlist. It returns an error describing the first failing check; the
// middleware decides whether that error blocks (enforce) or only logs (soft).
//
// Order matters: cheap, non-cryptographic checks (header presence, key shape,
// clock skew) run first; the Ed25519 verification runs before the replay cache
// is touched so an attacker cannot poison the cache with unsigned nonces; the
// allowlist lookup runs last.
func (s *Server) authenticate(r *http.Request, body []byte, now time.Time) (authResult, error) {
pubHex := r.Header.Get(hdrPub)
ts := r.Header.Get(hdrTs)
nonce := r.Header.Get(hdrNonce)
sigB64 := r.Header.Get(hdrSig)
if pubHex == "" || ts == "" || nonce == "" || sigB64 == "" {
return authResult{}, fmt.Errorf("missing auth headers")
}
pub, err := hex.DecodeString(pubHex)
if err != nil || len(pub) != 32 {
return authResult{}, fmt.Errorf("malformed %s (want 32-byte Ed25519 hex)", hdrPub)
}
tsInt, err := strconv.ParseInt(ts, 10, 64)
if err != nil {
return authResult{}, fmt.Errorf("malformed %s", hdrTs)
}
if d := now.Unix() - tsInt; d > int64(clockSkew/time.Second) || d < -int64(clockSkew/time.Second) {
return authResult{}, fmt.Errorf("timestamp out of range (skew %ds)", d)
}
sig, err := base64.StdEncoding.DecodeString(sigB64)
if err != nil {
return authResult{}, fmt.Errorf("malformed %s", hdrSig)
}
canonical := CanonicalRequest(r.Method, r.URL.RequestURI(), ts, nonce, body)
if !cs.VerifyEd25519(pub, canonical, sig) {
return authResult{}, fmt.Errorf("invalid signature")
}
// Authorize BEFORE touching the replay cache (audit H7): an unregistered
// identity can mint valid signatures for free, so caching its nonces would let
// it poison/grow the cache pre-auth. Only authorized identities are remembered.
if !s.store.IsAuthorized(pubHex) {
return authResult{}, fmt.Errorf("identity not authorized")
}
user, err := s.store.GetUser(pubHex)
if err != nil {
// IsAuthorized passed but the row vanished (race with revoke): fail closed.
return authResult{}, fmt.Errorf("identity not authorized")
}
// Anti-replay last: a replayed request from an authorized identity is still
// rejected here (the nonce is already live in the cache from its first use).
if !s.nonces.rememberOrReject(nonce, now) {
return authResult{}, fmt.Errorf("replayed nonce")
}
return authResult{pubHex: pubHex, endpoint: frame.EndpointID(pub), user: user}, nil
}
+206
View File
@@ -0,0 +1,206 @@
package membership
import (
"bytes"
"encoding/base64"
"encoding/hex"
"io"
"net/http"
"net/http/httptest"
"path/filepath"
"strconv"
"testing"
"time"
cs "fn-registry/functions/cybersecurity"
"github.com/enmanuel/unibus/pkg/blobstore"
"github.com/enmanuel/unibus/pkg/frame"
)
// authHarness boots an in-process membershipd HTTP server in the given auth mode
// with a fresh store + blob store, and seeds one active admin ("alice").
type authHarness struct {
ts *httptest.Server
store Store
alice cs.Identity
alicePub string // hex
}
func newAuthHarness(t *testing.T, mode AuthMode) *authHarness {
t.Helper()
dir := t.TempDir()
store, err := Open(filepath.Join(dir, "unibus.db"))
if err != nil {
t.Fatalf("open store: %v", err)
}
blobs, err := blobstore.New(filepath.Join(dir, "blobs"))
if err != nil {
t.Fatalf("open blobs: %v", err)
}
alice, err := cs.GenerateIdentity()
if err != nil {
t.Fatalf("identity: %v", err)
}
alicePub := hex.EncodeToString(alice.SignPub)
if err := store.AddUser(alicePub, "alice", RoleAdmin); err != nil {
t.Fatalf("seed admin: %v", err)
}
srv := NewServer(store, blobs, mode)
ts := httptest.NewServer(srv)
t.Cleanup(func() {
ts.Close()
store.Close()
})
return &authHarness{ts: ts, store: store, alice: alice, alicePub: alicePub}
}
// signedReq builds a control-plane request signed by id, with explicit ts/nonce
// so tests can force skew and replay. It signs via the same CanonicalRequest the
// production client uses, so the test verifies the real wire contract.
func signedReq(t *testing.T, base, method, path string, body []byte, id cs.Identity, ts int64, nonce string) *http.Request {
t.Helper()
var rdr io.Reader
if body != nil {
rdr = bytes.NewReader(body)
}
req, err := http.NewRequest(method, base+path, rdr)
if err != nil {
t.Fatalf("new request: %v", err)
}
tss := strconv.FormatInt(ts, 10)
canonical := CanonicalRequest(method, path, tss, nonce, body)
sig := cs.SignEd25519(id.SignPriv, canonical)
req.Header.Set(hdrPub, hex.EncodeToString(id.SignPub))
req.Header.Set(hdrTs, tss)
req.Header.Set(hdrNonce, nonce)
req.Header.Set(hdrSig, base64.StdEncoding.EncodeToString(sig))
return req
}
func do(t *testing.T, req *http.Request) (int, string) {
t.Helper()
resp, err := http.DefaultClient.Do(req)
if err != nil {
t.Fatalf("do request: %v", err)
}
defer resp.Body.Close()
b, _ := io.ReadAll(resp.Body)
return resp.StatusCode, string(b)
}
// okPath is a path that authenticates and returns 200 with an empty list when
// the request carries NO membership-bound signer (AuthOff/soft/missing-headers
// tests). Under enforce, the per-endpoint room directory is now restricted to
// the signer's own endpoint (audit H3), so tests that sign as alice use
// aliceRoomsPath instead.
const okPath = "/members/alice-endpoint/rooms"
// aliceRoomsPath is alice's own room directory — the canonical "authenticated
// and authorized" 200 path under enforce after H3.
func aliceRoomsPath(h *authHarness) string {
return "/members/" + frame.EndpointID(h.alice.SignPub) + "/rooms"
}
// Golden: a request signed by a registered, active identity is accepted.
func TestAuthGoldenAccepted(t *testing.T) {
h := newAuthHarness(t, AuthEnforce)
now := time.Now().Unix()
code, _ := do(t, signedReq(t, h.ts.URL, "GET", aliceRoomsPath(h), nil, h.alice, now, "nonce-golden"))
if code != http.StatusOK {
t.Fatalf("golden signed request should be 200, got %d", code)
}
}
// Error path: a structurally valid signature from an identity that is NOT in the
// allowlist is rejected with 401.
func TestAuthUnregisteredRejected(t *testing.T) {
h := newAuthHarness(t, AuthEnforce)
bob, _ := cs.GenerateIdentity()
now := time.Now().Unix()
code, body := do(t, signedReq(t, h.ts.URL, "GET", okPath, nil, bob, now, "nonce-bob"))
if code != http.StatusUnauthorized {
t.Fatalf("unregistered identity should be 401, got %d (%s)", code, body)
}
}
// Error path: replaying a captured request (same nonce + signature) is rejected.
func TestAuthReplayRejected(t *testing.T) {
h := newAuthHarness(t, AuthEnforce)
now := time.Now().Unix()
first := signedReq(t, h.ts.URL, "GET", aliceRoomsPath(h), nil, h.alice, now, "nonce-replay")
if code, body := do(t, first); code != http.StatusOK {
t.Fatalf("first request should be 200, got %d (%s)", code, body)
}
// Identical ts + nonce + signature: a replay.
second := signedReq(t, h.ts.URL, "GET", aliceRoomsPath(h), nil, h.alice, now, "nonce-replay")
if code, body := do(t, second); code != http.StatusUnauthorized {
t.Fatalf("replayed request should be 401, got %d (%s)", code, body)
}
}
// Error path: a timestamp outside the ±30s window is rejected even with a valid
// signature (defends against long-delayed captured requests).
func TestAuthClockSkewRejected(t *testing.T) {
h := newAuthHarness(t, AuthEnforce)
stale := time.Now().Unix() - 120
code, body := do(t, signedReq(t, h.ts.URL, "GET", okPath, nil, h.alice, stale, "nonce-skew"))
if code != http.StatusUnauthorized {
t.Fatalf("clock-skewed request should be 401, got %d (%s)", code, body)
}
}
// Error path: tampering the body after signing invalidates the signature.
func TestAuthTamperedBodyRejected(t *testing.T) {
h := newAuthHarness(t, AuthEnforce)
now := time.Now().Unix()
req := signedReq(t, h.ts.URL, "POST", "/rooms", []byte(`{"subject":"x"}`), h.alice, now, "nonce-tamper")
// Swap the body for different bytes the signature does not cover.
req.Body = io.NopCloser(bytes.NewReader([]byte(`{"subject":"evil"}`)))
req.ContentLength = int64(len(`{"subject":"evil"}`))
code, body := do(t, req)
if code != http.StatusUnauthorized {
t.Fatalf("tampered body should be 401, got %d (%s)", code, body)
}
}
// Error path: missing auth headers under enforce are rejected.
func TestAuthMissingHeadersRejected(t *testing.T) {
h := newAuthHarness(t, AuthEnforce)
req, _ := http.NewRequest("GET", h.ts.URL+okPath, nil)
code, _ := do(t, req)
if code != http.StatusUnauthorized {
t.Fatalf("unsigned request under enforce should be 401, got %d", code)
}
}
// Exemption: the health probe bypasses auth even under enforce.
func TestAuthHealthExempt(t *testing.T) {
h := newAuthHarness(t, AuthEnforce)
req, _ := http.NewRequest("GET", h.ts.URL+"/healthz", nil)
code, _ := do(t, req)
if code != http.StatusOK {
t.Fatalf("/healthz must be reachable without auth, got %d", code)
}
}
// Soft mode: an unauthenticated request is logged but allowed through, so
// clients can migrate without an outage.
func TestAuthSoftAllowsUnauthenticated(t *testing.T) {
h := newAuthHarness(t, AuthSoft)
req, _ := http.NewRequest("GET", h.ts.URL+okPath, nil)
code, _ := do(t, req)
if code != http.StatusOK {
t.Fatalf("soft mode should allow unsigned request, got %d", code)
}
}
// Off mode (default for legacy callers): no verification at all.
func TestAuthOffNoVerification(t *testing.T) {
h := newAuthHarness(t, AuthOff)
req, _ := http.NewRequest("GET", h.ts.URL+okPath, nil)
code, _ := do(t, req)
if code != http.StatusOK {
t.Fatalf("off mode should allow unsigned request, got %d", code)
}
}
+119
View File
@@ -0,0 +1,119 @@
package membership
import (
"encoding/hex"
"net/http"
"strconv"
"testing"
"time"
cs "fn-registry/functions/cybersecurity"
"github.com/enmanuel/unibus/pkg/frame"
)
// seedRoom inserts an encrypted room owned by alice with a sealed key for her,
// directly through the store so the test controls membership precisely. It
// returns the room id and alice's endpoint.
func seedRoom(t *testing.T, h *authHarness, subject string) (string, string) {
t.Helper()
aliceEp := frame.EndpointID(h.alice.SignPub)
roomID := newULID()
info := RoomInfo{RoomID: roomID, Subject: subject, OwnerEndpoint: aliceEp, Encrypt: true}
if err := h.store.CreateRoom(info, h.alice.SignPub, h.alice.KexPub, []byte("alice-sealed-key")); err != nil {
t.Fatalf("seed room: %v", err)
}
return roomID, aliceEp
}
// register adds id to the bus allowlist so its signed requests clear auth and
// reach the handler, where membership authorization (not mere registration) is
// what the test exercises.
func register(t *testing.T, h *authHarness, id cs.Identity, handle string) {
t.Helper()
if err := h.store.AddUser(hex.EncodeToString(id.SignPub), handle, RoleMember); err != nil {
t.Fatalf("register %s: %v", handle, err)
}
}
// TestAudit_HorizontalMetadataLeak ports the auditor's H3 (Alto) finding: bob is
// REGISTERED on the bus but is NOT a member of alice's room. Before the fix the
// GET endpoints checked registration, not membership, so bob could read the
// room's subject, the full member list (with everyone's public keys), alice's
// room directory, and even alice's sealed key. Now every one of those returns
// 403 to bob, while alice (owner/member) and carol (plain member) get 200.
func TestAudit_HorizontalMetadataLeak(t *testing.T) {
h := newAuthHarness(t, AuthEnforce)
roomID, aliceEp := seedRoom(t, h, "secret.subject.payroll")
// bob: registered, never invited.
bob, _ := cs.GenerateIdentity()
register(t, h, bob, "bob")
// carol: registered AND a plain (non-owner) member — the legitimate-member edge.
carol, _ := cs.GenerateIdentity()
register(t, h, carol, "carol")
carolEp := frame.EndpointID(carol.SignPub)
if err := h.store.AddMember(roomID, Member{Endpoint: carolEp, Role: RoleMember, SignPub: carol.SignPub, KexPub: carol.KexPub}, 1, []byte("carol-sealed")); err != nil {
t.Fatalf("add carol: %v", err)
}
n := 0
get := func(id cs.Identity, path string) int {
n++
code, _ := do(t, signedReq(t, h.ts.URL, "GET", path, nil, id, time.Now().Unix(), nonceN(n)))
return code
}
// Error path: bob (non-member) is forbidden on every room endpoint.
bobChecks := []struct {
name string
path string
}{
{"get room", "/rooms/" + roomID},
{"list members", "/rooms/" + roomID + "/members"},
{"alice room directory", "/members/" + aliceEp + "/rooms"},
{"alice sealed key", "/rooms/" + roomID + "/key?endpoint=" + aliceEp},
{"bob sealed key in alices room", "/rooms/" + roomID + "/key?endpoint=" + frame.EndpointID(bob.SignPub)},
}
for _, c := range bobChecks {
if code := get(bob, c.path); code != http.StatusForbidden {
t.Fatalf("bob (non-member) %s should be 403, got %d", c.name, code)
}
}
// Golden: alice (owner/member) reads her room's metadata, members, directory, key.
aliceChecks := []string{
"/rooms/" + roomID,
"/rooms/" + roomID + "/members",
"/members/" + aliceEp + "/rooms",
"/rooms/" + roomID + "/key?endpoint=" + aliceEp,
}
for _, p := range aliceChecks {
if code := get(h.alice, p); code != http.StatusOK {
t.Fatalf("alice (owner) %s should be 200, got %d", p, code)
}
}
// Edge: carol is a plain member, not the owner — she may still read the room.
if code := get(carol, "/rooms/"+roomID); code != http.StatusOK {
t.Fatalf("carol (member) get room should be 200, got %d", code)
}
if code := get(carol, "/rooms/"+roomID+"/members"); code != http.StatusOK {
t.Fatalf("carol (member) list members should be 200, got %d", code)
}
// Edge: carol may fetch her OWN sealed key but not alice's.
if code := get(carol, "/rooms/"+roomID+"/key?endpoint="+carolEp); code != http.StatusOK {
t.Fatalf("carol fetching her own key should be 200, got %d", code)
}
if code := get(carol, "/rooms/"+roomID+"/key?endpoint="+aliceEp); code != http.StatusForbidden {
t.Fatalf("carol fetching alice's key should be 403, got %d", code)
}
}
// nonceN yields a distinct nonce per request so the anti-replay cache never
// rejects a fresh, legitimately-different request inside one test.
func nonceN(i int) string {
return "authz-nonce-" + strconv.Itoa(i)
}
+148
View File
@@ -0,0 +1,148 @@
package membership
import (
"net/http"
"net/http/httptest"
"os"
"runtime"
"strconv"
"strings"
"sync"
"sync/atomic"
"testing"
"time"
)
// readRSSkBRaw reads VmRSS (kB) from /proc without a *testing.T, so it is safe to
// call from a sampling goroutine (vmRSSkB calls t.Skip, which may only run on the
// test's own goroutine). Returns 0 when unavailable.
func readRSSkBRaw() int64 {
b, err := os.ReadFile("/proc/self/status")
if err != nil {
return 0
}
for _, line := range strings.Split(string(b), "\n") {
if strings.HasPrefix(line, "VmRSS:") {
f := strings.Fields(line)
if len(f) >= 2 {
v, _ := strconv.ParseInt(f[1], 10, 64)
return v
}
}
}
return 0
}
// TestReaudit_DoSConcurrency ports the re-auditor's N2 (Medio-Alto) finding: the
// per-request body ceiling and the per-IP rate limit do not bound the AGGREGATE
// memory of many concurrent uploads. The auditor drove RSS to ~1.42 GB with 40
// concurrent 16 MiB blob uploads. With the global in-flight byte limiter, the
// number of simultaneously-buffered uploads is capped, so the resident set stays
// bounded regardless of how many connections arrive at once.
//
// Coverage:
// - golden: a normal upload succeeds, and the server is still healthy after the
// storm (the limiter did not wedge it);
// - edge : concurrency right at the cap is admitted;
// - error : a concurrent flood far past the cap sheds the excess with 503
// (backpressure) instead of buffering it all, and the RSS spike stays bounded
// and does NOT scale with the number of requests.
func TestReaudit_DoSConcurrency(t *testing.T) {
if runtime.GOOS != "linux" {
t.Skip("RSS probe is Linux-only")
}
srv := dosServer(t, AuthOff)
// Force a small aggregate cap so the bound is observable in a unit test: with
// a 16 MiB blob ceiling, 48 MiB admits ~3 concurrent uploads. Production uses
// maxInflightBytes (128 MiB); the mechanism under test is identical.
const cap = int64(48) << 20
srv.inflight = newInflightLimiter(cap)
const blob = maxBlobBytes // 16 MiB, the per-request ceiling
const n = 40 // the auditor's figure
// A spike bound: with the cap admitting ~3 concurrent 16 MiB uploads and a
// ~2x copy factor (auth buffer + handler buffer) plus Go runtime slack, the
// delta should stay well under this. Without the limiter, 40 concurrent
// uploads admitted at once would add hundreds of MB (the auditor saw ~1.4 GB).
const maxSpikeKB = int64(256) << 10 // 256 MiB
runtime.GC()
before := readRSSkBRaw()
// Sample peak RSS while the storm runs.
var peak int64
atomic.StoreInt64(&peak, before)
stop := make(chan struct{})
var sampler sync.WaitGroup
sampler.Add(1)
go func() {
defer sampler.Done()
for {
select {
case <-stop:
return
default:
if v := readRSSkBRaw(); v > atomic.LoadInt64(&peak) {
atomic.StoreInt64(&peak, v)
}
time.Sleep(2 * time.Millisecond)
}
}
}()
var got503, got200 int64
var wg sync.WaitGroup
for i := 0; i < n; i++ {
wg.Add(1)
go func() {
defer wg.Done()
req := httptest.NewRequest(http.MethodPost, "/blobs", &zeroReader{remaining: blob})
req.ContentLength = blob
// Distinct source IP per request: this is the multi-IP (botnet) shape the
// auditor flagged, where the per-IP rate limit gives no aggregate defense.
// The in-flight byte limiter is the global bound that must hold here.
req.RemoteAddr = "198.51.100." + strconv.Itoa(i%254+1) + ":1234"
rec := httptest.NewRecorder()
srv.ServeHTTP(rec, req)
switch rec.Code {
case http.StatusServiceUnavailable:
atomic.AddInt64(&got503, 1)
case http.StatusOK:
atomic.AddInt64(&got200, 1)
}
}()
}
wg.Wait()
close(stop)
sampler.Wait()
runtime.GC()
delta := atomic.LoadInt64(&peak) - before
// Error path: the flood must have hit the cap and shed the excess with 503.
if got503 == 0 {
t.Fatalf("a concurrent flood of %d uploads past the cap should shed some with 503; got 200=%d 503=%d", n, got200, got503)
}
// The aggregate memory must stay bounded — not scale with n.
if delta > maxSpikeKB {
t.Fatalf("aggregate RSS spiked %d kB under %d concurrent uploads (bound %d kB): in-flight limiter not bounding memory", delta, n, maxSpikeKB)
}
// All reservations released after the storm.
if f := srv.inflight.inFlight(); f != 0 {
t.Fatalf("after the storm inFlight = %d, want 0 (reservations leaked)", f)
}
// Golden: the server is still healthy and serves a normal upload (from a fresh
// IP so the per-IP rate limiter, untouched here, is not what we measure).
rec := httptest.NewRecorder()
gReq := httptest.NewRequest(http.MethodPost, "/blobs", strings.NewReader("hello after storm"))
gReq.RemoteAddr = "203.0.113.9:9999"
srv.ServeHTTP(rec, gReq)
if rec.Code != http.StatusOK {
t.Fatalf("a normal upload after the storm should be 200, got %d (%s)", rec.Code, rec.Body.String())
}
t.Logf("N2 bound: %d uploads -> 200=%d 503=%d, RSS delta %d kB (bound %d kB), cap %d MiB",
n, got200, got503, delta, maxSpikeKB, cap>>20)
}
+206
View File
@@ -0,0 +1,206 @@
package membership
import (
"io"
"net/http"
"net/http/httptest"
"os"
"path/filepath"
"runtime"
"strconv"
"strings"
"testing"
"github.com/enmanuel/unibus/pkg/blobstore"
)
// dosServer builds a Server backed by a fresh store + blob store so a test can
// drive ServeHTTP in-process (white-box) and observe its memory behavior without
// a network round trip — the same in-process technique the auditor used.
func dosServer(t *testing.T, mode AuthMode) *Server {
t.Helper()
dir := t.TempDir()
store, err := Open(filepath.Join(dir, "unibus.db"))
if err != nil {
t.Fatalf("open store: %v", err)
}
blobs, err := blobstore.New(filepath.Join(dir, "blobs"))
if err != nil {
t.Fatalf("open blobs: %v", err)
}
t.Cleanup(func() { store.Close() })
return NewServer(store, blobs, mode)
}
// zeroReader yields up to remaining zero bytes without ever allocating them, so
// the test process itself never materializes a huge buffer (which would taint the
// RSS measurement we are trying to make about the SERVER).
type zeroReader struct{ remaining int64 }
func (z *zeroReader) Read(p []byte) (int, error) {
if z.remaining <= 0 {
return 0, io.EOF
}
n := int64(len(p))
if n > z.remaining {
n = z.remaining
}
for i := int64(0); i < n; i++ {
p[i] = 0
}
z.remaining -= n
return int(n), nil
}
// vmRSSkB reads the resident set size (kB) of this process from /proc. Linux-only;
// the caller skips on other platforms.
func vmRSSkB(t *testing.T) int64 {
t.Helper()
b, err := os.ReadFile("/proc/self/status")
if err != nil {
t.Skipf("cannot read /proc/self/status: %v", err)
}
for _, line := range strings.Split(string(b), "\n") {
if strings.HasPrefix(line, "VmRSS:") {
f := strings.Fields(line)
if len(f) >= 2 {
v, _ := strconv.ParseInt(f[1], 10, 64)
return v
}
}
}
t.Skip("VmRSS not present in /proc/self/status")
return 0
}
// TestAudit_DoSBodyLimitNoAuth ports the auditor's H1 (Critical) vector: a peer
// with NO valid signature posts an oversized body. Before the fix the middleware
// io.ReadAll'd it unbounded (the auditor sent 400 MB and watched RSS jump from
// 18 MB to 898 MB). Now the request is rejected 413 and the resident set does NOT
// spike. Two shapes are covered:
//
// (1) a truthful, over-ceiling Content-Length -> rejected before any byte is read;
// (2) a lying / unknown length (chunked) -> MaxBytesReader trips mid-read,
// capping the buffered bytes at the ceiling instead of the attacker's 400 MB.
func TestAudit_DoSBodyLimitNoAuth(t *testing.T) {
if runtime.GOOS != "linux" {
t.Skip("RSS probe is Linux-only")
}
srv := dosServer(t, AuthEnforce) // enforce: the request carries no signature
const huge = int64(400) << 20 // 400 MiB — the auditor's figure
// A spike threshold an order of magnitude below the attack. The old code would
// add ~400 MB+; the fix keeps the delta to at most one bounded buffer.
const maxSpikeKB = int64(96) << 10 // 96 MiB
// Shape 1: declared Content-Length over the blob ceiling -> early 413, no read.
runtime.GC()
before := vmRSSkB(t)
req := httptest.NewRequest(http.MethodPost, "/blobs", &zeroReader{remaining: huge})
req.ContentLength = huge
rec := httptest.NewRecorder()
srv.ServeHTTP(rec, req)
if rec.Code != http.StatusRequestEntityTooLarge {
t.Fatalf("over-declared body should be 413, got %d", rec.Code)
}
runtime.GC()
if d := vmRSSkB(t) - before; d > maxSpikeKB {
t.Fatalf("RSS spiked %d kB on a pre-declared oversized body (limit %d kB)", d, maxSpikeKB)
}
// Shape 2: unknown length (chunked-style). The middleware cannot reject by
// Content-Length, so MaxBytesReader must cap the read at maxBlobBytes.
runtime.GC()
before = vmRSSkB(t)
req = httptest.NewRequest(http.MethodPost, "/blobs", &zeroReader{remaining: huge})
req.ContentLength = -1
rec = httptest.NewRecorder()
srv.ServeHTTP(rec, req)
if rec.Code != http.StatusRequestEntityTooLarge {
t.Fatalf("unknown-length oversized body should be 413, got %d", rec.Code)
}
runtime.GC()
if d := vmRSSkB(t) - before; d > maxSpikeKB {
t.Fatalf("RSS spiked %d kB on a chunked oversized body (limit %d kB)", d, maxSpikeKB)
}
}
// TestBlobLimitGoldenAndBoundary covers the golden path (a normal blob is stored)
// and the boundary (a body exactly at the ceiling is accepted; one byte over by
// truthful Content-Length is rejected before buffering).
func TestBlobLimitGoldenAndBoundary(t *testing.T) {
srv := dosServer(t, AuthOff) // AuthOff: the limits apply regardless of auth mode
// Golden: a small blob is accepted and hashed.
rec := httptest.NewRecorder()
srv.ServeHTTP(rec, httptest.NewRequest(http.MethodPost, "/blobs", strings.NewReader("hello blob")))
if rec.Code != http.StatusOK {
t.Fatalf("normal blob should be 200, got %d (%s)", rec.Code, rec.Body.String())
}
// Boundary: exactly at the ceiling is allowed (MaxBytesReader permits N bytes).
atLimit := strings.Repeat("a", maxBlobBytes)
rec = httptest.NewRecorder()
req := httptest.NewRequest(http.MethodPost, "/blobs", strings.NewReader(atLimit))
req.ContentLength = int64(len(atLimit))
srv.ServeHTTP(rec, req)
if rec.Code != http.StatusOK {
t.Fatalf("blob exactly at the ceiling should be 200, got %d", rec.Code)
}
// Error: one byte over the ceiling (truthful Content-Length) -> 413 pre-read.
rec = httptest.NewRecorder()
req = httptest.NewRequest(http.MethodPost, "/blobs", &zeroReader{remaining: maxBlobBytes + 1})
req.ContentLength = maxBlobBytes + 1
srv.ServeHTTP(rec, req)
if rec.Code != http.StatusRequestEntityTooLarge {
t.Fatalf("blob one byte over the ceiling should be 413, got %d", rec.Code)
}
}
// TestControlBodyLimit checks the smaller JSON ceiling on a non-blob route: a body
// over maxControlBodyBytes is rejected 413 before the handler runs.
func TestControlBodyLimit(t *testing.T) {
srv := dosServer(t, AuthOff)
rec := httptest.NewRecorder()
req := httptest.NewRequest(http.MethodPost, "/rooms", &zeroReader{remaining: maxControlBodyBytes + 1})
req.ContentLength = maxControlBodyBytes + 1
srv.ServeHTTP(rec, req)
if rec.Code != http.StatusRequestEntityTooLarge {
t.Fatalf("control body over 1 MiB should be 413, got %d", rec.Code)
}
}
// TestRateLimitPerIP exercises the per-IP throttle: a burst from one IP eventually
// gets 429 (error path), while a spread across distinct IPs is never throttled
// (edge — the bucket is keyed per source, not global).
func TestRateLimitPerIP(t *testing.T) {
srv := dosServer(t, AuthOff)
// Same IP: well past the burst -> at least one 429.
got429 := false
for i := 0; i < defaultRateBurst+50; i++ {
rec := httptest.NewRecorder()
req := httptest.NewRequest(http.MethodGet, "/rooms/none", nil)
req.RemoteAddr = "203.0.113.7:5555"
srv.ServeHTTP(rec, req)
if rec.Code == http.StatusTooManyRequests {
got429 = true
break
}
}
if !got429 {
t.Fatalf("a flood from one IP should eventually be rate-limited (429)")
}
// Distinct IPs: each gets a fresh bucket, so none is throttled.
for i := 0; i < 100; i++ {
rec := httptest.NewRecorder()
req := httptest.NewRequest(http.MethodGet, "/rooms/none", nil)
req.RemoteAddr = "198.51.100." + strconv.Itoa(i%254+1) + ":4444"
srv.ServeHTTP(rec, req)
if rec.Code == http.StatusTooManyRequests {
t.Fatalf("distinct IPs must not share a rate bucket; IP #%d got 429", i)
}
}
}
+85
View File
@@ -0,0 +1,85 @@
package membership
import "sync/atomic"
// inflightLimiter is a non-blocking, byte-counting concurrency limiter: a global
// cap on how many bytes of request body the server will buffer simultaneously.
//
// The per-request body ceilings (maxControlBodyBytes / maxBlobBytes) bound a
// single request, and the per-IP rate limiter throttles a single source, but
// neither bounds the AGGREGATE memory across many concurrent uploads: the
// re-audit (report 0006, N2) showed 40 concurrent 16 MiB blob uploads driving
// RSS to ~1.42 GB, and a distributed (multi-IP) flood scales without a ceiling
// because the rate limiter is per-IP. This limiter is the missing aggregate
// bound: ServeHTTP reserves a request's worst-case buffered size before reading
// the body and releases it when the request finishes, so the total bytes in
// flight can never exceed max regardless of how many connections or source IPs
// arrive at once.
//
// It is intentionally NON-blocking: when a reservation does not fit, the caller
// sheds the request with backpressure (503) rather than parking a goroutine,
// which would let an attacker exhaust goroutines/connections instead of RAM. The
// counter is maintained with sync/atomic (a CAS loop), so it is safe for
// concurrent use without a mutex.
//
// Implementation note: this lives inside unibus rather than the fn-registry
// (where a generic concurrency primitive would normally belong) because the
// registry's functions/core package pulls in transitive dependencies that
// require CGO (mattn/go-sqlite3) and external modules, which are incompatible
// with unibus's CGO_ENABLED=0 build, and because this work is scoped to the
// unibus sub-repo.
type inflightLimiter struct {
max int64 // immutable after construction; <= 0 disables the limiter
used int64 // bytes currently reserved; accessed ONLY via sync/atomic
}
// newInflightLimiter builds a limiter with a cap of maxBytes bytes in flight.
// maxBytes <= 0 disables the cap (tryAcquire always grants), which is the
// loopback/dev posture where an aggregate memory ceiling is not wanted.
func newInflightLimiter(maxBytes int64) *inflightLimiter {
return &inflightLimiter{max: maxBytes}
}
// tryAcquire reserves n bytes without blocking. It returns true and reserves the
// bytes when they fit within the cap (used+n <= max), or false (reserving
// nothing) when they do not. n <= 0 is granted without reserving, and a disabled
// limiter (max <= 0) always grants. Safe for concurrent use.
func (l *inflightLimiter) tryAcquire(n int64) bool {
if l.max <= 0 || n <= 0 {
return true
}
for {
cur := atomic.LoadInt64(&l.used)
if cur+n > l.max {
return false
}
if atomic.CompareAndSwapInt64(&l.used, cur, cur+n) {
return true
}
}
}
// release returns n previously reserved bytes. It must be paired with a
// tryAcquire that granted. A disabled limiter or n <= 0 is a no-op. The counter
// never drops below zero (a defensive clamp against an accidental double release).
func (l *inflightLimiter) release(n int64) {
if l.max <= 0 || n <= 0 {
return
}
for {
cur := atomic.LoadInt64(&l.used)
nv := cur - n
if nv < 0 {
nv = 0
}
if atomic.CompareAndSwapInt64(&l.used, cur, nv) {
return
}
}
}
// inFlight returns the bytes currently reserved. It is observability for tests
// and metrics.
func (l *inflightLimiter) inFlight() int64 {
return atomic.LoadInt64(&l.used)
}
+97
View File
@@ -0,0 +1,97 @@
package membership
import (
"sync"
"testing"
)
// TestInflightLimiterBasics covers the limiter contract: granting within the cap
// (golden), the exact boundary (edge), refusal over the cap without mutating the
// counter (error), the disabled mode, and the defensive clamp on over-release.
func TestInflightLimiterBasics(t *testing.T) {
l := newInflightLimiter(100)
// Golden: a reservation within the cap is granted and reflected.
if !l.tryAcquire(60) {
t.Fatalf("acquire 60 within cap 100 should grant")
}
if l.inFlight() != 60 {
t.Fatalf("inFlight = %d, want 60", l.inFlight())
}
// Edge: exactly reaching the cap (60+40 == 100) is granted.
if !l.tryAcquire(40) {
t.Fatalf("acquire to the exact cap should grant")
}
if l.inFlight() != 100 {
t.Fatalf("inFlight = %d, want 100", l.inFlight())
}
// Error: one more byte over the full cap is refused, and the counter is left
// untouched (a refused reservation reserves nothing).
if l.tryAcquire(1) {
t.Fatalf("acquire over a full cap must be refused")
}
if l.inFlight() != 100 {
t.Fatalf("a refused acquire must not change inFlight; got %d", l.inFlight())
}
// Release frees capacity again.
l.release(100)
if l.inFlight() != 0 {
t.Fatalf("inFlight after full release = %d, want 0", l.inFlight())
}
// Defensive: an over-release never drives the counter negative.
l.release(50)
if l.inFlight() != 0 {
t.Fatalf("over-release must clamp at 0; got %d", l.inFlight())
}
}
// TestInflightLimiterDisabled verifies that a non-positive cap disables the
// limiter: every reservation is granted and nothing is tracked (the loopback/dev
// posture).
func TestInflightLimiterDisabled(t *testing.T) {
for _, max := range []int64{0, -1} {
l := newInflightLimiter(max)
if !l.tryAcquire(1 << 30) {
t.Fatalf("disabled limiter (max=%d) must always grant", max)
}
if l.inFlight() != 0 {
t.Fatalf("disabled limiter must not track usage; got %d", l.inFlight())
}
l.release(1 << 30) // no-op, must not panic
}
}
// TestInflightLimiterConcurrent hammers the limiter from many goroutines with
// equal-sized acquire/release pairs and asserts the invariant never breaks: the
// counter returns to 0 and never exceeds the cap. Run with -race for the memory
// model guarantee.
func TestInflightLimiterConcurrent(t *testing.T) {
const cap = 1000
const chunk = 7
l := newInflightLimiter(cap)
var wg sync.WaitGroup
for g := 0; g < 64; g++ {
wg.Add(1)
go func() {
defer wg.Done()
for i := 0; i < 2000; i++ {
if l.tryAcquire(chunk) {
if f := l.inFlight(); f > cap {
t.Errorf("inFlight %d exceeded cap %d", f, cap)
return
}
l.release(chunk)
}
}
}()
}
wg.Wait()
if l.inFlight() != 0 {
t.Fatalf("after all goroutines, inFlight = %d, want 0", l.inFlight())
}
}
+656
View File
@@ -0,0 +1,656 @@
package membership
// jetstreamStore is the JetStream KV implementation of Store (issue 0003b): the
// control-plane state (rooms, members, sealed room keys, the user allowlist)
// lives in replicated JetStream Key/Value buckets instead of a process-local
// SQLite file. Any node in the cluster reads and writes the same buckets, and
// JetStream's RAFT layer keeps them consistent across replicas, so the HTTP
// control plane becomes effectively stateless: any membershipd can serve any
// request. It is selected only when the `decentralized` flag is on; sqliteStore
// stays the default.
//
// Key layout (every path segment is a single KV token — ULIDs, RawURL endpoint
// ids and lowercase-hex keys never contain a '.', so '.' is a safe separator and
// a "<prefix>.*" watch enumerates exactly one trailing token):
//
// rooms roomID -> RoomInfo (JSON)
// members roomID.endpoint -> Member (JSON, carries Role)
// rooms_by_member endpoint.roomID -> role (reverse index for ListRoomsForEndpoint)
// room_keys roomID.endpoint.epoch -> sealed_key bytes
// users signPubHex -> User (JSON)
//
// Consistency caveat: KV has no multi-key transaction, so a multi-write op
// (CreateRoom, AddMember) is a short sequence of single-key writes. The order is
// chosen so a partial failure leaves a recoverable state (the room/member row
// before its reverse index or sealed key), and writes are idempotent (Put
// overwrites), which is also what makes the SQLite->KV migration (0003c) safe to
// re-run.
//
// Fail-closed: every read uses a bounded context, and IsAuthorized/HasAdmin
// return false on ANY backend error (a KV quorum loss or timeout denies access
// rather than admitting it), mirroring the SQLite store's behavior.
import (
"context"
"encoding/json"
"errors"
"fmt"
"sort"
"strconv"
"strings"
"time"
"github.com/nats-io/nats.go/jetstream"
)
// Bucket names (alphanumeric/dash/underscore only — no dots, per KV rules).
const (
bucketRooms = "UNIBUS_rooms"
bucketMembers = "UNIBUS_members"
bucketByMember = "UNIBUS_rooms_by_member"
bucketRoomKeys = "UNIBUS_room_keys"
bucketUsers = "UNIBUS_users"
defaultKVOpTime = 5 * time.Second
)
// JetStreamConfig configures the KV-backed store.
type JetStreamConfig struct {
// Replicas is the per-bucket replication factor (R1..R5). Use 1 for a single
// node or a 1-2 node rollout, 3 for real HA (quorum 2/3). Scaling R1->R3 in
// place is an operational step (nats kv update) done when the third node
// joins; it does not require reopening the store.
Replicas int
// OpTimeout bounds every KV operation so a stalled backend fails closed
// instead of hanging a request. Zero uses defaultKVOpTime.
OpTimeout time.Duration
}
type jetstreamStore struct {
rooms jetstream.KeyValue
members jetstream.KeyValue
byMember jetstream.KeyValue
keys jetstream.KeyValue
users jetstream.KeyValue
opTimeout time.Duration
}
// OpenJetStream creates (or opens) the five KV buckets on js with the configured
// replication factor and returns a Store backed by them. The JetStream context
// belongs to the caller (it owns the NATS connection); Close is a no-op.
func OpenJetStream(js jetstream.JetStream, cfg JetStreamConfig) (Store, error) {
if cfg.Replicas <= 0 {
cfg.Replicas = 1
}
opTimeout := cfg.OpTimeout
if opTimeout <= 0 {
opTimeout = defaultKVOpTime
}
// Bootstrap budget for creating/opening the buckets. On a single node JetStream
// is ready the instant the server starts, so the first attempt succeeds. On a
// COLD multi-node cluster the JetStream meta-group must first elect a leader and
// each node must establish contact with it before its $JS.API responds. A KV
// op is a NATS request/reply: if it is published before the node's JetStream is
// ready the request is dropped (not queued), and a single long-context call then
// just blocks until it times out (issue 0006g). So we RETRY each bucket op with
// short per-attempt contexts until it succeeds or the overall bootstrap budget
// is exhausted; once the cluster is ready the next retry lands and the buckets
// are created, after which they persist and every node opens them quickly.
bootstrapBudget := 120 * time.Second
deadline := time.Now().Add(bootstrapBudget)
s := &jetstreamStore{opTimeout: opTimeout}
for _, b := range []struct {
name string
dst *jetstream.KeyValue
}{
{bucketRooms, &s.rooms},
{bucketMembers, &s.members},
{bucketByMember, &s.byMember},
{bucketRoomKeys, &s.keys},
{bucketUsers, &s.users},
} {
var kv jetstream.KeyValue
var lastErr error
for {
opCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
kv, lastErr = js.CreateOrUpdateKeyValue(opCtx, jetstream.KeyValueConfig{
Bucket: b.name,
Replicas: cfg.Replicas,
History: 1,
Storage: jetstream.FileStorage,
})
cancel()
if lastErr == nil {
break
}
if time.Now().After(deadline) {
return nil, fmt.Errorf("membership: open KV bucket %q (replicas=%d) after %s: %w", b.name, cfg.Replicas, bootstrapBudget, lastErr)
}
// JetStream not ready yet (no meta leader / request dropped). Wait and
// re-publish the op; in a cluster cold start this lands once the meta
// group settles.
time.Sleep(1 * time.Second)
}
*b.dst = kv
}
return s, nil
}
// Close releases nothing: the JetStream context and NATS connection are owned by
// the caller, which closes them on shutdown.
func (s *jetstreamStore) Close() error { return nil }
func (s *jetstreamStore) ctx() (context.Context, context.CancelFunc) {
return context.WithTimeout(context.Background(), s.opTimeout)
}
// ---- key helpers ----------------------------------------------------------
func memberKey(roomID, endpoint string) string { return roomID + "." + endpoint }
func byMemberKey(endpoint, roomID string) string { return endpoint + "." + roomID }
func sealedKey(roomID, endpoint string, e int) string {
return roomID + "." + endpoint + "." + strconv.Itoa(e)
}
// watchEntries collects every current entry whose key matches pattern (a KV
// watch with a "<prefix>.*" wildcard), draining the watcher until the nil marker
// that signals "all initial values delivered". Tombstones are skipped.
func (s *jetstreamStore) watchEntries(kv jetstream.KeyValue, pattern string) ([]jetstream.KeyValueEntry, error) {
ctx, cancel := s.ctx()
defer cancel()
w, err := kv.Watch(ctx, pattern, jetstream.IgnoreDeletes())
if err != nil {
return nil, err
}
defer w.Stop()
var out []jetstream.KeyValueEntry
for {
select {
case e := <-w.Updates():
if e == nil {
return out, nil // initial snapshot complete
}
out = append(out, e)
case <-ctx.Done():
return nil, ctx.Err()
}
}
}
// ---- rooms / members / keys ----------------------------------------------
func (s *jetstreamStore) CreateRoom(info RoomInfo, ownerSignPub, ownerKexPub, ownerSealedKey []byte) error {
ctx, cancel := s.ctx()
defer cancel()
info.Epoch = 1
roomJSON, err := json.Marshal(info)
if err != nil {
return fmt.Errorf("membership: marshal room: %w", err)
}
// Create (not Put) so a duplicate room id is rejected, matching SQLite's
// PRIMARY KEY behavior.
if _, err := s.rooms.Create(ctx, info.RoomID, roomJSON); err != nil {
if errors.Is(err, jetstream.ErrKeyExists) {
return fmt.Errorf("membership: room %q already exists", info.RoomID)
}
return fmt.Errorf("membership: create room: %w", err)
}
owner := Member{Endpoint: info.OwnerEndpoint, Role: "owner", SignPub: ownerSignPub, KexPub: ownerKexPub}
if err := s.putMember(ctx, info.RoomID, owner); err != nil {
return err
}
if info.Encrypt {
if _, err := s.keys.Put(ctx, sealedKey(info.RoomID, info.OwnerEndpoint, 1), ownerSealedKey); err != nil {
return fmt.Errorf("membership: put owner key: %w", err)
}
}
return nil
}
// putMember writes the member row and its reverse index together.
func (s *jetstreamStore) putMember(ctx context.Context, roomID string, m Member) error {
mb, err := json.Marshal(m)
if err != nil {
return fmt.Errorf("membership: marshal member: %w", err)
}
if _, err := s.members.Put(ctx, memberKey(roomID, m.Endpoint), mb); err != nil {
return fmt.Errorf("membership: put member: %w", err)
}
if _, err := s.byMember.Put(ctx, byMemberKey(m.Endpoint, roomID), []byte(m.Role)); err != nil {
return fmt.Errorf("membership: put member index: %w", err)
}
return nil
}
func (s *jetstreamStore) GetRoom(roomID string) (RoomInfo, error) {
ctx, cancel := s.ctx()
defer cancel()
e, err := s.rooms.Get(ctx, roomID)
if err != nil {
if errors.Is(err, jetstream.ErrKeyNotFound) {
return RoomInfo{}, fmt.Errorf("membership: get room %q: %w", roomID, ErrNotFound)
}
return RoomInfo{}, fmt.Errorf("membership: get room %q: %w", roomID, err)
}
var info RoomInfo
if err := json.Unmarshal(e.Value(), &info); err != nil {
return RoomInfo{}, fmt.Errorf("membership: unmarshal room %q: %w", roomID, err)
}
return info, nil
}
func (s *jetstreamStore) AddMember(roomID string, m Member, epoch int, sealedKeyBytes []byte) error {
ctx, cancel := s.ctx()
defer cancel()
if err := s.putMember(ctx, roomID, m); err != nil {
return err
}
if len(sealedKeyBytes) > 0 {
if _, err := s.keys.Put(ctx, sealedKey(roomID, m.Endpoint, epoch), sealedKeyBytes); err != nil {
return fmt.Errorf("membership: put member key: %w", err)
}
}
return nil
}
func (s *jetstreamStore) GetMember(roomID, endpoint string) (Member, error) {
ctx, cancel := s.ctx()
defer cancel()
e, err := s.members.Get(ctx, memberKey(roomID, endpoint))
if err != nil {
if errors.Is(err, jetstream.ErrKeyNotFound) {
return Member{}, fmt.Errorf("membership: get member %q/%q: %w", roomID, endpoint, ErrNotFound)
}
return Member{}, fmt.Errorf("membership: get member %q/%q: %w", roomID, endpoint, err)
}
var m Member
if err := json.Unmarshal(e.Value(), &m); err != nil {
return Member{}, fmt.Errorf("membership: unmarshal member: %w", err)
}
return m, nil
}
func (s *jetstreamStore) ListMembers(roomID string) ([]Member, error) {
entries, err := s.watchEntries(s.members, roomID+".*")
if err != nil {
return nil, fmt.Errorf("membership: list members %q: %w", roomID, err)
}
out := make([]Member, 0, len(entries))
for _, e := range entries {
var m Member
if err := json.Unmarshal(e.Value(), &m); err != nil {
return nil, fmt.Errorf("membership: unmarshal member: %w", err)
}
out = append(out, m)
}
sort.Slice(out, func(i, j int) bool { return out[i].Endpoint < out[j].Endpoint })
return out, nil
}
func (s *jetstreamStore) ListRoomsForEndpoint(endpoint string) ([]RoomMembership, error) {
entries, err := s.watchEntries(s.byMember, endpoint+".*")
if err != nil {
return nil, fmt.Errorf("membership: list rooms for endpoint %q: %w", endpoint, err)
}
out := make([]RoomMembership, 0, len(entries))
for _, e := range entries {
// Key is "<endpoint>.<roomID>"; the roomID is everything after the dot.
roomID := e.Key()[len(endpoint)+1:]
info, err := s.GetRoom(roomID)
if err != nil {
if errors.Is(err, ErrNotFound) {
continue // index points at a removed room: skip, stay consistent
}
return nil, err
}
out = append(out, RoomMembership{RoomInfo: info, Role: string(e.Value())})
}
sort.Slice(out, func(i, j int) bool { return out[i].RoomID < out[j].RoomID })
return out, nil
}
func (s *jetstreamStore) GetSealedKey(roomID, endpoint string, epoch int) (int, []byte, error) {
if epoch > 0 {
ctx, cancel := s.ctx()
defer cancel()
e, err := s.keys.Get(ctx, sealedKey(roomID, endpoint, epoch))
if err != nil {
if errors.Is(err, jetstream.ErrKeyNotFound) {
return 0, nil, fmt.Errorf("membership: get sealed key %q/%q@%d: %w", roomID, endpoint, epoch, ErrNotFound)
}
return 0, nil, fmt.Errorf("membership: get sealed key %q/%q@%d: %w", roomID, endpoint, epoch, err)
}
return epoch, e.Value(), nil
}
// epoch <= 0: latest. Enumerate "<roomID>.<endpoint>.*" and take the max.
entries, err := s.watchEntries(s.keys, roomID+"."+endpoint+".*")
if err != nil {
return 0, nil, fmt.Errorf("membership: get latest sealed key %q/%q: %w", roomID, endpoint, err)
}
bestEpoch, bestVal := -1, []byte(nil)
for _, e := range entries {
k := e.Key()
ep, perr := strconv.Atoi(k[len(roomID)+1+len(endpoint)+1:])
if perr != nil {
continue
}
if ep > bestEpoch {
bestEpoch, bestVal = ep, e.Value()
}
}
if bestEpoch < 0 {
return 0, nil, fmt.Errorf("membership: get latest sealed key %q/%q: %w", roomID, endpoint, ErrNotFound)
}
return bestEpoch, bestVal, nil
}
func (s *jetstreamStore) PutSealedKeys(roomID string, epoch int, keys map[string][]byte) error {
ctx, cancel := s.ctx()
defer cancel()
for endpoint, sealed := range keys {
if _, err := s.keys.Put(ctx, sealedKey(roomID, endpoint, epoch), sealed); err != nil {
return fmt.Errorf("membership: put sealed key for %q: %w", endpoint, err)
}
}
return nil
}
func (s *jetstreamStore) BumpEpoch(roomID string, newEpoch int) error {
// Read-modify-write the room's epoch. The control plane serializes rekeys per
// room (owner-signed), so the lost-update window is not exercised in practice.
info, err := s.GetRoom(roomID)
if err != nil {
return fmt.Errorf("membership: bump epoch %q->%d: %w", roomID, newEpoch, err)
}
info.Epoch = newEpoch
b, err := json.Marshal(info)
if err != nil {
return fmt.Errorf("membership: marshal room: %w", err)
}
ctx, cancel := s.ctx()
defer cancel()
if _, err := s.rooms.Put(ctx, roomID, b); err != nil {
return fmt.Errorf("membership: bump epoch %q->%d: %w", roomID, newEpoch, err)
}
return nil
}
func (s *jetstreamStore) RemoveMember(roomID, endpoint string) error {
ctx, cancel := s.ctx()
defer cancel()
// Drop the member row and its reverse index. Past-epoch sealed keys are left
// intact (they only decrypt data the member could already read), matching the
// SQLite store.
if err := s.members.Delete(ctx, memberKey(roomID, endpoint)); err != nil && !errors.Is(err, jetstream.ErrKeyNotFound) {
return fmt.Errorf("membership: remove member %q/%q: %w", roomID, endpoint, err)
}
if err := s.byMember.Delete(ctx, byMemberKey(endpoint, roomID)); err != nil && !errors.Is(err, jetstream.ErrKeyNotFound) {
return fmt.Errorf("membership: remove member index %q/%q: %w", roomID, endpoint, err)
}
return nil
}
// ---- users (the bus allowlist) -------------------------------------------
func (s *jetstreamStore) AddUser(signPub, handle, role string) error {
signPub = normalizeSignPub(signPub)
if signPub == "" || handle == "" {
return fmt.Errorf("membership: AddUser: sign_pub and handle required")
}
if role == "" {
role = RoleMember
}
if role != RoleAdmin && role != RoleMember {
return fmt.Errorf("membership: AddUser: invalid role %q (want %q or %q)", role, RoleAdmin, RoleMember)
}
u := User{SignPub: signPub, Handle: handle, Role: role, Status: StatusActive, CreatedAt: nowRFC3339()}
b, err := json.Marshal(u)
if err != nil {
return fmt.Errorf("membership: marshal user: %w", err)
}
ctx, cancel := s.ctx()
defer cancel()
if _, err := s.users.Create(ctx, signPub, b); err != nil {
if errors.Is(err, jetstream.ErrKeyExists) {
return ErrUserExists
}
return fmt.Errorf("membership: insert user: %w", err)
}
return nil
}
func (s *jetstreamStore) GetUser(signPub string) (User, error) {
signPub = normalizeSignPub(signPub)
ctx, cancel := s.ctx()
defer cancel()
e, err := s.users.Get(ctx, signPub)
if err != nil {
if errors.Is(err, jetstream.ErrKeyNotFound) {
return User{}, fmt.Errorf("membership: get user %q: %w", signPub, ErrNotFound)
}
return User{}, fmt.Errorf("membership: get user %q: %w", signPub, err)
}
var u User
if err := json.Unmarshal(e.Value(), &u); err != nil {
return User{}, fmt.Errorf("membership: unmarshal user: %w", err)
}
return u, nil
}
func (s *jetstreamStore) ListUsers() ([]User, error) {
ctx, cancel := s.ctx()
w, err := s.users.WatchAll(ctx, jetstream.IgnoreDeletes())
if err != nil {
cancel()
return nil, fmt.Errorf("membership: list users: %w", err)
}
defer cancel()
defer w.Stop()
var out []User
for {
select {
case e := <-w.Updates():
if e == nil {
sort.Slice(out, func(i, j int) bool {
if out[i].Handle != out[j].Handle {
return out[i].Handle < out[j].Handle
}
return out[i].SignPub < out[j].SignPub
})
return out, nil
}
var u User
if err := json.Unmarshal(e.Value(), &u); err != nil {
return nil, fmt.Errorf("membership: unmarshal user: %w", err)
}
out = append(out, u)
case <-ctx.Done():
return nil, ctx.Err()
}
}
}
func (s *jetstreamStore) RevokeUser(signPub string) error {
signPub = normalizeSignPub(signPub)
u, err := s.GetUser(signPub)
if err != nil {
if errors.Is(err, ErrNotFound) {
return fmt.Errorf("membership: revoke user %q: no active user with that key", signPub)
}
return fmt.Errorf("membership: revoke user %q: %w", signPub, err)
}
if u.Status != StatusActive {
return fmt.Errorf("membership: revoke user %q: no active user with that key", signPub)
}
u.Status = StatusRevoked
u.RevokedAt = nowRFC3339()
b, err := json.Marshal(u)
if err != nil {
return fmt.Errorf("membership: marshal user: %w", err)
}
ctx, cancel := s.ctx()
defer cancel()
if _, err := s.users.Put(ctx, signPub, b); err != nil {
return fmt.Errorf("membership: revoke user %q: %w", signPub, err)
}
return nil
}
// IsAuthorized reports whether signPub is an active bus user. Any backend error
// (including a KV quorum loss or timeout) yields false: fail closed.
func (s *jetstreamStore) IsAuthorized(signPub string) bool {
signPub = normalizeSignPub(signPub)
if signPub == "" {
return false
}
ctx, cancel := s.ctx()
defer cancel()
e, err := s.users.Get(ctx, signPub)
if err != nil {
return false
}
var u User
if err := json.Unmarshal(e.Value(), &u); err != nil {
return false
}
return u.Status == StatusActive
}
// HasAdmin reports whether at least one active admin exists. On any backend
// error it returns false, keeping the admin-gated endpoints closed (conservative).
func (s *jetstreamStore) HasAdmin() bool {
users, err := s.ListUsers()
if err != nil {
return false
}
for _, u := range users {
if u.Role == RoleAdmin && u.Status == StatusActive {
return true
}
}
return false
}
// ---- snapshot import / export (issue 0003c migration) ---------------------
// importSnapshot writes a full Snapshot into the KV buckets, preserving each
// room's epoch and each user's status (Put, not CreateRoom/AddUser, so the exact
// state is reproduced rather than reset to defaults). Idempotent: every write is
// an overwrite, so re-running the migration converges.
func (s *jetstreamStore) importSnapshot(snap *Snapshot) error {
ctx, cancel := s.ctx()
defer cancel()
for _, r := range snap.Rooms {
b, err := json.Marshal(r)
if err != nil {
return fmt.Errorf("import: marshal room %q: %w", r.RoomID, err)
}
if _, err := s.rooms.Put(ctx, r.RoomID, b); err != nil {
return fmt.Errorf("import: put room %q: %w", r.RoomID, err)
}
}
for roomID, members := range snap.Members {
for _, m := range members {
if err := s.putMember(ctx, roomID, m); err != nil {
return fmt.Errorf("import: %w", err)
}
}
}
for _, rec := range snap.Keys {
if _, err := s.keys.Put(ctx, sealedKey(rec.RoomID, rec.Endpoint, rec.Epoch), rec.Sealed); err != nil {
return fmt.Errorf("import: put key %q/%q@%d: %w", rec.RoomID, rec.Endpoint, rec.Epoch, err)
}
}
for _, u := range snap.Users {
b, err := json.Marshal(u)
if err != nil {
return fmt.Errorf("import: marshal user %q: %w", u.SignPub, err)
}
if _, err := s.users.Put(ctx, normalizeSignPub(u.SignPub), b); err != nil {
return fmt.Errorf("import: put user %q: %w", u.SignPub, err)
}
}
return nil
}
// ExportSnapshot reads the entire KV control-plane state back into a Snapshot,
// so the migration's parity test can compare it against the SQLite source.
func (s *jetstreamStore) ExportSnapshot() (*Snapshot, error) {
snap := &Snapshot{Members: map[string][]Member{}}
roomEntries, err := s.watchAll(s.rooms)
if err != nil {
return nil, fmt.Errorf("export kv: rooms: %w", err)
}
for _, e := range roomEntries {
var r RoomInfo
if err := json.Unmarshal(e.Value(), &r); err != nil {
return nil, fmt.Errorf("export kv: unmarshal room: %w", err)
}
snap.Rooms = append(snap.Rooms, r)
}
memberEntries, err := s.watchAll(s.members)
if err != nil {
return nil, fmt.Errorf("export kv: members: %w", err)
}
for _, e := range memberEntries {
// Key is "<roomID>.<endpoint>"; neither segment contains a dot.
roomID := strings.SplitN(e.Key(), ".", 2)[0]
var m Member
if err := json.Unmarshal(e.Value(), &m); err != nil {
return nil, fmt.Errorf("export kv: unmarshal member: %w", err)
}
snap.Members[roomID] = append(snap.Members[roomID], m)
}
keyEntries, err := s.watchAll(s.keys)
if err != nil {
return nil, fmt.Errorf("export kv: keys: %w", err)
}
for _, e := range keyEntries {
// Key is "<roomID>.<endpoint>.<epoch>".
parts := strings.Split(e.Key(), ".")
if len(parts) != 3 {
continue
}
epoch, err := strconv.Atoi(parts[2])
if err != nil {
continue
}
snap.Keys = append(snap.Keys, SealedKeyRecord{RoomID: parts[0], Endpoint: parts[1], Epoch: epoch, Sealed: e.Value()})
}
users, err := s.ListUsers()
if err != nil {
return nil, fmt.Errorf("export kv: users: %w", err)
}
snap.Users = users
return snap, nil
}
// watchAll collects every current entry of a bucket (no key filter), draining
// the watcher to its initial-snapshot nil marker.
func (s *jetstreamStore) watchAll(kv jetstream.KeyValue) ([]jetstream.KeyValueEntry, error) {
ctx, cancel := s.ctx()
defer cancel()
w, err := kv.WatchAll(ctx, jetstream.IgnoreDeletes())
if err != nil {
return nil, err
}
defer w.Stop()
var out []jetstream.KeyValueEntry
for {
select {
case e := <-w.Updates():
if e == nil {
return out, nil
}
out = append(out, e)
case <-ctx.Done():
return nil, ctx.Err()
}
}
}
+275
View File
@@ -0,0 +1,275 @@
package membership
import (
"bytes"
"errors"
"net"
"testing"
"time"
"github.com/enmanuel/unibus/pkg/embeddednats"
"github.com/nats-io/nats.go"
"github.com/nats-io/nats.go/jetstream"
server "github.com/nats-io/nats-server/v2/server"
)
func kvFreePort(t *testing.T) int {
t.Helper()
l, err := net.Listen("tcp", "127.0.0.1:0")
if err != nil {
t.Fatalf("free port: %v", err)
}
defer l.Close()
return l.Addr().(*net.TCPAddr).Port
}
// newKVStore boots a single-node embedded NATS with JetStream and opens a
// jetstreamStore (R1) over it, returning the store plus the server and
// connection so a test can shut the backend down to exercise fail-closed paths.
func newKVStore(t *testing.T) (*jetstreamStore, *server.Server, *nats.Conn) {
t.Helper()
ns, err := embeddednats.StartServer(embeddednats.ServerConfig{
StoreDir: t.TempDir(),
Host: "127.0.0.1",
Port: kvFreePort(t),
})
if err != nil {
t.Fatalf("embedded nats: %v", err)
}
nc, err := nats.Connect(ns.ClientURL())
if err != nil {
ns.Shutdown()
t.Fatalf("nats connect: %v", err)
}
js, err := jetstream.New(nc)
if err != nil {
nc.Close()
ns.Shutdown()
t.Fatalf("jetstream: %v", err)
}
st, err := OpenJetStream(js, JetStreamConfig{Replicas: 1, OpTimeout: 2 * time.Second})
if err != nil {
nc.Close()
ns.Shutdown()
t.Fatalf("open jetstream store: %v", err)
}
t.Cleanup(func() {
nc.Close()
ns.Shutdown()
ns.WaitForShutdown()
})
return st.(*jetstreamStore), ns, nc
}
// TestJetStreamStoreRoomsCRUD is the golden path: an encrypted room with an owner
// and an invited member round-trips through every room/member/key method.
func TestJetStreamStoreRoomsCRUD(t *testing.T) {
s, _, _ := newKVStore(t)
roomID := newULID()
owner := "owner-ep-1"
info := RoomInfo{RoomID: roomID, Subject: "room.kv", Encrypt: true, Persist: true, SignMsgs: true, OwnerEndpoint: owner}
ownerSealed := []byte("sealed-owner-epoch1")
if err := s.CreateRoom(info, []byte("owner-sign"), []byte("owner-kex"), ownerSealed); err != nil {
t.Fatalf("CreateRoom: %v", err)
}
// GetRoom returns epoch 1 and the policy.
got, err := s.GetRoom(roomID)
if err != nil {
t.Fatalf("GetRoom: %v", err)
}
if got.Epoch != 1 || got.Subject != "room.kv" || !got.Encrypt || got.OwnerEndpoint != owner {
t.Fatalf("GetRoom mismatch: %+v", got)
}
// Owner is a member with role "owner".
om, err := s.GetMember(roomID, owner)
if err != nil {
t.Fatalf("GetMember owner: %v", err)
}
if om.Role != "owner" || !bytes.Equal(om.SignPub, []byte("owner-sign")) {
t.Fatalf("owner member mismatch: %+v", om)
}
// Owner's sealed key at epoch 1.
ep, sealed, err := s.GetSealedKey(roomID, owner, 1)
if err != nil || ep != 1 || !bytes.Equal(sealed, ownerSealed) {
t.Fatalf("GetSealedKey owner: ep=%d sealed=%q err=%v", ep, sealed, err)
}
// Invite a member with a sealed key at epoch 1.
bob := "member-ep-bob"
bobSealed := []byte("sealed-bob-epoch1")
if err := s.AddMember(roomID, Member{Endpoint: bob, Role: "member", SignPub: []byte("bob-sign"), KexPub: []byte("bob-kex")}, 1, bobSealed); err != nil {
t.Fatalf("AddMember: %v", err)
}
// ListMembers returns both, sorted by endpoint.
members, err := s.ListMembers(roomID)
if err != nil {
t.Fatalf("ListMembers: %v", err)
}
if len(members) != 2 {
t.Fatalf("ListMembers want 2, got %d (%+v)", len(members), members)
}
// Bob can find the room via the reverse index.
rooms, err := s.ListRoomsForEndpoint(bob)
if err != nil {
t.Fatalf("ListRoomsForEndpoint: %v", err)
}
if len(rooms) != 1 || rooms[0].RoomID != roomID || rooms[0].Role != "member" {
t.Fatalf("ListRoomsForEndpoint mismatch: %+v", rooms)
}
// Latest sealed key (epoch <= 0) resolves to epoch 1 for bob.
lep, lsealed, err := s.GetSealedKey(roomID, bob, 0)
if err != nil || lep != 1 || !bytes.Equal(lsealed, bobSealed) {
t.Fatalf("GetSealedKey latest bob: ep=%d err=%v", lep, err)
}
// Rekey to epoch 2 (bump + new sealed keys), then latest resolves to 2.
if err := s.BumpEpoch(roomID, 2); err != nil {
t.Fatalf("BumpEpoch: %v", err)
}
if err := s.PutSealedKeys(roomID, 2, map[string][]byte{owner: []byte("owner-epoch2")}); err != nil {
t.Fatalf("PutSealedKeys: %v", err)
}
got2, _ := s.GetRoom(roomID)
if got2.Epoch != 2 {
t.Fatalf("after BumpEpoch want epoch 2, got %d", got2.Epoch)
}
lep2, _, err := s.GetSealedKey(roomID, owner, 0)
if err != nil || lep2 != 2 {
t.Fatalf("latest owner key after rekey: ep=%d err=%v", lep2, err)
}
// Remove bob; he disappears from members and his reverse index.
if err := s.RemoveMember(roomID, bob); err != nil {
t.Fatalf("RemoveMember: %v", err)
}
if _, err := s.GetMember(roomID, bob); !errors.Is(err, ErrNotFound) {
t.Fatalf("GetMember after remove want ErrNotFound, got %v", err)
}
rooms2, _ := s.ListRoomsForEndpoint(bob)
if len(rooms2) != 0 {
t.Fatalf("ListRoomsForEndpoint after remove want 0, got %d", len(rooms2))
}
}
// TestJetStreamStoreUsers exercises the allowlist: add, lookup, authorize,
// revoke (which flips IsAuthorized), and the admin gate.
func TestJetStreamStoreUsers(t *testing.T) {
s, _, _ := newKVStore(t)
const aliceHex = "aa11"
if s.HasAdmin() {
t.Fatalf("fresh store should have no admin")
}
if err := s.AddUser(aliceHex, "alice", RoleAdmin); err != nil {
t.Fatalf("AddUser: %v", err)
}
if !s.HasAdmin() {
t.Fatalf("HasAdmin should be true after adding an admin")
}
if !s.IsAuthorized(aliceHex) {
t.Fatalf("alice should be authorized")
}
// Case-insensitive lookup (keys are normalized lowercase).
if !s.IsAuthorized("AA11") {
t.Fatalf("uppercase hex should normalize and authorize")
}
u, err := s.GetUser(aliceHex)
if err != nil || u.Handle != "alice" || u.Role != RoleAdmin || u.Status != StatusActive {
t.Fatalf("GetUser mismatch: %+v err=%v", u, err)
}
// Duplicate add is rejected with ErrUserExists.
if err := s.AddUser(aliceHex, "alice2", RoleMember); !errors.Is(err, ErrUserExists) {
t.Fatalf("duplicate AddUser want ErrUserExists, got %v", err)
}
if err := s.AddUser("bb22", "bob", RoleMember); err != nil {
t.Fatalf("AddUser bob: %v", err)
}
users, err := s.ListUsers()
if err != nil || len(users) != 2 {
t.Fatalf("ListUsers want 2, got %d err=%v", len(users), err)
}
// Revoke alice: authorization flips off immediately.
if err := s.RevokeUser(aliceHex); err != nil {
t.Fatalf("RevokeUser: %v", err)
}
if s.IsAuthorized(aliceHex) {
t.Fatalf("revoked user must not be authorized")
}
if s.HasAdmin() {
t.Fatalf("after revoking the only admin, HasAdmin must be false")
}
// Revoking again is an error (no active user).
if err := s.RevokeUser(aliceHex); err == nil {
t.Fatalf("re-revoke should error")
}
}
// TestJetStreamStoreNotFound checks the ErrNotFound mapping for misses.
func TestJetStreamStoreNotFound(t *testing.T) {
s, _, _ := newKVStore(t)
if _, err := s.GetRoom("nope"); !errors.Is(err, ErrNotFound) {
t.Fatalf("GetRoom miss want ErrNotFound, got %v", err)
}
if _, err := s.GetMember("nope", "x"); !errors.Is(err, ErrNotFound) {
t.Fatalf("GetMember miss want ErrNotFound, got %v", err)
}
if _, _, err := s.GetSealedKey("nope", "x", 1); !errors.Is(err, ErrNotFound) {
t.Fatalf("GetSealedKey miss want ErrNotFound, got %v", err)
}
if _, _, err := s.GetSealedKey("nope", "x", 0); !errors.Is(err, ErrNotFound) {
t.Fatalf("GetSealedKey latest miss want ErrNotFound, got %v", err)
}
if _, err := s.GetUser("ffff"); !errors.Is(err, ErrNotFound) {
t.Fatalf("GetUser miss want ErrNotFound, got %v", err)
}
}
// TestJetStreamStoreIsAuthorizedFailClosed is the error path mandated by the
// issue: when the KV backend is unavailable (here the NATS server is shut down),
// IsAuthorized must DENY, never admit. A previously-authorized identity flips to
// unauthorized once the backend cannot be reached.
func TestJetStreamStoreIsAuthorizedFailClosed(t *testing.T) {
s, ns, nc := newKVStore(t)
const aliceHex = "abcd"
if err := s.AddUser(aliceHex, "alice", RoleAdmin); err != nil {
t.Fatalf("AddUser: %v", err)
}
if !s.IsAuthorized(aliceHex) {
t.Fatalf("alice should be authorized while the backend is up")
}
// Take the KV backend away: close the client and stop the server. Every
// subsequent KV Get fails, and the store must fail closed.
nc.Close()
ns.Shutdown()
ns.WaitForShutdown()
// Bound the assertion: IsAuthorized internally caps each op at OpTimeout, so
// this returns well before the test deadline.
done := make(chan bool, 1)
go func() { done <- s.IsAuthorized(aliceHex) }()
select {
case authorized := <-done:
if authorized {
t.Fatalf("KV backend down but IsAuthorized returned true: NOT fail-closed")
}
case <-time.After(10 * time.Second):
t.Fatalf("IsAuthorized hung when the backend was down (no bounded timeout)")
}
// HasAdmin is likewise conservative: backend down -> false (gates stay closed).
if s.HasAdmin() {
t.Fatalf("KV backend down but HasAdmin returned true: NOT fail-closed")
}
}
+152
View File
@@ -0,0 +1,152 @@
package membership_test
// Regression for audit report 0008, vector N2: with the broad "$JS.API.>" grant
// removed (issue 0006b), a registered peer that belongs to no room can no longer
// read the control-plane KV buckets over NATS, while the per-room JetStream API of
// a peer's OWN rooms keeps working. The auditor's ephemeral attack populated the
// KV control plane and had a registered non-member harvest the allowlist, the room
// graph and the sealed-key metadata directly through "$JS.API.>".
import (
"context"
"encoding/hex"
"path/filepath"
"testing"
"time"
"github.com/enmanuel/unibus/pkg/busauth"
"github.com/enmanuel/unibus/pkg/embeddednats"
"github.com/enmanuel/unibus/pkg/frame"
"github.com/enmanuel/unibus/pkg/membership"
"github.com/nats-io/nats.go"
"github.com/nats-io/nats.go/jetstream"
server "github.com/nats-io/nats-server/v2/server"
)
// startACLNatsInternal is startACLNats plus a recognized internal service identity
// (so the test can seed the KV control plane with full permissions, exactly as the
// decentralized membershipd does at bootstrap).
func startACLNatsInternal(t *testing.T, store membership.Store, internalPubHex string) *server.Server {
t.Helper()
auth := busauth.NewNkeyAuthenticatorACLInternal(store.IsAuthorized, aclPermsFunc(store), internalPubHex)
ns, err := embeddednats.StartServer(embeddednats.ServerConfig{
StoreDir: t.TempDir(), Host: "127.0.0.1", Port: aclFreePort(t), Auth: auth,
})
if err != nil {
t.Fatalf("acl nats: %v", err)
}
t.Cleanup(func() { ns.Shutdown(); ns.WaitForShutdown() })
return ns
}
// TestAttack0008_N2 reproduces the control-plane KV leak and proves it is closed.
//
// error : eve (registered, member of no room) cannot read the KV buckets — the
// JetStream KV API and the raw $KV subject space are both denied.
// golden: the owner of a persisted room can still drive the JetStream API of HER
// OWN room's stream (so persisted-room history keeps working).
// edge : eve cannot reach another room's stream API either (cross-room JS deny).
func TestAttack0008_N2(t *testing.T) {
dir := t.TempDir()
// The HTTP control-plane store stays SQLite; the KV buckets below stand in for
// the decentralized control plane the attack targets.
store, err := membership.Open(filepath.Join(dir, "unibus.db"))
if err != nil {
t.Fatalf("store: %v", err)
}
t.Cleanup(func() { store.Close() })
ceo, eve, internalID := mustID(t), mustID(t), mustID(t)
ceoEP := frame.EndpointID(ceo.SignPub)
mustAddUser(t, store, ceo, "ceo-root-admin")
mustAddUser(t, store, eve, "eve") // registered, member of nothing
// A persisted room owned by ceo: ceo is a member, so her per-room JS is allowed.
if err := store.CreateRoom(
membership.RoomInfo{RoomID: "PRIVROOM", Subject: "room.board.ma-deal", Encrypt: true, Persist: true, OwnerEndpoint: ceoEP},
ceo.SignPub, ceo.KexPub, []byte("sealed-self"),
); err != nil {
t.Fatalf("create room: %v", err)
}
internalPubHex := hex.EncodeToString(internalID.SignPub)
ns := startACLNatsInternal(t, store, internalPubHex)
url := ns.ClientURL()
// Seed the KV control plane with the privileged internal identity (full perms),
// simulating the decentralized buckets the attack reads.
intErr := make(chan error, 4)
intNC := nkeyConn(t, url, internalID, intErr)
intJS, err := jetstream.New(intNC)
if err != nil {
t.Fatalf("internal jetstream: %v", err)
}
kvStore, err := membership.OpenJetStream(intJS, membership.JetStreamConfig{Replicas: 1, OpTimeout: 3 * time.Second})
if err != nil {
t.Fatalf("open kv buckets: %v", err)
}
if err := kvStore.AddUser(hex.EncodeToString(ceo.SignPub), "ceo-root-admin", membership.RoleAdmin); err != nil {
t.Fatalf("seed kv user: %v", err)
}
// Each JetStream op gets its own short context: a DENIED request never gets a
// reply, so it blocks until its own deadline — a shared context would be
// exhausted by the first denied call and starve the rest.
freshCtx := func(d time.Duration) (context.Context, context.CancelFunc) {
return context.WithTimeout(context.Background(), d)
}
// --- error: eve cannot read the control-plane KV buckets ------------------
eveErr := make(chan error, 8)
eveNC := nkeyConn(t, url, eve, eveErr)
eveJS, err := jetstream.New(eveNC)
if err != nil {
t.Fatalf("eve jetstream: %v", err)
}
// (a) The KV API: binding the bucket requires STREAM.INFO.KV_UNIBUS_users, which
// eve has no permission for, so this must fail (no leak of users).
kvCtx, kvCancel := freshCtx(2 * time.Second)
if kv, err := eveJS.KeyValue(kvCtx, "UNIBUS_users"); err == nil {
if e, gerr := kv.Get(kvCtx, hex.EncodeToString(ceo.SignPub)); gerr == nil {
kvCancel()
t.Fatalf("eve read the control-plane KV users bucket: %q (N2 leak still open)", string(e.Value()))
}
kvCancel()
t.Fatalf("eve was able to BIND the KV users bucket (N2 leak still open)")
}
kvCancel()
// (b) The raw KV subject space: a direct subscribe must be a permissions
// violation (delivered async to the error handler).
drain(eveErr)
if _, err := eveNC.Subscribe("$KV.UNIBUS_users.>", func(*nats.Msg) {}); err != nil {
t.Fatalf("eve sub $KV: %v", err)
}
_ = eveNC.Flush()
if e := waitErr(eveErr, 1*time.Second); e == nil {
t.Fatalf("eve subscribing to $KV.UNIBUS_users.> must raise a permissions violation")
}
// --- edge: eve cannot reach another room's stream API ---------------------
edgeCtx, edgeCancel := freshCtx(2 * time.Second)
if _, err := eveJS.Stream(edgeCtx, "UNIBUS_PRIVROOM"); err == nil {
edgeCancel()
t.Fatalf("eve reached the foreign room stream API (cross-room JS not isolated)")
}
edgeCancel()
// --- golden: ceo can drive the JetStream API of HER OWN room's stream ------
ceoErr := make(chan error, 4)
ceoNC := nkeyConn(t, url, ceo, ceoErr)
ceoJS, err := jetstream.New(ceoNC)
if err != nil {
t.Fatalf("ceo jetstream: %v", err)
}
goldenCtx, goldenCancel := freshCtx(5 * time.Second)
defer goldenCancel()
if _, err := ceoJS.CreateOrUpdateStream(goldenCtx, jetstream.StreamConfig{
Name: "UNIBUS_PRIVROOM",
Subjects: []string{"room.board.ma-deal"},
Storage: jetstream.FileStorage,
}); err != nil {
t.Fatalf("ceo could not manage her OWN room stream (per-room JS broken): %v", err)
}
}
+176
View File
@@ -0,0 +1,176 @@
package membership
// Migration from the local SQLite control plane to replicated JetStream KV
// (issue 0003c). It is the one-time, idempotent data move that decentralization
// needs: read the entire SQLite state, write it into the KV buckets. Re-running
// it is safe (every KV write is an overwrite), so a partial/interrupted run is
// recovered by running again, and a parity test can assert the two stores hold
// the same state before and after.
import (
"database/sql"
"fmt"
"strings"
"time"
"github.com/nats-io/nats.go/jetstream"
)
// SealedKeyRecord is one row of room_keys: the sealed room key for an endpoint
// at a given epoch. It is the unit the snapshot carries so a backend can be
// imported with the exact epoch history (CreateRoom/AddMember alone could not
// reproduce a multi-epoch room).
type SealedKeyRecord struct {
RoomID string
Endpoint string
Epoch int
Sealed []byte
}
// Snapshot is the complete control-plane state, backend-agnostic. It is what
// ExportSnapshot produces and importSnapshot consumes, so the SQLite->KV
// migration and the parity test both work in terms of it.
type Snapshot struct {
Rooms []RoomInfo
Members map[string][]Member // roomID -> members
Keys []SealedKeyRecord
Users []User
}
// MigrateReport summarizes what a migration moved, for the operator log.
type MigrateReport struct {
BackupPath string
Rooms int
Members int
Keys int
Users int
}
// MigrateSQLiteToKV reads the SQLite store at sqlitePath and writes its entire
// state into the JetStream KV buckets on js (created with cfg.Replicas). It is
// idempotent: re-running converges to the same state. The caller is responsible
// for backing up the SQLite file first (BackupSQLite) — this function only
// reads it.
func MigrateSQLiteToKV(sqlitePath string, js jetstream.JetStream, cfg JetStreamConfig) (*MigrateReport, error) {
src, err := openSQLite(sqlitePath)
if err != nil {
return nil, fmt.Errorf("migrate: open sqlite %q: %w", sqlitePath, err)
}
defer src.Close()
snap, err := src.ExportSnapshot()
if err != nil {
return nil, fmt.Errorf("migrate: export sqlite: %w", err)
}
dst, err := OpenJetStream(js, cfg)
if err != nil {
return nil, fmt.Errorf("migrate: open kv: %w", err)
}
kv := dst.(*jetstreamStore)
if err := kv.importSnapshot(snap); err != nil {
return nil, fmt.Errorf("migrate: import to kv: %w", err)
}
members := 0
for _, ms := range snap.Members {
members += len(ms)
}
return &MigrateReport{
Rooms: len(snap.Rooms),
Members: members,
Keys: len(snap.Keys),
Users: len(snap.Users),
}, nil
}
// BackupSQLite makes a consistent copy of the SQLite database next to it,
// named "<path>.bak.<unixnano>", using SQLite's own VACUUM INTO (which writes a
// transactionally-consistent snapshot even with a live WAL). It returns the
// backup path. Always call this before MigrateSQLiteToKV so a botched migration
// can be undone.
func BackupSQLite(path string) (string, error) {
dst := fmt.Sprintf("%s.bak.%d", path, time.Now().UnixNano())
db, err := sql.Open("sqlite", "file:"+path+"?_pragma=busy_timeout(5000)")
if err != nil {
return "", fmt.Errorf("backup: open %q: %w", path, err)
}
defer db.Close()
if err := db.Ping(); err != nil {
return "", fmt.Errorf("backup: ping %q: %w", path, err)
}
// VACUUM INTO writes a fresh, consistent database file; the literal path is
// safely single-quoted (it is operator-supplied, never network input).
if _, err := db.Exec("VACUUM INTO '" + strings.ReplaceAll(dst, "'", "''") + "'"); err != nil {
return "", fmt.Errorf("backup: VACUUM INTO %q: %w", dst, err)
}
return dst, nil
}
// ---- SQLite export --------------------------------------------------------
// ExportSnapshot reads the entire SQLite control-plane state into a Snapshot.
func (s *sqliteStore) ExportSnapshot() (*Snapshot, error) {
snap := &Snapshot{Members: map[string][]Member{}}
rows, err := s.db.Query(`SELECT room_id, subject, key_epoch, encrypt, persist, sign_msgs, owner_endpoint FROM rooms ORDER BY room_id`)
if err != nil {
return nil, fmt.Errorf("export: query rooms: %w", err)
}
for rows.Next() {
var r RoomInfo
var enc, per, sgn int
if err := rows.Scan(&r.RoomID, &r.Subject, &r.Epoch, &enc, &per, &sgn, &r.OwnerEndpoint); err != nil {
rows.Close()
return nil, fmt.Errorf("export: scan room: %w", err)
}
r.Encrypt, r.Persist, r.SignMsgs = enc != 0, per != 0, sgn != 0
snap.Rooms = append(snap.Rooms, r)
}
rows.Close()
if err := rows.Err(); err != nil {
return nil, err
}
mrows, err := s.db.Query(`SELECT room_id, endpoint, role, sign_pub, kex_pub FROM members ORDER BY room_id, endpoint`)
if err != nil {
return nil, fmt.Errorf("export: query members: %w", err)
}
for mrows.Next() {
var roomID string
var m Member
if err := mrows.Scan(&roomID, &m.Endpoint, &m.Role, &m.SignPub, &m.KexPub); err != nil {
mrows.Close()
return nil, fmt.Errorf("export: scan member: %w", err)
}
snap.Members[roomID] = append(snap.Members[roomID], m)
}
mrows.Close()
if err := mrows.Err(); err != nil {
return nil, err
}
krows, err := s.db.Query(`SELECT room_id, epoch, endpoint, sealed_key FROM room_keys ORDER BY room_id, endpoint, epoch`)
if err != nil {
return nil, fmt.Errorf("export: query room_keys: %w", err)
}
for krows.Next() {
var rec SealedKeyRecord
if err := krows.Scan(&rec.RoomID, &rec.Epoch, &rec.Endpoint, &rec.Sealed); err != nil {
krows.Close()
return nil, fmt.Errorf("export: scan room_key: %w", err)
}
snap.Keys = append(snap.Keys, rec)
}
krows.Close()
if err := krows.Err(); err != nil {
return nil, err
}
users, err := s.ListUsers()
if err != nil {
return nil, fmt.Errorf("export: list users: %w", err)
}
snap.Users = users
return snap, nil
}
+195
View File
@@ -0,0 +1,195 @@
package membership
import (
"path/filepath"
"reflect"
"sort"
"testing"
"time"
"github.com/enmanuel/unibus/pkg/embeddednats"
"github.com/nats-io/nats.go"
"github.com/nats-io/nats.go/jetstream"
)
// seedSQLite populates a SQLite store with a representative control plane: two
// rooms (one rekeyed to epoch 2 with a removed member's keys left behind), a few
// members and sealed keys, and a user allowlist with one revoked entry. It
// returns the populated *sqliteStore and its file path.
func seedSQLite(t *testing.T) (*sqliteStore, string) {
t.Helper()
path := filepath.Join(t.TempDir(), "seed.db")
s, err := openSQLite(path)
if err != nil {
t.Fatalf("openSQLite: %v", err)
}
r1 := RoomInfo{RoomID: newULID(), Subject: "room.alpha", Encrypt: true, Persist: true, SignMsgs: true, OwnerEndpoint: "ep-owner1"}
if err := s.CreateRoom(r1, []byte("o1-sign"), []byte("o1-kex"), []byte("o1-sealed-e1")); err != nil {
t.Fatalf("create r1: %v", err)
}
if err := s.AddMember(r1.RoomID, Member{Endpoint: "ep-bob", Role: "member", SignPub: []byte("bob-sign"), KexPub: []byte("bob-kex")}, 1, []byte("bob-sealed-e1")); err != nil {
t.Fatalf("add bob: %v", err)
}
// Rekey r1 to epoch 2 (owner keeps a key at the new epoch).
if err := s.BumpEpoch(r1.RoomID, 2); err != nil {
t.Fatalf("bump: %v", err)
}
if err := s.PutSealedKeys(r1.RoomID, 2, map[string][]byte{"ep-owner1": []byte("o1-sealed-e2")}); err != nil {
t.Fatalf("put keys e2: %v", err)
}
r2 := RoomInfo{RoomID: newULID(), Subject: "room.beta", Encrypt: false, Persist: false, SignMsgs: false, OwnerEndpoint: "ep-owner2"}
if err := s.CreateRoom(r2, []byte("o2-sign"), []byte("o2-kex"), nil); err != nil {
t.Fatalf("create r2: %v", err)
}
if err := s.AddUser("aa11", "alice", RoleAdmin); err != nil {
t.Fatalf("add alice: %v", err)
}
if err := s.AddUser("bb22", "bob", RoleMember); err != nil {
t.Fatalf("add bob user: %v", err)
}
if err := s.AddUser("cc33", "carol", RoleMember); err != nil {
t.Fatalf("add carol: %v", err)
}
if err := s.RevokeUser("cc33"); err != nil {
t.Fatalf("revoke carol: %v", err)
}
return s, path
}
// normalizeSnapshot sorts every slice in a Snapshot so two snapshots from
// different backends can be compared regardless of enumeration order.
func normalizeSnapshot(snap *Snapshot) {
sort.Slice(snap.Rooms, func(i, j int) bool { return snap.Rooms[i].RoomID < snap.Rooms[j].RoomID })
for _, ms := range snap.Members {
sort.Slice(ms, func(i, j int) bool { return ms[i].Endpoint < ms[j].Endpoint })
}
sort.Slice(snap.Keys, func(i, j int) bool {
a, b := snap.Keys[i], snap.Keys[j]
if a.RoomID != b.RoomID {
return a.RoomID < b.RoomID
}
if a.Endpoint != b.Endpoint {
return a.Endpoint < b.Endpoint
}
return a.Epoch < b.Epoch
})
sort.Slice(snap.Users, func(i, j int) bool { return snap.Users[i].SignPub < snap.Users[j].SignPub })
}
func newJS(t *testing.T) jetstream.JetStream {
t.Helper()
ns, err := embeddednats.StartServer(embeddednats.ServerConfig{
StoreDir: t.TempDir(),
Host: "127.0.0.1",
Port: kvFreePort(t),
})
if err != nil {
t.Fatalf("embedded nats: %v", err)
}
nc, err := nats.Connect(ns.ClientURL())
if err != nil {
ns.Shutdown()
t.Fatalf("nats connect: %v", err)
}
js, err := jetstream.New(nc)
if err != nil {
nc.Close()
ns.Shutdown()
t.Fatalf("jetstream: %v", err)
}
t.Cleanup(func() { nc.Close(); ns.Shutdown(); ns.WaitForShutdown() })
return js
}
// TestMigrateSQLiteToKVParity is the parity test the issue mandates: after the
// migration, the KV store holds exactly the SQLite source's state.
func TestMigrateSQLiteToKVParity(t *testing.T) {
src, path := seedSQLite(t)
srcSnap, err := src.ExportSnapshot()
if err != nil {
t.Fatalf("export sqlite: %v", err)
}
src.Close() // release the file before the migration reopens it
js := newJS(t)
report, err := MigrateSQLiteToKV(path, js, JetStreamConfig{Replicas: 1, OpTimeout: 5 * time.Second})
if err != nil {
t.Fatalf("migrate: %v", err)
}
if report.Rooms != 2 || report.Users != 3 {
t.Fatalf("report mismatch: %+v", report)
}
kv, err := OpenJetStream(js, JetStreamConfig{Replicas: 1, OpTimeout: 5 * time.Second})
if err != nil {
t.Fatalf("open kv: %v", err)
}
kvSnap, err := kv.(*jetstreamStore).ExportSnapshot()
if err != nil {
t.Fatalf("export kv: %v", err)
}
normalizeSnapshot(srcSnap)
normalizeSnapshot(kvSnap)
if !reflect.DeepEqual(srcSnap, kvSnap) {
t.Fatalf("parity mismatch after migration:\n sqlite=%+v\n kv= %+v", srcSnap, kvSnap)
}
}
// TestMigrateSQLiteToKVIdempotent: running the migration twice converges to the
// same KV state (every write is an overwrite). A second run must not duplicate
// or corrupt anything.
func TestMigrateSQLiteToKVIdempotent(t *testing.T) {
src, path := seedSQLite(t)
srcSnap, _ := src.ExportSnapshot()
src.Close()
js := newJS(t)
if _, err := MigrateSQLiteToKV(path, js, JetStreamConfig{Replicas: 1}); err != nil {
t.Fatalf("migrate run 1: %v", err)
}
if _, err := MigrateSQLiteToKV(path, js, JetStreamConfig{Replicas: 1}); err != nil {
t.Fatalf("migrate run 2: %v", err)
}
kv, _ := OpenJetStream(js, JetStreamConfig{Replicas: 1})
kvSnap, err := kv.(*jetstreamStore).ExportSnapshot()
if err != nil {
t.Fatalf("export kv: %v", err)
}
normalizeSnapshot(srcSnap)
normalizeSnapshot(kvSnap)
if !reflect.DeepEqual(srcSnap, kvSnap) {
t.Fatalf("idempotency broken: a second migration changed the KV state\n sqlite=%+v\n kv= %+v", srcSnap, kvSnap)
}
}
// TestBackupSQLiteCreatesConsistentCopy verifies the pre-migration backup is a
// real, openable copy holding the same data.
func TestBackupSQLiteCreatesConsistentCopy(t *testing.T) {
src, path := seedSQLite(t)
srcSnap, _ := src.ExportSnapshot()
src.Close()
bak, err := BackupSQLite(path)
if err != nil {
t.Fatalf("backup: %v", err)
}
restored, err := openSQLite(bak)
if err != nil {
t.Fatalf("open backup: %v", err)
}
defer restored.Close()
bakSnap, err := restored.ExportSnapshot()
if err != nil {
t.Fatalf("export backup: %v", err)
}
normalizeSnapshot(srcSnap)
normalizeSnapshot(bakSnap)
if !reflect.DeepEqual(srcSnap, bakSnap) {
t.Fatalf("backup is not a faithful copy")
}
}
+22
View File
@@ -0,0 +1,22 @@
-- 002_users.sql — bus-level user directory (issue 0001a).
--
-- The authoritative allowlist of identities permitted to use the bus, independent
-- of room membership. A user is identified by its Ed25519 signing public key (the
-- same key that derives the endpoint via frame.EndpointID); roles gate admin-only
-- control-plane operations; status enables revocation without deleting history.
--
-- Additive and idempotent: safe to apply repeatedly. Never modify this file;
-- further schema changes go in new numbered migrations (see
-- .claude/rules/db_migrations.md). The embedded copy under
-- pkg/membership/migrations/002_users.sql mirrors this file byte-for-byte.
CREATE TABLE IF NOT EXISTS users (
sign_pub TEXT PRIMARY KEY, -- Ed25519 public key in lowercase hex (peer identity)
handle TEXT NOT NULL, -- human-readable name (unique recommended, not enforced as PK)
role TEXT NOT NULL DEFAULT 'member', -- 'admin' | 'member'
status TEXT NOT NULL DEFAULT 'active', -- 'active' | 'revoked'
created_at TEXT NOT NULL,
revoked_at TEXT
);
CREATE INDEX IF NOT EXISTS idx_users_status ON users(status);

Some files were not shown because too many files have changed in this diff Show More