unibus

Author	SHA1	Message	Date
egutierrez	f31580deec	Merge quick/nats-monitor-flag: UNIBUS_NATS_MONITOR loopback monitoring decoupled from debug log (bump 0.11.0)	2026-06-07 21:18:59 +02:00
Egutierrez	1c9325104c	feat(embeddednats): UNIBUS_NATS_MONITOR flag decoupled from debug log Add a dedicated UNIBUS_NATS_MONITOR=1 toggle that opens the embedded nats-server monitoring HTTP endpoint (127.0.0.1:8222, loopback only) so a local metrics scraper can read /varz, /connz and /jsz for server-level metrics (msgs/s, connections, KV bucket msgs, RAFT leader per stream, restarts). Previously the monitoring endpoint was only reachable via UNIBUS_NATS_DEBUG=1, which is coupled to the verbose nats-server debug log: enabling the endpoint also wrote routes/RAFT/room subjects to journald in clear, which regresses the hardened posture (issue 0007). The two concerns are now decoupled. The toggle computation is extracted to a pure function natsLogOpts(debugEnv, monitorEnv) (noLog, debug, trace, monitor): MONITOR=1 opens the endpoint while keeping the log quiet (NoLog true / Debug false). The inverse coupling is preserved for backward compatibility (DEBUG still implies MONITOR). The 127.0.0.1 bind stays hardcoded — the monitoring endpoint has no auth and must never be reachable from the network. Deploy wiring versioned: additive systemd drop-in membershipd-cluster.service.d/nats-monitor.conf (Environment=UNIBUS_NATS_MONITOR=1) plus a "NATS server metrics" section in the cluster README with the rolling activation runbook (magnus -> homer -> datardos) gated on R3 reconvergence (followers 2/2) between nodes. Tests: pure decoupling table (monitor on => log NOT debug; debug => monitor; default closed) + a real embedded server with MONITOR=1 asserting /varz answers 200 on loopback:8222, and a server without the flag with the endpoint closed. 100% additive: behavior is identical without the flag. Bump app.md 0.10.0 -> 0.11.0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 20:57:46 +02:00
egutierrez	b4f3118e85	Merge quick/users-http-admin: HTTP admin-only users API + client methods (report 0014)	2026-06-07 20:46:44 +02:00
egutierrez	e9053169da	Merge quick/0011-deploy-gaps: live user-add --store kv + clientcheck E2E + runbook fixes (report 0012)	2026-06-07 20:46:44 +02:00
Egutierrez	b983e43090	docs(0007): spec encryption-at-rest del control plane (JetStream/SQLite en disco)	2026-06-07 20:34:35 +02:00
egutierrez	b379730225	docs(app): document users HTTP admin model, bump 0.10.0 Add a gotcha describing the unified-storage model (the server writes users to the same store/KV as rooms), the admin-only HTTP surface, and the CLI-seeds-admin-#0 bootstrap. Bump the version 0.9.0 -> 0.10.0 and add the capability growth log entry for the new HTTP admin users API. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 20:32:05 +02:00
egutierrez	450ca01baf	feat(membership,client): HTTP admin-only users API Close the last control-plane asymmetry: rooms had a signed HTTP surface but users were only manageable via the local CLI or direct store access. Add admin-only HTTP endpoints, symmetric with rooms, executed against the same privileged store the server already serves (SQLite single-node, the replicated JetStream KV in cluster) — no new KV connection, no internal identity, so the admin panel can manage the allowlist by signing as an admin instead of needing --db / direct KV access. Endpoints (all behind requireAdmin, on top of the existing signature+nonce+TLS+enforce middleware): - GET /users list the full allowlist (incl. revoked) - POST /users add {sign_pub, handle, role} - POST /users/{signpub}/revoke revoke (status flip, no hard delete) requireAdmin is default-deny with no dev relaxation: it allows a request only when the authenticated signer is confirmed by the store as an active admin; any other case (no signer, non-admin, revoked, store error) is 403, fail-closed. The request context now also carries the signer's sign_pub hex, because the endpoint id is a one-way hash of the key and cannot be reversed to look the signer up in the allowlist. Validation/idempotency mirror the CLL: sign_pub must be 64-hex, role must be admin\|member (empty defaults to member), re-adding an existing key is a 409 that leaves the row untouched. The hex check is unified into membership.ValidateSignPubHex, reused by the CLI and the handlers. pkg/client gains ListUsers/AddUser/RevokeUser (flat UserInfo type) signed via doJSON, so the panel plugs in directly. Tests: non-admin -> 403 on all three endpoints; admin add->list->revoke roundtrip; validation (400 hex, 400 role, 409 re-add, row untouched); plus a client test against an embedded membershipd under enforce. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 20:31:57 +02:00
egutierrez	e1a7402ff1	chore: bump unibus to 0.9.0 (live user-add + clientcheck) New capability membershipd user add --store kv against a live cluster plus cmd/clientcheck end-to-end verification (issue 0011 gaps, report 0012). Adds the capability growth log entry. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 19:41:56 +02:00
egutierrez	ce72131ddf	docs(cluster): correct runbook + wire --internal-id-file into deploy Corrections learned from the real 0011 deploy: - Bring up: the "start magnus alone and verify healthz" order deadlocks — a lone node of a 3-node cluster has no meta-group quorum and never serves healthz until a second node joins. Document a quorum-forming start and that a node never self-serves. - Replication: R1 is an unusable SPOF (all six control-plane buckets on one node) and the cold start only converges with the three cold-start fixes; go straight to R3 once the cluster forms. - Add a "user add --store kv" section: the live user-add path that replaces stop-seed-restart, with its security model and idempotency/HA/no-delete semantics. - Topology: real IPs, ROUTE_NETWORK=public (no WireGuard mesh exists). - Chaos test: mark the data-plane client + failover proofs as validated (0012). Deploy machinery now emits the persisted internal identity: the unit gains --internal-id-file ${INTERNAL_ID_FILE} and deploy-cluster.sh writes INTERNAL_ID_FILE into each node's cluster.env, so a fresh deploy enables the live user-add path on every node. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 19:41:56 +02:00
egutierrez	3aa5a2c9a9	feat(clientcheck): end-to-end client verification (E2E room + failover) The 0011 chaos test validated only the control plane (healthz + leader failover + KV readable with 2/3); it never connected an authenticated bus client to the data plane. cmd/clientcheck is a reusable verification tool: it connects with a real identity (nkey + TLS on both planes, multi-node seed lists), creates an ephemeral E2E room (encrypted + signed, no durable stream), and either publishes N messages and asserts all come back decrypted (golden) or publishes a counter for a duration while logging the attached node (loop), so stopping a node mid-run shows the client fail over to a survivor and keep receiving with quorum 2/3. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 19:41:56 +02:00
egutierrez	02c2004ebd	feat(membershipd): user add/list/revoke --store kv against a live cluster Closes the most valuable 0011 deploy gap: adding users to the running cluster's replicated allowlist with no stop-seed-restart. Under enforce the per-subject ACL confines every bus user to its own rooms, so no ordinary identity may write the control-plane KV buckets; the only identity the authenticator grants full JetStream permissions is membershipd's internal service identity. - main.go: --internal-id-file persists that identity (load-or-create, 0600) instead of a fresh ephemeral key, so the same nkey is available out of process. Empty keeps the ephemeral default (single-node/dev unchanged). - users_kv.go: connectKVStore loads the persisted identity, presents its nkey (recognized as internal -> full perms), opens the KV store and writes. Defaults assume an on-node loopback invocation; a remote target without --ca is refused (allowlist must not travel cleartext, audit N6). Prints KV_UNIBUS_users replication (followers_current) after a write. - users_cli.go: --store kv on add/list/revoke. Re-adding a key is an explicit ErrUserExists (no silent overwrite / role flip); revoke is a status flip. - pkg/client: LoadIdentity (load-only) extracted from LoadOrCreateIdentity, preserving its "corrupt file is an error, not silently regenerated" guard. - kv_useradd_test.go: golden write under enforce, idempotency, unreachable endpoint, and remote-without-CA refusal against an embedded node. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 19:41:38 +02:00
egutierrez	ff580ac031	Merge quick/cluster-coldstart-fixes: 3-node cluster cold-start fixes + real topology	2026-06-07 18:56:28 +02:00
egutierrez	9fbff79df4	chore(deploy): fill cluster nodes.env with the real 3-node topology Set magnus's public IP (135.125.201.30) and switch ROUTE_NETWORK to "public": the three nodes have no WireGuard mesh (homer/datardos do not even have wg installed), so server-to-server routes go over the public IPs, still protected by the separate cluster route CA (mutual TLS). KV_REPLICAS is raised to 3 now that the cluster runs at R3. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 18:56:28 +02:00
egutierrez	33746d9962	fix(cluster): make the JetStream control-plane survive a cold multi-node start Bringing up the 3-node cluster from clean stores never converged: every node looped on `open KV bucket "UNIBUS_rooms" (replicas=1): context deadline exceeded`. Three independent defects in the clustered bootstrap path, none of which surface on a single node (where JetStream is ready instantly), caused it: 1. embeddednats: route connection pooling (nats-server 2.10 default pool of 3) churned with "duplicate route"/"client closed" reconnects on the small cluster, interrupting the meta-group RAFT heartbeats and forcing perpetual leader re-elections. Set Cluster.PoolSize = -1 (single route per peer). 2. embeddednats: the cluster nodes are Docker hosts, so NATS advertised the docker bridge IPs (172.x / 10.0.x) to peers, which then tried to dial those private, mutually-unreachable addresses. Set Cluster.NoAdvertise = true so only the explicit public-IP routes are used. Also added a UNIBUS_NATS_DEBUG env toggle (off by default) that enables the embedded server's logger and loopback monitoring port for debugging the route/meta layer. 3. membership.OpenJetStream: a KV op is a NATS request/reply; on a cold cluster the op was published once, before the node had contact with the meta leader, so the request was dropped and the single long-context call just blocked until timeout. Retry each bucket op with short per-attempt contexts until it succeeds or an overall bootstrap budget (120s) is exhausted, so it lands once the meta settles. With these the cluster forms cleanly, creates the KV buckets, scales R1->R3 in place, and survives loss of one node (quorum 2/3). Verified on magnus+homer+datardos. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 18:56:28 +02:00
agent	caf005f04b	feat(web): frontend v1 — login (handle+contraseña), sidebar rooms+buscador, chat estilo Element SPA React 19 + Vite + Mantine v9 en modo oscuro (acento índigo), datos mock para iterar el diseño antes de cablear el gateway. Login con identidad + contraseña (la contraseña desbloqueará la identidad Ed25519 cifrada en el dispositivo). Sidebar: avatar de usuario, buscador (rooms/usuarios/mensajes) y lista de rooms con candado E2E / hash cleartext / badges de no leídos. Panel de chat estilo Element (avatar+nombre+hora+texto) con composer interactivo.	2026-06-07 17:57:50 +02:00
agent	9787c218ac	chore: remove experimental frontends (web, android, playground, mobile) Limpieza de los frontends de prueba (SPA React, app Kotlin, gateway playground, binding gomobile) tras la fase de exploración. El bus (cmd/membershipd + pkg/*) queda intacto y verde. Empezamos un frontend web nuevo desde cero, construido de forma incremental. Todo lo borrado permanece en el historial git por si hay que recuperar algo.	2026-06-07 17:38:07 +02:00
egutierrez	926b8e96af	chore(0006): bump unibus to 0.8.0, close issue 0006 (cluster hardening + wiring) All seven phases (0006a–0006g) merged: blockers N3 (replicated nonce) and N2 ($JS.API.> KV leak) closed, decentralized KV store wired (--store kv), homogeneous cluster posture enforced (N1), RefreshSession in all clients (N4), the lows (secret out of argv, migrate guard, R1/CA docs), and the 3-node deploy material. Full suite + every audit-0008 attack regression green; govulncheck 0 reachable. See report 0009. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 17:33:03 +02:00
egutierrez	ae39e35fb4	Merge issue/0006g-deploy: cluster deploy material (magnus+homer+datardos, R3 HA)	2026-06-07 17:31:13 +02:00
egutierrez	48a3d6be33	docs(0006g): cluster deploy material for magnus+homer+datardos (R3 HA) Parameterized, NO-VPS-touched material to bring up unibus as a 3-node cluster. The authoring agent ran none of it on a host; every remote-changing step is marked HUMAN and deploy-cluster.sh defaults to a dry run. deploy/cluster/: - nodes.env — topology (cluster name, ports, per-node rows). Public IPs known (homer 141.94.69.66, datardos 51.91.100.142) pre-filled; magnus public IP and all WireGuard IPs are <PLACEHOLDER> for the human; scripts refuse to run while any remain. - generate-cluster-certs.sh — mints a SEPARATE cluster route CA + a route cert per node (server+clientAuth, mutual routes) and a data-plane server cert per node signed by the reused client CA (../tls/ca.); SAN = public + WG + hostname. - membershipd-cluster.service — one unit, parameterized per node via /opt/unibus/cluster.env: enforce + per-subject ACL + TLS + --store kv, --cluster-pass-file (secret out of argv), Restart=always. - deploy-cluster.sh — cross-build linux/amd64, generate each node's cluster.env (routes to the other two on the WG mesh, no userinfo), rsync + install (only with --yes); staggered start is manual. - README.md — runbook: prerequisites, loopback bootstrap to seed the first admin into the KV (works around the user-CLI/KV chicken-and-egg), staggered bring-up, verify posture+quorum, scale R1->R3 in place, and the chaos test (left to 0003f on the real VPS). - .gitignore — out/, build/, secrets/, .key never committed. bash -n passes on both scripts; go build/test unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 17:31:13 +02:00
egutierrez	24ff45ca7e	Merge issue/0006f-lows: cluster secret out of argv + migrate guard + docs (audit 0008 lows)	2026-06-07 17:24:46 +02:00
egutierrez	b8201a82cd	fix(0006f): cluster secret out of argv, migrate-to-kv TLS guard, R1/CA docs (audit 0008 lows) Low-severity cluster hardening from audit 0008: - Route secret out of argv (N1-low): --cluster-pass and a nats://user:pass@host in --routes are visible in ps/journald. New --cluster-pass-file and the UNIBUS_CLUSTER_PASS env var (precedence file > env > flag); the resolved secret guards the route layer and is injected into bare --routes entries (injectRouteCreds), so peers can be listed as nats://host:6250 with no secret in argv. The legacy --cluster-pass stays for dev/compat. - migrate-to-kv confidentiality (N6): refuse a remote --nats-url without --ca (the allowlist would travel cleartext); loopback targets are exempt (isLoopbackURL). - Docs (N1 route CA, N3 DoS): deploy/README gains a Clustering section — use a SEPARATE cluster CA for routes (not the client CA), keep the secret out of argv, run migrate-to-kv loopback/TLS only, and R1 is a SPOF of auth (not HA); R3 quorum is real HA. The generated cert material lives in deploy/cluster/ (0006g). Tests: - TestResolveClusterPass (file > env > flag precedence; missing file errors), - TestInjectRouteCreds (injects only into userinfo-less routes; preserves overrides), - TestIsLoopbackURL (loopback vs remote vs malformed). CGO_ENABLED=0 go build/vet/test green; govulncheck 0 reachable. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 17:24:46 +02:00
egutierrez	3a33656cac	Merge issue/0006e-refresh: RefreshSession in all clients (audit 0008 N4)	2026-06-07 17:21:14 +02:00
egutierrez	2f5b372a80	fix(0006e): call RefreshSession after membership changes in all clients (audit 0008 N4) A secured bus freezes per-subject permissions at connect time, so a peer that creates or joins a room after connecting cannot pub/sub on it until it reconnects (RefreshSession). No client called it, so under enforce+ACL the demos failed closed — pushing the operator to disable the ACL (a security regression at the operator's discretion). Wire the membership-change contract into every client: - cmd/worker: RefreshSession after CreateRoom, before publishing. - cmd/chat (simple): RefreshSession after CreateRoom+Join, before Subscribe. - cmd/chat (encrypted demo): A refreshes after CreateRoom; B refreshes after the invite+join, both before pub/sub. - local_files/bridge (gateway): RefreshSession after CreateRoom+Join, before Subscribe. - mobile: new Session.RefreshSession wrapper + the contract documented for callers. Contract (documented on the wrappers): after ANY membership change, call RefreshSession BEFORE pub/sub on the new room (it drops active subs, so it must precede Subscribe). On an unsecured/dev bus it is a harmless reconnect. Test: - TestClientCreateRoomRefreshPublishFlow: end-to-end under enforce+ACL, a peer creates a room, refreshes, invites a second peer who joins+refreshes+subscribes, and the publish is received — no manual intervention, the ACL stays on. CGO_ENABLED=0 go build/vet/test green; govulncheck 0 reachable. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 17:21:14 +02:00
egutierrez	32bec75665	Merge issue/0006d-posture: homogeneous cluster posture + /healthz posture (audit 0008 N1)	2026-06-07 17:17:37 +02:00
egutierrez	9b96537aa6	fix(0006d): enforce homogeneous cluster posture + publish posture on /healthz (audit 0008 N1) A cluster is only as secure as its weakest node: the data plane forwards every subject between nodes, so one node running without enforced auth lets an unauthenticated peer Subscribe(">") on it and harvest the traffic forwarded from the ACL'd nodes. - validateClusterConfig now takes the auth mode and REFUSES to join a cluster unless --bus-auth enforce, regardless of bind (a clustered node is a production node; there is no safe dev cluster without auth). This binary therefore cannot BE the weak node. - Server.Posture {enforce,acl,tls,cluster,store} is published on /healthz (non secret operational metadata, probe stays unauthenticated) so a monitor or peer can detect a cluster member not running enforce+ACL+TLS — covering a peer that runs a tampered/old binary outside this node's control. Tests: - TestAttack0008_N1: a clustered node with --bus-auth off is refused; the same node with enforce + full route security is allowed. - TestClusterConfigPolicy: extended with off/soft clustered cases (refused) and the mode parameter throughout. - TestHealthExposesPosture: /healthz returns the posture booleans + store backend. CGO_ENABLED=0 go build/vet/test green; govulncheck 0 reachable. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 17:17:37 +02:00
egutierrez	18ee7c469b	Merge issue/0006c-kv-store: wire decentralized control-plane KV store (--store kv)	2026-06-07 17:14:20 +02:00
egutierrez	e9ad719424	feat(0006c): wire the decentralized control-plane KV store (--store kv) 0003 built the JetStream KV store (jetstreamStore) but the binary never selected it: membership.Open (SQLite) was hardcoded and OpenJetStream was only reached by migrate-to-kv. This completes the wiring so a node actually serves its control plane from the replicated KV. - New flag --store kv\|sqlite (default sqlite). kv opens the JetStream KV control plane over the privileged internal connection; sqlite is the unchanged baseline (branch-by-abstraction: the full suite's SQLite paths are untouched). - Bootstrap cycle resolved with storeHolder: the authenticator consults the holder (fail-closed until set), so it can be built before the KV store exists. The KV store opens after NATS is up and is published into the holder. The only client that can connect in that window is the internal identity, which bypasses the store by key. In SQLite mode the store is set before StartServer, so the window does not exist. - needJS now covers --store kv as well as --cluster-name; the JetStream client is shared by the KV store and the replicated nonce bucket. - feature_flags.json: decentralized wiring documented as complete, realized via --store kv (opt-in per deploy; default stays sqlite). Fail-closed preserved: jetstreamStore.IsAuthorized already denies on any backend error; the holder denies while unset. Tests: - TestStoreHolderFailClosed: empty holder denies; serves after set. - TestKVStoreBootstrapUnderEnforce: end-to-end decentralized boot — KV-seeded user authenticates over nkey under enforce; outsider denied. - TestKVStoreDecentralizedConsistency: a room/user created on one node's KV store is visible to another's (ends the per-node SQLite divergence, audit 0008 N5). CGO_ENABLED=0 go build/vet/test green; govulncheck 0 reachable. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 17:14:20 +02:00
egutierrez	d1e1a478f8	Merge issue/0006b-kv-acl: scope JetStream ACL per-room (audit 0008 N2)	2026-06-07 17:08:54 +02:00
egutierrez	cacf608fde	fix(0006b): scope JetStream ACL per-room, close $JS.API.> KV leak (audit 0008 N2) The client-infra grant was {"_INBOX.>", "$JS.API.>"}. The broad "$JS.API.>" let any registered peer drive the whole JetStream API and read the control-plane KV buckets (KV_UNIBUS_users/rooms/members/room_keys) and the object store directly over NATS, bypassing the HTTP authorization (requireMember + own-endpoint checks): a full leak of the allowlist, room graph and sealed-key metadata once the decentralized control plane is active. Fix: replace the broad grant with a CLOSED, per-room allow set. - clientInfraSubjects shrinks to {"_INBOX.>", "$JS.API.INFO"} ($JS.API.INFO is account counters only — no room/user/key contents). - SubjectACLFor now grants, per room the peer belongs to, the room subject plus the minimal JetStream API subjects of THAT room's stream (jsSubjectsFor: STREAM., CONSUMER., $JS.ACK scoped to UNIBUS_<roomID>). - Because KV_UNIBUS_* and OBJ_UNIBUS_* are never a room stream, they fall outside the closed allow set and are denied by default. Clients reach blobs over the HTTP control plane, not the NATS object store, so OBJ needs no client grant. roomStreamName mirrors pkg/client.streamName so the authorizer and the producer never drift. Tests: - TestAttack0008_N2: eve (registered, member of no room) cannot bind the KV users bucket nor subscribe $KV.UNIBUS_users.> (permissions violation); golden: the room owner can still drive her OWN room stream's JetStream API; edge: eve cannot reach a foreign room's stream. - TestReaudit_H4 residual note updated: the $JS.API.> leak it deferred is closed. CGO_ENABLED=0 go build/vet/test green; govulncheck 0 reachable. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 17:08:54 +02:00
egutierrez	a9c245d468	Merge issue/0006a-replicated-nonce: wire replicated nonce store (audit 0008 N3)	2026-06-07 17:02:19 +02:00
egutierrez	8b6a01d280	fix(0006a): wire replicated nonce store on clustered nodes (audit 0008 N3) membershipd never called Server.UseReplicatedNonces, so every node kept a per-process anti-replay cache and a signed request accepted on node A could be replayed to node B (200+200). This wires the shared JetStream KV nonce bucket on any clustered node, closing the cross-node replay hole. Bootstrap: under enforce the service needs JetStream on its own embedded server, but the data plane only accepts allowlisted clients. Resolved with an ephemeral internal service identity the authenticator recognizes and grants full permissions (NewNkeyAuthenticatorACLInternal), connected over the in-process transport (no TLS/CA needed for the self-connection). Hard rule: --cluster-name != "" means the replicated nonce bucket is mandatory; if it cannot be created the node refuses to start (wireReplicatedNonces returns a fatal error) rather than run insecurely. Standalone nodes keep the in-memory cache unchanged (branch-by-abstraction: no JetStream dependency added). Changes: - busauth: NewNkeyAuthenticatorACLInternal + fullPermissions for the internal id. - cmd/membershipd: connectInternalJS (in-process, privileged) / connectExternalJS; wireReplicatedNonces helper; main wires it when clustered; --kv-replicas flag. Tests (regression of audit 0008 N3): - TestAttack0008_N3: 2 clustered nodes share the bucket, cross-node replay -> 401. - TestAttack0008_N3_StandaloneKeepsLocalCache: standalone needs no JetStream, same-node replay still 401. - TestAttack0008_N3_ClusteredRequiresJetStream: clustered + no JetStream -> fatal. - TestInternalConnPrivilegedUnderEnforce / ...OutsiderRejected: the privileged self-connection works under enforce and no other identity can claim it. CGO_ENABLED=0 go build/vet/test green; govulncheck 0 reachable. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 17:02:19 +02:00
agent	5df99fa4c4	docs(issue): 0006 completar+endurecer cluster — wiring KV + N1-N6 auditoría 0008 + material deploy magnus/homer/datardos	2026-06-07 16:48:07 +02:00
egutierrez	df3b62a601	Merge quick/0005-bump-close: unibus 0.7.0 + close issue 0005	2026-06-07 16:17:41 +02:00
egutierrez	6976537842	chore(0005): bump unibus to 0.7.0, close issue 0005 (hardening 2) Hardening 2 (issue 0005, fases 0005a-0005e) cierra los hallazgos nuevos de la re-auditoría red-team (report 0006): bump de nats-server + toolchain (16 CVEs -> 0 alcanzables), drop de frames sin firma en rooms SignMsgs, limiter global de bytes en vuelo contra el DoS por concurrencia, TLS obligatorio en bind publico, y cableado de la ACL por subject que cierra el wildcard metadata leak. Detalle por fase en el capability growth log del app.md y en el report 0007. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 16:17:41 +02:00
egutierrez	a4bbe8209b	Merge issue/0005e-acl-wire: wire per-subject ACL into membershipd (audit H4)	2026-06-07 16:15:52 +02:00
egutierrez	87ef52cc80	fix(0005e): wire per-subject ACL into membershipd (close H4 wildcard metadata leak) The per-subject data-plane ACL existed since 0003e (membership.SubjectACLFor + busauth.NewNkeyAuthenticatorACL, unit-tested in TestSubjectACLIsolation) but the binary never used it: cmd/membershipd installed the plain NewNkeyAuthenticator, so in production a registered NON-member could open a raw NATS connection, Subscribe(">"), and harvest every room's subject plus JetStream stream/advisory activity (payload stayed E2E ciphertext, metadata leaked) — the re-audit's H4 vector (report 0006). Fix: - New busauth.PermissionsFromSubjects adapts a subject-deriving function into the PermissionsFunc the ACL authenticator expects (subjects granted as both the publish and subscribe allow set; a derivation error fails closed). It lives in busauth so membership stays free of the nats-server dependency. - cmd/membershipd, under enforce, now installs NewNkeyAuthenticatorACL(store.IsAuthorized, PermissionsFromSubjects(membership.SubjectACLFor(store))) so every connection is confined to the subjects of the rooms it belongs to plus the client-infra subjects. - pkg/membership/acl_test.go's helper now delegates to the production wiring (PermissionsFromSubjects) instead of a test-only reimplementation, so the tests exercise the real path. Verification (pkg/membership/acl_test.go): - TestReaudit_H4_WildcardMetadataLeak: a non-member's Subscribe(">") and any foreign-subject subscribe raise permission violations; the member still pub/subs her own room and the non-member captures nothing. With the plain authenticator (the pre-0005e wiring) the test fails ("wildcard metadata leak still open"), confirming the wiring is what closes it. - TestSubjectACLIsolation / TestRefreshSessionGainsNewRoom still green. - CGO_ENABLED=0 go build ./... && go vet ./... && go test -count=1 ./... green. Residual (documented): the client-infra grant includes "$JS.API.>", shared by all peers so per-connection JetStream works; a peer that subscribes specifically to "$JS.API.>" can still observe stream-management requests whose subjects embed the room-derived stream name. Fully closing that needs NATS accounts/permissions per identity (deferred to the 0003 decentralization line). Operational note: NATS freezes permissions at connect time, so clients must client.RefreshSession after a membership change to gain a new room's subject; cmd/chat and cmd/worker do not yet call it, a functional gap to close before an enforce+ACL deployment. Refs: report 0006 H4, issue 0005e. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 16:15:52 +02:00
egutierrez	a2ec78c81d	Merge issue/0005d-tls-guard: require TLS on public bind (audit N4)	2026-06-07 16:11:45 +02:00
egutierrez	d01da9d396	fix(0005d): require TLS on a public bind (close N4 plaintext control plane) The H2 guard refused "public bind without enforce" and "TLS flags without enforce", but it still ALLOWED a public bind with enforce and no --tls-cert: the control plane then served metadata (subjects, pubkeys, sealed keys, the social graph) over plaintext HTTP publicly, so audit H5 reappeared as the N4 gap (TLS was a capability, not a requirement; report 0006). Fix: validateBootConfig now also refuses a non-loopback --bind unless both --tls-cert and --tls-key are set. Public deployments must serve HTTPS; loopback dev is unaffected (no TLS still allowed there). Verification (cmd/membershipd/config_test.go): - TestGap_PublicEnforceNoTLS: validateBootConfig("0.0.0.0", enforce, "", "") now returns an error mentioning --tls-cert (golden public+enforce+TLS allowed; edge loopback-without-TLS still allowed). - TestBootConfigPolicy table updated: public+enforce+notls / +certonly / +keyonly and lan-ip+enforce+notls are now refused; public+enforce+tls and loopback+enforce+tls allowed. - CGO_ENABLED=0 go build ./... && go vet ./... && go test -count=1 ./... green. Refs: report 0006 N4, issue 0005d. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 16:11:45 +02:00
egutierrez	db8618ddc3	Merge issue/0005c-inflight: global in-flight byte limiter bounds aggregate memory (audit N2)	2026-06-07 16:09:58 +02:00
egutierrez	e7d59fd01d	fix(0005c): bound aggregate buffered memory with a global in-flight byte limiter The H1 fix bounds each request (1 MiB control / 16 MiB blob) and the per-IP rate limiter throttles a single source, but neither bounds the AGGREGATE memory across concurrent requests. The re-audit (report 0006, N2) drove RSS to ~1.42 GB with 40 concurrent 16 MiB uploads, and noted that a multi-IP (botnet) flood scales without a ceiling because the rate limit is per-IP. Fix: a global, non-blocking, byte-counting limiter (pkg/membership/inflight.go). ServeHTTP reserves a POST's worst-case buffered size (its route ceiling) from the limiter before reading the body, and releases it when the request finishes. When the global cap (maxInflightBytes = 128 MiB) is reached, further POSTs are shed with 503 (backpressure) rather than parking goroutines, so total bytes buffered in flight stays bounded regardless of connection count or source-IP spread. GETs carry no body and do not consume the budget. The limiter is implemented inside unibus (not delegated to the fn-registry, where a generic concurrency primitive would normally live) because functions/core pulls transitive deps requiring CGO (mattn/go-sqlite3) and external modules that are incompatible with unibus's CGO_ENABLED=0 build, and because this work is scoped to the unibus sub-repo. The type/method comments document this. Verification: - pkg/membership/inflight_test.go: TestInflightLimiter{Basics,Disabled,Concurrent} cover golden/edge/error/disabled/over-release and a -race concurrency invariant (inFlight returns to 0, never exceeds cap). - pkg/membership/dos_concurrency_test.go: TestReaudit_DoSConcurrency fires 40 concurrent 16 MiB uploads from distinct IPs (the multi-IP shape) against a 48 MiB test cap -> 200=3 503=37, RSS delta ~93 MiB (bound 256 MiB), inFlight()==0, and a fresh upload still 200. With the limiter disabled the test fails (200=40 503=0), confirming it is a real regression guard. - CGO_ENABLED=0 go build ./... && go vet ./... && go test -count=1 ./... green; CGO_ENABLED=1 go test -race ./pkg/membership/ green. Residual (documented): under enforce the body is buffered twice (auth verify + handler), so real RSS is ~2x the reserved bytes; closing that fully means streaming blobs to disk (overlaps H9 / issue 0002). Refs: report 0006 N2, issue 0005c. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 16:09:58 +02:00
egutierrez	0f79708338	Merge issue/0005b-sig-nil: drop unsigned frames in SignMsgs rooms (audit N3)	2026-06-07 15:58:10 +02:00
egutierrez	ef3af6dfd1	fix(0005b): drop unsigned frames in SignMsgs rooms (close sig-nil spoof) client.processFrame verified a frame's signature only when one was present (`info.Policy.SignMsgs && f.Sig != nil`). In a room whose policy REQUIRES per-message signatures, an attacker with data-plane access could publish a raw frame with Sig==nil and a forged Sender, and the receiver accepted it as authentic because the verification block was skipped (audit N3, report 0006). On a signed-but-cleartext room any peer that knows the subject could thus impersonate any sender. Fix: in a SignMsgs room a missing signature is itself a rejection. processFrame now drops any frame with Sig==nil before attempting verification: if info.Policy.SignMsgs { if f.Sig == nil { return } // signature required but absent: drop // verify ... } Non-signed rooms (ModeNATS) are unaffected: unsigned frames there are still delivered, so the plain-NATS path is unchanged. Verification (pkg/client/sig_nil_spoof_test.go, TestReaudit_SigNilSpoof): - golden: a properly signed frame from a member is delivered. - error : an unsigned frame with a forged Sender in a SignMsgs room is dropped (the test fails with "SIG-NIL SPOOF: receiver accepted ..." when the fix is reverted, confirming it is a real regression guard). - edge : a non-signed room still delivers an unsigned frame. - CGO_ENABLED=0 go build ./... && go vet ./... && go test -count=1 ./... green. Refs: report 0006 N3, issue 0005b. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 15:58:10 +02:00
egutierrez	88b47912bd	Merge issue/0005a-cve-bump: bump nats-server to v2.11.15 + go1.26.4 (16 CVEs -> 0 reachable)	2026-06-07 15:55:32 +02:00
egutierrez	a3ac58fb70	fix(0005a): bump nats-server v2.10.22->v2.11.15 + toolchain go1.26.4 (close 16 CVEs) govulncheck reported 16 reachable vulnerabilities (re-audit finding N1, report 0006): 14 in github.com/nats-io/nats-server/v2@v2.10.22 -- the embedded NATS server, which is exposed to the internet in the chosen deployment -- and 2 in the Go standard library (GO-2026-5039 net/textproto, GO-2026-5037 crypto/x509). Changes: - go get github.com/nats-io/nats-server/v2@v2.11.15 (covers all 14 server CVEs; pulls nats.go v1.49.0, nkeys v0.4.15, jwt v2.8.1, klauspost/compress v1.18.4 and friends transitively). - go directive 1.25.0 -> 1.26.4 so the toolchain ships the two stdlib fixes. This is a go.mod/go.sum change justified purely by CVE remediation; it is the explicit exception to the "do not touch deps" rule for a CVE bump. Verification: - CGO_ENABLED=0 go build ./... && go vet ./... && go test -count=1 ./... -> green, including the 0003 multi-node cluster/JetStream e2e in pkg/embeddednats, so the server bump did not break the cluster or the durable plane. - govulncheck ./... -> "No vulnerabilities found" (0 reachable; the 13 that remain are in required-but-not-called modules). Refs: report 0006 N1, issue 0005a. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 15:55:32 +02:00
agent	fb0291ad8a	docs(issue): 0005 hardening 2 — CVEs, sig-nil spoof, DoS concurrencia, TLS forzado (re-auditoría 0006)	2026-06-07 15:48:49 +02:00
agent	d821bc1794	chore(0003): bump unibus to 0.6.0 — decentralization / HA (0003a-0003e) Cluster NATS routes (auth + mutual TLS), Store/blobstore interfaces with replicated JetStream KV and Object Store backends, idempotent migrate-to-kv with backup, client failover over seed/control-plane lists, replicated nonce store (closes the multi-node replay hole), and the per-subject membership ACL (audit H4 residual). All behind the `decentralized` flag (off); single-node SQLite+disk behavior unchanged. The multi-node deploy (0003f) is the human's; runbook in report 0006.	2026-06-07 15:31:14 +02:00
egutierrez	da420513b6	Merge issue/0003e-client-failover: client failover + replicated nonce store + subject ACL (H4)	2026-06-07 15:27:45 +02:00
agent	96abb75a2e	feat(0003e/3): per-subject data-plane ACL from room membership (audit H4) Closes the residual the 0004 hardening deferred: the NATS authenticator can now confine a registered peer to the subjects of the rooms it belongs to, instead of letting any registered identity sub/pub on any subject. The dynamic-membership reconnection model the audit named is provided by client.RefreshSession. pkg/busauth: - verifyNkey factors out the shared nkey verification. - NewNkeyAuthenticatorACL + PermissionsFunc: an authenticator that, after authorizing, derives and RegisterUser()s per-subject permissions. A derivation error denies the connection (fail closed). pkg/membership: - SubjectACLFor(store) maps a signing pubkey to the subjects it may use: the subject of every room it belongs to, plus the client infrastructure subjects (_INBOX.>, $JS.API.> for request/reply and the persisted plane). pkg/client: - RefreshSession() rebuilds the data-plane connection so the authenticator re-derives permissions after a membership change (NATS freezes permissions at connect time). It retains the seeds/options to reconnect; active subscriptions are dropped and must be re-made (documented). Tests (DoD: isolation + refresh): - TestSubjectACLIsolation: alice (member of room.A) may sub/pub room.A but is DENIED sub and pub on room.B (permissions violation), and never reads bob's room.B traffic; bob never receives alice's cross-room publish. - TestRefreshSessionGainsNewRoom: alice has no permission for room B until she is added and calls RefreshSession; the reconnect grants the subject and she then receives room B traffic. Scope note: the per-subject ACL authenticator is opt-in (NewServer/ membershipd keep the open authenticator by default) and is wired in with the decentralized boot path; auto-RefreshSession on every membership change (fully transparent) remains for 0003f. Master behavior unchanged.	2026-06-07 15:27:45 +02:00
agent	37c778ca9a	feat(0003e/2): replicated anti-replay nonce store on JetStream KV The per-process nonce cache breaks anti-replay under multi-node failover (audit 0004): a request captured on one node can be replayed to a DIFFERENT node whose local cache never saw the nonce, and is accepted. This makes the nonce state shared so a replay is rejected cluster-wide. pkg/membership: - nonceStore is now an interface. The in-memory cache is renamed memNonceCache (still the default, single-node behavior). - kvNonceStore (new) claims each nonce with an atomic KV Create on a shared bucket: first sight wins (accept), any later sight on any node rejects (replay). A backend error fails CLOSED (reject), so a KV outage never silently disables anti-replay. The bucket carries a TTL = nonceTTL (2*clockSkew) so a key expires exactly when its replay window closes; raw base64 nonces are mapped to KV-safe keys via sha256-hex. - Server.UseReplicatedNonces(js, replicas) swaps the store on a node; every node in a cluster calls it. NewServer still defaults to the in-memory cache (master behavior unchanged). Test (DoD error path — the issue's cross-node replay case): - TestReplicatedNonceRejectsCrossNodeReplay: two membershipd nodes share one KV bucket; a request accepted (200) on node A, replayed with the same ts+nonce to node B, is rejected (401) — and replaying to A again is rejected too.	2026-06-07 15:21:45 +02:00
agent	c6ad63059f	feat(0003e/1): client failover over a list of seeds and control planes The client (issue 0003e, part 1) accepts a LIST of NATS seeds and a LIST of control-plane URLs so a node loss is transparent. pkg/client: - Options.NatsServers: extra NATS seeds beyond the primary. The client connects to the joined seed list with MaxReconnects(-1) + RetryOnFailedConnect, so nats.go fails over to a surviving node when the one a client is attached to dies and rejoins a node that comes back. - Options.CtrlURLs: extra control-plane endpoints. doJSON/putBlob/getBlob now try each endpoint in order, falling over on a transport error to the next (an HTTP response from any node is authoritative — every node serves the same state under the KV store). newSignedRequest becomes newSignedRequestTo(base, ...); each failover attempt mints a fresh nonce (the signature covers method+path+ts+nonce+body, not the host), so a retried request is never seen as a replay. - ConnectedServer()/IsConnected(): observability for which node the data plane is attached to, for ops and failover tests. - New/Connect/NewWithOptions keep their signatures (a single URL = a one-element list), so worker/chat/mobile/playground are unchanged. Test (DoD edge — the issue's "kill node A" case): - TestClientFailoverAcrossNodes: A seeds two clustered nodes, subscribes, receives a cross-node message; the node A is attached to is KILLED; A reconnects to the survivor and still receives messages — session intact.	2026-06-07 15:18:18 +02:00

1 2 3

128 Commits