unibus

Author	SHA1	Message	Date
Egutierrez	1c9325104c	feat(embeddednats): UNIBUS_NATS_MONITOR flag decoupled from debug log Add a dedicated UNIBUS_NATS_MONITOR=1 toggle that opens the embedded nats-server monitoring HTTP endpoint (127.0.0.1:8222, loopback only) so a local metrics scraper can read /varz, /connz and /jsz for server-level metrics (msgs/s, connections, KV bucket msgs, RAFT leader per stream, restarts). Previously the monitoring endpoint was only reachable via UNIBUS_NATS_DEBUG=1, which is coupled to the verbose nats-server debug log: enabling the endpoint also wrote routes/RAFT/room subjects to journald in clear, which regresses the hardened posture (issue 0007). The two concerns are now decoupled. The toggle computation is extracted to a pure function natsLogOpts(debugEnv, monitorEnv) (noLog, debug, trace, monitor): MONITOR=1 opens the endpoint while keeping the log quiet (NoLog true / Debug false). The inverse coupling is preserved for backward compatibility (DEBUG still implies MONITOR). The 127.0.0.1 bind stays hardcoded — the monitoring endpoint has no auth and must never be reachable from the network. Deploy wiring versioned: additive systemd drop-in membershipd-cluster.service.d/nats-monitor.conf (Environment=UNIBUS_NATS_MONITOR=1) plus a "NATS server metrics" section in the cluster README with the rolling activation runbook (magnus -> homer -> datardos) gated on R3 reconvergence (followers 2/2) between nodes. Tests: pure decoupling table (monitor on => log NOT debug; debug => monitor; default closed) + a real embedded server with MONITOR=1 asserting /varz answers 200 on loopback:8222, and a server without the flag with the endpoint closed. 100% additive: behavior is identical without the flag. Bump app.md 0.10.0 -> 0.11.0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 20:57:46 +02:00
egutierrez	ce72131ddf	docs(cluster): correct runbook + wire --internal-id-file into deploy Corrections learned from the real 0011 deploy: - Bring up: the "start magnus alone and verify healthz" order deadlocks — a lone node of a 3-node cluster has no meta-group quorum and never serves healthz until a second node joins. Document a quorum-forming start and that a node never self-serves. - Replication: R1 is an unusable SPOF (all six control-plane buckets on one node) and the cold start only converges with the three cold-start fixes; go straight to R3 once the cluster forms. - Add a "user add --store kv" section: the live user-add path that replaces stop-seed-restart, with its security model and idempotency/HA/no-delete semantics. - Topology: real IPs, ROUTE_NETWORK=public (no WireGuard mesh exists). - Chaos test: mark the data-plane client + failover proofs as validated (0012). Deploy machinery now emits the persisted internal identity: the unit gains --internal-id-file ${INTERNAL_ID_FILE} and deploy-cluster.sh writes INTERNAL_ID_FILE into each node's cluster.env, so a fresh deploy enables the live user-add path on every node. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 19:41:56 +02:00
egutierrez	9fbff79df4	chore(deploy): fill cluster nodes.env with the real 3-node topology Set magnus's public IP (135.125.201.30) and switch ROUTE_NETWORK to "public": the three nodes have no WireGuard mesh (homer/datardos do not even have wg installed), so server-to-server routes go over the public IPs, still protected by the separate cluster route CA (mutual TLS). KV_REPLICAS is raised to 3 now that the cluster runs at R3. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 18:56:28 +02:00
egutierrez	48a3d6be33	docs(0006g): cluster deploy material for magnus+homer+datardos (R3 HA) Parameterized, NO-VPS-touched material to bring up unibus as a 3-node cluster. The authoring agent ran none of it on a host; every remote-changing step is marked HUMAN and deploy-cluster.sh defaults to a dry run. deploy/cluster/: - nodes.env — topology (cluster name, ports, per-node rows). Public IPs known (homer 141.94.69.66, datardos 51.91.100.142) pre-filled; magnus public IP and all WireGuard IPs are <PLACEHOLDER> for the human; scripts refuse to run while any remain. - generate-cluster-certs.sh — mints a SEPARATE cluster route CA + a route cert per node (server+clientAuth, mutual routes) and a data-plane server cert per node signed by the reused client CA (../tls/ca.); SAN = public + WG + hostname. - membershipd-cluster.service — one unit, parameterized per node via /opt/unibus/cluster.env: enforce + per-subject ACL + TLS + --store kv, --cluster-pass-file (secret out of argv), Restart=always. - deploy-cluster.sh — cross-build linux/amd64, generate each node's cluster.env (routes to the other two on the WG mesh, no userinfo), rsync + install (only with --yes); staggered start is manual. - README.md — runbook: prerequisites, loopback bootstrap to seed the first admin into the KV (works around the user-CLI/KV chicken-and-egg), staggered bring-up, verify posture+quorum, scale R1->R3 in place, and the chaos test (left to 0003f on the real VPS). - .gitignore — out/, build/, secrets/, .key never committed. bash -n passes on both scripts; go build/test unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 17:31:13 +02:00
egutierrez	b8201a82cd	fix(0006f): cluster secret out of argv, migrate-to-kv TLS guard, R1/CA docs (audit 0008 lows) Low-severity cluster hardening from audit 0008: - Route secret out of argv (N1-low): --cluster-pass and a nats://user:pass@host in --routes are visible in ps/journald. New --cluster-pass-file and the UNIBUS_CLUSTER_PASS env var (precedence file > env > flag); the resolved secret guards the route layer and is injected into bare --routes entries (injectRouteCreds), so peers can be listed as nats://host:6250 with no secret in argv. The legacy --cluster-pass stays for dev/compat. - migrate-to-kv confidentiality (N6): refuse a remote --nats-url without --ca (the allowlist would travel cleartext); loopback targets are exempt (isLoopbackURL). - Docs (N1 route CA, N3 DoS): deploy/README gains a Clustering section — use a SEPARATE cluster CA for routes (not the client CA), keep the secret out of argv, run migrate-to-kv loopback/TLS only, and R1 is a SPOF of auth (not HA); R3 quorum is real HA. The generated cert material lives in deploy/cluster/ (0006g). Tests: - TestResolveClusterPass (file > env > flag precedence; missing file errors), - TestInjectRouteCreds (injects only into userinfo-less routes; preserves overrides), - TestIsLoopbackURL (loopback vs remote vs malformed). CGO_ENABLED=0 go build/vet/test green; govulncheck 0 reachable. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 17:24:46 +02:00
egutierrez	1b56f14c20	feat(deploy/tls): self-signed CA + server cert generator generate-certs.sh mints the bus CA and a NATS server certificate whose SANs cover the public IP (135.125.201.30), the WireGuard IP (10.42.0.1), the om hostname, and localhost/127.0.0.1 for on-host smoke tests (all overridable via env). Only the public ca.crt is committed; ca.key, server.key and server.crt are gitignored and distributed out of band. README documents generation, use and rotation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 12:44:13 +02:00
egutierrez	f6b53620e9	feat(deploy): systemd user unit + install script for membershipd Add deploy/unibus-membershipd.service (Restart=always, binds both planes to 0.0.0.0 for LAN reachability), an idempotent deploy/install.sh that builds the binary, symlinks the unit, and enables+starts it, plus deploy/README.md with operate/health instructions. Restart=always is deliberate: a clean SIGTERM exits 0 and Restart=on-failure would not restart it, leaving the service silently dead (the sqlite_api gotcha). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-06 18:05:53 +02:00

7 Commits