Add a dedicated UNIBUS_NATS_MONITOR=1 toggle that opens the embedded
nats-server monitoring HTTP endpoint (127.0.0.1:8222, loopback only) so a
local metrics scraper can read /varz, /connz and /jsz for server-level
metrics (msgs/s, connections, KV bucket msgs, RAFT leader per stream,
restarts).
Previously the monitoring endpoint was only reachable via UNIBUS_NATS_DEBUG=1,
which is coupled to the verbose nats-server debug log: enabling the endpoint
also wrote routes/RAFT/room subjects to journald in clear, which regresses the
hardened posture (issue 0007). The two concerns are now decoupled.
The toggle computation is extracted to a pure function
natsLogOpts(debugEnv, monitorEnv) (noLog, debug, trace, monitor): MONITOR=1
opens the endpoint while keeping the log quiet (NoLog true / Debug false). The
inverse coupling is preserved for backward compatibility (DEBUG still implies
MONITOR). The 127.0.0.1 bind stays hardcoded — the monitoring endpoint has no
auth and must never be reachable from the network.
Deploy wiring versioned: additive systemd drop-in
membershipd-cluster.service.d/nats-monitor.conf (Environment=UNIBUS_NATS_MONITOR=1)
plus a "NATS server metrics" section in the cluster README with the rolling
activation runbook (magnus -> homer -> datardos) gated on R3 reconvergence
(followers 2/2) between nodes.
Tests: pure decoupling table (monitor on => log NOT debug; debug => monitor;
default closed) + a real embedded server with MONITOR=1 asserting /varz answers
200 on loopback:8222, and a server without the flag with the endpoint closed.
100% additive: behavior is identical without the flag. Bump app.md 0.10.0 ->
0.11.0.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Corrections learned from the real 0011 deploy:
- Bring up: the "start magnus alone and verify healthz" order deadlocks — a
lone node of a 3-node cluster has no meta-group quorum and never serves
healthz until a second node joins. Document a quorum-forming start and that
a node never self-serves.
- Replication: R1 is an unusable SPOF (all six control-plane buckets on one
node) and the cold start only converges with the three cold-start fixes;
go straight to R3 once the cluster forms.
- Add a "user add --store kv" section: the live user-add path that replaces
stop-seed-restart, with its security model and idempotency/HA/no-delete
semantics.
- Topology: real IPs, ROUTE_NETWORK=public (no WireGuard mesh exists).
- Chaos test: mark the data-plane client + failover proofs as validated (0012).
Deploy machinery now emits the persisted internal identity: the unit gains
--internal-id-file ${INTERNAL_ID_FILE} and deploy-cluster.sh writes
INTERNAL_ID_FILE into each node's cluster.env, so a fresh deploy enables the
live user-add path on every node.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Set magnus's public IP (135.125.201.30) and switch ROUTE_NETWORK to "public":
the three nodes have no WireGuard mesh (homer/datardos do not even have wg
installed), so server-to-server routes go over the public IPs, still protected
by the separate cluster route CA (mutual TLS). KV_REPLICAS is raised to 3 now
that the cluster runs at R3.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Parameterized, NO-VPS-touched material to bring up unibus as a 3-node cluster.
The authoring agent ran none of it on a host; every remote-changing step is
marked HUMAN and deploy-cluster.sh defaults to a dry run.
deploy/cluster/:
- nodes.env — topology (cluster name, ports, per-node rows). Public IPs known
(homer 141.94.69.66, datardos 51.91.100.142) pre-filled; magnus public IP and
all WireGuard IPs are <PLACEHOLDER> for the human; scripts refuse to run while
any remain.
- generate-cluster-certs.sh — mints a SEPARATE cluster route CA + a route cert per
node (server+clientAuth, mutual routes) and a data-plane server cert per node
signed by the reused client CA (../tls/ca.*); SAN = public + WG + hostname.
- membershipd-cluster.service — one unit, parameterized per node via
/opt/unibus/cluster.env: enforce + per-subject ACL + TLS + --store kv,
--cluster-pass-file (secret out of argv), Restart=always.
- deploy-cluster.sh — cross-build linux/amd64, generate each node's cluster.env
(routes to the other two on the WG mesh, no userinfo), rsync + install (only
with --yes); staggered start is manual.
- README.md — runbook: prerequisites, loopback bootstrap to seed the first admin
into the KV (works around the user-CLI/KV chicken-and-egg), staggered bring-up,
verify posture+quorum, scale R1->R3 in place, and the chaos test (left to 0003f
on the real VPS).
- .gitignore — out/, build/, secrets/, *.key never committed.
bash -n passes on both scripts; go build/test unchanged.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Low-severity cluster hardening from audit 0008:
- Route secret out of argv (N1-low): --cluster-pass and a nats://user:pass@host in
--routes are visible in ps/journald. New --cluster-pass-file and the
UNIBUS_CLUSTER_PASS env var (precedence file > env > flag); the resolved secret
guards the route layer and is injected into bare --routes entries
(injectRouteCreds), so peers can be listed as nats://host:6250 with no secret in
argv. The legacy --cluster-pass stays for dev/compat.
- migrate-to-kv confidentiality (N6): refuse a remote --nats-url without --ca (the
allowlist would travel cleartext); loopback targets are exempt (isLoopbackURL).
- Docs (N1 route CA, N3 DoS): deploy/README gains a Clustering section — use a
SEPARATE cluster CA for routes (not the client CA), keep the secret out of argv,
run migrate-to-kv loopback/TLS only, and R1 is a SPOF of auth (not HA); R3
quorum is real HA. The generated cert material lives in deploy/cluster/ (0006g).
Tests:
- TestResolveClusterPass (file > env > flag precedence; missing file errors),
- TestInjectRouteCreds (injects only into userinfo-less routes; preserves overrides),
- TestIsLoopbackURL (loopback vs remote vs malformed).
CGO_ENABLED=0 go build/vet/test green; govulncheck 0 reachable.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
generate-certs.sh mints the bus CA and a NATS server certificate whose SANs
cover the public IP (135.125.201.30), the WireGuard IP (10.42.0.1), the om
hostname, and localhost/127.0.0.1 for on-host smoke tests (all overridable via
env). Only the public ca.crt is committed; ca.key, server.key and server.crt
are gitignored and distributed out of band. README documents generation, use
and rotation.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add deploy/unibus-membershipd.service (Restart=always, binds both planes to
0.0.0.0 for LAN reachability), an idempotent deploy/install.sh that builds the
binary, symlinks the unit, and enables+starts it, plus deploy/README.md with
operate/health instructions.
Restart=always is deliberate: a clean SIGTERM exits 0 and Restart=on-failure
would not restart it, leaving the service silently dead (the sqlite_api gotcha).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>