unibus

dataforge/unibus

Fork 0

Commit Graph

Author	SHA1	Message	Date
Egutierrez	1c9325104c	feat(embeddednats): UNIBUS_NATS_MONITOR flag decoupled from debug log Add a dedicated UNIBUS_NATS_MONITOR=1 toggle that opens the embedded nats-server monitoring HTTP endpoint (127.0.0.1:8222, loopback only) so a local metrics scraper can read /varz, /connz and /jsz for server-level metrics (msgs/s, connections, KV bucket msgs, RAFT leader per stream, restarts). Previously the monitoring endpoint was only reachable via UNIBUS_NATS_DEBUG=1, which is coupled to the verbose nats-server debug log: enabling the endpoint also wrote routes/RAFT/room subjects to journald in clear, which regresses the hardened posture (issue 0007). The two concerns are now decoupled. The toggle computation is extracted to a pure function natsLogOpts(debugEnv, monitorEnv) (noLog, debug, trace, monitor): MONITOR=1 opens the endpoint while keeping the log quiet (NoLog true / Debug false). The inverse coupling is preserved for backward compatibility (DEBUG still implies MONITOR). The 127.0.0.1 bind stays hardcoded — the monitoring endpoint has no auth and must never be reachable from the network. Deploy wiring versioned: additive systemd drop-in membershipd-cluster.service.d/nats-monitor.conf (Environment=UNIBUS_NATS_MONITOR=1) plus a "NATS server metrics" section in the cluster README with the rolling activation runbook (magnus -> homer -> datardos) gated on R3 reconvergence (followers 2/2) between nodes. Tests: pure decoupling table (monitor on => log NOT debug; debug => monitor; default closed) + a real embedded server with MONITOR=1 asserting /varz answers 200 on loopback:8222, and a server without the flag with the endpoint closed. 100% additive: behavior is identical without the flag. Bump app.md 0.10.0 -> 0.11.0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 20:57:46 +02:00
egutierrez	ce72131ddf	docs(cluster): correct runbook + wire --internal-id-file into deploy Corrections learned from the real 0011 deploy: - Bring up: the "start magnus alone and verify healthz" order deadlocks — a lone node of a 3-node cluster has no meta-group quorum and never serves healthz until a second node joins. Document a quorum-forming start and that a node never self-serves. - Replication: R1 is an unusable SPOF (all six control-plane buckets on one node) and the cold start only converges with the three cold-start fixes; go straight to R3 once the cluster forms. - Add a "user add --store kv" section: the live user-add path that replaces stop-seed-restart, with its security model and idempotency/HA/no-delete semantics. - Topology: real IPs, ROUTE_NETWORK=public (no WireGuard mesh exists). - Chaos test: mark the data-plane client + failover proofs as validated (0012). Deploy machinery now emits the persisted internal identity: the unit gains --internal-id-file ${INTERNAL_ID_FILE} and deploy-cluster.sh writes INTERNAL_ID_FILE into each node's cluster.env, so a fresh deploy enables the live user-add path on every node. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 19:41:56 +02:00
egutierrez	48a3d6be33	docs(0006g): cluster deploy material for magnus+homer+datardos (R3 HA) Parameterized, NO-VPS-touched material to bring up unibus as a 3-node cluster. The authoring agent ran none of it on a host; every remote-changing step is marked HUMAN and deploy-cluster.sh defaults to a dry run. deploy/cluster/: - nodes.env — topology (cluster name, ports, per-node rows). Public IPs known (homer 141.94.69.66, datardos 51.91.100.142) pre-filled; magnus public IP and all WireGuard IPs are <PLACEHOLDER> for the human; scripts refuse to run while any remain. - generate-cluster-certs.sh — mints a SEPARATE cluster route CA + a route cert per node (server+clientAuth, mutual routes) and a data-plane server cert per node signed by the reused client CA (../tls/ca.); SAN = public + WG + hostname. - membershipd-cluster.service — one unit, parameterized per node via /opt/unibus/cluster.env: enforce + per-subject ACL + TLS + --store kv, --cluster-pass-file (secret out of argv), Restart=always. - deploy-cluster.sh — cross-build linux/amd64, generate each node's cluster.env (routes to the other two on the WG mesh, no userinfo), rsync + install (only with --yes); staggered start is manual. - README.md — runbook: prerequisites, loopback bootstrap to seed the first admin into the KV (works around the user-CLI/KV chicken-and-egg), staggered bring-up, verify posture+quorum, scale R1->R3 in place, and the chaos test (left to 0003f on the real VPS). - .gitignore — out/, build/, secrets/, .key never committed. bash -n passes on both scripts; go build/test unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 17:31:13 +02:00

Author

SHA1

Message

Date

Egutierrez

1c9325104c

feat(embeddednats): UNIBUS_NATS_MONITOR flag decoupled from debug log

Add a dedicated UNIBUS_NATS_MONITOR=1 toggle that opens the embedded
nats-server monitoring HTTP endpoint (127.0.0.1:8222, loopback only) so a
local metrics scraper can read /varz, /connz and /jsz for server-level
metrics (msgs/s, connections, KV bucket msgs, RAFT leader per stream,
restarts).

Previously the monitoring endpoint was only reachable via UNIBUS_NATS_DEBUG=1,
which is coupled to the verbose nats-server debug log: enabling the endpoint
also wrote routes/RAFT/room subjects to journald in clear, which regresses the
hardened posture (issue 0007). The two concerns are now decoupled.

The toggle computation is extracted to a pure function
natsLogOpts(debugEnv, monitorEnv) (noLog, debug, trace, monitor): MONITOR=1
opens the endpoint while keeping the log quiet (NoLog true / Debug false). The
inverse coupling is preserved for backward compatibility (DEBUG still implies
MONITOR). The 127.0.0.1 bind stays hardcoded — the monitoring endpoint has no
auth and must never be reachable from the network.

Deploy wiring versioned: additive systemd drop-in
membershipd-cluster.service.d/nats-monitor.conf (Environment=UNIBUS_NATS_MONITOR=1)
plus a "NATS server metrics" section in the cluster README with the rolling
activation runbook (magnus -> homer -> datardos) gated on R3 reconvergence
(followers 2/2) between nodes.

Tests: pure decoupling table (monitor on => log NOT debug; debug => monitor;
default closed) + a real embedded server with MONITOR=1 asserting /varz answers
200 on loopback:8222, and a server without the flag with the endpoint closed.
100% additive: behavior is identical without the flag. Bump app.md 0.10.0 ->
0.11.0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-07 20:57:46 +02:00

egutierrez

ce72131ddf

docs(cluster): correct runbook + wire --internal-id-file into deploy

Corrections learned from the real 0011 deploy:
- Bring up: the "start magnus alone and verify healthz" order deadlocks — a
  lone node of a 3-node cluster has no meta-group quorum and never serves
  healthz until a second node joins. Document a quorum-forming start and that
  a node never self-serves.
- Replication: R1 is an unusable SPOF (all six control-plane buckets on one
  node) and the cold start only converges with the three cold-start fixes;
  go straight to R3 once the cluster forms.
- Add a "user add --store kv" section: the live user-add path that replaces
  stop-seed-restart, with its security model and idempotency/HA/no-delete
  semantics.
- Topology: real IPs, ROUTE_NETWORK=public (no WireGuard mesh exists).
- Chaos test: mark the data-plane client + failover proofs as validated (0012).

Deploy machinery now emits the persisted internal identity: the unit gains
--internal-id-file ${INTERNAL_ID_FILE} and deploy-cluster.sh writes
INTERNAL_ID_FILE into each node's cluster.env, so a fresh deploy enables the
live user-add path on every node.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-07 19:41:56 +02:00

egutierrez

48a3d6be33

docs(0006g): cluster deploy material for magnus+homer+datardos (R3 HA)

Parameterized, NO-VPS-touched material to bring up unibus as a 3-node cluster.
The authoring agent ran none of it on a host; every remote-changing step is
marked HUMAN and deploy-cluster.sh defaults to a dry run.

deploy/cluster/:
- nodes.env — topology (cluster name, ports, per-node rows). Public IPs known
  (homer 141.94.69.66, datardos 51.91.100.142) pre-filled; magnus public IP and
  all WireGuard IPs are <PLACEHOLDER> for the human; scripts refuse to run while
  any remain.
- generate-cluster-certs.sh — mints a SEPARATE cluster route CA + a route cert per
  node (server+clientAuth, mutual routes) and a data-plane server cert per node
  signed by the reused client CA (../tls/ca.*); SAN = public + WG + hostname.
- membershipd-cluster.service — one unit, parameterized per node via
  /opt/unibus/cluster.env: enforce + per-subject ACL + TLS + --store kv,
  --cluster-pass-file (secret out of argv), Restart=always.
- deploy-cluster.sh — cross-build linux/amd64, generate each node's cluster.env
  (routes to the other two on the WG mesh, no userinfo), rsync + install (only
  with --yes); staggered start is manual.
- README.md — runbook: prerequisites, loopback bootstrap to seed the first admin
  into the KV (works around the user-CLI/KV chicken-and-egg), staggered bring-up,
  verify posture+quorum, scale R1->R3 in place, and the chaos test (left to 0003f
  on the real VPS).
- .gitignore — out/, build/, secrets/, *.key never committed.

bash -n passes on both scripts; go build/test unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-07 17:31:13 +02:00

3 Commits