Files
agents_and_robots/pkg/tools/devicemesh
egutierrez bcd246bf85 feat(0144a): tool registry framework para device-mesh
Anade pkg/tools/devicemesh con Client HTTP al device_agent + ToolRegistry
con 16 tools standard (exec, fs.*, git.*, docker.*, proc.*, pkg.*, shell.eval).
RegisterBuiltins filtra por mode user/sudo via RequiresApproval flag.
Hook al pkg/decision con ActionKindDeviceMesh + DeviceMeshAction.
Runner soporta dispatch via NewRunnerWithDeviceMesh (back-compat NewRunner).

Tests: 25 nuevos en devicemesh + 4 en runner. Build clean.
2026-05-24 14:07:13 +02:00
..

pkg/tools/devicemesh

Tool registry framework that lets an LLM agent in agents_and_robots (VPS) call capabilities exposed by a remote device_agent over the WireGuard mesh.

Issue: 0144a (POC for the broader 0144 spec).

What it does

LLM (Claude)
  │  tool_call exec {argv:["ls","/tmp"]}
  ▼
ToolRegistry.Call("exec", input)
  │  1. ValidateInput against tool's InputSchema
  │  2. ArgMapping(input) → device-facing args
  │  3. Client.Call(CapabilityRequest{capability: "shell.exec", args})
  │  4. ResultMapping(resp.Result) → LLM-facing output
  ▼
HTTP POST http://10.42.0.10:7474/capability   (over mesh WG)
  ▼
device_agent on home-wsl runs the binary, returns audit_hash + result

The LLM never sees the HTTP layer; it sees a flat list of named tools with JSON-Schema inputs.

Pieces

File Purpose
client.go HTTP client to POST /capability and GET /health of the remote device_agent. Generates request_id (req_<12bytehex>) and nonce (16 random bytes base64) when missing.
types.go ToolSpec + ToolRegistry. Thread-safe registry, Call is the single dispatch entry point.
schema.go Mini JSON-Schema validator (object/array/string/integer/number/boolean + required + additionalProperties + enum). Enough to reject LLM mistakes without pulling a heavy dep.
tools_builtin.go The standard catalog: exec, shell.eval, fs.read, fs.write, fs.list, fs.stat, git.clone, git.commit, git.push, pkg.install, pkg.search, proc.list, proc.kill, docker.list, docker.exec, docker.logs. `RegisterBuiltins(reg, ModeUser

How to register a new tool

import "github.com/enmanuel/agents/pkg/tools/devicemesh"

reg.Register(devicemesh.ToolSpec{
    Name:        "screenshot",
    Description: "Capture the display on the remote device. Returns PNG base64.",
    Capability:  "display.capture",
    InputSchema: map[string]any{
        "type":                 "object",
        "additionalProperties": false,
        "properties": map[string]any{
            "format": map[string]any{"type": "string", "enum": []any{"png", "jpeg"}},
        },
    },
    ArgMapping: func(in map[string]any) (map[string]any, error) {
        // pure transform LLM → device
        return in, nil
    },
    ResultMapping: func(r map[string]any) (any, error) {
        // pure transform device → LLM
        return r, nil
    },
    RequiresApproval: false, // user-scope
})

Then add the tool name to cfg.DeviceMesh.ToolsAllowed in the agent's config.yaml.

Wiring (issue 0144c — done)

The launcher now constructs the device mesh registry from cfg.DeviceMesh and surfaces every spec as a regular tools.Tool consumed by the existing LLM tool-use loop. No special LLM path; the LLM does not know (or care) that the tool's Exec ends up making an HTTP call over WireGuard.

config.AgentConfig.DeviceMesh (yaml block)
    │
    ▼  buildDeviceMeshRegistry(cfg, logger)   ← devagents/registry_build.go
    │   1. resolve URL (env var override wins when present + non-empty)
    │   2. NewClient(url) + apply Timeout
    │   3. RegisterBuiltins(reg, mode)        ← user | sudo | all
    │   4. FilterByAllowed(reg, tools_allowed)
    │
    ▼  devicemesh.ToolsForLLM(reg)            ← pkg/tools/devicemesh/adapter.go
    │   1 tools.Tool per spec; Def.Parameters
    │   compressed from JSON-Schema; Exec
    │   closure routes through reg.Call
    │
    ▼  tools.Registry.Register(...)           ← devagents/registry_build.go
    │
    ▼  devagents/llm.go runLLM tool-use loop  ← unchanged

The same *ToolRegistry is also passed to effects.NewRunnerWithDeviceMesh so any rule that emits decision.ActionKindDeviceMesh (orchestrator pipelines, !exec builtin command, etc.) hits the same dispatcher. Both paths produce the same JSON envelope, so audit chains line up regardless of where the call originated.

Config block

The agent's config.yaml opts in via:

device_mesh:
  enabled: true
  device_id: home-wsl                # logged as audit context; aliased as "host"
  mode: user                         # user | sudo | all
  device_agent_url: "http://10.42.0.10:7474"
  device_agent_url_env: AGENT_HOME_WSL_DEVICE_MESH_URL  # optional; wins when set + non-empty
  manifest_id: manifest_home-wsl_v1  # metadata only; the device enforces
  client_timeout_s: 60               # aliased as "timeout_seconds"
  tools_allowed:                     # whitelist; empty = keep everything mode allowed
    - exec
    - fs.read
    - fs.list

Names in tools_allowed that the catalog does not provide are logged with a WARN device_mesh tools_allowed lists unknown tool and dropped. The template ships extras like project.create, memory.recall, etc. that arrive in 0144d/e — they degrade gracefully today.

LLM-side view of a device tool

The adapter compresses the device-mesh InputSchema into the flatter tools.Def.Parameters shape (each top-level property becomes one tools.Param). The description is enriched with a stable marker so the model can spot remote tools at a glance:

exec  →  "Execute a command on the remote device. argv is parsed as exec.Command (NO shell). ... [device_mesh: shell.exec]"
pkg.install  →  "Install an OS package ... [device_mesh: pkg.install] (approval required)"

When RequiresApproval=true, the marker also reminds the model the call may be queued, which feeds back into the system prompt rules of agent-<host>-sudo.

Approval flow + LLM tool-result mapping

When the device_agent returns approval_status="queued" and the operator does not click 👍 within the timeout (0134 §6.5), the device returns approval_status="timeout" or ok=false, error="approval_required". The adapter does NOT silence this — it surfaces the error verbatim:

ToolRegistry.Call(...) → returns err = "devicemesh: shell.exec: approval_required"
tools.Result{Err: err}
runLLM → appends `role='tool'` message with `error: devicemesh: shell.exec: approval_required`
LLM next iteration → can apologize to operator and ask for retry.

The actual approval UX (operator clicks 👍 in #operator-approvals) is the device_agent's responsibility (issue 0134 §6, validated end-to-end in flow 0009). Nothing new on the agents_and_robots side.

What this issue does NOT do

  • Matrix-side approval rendering is 0144f — !preapprove, !approve req_id, pre-approval cache.
  • ed25519 manifest signing is 0144h — today the wire format is correct but unsigned.
  • call_monitor telemetry hook that emits function_id = capability_<name>_<lang>_<domain> per call is 0144 §13 (separate plumbing in the audit writer).
  • Cross-room correlation (delegate_sudo posting to #<host>-sudo and the bot copying the reply back) is its own issue (0144 main spec §3.3 + 0144c original plan — left intentionally for the room/bus layer once approval is wired).

shell.eval — the powerful tool

shell.eval is the only built-in tool that lets the LLM execute arbitrary free-form shell text on the device. Every other tool has a tightly-scoped JSON schema (paths, argv lists, container ids); shell.eval accepts a single string that the device hands to bash (Linux/WSL) or PowerShell (Windows) unmodified.

It exists because no structured tool can cover every legal shell idiom: pipes, redirects, here-docs, $() expansions, complex globs, environment-aware composition. Without shell.eval, the LLM resorts to multi-step exec chains that lose fidelity (no shell metacharacters allowed in exec's argv). With it, the LLM can ask for "give me the size of every .log in /var/log sorted desc" in one round-trip.

Guardrails (all device-side)

The flag on ToolSpec.RequiresApproval is metadata only. The real protections live in the device_agent:

  1. Hardcoded blocklist — destructive patterns (rm -rf /, dd if=/dev/..., mkfs, fork-bombs :(){:|:&};:, shutdown, reboot, :>/dev/sda, ...) always reject regardless of agent role or operator. There is no override.
  2. Auto-approve whitelist — read-only / inspection patterns (^git , ^ls , ^cat , ^grep , ^ps , ^uptime, ^df , ...) execute directly without operator prompt. The whitelist lives in the device manifest, not here.
  3. Operator approval — anything that is neither blocked nor auto-approved returns approval_status="queued" in the result. The device sends an approval request to #operator-approvals in Element and waits up to 60s for the operator to confirm; on timeout the call returns approval_status="timeout" and the LLM must reword or !retry.

The fields the LLM gets back from shell.eval: stdout, stderr, exit_code, approval_status, cmd_executed (post-normalization), truncated (true if output was capped), duration_ms.

When the LLM should call shell.eval

Use it as the fallback for cases none of the structured tools cover:

  • Pipes, redirects, sub-shells, here-docs.
  • One-liners that combine find + xargs + awk.
  • Quick sanity checks (uptime && df -h).
  • Composing CLI tools the agent isn't going to call enough to warrant a dedicated tool spec.

Avoid it for things that do have a structured tool: fs.read, fs.list, git.commit, docker.exec, etc. Those have predictable JSON shapes, narrower attack surface, and richer result mapping.

Designing manifests for user vs sudo agents

RegisterBuiltins registers shell.eval in both ModeUser and ModeSudo because the device_agent — not the registry — decides what is safe. Recommended manifest defaults:

Agent role RequiresApproval (LLM-facing metadata) Device manifest
agent-<host> (user) false Auto-approve whitelist + operator approval for anything else. Hardcoded blocklist active.
agent-<host>-sudo (sudo) true (forced via withApprovalRequired) Every invocation requires explicit operator approval. No auto-approve whitelist. Hardcoded blocklist active.

The withApprovalRequired helper clones the spec returned by shellEvalSpec() and flips RequiresApproval=true without mutating the source, so ModeUser registries that re-register after a ModeSudo run still get the unmodified spec. See tools_builtin.go::RegisterBuiltins for the special-case wiring.

See also: apps/device_agent/ (where the blocklist + auto-approve whitelist + approval flow live) and issue 0144 §6.4 for the RBAC design.

POC limitations (intentional)

These are out of scope for 0144a and tracked in sibling issues:

  • No retry. A single Call failure surfaces immediately. The spec accepts this: tool failures go back to the LLM as a role='tool' error message and the LLM decides what to do (issue 0144 §7.1 reglas operativas 2).
  • No pre-approval cache. RequiresApproval is metadata only; the actual gate lives on the device_agent (0144 §3) and the pre-approvals table (0144f).
  • No streaming. Tools are request/response. Long-running commands (apt-get install of a 200MB package) block until done or timeout. Streaming for logs is its own future issue.
  • No exponential backoff. The Go HTTP client's transport defaults apply (TCP retries on connect, no per-request retry).
  • No output sanitization. The Runner formats the result as JSON; sanitization against prompt-injection payloads is 0144g.
  • No telemetry to call_monitor. The hook for function_id = capability_<name>_<lang>_<domain> is part of the agent runtime wiring (0144c) — this package emits no metrics on its own.
  • No manifest signing on the request side. The Client envelope matches the 0134 §2.1 wire format but does NOT sign; manifest signing arrives in 0144h.

Why these specific design choices

  • Args map[string]any (object) NOT []string (positional). The current device_agent POC uses []string for shell.exec (see apps/device_agent/capability.go). The 0134 protocol and 0144 spec call for object-shaped args because most capabilities (fs.read, git.clone, docker.exec) are not naturally positional. 0144h migrates the device_agent.
  • ResultMapping returns any instead of map[string]any. Some tools (eg the test's echo example) collapse their output to a string. The Runner JSON-encodes whatever comes back so the LLM always sees a stable representation.
  • Capability is a field on ToolSpec, not derived from Name. The 1:1 mapping is the common case (fs.readfs.read), but docker.listdocker.container.list and project.create (future) compose multiple capabilities, so the indirection pays for itself.
  • Pure/impure split inside one package. ToolSpec, schema, mappings, registry are pure data and pure functions. Only Client.Call and Client.Health do I/O. The runtime composes them; tests substitute the Client.