Synthesis · Agent ops signal · 2026-05-29

Agent OS: build the control room, then bolt an eval spine through it.

The two links converge on one operating model: Shann Holmberg gives the physical/organizational shape of a Hermes agent fleet; Ben Hylak gives the reliability loop that stops that fleet becoming theater. Together they imply a practical doctrine for Connor: every agent needs a cockpit, every workflow needs traceable proof, and every production stumble should become either a fix, a regression case, or a consciously ignored one-off.

Control room

A side control plane: registry, runbooks, ports, permissions, env map, backup plan. No raw secrets. Not where work happens.

Eval spine

Code-aware golden cases, traces, replay, assertions, issue taxonomy, production monitoring, high-signal regression memory.

Learning loop

User/source/agent signals flow into GBrain and Signal Radar, cluster into stumbles/issues/signals/experiments, then update agents.

The synthesis in one sentence

Don’t build “a smarter assistant.” Build an operating system for agent work: a visible control room for governance, isolated specialist runtimes for execution, and a production eval spine that turns failures into durable institutional memory.

Shann’s contribution

Hermes should be operated like a fleet: one VPS, one control room, specialist containers, optional orchestrator, optional task bus, clear runbooks, and a separation between the “brain” that defines the fleet and the “body” that runs it.

Hylak’s contribution

Agent evaluation should be floor-raising, not benchmark-maxxing: inspect real traces, identify the first true failure, fix failure classes, and keep only high-signal regression cases that prevent embarrassing repeat failures.

What each source teaches

1. Hermes Agent Operator Handbook

Source: Shann Holmberg’s X article/thread and control-room template repo.

“the agent control room is the side control plane. it is not an agent you chat through. it is a folder at /root/vps-agents that documents and governs the whole fleet.”

Key move: separate control from runtime. The control room is docs, rules, runbooks, env maps, architecture, and governance. The live runtime is secrets, memory, skills, sessions, crons, logs, and state.

one VPSspecialist agentsorchestrator laterno raw secrets in control docs

2. How to evaluate AI agents

Source: Ben Hylak’s howtoeval.com guide and Raindrop Workshop material.

“A floor-raising eval suite is a memory of bugs you refuse to reintroduce.”

Key move: stop treating eval as a big abstract score. For production agents, eval is detective work on real traces: last successful step, first real failure, missed retrieval, ignored context, bad tool call, overclaimed final answer.

code-aware evalsproduction traces20 high-signal casesself-healing loop

The combined operating model

Operator cockpitControl room: agents, runbooks, permissions, env references, backups, ports, task bus, escalation rules.

→

Specialist runtimesIsolated Hermes containers/profiles: PO, GBrain, Signal Radar, Open Tabs, research, dev, ops.

→

Eval spineTrace capture, replay, golden cases, issue taxonomy, experiments, regression memory.

The three planes

1. Control planeThe system’s explicit map. It answers: what agents exist, where do they run, what can they touch, how do they fail, how do we restart/rebuild them, what requires Connor approval?

2. Runtime planeThe actual workers. Each specialist has its own data, env, skills, sessions, crons, logs, state, and backup policy. The orchestrator is only another agent with routing rights, not a god-agent.

3. Learning planeThe eval and memory loop. Production stumbles become issues; recurring issues become signals; signals become experiments; successful experiments become changes and high-signal regression cases.

What this means for Connor’s stack

System	Adopt this	Why it matters
Hermes	Create a control-room registry for every live agent/profile/cron/tool lane: inventory, runbook, env-map, backup, permissions, golden cases, failure modes.	Hermes becomes operable infrastructure, not a mysterious assistant process.
Personal Orchestrator	Define PO critical paths and unacceptable failures before adding more autonomy: decision suggestions, Open Tabs mutation, GBrain writes, message sending, cron actions.	PO can only become more autonomous when “prove done” is replayable and auditable.
GBrain	Store failures as first-class learning objects: trace, diagnosis, failure class, fix, eval status, superseded-by, source session.	GBrain becomes “memory of bugs we refuse to reintroduce,” not just a notes graph.
Signal Radar	Use Connor-sent links as taste/lens seeds, but also ingest agent/product failures as signal: repeated misses, noisy recommendations, source coverage gaps.	Signal Radar should detect both external opportunities and internal system drift.
Open Tabs	Treat it as a specialist workflow with approvals, not a generic todo list: mutations should be traced and replayable.	Attention tooling can become trusted only when every change has provenance.

Concrete implementation path

Order matters: do not build the orchestrator first. Register one specialist fully, prove its runbook and eval loop, then add more specialists, then add routing.

Level 1

One documented agent. Choose one real agent lane. Write inventory, runbook, env-map, backup policy, permissions, 5 golden cases.

Level 2

Direct specialists. Add 2–3 specialists, still manually invoked. No orchestrator. Measure friction and repeated handoffs.

Level 3

Orchestrator. Add one front door only when specialist lanes are useful and well-documented. It reads the control room.

Level 4

Automated team. Crons, health checks, backup verification, recurring workflows, production monitoring, experiment loop.

Minimum viable control-room folder

/root/agent-control-room/
  README.md
  agents/
    personal-orchestrator/
      inventory.md
      runbook.md
      env-map.md
      permissions.md
      backup.md
      evals.md
      known-failures.md
    signal-radar/
    gbrain/
    open-tabs/
  shared/
    security.md
    commands.md
    escalation.md
  task-bus/
  experiments/
  failure-taxonomy.md

Minimum viable eval spine

Pick 5–10 golden cases for one workflow; each must run the real path, not a prompt-only mock.
Capture trace: user input, retrieved context, tool calls, file/DB writes, messages sent, final response, and verification output.
Label first failure, not just outcome: missing context, stale retrieval, bad tool, overclaim, loop, wrong source, unsafe side effect, bad handoff.
Fix the pattern.
Add only high-signal regression cases. Prune stale edge cases.
Run production learning loop: Stumbles → Issues → Signals → Experiments.

How Signal Radar should learn from these links

External source learning

@shannholmberg and @benhylak are not just accounts. They are lens seeds for two lanes: agent-operator/control-room patterns and production agent evaluation patterns. Promote future posts only when the item text matches the mechanism, not every post by the author.

Internal failure learning

Signal Radar should also watch the system itself: missed exact links, noisy digests, broken collectors, bad source identity, false-positive lens promotions, and failure to surface Connor-marked “signal.” Those are product signals.

New Signal Radar lenses to encode

agent control roomspecialist agentsone VPScode-aware evalsfloor raisingself-healing loopstrace replaygolden casesagent failure taxonomy

Warnings

Do not orchestrate too early. An orchestrator over undocumented specialists creates routing theater.
Do not give the orchestrator every credential. It should route and synthesize; specialists should hold scoped permissions.
Do not turn every bug into an eval. Twenty high-signal cases beat two hundred stale weird cases.
Do not trust pass rate alone. Ask “which 1% failed?” before celebrating 99%.
Do not store raw secrets in the control room. Store references, owners, rotation procedures, and risk notes.
Do not confuse self-diagnostics with truth. Agent-reported failures are useful stumbles, not ground truth.
Do not add automation before recovery. Manual runbooks and rollback need to work first.

Evidence and source notes

Confidence: high for the synthesis. The X UI was partially gated/truncated, so extraction used public X page snapshots, direct images/OCR, FXTwitter API for the long X article, public repo contents, and howtoeval.com text.