Making Agents Intelligent

North star

Hermes becomes a continuously running personal intelligence control plane: quiet by default, proactive when useful, safe by construction, and increasingly capable because its harness learns from every source, decision, action, and correction.

Outcome

Agent as operating system

Hermes is not the chat box. Hermes is the runtime that routes evidence into attention, memory, proposal, skill, and execution surfaces.

Shape

Co-arising faculties

Signal filtering, prioritisation, recall, synthesis, generation, skill use, and proactive care inform each other in the same tick.

Safety

Evidence first, action gated

Raw evidence can flow automatically. Derived interpretation mutates only through policy, confidence, reversibility, and proof.

Non-simplification rule: this roadmap must not collapse the vision into seven modules. The implementation work is to create a shared substrate where faculties compose recursively while retaining auditability and human control.

Operating principles

These are the constraints that keep the roadmap aligned with the vision rather than drifting into a noisy automation system.

1. Raw evidence flows automatically

Transcripts, events, metadata, and evidence pointers become citable substrate without forcing Connor to triage every fact at intake time.

2. Derived mutation is proposed

Compiled memory, Open Tabs writes, outbound messages, public publishing, code changes, and external mutations require gates unless explicitly allowlisted.

3. No new inbox

Most intelligence compiles quietly into memory, Pulse, rankings, or suppression rules. Only high-leverage choices interrupt.

4. Every action proves done

Applied work records a verifiable handle: slug, file path, URL, Open Tab ID, proposal ID, command output, or test evidence.

5. Feedback changes future behavior

Approvals, rejections, ignored proposals, edits, and corrections become scoring, suppression, policy, skills, templates, and memory updates.

6. Autonomy is graduated

The system earns permission by risk class: observe → draft → propose → apply local reversible change → execute allowlisted task → external action.

Architecture: one field, many organs

The roadmap should make Hermes into a nervous system: sources feed evidence; evidence forms primitives; primitives convert into other primitives; proposals become work; decisions become learning.

01SourcesGmail, Calendar, Contacts, Telegram, Open Tabs, Signal Radar, GBrain, browser/SaaS, finance CSVs.

02EvidenceImmutable/citable raw events, refs, snippets, metadata, provenance, confidence.

03PrimitivesSignal, open loop, memory candidate, pattern, proposal candidate, skill gap, job.

04RulesTyped conversions with utility class, risk, suppression, explanation, and fan-out limits.

05SurfacesPulse, Open Tabs, proposal inbox, GBrain pages, docs, local UI.

06ExecutionGoal contracts, skills, delegated agents, allowlisted commands, verification proofs.

07LearningFeedback scoring, filters, prioritisation, memory policy, synthesis, templates, skills, autonomy thresholds.

Roadmap phases

The build should advance in capability horizons. Each phase produces a demonstrable faculty improvement and a durable proof artifact.

Phase 0

Stabilise the control plane baseline

Turn the current HCP MVP into a reliable substrate before adding more intelligence.

Now → 1 week

Fix the hanging test path. Resolve `test_execution_runner_processes_ready_goal_contracts` so full pytest is trustworthy again.

Cut over to Postgres. Set `HCP_DATABASE_URL`, run import plan, soak, and update doctor/runtime checks.

Add primitive inspection. Implement `hcp primitives list|explain` so the graph is human-inspectable, not only rebuildable.

Drain proposal backlog. Resolve or suppress remaining stale/noisy pending proposals to reduce baseline attention debt.

Proof of done

Full test suite passes without deselection.
`hcp doctor --json` reports Postgres primary and ok.
`hcp primitives list --json` shows typed primitive records.
Pending proposal inbox is low-noise and representative.

Phase 1

Explicit faculty model inside the runtime

Make the seven faculties first-class evaluation and routing dimensions.

1–2 weeks

Define faculty packets. Every runtime tick produces a structured assessment: discernment, focus, knowledge, pattern, creativity, competence, conscientiousness.

Map primitives to faculties. Each primitive records which faculty produced it, which faculty it improves, and which substrate should learn.

Add faculty scorecard. Show where the agent is improving or failing: noise, missed urgency, memory entropy, pattern quality, proposal quality, skill gaps, autonomy safety.

Record decision feedback by faculty. Rejections should know if the failure was discernment, focus, creativity, policy, or competence.

Proof of done

`hcp faculties scorecard --json` exists.
Proposal explanations name the faculty move and failure mode.
Feedback changes future rankings/suppression by faculty.

Phase 2

Self-improving discernment and focus

Make the system better at deciding what matters and what deserves attention now.

2–4 weeks

Per-source filter library. Learn source-specific signal/noise patterns from approvals, rejections, ignored items, and explicit corrections.

Attention budget model. Cap daily interruptions by urgency, reversibility, deadline, relationship cost, goal relevance, and cognitive load.

Open Tabs prioritisation engine. Rank unresolved loops by leverage, decay, closability, deadline, and repeated reappearance.

Suppression with escape hatches. Quiet noisy semantic clusters until materially new evidence appears.

Proof of done

Daily Pulse gets shorter while preserving high-value recalls.
Proposal inbox precision improves over a decision-labeled evaluation set.
Open Tabs weekly review shows fewer stale high-value loops.

Phase 3

Self-improving knowledge and pattern recognition

Make memory compounding and synthesis reliable enough to drive future work.

1–2 months

GBrain live salience pull. Integrate recent salience, contradictions, trajectory, health, and expert routing into HCP ticks.

Memory policy engine. Decide hot fact vs concept page vs entity page vs proposal vs suppression, with provenance and expiry.

Pattern synthesis jobs. Periodically detect cross-source themes: recurring pain, emerging opportunities, relationship dynamics, investment theses, project bottlenecks.

Contradiction and drift loop. Memory regressions become reconcile/simplify/research proposals with evidence packets.

Proof of done

GBrain-backed synthesis produces cited proposal candidates.
Contradictions/drift create bounded, explainable proposals.
Memory writes are deduped, source-backed, and recoverable.

Phase 4

Self-improving creativity and proposal generation

Move from “notice things” to “generate high-leverage possible futures.”

2–3 months

Proposal template library. Encode reusable forms: research, simplification, reconciling memory, outreach draft, build slice, investment memo, system improvement.

Creative alternatives. For important loops, generate multiple shapes rather than one obvious recommendation.

Novelty and usefulness scoring. Compare generated proposals against memory, goals, cost, reversibility, and past decisions.

Doc/artifact generator. Promote major ideas into polished docs with source links and implementation backlogs.

Proof of done

Proposals have type-specific templates and acceptance criteria.
Important recommendations include alternatives and tradeoffs.
Approved creative proposals improve future proposal templates.

Phase 5

Competence: skills, agents, and verified execution

Let Hermes do bounded work safely, then learn from how the work went.

3–4 months

Skill gap detector. When a task repeats or fails, propose creating or patching a Hermes skill.

Goal contract compiler v2. Convert approved proposals into SPEC/PLAN/STATUS/VALIDATION with source refs and edit boundaries.

Allowlisted execution runner. Run low-risk local jobs with verification, logs, and rollback/proof handles.

Two-stage review loop. Spec compliance review, then code quality/safety review, before reporting completion.

Proof of done

`hcp jobs run-ready` can execute allowlisted goal contracts end-to-end.
Every completed job updates status and records validation evidence.
Repeated work improves a skill/template/policy.

Phase 6

Conscientiousness: autonomous care within policy

Make Hermes omnipresent in the background without becoming noisy or unsafe.

4–6 months

Autonomy ladder. Observe, draft, propose, auto-apply low-risk local, execute allowlisted, ask before external/risky.

Background care loops. Quietly watch deadlines, stale commitments, system health, project drift, memory contradictions, and repeated bottlenecks.

Interruption policy. Enforce timing, bundling, urgency, relationship sensitivity, and “do not interrupt unless…” rules.

Conscience audit. Weekly report: what Hermes did, suppressed, proposed, learned, and should change about its own policy.

Proof of done

Most useful background work happens without Telegram chatter.
High-risk actions always require explicit confirmation.
Weekly audit shows concrete policy/skill/filter improvements.

Parallel build lanes

Phases sequence capability maturity, but implementation should run across durable lanes so progress compounds rather than waiting for one giant rewrite.

Data substrate

Postgres primary, context ledger, evidence refs, source registry, primitive graph, provenance, health.

Faculty runtime

Per-tick faculty packets, routing, conversion rules, scoring, suppression, policy, scorecards.

Memory and synthesis

GBrain bridge, hot facts, entity pages, contradictions, salience, trajectories, pattern synthesis jobs.

Proposal system

Templates, explanations, alternatives, risk levels, approval/apply paths, inbox bundling, feedback learning.

Execution and skills

Goal contracts, jobs runner, validation evidence, skill creation/patching, delegated agent orchestration.

Surfaces

Telegram decisions, Pulse, Open Tabs, local UI, docs publishing, weekly conscience audit.

Success metrics by faculty

The system is intelligent only if the harness measurably improves. These metrics should become scorecards in the control plane.

Faculty

What improves

Measurement

Discernment

Signal vs noise

Proposal precision, suppression correctness, false-negative recall audits.

Focus

Now vs later

Deadline misses, stale high-value tabs, interruption load, time-to-surface urgent loops.

Knowledge

Remember vs entropy

Useful recall rate, duplicate memory rate, provenance coverage, contradiction resolution time.

Pattern recognition

Connect vs isolate

Cited synthesis quality, cross-source theme detection, repeated-pain detection before Connor says it.

Creativity

Generate vs imitate

Accepted proposal novelty/usefulness, alternatives generated for important decisions, doc/artifact usefulness.

Competence

Learnt ability vs static ability

Skill coverage, repeated task time reduction, validation pass rate, fewer repeated mistakes.

Conscientiousness

Proactive vs prompted

Useful autonomous actions, correctly gated risky actions, quiet helpfulness, weekly audit improvements.

Failure modes to engineer away

The roadmap is as much about preventing bad intelligence as creating good intelligence.

Noise amplification

Rails: per-source filters, bundling, feedback suppression, attention budget, explicit ignore rules.

Memory entropy

Rails: provenance, dedupe, expiry, contradiction checks, fact/page separation, reversible writes.

False urgency

Rails: now/later model, deadline evidence, reversibility, opportunity cost, interruption thresholds.

Proposal spam

Rails: top-N inbox, taxonomy, owner/risk/done/proof, semantic bundling, rejected-cluster decay.

Unbounded autonomy

Rails: autonomy ladder, allowlists, dry-runs, confirmation gates, rollback, proof artifacts.

Isolated subsystems

Rails: primitive graph, typed conversions, shared evidence refs, faculty packets, cross-surface proof.

Next build slice

The immediate slice should not try to build “full intelligence.” It should make the current control plane stable, inspectable, and ready to express the faculty model.

Slice A — Control-plane stabilisation

Fix hanging execution-runner test.
Postgres cutover and doctor verification.
Primitive list/explain CLI.
Backlog drain / suppression cleanup.

Slice B — Faculty runtime specification

Create `FACULTY_RUNTIME_SPEC.md`.
Define faculty packet schema.
Map primitive conversions to faculties.
Define faculty scorecard metrics and CLI contract.

Recommended next action: create `/root/hermes-control-plane/docs/faculty-runtime-roadmap.md` plus an implementation plan for Slice A and Slice B. Then execute Slice A first so the substrate is stable before adding the new faculty layer.