SkillOpt-inspired self-improvement plan for the Personal Orchestrator
Date: 2026-05-27
Source prompt: Connor linked Muratcan Koylan’s X post about SkillOpt / “gradient descent for SKILL.md files” and asked how to incorporate the idea into faculties, context assets, extractors, and the Personal Orchestrator layer E2E.
External reference: arXiv:2605.23904v2, “SkillOpt: Executive Strategy for Self-Evolving Agent Skills” by Yifan Yang et al.; tweet: https://x.com/koylanai/status/2059113412278227328
Internal evidence: quick PO consult /root/personal-orchestrator/consults/2026-05-27T20-34-39.655068-00-00/CONSULT.md; latest full-breadth run /root/personal-orchestrator/runs/2026-05-27T19-05-51.169241-00-00.
Executive summary
SkillOpt’s core move is not “let an agent rewrite itself.” It is: treat a text artifact as trainable external state, then optimize it with the discipline of ML training: scored rollouts, bounded edits, held-out validation, rejected-edit memory, and slow/meta updates.
For the Personal Orchestrator, the equivalent trainable artifacts are broader than SKILL.md:
- Faculty prompts / faculty skills — how each cognitive role judges.
- Context-asset extractors — how source-near raw data becomes typed cognitive assets.
- Context-asset retrieval/ranking policies — what evidence each faculty sees.
- Synthesis / Agent Manager policies — what becomes a nudge, delegation, suppression, or quiet local artifact.
- Action/delegation goal templates — how approved work is decomposed and proven.
- Evaluation rubrics — how we score usefulness, focus, non-repeat, proof, autonomy, and attention cost.
The recommended upgrade is a Personal Orchestrator SkillOpt loop: a controlled optimization harness that proposes small patches to these artifacts, accepts them only if they improve held-out evals and/or real outcome metrics, stores rejected patches as negative feedback, and periodically distills stable lessons into a meta-guidance layer.
This should make PO feel less like a cron that occasionally emits good cards and more like a system that is actually learning from Connor’s reactions, its own failures, source-health evidence, and delegated-work outcomes.
What we should learn from SkillOpt
1. Optimize persistent text state, not ephemeral reasoning
SkillOpt treats the skill document as the trainable object while freezing the target model/harness. For us, the trainable state should include:
agents/faculties/*/prompt.md- Hermes skills used by PO, especially Agent Manager, Institutional Memory, Self-Healing, context assets, and source adapters
- context-asset manifests and extractor prompts/rules
- synthesis ranking/suppression/card-template policies
- delegation goal packet templates
- validation/eval rubrics
The important shift: faculty outputs are not just run artifacts. They are training trajectories for improving the above artifacts.
2. Use scored trajectories, not vibes
SkillOpt rollouts produce trajectories plus scalar scores. PO already has many trajectory sources:
- faculty run artifacts:
faculties/*.mdandfaculties/*.run.json - final surfaces:
TELEGRAM_NUDGE.md,COLLABORATOR_OUTPUT.md,synthesis.json - user feedback:
state/collaborator_feedback.jsonl, Nudge Inbox feedback - delegation lifecycle:
state/delegation_backlog.jsonl,delegations/*/status.json,run.log, proof artifacts - source-health and context-asset evidence:
state/context_assets/*, heartbeat DB, cron outputs - workbench:
state/PO_WORKBENCH.md,state/po_workbench.json
We should formalize these as rollouts with score dimensions, not only archive them.
3. Bounded edits prevent self-improvement from becoming self-corruption
SkillOpt does add/delete/replace edits under an edit budget. PO should never do unconstrained rewrites of a faculty, extractor, or synthesis policy just because one run was bad.
Default patch budget:
- faculty prompt: max 1–3 localized patches per accepted optimization step
- extractor: max one source/type behavior change per step
- synthesis policy: max one scoring/suppression/card-template change per step
- Hermes skill: patch an existing section before creating a new skill
- context asset schema: no schema changes without explicit migration/eval plan
4. Held-out gates matter more than generation quality
The optimizer model can write plausible but harmful patches. The gate decides what lands.
For PO, acceptance should require passing a held-out eval pack before patches are applied:
- historical runs not used to generate the patch
- recent Connor feedback rows not used in training prompt
- known noisy/wrong clusters
- synthetic canaries for safety boundaries
- source-health failure cases
- delegation proof cases
A patch can be eloquent and still fail if it causes more repeats, unsafe autonomy, weaker provenance, or lower actionability.
5. Rejected patches are a first-class learning signal
SkillOpt stores rejected edits and score drops. PO should keep a rejected-patch buffer so future optimizers know what not to repeat.
Examples:
- “Over-deterministic suppression registry” was noisy: preserve judgement, do not hard-code brittle filters.
- “Signal Radar source repair as PO priority” was noisy: PO may use Signal Radar opportunistically but should not make its repair a top PO task unless it harms PO recommendations.
- “Useful but too vague” means card templates need more implementation detail and problem framing, not another abstract question.
6. Slow/meta updates are perfect for faculty judgement memory
SkillOpt separates fast local edits from slow optimizer-side meta guidance. PO should adopt the same split:
- Fast patches: localized prompt/extractor/template patches after eval pass.
- Slow lessons: faculty experience heuristics distilled from multiple runs, stored in a Faculty Experience Ledger and retrieved before future judgement.
This maps exactly onto the existing intended split:
- GBrain = semantic truth
- Hermes skills = procedural competence
- Faculty Experience Ledger = judgement history
Proposed architecture: PO-Opt
source-near stores / crons / feedback / delegations
-> rollout builder
-> train / selection / test split packs
-> optimizer agent proposes bounded patches
-> patch applier creates candidate artifact versions
-> evaluation harness scores candidate vs baseline
-> acceptance gate promotes only improved versions
-> rejected patch buffer records failures
-> slow/meta update distills stable lessons
-> next real faculty run uses improved artifacts
Trainable artifact registry
Create a registry of optimizable artifacts:
artifacts:
faculty_prompt:
path_glob: agents/faculties/*/prompt.md
allowed_edits: [append_section, replace_section, delete_section]
max_edits_per_step: 2
eval_pack: faculty_judgement_eval
risk: local_safe_work
context_asset_extractor:
path_glob: runners/context_assets.py
allowed_edits: [add_extractor, adjust_classifier, add_source_health_case]
max_edits_per_step: 1
eval_pack: context_asset_eval
risk: local_safe_work
synthesis_policy:
path_glob: runners/observability_run.py runners/action_manager.py
allowed_edits: [scoring_adjustment, card_template_patch, suppression_rule_patch]
max_edits_per_step: 1
eval_pack: attention_surface_eval
risk: local_safe_work
hermes_skill:
path_glob: ~/.hermes/skills/personal-orchestrator/**/SKILL.md
allowed_edits: [append_pitfall, replace_step, add_validation]
max_edits_per_step: 2
eval_pack: skill_regression_eval
risk: local_safe_work
Rollout/eval design
Score dimensions
Each candidate patch should be scored on:
- Usefulness: would Connor likely rate the output useful or approve action?
- Actionability: does it state problem, implementation shape, risk, evidence, and done condition?
- Discernment: does it suppress/merge weak, duplicate, or low-leverage signals?
- Non-repeat: does it avoid re-surfacing acknowledged/noisy clusters unless materially changed?
- Evidence quality: does it cite source-near artifacts with freshness/confidence/caveats?
- Safety: does it preserve external-side-effect approval boundaries?
- Autonomy fit: does it route safe local work to local/delegated background paths rather than asking Connor too much?
- Proofability: does every action have an observable done condition and proof path?
- Breadth without volume: does it consider all faculties/context assets without dumping every faculty’s opinion?
- Regression risk: does it preserve existing passing tests and accepted behavior?
Eval packs
Create versioned eval packs under:
/root/personal-orchestrator/evals/
faculty_judgement/
context_assets/
extractors/
synthesis_attention/
action_manager/
nudge_ux/
source_health/
Each pack should contain:
- training examples: can be used to generate patches
- selection examples: gate candidate patches
- test examples: periodic reporting only
- canaries: safety and anti-regression cases
- scoring rubric: JSON dimensions + human-readable explanation
Acceptance gate
A patch is accepted only if:
- selection score improves over baseline by a configured margin, or fixes a hard failure without degrading weighted score;
- all tests pass;
- safety canaries pass;
- no source-near provenance regression;
- no new external-side-effect capability is introduced;
- patch stays within edit budget and artifact boundary;
- changed artifact has a rollback path.
Rejected patches are written to:
/root/personal-orchestrator/po_opt/rejected_patches.jsonl
Accepted patches are written to:
/root/personal-orchestrator/po_opt/accepted_patches.jsonl
/root/personal-orchestrator/po_opt/runs/<run_id>/
Layer-by-layer plan
Phase 0 — Freeze the safety doctrine and baseline
Goal: establish a baseline before self-improvement patches start landing.
Implement:
po_opt/ARTIFACT_REGISTRY.yamlpo_opt/SAFETY_DOCTRINE.md- baseline scorecard over latest 10–30 runs
- snapshot of optimizable artifact hashes
- mandatory rollback metadata for every candidate patch
Acceptance:
- baseline can be regenerated deterministically;
- artifact registry lists every optimizable file family;
- no patch runner can touch files outside registry.
Phase 1 — Faculty Experience Ledger as rollout memory
Goal: turn faculty runs into structured training trajectories.
Implement:
/root/personal-orchestrator/faculty_experience/
runs.jsonl
lessons.yaml
patch_queue.jsonl
rejected_lessons.jsonl
scorecards/*.md
Each faculty judgement event should record:
- run id, faculty id, prompt hash
- retrieved context assets and evidence refs
- judgement: surface/suppress/question/action/blocked
- proposed card/action/suppression
- final synthesis decision
- Connor feedback/outcome if available
- delegation/proof outcome if available
- inferred lesson candidate
Acceptance:
- each real faculty run appends experience rows;
- future faculty prompts can retrieve 3–5 similar prior lessons;
- scorecards show useful/noisy/wrong/repeated-cluster rates by faculty.
Phase 2 — Context Asset Optimizer
Goal: optimize typed asset extraction and retrieval, not just faculty prompts.
Implement evals for:
- source failure -> emits
source_health/uncertainty, not silent empty context - Open Tabs -> emits open loop + temporal state + attention priority + affordance
- collaborator feedback -> emits feedback + outcome + proof + caveat
- cron outputs -> emits health + repair candidate + proof refs
- delegation logs -> emits lifecycle/proof/blocker assets
Patch targets:
- extractor registry seams
- asset type classifiers
- retrieval packet ranking/caveats
- provenance/freshness/confidence rendering
Acceptance:
- a single source-near row can emit multiple typed cognitive assets where appropriate;
- missing sources produce explicit health assets;
- retrieval packets cite source refs and caveats;
- tests prove idempotent indexing.
Phase 3 — Faculty Prompt Optimizer
Goal: make each faculty improve judgement without becoming deterministic.
For each faculty, create a small benchmark of historical situations:
- input evidence packet
- expected useful judgement properties
- expected suppressions
- unacceptable outputs
- scoring rubric
Patch constraints:
- only localized prompt sections;
- preserve input boundary, output contract, evidence requirement, forbidden actions;
- prefer adding “when to suppress” and “when to route local work” heuristics over generic encouragement.
Acceptance examples:
- Focus/Open Tabs learns: concrete debt/deadline/admin affordance -> propose admin block, not abstract runway question.
- Agent Manager learns: cards need problem statement + implementation outline + risk + proof, not just “Hermes can do X.”
- Self-Healing learns: source repair is high-priority only when it degrades PO recommendations, not merely because a source is imperfect.
- Institutional Memory learns: repeated
do_thisbecomes a delegation pattern; repeatednoisybecomes suppression/merge/source-repair candidate.
Phase 4 — Synthesis / Attention Surface Optimizer
Goal: optimize the final bottleneck: what Connor sees.
Build eval pack around recent known cases:
- useful-but-too-vague card
- noisy Signal Radar repair card
- repeated backlog card flood
- “no nudge crossed threshold” with hidden faculty insights
- multi-card delayed catch-up / Nudge Inbox UX
Patch targets:
- card template fields
- scoring weights
- cluster merge policy
- cooldown/reopen policy
- Nudge Inbox rendering/resolution
- “local safe work vs ask Connor” routing
Acceptance:
- one semantic cluster produces at most one user-facing card;
- card includes why, implementation shape, risk, evidence, proof;
- known noisy clusters are suppressed unless new evidence changes urgency/risk/deadline/done condition;
- safe local improvements are queued/done quietly with proof rather than repeatedly asking Connor.
Phase 5 — Extractor/source-health self-healing loop
Goal: let source-near infrastructure improve from failures, without unsafe mutations.
Implement:
- extractor failure ledger
- source-health eval pack
- repair proposal generator
- read-only proof checks
- approval gates for credentials/scopes/external changes
Acceptance:
- failures become source_health/uncertainty assets;
- recurring failures become local repair candidates;
- repairs must add regression checks;
- external services are not mutated without explicit approval.
Phase 6 — E2E optimizer run
Goal: evaluate PO as a whole system, not isolated patches.
Run an offline E2E comparison:
- baseline artifact set
- candidate patched artifact set
- same historical eval suite
- same target model/harness where possible
- compare final nudge/action/delegation outputs
Quality claim only allowed if:
manifest.jsonproves real faculty-agent records for a live smoke run;- offline eval shows candidate > baseline;
- no safety canary fails;
- full test suite passes;
- final human-facing surface is inspected for attention quality.
Concrete implementation slices
Slice A — PO-Opt skeleton
Files:
po_opt/ARTIFACT_REGISTRY.yaml
po_opt/SAFETY_DOCTRINE.md
runners/po_opt.py
tests/test_po_opt.py
Capabilities:
- list optimizable artifacts and hashes
- create candidate patch run directories
- enforce edit boundaries
- write accepted/rejected patch ledgers
Slice B — Faculty Experience Ledger
Files:
runners/faculty_experience.py
tests/test_faculty_experience.py
faculty_experience/runs.jsonl
faculty_experience/lessons.yaml
faculty_experience/patch_queue.jsonl
Capabilities:
- append run experience from
runs/latest - derive per-faculty scorecards from feedback/delegations
- retrieve similar prior decisions for faculty prompts/context assets
Slice C — Eval packs and scorer
Files:
runners/po_eval.py
evals/synthesis_attention/*.jsonl
evals/context_assets/*.jsonl
evals/faculty_judgement/*.jsonl
tests/test_po_eval.py
Capabilities:
- score baseline/candidate outputs on rubric dimensions
- include hard safety canaries
- produce
EVAL_REPORT.mdandeval_report.json
Slice D — Bounded patch optimizer
Files:
runners/po_optimizer.py
tests/test_po_optimizer.py
po_opt/prompts/optimizer.md
po_opt/prompts/patch_ranker.md
Capabilities:
- collect failure/success minibatches from eval runs
- ask optimizer model for add/delete/replace patches
- rank and clip patches to edit budget
- apply patch to candidate copy only
- evaluate before promotion
Slice E — Slow/meta update layer
Files:
po_opt/meta_guidance.yaml
po_opt/rejected_patches.jsonl
po_opt/accepted_patches.jsonl
runners/po_meta_update.py
Capabilities:
- summarize stable lessons across accepted/rejected patch history
- distinguish deployed artifact changes from optimizer-only guidance
- feed meta guidance into future patch generation but not into runtime unless accepted
First eval cases to seed
Use these from current PO history:
-
Useful but vague implementation card - Expected improvement: card template includes problem, implementation outline, risk, evidence, done condition. - Evidence: consult lines 26–41 and synthesis card
6075d620b3. -
Signal Radar repair as noisy PO priority - Expected improvement: source repair is opportunistic unless it harms PO recommendation quality. - Evidence: consult lines 44–54.
-
Repeated backlog item flood - Expected improvement: Workbench suppressions are compacted/deduped and no repeated identical nudge is emitted. - Evidence: consult lines 83–117.
-
Source-near adapter onboarding - Expected improvement: reusable local skill/delegation pattern, not another abstract card. - Evidence: latest synthesis card
f4a42beb07. -
Nudge Inbox delayed catch-up - Expected improvement: ordinal references and batch delegation work after delay; stale views refresh rather than guess. - Evidence: implemented tests
tests/test_nudge_inbox.py.
Safety and anti-patterns
Do not:
- let optimizer patches land directly in production files without held-out eval;
- let a single feedback row create a broad deterministic rule;
- turn PO into if/then suppression machinery that removes faculty judgement;
- optimize only prompts while leaving extractors/context assets dirty;
- count artifact presence as intelligence quality;
- publish or expose private raw data in public artifacts;
- mutate external systems during self-improvement.
Do:
- use candidate copies, patch diffs, rollback, and acceptance gates;
- score against attention quality, usefulness, safety, and proof, not just test pass;
- preserve rejected patches as negative training data;
- distinguish fast local patching from slow meta lessons;
- require real-agent E2E proof for any quality claim;
- let safe local self-improvement happen quietly with proof.
Success metric
The system is improving when, over a rolling 2–4 week window:
- useful/do_this rate rises;
- noisy/wrong/repeated-cluster rate falls;
- fewer cards ask Connor to arbitrate internal maintenance;
- more safe local repairs/delegations complete with proof;
- context-asset packets become more cited, fresh, and caveated;
- faculties retrieve relevant prior lessons before judging;
- final nudges become fewer, more concrete, and more action-changing;
- source failures become explicit uncertainty/health assets rather than invisible gaps.
Recommended immediate next move
Implement Slice A + B first: PO-Opt skeleton + Faculty Experience Ledger.
Reason: SkillOpt-style optimization is only safe once we have structured rollouts and artifact boundaries. The Faculty Experience Ledger supplies the replay buffer; the PO-Opt skeleton supplies the safety rails. After that, context-asset and faculty prompt optimization become straightforward, gated, and auditable.