SkillOpt-inspired self-improvement plan for the Personal Orchestrator

Date: 2026-05-27 Source prompt: Connor linked Muratcan Koylan’s X post about SkillOpt / “gradient descent for SKILL.md files” and asked how to incorporate the idea into faculties, context assets, extractors, and the Personal Orchestrator layer E2E. External reference: arXiv:2605.23904v2, “SkillOpt: Executive Strategy for Self-Evolving Agent Skills” by Yifan Yang et al.; tweet: https://x.com/koylanai/status/2059113412278227328 Internal evidence: quick PO consult /root/personal-orchestrator/consults/2026-05-27T20-34-39.655068-00-00/CONSULT.md; latest full-breadth run /root/personal-orchestrator/runs/2026-05-27T19-05-51.169241-00-00.

Executive summary

SkillOpt’s core move is not “let an agent rewrite itself.” It is: treat a text artifact as trainable external state, then optimize it with the discipline of ML training: scored rollouts, bounded edits, held-out validation, rejected-edit memory, and slow/meta updates.

For the Personal Orchestrator, the equivalent trainable artifacts are broader than SKILL.md:

Faculty prompts / faculty skills — how each cognitive role judges.
Context-asset extractors — how source-near raw data becomes typed cognitive assets.
Context-asset retrieval/ranking policies — what evidence each faculty sees.
Synthesis / Agent Manager policies — what becomes a nudge, delegation, suppression, or quiet local artifact.
Action/delegation goal templates — how approved work is decomposed and proven.
Evaluation rubrics — how we score usefulness, focus, non-repeat, proof, autonomy, and attention cost.

The recommended upgrade is a Personal Orchestrator SkillOpt loop: a controlled optimization harness that proposes small patches to these artifacts, accepts them only if they improve held-out evals and/or real outcome metrics, stores rejected patches as negative feedback, and periodically distills stable lessons into a meta-guidance layer.

This should make PO feel less like a cron that occasionally emits good cards and more like a system that is actually learning from Connor’s reactions, its own failures, source-health evidence, and delegated-work outcomes.

What we should learn from SkillOpt

1. Optimize persistent text state, not ephemeral reasoning

SkillOpt treats the skill document as the trainable object while freezing the target model/harness. For us, the trainable state should include:

agents/faculties/*/prompt.md
Hermes skills used by PO, especially Agent Manager, Institutional Memory, Self-Healing, context assets, and source adapters
context-asset manifests and extractor prompts/rules
synthesis ranking/suppression/card-template policies
delegation goal packet templates
validation/eval rubrics

The important shift: faculty outputs are not just run artifacts. They are training trajectories for improving the above artifacts.

2. Use scored trajectories, not vibes

SkillOpt rollouts produce trajectories plus scalar scores. PO already has many trajectory sources:

faculty run artifacts: faculties/*.md and faculties/*.run.json
final surfaces: TELEGRAM_NUDGE.md, COLLABORATOR_OUTPUT.md, synthesis.json
user feedback: state/collaborator_feedback.jsonl, Nudge Inbox feedback
delegation lifecycle: state/delegation_backlog.jsonl, delegations/*/status.json, run.log, proof artifacts
source-health and context-asset evidence: state/context_assets/*, heartbeat DB, cron outputs
workbench: state/PO_WORKBENCH.md, state/po_workbench.json

We should formalize these as rollouts with score dimensions, not only archive them.

3. Bounded edits prevent self-improvement from becoming self-corruption

SkillOpt does add/delete/replace edits under an edit budget. PO should never do unconstrained rewrites of a faculty, extractor, or synthesis policy just because one run was bad.

Default patch budget:

faculty prompt: max 1–3 localized patches per accepted optimization step
extractor: max one source/type behavior change per step
synthesis policy: max one scoring/suppression/card-template change per step
Hermes skill: patch an existing section before creating a new skill
context asset schema: no schema changes without explicit migration/eval plan

4. Held-out gates matter more than generation quality

The optimizer model can write plausible but harmful patches. The gate decides what lands.

For PO, acceptance should require passing a held-out eval pack before patches are applied:

historical runs not used to generate the patch
recent Connor feedback rows not used in training prompt
known noisy/wrong clusters
synthetic canaries for safety boundaries
source-health failure cases
delegation proof cases

A patch can be eloquent and still fail if it causes more repeats, unsafe autonomy, weaker provenance, or lower actionability.

5. Rejected patches are a first-class learning signal

SkillOpt stores rejected edits and score drops. PO should keep a rejected-patch buffer so future optimizers know what not to repeat.

Examples:

“Over-deterministic suppression registry” was noisy: preserve judgement, do not hard-code brittle filters.
“Signal Radar source repair as PO priority” was noisy: PO may use Signal Radar opportunistically but should not make its repair a top PO task unless it harms PO recommendations.
“Useful but too vague” means card templates need more implementation detail and problem framing, not another abstract question.

6. Slow/meta updates are perfect for faculty judgement memory

SkillOpt separates fast local edits from slow optimizer-side meta guidance. PO should adopt the same split:

Fast patches: localized prompt/extractor/template patches after eval pass.
Slow lessons: faculty experience heuristics distilled from multiple runs, stored in a Faculty Experience Ledger and retrieved before future judgement.

This maps exactly onto the existing intended split:

GBrain = semantic truth
Hermes skills = procedural competence
Faculty Experience Ledger = judgement history

Proposed architecture: PO-Opt

source-near stores / crons / feedback / delegations
  -> rollout builder
  -> train / selection / test split packs
  -> optimizer agent proposes bounded patches
  -> patch applier creates candidate artifact versions
  -> evaluation harness scores candidate vs baseline
  -> acceptance gate promotes only improved versions
  -> rejected patch buffer records failures
  -> slow/meta update distills stable lessons
  -> next real faculty run uses improved artifacts

Trainable artifact registry

Create a registry of optimizable artifacts:

artifacts:
  faculty_prompt:
    path_glob: agents/faculties/*/prompt.md
    allowed_edits: [append_section, replace_section, delete_section]
    max_edits_per_step: 2
    eval_pack: faculty_judgement_eval
    risk: local_safe_work
  context_asset_extractor:
    path_glob: runners/context_assets.py
    allowed_edits: [add_extractor, adjust_classifier, add_source_health_case]
    max_edits_per_step: 1
    eval_pack: context_asset_eval
    risk: local_safe_work
  synthesis_policy:
    path_glob: runners/observability_run.py runners/action_manager.py
    allowed_edits: [scoring_adjustment, card_template_patch, suppression_rule_patch]
    max_edits_per_step: 1
    eval_pack: attention_surface_eval
    risk: local_safe_work
  hermes_skill:
    path_glob: ~/.hermes/skills/personal-orchestrator/**/SKILL.md
    allowed_edits: [append_pitfall, replace_step, add_validation]
    max_edits_per_step: 2
    eval_pack: skill_regression_eval
    risk: local_safe_work

Rollout/eval design

Score dimensions

Each candidate patch should be scored on:

Usefulness: would Connor likely rate the output useful or approve action?
Actionability: does it state problem, implementation shape, risk, evidence, and done condition?
Discernment: does it suppress/merge weak, duplicate, or low-leverage signals?
Non-repeat: does it avoid re-surfacing acknowledged/noisy clusters unless materially changed?
Evidence quality: does it cite source-near artifacts with freshness/confidence/caveats?
Safety: does it preserve external-side-effect approval boundaries?
Autonomy fit: does it route safe local work to local/delegated background paths rather than asking Connor too much?
Proofability: does every action have an observable done condition and proof path?
Breadth without volume: does it consider all faculties/context assets without dumping every faculty’s opinion?
Regression risk: does it preserve existing passing tests and accepted behavior?

Eval packs

Create versioned eval packs under:

/root/personal-orchestrator/evals/
  faculty_judgement/
  context_assets/
  extractors/
  synthesis_attention/
  action_manager/
  nudge_ux/
  source_health/

Each pack should contain:

training examples: can be used to generate patches
selection examples: gate candidate patches
test examples: periodic reporting only
canaries: safety and anti-regression cases
scoring rubric: JSON dimensions + human-readable explanation

Acceptance gate

A patch is accepted only if:

selection score improves over baseline by a configured margin, or fixes a hard failure without degrading weighted score;
all tests pass;
safety canaries pass;
no source-near provenance regression;
no new external-side-effect capability is introduced;
patch stays within edit budget and artifact boundary;
changed artifact has a rollback path.

Rejected patches are written to:

/root/personal-orchestrator/po_opt/rejected_patches.jsonl

Accepted patches are written to:

/root/personal-orchestrator/po_opt/accepted_patches.jsonl
/root/personal-orchestrator/po_opt/runs/<run_id>/

Layer-by-layer plan

Phase 0 — Freeze the safety doctrine and baseline

Goal: establish a baseline before self-improvement patches start landing.

Implement:

po_opt/ARTIFACT_REGISTRY.yaml
po_opt/SAFETY_DOCTRINE.md
baseline scorecard over latest 10–30 runs
snapshot of optimizable artifact hashes
mandatory rollback metadata for every candidate patch

Acceptance:

baseline can be regenerated deterministically;
artifact registry lists every optimizable file family;
no patch runner can touch files outside registry.

Phase 1 — Faculty Experience Ledger as rollout memory

Goal: turn faculty runs into structured training trajectories.

Implement:

/root/personal-orchestrator/faculty_experience/
  runs.jsonl
  lessons.yaml
  patch_queue.jsonl
  rejected_lessons.jsonl
  scorecards/*.md

Each faculty judgement event should record:

run id, faculty id, prompt hash
retrieved context assets and evidence refs
judgement: surface/suppress/question/action/blocked
proposed card/action/suppression
final synthesis decision
Connor feedback/outcome if available
delegation/proof outcome if available
inferred lesson candidate

Acceptance:

each real faculty run appends experience rows;
future faculty prompts can retrieve 3–5 similar prior lessons;
scorecards show useful/noisy/wrong/repeated-cluster rates by faculty.

Phase 2 — Context Asset Optimizer

Goal: optimize typed asset extraction and retrieval, not just faculty prompts.

Implement evals for:

source failure -> emits source_health / uncertainty, not silent empty context
Open Tabs -> emits open loop + temporal state + attention priority + affordance
collaborator feedback -> emits feedback + outcome + proof + caveat
cron outputs -> emits health + repair candidate + proof refs
delegation logs -> emits lifecycle/proof/blocker assets

Patch targets:

extractor registry seams
asset type classifiers
retrieval packet ranking/caveats
provenance/freshness/confidence rendering

Acceptance:

a single source-near row can emit multiple typed cognitive assets where appropriate;
missing sources produce explicit health assets;
retrieval packets cite source refs and caveats;
tests prove idempotent indexing.

Phase 3 — Faculty Prompt Optimizer

Goal: make each faculty improve judgement without becoming deterministic.

For each faculty, create a small benchmark of historical situations:

input evidence packet
expected useful judgement properties
expected suppressions
unacceptable outputs
scoring rubric

Patch constraints:

only localized prompt sections;
preserve input boundary, output contract, evidence requirement, forbidden actions;
prefer adding “when to suppress” and “when to route local work” heuristics over generic encouragement.

Acceptance examples:

Focus/Open Tabs learns: concrete debt/deadline/admin affordance -> propose admin block, not abstract runway question.
Agent Manager learns: cards need problem statement + implementation outline + risk + proof, not just “Hermes can do X.”
Self-Healing learns: source repair is high-priority only when it degrades PO recommendations, not merely because a source is imperfect.
Institutional Memory learns: repeated do_this becomes a delegation pattern; repeated noisy becomes suppression/merge/source-repair candidate.

Phase 4 — Synthesis / Attention Surface Optimizer

Goal: optimize the final bottleneck: what Connor sees.

Build eval pack around recent known cases:

useful-but-too-vague card
noisy Signal Radar repair card
repeated backlog card flood
“no nudge crossed threshold” with hidden faculty insights
multi-card delayed catch-up / Nudge Inbox UX

Patch targets:

card template fields
scoring weights
cluster merge policy
cooldown/reopen policy
Nudge Inbox rendering/resolution
“local safe work vs ask Connor” routing

Acceptance:

one semantic cluster produces at most one user-facing card;
card includes why, implementation shape, risk, evidence, proof;
known noisy clusters are suppressed unless new evidence changes urgency/risk/deadline/done condition;
safe local improvements are queued/done quietly with proof rather than repeatedly asking Connor.

Phase 5 — Extractor/source-health self-healing loop

Goal: let source-near infrastructure improve from failures, without unsafe mutations.

Implement:

extractor failure ledger
source-health eval pack
repair proposal generator
read-only proof checks
approval gates for credentials/scopes/external changes

Acceptance:

failures become source_health/uncertainty assets;
recurring failures become local repair candidates;
repairs must add regression checks;
external services are not mutated without explicit approval.

Phase 6 — E2E optimizer run

Goal: evaluate PO as a whole system, not isolated patches.

Run an offline E2E comparison:

baseline artifact set
candidate patched artifact set
same historical eval suite
same target model/harness where possible
compare final nudge/action/delegation outputs

Quality claim only allowed if:

manifest.json proves real faculty-agent records for a live smoke run;
offline eval shows candidate > baseline;
no safety canary fails;
full test suite passes;
final human-facing surface is inspected for attention quality.

Concrete implementation slices

Slice A — PO-Opt skeleton

Files:

po_opt/ARTIFACT_REGISTRY.yaml
po_opt/SAFETY_DOCTRINE.md
runners/po_opt.py
tests/test_po_opt.py

Capabilities:

list optimizable artifacts and hashes
create candidate patch run directories
enforce edit boundaries
write accepted/rejected patch ledgers

Slice B — Faculty Experience Ledger

Files:

runners/faculty_experience.py
tests/test_faculty_experience.py
faculty_experience/runs.jsonl
faculty_experience/lessons.yaml
faculty_experience/patch_queue.jsonl

Capabilities:

append run experience from runs/latest
derive per-faculty scorecards from feedback/delegations
retrieve similar prior decisions for faculty prompts/context assets

Slice C — Eval packs and scorer

Files:

runners/po_eval.py
evals/synthesis_attention/*.jsonl
evals/context_assets/*.jsonl
evals/faculty_judgement/*.jsonl
tests/test_po_eval.py

Capabilities:

score baseline/candidate outputs on rubric dimensions
include hard safety canaries
produce EVAL_REPORT.md and eval_report.json

Slice D — Bounded patch optimizer

Files:

runners/po_optimizer.py
tests/test_po_optimizer.py
po_opt/prompts/optimizer.md
po_opt/prompts/patch_ranker.md

Capabilities:

collect failure/success minibatches from eval runs
ask optimizer model for add/delete/replace patches
rank and clip patches to edit budget
apply patch to candidate copy only
evaluate before promotion

Slice E — Slow/meta update layer

Files:

po_opt/meta_guidance.yaml
po_opt/rejected_patches.jsonl
po_opt/accepted_patches.jsonl
runners/po_meta_update.py

Capabilities:

summarize stable lessons across accepted/rejected patch history
distinguish deployed artifact changes from optimizer-only guidance
feed meta guidance into future patch generation but not into runtime unless accepted

First eval cases to seed

Use these from current PO history:

Useful but vague implementation card - Expected improvement: card template includes problem, implementation outline, risk, evidence, done condition. - Evidence: consult lines 26–41 and synthesis card 6075d620b3.
Signal Radar repair as noisy PO priority - Expected improvement: source repair is opportunistic unless it harms PO recommendation quality. - Evidence: consult lines 44–54.
Repeated backlog item flood - Expected improvement: Workbench suppressions are compacted/deduped and no repeated identical nudge is emitted. - Evidence: consult lines 83–117.
Source-near adapter onboarding - Expected improvement: reusable local skill/delegation pattern, not another abstract card. - Evidence: latest synthesis card f4a42beb07.
Nudge Inbox delayed catch-up - Expected improvement: ordinal references and batch delegation work after delay; stale views refresh rather than guess. - Evidence: implemented tests tests/test_nudge_inbox.py.

Safety and anti-patterns

Do not:

let optimizer patches land directly in production files without held-out eval;
let a single feedback row create a broad deterministic rule;
turn PO into if/then suppression machinery that removes faculty judgement;
optimize only prompts while leaving extractors/context assets dirty;
count artifact presence as intelligence quality;
publish or expose private raw data in public artifacts;
mutate external systems during self-improvement.

Do:

use candidate copies, patch diffs, rollback, and acceptance gates;
score against attention quality, usefulness, safety, and proof, not just test pass;
preserve rejected patches as negative training data;
distinguish fast local patching from slow meta lessons;
require real-agent E2E proof for any quality claim;
let safe local self-improvement happen quietly with proof.

Success metric

The system is improving when, over a rolling 2–4 week window:

useful/do_this rate rises;
noisy/wrong/repeated-cluster rate falls;
fewer cards ask Connor to arbitrate internal maintenance;
more safe local repairs/delegations complete with proof;
context-asset packets become more cited, fresh, and caveated;
faculties retrieve relevant prior lessons before judging;
final nudges become fewer, more concrete, and more action-changing;
source failures become explicit uncertainty/health assets rather than invisible gaps.

Recommended immediate next move

Implement Slice A + B first: PO-Opt skeleton + Faculty Experience Ledger.

Reason: SkillOpt-style optimization is only safe once we have structured rollouts and artifact boundaries. The Faculty Experience Ledger supplies the replay buffer; the PO-Opt skeleton supplies the safety rails. After that, context-asset and faculty prompt optimization become straightforward, gated, and auditable.

Detected source links

https://x.com/koylanai/status/2059113412278227328

SkillOpt-Inspired Personal Orchestrator Self-Improvement Plan

SkillOpt-inspired self-improvement plan for the Personal Orchestrator

Executive summary

What we should learn from SkillOpt

1. Optimize persistent text state, not ephemeral reasoning

2. Use scored trajectories, not vibes

3. Bounded edits prevent self-improvement from becoming self-corruption

4. Held-out gates matter more than generation quality

5. Rejected patches are a first-class learning signal

6. Slow/meta updates are perfect for faculty judgement memory

Proposed architecture: PO-Opt

Trainable artifact registry

Rollout/eval design

Score dimensions

Eval packs

Acceptance gate

Layer-by-layer plan

Phase 0 — Freeze the safety doctrine and baseline

Phase 1 — Faculty Experience Ledger as rollout memory

Phase 2 — Context Asset Optimizer

Phase 3 — Faculty Prompt Optimizer

Phase 4 — Synthesis / Attention Surface Optimizer

Phase 5 — Extractor/source-health self-healing loop

Phase 6 — E2E optimizer run

Concrete implementation slices

Slice A — PO-Opt skeleton

Slice B — Faculty Experience Ledger

Slice C — Eval packs and scorer

Slice D — Bounded patch optimizer

Slice E — Slow/meta update layer

First eval cases to seed

Safety and anti-patterns

Success metric

Recommended immediate next move

Detected source links