Harness updating is not harness benefit

Key takeaways from Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents (arXiv:2605.30621). The paper separates two things that are often conflated in self-improving agents: writing better harness artifacts, and actually using those artifacts during task execution.

17 authorsSWE-bench VerifiedMCP-AtlasSkillsBench7 LLM backbonesPrompts / skills / memory / tools
≤3.1pp

Evolver spread

Across benchmarks, the best-vs-worst harness-updater gap is narrow.

25.1%

Weak skill-load rate

Qwen3-32B often fails to bring relevant skill artifacts into context.

0.52→0.13

Adherence drift

Weak models lose harness-following over long trajectories.

The core distinction

Harness-updating

The ability of an evolver model to read execution evidence and write useful persistent artifacts: skills, prompts, memories, tool rules.

Harness-benefit

The ability of the task-solving model to retrieve/load those artifacts and follow them faithfully while solving future tasks.

Main takeaways

1. Bigger evolvers are not obviously worth it. When the task-solving agent is fixed, Qwen3.5-9B, Qwen3-235B, Claude Haiku/Sonnet/Opus, and GPT-OSS-120B produce surprisingly similar downstream gains. The smallest evolver even wins SkillsBench in their setup.
2. The bottleneck is the acting agent, not the learning/writing agent. Post-evolution performance varies much more by the base capability of the solver than by which model wrote the harness update.
3. Harness benefit is non-monotonic. Mid-tier models often benefit most; frontier models have less room because they already solve many tasks; weak models have room but cannot reliably operationalize the harness.
4. Weak models fail in two separable ways. Activation failure: they do not load the right artifact in the runner’s expected protocol. Adherence failure: they load it, then drift, literalize it, or abandon its conditional steps.
5. Training target: not just reasoning, but harness operation. Agent training should reward correct artifact invocation, protocol-conformant loading, sustained instruction following, and recovery from tool/runtime failures while preserving the skill’s procedure.

Selected quantitative results

Task-solving model
SWE base / gain
MCP base / gain
SB base / gain
Qwen3-32B
3.6 / +4.4
3.6 / +1.0
0.0 / +5.8
Qwen3-235B
20.7 / +19.3
25.0 / +4.3
4.7 / +1.1
GPT-OSS-120B
26.2 / +15.8
28.0 / +7.0
0.0 / +7.0
Haiku 4.5
66.0 / +2.4
42.4 / +3.6
5.8 / +15.1
Sonnet 4.6
73.2 / +2.8
54.0 / +3.2
24.4 / +3.5
Opus 4.6
74.2 / +2.6
61.0 / +3.6
25.6 / +5.8

Why it matters for agent systems

If you are building Hermes/GBrain-style self-improving agents, the paper argues for a very practical architecture: use a cheap/good-enough model to propose or summarize durable harness updates, then spend your expensive capability budget on the agent that must execute with those updates. But do not assume saved skills/memories help automatically: measure whether the agent loads them and follows them across long-horizon work.