Harness updating is not harness benefit

Key takeaways from Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents (arXiv:2605.30621). The paper separates two things that are often conflated in self-improving agents: writing better harness artifacts, and actually using those artifacts during task execution.

17 authorsSWE-bench VerifiedMCP-AtlasSkillsBench7 LLM backbonesPrompts / skills / memory / tools

≤3.1pp

Evolver spread

Across benchmarks, the best-vs-worst harness-updater gap is narrow.

25.1%

Weak skill-load rate

Qwen3-32B often fails to bring relevant skill artifacts into context.

0.52→0.13

Adherence drift

Weak models lose harness-following over long trajectories.

The core distinction

Harness-updating

The ability of an evolver model to read execution evidence and write useful persistent artifacts: skills, prompts, memories, tool rules.

Harness-benefit

The ability of the task-solving model to retrieve/load those artifacts and follow them faithfully while solving future tasks.

Main takeaways

1. Bigger evolvers are not obviously worth it. When the task-solving agent is fixed, Qwen3.5-9B, Qwen3-235B, Claude Haiku/Sonnet/Opus, and GPT-OSS-120B produce surprisingly similar downstream gains. The smallest evolver even wins SkillsBench in their setup.

2. The bottleneck is the acting agent, not the learning/writing agent. Post-evolution performance varies much more by the base capability of the solver than by which model wrote the harness update.

3. Harness benefit is non-monotonic. Mid-tier models often benefit most; frontier models have less room because they already solve many tasks; weak models have room but cannot reliably operationalize the harness.

4. Weak models fail in two separable ways. Activation failure: they do not load the right artifact in the runner’s expected protocol. Adherence failure: they load it, then drift, literalize it, or abandon its conditional steps.

5. Training target: not just reasoning, but harness operation. Agent training should reward correct artifact invocation, protocol-conformant loading, sustained instruction following, and recovery from tool/runtime failures while preserving the skill’s procedure.

Selected quantitative results

Task-solving model

SWE base / gain

MCP base / gain

SB base / gain

Qwen3-32B

3.6 / +4.4

3.6 / +1.0

0.0 / +5.8

Qwen3-235B

20.7 / +19.3

25.0 / +4.3

4.7 / +1.1

GPT-OSS-120B

26.2 / +15.8

28.0 / +7.0

0.0 / +7.0

Haiku 4.5

66.0 / +2.4

42.4 / +3.6

5.8 / +15.1

Sonnet 4.6

73.2 / +2.8

54.0 / +3.2

24.4 / +3.5

Opus 4.6

74.2 / +2.6

61.0 / +3.6

25.6 / +5.8

Why it matters for agent systems

If you are building Hermes/GBrain-style self-improving agents, the paper argues for a very practical architecture: use a cheap/good-enough model to propose or summarize durable harness updates, then spend your expensive capability budget on the agent that must execute with those updates. But do not assume saved skills/memories help automatically: measure whether the agent loads them and follows them across long-horizon work.