Paper brief · arXiv 2606.10662

DELM: decentralized agents need shared verified state, not a bigger boss.

Yuzhen Mao and Azalia Mirhoseini propose Decentralized Language Models: a multi-agent framework where parallel agents coordinate through a task queue and a compact, verified, unfoldable shared context instead of a central orchestrator.

Stanford · Jun 2026arXiv abstractPDFproject page

Key overview

1The thesis

Existing multi-agent systems parallelize workers but centralize coordination. DELM decentralizes coordination by making verified progress persistent and readable by all agents.

The important move is from prompt-routed collaboration to state-based collaboration: useful findings, failures, constraints, and evidence become shared problem state.

2The one-line model

DELM = parallel agents + task queue + verified shared context + compact gists + selective unfolding.

The shared context is intentionally not a raw chat transcript. It is a curated working set with backing evidence.

77.4% Pass@4

SWE-bench Verified with Gemini 3 Flash, beating baselines while reducing cost.

+5.7 pts max LongBench gain

Largest reported gain over the strongest baseline on LongBench-v2 Multi-Doc QA.

~50% cost reduction

Gemini SWE-bench cost fell to $0.12/task versus roughly $0.24–$0.26 for strong baselines.

Concept points

Centralized MAS bottleneck

A main agent decomposes, delegates, waits, merges, and launches another round. This creates a scatter-gather bottleneck and makes every worker depend on the controller’s integration quality.

Shared context as blackboard, but stricter

DELM resembles a blackboard architecture, but with stronger rules: entries are compact, verified, evidence-backed, and selectively unfoldable.

Gists as resident working set

Agents read short gists by default. This keeps the global state cheap enough to include in many calls while preserving pointers to detailed evidence.

Selective unfolding as demand paging

When a gist is insufficient, an agent can unfold to a grounded summary, then to raw evidence. Detail is pulled only when the subtask requires it.

Admission-time verification

Outputs do not enter shared state automatically. They are checked against supporting evidence first. Unsupported claims are rejected, regenerated, or returned to the queue.

Failures become assets

Failed hypotheses, constraints, and patch summaries become reusable. This prevents other agents from rediscovering the same dead ends.

Approach

1. Initialize
Generate initial subtasks from the task and optional source context.

2. Claim
Parallel agents asynchronously claim ready subtasks from the queue.

3. Reason
Each agent reads the current verified shared context and works locally.

4. Admit
Completed outputs are compressed, verified, and appended as compact gists.

5. Iterate/finalize
If more work is needed, generate new subtasks; otherwise answer from verified state.

For reasoning trajectories

Compress the useful result directly into a gist: finding, failure, feedback, constraint, or patch summary. Verify that the gist faithfully captures the underlying trajectory before admitting it.

For long source units

Use a hierarchy: raw source → reference-grounded summary → compact gist. Store the gist in shared context; keep the summary/raw source in backing stores for unfolding.

Key findings

Benchmark

What DELM did

Why it matters

SWE-bench Verified
Real GitHub issue fixing

Best Avg@1, Pass@2, and Pass@4 across Gemini 3 Flash and Claude Opus 4.6 settings.

Shared failures, constraints, and patch summaries improve exploration while lowering redundant work.

LongBench-v2 Multi-Doc QA
Financial, government, news, legal, academic

Best average accuracy across GPT-5.4, Claude Sonnet 4.6, Gemini 3 Flash, and DeepSeek-V4-Pro.

Verified hierarchical gists give agents a global map before they inspect details.

Ablations

No verification: 60.1% → 55.2%. No hierarchy: 60.1% → 57.7%.

The core components are load-bearing; this is not just “more agents”.

DELM + RLM

Hybrid beats either method alone on OOLONG and LongBench-v2 using GPT-5.

Natural-language shared state and code-mediated execution are complementary.

The strongest empirical claim: DELM converts extra test-time compute into reusable shared progress rather than isolated attempts or controller-mediated summaries.

Innovation points

1. Coordination substrate

The shared context is the medium of collaboration. Agents do not need all communication routed through a main agent.

2. Verified before reusable

The admission gate prevents plausible but unsupported claims from becoming shared truth.

3. Compact + recoverable

Gists are small enough to be globally visible, while summaries/raw evidence remain recoverable.

4. Failure sharing

Negative results are promoted into constraints, reducing repeated dead-end exploration.

5. Dependency-aware queueing

Subtasks can be made eligible only when dependencies complete; blocked queues can generate missing prerequisite tasks.

6. Hybridization with tools

DELM does not replace programmatic agents. It adds decentralized verified state around them.

Practical takeaway points

For agent engineering

Build a shared state layer, not just supervisor/subagent calls.
Store distilled findings, failures, constraints, and evidence pointers.
Make updates pass an admission check before other agents can rely on them.
Keep shared state compact; keep raw evidence recoverable.

For research agents

Use shared context to avoid rereading the same papers or rerunning failed analyses.
Require evidence-backed claims before synthesis.
Separate global navigation from local evidence inspection.
Pair natural-language state with code/REPL for exact aggregation.

A good implementation pattern: every agent output should become one of a few typed shared entries: FACT, FAIL, CONSTRAINT, PATCH_SUMMARY, EVIDENCE_POINTER, OPEN_QUESTION. Each entry needs provenance and verification status.

Limitation points

Verification overhead

Admission-time checking costs extra calls and latency. The paper argues the reliability gain is worth it, but lighter verifiers are future work.

Decomposition quality

DELM inherits the quality of the generated task topology. Too coarse: agents are under-specified. Too fine: unnecessary agents and coordination overhead.

Natural language is weak for exact aggregation

On OOLONG, vanilla DELM underperforms RLM because exact counting/filtering/tie-handling benefits from executable code.

Prompt/model sensitivity

The authors note there is no universally optimal prompt across model families. DELM may need prompt adaptation per model.

A production caveat: the paper demonstrates benchmark gains, but it does not solve every hard systems question around conflict resolution, adversarial entries, long-running memory hygiene, permission boundaries, or human-in-the-loop correction.

Hype points vs grounded read

What is genuinely exciting

It points at a real scaling law for agent systems: not “more agents”, but better shared state.
The verification-before-admission frame is exactly the missing safety rail in many agent swarms.
The memory hierarchy feels OS-like: resident gist, backing summary, raw evidence, demand paging.
The RLM hybrid result suggests this can wrap tool-using agents, not just chatty agents.

What not to overclaim

It is not proof that decentralized agents always beat orchestrators.
It is not a complete autonomous research system by itself.
It still relies on LLM decomposition, summarization, and verification quality.
Some domains need code, databases, or formal checks rather than prose state.

Best compression: DELM is a verified blackboard for LLM agents. The blackboard is small enough to read, grounded enough to trust, and expandable enough to recover detail.