Darwin Godel Machine - Open-Ended Evolution of Self-Improving Agents

year: 2025/05
paper: https://arxiv.org/abs/2505.22954
website: https://sakana.ai/dgm/ || sakana-ai-dgm
code: https://github.com/jennyzzt/dgm
connections: Gödel Machine, SSA, quality diversity, open-endedness, Jenny Zhang, Jeff Clune

Takes the Gödel Machine’s idea (a self-referential, self-improving AI) but drops the impractical requirement of formal proof search.
Instead: empirical evaluation (like SSA) + population-based open-ended search.

A coding agent that modifies its own Python codebase to become a better coding agent. Self-improvement is the downstream task: better coding ability directly means better self-modification ability. The LLM is the mutation operator (“primitive learning algorithm”), instead of blind parameter tweaks, modifications are proposed by a foundation model that reads the codebase and writes patches.

Algorithm

Initialize archive with a single base coding agent (intentionally minimal: bash + file edit tools, frozen FM)

Select parent from archive (probability roughly proportional to performance, inversely proportional to number of children already explored)

Parent analyzes its own benchmark evaluation logs, proposes the next feature to implement

Parent implements the feature into its own codebase → new agent

Evaluate new agent on coding benchmarks (SWE-bench, Polyglot)

If the new agent compiles and retains basic codebase-editing functionality, add to archive. Otherwise discard.

Repeat from 2.

All agents in the archive retain non-zero selection probability, so any path to improvement remains feasible. The open-ended exploration loop (archive maintenance, parent selection) is currently fixed and not self-modifiable.

The archive is the key difference from SSA-style single-system self-improvement. It lets the search traverse fitness valleys: the path to the best SWE-bench agent went through ancestors that performed worse than their parents. A greedy approach (always branch from the best) would never have found them.

What it discovered

Tool improvements:

Granular file viewing (by lines instead of whole files)

Precise editing (string replacement instead of full-file overwrites).

Workflow improvements:

Multiple solution attempts with FM-based ranking

Incorporating history of previous failed attempts when generating new ones, patch validation steps.

Both self-improvement and open-ended exploration are necessary. Without the archive, the system barely improves at all.

Ablation SWE-bench Polyglot
DGM 50.0% 38.0%
w/o self-improve (fixed meta-agent, like ADAS) 39.0% 28.0%
w/o open-ended exploration (no archive, always branch from latest) 23.0% 14.0%
Greedy (always branch from best) 39.7% 30.0%

Ablation	SWE-bench	Polyglot
DGM	50.0%	38.0%
w/o self-improve (fixed meta-agent, like ADAS)	39.0%	28.0%
w/o open-ended exploration (no archive, always branch from latest)	23.0%	14.0%
Greedy (always branch from best)	39.7%	30.0%

Improvements transfer across FMs, benchmarks, and across programming languages

Agent optimized with Claude 3.5 Sonnet also improves with o3-mini and Claude 3.7 Sonnet.
SWE-bench → Polyglot and vice versa; optimized on Python, gains on Rust/C++/Go.
The modifications are general agent design improvements, not model-specific or benchmark-specific tricks.

Graph View

Darwin Godel Machine - Open-Ended Evolution of Self-Improving Agents

Backlinks