year: 2025/05
paper: https://arxiv.org/abs/2505.22954
website: https://sakana.ai/dgm/ || sakana-ai-dgm
code: https://github.com/jennyzzt/dgm
connections: Gödel Machine, SSA, quality diversity, open-endedness, Jenny Zhang, Jeff Clune
Takes the Gödel Machine’s idea (a self-referential, self-improving AI) but drops the impractical requirement of formal proof search.
Instead: empirical evaluation (like SSA) + population-based open-ended search.
A coding agent that modifies its own Python codebase to become a better coding agent. Self-improvement is the downstream task: better coding ability directly means better self-modification ability. The LLM is the mutation operator (“primitive learning algorithm”), instead of blind parameter tweaks, modifications are proposed by a foundation model that reads the codebase and writes patches.
Algorithm
- Initialize archive with a single base coding agent (intentionally minimal: bash + file edit tools, frozen FM)
- Select parent from archive (probability roughly proportional to performance, inversely proportional to number of children already explored)
- Parent analyzes its own benchmark evaluation logs, proposes the next feature to implement
- Parent implements the feature into its own codebase → new agent
- Evaluate new agent on coding benchmarks (SWE-bench, Polyglot)
- If the new agent compiles and retains basic codebase-editing functionality, add to archive. Otherwise discard.
- Repeat from 2.
All agents in the archive retain non-zero selection probability, so any path to improvement remains feasible. The open-ended exploration loop (archive maintenance, parent selection) is currently fixed and not self-modifiable.
The archive is the key difference from SSA-style single-system self-improvement. It lets the search traverse fitness valleys: the path to the best SWE-bench agent went through ancestors that performed worse than their parents. A greedy approach (always branch from the best) would never have found them.
What it discovered
Tool improvements:
- Granular file viewing (by lines instead of whole files)
- Precise editing (string replacement instead of full-file overwrites).
Workflow improvements:
- Multiple solution attempts with FM-based ranking
- Incorporating history of previous failed attempts when generating new ones, patch validation steps.
Both self-improvement and open-ended exploration are necessary. Without the archive, the system barely improves at all.
Ablation SWE-bench Polyglot DGM 50.0% 38.0% w/o self-improve (fixed meta-agent, like ADAS) 39.0% 28.0% w/o open-ended exploration (no archive, always branch from latest) 23.0% 14.0% Greedy (always branch from best) 39.7% 30.0%
Improvements transfer across FMs, benchmarks, and across programming languages
Agent optimized with Claude 3.5 Sonnet also improves with o3-mini and Claude 3.7 Sonnet.
SWE-bench → Polyglot and vice versa; optimized on Python, gains on Rust/C++/Go.
The modifications are general agent design improvements, not model-specific or benchmark-specific tricks.