year: 2025/09
paper: https://arxiv.org/pdf/2509.24372
website:
code:
connections: evolution strategies, LLM, fine-tuning


check out the paper in detail

Tldr

For the first time, ES is scaled to multi-billion-parameter search spaces, by searching directly over the full parameter space of LLMs in fine-tuning tasks. The approach is based on a memory-efficient implementation of an algorithmically simplified ES variant, with support for parallelization within and across GPUs. Performance is compared with state-of-the-art (SOTA) RL methods in fine-tuning various LLMs in a standard reasoning benchmark task, and behavioral differences from RL are analyzed in terms of fine-tuning for conciseness. This version of ES was able to search directly over billions of parameters, and exhibit surprisingly good fine-tuning performance compared to RL methods in multiple aspects:
• ES only needs response-level rewards, making it a perfect fit for fine-tuning on reasoning tasks that have only sparse long-horizon outcome rewards. In particular, ES obtained significantly better fine-tuned models than RL in the Countdown task with such rewards
• Counterintuitively, even though ES explores in the parameter space with billions of parameters, it is more sample efficient than RL methods that explore in the action space, which is much smaller. Further, ES was able to find good solutions with a population size of only 30. As a comparison, previous ES implementations (Salimans et al., 2017; Zhang et al., 2017; Lehman et al., 2018; Lorenc & Neruda, 2025) utilized a population size of 10,000 or
more with much smaller models (i.e. millions of parameters or less).
• ES is significantly more robust than RL across different LLMs. While RL fine-tuning
failed on some LLMs, ES provided good fine-tuning for all of them. ES benefits from its
exploration in parameter space, making it less sensitive to initial states of the LLMs.
• Whereas RL tends to hack the reward function if no other penalty is added, ES consistently
maintains reasonable behaviors during fine-tuning. The main reason is that ES optimizes
a solution distribution (Lehman et al., 2018), which is more difficult to hack, while RL
optimizes a single solution.
• ES’s behavior is more consistent than RL’s across different runs. This property can signifi-
cantly reduce expected cost of fine-tuning.
• Fine-tuning with ES is based on inference, and therefore no backpropagation calculations
are needed. A significant amount of GPU memory can therefore be saved