Evolution Strategies at the Hyperscale

Bidipta Sarkar, Mattie Fellows, Juan Agustin Duque,
Alistair Letcher, Antonio León Villares, Anya Sims, Clarisse Wibault,
Dmitry Samsonov, Dylan Cope, Jarek Liesen, Kang Li, Lukas Seier, Theo Wolf,
Uljad Berdica, Valentin Mohl,
Alexander David Goldie, Aaron Courville, Karin Sevegnani,
Shimon Whiteson, Jakob Nicolaus Foerster.

$^1$ FLAIR - University of Oxford, $^2$ WhiRL - University of Oxford, $^3$ MILA– Québec AI Institute $^4$ NVIDIA AI Technology Center, $^5$ CIFAR AI Chair, $^6$ NormaCore.dev {bidipta.sarkar, matthew.fellows, jakob.foerster}@eng.ox.ac.uk juan.duque@mila.quebec, shimon.whiteson@cs.ox.ac.uk

Abstract

Evolution Strategies (ES) is a class of powerful black-box optimisation methods that are highly parallelisable and can handle non-differentiable and noisy objectives. However, naïve ES becomes prohibitively expensive at scale on GPUs due to the low arithmetic intensity of batched matrix multiplications with unstructured random perturbations. We introduce Evolution Guided GeneRal Optimisation via Low-rank Learning (EGGROLL), which improves arithmetic intensity by structuring individual perturbations as rank-r matrices, resulting in a hundredfold increase in training speed for billion-parameter models at large population sizes, achieving up to 91% of the throughput of pure batch inference. We provide a rigorous theoretical analysis of Gaussian ES for high-dimensional parameter objectives, investigating conditions needed for ES updates to converge in high dimensions. Our results reveal a linearising effect, and proving consistency between EGGROLL and ES as parameter dimension increases. Our experiments show that EGGROLL: (1) enables the stable pretraining of nonlinear recurrent language models that operate purely in integer datatypes, (2) is competitive with GRPO for post-training LLMs on reasoning tasks, and (3) does not compromise performance compared to ES in tabula rasa RL settings, despite being faster. Our code is available at eshyperscale.github.io.

Diagram showing the EGGROLL workflow: initial weights are perturbed by rank-one matrices, fitness is evaluated for each, and a weighted average forms a final rank-N update.Figure 1: Schematic visualisation of EGGROLL using workers.

1 Introduction

Evolution Strategies (ES) (Rechenberg, 1978; Beyer, 1995; Beyer & Schwefel, 2002)as established in prior literature is an attractive alternative to first-order methods based on gradient backpropagation for several reasons. First, ES does not require differentiability; it can optimise a broader class of models, like those with discrete parametrisations (cellular

*Equal Contribution † Core Contributor, sorted by alphabetical order in first names ‡Equal Senior Authors

Two plots: (a) bar chart comparing training speeds of EGGROLL (91), PPO (34), and OpenES (0.41); (b) line chart showing test loss over training steps for various population sizes using EGGROLL compared to backpropagation.Figure 2: (a) Relative speed of our method, EGGROLL, in terms of experience throughput versus prior methods, where 100 is the maximum batch inference throughput. See Appendix E for more details. (b) We use EGGROLL to train an int8 RNN language model from scratch, scaling population size from 2 to 1,048,576 with a fixed data batch size of 16. The dotted line is a fp32 Transformer trained with backprop SGD. EGGROLL’s test next-token cross-entropy of 3.40 bits/byte while backprop only gets 3.58 bits/byte.

automata) or objectives for which gradients are unavailable or noisy, such as outcome-only rewards in LLM fine-tuning (Qiu et al., 2025). Second, ES can be more robust to noisy and ill-conditioned optimisation landscapes (Wierstra et al., 2011; Xue et al., 2021). Population-based exploration smooths irregularities (Salimans et al., 2017), tolerates discontinuities, and mitigates issues like ill-conditioned curvature or vanishing and exploding gradients in long-range or recurrent settings (Hansen, 2023). Third, ES is highly amenable to parallel scaling, since fitness evaluations are independent across population members and require only the communication of scalar fitnesses, which maps cleanly onto modern inference infrastructure and yields near-linear speedups on large clusters (Salimans et al., 2017). By contrast, backpropagation requires communicating and aggregating gradients across devices, yielding updates with high memory and computational costs. Furthermore, backpropagation requires special care when training models with low-precision datatypes (Fishman et al., 2025), whereas ES can directly optimise any model with the same datatypes used at inference time. Together, these properties position ES as a potentially powerful tool for training large, discrete, or hybrid architectures, and end-to-end systems with non-differentiable components, including LLMs (Brown et al., 2020; Chowdhery et al., 2023; Du et al., 2022; Fedus et al., 2022).

However, there are currently practical obstacles to employing ES at scale. In deep learning architectures (Goodfellow et al., 2016), the majority of trainable parameters form linear mappings represented by matrices (Rosenblatt, 1962; Hochreiter & Schmidhuber, 1996; Bengio et al., 2000; Krizhevsky et al., 2012; Goodfellow et al., 2014; Kingma & Welling, 2014; Vaswani et al., 2017). Naïvely adapting ES therefore requires generating full-rank matrix perturbations that replicate the entire parameter set for every population member. This inflates memory costs and forces frequent movement of large weight tensors. Evaluating these perturbations then requires a separate sequence of matrix multiplications per member, so the total compute and wall-clock time scale roughly with the population size and sequence length since batched matrix multiplication has a low arithmetic intensity, i.e., the ratio of arithmetic operations to memory traffic (Williams, 2008). In billion-parameter regimes, these two costs dominate, limiting ES to small models and small populations (Qiu et al., 2025; Korotyshova et al., 2025).

To mitigate both memory and computational bottlenecks, we introduce Evolution Guided GeneRal Optimisation via Low-rank Learning (EGGROLL), an ES algorithm that allows for the efficient training of neural network architectures with billions of parameters. Analogous to LoRA’s low-rank adapters in gradient-based training (Hu et al., 2022), EGGROLL generates low-rank parameter-space perturbations for ES; instead of sampling a full-rank matrix E in R m by n, we sample A in R m by r and B in R n by r with r much less than the minimum of m and n and form E equals one over the square root of r times A B transpose. This reduces auxiliary perturbation matrix storage from to per layer, and proportionally reduces tensor movement.

Moreover, we use a counter-based deterministic random number generator (RNG) (Salmon et al., 2011; Bradbury et al., 2018) to reconstruct noise on demand, so matrix perturbations need not persist in memory. When evaluating the fitness of members of multiple perturbations in parallel, EGGROLL batches a population of low-rank adapters and shares the base activations, enabling a single forward pass that applies all A B transpose updates via specialised batched matrix multiplications with significantly higher arithmetic intensity, resulting

in over a hundredfold increase in training throughput for large neural networks at large population sizes, as shown in Figure 2a. Crucially, EGGROLL does not restrict updates to be low-rank, as the overall update is a weighted average of rank r matrices across the population, making the matrix parameter update rank the minimum of N times r, m, and n.

To understand ES when applied to large parameter models, we analyse the convergence properties of general Gaussian ES in high dimensions, showing there exists a critical noise scaling sigma d equals little o of d to the minus one half under which the update provably linearises and converges to the first-order derivative for a broad class of (possibly discontinuous) objectives. We identify three distinct regimes—linearisation, critical, and divergence—and establish provably tight conditions for stable ES optimisation in large models. Building on this, we extend the analysis to EGGROLL and prove that even fixed low-rank updates (including rank-1) converge to the true ES gradient as dimension grows, despite heavier-tailed perturbations. Our results explain the empirical success of EGGROLL in high-dimensional neural networks and connect its behaviour to neural tangent kernel-style linearisation (Jacot et al., 2018)Jacot and colleagues, yielding explicit convergence rates under standard overparameterised regimes. We also provide a rigorous theoretical analysis of the low-rank approximation accuracy, proving that EGGROLL updates converge to the full-rank Gaussian ES updates at a fast order r to the minus one rate.

Furthermore, in our extensive empirical evaluation, we test this hypothesis across a wide range of domains. In tabula rasa and multi-agent RL (MARL) settings, we show that EGGROLL does not compromise performance compared to naïve ES despite being faster. We demonstrate the scalability of EGGROLL for LLM fine-tuning with experiments on pretrained RWKV7 (Peng et al., 2025)Peng and colleagues models, modern recurrent language models that enable large batch inference due to their constant state size. Finally, we develop a nonlinear RNN language model that operates purely in integer datatypes, and demonstrate that EGGROLL can stably pretrain this language model, a feat which is only feasible due to the large population sizes enabled by EGGROLL.

2 Preliminaries

2.1 Low-Rank Matrix Approximations

When adapting high-dimensional foundation models for specific tasks, updating the parameters using gradient-based methods has high memory requirements. LoRA (Hu et al., 2022)Hu and colleagues applies low-rank approximations to the matrix multiplications to reduce these costs. For each matrix M i of dimension m by n in the model, a low-rank approximation can be made by decomposing each matrix:

where M i zero is the imported matrix from the foundation model with frozen parameters and A i and B i are low-width column matrices (i.e., r is much smaller than the minimum of m and n) whose parameters are updated through gradient-based optimisation during task-specific adaptation. This reduces the number of optimisation parameters for each matrix from to . EGGROLL uses a similar low-rank approximation for evolutionary strategies.

2.2 Evolution Strategies

Evolution strategies (ES) (Rechenberg, 1978; Beyer, 1995; Beyer & Schwefel, 2002) is a set of black-box optimisation methods that has emerged as a useful alternative to first-order gradient-based methods like stochastic gradient descent (SGD), particularly for noisy or non-differentiable systems. Let f from R d to R denote an objective to be optimised, known as the fitness, where the goal is to find an optimising set of parameters x star. Each set of parameters is collected into a -dimensional vector known as a genotype. We denote the derivative of the fitness evaluated at as . Unlike first-order gradient-based methods, which query derivatives to update the vector of parameters , evolutionary methods update a parametric population distribution over the fitness parameter space pi of x given theta, which is smoothly parametrised by a separate set of parameters theta. The population distribution generates perturbations x sampled from the distribution known as mutations. The problem of optimising the fitness for reduces to optimising the parameters of the population distribution . This is achieved by solving a secondary optimisation problem to maximise the expected fitness under for :

Introducing a population distribution smooths the fitness landscape; since pi of x given theta is smooth in theta, the resulting objective J of theta is also smooth in theta, provided f of x is measurable and integrable but not necessarily differentiable. Evolution strategies can therefore optimise black-box problems that may be non-differentiable as the derivatives of J of theta exist for fitness functions that are discontinuous, yielding a gradient with respect to theta:

where is known as the score function. A Monte Carlo estimate is formed by sampling search mutations x sub i sampled from pi and computing an average of the score-weighted fitnesses:

with which we update theta via stochastic gradient ascent with a suitable stepsize alpha sub t:

ES does not require taking derivatives directly through the fitness function; instead the Monte Carlo update in Eq. (1) only requires evaluation of f of x sub i for each mutation x sub i to estimate the gradient of J. As ES only queries and not , it is a zeroth-order optimisation method.

In this paper, we study ES using Gaussian population distributions: pi of x given theta is a normal distribution with mean mu and covariance I d sigma squared. In addition to its mathematical convenience, the central limit theorem means that the Gaussian distribution emerges naturally from the EGGROLL low-rank approximation as rank increases, even if the matrices and are themselves non-Gaussian. Moreover, most widely-used ES algorithms assume Gaussian population distributions (Rechenberg, 1978; Schwefel, 1995; Hansen & Ostermeier, 2001a; Beyer & Schwefel, 2002; Auger & Hansen, 2011; Wierstra et al., 2011; Salimans et al., 2017). In our setting, ES optimises over the population mean mu in d dimensional real space, which acts as a proxy for the true maximum of the fitness function, and the variance parameter sigma squared is treated as a hyperparameter to be tuned.

For the Gaussian population distribution we study in this paper, the ES update can be written using an expectation under a standard normal distribution by making a transformation of variables (Wierstra et al., 2011; Salimans et al., 2017):

where and denotes the density of . In this form, Eq. (2) shows that Gaussian ES methods optimise the fitness by generating search vectors from a standard normal distribution around the mean parameter .

2.3 Evolution Strategies for Matrix Parameters

A key focus of this paper is to develop efficient methods for evolution strategies that target matrix parameters. When working in matrix space, it is convenient to use the matrix Gaussian distribution (Dawid, 1981), which is defined directly over matrices :

where is the mean matrix, is the row covariance matrix and is the column covariance matrix. We use to denote the vectorisation operator:

The matrix Gaussian distribution is a generalisation of the multivariate Gaussian distribution defined over vector space. Sampling a matrix from a matrix Gaussian distribution is equivalent

to sampling a vector vector X distributed as normal with mean mu and covariance sigma from a multivariate Gaussian distribution with mean mu equals vector M and covariance matrix sigma equals V Kronecker product U where the symbol denotes the Kronecker product. For isotropic matrix Gaussian distributions with covariance matrices U equals sigma I m and V equals sigma I n, the equivalent multivariate Gaussian distribution is also isotropic with sigma equals sigma squared I m n. We denote the L two vector norm as and to measure distance between matrices, we use the Frobenius norm:

the Frobenius norm of M is defined as the square root of the sum of squared elements, which equals the L two norm of the vectorized matrix

which provides an upper bound on the matrix 2-norm (Petersen & Petersen, 2012). Let W in R m by n be a set of matrix parameters where vector W forms a subset of the full parameter vector x, typically parametrising the weights of a linear layer in a neural network. As we derive in Section B, the Gaussian ES update associated with the matrix W is:

where M is the mean matrix associated with W, i.e. forms a subset of , and is a zero-mean standard normal matrix distribution: p of E is a normal distribution with zero mean and identity covariance matrices. The gradient in Eq. (3) is estimated using the Monte Carlo estimate:

the estimated gradient is one over sigma N times the sum from i equals one to N of E i times f of W

by sampling N search matrices from a standard matrix normal distribution around the mean parameter matrix , which is updated via stochastic gradient ascent:

3.1 Evolutionary Algorithms

Evolutionary algorithms have long been a compelling alternative to backpropagation-based training methods (e.g., genetic algorithms (Such et al., 2018) or symbolic evolution (Koza, 1994)). Much research in evolution has focused on developing algorithms for deep learning that scale well to distributed parallel computation (Jaderberg et al., 2017; Hansen & Ostermeier, 2001b; Salimans et al., 2017). These approaches have increased in popularity following the application of ES to policy learning in deep RL environments (Salimans et al., 2017). Since then, evolution has been widely applied in other domains, such as meta-learning (e.g., (Lu et al., 2022; Metz et al., 2022; Lange et al., 2023; Goldie et al., 2024; 2025)), hyperparameter tuning (e.g., (Parker-Holder et al., 2021; Tani et al., 2021; Vincent & Jidesh, 2023)), and drug discovery (Towers et al., 2025). ES has also enabled the development of neural network architectures that are unsuitable for backpropagation, such as activation-free models that exploit floating point rounding error as an implicit nonlinearity (Foerster, 2017). Here, we consider how to apply ES at a scale beyond the small networks and population sizes of prior work. For example, Salimans et al. (2017)Salimans and colleagues use a maximum population size of 1440, whereas we use over a million.

While low-rank structures have been used in prior evolutionary algorithms, they have been applied to different ends, with different trade-offs, relative to EGGROLL. Choromanski et al. (2019)Choromanski and colleagues use a low-rank search space found via principal component analysis, which provides a better search direction to more efficiently use small populations. Garbus & Pollack (2025)Garbus and Pollack optimise a low-rank factorisation instead of the full dense matrix with neuroevolution, achieving similar computational gains to EGGROLL but is limited to the low-rank structure regardless of population size.

3.2 Evolution Strategies for LLMs

Although gradient backpropagation is typically used for LLM training and fine-tuning, prior work explores ES variants for fine-tuning. In particular, Zhang et al. (2024)’sZhang and colleagues’ two-point zeroth-order gradient estimator, which can be viewed as an ES-inspired method using a single perturbation direction and two function queries per update, is used by Malladi et al. (2023)Malladi and colleagues for memory-efficient LLM fine-tuning. Yu et al. (2025)Yu and colleagues extend this approach by projecting perturbations to a low-rank subspace, improving convergence. Jin et al. (2024)Jin and colleagues perform ES directly on LoRA matrices. These works focus on supervised fine-tuning and report performance comparable to full fine-tuning, but do not address whether pretraining is possible with two-point zeroth-order methods; we find that large population sizes are necessary for pretraining, indicating such methods are unsuitable here.

Recent work also explores ES in the context of LLM reasoning. Korotyshova et al. (2025)Korotyshova and colleagues first train LoRA adapters using supervised fine-tuning (SFT) before decomposing them into fixed SVD bases alongside singular values that are trained using CMA-ES. They achieve comparable performance to GRPO (Shao et al., 2024) in significantly less wall-clock time on maths reasoning benchmarks. Qiu et al. (2025)Qiu and colleagues directly use ES to optimise all LLM parameters for reasoning, with stronger performance than GRPO on the countdown reasoning task. However, both of these approaches use relatively small population sizes, on the order of a hundred unique perturbations per update, and instead collect hundreds of rollouts per perturbation to efficiently use GPUs. By contrast, our approach allows all generations to use different perturbations, such that our maximum population size per update is orders of magnitude larger (equal to the maximum inference batch size), without compromising token generation throughput.

4 EGGROLL

We now introduce EGGROLL (Algorithm 1). A practical issue with using a low-rank matrix approximation is that its distribution and score function have no analytic solution except for degenerate cases, so in Section 4.1 we derive the EGGROLL approximate score function from the limiting high-rank Gaussian. Section 4.2 describes how to efficiently implement EGGROLL on modern hardware.

4.1 Low-Rank Evolution Strategies

Recall the Gaussian matrix ES update from Eq. (3). Our goal is to introduce a tractable approximation to generating full-rank matrices by using low-rank matrices A B transpose as our search matrices instead. Let and denote the distribution of A in R m by r and B in R n by r.

Assumption 1 (I.I.D. Sampling). Assume all elements and are continuous, identically and independently distributed random variables according to some zero-mean, symmetric, absolutely continuous distribution with finite fourth-order moments and unit variance.

Algorithm 1 EGGROLL()
initialise and workers with known random seeds
for timesteps do
    for each worker in parallel do
        
        
        
    end for
    workers share scalar fitness with other workers
    for each worker in parallel do
        reconstruct for from
        
    end for
end for

This assumption is easily satisfied for most perturbation distributions used by ES, including members from the set of generalised Gaussian distributions like Laplace, normal, and uniform distributions. We then form a low-rank search matrix: E equals one over the square root of r times A B transpose. The one over square root of r scaling ensures the variance of E remains bounded for all r. We denote the induced distribution of as . maps to the manifold of rank- matrices. Hence, the density is defined with respect to a unit volume on the manifold and cannot be defined with respect to the standard unit volume in Euclidean space. For the corresponding score function, gradients with respect to are not defined over the usual Euclidean space. Instead, we use an approximation for the score function, yielding our low-rank update:

In our experiments, analysis and Algorithm 1, we use a Gaussian approximate score function:

which is the score function for the Gaussian distribution N zero, I sub m, I sub n. This choice is motivated by two theoretical insights from Section 5. The matrix A B transpose can be decomposed as a sum of independent, zero-mean vector outer products. Under Assumption 1, the central limit theorem applies to this sum of variables, proving that P of E converges in distribution to a Gaussian N zero, I sub m, I sub n as rank r increases, recovering the approximate Gaussian score in the limit. Secondly, we investigate the convergence of ES and EGGROLL as the number of parameters grows, proving both updates converge to a linearised form that is consistent with the EGGROLL update using the Gaussian approximate score function.

EGGROLL is not wedded to any particular score function approximator and we derive and explore a set of mean-field approximators in Appendix D.1 as alternatives. However, our experiments show that the Gaussian approximator has the best overall performance on the tasks we consider. To optimise the ES objective using the EGGROLL update, we adapt the parallelised evolutionary strategies algorithm from Salimans et al. (2017)Salimans and colleagues. We make a Monte Carlo estimate of the expectation in Eq. (4) with N sub workers samples to optimise the mean matrix parameters M using (approximate) stochastic gradient ascent. This yields the Gaussian EGGROLL update:

EGGROLL UPDATE: For each worker (in parallel), sample , and form a low-rank perturbation . Update matrix parameters using:

Here we absorb the constant one over sigma into the tunable learning rate alpha sub t. As each random matrix E sub i comma t in Eq. (6) has rank r almost surely and the matrix is updated using a sum of N sub worker such matrices, the overall EGGROLL matrix parameter update has rank minimum of N r, m, and n almost surely, i.e., the overall parameter update is not restricted to be low-rank. For all experiments in Section 6, N r is greater than the minimum of m and n, i.e., EGGROLL parameter updates are full-rank.

4.2 Hardware-Efficient Implementation

A key reason to use EGGROLL over standard ES is that large populations can be simulated in parallel on a GPU thanks to the low-rank perturbations. For the sake of exposition, we write equations from the perspective of a single worker, i, and explain in text how this corresponds to batched GPU operations. Consider the task of computing a batched forward pass over inputs u sub i in d in dimensional real space for a linear layer with mean parameter M of dimensions d out by d in. The standard forward pass is just a regular matrix multiplication, u i times M transpose, since M is constant across all threads. In contrast, naïvely applying ES by trying to compute u i times the quantity M plus sigma E i transpose becomes a batched matrix multiplication, which is inefficient on GPUs since every element of is only used in a single multiplication, yielding poor arithmetic intensity.

However, with EGGROLL we know that u i M transpose plus sigma over root r times the quantity u i B i times A i transpose, which improves arithmetic intensity since it preserves the efficient general matrix multiplication used in batched inference while adding some additional cheap work per perturbation. In this context, the bulk of compute is spent on the efficient calculation of using regular matrix multiplication. Meanwhile, when r equals one, u i B i simply becomes an inexpensive batch of vector-vector dot products of length to get a batch of scalars, which is then processed by a batched scalar-vector multiplication when multiplying by A i transpose. This decomposition is key to efficient batched LoRA inference, such as those used by vLLM (Kwon et al., 2023), which is why EGGROLL achieves the same speeds as batched LoRA inference systems. The batched LoRA inference enables high arithmetic intensity, enabling us to saturate compute with many unique perturbations per input. Note that this is impossible with naïve ES because each perturbation requires a separate matrix-vector multiplication, setting an upper bound of 1 for arithmetic intensity regardless of population size; see Appendix F for a full derivation.

We additionally optimise the update by not explicitly materialising the individual E sub i in the computation of the sum from i equals one to N of E i f i, the key term in the Gaussian approximate score function. In particular, when the rank is 1, we reconstruct and A of dimensions N by d out and B of dimensions N by d in and calculate the expression as the transpose of diagonal f times A, all times B, a simple matrix multiplication.

5 Analysis

*Proofs for all theorems can be found in Appendices A to D.*

In this section, we investigate the theoretical properties of the ES and EGGROLL updates. In Section 5.1, we study the convergence properties of the general Gaussian ES update as the parameter dimension d approaches infinity, obtaining the conditions required for convergence to a linearised form. We then extend this analysis to the EGGROLL update in Section 5.2. Finally, in Section 5.3 we provide an analysis investigating the effect that increasing the rank of the EGGROLL approximation, proving convergence to the true ES update in the limit.

5.1 High-Dimensional Gaussian ES

We first analyse the general ES update under Gaussian perturbations from Eq. (2):

The gradient of J with respect to mu equals one over sigma d, times the expected value over v drawn from a standard normal distribution, of v times f of mu plus sigma d v

where v is a d dimensional vector. In high dimensions, the Gaussian annulus theorem (Vershynin, 2018; Wegner, 2024) proves that the probability mass of standard Gaussian distributions concentrates in thin shells of radius the square root of d, which place probability mass further from the origin as dimension d increases. To counter this, we let sigma d depend on d and analyse the critical decay rate of sigma d that yields convergence of the ES updates. We make the following mild regularity assumptions:

Assumption 2 (Locally Continuous Fitness). With probability 1 with respect to the random initialisation of mu, assume there exists a ball B rho of mu, defined as the set of x prime such that the norm of x prime minus mu is less than rho of fixed radius rho greater than zero where is -continuous for all . Within this ball, let be -Hölder continuous, i.e., for all , and .the norm of the gradient of f at x minus the gradient of f at y is less than or equal to L times the norm of x minus y to the power of alpha, where alpha is between zero and one and L is of order one.

Assumption 2 does not restrict the fitness to be globally continuous; with probability one with respect to the initialisation distribution there must exist an arbitrarily small -continuous ball around mu. In particular, discontinuities, kinks, and non-differentiable regions may exist in the domain, provided they are not encountered with nonzero probability in the local region explored by the algorithm. -Hölder is the weakest simple, dimension-robust assumption that guarantees vanishing local gradient variation under Gaussian perturbations; it is weaker than Lipschitz continuity, which is recovered with .

Assumption 3 (Global Polynomial Growth). Assume that there exists some constant that is in and finite polynomial degree such that and almost surely under .Assume that there exists some constant C and finite polynomial degree p such that both the function f and its gradient are bounded by C times one plus the norm of mu plus sigma d v to the p-th power, almost surely under the normal distribution.

Unlike Assumption 2, this is a global assumption. Again, discontinuities can exist. The assumption is weaker than boundedness, is satisfied by essentially all fitness functions used in ES, and ensures that both the objective and its gradient are integrable under Gaussian perturbations; objectives violating this condition typically exhibit super-polynomial growth and derivative growth, which leads to ill-defined or highly unstable ES updates. Moreover, if the condition is not satisfied almost surely, then the function and its gradients are undefined in regions that have nonzero Gaussian measure.

Assumption 4 (Bounded Derivative). With probability 1 with respect to the random initialisation of mu, assume that and , i.e. and do not grow with increasing .

This assumption is standard in high-dimensional analysis proving convergence to linearity, as proving convergence to the gradient of f at mu becomes meaningless if the norm of the gradient approaches infinity. Moreover, the ES update as a whole can diverge if Assumption 4 is not satisfied. It can be ensured by scaling, typically by scaling networks parameters by d to the minus one half or using an appropriate scaled initialisation, commonly Gaussian initialisation mu distributed as normal with mean zero and covariance one over d times the identity matrix. This is precisely the scaling employed in the neural tangent kernel (NTK) regime (Jacot et al., 2018; Lee et al., 2019; Chizat et al., 2019), where it guarantees dimension-independent gradients and stable training dynamics.

These assumptions encompass essentially all objectives encountered in modern machine learning, including networks with finitely many ReLU activations, max- and hinge-based losses, and other piecewise-smooth or discontinuous models. Our first theorem proves convergence of a Gaussian ES update to a linearised form, that is to the local first-order derivative the gradient of f at mu, with a tight convergence rate for any function satisfying these assumptions:

Theorem 1 (Convergence to Linearity). Let Assumptions 2, 3, and 4 hold and sigma d is little o of d to the minus one half. Then:

To understand the effect that breaching the rate has on the convergence of Gaussian ES, we study the space of functions that can be represented by cubic polynomials of the form:

where a in R d and B in R d by d is a symmetric matrix and denotes a symmetric 3-linear map represented by the symmetric 3-tensor C in R d by d by d, which generalises cubic equations of the form to vector-valued . These are non-pathological, well-behaved, analytic -continuous functions, and include a rich subclass of convex optimisation problems, for instance, cubic perturbations of strictly convex quadratics. Moreover, any convex -continuous objective admits a local third-order Taylor expansion of this form around a minimiser.

Theorem 2 (Exact Divergence for Cubic Objectives). Let denote the cubic polynomial in Eq. (7). Assume where denotes operator norm for -tensor : . Let Assumption 4 hold, then:

Moreover:

Together, Theorems 1 and 2 prove Gaussian ES has a critical convergence rate of in high dimensions, and operates in three regimes:

Regime I (Convergence to Linearity): For , ES converges to a linearised form, recovering a local first-order gradient update . This result is analogous to neural tangent kernel (NTK) type theorems, which prove that neural networks linearise in high dimensions (Jacot et al., 2018)Jacot and colleagues, 2018 and results from the concentration of the population distribution as , but applies to a more general set of objectives including discontinuous architectures. Moreover, Theorem 1 proves that the rate at which Gaussian ES converges is tight and cannot in general be improved upon without strengthening continuity or introducing specific structure into the objective to ensure the Hölder constant decays with ; for the class of cubic functions we consider in Theorem 2, the faster convergence rate found in Eq. (9) is possible due to the -continuity of this function class, which means the converge rate is governed by third order derivative terms.

Regime II (Critical): For , Gaussian ES converges to a nonlinear limiting update that may retain higher-order derivative terms when they exist; for our cubic example, Eq. (8) proves that at this critical rate, the second-order term associated with the matrix vanishes due to symmetry and the third-order term associated with the tensor remains:

As the polynomial form is representative of general Taylor expansions, this implies that the limiting high dimensional update retains third-order derivatives (and higher order odd derivatives) as .

Regime III (Divergence): For d to the minus one half being little o of sigma d, Theorem 2 shows that there exist smooth cubic objectives with bounded coefficients for which:

the norm of the gradient of J with respect to mu is of the order sigma d squared times d, which goes to infinity

In particular, divergence occurs whenever the cubic tensor has a non-vanishing Gaussian contraction (equivalently, non-zero partial trace), i.e. in non-degenerate cases; only in the exceptional trace-free case does the cubic contribution vanish.

In practice, sigma d is often absorbed into the ES update stepsize, and its scale is adjusted automatically as part of the hyperparameter regime to ensure stability.

5.2 High-Dimensional EGGROLL

We now extend our high-dimensional analysis to study the EGGROLL update using the Gaussian approximate score function g hat L R from Eq. (5). Taking r as fixed, we consider the Gaussian matrix ES setting outlined in Section 2.3. We take x equals Vec of W where W is an m by n matrix and analyse the effect of increasing the total number of matrix parameters d equals m times n. Recall the true ES Gaussian matrix update is:

the gradient of J with respect to M is one over sigma times the expectation over E of E times f of M plus sigma E

where is the set of mean matrix parameters associated with the matrix and is a zero-mean standard normal p of E is a normal distribution with zero mean and identity covariance matrices.

Two key differences between full-rank Gaussian ES and EGGROLL are that g hat L R is an approximation to a true gradient and may have heavier tails than a Gaussian. To account for these differences, we require a slightly stricter local continuity control assumption:

Assumption 5 (EGGROLL Locally Continuous Fitness). With probability 1 with respect to the random initialisation of mu, assume there exists a ball B rho of mu, defined as the set of points x prime such that the distance to mu is less than rho of fixed radius rho greater than zero where is -continuous for all and be polynomial bounded in . Within this ball, let be Lipschitz continuous, i.e. for all .

This assumption still permits discontinuous objectives. We also assume that generates sub-Gaussian elements with uniform tail control:

Assumption 6 (Sub-Gaussian Tails). In addition to Assumption 1, assume that generates variables that have sub-Gaussian tails, i.e. for :

the probability that the absolute value of x i is greater than t is less than or equal to two e to the minus C t squared

for some that does not depend on .

We discuss sub-Gaussian variables and their properties in Section C.3 The assumption is trivially satisfied for Gaussian distributions and , and holds more generally, for example for bounded distributions, uniform distributions and generalised Gaussian distributions with shape parameter greater than two. This flexibility is particularly relevant for the models in Section 6.1, where heavier-shouldered distributions may be preferred over the Gaussian.

Theorem 3 (EGGROLL Convergence to Linearity). Let , and . Let Assumptions 3, 4, 5 and 6 hold, , and . Then there exists some such that:

and

almost surely with respect to the distribution over .

Our theory explains the success of EGGROLL in high dimensions with rank as small as r equals one; Eq. (11) proves EGGROLL converges to the true update matrix ES update grad M J of theta as d goes to infinity regardless of r. In addition, Eq. (10) proves that under the same conditions, the EGGROLL update also linearises like the true Gaussian ES update analysed in Section 5.1, recovering a local first-order derivative as d goes to infinity. For high-dimensional neural networks, standard parametrisations place training in the NTK regime, in which the network behaves approximately linearly in its parameters and gradient descent converges to a global minimum (Jacot et al., 2018; Lee et al., 2019; Chizat et al., 2019)as shown in prior research. Recent results show that the spectral norm of the Hessian decays polynomially with width, and that higher-order derivatives governing the variation of the Hessian also vanish (Liu et al., 2020). Consequently, the Lipschitz constant L d is small o of one, typically at rate or depending on the network architecture. Substituting these rates into our upper bound in Eq. (10) yields convergence rates of or respectively.

5.3 Rank Analysis

We now analyse how fast the low-rank update from Eq. (4) with Gaussian score approximation converges to the true Gaussian ES matrix gradient in Eq. (3) as the rank of the update r increases. We make notation explicit in r in this subsection, for example writing E r equals one over the square root of r, times A r B r transpose. We introduce the following formal regularity assumption for the fitness function:

Assumption 7 (Bounded Fitness). Assume that f of W is bounded, that is the supremum over W of the absolute value of f of W is finite.

Our key theoretical result characterises the error rate between the Gaussian score approximator in the low-rank update from Eq. (4) and the true gradient using the matrix Frobenius norm:

Theorem 4 (EGGROLL Rank Convergence). Let Assumptions 1 and 7 hold, then:

the Frobenius norm of the difference between the low rank estimate and the true gradient is of order r to the minus one

line graph showing marginal score multiplied by density for different values of r, demonstrating convergence to a Gaussian limitFigure 3: Plot of Marginal Score Multiplied by Density for Increasing

The convergence rate in Eq. (12) is faster than the typical order r to the minus one half rate dictated by the general parametric central limit theorem. Our analysis shows that this is due to the symmetry in our problem under Assumption 1. To obtain our results, we make an Edgeworth expansion (Bhattacharya & Ranga Rao, 1976) of the distribution P of E r, which expands P of E r as the limiting Gaussian distribution plus a sum of decaying terms that are controlled by the 3rd order and higher cumulants of P of E r. Each th order cumulant term is multiplied by a factor that decays at rate order r to the power of minus i minus two over two. For symmetric zero-mean distributions, all odd cumulants are zero (for the same reason that all odd moments of a symmetric distribution are zero). Hence, the rate of convergence to the limiting distribution is controlled by the 4th order term, which has rate order r to the minus one.

Although the full distribution P of E r has no general closed-form solution, the distribution over marginals is more amenable to analysis. We derive the density of the marginal distribution for generalised Gaussian distributed and in Section D.1. To illustrate the fast convergence rate, we plot the negative density score function for the marginal density in Fig. 3 using Gaussian distributed and (see Theorem 6 for a derivation). The figure shows that quickly converges to the limiting function E i j over the square root of two pi, times the exponential of minus E i j squared over two, recovering the Gaussian form from the true Gaussian ES update. Even at r equals one, the function is not a poor approximation. After r equals ten, the function has nearly converged and after r equals fifty, the function is visually indistinguishable from the limit, providing evidence for the hypothesis that the low-rank approximation is accurate even for very low-rank regimes r much less than the minimum of m and n.

Line graphs showing (a) Tabula Rasa Reinforcement Learning returns and (b) Countdown validation scores, comparing EGGROLL to other methods.Figure 4: (a) Comparison of reinforcement learning returns normalised by PPO performance across 16 environments for 10 seeds. The shaded region is the standard error of the mean. (b) Validation score of 3 seeds of EGGROLL v.s. 3 seeds of GRPO in countdown task with an RWKV 7g1.5B model on a single GPU. EGGROLL allows 1024 parallel generations per GPU (618 updates) whereas GRPO only 64 (915 updates).

6 Experiments

In the following section we showcase the effectiveness of EGGROLL in a variety of tasks that position it as a strong alternative to back-propagation for the end-to-end training of foundation models.

6.1 Pure Integer Language Model Pretraining

To demonstrate the potential of EGGROLL as a general optimisation method, we apply it to language model pretraining. Since EGGROLL does not rely on gradients, we explicitly design a language model architecture to be efficient and hardware-friendly at inference time. To highlight EGGROLL’s flexibility, we train a nonlinear recurrent neural network (RNN) in pure integer datatypes with no explicit activation functions, relying only on the implicit nonlinearity of clipping in int8 operations. We call the resulting language model EGG, the Evolved Generative GRU, an EGGROLL-friendly architecture with all weights in int8. See Appendix G for more details on the architecture and motivation behind EGG.

We train an EGG model with 6 layers and hidden dimension 256 (6L-256D) to do character-level prediction on the minipile dataset (Kaddour, 2023). We update parameters after 100 tokens for each population member, applying truncated ES by keeping the hidden state and only resetting at document boundaries. We plot the test loss in Fig. 2b over training steps across a range of population sizes with a fixed data batch size of 16 sequences per step, where the best test loss is 3.40 bits/byte. With a sufficiently large population size, EGG outperforms a dense 6L-256D Transformer trained with backprop SGD using the same data batch size. Note that larger population sizes require more parallel compute for the same amount of data; our largest population size of two to the twentieth power, which is one million forty eight thousand five hundred seventy six requires around 180 times more GPU-hours than the backprop baseline, demonstrating the potential for compute-only scaling in limited data regimes using EGGROLL.

Moreover, our largest population size of two to the twentieth is three orders of magnitude larger than the largest experiment done by Salimans et al. (2017)Salimans and colleagues while only requiring a single GPU to train, highlighting EGGROLL’s computational efficiency. We note that large population sizes are critical for pretraining; a population size of 2, analogous to MeZO (Malladi et al., 2023), significantly underperforms larger population sizes despite having access to the same data batch. We conduct more ablations in Appendix I, analysing the tradeoff between population size and data batch size.

6.2 Reinforcement Learning Tasks

To verify that low-rank perturbations do not change the optimisation behavior of ES in standard control settings, we benchmark EGGROLL against OpenES (Salimans et al., 2017) across 16 tabula rasa environments spanning Navix, Craftax, Brax, Kinetix, and Jumanji. We use a fixed 3-layer MLP policy (256 hidden units) and perform per-environment hyperparameter optimisation for each method before evaluating the selected configuration over 10 random seeds, reporting mean performance (normalised by PPO) and uncertainty. Overall, EGGROLL is competitive with OpenES on 7/16 environments, underperforms on 2/16, and outperforms on 7/16, while often delivering substantial wall-clock improvements due to its batched low-rank structure (full environment

Line charts of validation scores for GSM8K and bar charts for AIME reasoning tasks.Figure 5: (a) Comparison of the validation score of 3 seeds of EGGROLL v.s. 3 seeds of GRPO in GSM8K task with an RWKV 7g7B model on 8 GPUs. EGGROLL allows 8192 parallel generations (1024 per GPU with 260 updates) whereas GRPO only 256 (32 per GPU with 340 updates). (b) Performance of our finetuned RWKV 7G 7 billion model on hard reasoning tasks using 128 GPUs for 12 hours. The model was trained using the DeepScaleR dataset and the best checkpoint was chosen by evaluating on AIME24.

list, learning curves, timing comparisons, and complete HPO ranges/settings are provided in Appendix N.4. Figure 4a shows the averaged normalised return across the 16 environments with 10 seeds per environment. We additionally report MARL results in Section N.1.

6.3 Foundation Model Fine-tuning

We apply EGGROLL to finetune an RWKV-7 (Peng et al., 2025)Peng and colleagues LLM on two reasoning tasks: countdown (Gandhi et al., 2024)Gandhi and colleagues and GSM8K (Cobbe et al., 2021)Cobbe and colleagues. RWKV is a recurrent model that is better suited to parallelisation than transformers because any memory otherwise spent on the KV cache is used to evaluate population members. Figure 4b shows that EGGROLL fine-tuning on an RWKV-7 1.5B model converges to a higher validation accuracy of 35% (vs. 23%) given the same hardware and wall-clock time in the countdown task. Similarly, Figure 5a shows that EGGROLL outperforms GRPO on GSM8K fine-tuning. Our scoring function draws parallels to the group relative advantage of GRPO. In particular, to score a set of noise directions, E, defined as the set E 1 through E n, we first compute their accuracies, s one q i through s n q i, on m questions, creating a matrix of scores S of dimension m by n. We then compute the normalised score per question, with the main difference that we use the global variance sigma bar, and average over all the questions to compute a score for the noise direction E i:

This scoring function weights all questions within the same batch the same across population members. We use this recipe to train a 14 billion parameter RWKV 7 model on the DeepScaleR dataset and evaluate in more challenging maths reasoning tasks. In this regime, GRPO is infeasible due to the extra memory used by the Adam optimiser Kingma & Ba (2014)Kingma and Ba. Using a thinking budget of 5000 tokens for training and evaluation, our fine-tuned 14B model improves from 13% to 30% accuracy on AIME24, from 7% to 33% accuracy on AIME25 and from 11% to 13% accuracy on HMMT25 after training on 32 GPUs for 12 hours (Figure 13b). On 7B models, we outperform GRPO using 128 GPUs for 24 hours (Figure 5b).

In Section L, we achieve similar performance to GRPO when fine-tuning Qwen Transformer models, and additionally demonstrate that EGGROLL can directly optimise for pass@k, a known limitation of GRPO (Yue et al., 2025)Yue and colleagues. Beyond language models, we also fine-tune a finance world model into an agent for high-frequency trading that directly optimises for PnL; see Section M for more details.

6.4 Fine-tuning Integer Quantised LLMs

We follow the same procedure as Jacob et al. (2017)Jacob and colleagues to quantise the RWKV-7 family of models by dividing by the maximum per-channel value on each weight matrix and mapping into the int8 range of . We then apply EGGROLL with Adam to do model distillation from the original, non-quantised RWKV-7, into the resulting int8 quantised model using examples from GSM8K. See Appendix K for full details about the

Charts showing perplexity and validation score during quantization distillation for a RWKV model.Figure 6: (a) Average per token perplexity (during training) of 3 seeds of a quantised (int8) RWKV 7G 7 billion parameter model on distillation from the non quantised model using examples from GSM8K. (b) Validation score on unseen examples of GSM8K of 3 seeds of a quantised RWKV 7G 7 billion parameter model. Initially the model is unable to solve any problems, but progressively it is capable of solving more problems. The baseline here indicates the validation score of a quantised model without any further training.

specifics of quantisation and fine-tuning. The distillation is done by matching the distributions between the quantised and non-quantised models on teacher forced examples (with solutions) from the GSM8K dataset. More specifically, the fitness for a given set of parameters, mu i, is computed as follows:

the fitness is calculated as the sum from t equals one to T of the Kullback-Leibler divergence between p sub t and q sub t given mu i

where x one through T is a subsequence of tokens taken from the solutions of GSM8K and the K L divergence is the Kullback-Leibler divergence between the distribution of the non-quantised model, p t, and the distribution of the quantised model q t over the vocabulary at token t. Figure 6a shows the average per token perplexity of 3 seeds of a quantised RWKV 7G 7 billion parameter model compared to that of the original non-quantised model over the same sequence, as a baseline. Progressively, the quantised model recovers the capability to solve a subset of the GSM8K dataset (Figure 6b).

7 Conclusion

We introduce EGGROLL, a powerful method for black-box optimisation that scales evolutionary strategies to billion-parameter models and beyond using low-rank search matrices. Our experiments demonstrate that EGGROLL is effective with a rank of 1, giving substantial computational and memory savings for negligible decrease in performance when compared to the full-rank perturbations. Empirically, EGGROLL delivers large speedups over naïve ES in tabula rasa and multi-agent RL, and can power end-to-end training pipelines for foundation models. Our theoretical analysis shows that the EGGROLL update converges towards the Gaussian ES update with increasing rank r and parameter dimension d equals m times n, and we provide a rigorous study of general ES at high dimensions, deriving necessary and sufficient conditions for convergence and linearisation.

Looking forward, we can use EGGROLL for other problems beyond the reach of modern first-order gradient-based techniques. In particular, EGGROLL can enable the training of large scale end-to-end neurosymbolic systems (Sarker et al., 2021)by Sarker and colleagues with non-differentiable components. For instance, we can train neural networks that interface with symbolic modules for specific functions, like memory or calculations. We can also optimise end-to-end systems of language models, training them to be aware of inference-time harnesses and interactions with other agents in complex systems.

Acknowledgements

Compute for this project is graciously provided by the Isambard-AI National AI Research Resource, under the projects “FLAIR 2025 Moonshot Projects” and “Robustness via Self-Play RL.” Some experiments also used compute generously given by JASMIN, the UK’s collaborative data analysis environment (https://www.jasmin.ac.uk).

Bidipta Sarkar is supported by the Clarendon Fund Scholarship in partnership with a Department of Engineering Science Studentship for his Oxford DPhil. Mattie Fellows is funded by a generous grant from the UKRI Engineering and Physical Sciences Research Council EP/Y028481/1. Juan Agustin Duque is supported by the St-Pierre-Larochelle Scholarship at the University of Montreal and by Aaron Courville’s CIFAR AI Chair in Representations that Generalize Systematically. Jarek Liesen and Theo Wolf are supported by the EPSRC Centre for Doctoral Training in Autonomous Intelligent Machines & Systems EP/Y035070/1. Jarek Liesen is also supported by Sony Interactive Entertainment Europe Ltd. Uljad Berdica is supported by the EPSRC Centre for Doctoral Training in Autonomous Intelligent Machines & Systems EP/S024050/1 and the Rhodes Scholarship. Lukas Seier is supported by the Intelligent Earth CDT with funding from the UKRI grant number EP/Y030907/1. Alexander D. Goldie is funded by the EPSRC Centre for Doctoral Training in Autonomous Intelligent Machines and Systems EP/S024050/1. Jakob Nicolaus Foerster is partially funded by the UKRI grant EP/Y028481/1 (originally selected for funding by the ERC). Jakob Nicolaus Foerster is also supported by the JPMC Research Award and the Amazon Research Award.

We thank Andreas Kirsch for discovering an emergent log-linear scaling law for EGG loss with respect to int8 OPs in this tweet along with other community members for their comments and recommendations during the first arXiv release of this work.

References

[references omitted]

[references omitted]

Jean-Philippe Bouchaud, Julius Bonart, Jonathan Donier, and Martin Gould. Trades, quotes and prices: financial markets under the microscope. Cambridge University Press, 2018.

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL GitHub.

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020. URL arXiv.

Lénaïc Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL NeurIPS.

Krzysztof M Choromanski, Aldo Pacchiano, Jack Parker-Holder, Yunhao Tang, and Vikas Sindhwani. From complexity to simplicity: Adaptive es-active subspaces for blackbox optimization. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL NeurIPS.

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sashank Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. Palm: Scaling language modeling with pathways. J. Mach. Learn. Res., 24(1144), 2023. URL JMLR.

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.

A. P. Dawid. Some matrix-variate distribution theory: Notational considerations and a bayesian application. Biometrika, 68(1):265–274, 1981. ISSN 0006-3444.

Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten P. Bosma, Zongwei Zhou, Tao Wang, Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc Le, Yonghui Wu, Zhifeng Chen, and Claire Cui. Glam: Efficient scaling of language models with mixture-of-experts. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 5547–5569, Jul 2022. URL PMLR.

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: scaling to trillion parameter models with simple and efficient sparsity. J. Mach. Learn. Res., 23(1):1–39, January 2022. ISSN 1532-4435. URL JMLR.

[references omitted]

[references omitted]

[references omitted]

[references omitted]

[references omitted]

[references omitted]

[references omitted]

Appendix

A Notation

In our proofs, we use the integral notation to denote the integral over the corresponding R d space, for example, for a matrix E in R m by n, and for a vector E in R m n, . For f from R d to R, we use to denote the derivative of evaluated at . For a vector v in R m n, we define the mat operator as:

so . We will use the fact that the Frobenius norm becomes the l two norm in vector space:

Our proofs make use of Fourier analysis. For a vector-valued function , we define the Fourier transform as:

and the inverse Fourier transform as:

B ES Matrix Gradient Deviations

Let mu M equals vec M in R m n be the vector of mean parameters associated with the matrix . Let v M denote the corresponding search vector associated with mu M. As each element of is generated independently from a standard normal , the search vector is generated from the standard multivariate norm: v M is sampled from a multivariate normal with mean zero and identity matrix I m n. From Eq. (2), the update for mu M is:

where and we have used the fact that sampling is equivalent to sampling and applying . Now

C High-Dimensional Analysis

C.1 High-Dimensional Gaussian ES and Convergence

We use insights from the Gaussian annulus theorem when investigating the convergence properties of high-dimensional ES: our proof relies on the fact that all probability mass converges to the interior of the ball B epsilon of mu, defined as the set of x prime such that the norm of x prime minus mu is less than epsilon where epsilon equals rho over two in the limit d approaches infinity, where rho is the radius of the local ball from Assumption 2, meaning we only need to consider the smooth region around mu in this limit. Our first result proves that the mass outside of the ball for any polynomially bounded function tends to zero at an exponential rate.

Lemma 1 (Polynomial Tail Bounds). Let g of x be polynomial bounded as:

for some finite polynomial of orders and and constant . Let denote the event that a mutation lies outside the a local ball of radius around . Assume . Then for some constant :

the norm of the expected value of g of mu plus sigma d v, times the indicator function of A d, is big O of d to the power of q over two, times the exponential of negative K times the square of epsilon over sigma d
and in particular the right-hand side is as .

Proof. We start by bounding the integrand using the polynomial bound. Denote . Then, by Jensen’s inequality in the first line, polynomial boundedness in the second and in the third:

where and are constants independent of . Applying the Cauchy-Schwarz inequality to the second expectation gives:

Now, the variable is -distributed. Using the formula for the -th central moment of about the origin (Forbes et al., 2011, Chapter 11.3)as described by Forbes and colleagues yields:

Applying the identity (Askey & Roy, 2020-2026, Eq. 5.11.12)from Askey and Roy:

where denotes asymptotic equivalence. For , this yields the bound:

hence:

We use the Gaussian concentration inequality for the Euclidean norm (Vershynin, 2018, Theorem 3.1.1)from a 2018 paper by Vershynin, which states that for x distributed as a standard multivariate normal there exists an absolute constant such that for all ,

In our setting, we need to bound:

Setting , the assumption root d times sigma d is little o of one implies for sufficiently large that and therefore , so we can apply the concentration bound to obtain:

Now, as , it follows , yielding:

Applying these results to Eq. (15), along with , yields our desired result:

where we have absorbed the factor of into the constant .

Our proof in Lemma 1 reveals the necessity of the condition for convergence as we can only apply the Gaussian concentration inequality in Eq. (16) for ; this is a direct consequence of the Gaussian annulus theorem, as for slower rates , the Gaussian probability mass will exit any local ball around and flood the tail, meaning that the tail probability will grow with increasing . Having bounded the tail, convergence to linearity follows by proving convergence within the ball, which allows us to exploit the local smoothness of :

Theorem 1 (Convergence to Linearity). Let Assumptions 2, 3 and 4 hold and . Then:

the norm of the gradient of J minus the gradient of f is big theta of sigma d times root d, to the power of alpha, which is little o of one

almost surely with respect to the distribution over .

Proof. We start with the definition of the ES update:

Now let epsilon equals rho over two where rho is the radius of the ball from Assumption 2. Consider the hinge function:

which interpolates between 1 and 0 in the region . Our first goal is to use phi of x to generate a function f tilde of x that is absolutely continuous and has integrable derivatives outside of B rho of mu to allow us to apply Stein’s lemma (Stein, 1972). We define f tilde of x as:

Consider the closed ball . We note that within the ball remains unchanged:

The derivative of the function with respect to v is:

where the gradient fails to exist only on the sets , which have Lebesgue measure zero. We start by using this function to decompose J of mu into a smoothed part and a remainder:

Hence:

Consider the smoothed part:

Our goal is to apply Stein’s lemma (Stein, 1972) in its multivariate form (Liu, 1994, Lemma 1). The assumptions of (Liu, 1994, Lemma 1)prior work require that the partial derivatives are absolutely continuous almost everywhere and:

These two conditions are satisfied by construction. Indeed, under Assumption 2, f is C one continuous on , hence from Eq. (17), f tilde coincides with a compactly supported, piecewise C one function whose gradient (Eq. (18)) exists almost everywhere. Moreover, under Assumption 3, both and are polynomially bounded, and since is supported on , it follows that:

Applying (Liu, 1994, Lemma 1)Liu 1994, Lemma 1:

Let denote the event that a mutation lies within the ball B epsilon of mu. We now split the integral into two regions, the first within the ball and the second outside:

Consider the region inside the ball, I local. From Eq. (18), within this region. Using the local -Hölder continuity from Assumption 2:

Now, applying the identity the expected value of the norm of v to the i is on the order of d to the i over two, from Eq. (14):

We now bound the tail region outside the ball:

Now, as from Assumption 4 and we have established that is polynomial bounded under Assumption 3 when applying Stein’s lemma, it follows that is also polynomial bounded, that is there exists some constant and finite polynomial order such that:

Applying Lemma 1, it follows:

for some constant . Together, this yields:

As for any , we take to obtain a weakened bound matching the first term:

This yields the upper bound:

Returning to Eq. (19), we must bound the remainder term:

Again, from Assumption 3, it follows that is polynomially bounded, that is there exists some constant C prime greater than zero and finite polynomial order p prime such that:

Applying Lemma 1 with q equals one:

Now, as exponential of minus x is little o of x to the minus one for x approaching infinity, it follows:

where the final line follows from the fact the square root of d times sigma d is little o of one. Assembling our bounds using Ineq. 19 yields our desired result:

We now show that the bound is tight. Consider the function f of x equals L over two times the sum from i equals one to d of x i times the absolute value of x i, plus a transpose x where the norm of a is order one. Taking partial derivatives:

hence:

Applying the reverse triangle inequality the absolute value of x i minus the absolute value of y i, squared, is less than or equal to x i minus y i, squared:

We have thus shown that f of x is -continuous and its gradient has Lipschitz constant , i.e. with Hölder constant . It is also bounded by a polynomial of order 2. Without loss of generality, we take a deterministic initialisation mu equals zero to simplify algebra, yielding;

f of x thus satisfies Assumptions 2, 3 and 4.Assumptions two, three and four. Using f of x as the fitness:

Taking expectations element-wise and using Eq. (23):

Applying Eq. (14):

Hence:

thereby attaining the upper bound rate of .

C.2 Critical Convergence Rate

To show that our rate is critical, we investigate the space of functions that can be represented by cubic polynomials of the form:

where a in R d and B in R d by d is a symmetric matrix and denotes a symmetric 3-linear map represented by the 3-tensor C in R d by d by d.

Since our theory depends on analysing the local stability of a smooth ball for a fitness function, stability over this class is necessary for convergence on more general objectives. We show that once sigma d decays slower than the critical rate, divergence already occurs within this subclass, establishing the sharpness of the rate.

Theorem 2 (Exact divergence for cubic objectives). Let f of x denote the cubic polynomial in Eq. (24). Assume the norms of a, B, and C are all order one where denotes operator norm for -tensor : . Let Assumption 4 hold, then:

Moreover:

Proof. We start by taking derivatives of f of x:

Substituting this into the definition of the gradient of J with respect to mu and using Eq. (20):

where we have used the fact by definition of the symmetry of C. As C of v, mu, dot is linear in v, its expectation under zero-mean Gaussian noise is zero, hence:

proving our first result. Now, it follows that the norm of C of v, v, dot is less than or equal to the norm of C times the squared norm of v and as the norm of C is order one:

Now as v is unit Gaussian: the expectation of the squared norm of v is d, hence:

We now show that the bound is tight. Consider the function f of x equals u transpose x times the squared norm of x for u transpose equals one over the square root of d times a vector of ones. The factor of ensures that the gradient of the function the gradient of f is order one. We can write the squared norm of x as the tensor contraction:

where I sub d is the identity matrix and:

hence we write f of x as a tensor contraction as:

where C is defined as the symmetrization of the tensor product of u and the identity matrix. Using this function:

hence , achieving the upper bound rate of big O of d which implies:

Our final result follows immediately:

C.3 EGGROLL Linearisation

We now study the effect of EGGROLL in high dimensions. We introduce the notation v equals the vectorization of E to denote the vectorisation of the low-rank matrix perturbation E equals one over the square root of r, times A B transpose and work in vector space. The EGGROLL vector update v can thus be written as sum of independent variables:

with:

where recall and a sub i and b sub i are the th column vectors of and A and B. We write mu equals the vectorization of M. Using Eq. (13), we can convert between results in vector space and matrix space as:

To extend our analysis, we need to ensure that all polynomial moments of P of v are finite and grow at most polynomially in the dimension d equals m times n. In particular, such tail bounds are sufficient to dominate polynomial error terms in our analysis. To introduce sub-Gaussian variables, we follow the exposition of Vershynin (2018)Vershynin and results therein. A random variable x sub i in the set of real numbers is sub-Gaussian if there exists some finite constant C greater than zero such that for all t greater than zero:

meaning their their tails decay like Gaussians. This is equivalent to any of the following three properties holding (Vershynin, 2018, 2.6.1): There exist constants C one, C two, and C three, greater than zero that differ at most by an absolute constant factor such that:

and if the expected value of x sub i is zero:

A random vector x in R d is sub-Gaussian if all one-dimensional marginals of x are sub-Gaussian, i.e. x transpose u is sub-Gaussian for all u in R d. The sub-Gaussian norm is defined as:

which returns the smallest universal sub-Gaussian constant for all marginals.

A key property of sub-Gaussian vectors that we use in our proofs is the sub-Gaussian concentration inequality for the Euclidean norm (Vershynin, 2018, Theorem 3.1.1)detailed in Theorem 3.1.1 of Vershynin 2018, which states that for if x is a sub-Gaussian vector with expected value of x i squared equals one and K equals the sub-Gaussian norm of x, there exists an absolute constant C greater than zero such that for all t greater than or equal to zero,

the probability that the absolute difference between the norm of x and the square root of m is at least t, is less than or equal to two times the exponential of negative C t squared over K to the fourth

We also use a weaker form of control, that replaces the Gaussian-like tail decay with an exponential decay, but all other properties are defined similarly. In this paper, we use the definition that a variable x is known as sub-exponential if there exists a K greater than zero such that for all :

Our first result derives a bound on the expected value of the norms and :

Lemma 2. Let Assumption 6 hold. Let denote the distribution over columns of and denote the distribution over columns of . Then:

Proof. It suffices to prove as follows automatically from the same assumptions. We start by using the ‘layer cake’ representation of the expectation Lieb & Loss (2010 - 2010, Theorem 1.13)from Lieb and Loss, Theorem 1.13:

Let for any . We split the integral into two regions:

For the first integral:

For the second integral, we wish to bound for the region . Setting , the assumption implies in this region, hence

We bound this using the sub-Gaussian concentration inequality from Eq. (25). Under Assumption 6, is a sub-Gaussian vector with with a finite sub-Gaussian norm, hence there exists an absolute constant such that for all ,

This implies:

for all . Substituting yields:

Let x equals t minus the square root of m, which implies d t equals d x:

Now, the square root of m is less than or equal to x over C minus one for all , hence:

Combining the two bounds yields:

the expected value of the norm of a to the i-th power is big O of m to the i over two

as required.

Using this result, we now bound the whole vector v equals one over the square root of r times the sum from i equals one to r of the vectorization of a i times b i transpose

Lemma 3. Let . Under Assumption 6:

the expected value of the norm of v to the i-th power is big O of the quantity r times m times n, all to the i over two

Proof. For any vectors a and b:

Applying Lemma 2 under Assumption 6 for each summand of v:

Applying the triangle inequality:

Now, as i is greater than or equal to one, we can apply Jensen’s inequality:

yielding:

Our proof borrows techniques used to prove linearisation of the ES update in Section C.1 by bounding the tail probability of any polynomial under the low-rank distribution outside of the ball B rho of mu. To apply the concentration inequality that would generalise Lemma 1, we show that v has an exponentially decaying tail:

Lemma 4 (Exponential Tail Bound). Let r be finite and Assumption 6 hold. Then all elements of v are sub-exponential and for the square root of d times sigma d is little o of one there exists some constant C greater than zero such that:

Proof. In matrix form:

The elements of E are thus:

As a i j and b i k are independent sub-Gaussian random variables with zero mean, it follows from Vershynin (2018, Lemma 2.8.6)Vershynin that their product a i j times b i k is a zero-mean sub-exponential variable with a uniform norm psi one norm is finite. Finally, a finite sum of sub-exponential variables is sub-exponential (Wainwright, 2019, Eq. (2.18))according to Wainwright with a uniform norm, so all elements of E and hence v equals vector E are sub-exponential and zero-mean with a uniform -norm psi one norm K is finite.

We now bound . For the vector v, it follows for :

This is easily proven via the contrapositive: if then

implying . This means for :

As v j is a sub-exponential variable with finite uniform sub-exponential norm, by definition (Vershynin, 2018, Proposition 2.8.1)according to work by Vershynin in 2018 there exists a finite K such that for all j:

Applying to Eq. (26) yields:

Now, using t equals rho over sigma d and C equals one over K yields:

We now use these results to assemble into our key polynomial tail bound:

Lemma 5 (EGGROLL Polynomial Tail Bounds). Let Assumption 6 hold. Let be polynomial bounded as:

for some finite polynomial of order and constant . Consider the ball . Let denote the event that a mutation lies outside the ball. Assume . Then for some constant independent of :

and in particular the right-hand side is as .

Proof. Let

and denote . Our proof proceeds as in Lemma 1 to obtain:

where and are constant in . Applying the Cauchy–Schwarz inequality to the second expectation gives:

Applying Lemma 3 with fixed and :

Now, . From Lemma 4, there exists some such that:

where we have absorbed the factor of into , hence:

Now, as the square root of d times sigma sub d is little o of one, sigma sub d to the p times d to the p over two is little o of one, hence:

Applying our bounds yields our desired result:

where the little o of one bound follows from the fact that the exponential factor dominates the square root of d and the square root of d times sigma sub d is little o of one.

Theorem 3 (EGGROLL Convergence to Linearity). Let Assumptions 3, 4, 5 and 6 hold and sigma sub d is little o of d to the minus one half and L sub d times the quantity sigma sub d times d, squared, is little o of one. Then there exists some K greater than zero such that:

almost surely with respect to the distribution over .

Proof. We start with the definition of the vectorised EGGROLL update:

where we have used the fact that the expectation of an odd function under a symmetric, zero mean distribution is always zero, and satisfies this under Assumption 6, hence , and from Lemma 6. Consider the ball . We now split the integral into two regions, the first within the ball and the second outside:

Consider the region inside the ball:

Within this region, f of mu plus sigma d v is C two continuous under Assumption 5. We can thus write f of mu plus sigma d v using a first-order Taylor expansion about mu with a Hessian (second order derivative) remainder within the ball:

Applying the Lipschitz bound on the Hessian from Assumption 5:

Using this to bound Eq. (27):

Now, (for fixed r) we apply the identity the expectation of the norm of v to the fourth is order m n squared with m n equals d from Lemma 3:

We now bound the tail region outside the ball:

Now under Assumptions 3, 4 and 5, f of mu plus sigma d v is polynomial bounded, the norm of the gradient of f is order one and the norm of the Hessian of f is polynomial bounded hence there exists some finite constant C greater than zero and finite polynomial order p such that:

We thus apply Lemma 5:

Now, as sigma d times the square root of d is little o of one, the exponential term dominates the prefactor d times the square root of d, over d sigma d squared, we conclude:

Our final result follows from:

We have already shown the norm of vec g hat L R minus the gradient of f of mu is little o of one and under the assumptions for this theorem, Theorem 1 holds and so the norm of the gradient of f of mu minus the gradient with respect to mu of J of theta is little o of one.

D Asymptotic Rank Analysis

For convenience, we work with random vectors in our analysis. We analyse the vector v r, which is the vectorization of E r, which is the vectorisation of the low-rank matrix E r. We denote v, the vectorization of E, which is the vectorisation of the full rank matrix E. Note v is distributed as a multivariate normal with mean zero and identity covariance which we denote as P of v. We write v r as a standardised sum of r independent, zero-mean random vectors. Let

u i equals the vectorization of a i times b i transpose

where recall a i and b i are the th column vectors of and so:

v r equals one over the square root of r, times the sum from i equals one to r of u i

Denoting the covariance matrix of as sigma u, the central limit theorem proves that the distribution of v r converges in distribution to a zero-mean Gaussian normal distribution with covariance sigma r. In Lemma 6, we derive the covariance matrix for sigma u, which we prove is the identity. Our analysis uses an Edgeworth expansion (Bhattacharya & Ranga Rao, 1976)by Bhattacharya and Ranga Rao to characterise precisely the rate at which P of v r converges to the limiting Gaussian distribution. In Lemma 7, we make an Edgeworth expansion of P of v r to show that it is dominated by order r to the minus one terms and higher. These are then used to prove Lemma 8, which allows us to bound the integral of the remainder of the Edgeworth expansion, thereby characterising how fast P of v r converges to the limiting Gaussian distribution.

Lemma 6. Let Assumption 1 hold and u i be defined in Eq. (28). Then the variable u i has identity covariance matrix:

has finite 4th-order absolute moments:

and the vector v r is zero-mean and has identity covariance matrix:

Proof. Under the vec operator, the vector u i can be written element wise as:

We note that all elements in the vector u i have zero mean, and so the covariance matrix is the expectation of the outer product:

The diagonal elements of sigma u are:

the expected value of a i b j squared equals the expectation of a i squared times the expectation of b j squared, which equals one
As all elements of a and b and epsilon are zero-mean, off-diagonal elements are zero:

Using Eqs. (29) and (30), our first result follows:

Now, as u i is a vector of elements which are sums and products of variables which all have finite 4th order moments from Assumption 1, it immediately follows that u has finite 4th order absolute moments.

For our final result, we can write v r as sum of independent variables:

where x i is defined as one over the square root of r times u i. As v r is a sum of zero-mean vectors, it is also zero-mean. We use the fact that the covariance of r i.i.d. random variables is equal to the sum of the individual covariances, hence

as required.

Using Lemma 6Lemma six, we see the asymptotic Gaussian density of v r is a standard normal:

g of v r is one over the square root of two pi to the d power, times the exponential of negative norm v r squared over two

which is the density of P of v, where recall v equals the vectorization of E, is the vectorisation of the full rank matrix E.

Although P of v r does not have a density in the usual sense for low-rank r, we can still approximate it with a distribution p hat of v r by making a Taylor series expansion of its characteristic function, which always exists regardless of whether P of v r has a well-defined density or not. We now derive the 4th order Edgeworth expansion for P of v r. Our proof reveals that 3rd order cumulants control all terms in the expansion that decay at rate big O of r to the minus one half. As 3rd order cumulants are all zero due to symmetry in Assumption 1, the overall decay rate is controlled by big O of r to the minus one terms associated with 4th order cumulants. It is for this reason that we obtain a faster convergence rate than the standard central limit theorem.

Lemma 7. Let Assumption 1 hold and let v r equals the vectorization of E r and u i be defined in Eq. (28)equation twenty-eight. Let g of v r denote the limiting Gaussian density in Eq. (31). Then, the 2nd order Edgeworth expansion of v r is a distribution P hat of v r defined by the approximate density:

where:

is a 4th order Hermite polynomial associated with g of v r (Laplace, 1811; Hall, 1992; Withers, 2000).

Proof. We denote the characteristic function of P of u i as:

and the characteristic function of P of v r as:

Recall v r equals one over the square root of r times the sum of u i is the sum of r i.i.d. copies of one over the square root of r times u i. Using the scaling property of the Fourier transform, the characteristic function of one over the square root of r times u i is phi U of omega over the square root of r. The distribution of a sum of r independent random

variables is given by the -fold convolution of the individual distributions. As convolution in the spatial domain corresponds to multiplication in the frequency domain, the characteristic function of v r is (Bhattacharya & Ranga Rao, 1976)Bhattacharya and Ranga Rao, 1976:

Taking logarithms yields the log-characteristic function:

where K U of omega is defined as the log of phi U of omega. The cumulants are defined by

The Edgeworth expansion proceeds by a Taylor expansion of r times K U of omega over the square root of r about omega equals zero. A 4th order expansion yields:

where . Under Assumption 8, is symmetric, hence all odd-order cumulants vanish: . The second-order cumulant satisfies

and from Lemma 6 we have . Substituting yields:

Exponentiating and expanding the exponential to first-order in gives:

Taking the inverse Fourier transform (with the convention ) yields:

and using the identity , we recover the stated Edgeworth density.

We now apply key results from Bhattacharya & Ranga Rao (1976)Bhattacharya and Ranga Rao, 1976 to bound the difference in expectation between the low-rank distribution and the Edgeworth approximation as well as the difference in expectation between the true ES Gaussian distribution and the Edgeworth approximation.

Lemma 8. Let f of v be defined as f of M equals mu plus sigma times mat of v, let P of v follow a standard normal distribution, P of v r be the distribution of v r and P hat of v r be the 2nd order Edgeworth expansion of P of v r. Let Assumptions 1 and 7 hold and let v r equal the vectorization of E r and u i be defined in Eq. (28). Then:

Proof. From Lemma 7, we have shown that the Edgeworth expansion for P of v r is controlled by 4th order cumulants and higher, that is;

We show that the three assumptions needed to apply Bhattacharya & Ranga Rao (1976, Theorem 20.1)Bhattacharya and Ranga Rao, Theorem 20.1 to obtain our result using Eq. (32) hold. Firstly, the boundedness assumption of the integrand holds:

Secondly, the sampling regularity assumption that u i (as defined in Eq. (28)) is zero-mean i.i.d. (satisfied under Assumption 1) with finite 4th order moments (satisfied from Lemma 6) holds. Let phi U of omega denote the characteristic function of p of u, then the final assumption we need to verify is the Cramer condition: , which is satisfied from the Riemann-Lebesgue lemma Folland (1999, Theorem 8.22)Folland, 1999 because p zero is absolutely continuous under Assumption 1 and hence as . Our first result thus follows from applying Bhattacharya & Ranga Rao (1976, Theorem 20.1)Bhattacharya and Ranga Rao:

We now derive our second result.

hence

Now by definition, H i j k l of v is a 4th order Hermite polynomial and under Assumption 7, the absolute value of f of v is bounded, hence has polynomial growth of order 5 and is bounded by:

for some finite C greater than zero. As the expectation of a finite order polynomial under a standard normal distribution is bounded, it thus follows:

as required.

Using Lemma 8Lemma eight, we have all ingredients needed derive our main about the convergence result, which follows after some simple algebra on the norm:

Theorem 4. Let Assumptions 1 and 7 hold, then:

Proof. We start by converting the Frobenius norm to vector form using Eq. (13):

where and is the vectorisation of variable , which is distributed as . Let be the distribution for the 2nd order Edgeworth expansion, which we derived in Lemma 7Lemma seven. Since and are identified as the same Edgeworth-expanded distribution on , we may equivalently write:

hence:

Applying Lemma 8Lemma eight to each bound yields our desired result:

D.1 Mean Field Score Function Approximator

We will use th order Bessel functions of the second kind K n of z (Basset, 1888; Macdonald, 1899; Watson, 1944) , which are conveniently represented by the integral equations:

Bessel functions are the solutions to systems of differential equations that occur naturally in phenomena where there is strong radial symmetry, typically involving the propagation of spherical waves from points like the ripples formed from water droplets (Whitham, 1999) . For our setting, Bessel functions describe the

probability density of the product of rotationally invariant random variables, whose solution is analogous to the interference pattern of two spherical wave propagators.

Using the representation, we find the derivative of the zeroth order function takes the recursive form:

More generally, the derivative of the th order Bessel function is Watson (1944, Section 3.71, Eq. 4)Watson, 1944:

D.2 Derivation of Mean-field Approximators

To derive a mean-field approximation, we assume that the elements of A and B are drawn independently from the set of generalised Gaussian distributions (GGDs):

Assumption 8. Assume each element a i j distributed as G G of s and p and b i j distributed as G G of s and p of A and B is independently distributed according to the zero-mean generalised Gaussian distribution with density:

where s is between zero and infinity is the scale parameter, p is greater than zero the shape parameter and gamma is the gamma function.

We observe common distributions emerge from the set of GGDs including the Laplace for p equals 1, the Gaussian for p equals 2 and the uniform over in the limit as p approaches infinity.

If we make the assumption that all elements of E are independent (this is true as r grows) then we can write as the product of the marginal distributions. Under this approximation, the score function can be defined element-wise as:

Using this approximation we apply the score function S hat element-wise to the matrix E:

For r equals 1, S hat has a convenient analytic form for all members of the set of GGDs:

Theorem 5. Let Assumption 8 hold and r equals 1. Then the distribution over marginals is:

where K sub zero is the zeroth-order modified Bessel function of the second kind and the marginal score function is defined element-wise as:

Proof. For , we denote the elements of vector as and elements of vector as , then the elements of matrix are: . We now derive the distribution of the unnormalised variables:

using the formula for the distribution of the product of two independent random variables (Rohatgi, 1976; Grimmett & Stirzaker, 1993)Rohatgi 1976, and Grimmett and Stirzaker 1993:

where we have used symmetry of the integrand about zero to derive the final line. Now, making the substitution x equals a sub i over s, to the power of p, we have:

hence:

Now, we use the identity (Temme, 1996, Theorem 9.42)from Temme 1996:

with z equals two times the absolute value of E sub i j to the power of p over two, all over s to the p to yield:

as required for Eq. (35). Now we derive the marginal score function by applying the chain rule:

where we have used the identity from Eq. (33).

For r greater than one we can derive S hat for the Gaussian sampling case:

Theorem 6. Let Assumption 8 hold and p equals two. Then the distribution over marginals p of E i comma j is:

and the score function is (for E i j not equal to zero):

Proof. Each element E i j is the sum of r independent variables u i j l, defined as a i l times b j l distributed according to Eq. (35) with p equals two:

Let Z i j equals the square root of r times E i j, hence:

We first find the density p of Z i j. From Eq. (35), the distribution of each u i j l is:

We use the fact that the PDF of a sum of r independent random variables (i.e. Z i j) is given by the r-fold convolution of the individual PDFs. As convolution in the spatial domain is equal to multiplication in the frequency domain, the PDF p of Z i j follows by taking Fourier transform of p of u i j l, taking the power r and then taking the inverse Fourier transform:

where recall from Section A with d equals one, F of f of omega is the integral of f of x times e to the minus i omega x denotes the Fourier transform and the inverse Fourier transform. Taking the Fourier transform of the Bessel function:

where we have used the fact that K zero is an even function of x and so its integral with sine of omega x in the second line is zero. Using a standard result, we can evaluate the integral in Eq. (36) Gradshteĭn et al. (2015, 6.671 Integral 14)Gradshteĭn and colleagues:

hence:

where we have used the fact that the integrand is an even function and so its integral with sine of omega Z i j is zero to derive the penultimate line. To evaluate the integral in Eq. (37) we apply Gradshteǐn et al. (2015, 3.771 Integral 2)Gradshtein and colleagues:

Using the transformation of variables E i j equals one over the square root of r times Z i j yields our desired results:

Now, we derive the score function:

Now, from Eq. (34)34 for E i j not equal to zero:

as required.

E EGGROLL Speed

All timings were done on a single GPU on a GH200 (equivalent to a single H100) for a linear model with dimension 8192 in bfloat16, allowing a maximum batch size of 1024. For the graph in Figure 2a, we pre-generate the noises instead of integrating the noise generation into the forward pass.

Bar chart showing normalized training speeds for EGGROLL, PPO, and OpenESFigure 7: Relative speed of EGGROLL, when including jax noise regeneration.

In Figure 7, we consider the impact of regenerating noises on-the-fly using jax PRNG. The darker area and value in parenthesis for EGGROLL and OpenES indicate the speed when regenerating noises on-the-fly, while the full bar indicates the speed when the noises are already generated.

We regenerate noises on the fly in our primary jax codebase, but pre-generating the EGGROLL perturbations beforehand is also a practical possibility since low-rank perturbations only require a small amount of memory, proportional to the square root of the size of the original parameter matrices.

F Arithmetic Intensity Analysis

In this section, we derive the arithmetic intensity of standard batched inference, Gaussian matrix ES, and EGGROLL. We calculate arithmetic intensity as the number of operations divided by the total number of bytes read from or written to. For context, for the (b)float16 datatype on an H100 GPU, there are approximately 1000 teraFLOPS of compute (without sparsity) and 3.35 TB/s of GPU memory bandwidth, meaning that the roofline threshold is approximately 300 ops/byte, defined as the minimum for computation needed for it to be the bottleneck instead of memory movement.

In the following subsections, we are considering a single linear layer with mean parameter M in R dimensions d out by d in and a batch of inputs u in R dimensions B by d in. All operations occur with a precision of s bytes per element.

F.1 Arithmetic Intensity of Standard Batched Inference

In standard batched inference, we wish to simply calculate u times M transpose. The total bytes read as input are B times d in times s (for u) and (for M), and the total bytes written as output are . The total number of operations are since matrix multiplication requires both multiplications and additions for each element of u across all of d out. Therefore, the arithmetic intensity is:

When s equals 2 (for (b)float16) and d out and d in both equal m, the arithmetic intensity simplifies to

The batch size needed to achieve a desired arithmetic intensity of A is derived as follows:

Therefore, achieving an arithmetic intensity of 300 ops/byte with m equals 8192 requires a minimum batch size of 324.

F.2 Arithmetic Intensity of Gaussian Matrix ES

In Gaussian matrix ES, we assume access to pre-generated perturbations of shape of dimensions B by d out by d in. The total bytes read as input are (for u) and (for M), and the total bytes written as output are . Otherwise, the total number of operations is identical to standard batched inference, giving us an arithmetic intensity of

When s equals 2 (for (b)float16) and d out and d in both equal m, the arithmetic intensity simplifies to

This means that arithmetic intensity is always strictly less than 1, regardless of batch size or dimensionality. The common way to increase arithmetic intensity is to bring it closer to standard batched inference, reusing the same perturbation across multiple inputs. For instance, when m equals 8192, achieving an arithmetic intensity of 300 ops/byte requires that each perturbation is reused at least 324 times, and smaller values of m need to be reused even more often.

F.3 Arithmetic Intensity of EGGROLL

For EGGROLL, we assume access to the pre-generated decomposed perturbations A in R dimensions B by d out by r and B in R dimensions B by d in by r. Therefore, the bytes read as pure input are and the bytes written as pure output are . However, the efficient low-rank perturbation calculation requires writing and reading an intermediate matrix of shape , so the total bytes read are

The total number of operations includes the amount for standard batch inference, , along with the rank-r perturbations, , and the final sum between the main calculation and perturbation . Therefore, the arithmetic intensity is

When s equals two (for (b)float16) and d out and d in both equal m, the arithmetic intensity simplifies to

The batch size needed to achieve a desired arithmetic intensity of A is derived as follows:

Note that the only difference with the critical batch size of standard batched inference is the additional in the denominator. Therefore, achieving an arithmetic intensity of 300 ops/byte with and requires a minimum batch size of 352, compared to 324 for standard batched inference. This means that EGGROLL can saturate compute with unique perturbations per input, unlike Gaussian matrix ES.

Note that there is an overhead of B m times, 4 r plus 1 flops relative to standard batched inference, resulting in an additional compute rate of , which is effectively negligible for large enough matrices.

G EGG Architecture

In the following section, we detail the design of our EGG model, which follows the high-level structure of modern pre-layernorm decoder-only language models, but replaces self-attention with a modified minGRU and standard layernorms with a custom variant to enable pure integer training. See Algorithm 2 for an overview of the forward pass of the EGG architecture.

Algorithm 2 EGG forward pass

Require: Input token , input state , network parameters
Ensure: Output vector and output state
initialised to 0

for do




end for

G.1 Motivation

Since EGGROLL does not rely on gradients, we can explicitly design a language model architecture to be efficient and hardware-friendly at inference time. In particular, we design EGG under the following constraints to emphasise the flexibility of EGGROLL:

Pure Integer Training: On H100 systems, int8 is the fastest datatype and int8 matrix multiplication with int32 accumulation is the fastest tensor core operation. Furthermore, integer datatypes are much simpler to implement in hardware, providing massive energy savings for high-throughput systems (Horowitz, 2014). Therefore, we keep all weights in int8 and all activations in integer formats, never casting to floating point at any point during training. This stands in contrast to the standard approach for language model quantisation through “quantisation aware training” with backpropagation, where floating point activations are still necessary (Wang et al., 2023).

Nonlinear RNN: Modern language models use sequence-parallel architectures like Transformers and SSMs, since they enable stable gradients without backpropagation through time. However, most of these architectures cannot handle simple state tracking (Merrill et al., 2024), whereas classic recurrent networks like LSTMs and GRUs can do so with a single layer. Since EGGROLL does not require backpropagation through time, we can train on unbounded sequence lengths (Li et al., 2023) with nonlinear RNNs of broader complexity classes. Specifically, we develop a variant of the minGRU model (Heck & Salem, 2017) that performs all operations in integer formats.

Removal of all Activation Functions: Inspired by Foerster (2017)prior work, we remove all activation functions, like the rectified linear unit and hyperbolic tangent, due to the nonlinearity present in the int8 datatype. Specifically, the saturated addition of int8 values provides sufficient nonlinearity due to the implicit clipping of values to the int8 dynamic range, which evolution strategies can exploit.

G.2 Notation and Operations

We use the constant l in the set of positive integers to denote the number of layers of the model and D equals four to the d as the hidden dimension of the model, where d in the set of positive integers.

We use I sub n to denote an -bit signed integer and U sub n to denote an -bit unsigned integer. We denote casting vector to format as I sub n of u, which implicitly includes clipping to the bounds of the datatype. To ensure symmetry between positive and negative values of each datatype, we consider the value minus two to the n minus one to be invalid for datatype ; for instance, for 8-bit signed integers we only allows value from -127 to 127.

We use the following operations:

  • indicating scaled vector-matrix multiplication of u at M, indicating scaled vector-matrix multiplication from i 8 n and i 8 n by m to i 8 m, corresponding to int8 tensor core multiplication with int32 accumulation and scaling. The details of this operation are described in Section G.4.
  • a dot b indicates dot product with int32 accumulation, from i 8 n and i 8 n to i 32, and indicates the Hadamard (elementwise) product.
  • Standard integer operations: for addition, for subtraction, and for element-wise multiplication.
  • indicates taking the element-wise absolute value of u, .
  • indicates taking the element-wise sign of u, giving 1 for positive values, -1 for negative values, and 0 for zero.
  • indicates taking the sum of all elements in u (casting to to prevent overflow): .
  • indicates an elementwise bitwise right shift by n, which is typically equivalent to . Similarly, indicates a bitwise left shift by , which is typically equivalent to .
  • Square-bracket indexing. For instance extracts the element at index in axis 0 and index in axis 1, following the zero-based indexing convention.

G.3 Parameter Initialisation

The standard initialisation for matrix parameters in our model is rounding 16 times a sample from the standard normal, and casting to . This can be precomputed on a CPU since this is only done once at the start of training.

The egg model has the following parameters (where an additional subscript of i indicates that there is a version of this parameter for each layer of the model):

  • theta emb in i 8 of size 256 by D, following standard initialisation.
  • theta head in i 8 of size 256 by D, following standard initialisation.
  • theta l n out in i 8 D, initialised to 16 for each element.
  • theta l n 1 i and theta l n 2 i in i 8 D, initialised to 16 for each element
  • and theta m l p i 1 of size 4 D by D and theta m l p i 2 of size D by 4 D, both in i 8, following standard initialisation.
  • theta G R U i for the weights in i 8 size D by D, following standard initialisation.
  • theta G R U i for the biases in i 8 D, initialised to 0 for each element.

In total there are 513 D plus l times the quantity 4 D plus 12 D squared parameters in the model.

G.4 Matrix Multiplication

Tensor cores in GPUs are able to calculate fast vector-matrix multiplications with int32 accumulation as u M in i 32 m where u in i 8 n and M in i 8 n by m. For our purposes, we define as a scaled multiplication:

u at M is defined as the i 8 casting of u M divided by 16 times the square root of n

Note that when n equals 4 to the d, the division operation just becomes a right-shift by , which is fast to calculate.

We choose this scaled matrix multiplication because we initialise to 16 times standard normal samples for each element, so dividing by preserves the magnitude of for the output. In particular, if all elements of and are drawn from independently from the standard normal distribution multiplied by 16, the central limit theorem tells us that the expected value per element of the output will be , so dividing by preserves the standard deviation of 16.

G.5 Embedding

Our embedding function takes as input an embedding matrix theta emb in the set of 8 bit integers of shape 256 by D and an input token t, an 8 bit unsigned integer, and simply outputs the vector corresponding to that token: theta emb of t, in the set of 8 bit integers of length D.

G.6 Layer Normalisation (LN)

Our layer normalisation operation involves multiplying our input u with a weight theta l n before dividing by the mean absolute value of u.

We decide to divide by the mean absolute value of the input instead of the more common root-mean-squared since square roots are expensive on integers. Note that the norm after dividing the input by the mean absolute value (when using real numbers) is instead of 1, which we intentionally choose to preserve more bits of information given the limited range of the 8 bit integer range.

We calculate the mean absolute value of input u as:

u m a v equals the 8 bit integer conversion of the sum of the absolute values of u, bit shifted right by 2 d

Note that we can safely cast the mean absolute value to an 8 bit integer without overflow given the properties of the mean of a set, though we lose precision due to truncating the fractional component.

The output of layernorm is calculated as:

divide the element-wise product of 16 bit versions of u and theta l n by u m a v

Since division is an expensive operation, we precompute it using a lookup table. Note that the product of two 8 bit integer values will always remain in the dynamic range of 16 bit integers, so our lookup table will be of shape .

G.7 MLP

Each MLP block consists of two weight parameters: and theta one of shape 4 D by D, and theta two of shape D by 4 D. Given an input u, we calculate the output as:

u matmul theta one transpose, matmul theta two transpose

Note that we do not use an activation function, because the @ operation is already nonlinear due to the saturated conversion from to

G.8 GRU

Each GRU block accepts an input vector and state u and s consists of 6 weight parameters: and theta W f, theta U f, theta W h, and theta U h of shape D by D, and bias terms theta b f and theta b h.

Using these weight matrices, we calculate the following vectors:

where is the output and the new hidden state. In the typical GRU, stands for the sigmoid function while stands for the hyperbolic tangent, but we find that setting these as identity operations is sufficient due to the nonlinearity already present in the clipped addition. One can view this clipped addition operation as scaled and shifted version of the “hard” tanh and sigmoid operators.

To explain why we perform these operations, we can analyse this relative to the original GRU. The vector for the standard GRU has all elements between 0 and 1 due to the sigmoid, but our elements are between -127 and 127. Therefore, to calculate (which is typically just ), we first add 127 to , getting the range between 0 and 254 before multiplying by before bit-shifting right by 8 again to bring our values back to the dynamic range. We apply similar logic to calculate the final , which is typically just but needs to be rescaled to keep the int8 dynamic range.

G.9 Fitness Calculation in Integer Types

The “fitness” used in language model pretraining is the log-likelihood of correctly generating the next token, treating the outputs of the language model as logits (unnormalised log probabilities). If t prime in the set of unsigned 8 bit integers is the next token to predict and y in the set of 256 dimensional signed 8 bit integer vectors are the logits, we can calculate the log likelihood as follows:

y prime equals y cast to 32 bit integers plus 128, and o equals y prime at index t prime minus the LOG 2 of the sum of EXP 2 of y prime

where o is the loss for one token. We implement EXP2 and LOG2 as lookup tables, where

Note that each element in EXP2 for any unsigned 8 bit input requires at most 20 bits, so the sum of exponents across all possible choices is at most 28 bits, meaning we have to precompute LOG2 for 2 to the 28th values.

H EGG Pretraining with Integer EGGROLL

The core ideas of EGGROLL still apply in this integer-based training setting, but we have to make some modifications to ensure it only uses integer operations.

H.1 Adding EGGROLL Perturbations

For parameter theta in I 8 of dimensions m by n that represents a matrix multiplication, we first sample rank-1 perturbation vectors for each index in the batch: A and B. We sample these vectors from the standard random normal multiplied by 16 and rounded to the nearest 8 bit integer (clipping if necessary). To prevent the use of floating-point arithmetic on the accelerator, we pre-generate a large matrix of these random values, randomly indexing into it to get the perturbation vectors.

Given an input u, instead of calculating u times theta transpose, we calculate

The value of sigma hat is a hyperparameter, related to the sigma in the main paper as sigma equals 2 to the power of negative sigma hat. Note that the batched forward pass remains efficient since it still simply performs a batched vector-vector dot product in int8 (with int32 accumulate) and a batched vector-scalar product in int32.

We apply this same logic to the embedding matrix, since we can interpret theta of t as a one hot encoding of t times theta and still apply our rank-1 updates in that context. In practice, this means replacing u dot B with B of t.

H.2 Fitness Shaping

We employ a simple fitness shaping scheme based on antithetical pairs. Specifically, given raw fitnesses s plus and s minus, for the positive and negative sample of the antithetical pair respectively, the transformed fitness for the noise is:

Note that the only possible values for the fitness after shaping are .

H.3 Parameter Update

For parameter theta that represents a matrix multiplication (or embedding vector), suppose the sampled batch of rank-1 perturbation vectors are A and B, and let the fitnesses after shaping be F. Then we calculate an intermediate value E as:

We use E to determine if each element of theta should be increased or decreased. In particular, when the absolute value of is above a pre-specified threshold we move by one discrete bin in the direction of the sign of . Since there are only 255 unique values for each element in the set of 8 bit integers, restricting updates to single bins improves stability without compromising the ability for a parameter to get to any other value with relatively few updates. In particular, we have a real-valued hyperparameter, alpha between zero and one such that the threshold equals

where phi is the normal cumulative distribution function. Note that this threshold can be precalculated on a CPU. We observe that approximately equals the fraction of parameters that are updated at each step.

We currently do not incorporate any momentum or other optimiser states, but this remains critical future work to improve the speed of convergence for pure integer training.

Across model sizes and population size, we find that setting sigma hat to 4 and letting decay over training steps as gives consistently strong results.

I EGG Ablations

In our main experiments, we use a fixed data batch size of 16 sequences for population sizes 2 and powers of 4 ranging from 4 to four to the tenth power, or one million forty eight thousand five hundred seventy six. In this section, we vary the batch size by powers of 4, ranging from 4 to four to the fifth power, or one thousand twenty four, while varying population size by powers of 4 from 16 to 1048576. When the batch size, b is greater than half of the population size, N, we give each antithetical pair sequences, functionally giving a cleaner fitness signal to each member of the population. This also means that the number of parallel “inferences” required is .

Line graph showing final test loss versus data batch size for various population sizes, where higher population sizes generally lead to lower loss.Figure 8: Test loss curves when varying data batch size and population size.

In Fig. 8, we observe that the final test loss for each population size is relatively constant beyond a specific data batch size threshold. At the top right of the figure, we observe a decrease in loss for small population sizes after b is greater than N over two, which is an artifact of the increased compute usage necessary to use the full data batch. Ignoring this artifact, the minimum batch size for near-optimal performance at a given population size appears to be N over four to the sixth. We see that large population sizes need larger data batches for improved performance, since a batch size of 4 results in nearly identical performance for population sizes and , but this diverges as data batch size increases.

J Distributed EGGROLL Framework

To facilitate the large-scale experiments, where we scale population sizes beyond 1M, we develop a lightweight distributed training framework designed to minimise network overhead.

J.1 Base-3 Fitness Packing and Bandwidth Efficiency

A key bottleneck in distributed training is the communication of gradients or results. We address this via a custom base-3 packing scheme for fitness vectors. Since workers evaluate perturbations in antithetic pairs, the raw signal is discretised into ternary values . These are mapped to and packed five at a time into a single byte:

This yields an effective bitrate of 1.6 bits per value (near the log base two of three, approximately one point five eight five theoretical limit). Consequently, the network payload per chunk is approximately bytes, rendering bandwidth usage independent of model size.

J.2 System Architecture

The system employs a Coordinator-Worker topology. The Coordinator maintains the global state and assigns population chunks to Workers. Workers calculate fitness on GPU, apply signal shaping (chunk mean filtering, adaptive thresholding), and return only the packed ternary fitness, minimising traffic significantly compared to standard gradient transmission.

K Fine-tuning of Integer Quantised Models

K.1 Quantisation Procedure

To maximise population throughput and reduce device memory during EGGROLL fine-tuning, we represent the large matrix-multiplication parameters of RWKV in an int8 weight format while keeping non-matmul parameters (e.g., small biases / bookkeeping tensors) in floating point, bf16. Following Jacob et al. (2017)Jacob and colleagues, for each weight matrix W of dimensions d in by d out, we use symmetric per-channel int8 quantisation with an absmax scale. For each output channel we first compute:

where epsilon is some small scalar. Then, we store each s i in bf16, and quantise weights as

Every matrix parameter is stored as a dictionary containing the quantised weight matrix , the scale parameters per channel and an input scale factor in bf16 precision. At runtime, the forward pass is computed by scaling the input vector by and the quantised matrix with the scales per channel, ,

K.2 Integrating integer-quantised EGGROLL with Adam

EGGROLL performs black-box (ES) optimisation directly over the parameter representation used in the forward pass, including integer quantised weights. We integrate this with the Adam optimiser (Kingma & Ba, 2014)of Kingma and Ba by maintaining Adam’s moment estimates in bf16, while enforcing that all quantised tensors remain on the int8 lattice.

ES gradients. EGGROLL estimates gradients via antithetic ES perturbations and score-weighted averaging.
This yields a bf16 gradient estimate for: (i) floating-point parameters (when present), (ii) quantised matrix
parameters via a low-rank perturbation pathway, and (iii) scale parameters s i for all i from 1 to d out and s x via
explicit scale perturbations. We then pass these gradients to Adam (Optax), which produces an update tensor u
for each parameter leaf.

Adam updates for int8 tensors (discretised). For integer parameters (notably int8), Adam produces a
real-valued proposal u (stored in bf16). Since the parameter itself must remain int8, we convert this proposal
into a sparse unit-step update using a normalised thresholding rule. Let Q be an int8 tensor and
u be Adam’s proposed update. We compute a per-tensor z-score normalisation

then apply a threshold tau to form the integer step

Finally we update by unit increments and clip to the valid int8 range:

Intuitively, Adam supplies a magnitude- and history-aware proposal, while the discretisation enforces the
integer constraint and yields a stable, sparse update pattern (only entries with sufficiently large normalised
updates are modified).

Memory considerations. We store Adam’s optimiser state (moments) in bf16 for all array-valued leaves
to reduce memory footprint, while keeping scalar bookkeeping in full precision. This keeps the dominant
memory cost of optimisation close to that of the parameters themselves, which is particularly important when
fine-tuning large models with large ES populations.

Model distillation. We distil a non-quantised model into the quantised RWKV-7 model by matching the
two distributions in teacher forced examples from GSM8k. More specifically, the fitness for a given set of
parameters, mu i, is computed as follows:

the fitness is the sum over time steps t of the K L divergence between the distribution of the non-quantised model p t and the distribution of the quantised model q t

where x 1 through T is a subsequence of tokens taken from the solutions of GSM8K and is the
Kullback-Leibler divergence between the distribution of the non-quantised model, p t, and the distribution of
the quantised model q t over the vocabulary at token t.

L Fine-tuning Pretrained Transformer LLMs with Verifiable Rewards

This section describes compares EGGROLL to standard RL from Verifiable Rewards (RLVR). We first describe
our experimental results, before including details of the infrastructure used to run these experiments.

L.1 Results

Here we demonstrate that EGGROLL can be used to fine-tune pre-trained LLMs on verifiable rewards.
We use the vLLM library Kwon et al. (2023)Kwon and colleagues for efficient inference. More infrastructure detail is given in
Section L.2.

We first fine-tune the Qwen3-4B-Base model Yang et al. (2025)Yang and colleagues on the DeepScaleR Agentica Organization et al. (2025)Agentica Organization and colleagues, a dataset of 40k maths questions. As in standard RLVR, the model generates a chain-of-thought (CoT)
followed by a final answer. Fitness is then simply calculated by extracting the final answer and comparing it to
the ground truth answer Shao et al. (2025)Shao and colleagues. We evaluate performance on MATH500 Hendrycks et al. (2021)Hendrycks and colleagues,
OlympiadBench He et al. (2024)He and colleagues, AIME24 Balunović et al. (2026)Balunovic and colleagues, AMC, and MinervaMath Lewkowycz et al. (2022)Lewkowycz and colleagues. Training curves are shown in Figure 9. Here we see that fine-tuning with EGGROLL significantly

improves performance over the base model. In Section L.1 we show final accuracies with EGGROLL and with the equivalent RL experiment. The RL values are taken from Liu et al. (2025)Liu and colleagues, and we match all the relevant shared hyperparameters and setup, such as maximum response length and prompt phrasing. We see that EGGROLL is able to match the RL optimisation with very minimal hyperparameter tuning, a LoRA rank of 1 and a moderately small population size of 2048. Full hyperparameter details are given in Table 3.

Six line graphs showing training accuracy across different math benchmarks over 200 steps.Figure 9: Training curves for fine-tuning Qwen3-4B-Base on the DeepScaleR math dataset. Similar to RL from Verifiable Rewards (RLVR), we see that optimising with EGGROLL is able to improve chain-of-thought reasoning performance on a range of math benchmarks.

MATH500OlympiadBenchAIME24AMCMinervaMathAverage
Qwen3-4B-Base50.224.410.033.721.728.0
+EGGROLL75.837.313.349.431.341.4
+RL67.433.516.749.440.141.4

Table 1: Final test accuracies when training on the DeepScaleR dataset to optimise verifiable rewards with EGGROLL and RL. We see that EGGROLL significantly boosts performance from the base model and is able to match the equivalent RL experiment.

Since EGGROLL can be used to optimise non-differentiable objectives we next try optimising for pass@kpass at k. While zero-shot (pass@1pass at one) is differentiable, the pass@kpass at k objective is not as it depends on multiple samples from the model. This means it cannot be optimised easily with RL. In Figure 10 we fine-tune the Qwen3-1.7B model on the DeepScaleR dataset with a population size of 256, LoRA rank 1, and K equals four. We see that EGGROLL successfully optimises both the pass@1pass at one (differentiable) and pass@kpass at k (non-differentiable) objectives. In Figure 10 (right) we plot the number of distinct answers in 4 samples from the model. We see then when optimising for pass@kpass at k the answer diversity sampled by the model increases over training, whereas when optimising for zero-shot (pass@1pass at one) the model collapses towards a single final answer.

L.2 Training Infrastructure for Large-Scale Transformer LLMs

EGGROLL facilitates the fine-tuning of transformer-based LLMs at scale. We achieve this by repurposing the vLLM inference engine, leveraging its high-throughput kernel implementations and native support for multi-LoRA serving. The system utilises vLLM’s native Tensor Parallelism (TP) to shard the model weights across the GPUs within a node, while cross-node parallelisation is employed for the concurrent evaluation of the LoRA population.

To render ES-based optimisation feasible and efficient across a wide range of model sizes, we implement several critical systems-level optimisations:

Custom WorkerExtension and Sharding-Aware Updates By implementing a custom WorkerExtension, we effectively convert the vLLM inference engine into a training-capable runtime. This extension allows the optimisation logic to reside within the GPU process space, enabling direct,

Two line graphs comparing Fitness and Distinct Answers over training steps for Pass at 1 and Pass at 4 accuracy metrics.Figure 10: Using EGGROLL to optimise non-differentiable objectives. Left: Fitness curves comparing training with pass@1 (differentiable) versus pass@k (non-differentiable), where K equals 4. Right: The mean number of unique final answers generated per 4-sample set. We observe that when optimizing for pass@k increases answer diversity, whereas optimizing for zero-shot accuracy (pass@1) reduces it.

in-place manipulation of the model’s weights. A significant complexity of this integration is vLLM’s internal tensor parallelism, which frequently fuses weights (e.g. combining q_proj, k_proj, and v_proj into a single qkv_proj tensor). Our update mechanism is explicitly “sharding-aware”; it constructs a dictionary which maps individual LoRA updates to the specific fused slices held by each local GPU rank. This ensures that the global ES update is mathematically consistent across all distributed shards.

Layer-wise Memory Management To prevent out-of-memory (OOM) errors during the update phase, the WorkerExtension performs the ES weight application in a streaming, layer-wise fashion. By processing one layer at a time and clearing temporary buffers, the memory overhead of the update remains independent of the total model depth. This allows for the fine-tuning of models of very different sizes with a VRAM footprint barely exceeding that of standard inference.

Direct GPU-to-GPU Weight Synchronization After computing the ES update on the primary rank, we broadcast the updated parameters to all model instances using NCCL via PyNcclCommunicator. This approach bypasses CPU-based communication and instead uses hardware interconnects to transfer weights directly between GPUs, preventing synchronization from becoming a bottleneck when scaling to more nodes.

Meta-Device Blueprinting To initialise models that exceed the physical RAM of the control node, we employ Meta-Device Initialisation. Using accelerate’s init_empty_weights, we instantiate a “meta” version of the model to derive the weight shapes and sharding requirements for the LoRA adapters. This allows the system to generate a complete parameter blueprint for models of arbitrary size without ever allocating the full weight tensors in system memory.

vLLM Engine Settings Throughout the different experiments with vLLM, we use the following engine settings. These generally allow for high throughput across model sizes (e.g. at least 800 tokens/second), but we haven’t performed hyperparameter sweeps, so potentially faster, more memory-efficient settings may be used for improved results.

ParameterValue
Tensor parallel size2, 4
Data typeauto
Enable prefix cachingTrue
Enforce eager executionTrue
Enable LoRATrue
Max LoRAs
GPU memory utilisation0.90
Max number of sequences384
Max model length
Max batched tokens
Load formatauto

Table 2: vLLM engine configuration parameters to allow for high throughput EGGROLL training on large-scale transformer LLMs.

ParameterValue
Population size256, 2048
Sigma0.001
Learning Rate0.001
Max Response Length4096
Temperature0.0, 0.7
Samples Per Prompt1, 4
Pass at KTrue, False
LoRA Rank1
LoRA Reuse Steps4

Table 3: Hyperparameters for the verifiable reward transformer fine-tuning experiments in Section L.1.

M Fine-tuning Time Series Foundation Model: High-Frequency Trading

The preceding experiments demonstrate the effectiveness of EGGROLL on natural language reasoning tasks. We now investigate whether EGGROLL can effectively fine-tune pretrained foundation models on a fundamentally different data modality: structured time series. We focus on high-frequency trading (HFT) for two reasons. First, HFT generates data at an unprecedented scale. The S&P 500 constituents alone produced approximately 3.8 trillion tokens of order flow data between 2016 and 2021, comparable to the largest natural language corpora. Second, the domain presents a well-defined downstream task (order execution) with a natural reward signal: the realised profit and loss, also known as PnL, making it amenable to fine-tuning via evolution strategies.

Order execution takes place in limit order books (LOBs), which are the mechanism upon which modern financial exchanges operate (Gould et al., 2013; Bouchaud et al., 2018)as discussed by Gould and colleagues in 2013 and Bouchaud and colleagues in 2018. They allow market participants to submit limit orders that specify the details of intended transactions. Specifically, each limit order contains the order type, direction, price, and quantity. The continuous stream of these orders is known as the order flow. LOBs aggregate the limit orders that have not been matched yet. Unlike natural language, where tokens are purely symbolic, order flow messages comprise both categorical values (e.g., order type, direction) and numerical values (e.g., price, quantity) in which magnitude carries semantic meaning. This structure provides a distinct test of EGGROLL’s ability to fine-tune foundation models on time series sequential data.

A central objective in this context is order execution, which consists of buying or selling a specified quantity of an asset within a given time window. The goal is to maximise profit by transacting at favourable prices. In prior reinforcement learning approaches to this problem, the action space is usually simplified (Frey et al., 2023; Mohl et al., 2025; Ning et al., 2021)as seen in work by Frey and colleagues, Mohl and colleagues, and Ning and colleagues. In contrast, we aim to give the model full flexibility in choosing

Line graphs showing Mean PnL and PnL standard deviation over training epochs for baseline and EGGROLL.Figure 11: Training curves for order execution with EGGROLL. Left: Mean PnL over training epochs for the baseline (, orange dashed) and EGGROLL (, blue solid). Right: PnL standard deviation over training epochs. Shaded regions indicate the interquartile range across runs.

limit orders, i.e., to freely choose the order type, direction, price, and quantity. We achieve this by tokenising the limit order book messages and providing the model with a token-level action space.

Foundation models have recently been used to generate synthetic order flow (Nagy et al., 2023; Li et al., 2025)as shown by Nagy and colleagues, as well as Li and colleagues and have been shown to replicate realistic market behaviour (Nagy et al., 2025)in more recent work through next-token prediction. We therefore first pretrain a foundation model on tokenised limit order book messages, and then fine-tune it using EGGROLL for the order execution task. The pretraining follows the approach of Nagy et al. (2023)Nagy and colleagues: we employ an S5 model architecture (Smith et al., 2023)by Smith and colleagues that generates next-token probabilities, with cross-entropy as the training loss. The pretraining is conducted on the LOBSTER data set (Huang & Polak, 2011)by Huang and Polak for the Google stock (GOOG) in 2022, which contains around 25B tokens.

Subsequently, we fine-tune the model using EGGROLL. The training parameters are summarised in Table 4. The task is to execute a sell order of Q equals thirty shares within a horizon of T equals ten steps. In each episode, the LOB is initialised based on a LOB snapshot followed by 10 warm-up background messages. In each step, the population members generate their messages, which are then followed by 50 real background data messages. The orders are executed using the Jax-LOB (Frey et al., 2023)simulator by Frey and colleagues. We perform the fine-tuning on a fixed time window for GOOG in January 2023. Following (Galim et al., 2025)Galim and colleagues, we apply LoRA with rank 4 on all projection matrices while freezing SSM parameters and layer norms. Performance is evaluated using PnL based on the executed prices and the initial mid price. Specifically, for a sell task of total quantity Q, the PnL is computed as

the sum of q i times p i from i equals one to N, minus Q times the initial mid price

where q i and p i denote the quantity and price of the -th executed trade and P mid init is the mid-price at the beginning of the execution window. If the agent does not execute the entire quantity by the end of the episode, an automatic market order is submitted selling the remaining quantity. To improve robustness to outliers, fitness is defined as a rank-based transformation of the PnL. Specifically, for a population of size M, the PnL values

P, defined as the set of PnL values one through M

are mapped to the interval , where the rank of PnL i, which is between zero and M minus one denotes the rank of the -th individual’s PnL:

F i equals one half minus the rank of PnL i over M minus one

Training curves over 6,500 epochs are shown in Figure 11. The baseline policy ()with sigma equals zero, corresponding to the pretrained model, achieves a mean PnL of approximately 4,700. In contrast, EGGROLL fine-tuning ()with sigma equals point zero one improves the mean PnL to around 12,000, corresponding to a roughly 155% improvement over the baseline. The right panel of Figure 11 depicts the PnL standard deviation during fine-tuning: it initially increases to around 3,100 during the first 2,500 epochs, which corresponds to an exploration phase where the population tries out diverse strategies, before decreasing to approximately 400 by the end of training, indicating that the population concentrates around a high-performing policy.

HyperparameterValue
ModelLOBS5-360M
Parallel generations per GPU2,048
Total parallel generations65,536
LoRA rank4
Sigma0.01
Learning rate 0.001
Epochs6,500

Table 4: Model and EGGROLL fine-tuning settings for high-frequency trading.

Training curves and wall clock times for three different cooperative Multi Particle Environments, comparing IPPO, OpenES, and EGGROLL performance and efficiency.Figure 12: Training curves and wall clock times for cooperative Multi Particle Environments. Hyperparameter optimisation yielded equal batch sizes for all algorithms on the same environment. All EGGROLL runs used rank 1 perturbations. Shaded regions are standard errors of mean values.

N Experimental Details

N.1 Multi Agent Reinforcement Learning Experiments

Table 5: Hyperparameter Ranges Used in MPE Sweeps for EGGROLL and OpenES

HyperparameterValues
activationpqn, tanh
pop_size128, 512, 1024, 2048, 4096
learning_rate0.01, 0.05, 0.1, 0.5
lr_decay0.3, 0.7, 1.0
sigma0.1, 0.2, 0.3, 0.4, 0.5
rank_transformtrue, false

Table 6: Hyperparameter Ranges Used in MPE Sweeps for IPPO

HyperparameterValues
activationrelu, tanh
pop_size128, 512, 1024, 2048, 4096
learning_rate5e-5, 1e-4, 2.5e-4, 1e-3
entropy_coef0.001, 0.005, 0.01

Table 7: MPE Simple Spread v3

Hyperparametereggrollopen_esippo
activationtanhtanhtanh
deterministic_policytruetruefalse
learning_rate0.010.010.001
lr_decay0.70.7linear
layer_size646464
n_layers333
pop_size128128128
optimizeradamwadamwadam
rank11-
rank_transformfalsefalse-
sigma0.50.5-
n_minibatches--4
update_epochs--4
gamma--0.99
gae_lambda--0.95
epsilon_clip--0.2
entropy_coef--0.01
value_coef--0.5
max_grad_norm--0.5

Table 8: MPE Simple Speaker Listener v4

Hyperparametereggrollopen_esippo
activationtanhtanhrelu
deterministic_policytruetruefalse
learning_rate0.010.010.001
lr_decay0.70.3linear
layer_size646464
n_layers3364
pop_size512512512
optimizeradamwadamwadam
rank11-
rank_transformtruetrue-
sigma0.50.5-
n_minibatches--4
update_epochs--4
gamma--0.99
gae_lambda--0.95
epsilon_clip--0.2
entropy_coef--0.005
value_coef--0.5
max_grad_norm--0.5

Table 9: MPE Simple Reference v3

Hyperparametereggrollopen_esippo
activationpqntanhrelu
deterministic_policytruetruefalse
learning_rate0.010.010.001
lr_decay0.30.3linear
layer_size646464
n_layers333
pop_size409640964096
optimizeradamwadamwadam
rank11-
rank_transformfalsetrue-
sigma0.10.3-
n_minibatches--4
update_epochs--4
gamma--0.99
gae_lambda--0.95
epsilon_clip--0.2
entropy_coef--0.01
value_coef--0.5
max_grad_norm--0.5

We train on three cooperative Multi Particle Environments (MPEs) (Lowe et al., 2017)introduced by Lowe and colleagues implemented in JaxMARL (Rutherford et al., 2024)by Rutherford and colleagues with feed-forward networks of width 64 and depth 3, performing Bayesian hyperparameter optimisation for each environment and algorithm. All runs were executed on NVIDIA A100-SXM4-40GB GPUs. We find that the optimal batch size is consistent across algorithms on the same environment. Figure 12 shows that EGGROLL with rank 1 trains up to 2.4 times faster than OpenES for large batch sizes while staying competitive in performance.

A line graph comparing training methods for RWKV and Qwen models on a countdown task, and a bar chart showing performance on AIME and HMMT benchmarks.Figure 13: (a) Comparison of our finetuned RWKV 7G 7 billion parameter model using 8 GPUS with the results reported by Qiu et al. (2025)Qiu and colleagues on similarly sized Qwen models. (b) Performance of our finetuned RWKV 7G 14 billion parameter model on hard reasoning tasks using 32 GPUs for 12 hours. The model was trained using the DeepScaleR dataset and the best checkpoint was chosen by evaluating on AIME24. Due to the size of the model we were not able to run similar baseline experiments using GRPO.

N.2 Reasoning Fine-tuning Experiments: Countdown

We ran a Bayesian hyper-parameter sweep (Snoek et al., 2012)by Snoek and colleagues for both GRPO and EGGROLL and used the best set found to run the experiments in figure 4b. For GRPO we swept over sampling temperature and learning rate, whereas for EGGROLL we swept over the standard deviation of the ES sampling sigma and the learning rate scale. The best hyper-parameters found are detailed on tables 10 (EGGROLL) and 11 (GRPO). All of the experiments run in 8 hours on a NVIDIA H200 GPU.

HyperparameterValue
ModelRWKV 7g1.5B
OptimiserGradient descent
ES standard deviation
Rank 1
Learning-rate scale 0.125
Population size256
Parallel generations per GPU1536
Prompts per epoch6
Generation / thinking length1000 tokens
Train / val temperature0 / 0
Parallel validations128

Table 10: Key hyperparameters for EGGROLL training on Countdown with FastRWKV-7g1.5B.

We also run an experiment where we increase the number of GPUs to 8 and use a bigger model, RWKV 7g7B, on the Countdown task, allowing for stronger final performance. Notably, we compare to the results reported by Qiu et al. (2025)Qiu and colleagues on Countdown. Figure 13a shows that starting from our significantly weaker model (RWKV 7g7B v.s. Qwen 2.5-7B), we are able to train to a higher validation accuracy (72.9%), v.s. the ones reported for training with GRPO (52.8%) and Open ES (66.8%). Qiu et al. (2025)Qiu and colleagues do not report the wall clock time or the hardware used for their experiments which makes it difficult to establish a fair comparison.

HyperparameterValue
ModelRWKV 7g1.5B
OptimiserRadam
Learning rate
Generations per prompt 8
Parallel generations per GPU64
Prompts per epoch8
Generation length1000 tokens
Number of minibatches4
PPO clip parameter 0.2
Train / val temperature1 / 0
Parallel validations128

Table 11: Key hyperparameters for GRPO training on Countdown with AssociativeScanRWKV-7g1.5B.

N.3 Reasoning Fine-tuning Experiments: GSM8K

We used the hyper-parameters found for Countdown as a starting point and reduced the learning rates for both GRPO and EGGROLL using linear search until we found the best performing one on the validation set. Our experiments for GSM8K run on 8 NVIDIA H200 GPUS for 8 hours each. We also increase the standard deviation, sigma, parameter for ES (from to )from seven times ten to the minus four to two times ten to the minus three as the significantly bigger population sizes (8096 v.s. 512) allow for much more stable training and aggressive exploration.

HyperparameterValue
ModelRWKV 7g7B
ES standard deviation
Rank 1
Learning-rate scale 0.06
Generations per prompt 512
Parallel generations per GPU1024
Total parallel generations8192
Prompts per epoch16
Generation length1000 tokens
Noise reuse factor1
Freeze non-LoRA paramsTrue
Train / val temperature0 / 0
Parallel validations128

Table 12: Key hyperparameters for multi-GPU EGGROLL training on GSM8K with FastRWKV-7g7B.

N.4 Reinforcement Learning Experiments

Next, we compare the performance of EGGROLL against standard OpenES as implemented in Salimans et al. (2017)Salimans and colleagues on reinforcement learning tasks. Given the small network sizes, we can use OpenES at this scale, but as network sizes increase, the use of vanilla OpenES becomes computationally infeasible. We use the standard formulation of simply optimising for the final return in the environment. For both EGGROLL and OpenES, we perform hyperparameter optimisation (HPO) separately for each environment. For each algorithm–environment pair, we define plausible ranges for all key hyperparameters based on prior work and preliminary experiments. We then perform 20 random search trials, where each trial corresponds to a single training run with a randomly sampled hyperparameter configuration. Each configuration is evaluated based on the final return achieved by the mean policy parameters at the end of training. After all trials, we select the configuration that yields the highest final return. Using this best configuration, we then run 10 independent seeds to evaluate performance and report the mean and standard error of the mean across these seeds.

Sixteen line plots comparing EGGROLL, OpenES, and PPO algorithms across different environments including CartPole, Pendulum, Brax Ant, Brax Humanoid, Craftax, Jumanji, Kinetix, and Navix.Figure 14: Comparison of reinforcement learning results: Mean returns for each environment and algorithm across 10 random seeds. The returns are evaluated using the mean of the parameters. HPO was conducted for each algorithm/environment pair. The shaded region is the standard error of the mean.

HyperparameterValue
ModelRWKV 7g7B
Learning rate
Generations per prompt 8
Parallel generations per GPU32
Total parallel generations256
Prompts per epoch32
Generation length1000 tokens
Number of minibatches16
Number of workers (processes)8
PPO clip parameter 0.2
Train / val temperature1 / 0
Parallel validations128

Table 13: Key hyperparameters for multi-GPU GRPO training on GSM8K with AssociativeScanRWKV-7g7B.

HyperparameterValue
ModelRWKV 7g7B
OptimiserEGGROLL (Quantised))
ES standard deviation 0.4
Rank 1
Learning-rate scale
Population size8192
Parallel generations per GPU256
Prompts per epoch1
Generation / thinking length256 tokens
Train / val temperature0 / 0
Parallel validations128

Table 14: Key hyperparameters for quantised EGGROLL training on GSM8K (teacher-forced) with RWKV-7g7B.

We use policy networks with 3 layers of 256 neurons and a range of environments that demonstrate different capabilities. We evaluate across the Navix (Pignatelli et al., 2024)Pignatelli and colleagues, Craftax (Matthews et al., 2024)Matthews and colleagues, Brax (Freeman et al., 2021)Freeman and colleagues, Kinetix (Matthews et al., 2025)Matthews and colleagues, and Jumanji (Bonnet et al., 2024)Bonnet and colleagues suites of environments. We evaluate 16 environments in total. We choose environments that are not trivial or impossible for PPO to solve, according to the original papers. We also choose environments that belong to different categories (e.g., environment size in Kinetix or categories in Jumanji).

We show a subsample of the evaluated environments in Fig. 4a with the remaining results and hyperparameter details in Appendix N.4. Our findings show that EGGROLL is competitive with Open ES on 7/16 environments, underperforms on 2/16, and outperforms on 7/16. This does not take into account the speed-ups when compared to using OpenES with full-rank updates (see Figure 15). We postulate that the reason for this performance increase is that the large networks are difficult to optimise for OpenES and lend themselves well to low-rank updates.

We present here the hyperparameter ranges we used for hyperparameter optimisation, as well as all hyperparameter settings for all the experiments. All RL experiments were run on an NVIDIA L40S GPU. For PPO, we use the same methodology to tune the hyperparameters as we did for OpenES and EGGROLL as described in Section 6.2. We report the ranges and the final hyperparameters here. We train PPO agents using Rejax (Liesen et al., 2024)Liesen and colleagues. We use the activation function from Gallici et al. (2025)Gallici and colleagues in our experiments, which we refer to as the “pqn” activation function in our hyperparameter tables.

Sixteen bar charts comparing training time of EGGROLL and OpenES across various reinforcement learning environments.Figure 15: Comparison of reinforcement learning results: Mean and standard deviation of training time. Note that some of the timing difference is due to the differences in episode lengths, which is why the total time for EGGROLL sometimes appears longer than OpenES despite EGGROLL being faster on a per-timestep basis.

Table 15: Hyperparameter Ranges for EGGROLL and OpenES

HyperparameterValues
pop_size512, 1024, 2048, 4096
n_parallel_evaluations1, 4, 8
rank1, 2, 4
optimizeradamw, sgd, adam
learning_rate1e-3, 1e-2, 1e-1
lr_decay0.995, 0.999, 0.9995, 1.0
sigma0.05, 0.2, 0.5
sigma_decay0.995, 0.999, 0.9995, 1.0
rank_transformtrue, false
deterministic_policytrue, false
Table 16: Hyperparameter Ranges for PPO
HyperparameterValues
clip_eps0.1, 0.2, 0.3
ent_coef0, 0.0001, 0.001
gae_lambda0.9, 0.95, 0.98
gamma0.95, 0.99, 0.995, 0.999
learning_rate0.0001, 0.0003, 0.001
max_grad_norm0.5, 1, 2
layer_size256
n_layers3
normalize_observationstrue
normalize_rewardsfalse
num_envs64, 128, 256
num_epochs4, 8, 16
num_minibatches16, 32, 64
num_steps64, 128, 256
reward_normalization_discount0.99
skip_initial_evaluationfalse
vf_coef0.5, 0.75, 1

Table 17: CartPole-v1

Hyperparametereggrollopen_es
activationpqnpqn
deterministic_policyfalsetrue
learning_rate0.10.1
lr_decay0.99950.9995
layer_size256256
n_layers33
n_parallel_evaluations14
pop_size2048512
optimizersgdadamw
rank4/
rank_transformfalsetrue
sigma0.20.5
sigma_decay0.9990.9995

Table 18: Pendulum-v1

Hyperparametereggrollopen_es
activationpqnpqn
deterministic_policyfalsetrue
learning_rate0.010.01
lr_decay0.9950.995
layer_size256256
n_layers33
n_parallel_evaluations14
pop_size40964096
optimizeradamadamw
rank4/
rank_transformfalsefalse
sigma0.050.05
sigma_decay0.9951

Table 19: brax/ant

Hyperparametereggrollopen_es
activationpqnpqn
deterministic_policyfalsefalse
learning_rate0.010.1
lr_decay0.99950.995
layer_size256256
n_layers33
n_parallel_evaluations18
pop_size2048512
optimizeradamadam
rank1/
rank_transformfalsefalse
sigma0.050.05
sigma_decay0.99950.9995

Table 20: brax/humanoid

Hyperparametereggrollopen_es
activationpqnpqn
deterministic_policytruefalse
learning_rate0.10.1
lr_decay10.995
layer_size256256
n_layers33
n_parallel_evaluations88
pop_size40961024
optimizeradamsgd
rank1/
rank_transformtruetrue
sigma0.20.2
sigma_decay0.99950.995
Tables 16 through 20 provide the hyperparameter details for the experiments. Table 16 lists the hyperparameter ranges for PPO. Tables 17 through 20 detail specific hyperparameter settings for eggroll and open e s across the environments: CartPole-v1, Pendulum-v1, brax ant, and brax humanoid.

Table 21: brax/inverted_double_pendulum

Hyperparametereggrollopen_es
activationpqnpqn
deterministic_policytruetrue
learning_rate0.10.1
lr_decay10.995
layer_size256256
n_layers33
n_parallel_evaluations11
pop_size20484096
optimizeradamadam
rank2/
rank_transformtruetrue
sigma0.50.05
sigma_decay0.9951

Table 22: craftax/Craftax-Classic-Symbolic-AutoReset-v1

Hyperparametereggrollopen_es
activationpqnpqn
deterministic_policyfalsefalse
learning_rate0.010.001
lr_decay0.9950.995
layer_size256256
n_layers33
n_parallel_evaluations48
pop_size20484096
optimizersgdadamw
rank1/
rank_transformfalsefalse
sigma0.050.05
sigma_decay10.995

Table 23: craftax/Craftax-Symbolic-AutoReset-v1

Hyperparametereggrollopen_es
activationpqnpqn
deterministic_policyfalsefalse
learning_rate0.010.1
lr_decay0.9990.995
layer_size256256
n_layers33
n_parallel_evaluations14
pop_size5121024
optimizersgdadam
rank4/
rank_transformtruefalse
sigma0.050.5
sigma_decay0.9991

Table 24: jumanji/Game2048-v1

Hyperparametereggrollopen_es
activationpqnpqn
deterministic_policyfalsetrue
learning_rate0.10.01
lr_decay10.999
layer_size256256
n_layers33
n_parallel_evaluations44
pop_size10241024
optimizeradamwadamw
rank1/
rank_transformfalsetrue
sigma0.50.05
sigma_decay0.99950.9995

Table 25: jumanji/Knapsack-v1

Hyperparametereggrollopen_es
activationpqnpqn
deterministic_policyfalsefalse
learning_rate0.10.01
lr_decay0.9991
layer_size256256
n_layers33
n_parallel_evaluations41
pop_size10242048
optimizersgdadamw
rank4/
rank_transformtruetrue
sigma0.050.5
sigma_decay10.995

Table 26: jumanji/Snake-v1

Hyperparametereggrollopen_es
activationpqnpqn
deterministic_policyfalsefalse
learning_rate0.0010.001
lr_decay0.99951
layer_size256256
n_layers33
n_parallel_evaluations81
pop_size40962048
optimizeradamsgd
rank1/
rank_transformtruefalse
sigma0.050.2
sigma_decay0.99951

Table 27: kinetix/l/hard_pinball

Hyperparametereggrollopen_es
activationpqnpqn
deterministic_policytruetrue
learning_rate0.010.01
lr_decay0.9951
layer_size256256
n_layers33
n_parallel_evaluations81
pop_size2048512
optimizersgdsgd
rank4/
rank_transformtruetrue
sigma0.050.5
sigma_decay0.9990.9995

Table 28: kinetix/m/h17_thrustcontrol_left

Hyperparametereggrollopen_es
activationpqnpqn
deterministic_policyfalsefalse
learning_rate0.10.001
lr_decay0.99951
layer_size256256
n_layers33
n_parallel_evaluations41
pop_size5121024
optimizersgdadam
rank4/
rank_transformtruetrue
sigma0.50.5
sigma_decay10.999

Table 29: kinetix/s/h1_thrust_over_ball

Hyperparametereggrollopen_es
activationpqnpqn
deterministic_policyfalsefalse
learning_rate0.10.01
lr_decay0.9950.995
layer_size256256
n_layers33
n_parallel_evaluations11
pop_size5122048
optimizeradamwsgd
rank1/
rank_transformtruetrue
sigma0.50.05
sigma_decay0.99951

Table 30: navix/Navix-DoorKey-8x8-v0

Hyperparametereggrollopen_es
activationpqnpqn
deterministic_policyfalsefalse
learning_rate0.010.01
lr_decay0.99951
layer_size256256
n_layers33
n_parallel_evaluations18
pop_size10242048
optimizeradamwadam
rank1/
rank_transformfalsetrue
sigma0.050.05
sigma_decay11

Table 31: navix/Navix-Dynamic-Obstacles-6x6-Random-v0

Hyperparametereggrollopen_es
activationpqnpqn
deterministic_policyfalsefalse
learning_rate0.010.01
lr_decay0.9991
layer_size256256
n_layers33
n_parallel_evaluations41
pop_size5124096
optimizeradamadam
rank2/
rank_transformfalsefalse
sigma0.050.2
sigma_decay10.995

Table 32: navix/Navix-FourRooms-v0

Hyperparametereggrollopen_es
activationpqnpqn
deterministic_policyfalsefalse
learning_rate0.010.001
lr_decay0.9990.9995
layer_size256256
n_layers33
n_parallel_evaluations44
pop_size20482048
optimizersgdadam
rank4/
rank_transformtruefalse
sigma0.050.05
sigma_decay0.99950.9995

Table 33: PPO Hyperparameters (Set 1)

HyperparameterCartPolePendulumAntHumanoidIDPCraftaxClassicCraftaxSymbolicGame2048
activationpqnpqnpqnpqnpqnpqnpqnpqn
clip_eps0.20.10.20.30.10.20.20.3
ent_coef0.00010.00100.00010.00010.000100.001
gae_lambda0.90.950.950.90.980.980.90.9
gamma0.9950.9990.9950.950.990.950.950.99
learning_rate0.00030.00030.00030.00010.0010.0010.00030.0003
max_grad_norm0.510.522222
layer_size256256256256256256256256
n_layers33333333
normalize_obstruetruetruetruetruetruetruetrue
normalize_rewfalsefalsefalsefalsefalsefalsefalsefalse
num_envs256256642566412825664
num_epochs416844448
num_minibatches3216326464323216
num_steps128256128641281286464
rew_norm_discount0.990.990.990.990.990.990.990.99
skip_initial_evalfalsefalsefalsefalsefalsefalsefalsefalse
vf_coef0.5110.7510.50.750.75

Table 34: PPO Hyperparameters (Set 2)

HyperparameterKnapsackSnakeHardPinballThrustLeftThrustBallDoorKeyDynamicObsFourRooms
activationpqnpqnpqnpqnpqnpqnpqnpqn
clip_eps0.10.30.10.20.20.10.10.1
ent_coef0.00010.0010.00010.00010.00010.00010.0010.001
gae_lambda0.90.950.90.90.950.980.980.9
gamma0.990.9990.990.9950.9990.950.9990.99
learning_rate0.00010.00010.00010.00010.00010.00030.0010.001
max_grad_norm0.50.5120.50.511
layer_size256256256256256256256256
n_layers33333333
normalize_obstruetruetruetruetruetruetruetrue
normalize_rewfalsefalsefalsefalsefalsefalsefalsefalse
num_envs2561282562566464128256
num_epochs441616161648
num_minibatches6416163216641632
num_steps1281286412864256128256
rew_norm_discount0.990.990.990.990.990.990.990.99
skip_initial_evalfalsefalsefalsefalsefalsefalsefalsefalse
vf_coef0.750.750.50.50.50.750.50.75