zero order optimization

Motivation

There are many problems backpropagation cannot be used for or has limitaitons, e.g. in RL the credit assignment problem makes gradient estimates difficult: rewards are often delayed, sparse, or noisy, and policies can get trapped in local optima.
In such settings, black-box optimization methods like genetic algorithms or evolution strategies provide a gradient-free alternative.

See also: Evolution Strategies as a Scalable Alternative to Reinforcement Learning

Zero-order optimization

Optimization methods that only require function evaluations $f (x)$ , not gradients $\nabla f (x)$ or higher derivatives.
You may approximate derivatives from function values or not use them at all.

Problem. Minimize $f (x)$ over $x \in X \subseteq R^{d}$ .
Zero-order oracle. On a query $x$ , an oracle gives us a single (possibly noisy) function value $y$ at $x$ :
$O_{f}^{(0)} (x) = y = {f (x) f (x) + η_{x} (deterministic) (stochastic)$
Algorithm. Choose $x_{t + 1}$ adaptively from past data ${(x_{s}, O_{f}^{(0)} (x_{s}))}_{s = 1}^{t}$ , then output $\overset{x}{^}_{T}$ .

Is black-box optimization the same as zero order optimization

BBO is a modeling assumption about access to information, not a specific algorithm family.
The box might return just values, or values and gradients, but you can’t exploit an explicit formula.
BBO is about what you know (only queries), gradient-free optimization describes what you use.

Link to original

That said… they often coincide (cmp below list to the one for black-box), but there are also many hybrids/variants that do mix BBO + gradients.

Variants of zero order optimization

stochastic optimization

random search

simulated annealing

evolutionary optimization

distribution-based

evolution strategies

…

population-based

genetic algorithm

particle swarm optimization

diffusion evolution

differential evolution

…

…

bayesian optimization

grid search

…

Consistency and finite-sample convergence rates

With the condition $E [η_{x} ∣ x] = 0, Var (η_{x} ∣ x) \leq σ^{2}$ , which state that the noise is unbiased ( $E [y ∣ x] = f (x)$ ) and its variance isn’t unbounded, we can prove:
Consistency: The error $f (\overset{x}{^}_{T} - f (x^{*})) \to 0$ as our sample size $T \to \infty$
Finite-sample convergence rates: Bounds like $E [f (\overset{x}{^}_{T} - f (x^{*}))] \leq c / T$

Eventually remove the claude slop below as I replace it with my own / better notes.

Key Trade-offs

Sample efficiency: Gradient methods use $O (1)$ evaluations per step; zero-order needs $O (n)$ or more. But each evaluation can be simpler (no backprop).
Parallelization: Zero-order methods often evaluate many points independently; gradient methods are inherently sequential.
Convergence: First-order methods converge at known rates for convex problems. Zero-order methods often lack convergence guarantees but can escape local minima.
Memory: No need to store computational graphs or intermediate activations - just current parameters and function values.

The variance is higher than FDSA, but in high dimensions the massive reduction in evaluations often wins.
Some common methods:

Stochastic Search Methods

Random search: Sample points uniformly or from a distribution, keep the best. Simple but surprisingly effective in high dimensions where grid search fails.
simulated annealing: Probabilistic method that accepts worse solutions with decreasing probability over time, allowing escape from local minima through controlled randomness.

Population-Based Methods

evolutionary optimization: Maintain a population of solutions that evolve through selection, mutation, and crossover. Includes genetic algorithms, evolution strategies (ES), and differential evolution.
CMA-ES: Covariance Matrix Adaptation Evolution Strategy - adapts a multivariate normal distribution to efficiently explore the search space by learning correlations between parameters.
NEAT: Evolves both neural network weights and topology, solving the competing conventions problem through historical markings.

Gradient Approximation

Finite Difference Stochastic Approximation (FDSA): Approximate each partial derivative separately by perturbing one parameter at a time:
$\frac{\partial f}{\partial x _{i}} \approx \frac{f ( x + h e _{i} ) - f ( x - h e _{i} )}{2 h}$
where $e_{i}$ is the unit vector in the $i$ -th direction. This gives an accurate gradient estimate but requires $2 n$ function evaluations for $n$ parameters. The approximation error scales as $O (h^{2})$ , but making $h$ too small leads to numerical instability due to floating-point precision.

Simultaneous Perturbation Stochastic Approximation (SPSA): Perturb all parameters simultaneously with a random vector $Δ$ :
$\overset{g}{^} = \frac{f ( x + c Δ ) - f ( x - c Δ )}{2 c} Δ$
where each component $Δ_{i}$ is typically $\pm 1$ with equal probability (Bernoulli distribution). This estimates the gradient using only 2 evaluations regardless of dimension! In expectation, this gives an unbiased gradient estimate because $E [Δ_{i} Δ_{j}] = 0$ for $i \neq = j$ and $E [Δ_{i}^{2}] = 1$ . FDSA needs $2 n$ evaluations, SPSA needs 2. This is essentially what NES does when viewed as gradient estimation.

Model-Based Methods

bayesian optimization: Build a probabilistic surrogate model (often Gaussian Process) of the objective function, then optimize an acquisition function to decide where to sample next. Particularly effective when evaluations are expensive.
Surrogate optimization: Use simpler approximations (polynomials, radial basis functions) to model the objective locally or globally.

Direct Search Methods

Nelder-Mead simplex: Maintains a simplex (triangle in 2D, tetrahedron in 3D, etc.) of $n + 1$ points in $n$ dimensions, iteratively reflecting, expanding, or contracting based on function values. No derivatives needed but can converge to non-stationary points.
Pattern search: Explores along a set of directions (often coordinate axes), reducing step size when no improvement is found. Guaranteed convergence for smooth functions.

Many zero-order methods can be viewed as estimating gradients in parameter space

ES estimates gradients by perturbing in parameter space

Finite differences explicitly approximate derivatives

Even random search implicitly follows a noisy gradient

This connects to BLUR and NES, which formulate parameter updates as following distributions rather than explicit gradients.

References

optimization
first order optimization
second order optimization

Max Wolf's Second Brain

Explorer

zero order optimization

References

Graph View

Backlinks