Motivation

There are many problems backpropagation cannot be used for or has limitaitons, e.g. in RL the credit assignment problem makes gradient estimates difficult: rewards are often delayed, sparse, or noisy, and policies can get trapped in local optima.
In such settings, black-box optimization methods like genetic algorithms or evolution strategies provide a gradient-free alternative.

See also: Evolution Strategies as a Scalable Alternative to Reinforcement Learning

Zero-order optimization

Optimization methods that only require function evaluations , not gradients or higher derivatives.
You may approximate derivatives from function values or not use them at all.

Problem. Minimize over .
Zero-order oracle. On a query , an oracle gives us a single (possibly noisy) function value at :

Algorithm. Choose adaptively from past data , then output .

Is black-box optimization the same as zero order optimization

BBO is a modeling assumption about access to information, not a specific algorithm family.
The box might return just values, or values and gradients, but you can’t exploit an explicit formula.
BBO is about what you know (only queries), gradient-free optimization describes what you use.

Link to original

That said… they often coincide (cmp below list to the one for black-box), but there are also many hybrids/variants that do mix BBO + gradients.

Consistency and finite-sample convergence rates

With the condition , which state that the noise is unbiased () and its variance isn’t unbounded, we can prove:
Consistency: The error as our sample size
Finite-sample convergence rates: Bounds like

Eventually remove the claude slop below as I replace it with my own / better notes.

Key Trade-offs

Sample efficiency: Gradient methods use evaluations per step; zero-order needs or more. But each evaluation can be simpler (no backprop).
Parallelization: Zero-order methods often evaluate many points independently; gradient methods are inherently sequential.
Convergence: First-order methods converge at known rates for convex problems. Zero-order methods often lack convergence guarantees but can escape local minima.
Memory: No need to store computational graphs or intermediate activations - just current parameters and function values.


The variance is higher than FDSA, but in high dimensions the massive reduction in evaluations often wins.
Some common methods:

Stochastic Search Methods

Random search: Sample points uniformly or from a distribution, keep the best. Simple but surprisingly effective in high dimensions where grid search fails.
simulated annealing: Probabilistic method that accepts worse solutions with decreasing probability over time, allowing escape from local minima through controlled randomness.

Population-Based Methods

evolutionary optimization: Maintain a population of solutions that evolve through selection, mutation, and crossover. Includes genetic algorithms, evolution strategies (ES), and differential evolution.
CMA-ES: Covariance Matrix Adaptation Evolution Strategy - adapts a multivariate normal distribution to efficiently explore the search space by learning correlations between parameters.
NEAT: Evolves both neural network weights and topology, solving the competing conventions problem through historical markings.

Gradient Approximation

Finite Difference Stochastic Approximation (FDSA): Approximate each partial derivative separately by perturbing one parameter at a time:

where is the unit vector in the -th direction. This gives an accurate gradient estimate but requires function evaluations for parameters. The approximation error scales as , but making too small leads to numerical instability due to floating-point precision.

Simultaneous Perturbation Stochastic Approximation (SPSA): Perturb all parameters simultaneously with a random vector :

where each component is typically with equal probability (Bernoulli distribution). This estimates the gradient using only 2 evaluations regardless of dimension! In expectation, this gives an unbiased gradient estimate because for and . FDSA needs evaluations, SPSA needs 2. This is essentially what NES does when viewed as gradient estimation.

Model-Based Methods

bayesian optimization: Build a probabilistic surrogate model (often Gaussian Process) of the objective function, then optimize an acquisition function to decide where to sample next. Particularly effective when evaluations are expensive.
Surrogate optimization: Use simpler approximations (polynomials, radial basis functions) to model the objective locally or globally.

Direct Search Methods

Nelder-Mead simplex: Maintains a simplex (triangle in 2D, tetrahedron in 3D, etc.) of points in dimensions, iteratively reflecting, expanding, or contracting based on function values. No derivatives needed but can converge to non-stationary points.
Pattern search: Explores along a set of directions (often coordinate axes), reducing step size when no improvement is found. Guaranteed convergence for smooth functions.


Many zero-order methods can be viewed as estimating gradients in parameter space

  • ES estimates gradients by perturbing in parameter space
  • Finite differences explicitly approximate derivatives
  • Even random search implicitly follows a noisy gradient

This connects to BLUR and NES, which formulate parameter updates as following distributions rather than explicit gradients.

References

optimization
first order optimization
second order optimization