Motivation
There are many problems backpropagation cannot be used for or has limitaitons, e.g. in RL the credit assignment problem makes gradient estimates difficult: rewards are often delayed, sparse, or noisy, and policies can get trapped in local optima.
In such settings, black-box optimization methods like genetic algorithms or evolution strategies provide a gradient-free alternative.See also: Evolution Strategies as a Scalable Alternative to Reinforcement Learning
Zero-order optimization
Optimization methods that only require function evaluations , not gradients or higher derivatives.
You may approximate derivatives from function values or not use them at all.Problem. Minimize over .
Zero-order oracle. On a query , an oracle gives us a single (possibly noisy) function value at :Algorithm. Choose adaptively from past data , then output .
Link to originalIs black-box optimization the same as zero order optimization
BBO is a modeling assumption about access to information, not a specific algorithm family.
The box might return just values, or values and gradients, but you can’t exploit an explicit formula.
BBO is about what you know (only queries), gradient-free optimization describes what you use.
That said… they often coincide (cmp below list to the one for black-box), but there are also many hybrids/variants that do mix BBO + gradients.
Variants of zero order optimization
Consistency and finite-sample convergence rates
With the condition , which state that the noise is unbiased () and its variance isn’t unbounded, we can prove:
Consistency: The error as our sample size
Finite-sample convergence rates: Bounds like
Eventually remove the claude slop below as I replace it with my own / better notes.
Key Trade-offs
Sample efficiency: Gradient methods use evaluations per step; zero-order needs or more. But each evaluation can be simpler (no backprop).
Parallelization: Zero-order methods often evaluate many points independently; gradient methods are inherently sequential.
Convergence: First-order methods converge at known rates for convex problems. Zero-order methods often lack convergence guarantees but can escape local minima.
Memory: No need to store computational graphs or intermediate activations - just current parameters and function values.
The variance is higher than FDSA, but in high dimensions the massive reduction in evaluations often wins.
Some common methods:
Stochastic Search Methods
Random search: Sample points uniformly or from a distribution, keep the best. Simple but surprisingly effective in high dimensions where grid search fails.
simulated annealing: Probabilistic method that accepts worse solutions with decreasing probability over time, allowing escape from local minima through controlled randomness.
Population-Based Methods
evolutionary optimization: Maintain a population of solutions that evolve through selection, mutation, and crossover. Includes genetic algorithms, evolution strategies (ES), and differential evolution.
CMA-ES: Covariance Matrix Adaptation Evolution Strategy - adapts a multivariate normal distribution to efficiently explore the search space by learning correlations between parameters.
NEAT: Evolves both neural network weights and topology, solving the competing conventions problem through historical markings.
Gradient Approximation
Finite Difference Stochastic Approximation (FDSA): Approximate each partial derivative separately by perturbing one parameter at a time:
where is the unit vector in the -th direction. This gives an accurate gradient estimate but requires function evaluations for parameters. The approximation error scales as , but making too small leads to numerical instability due to floating-point precision.
Simultaneous Perturbation Stochastic Approximation (SPSA): Perturb all parameters simultaneously with a random vector :
where each component is typically with equal probability (Bernoulli distribution). This estimates the gradient using only 2 evaluations regardless of dimension! In expectation, this gives an unbiased gradient estimate because for and . FDSA needs evaluations, SPSA needs 2. This is essentially what NES does when viewed as gradient estimation.
Model-Based Methods
bayesian optimization: Build a probabilistic surrogate model (often Gaussian Process) of the objective function, then optimize an acquisition function to decide where to sample next. Particularly effective when evaluations are expensive.
Surrogate optimization: Use simpler approximations (polynomials, radial basis functions) to model the objective locally or globally.
Direct Search Methods
Nelder-Mead simplex: Maintains a simplex (triangle in 2D, tetrahedron in 3D, etc.) of points in dimensions, iteratively reflecting, expanding, or contracting based on function values. No derivatives needed but can converge to non-stationary points.
Pattern search: Explores along a set of directions (often coordinate axes), reducing step size when no improvement is found. Guaranteed convergence for smooth functions.
Many zero-order methods can be viewed as estimating gradients in parameter space
References
optimization
first order optimization
second order optimization