year: 2017/12
paper: on-the-relationship-between-the-openai-evolution-strategy-and-stochastic-gradient-descent
website:
code:
connections: OpenAI-ES, SGD, Kenneth O. Stanley
Takeaway
If you have access to gradients/good proxy gradients, i.e. you can do supervised learning, ES’s computational overhead isn’t worth it. But in RL, gradients are noisy anyway, and ES can reduce that noise by adding more offspring (parallelizable). ES is competitive when domain noise degrades SGD’s gradient quality enough that ES’s approximation isn’t much worse.
Extend takeawa after read compoanion paper. Then also read eggroll paper (3e3289fd-20a1-46c8-90f5-bd7ca813c710 )
ES is more than just a traditional finite-difference approximator
“now read evolution-strategies-at-the-hyperscale and tell me what you learned”
- ES with low sigma is SGD with extra noise. The gradient it estimates is the same quantity backprop
computes, just with worse signal-to-noise. An SGD proxy with calibrated Gaussian noise predicts ES
behavior almost perfectly. - Surprisingly little correlation goes a long way. 3.9% correlation to the true gradient still gets you
97% on MNIST. 20% correlation matches SGD. The relationship between gradient quality and final
performance is far more forgiving than you’d expect. - The computational cost of closing the gap is steep. Matching SGD’s performance requires ~274k
pseudo-offspring per step. ES is never worth it in supervised settings where clean gradients are
available. - The real implication is for RL. ES doesn’t need to match SGD. It only needs to be competitive when
SGD’s own signal is degraded by environmental noise, reward sparsity, and credit assignment problems.
The paper gives you the quantitative framework to reason about when that crossover happens. - Limited perturbation works. You can perturb only a random subset of 5k params per offspring instead
of all 3M, use 5x more offspring, and get the same result 64x faster at the master node. The gradient
signal is that redundant across dimensions.