optimization

“s.t.” … subject to

Properties of $f$ that determine which optimization algorithm to employ

Is $f$ known or unknown?
If there’s no mathematical model for $f$ , we compute it by interaction with a simulation or experiment (black-box optimization).

Can we compute the gradient $\nabla f$ ?
If we can, we can use first order optimization methods. If not, we must use zero order optimization methods, approximating gradients or using only function values.

Is $f$ high dimensional?
If $f$ has high-dimensional decision variables $x$ (like millions of parameters), we must use algorithms that scale well with dimension, like SGD.

Is $f$ convex or nonconvex?
If $f$ is convex, we can use efficient convex optimization algorithms with global convergence guarantees. Each local minimum is also a global minimum.
Deep learning objectives are typically nonconvex, so we use heuristics like SGD with momentum or Adam.

Is $f$ linear of nonlinear?
If $x$ is linear, we can use linear programming techniques.

Is $f$ expensive to compute or evaluate?
If $f$ is expensive, we may use surrogate models or bayesian optimization to minimize the number of evaluations.

Does $f$ encode dynamics in time?

References

Intro to optimization in deep learning: Momentum, RMSProp and Adam

convex optimization
optimizers

Max Wolf's Second Brain

Explorer

optimization

References

Graph View

Backlinks