year: 2020
paper: https://arxiv.org/pdf/1906.10652
website:
code:
connections: monte carlo method, machine learning
Whereas in a typical optimisation we would make a call to the system for a function value, which is deterministic, in stochastic optimization we make a call for an estimate of the function value; instead of calling for the gradient, we ask for an estimate of the gradient; instead of calling for the Hessian, we will ask for an estimate of the Hessian; all these estimates are random variables.
Problem statement
is a prbabilistic objectve of the form:
where is a function of an input variable with “structural parameters” , evaluated on average with respect to the input distribution with distributional parameters .
is referred to as cost and as measure. They call this a “mean-value analysis”.→ This is a general framework that allows us to cover problems from queueing theory and variational inference to portfolio design and reinforcement learning.
We want to learn by optimizing , so we care about its gradient:
They call this the “sensitivity analysis” of , i.e. the gradient of the expectation with respect to the distribution parameters . It characterizes the behavior of the costs’s sensitivity to changes in the distribution parameters. 12
→ This gradient lies at the core of tools for model explanation [probabilistic systems] and optimisation.
But this gradient problem is difficult in general: we will often not be able to evaluate the expectation in closed form; the integrals over are typically high-dimensional making quadrature ineffective; it is common to request gradients with respect to high-dimensional parameters , easily in the order of tens-of-thousands of dimensions; and in the most general cases, the cost function may itself not be differentiable, or be a black-box function whose output is all we are able to observe. In addition, for applications in machine learning, we will need efficient, accurate and parallelisable solutions that minimise the number of evaluations of the cost. These are all challenges we can overcome by developing Monte Carlo estimators of these integrals and gradients,
is not considered in the following, which is why the paper drops it, even though may depend on it.
Monte Carlo estimator of the expectation
The monte carlo method lets us approximate the estimation in cases where we can’t (easily) evaluate the integral in closed form, by drawing iid samples from the distribution , and taking the average of the function evaluated at those samples:
I.e. we approximate the expectation by the empircal/sample mean of samples.
All we need is an integral of the form shown above, and a distribution that we can easily sample from.The resulting quantity is a random variable, because it deends on the set of random variates used.
Desirable properties of Monte Carlo estimators:
Consistency
The estimator should converge to the true value as the number of samples increases:
\bar{\mathcal{F}}_{N} \xrightarrow{a.s.} \mathcal{F}(\theta) \text{ as } N \to \infty
Following the strong law of large numbers.
Un biasedness The estimate of the gradient should be correct on average. Unbiasedness is always preferred because it allows us to guarantee the convergence of a stochastic optimization procedure.
\mathbb{E}{p{\theta}(x)}[\bar{\mathcal{F}}_{N}] = \mathcal{F}(\theta)
This follows from the linearity of expectation and the fact that the samples are drawn from :
\begin{align*}
\mathbb{E}{p{\theta}(x)}[\bar{\mathcal{F}}{N}] &= \mathbb{E}{p_{\theta}(x)}\left[\frac{1}{N} \sum_{n=1}^{N}f(\hat{x}^{(n)})\right] \
&= \frac{1}{N} \sum_{n=1}^{N} \mathbb{E}{p{\theta}(x)}[f(\hat{x}^{(n)})] \
&= \frac{1}{N} \sum_{n=1}^{N} \mathcal{F}(\theta) \
&= \mathcal{F}(\theta)
\end{align*}Minimum variance. Evolution Strategies as a Scalable Alternative to Reinforcement Learning also makes good points about variance reduction.
Comparing two unbiased estimators, the one with lower variance is better.
Low-variance gradients allow larger learning rates (→ fewer steps) and more stable convergence.
Computational efficiency
Less evaluations/samples (linear cost in the number of parameters) + parallelizable = good.
Making a clear separation between the simulation and optimisation phases allows us to focus our attention on developing the best estimators of gradients we can, knowing that when used with gradient-based optimisation methods available to us, we can guarantee convergence so long as the estimate is unbiased, i.e. the estimate of the gradient is correct on average.
Sources of stochasticity:
- environment
- (optionally) SGD (sampling minibatches)
The Bias-Variance Tradeoff Different gradient estimators (score function, pathwise, etc.) have different bias-variance properties.
Score function is unbiased but high variance. Pathwise (reparameterization) has lower variance but requires differentiable f(x).
Footnotes
-
Check out this epic visualization of how and convolve, and how the sensitivity guides the parameter optimization. ↩
-
Sensitivity here just means the gradient of an expectation, i.e. a measure of how much the expected value changes in response to small changes in parameters. Not really connected to the numerous other uses of the word sensitivity in other contexts, beyond the general idea of responsiveness to change. ↩