Variational inference
Variational inference is a general method for approximating complex and unknown distributions by the closest distribution within a tractable family. Wherever the need to approximate distributions appears in machine learning—in supervised, unsupervised, reinforcement learning—a form of variational inference can be used.
Consider a generic probabilistic model that defines a generative process in which observed data is generated from a set of unobserved variables using a a data distribution and a prior distribution .
In supervised learning, the unobserved variables might correspond to the weights of a regression problem, and in unsupervised learning to latent variables.
The posterior distribution of this generative process is unknown, and is approximated by a variational distribution , which is a parameterised family of distributions with variational parameters , e.g., the mean and variance of a gaussian distribution. Finding the distribution that is closest to (e.g. in the KL sense) leads to an objective, the variational free energy, that optimizes an expected log-likelihood subject to a regularization constraint that encourages closeness between the variational distribution and the prior distribution .Optimising the distribution requires the gradient of the free energy with respect to the variational parameters :
This is an objective in the form we have in monte carlo estimation, where the cost function is the term in square brackets, and the measure is the variational distribution .
This problem also appears in other research areas, especially in statistical physics, information theory and utility theory. Many of the solutions that have been developed for scene understanding, representation learning, photo-realistic image generation, or the simulation of molecular structures also rely on variational inference.
cleanup inaccuracies
Variational inference is treats inference (the process of deriving an intractable posterior from a prior distribution + observed data) as an optimization problem:
… log-likelihood function - log probability of observing your data given parameters
… variational distribution - your approximation to the true posterior
You usually choose as the KL-divergence, i.e. you maximize the likelihood of data from a stochastic model with an arbitrary prior.
Reverse KL
We’re using the reverse , with and swapped, because the forward KL would require us to sample from the posterior which we can’t compute and need variational inference for in the first place.
Note: is always the thing we’re optimizing when using KLBehavioral differences:
(forward KL) is “mass-covering”: weighted by , it tries to cover everything, penalizing for not putting mass where has mass(reverse KL) is “mode-seeking”: weighted by , it focuses on the main modes of , penalizing for putting mass where is low
→ tends to underfit rather than overfit