variational inference

Variational inference

Variational inference is a general method for approximating complex and unknown distributions by the closest distribution within a tractable family. Wherever the need to approximate distributions appears in machine learning—in supervised, unsupervised, reinforcement learning—a form of variational inference can be used.

Consider a generic probabilistic model $p (x ∣ z) p (z)$ that defines a generative process in which observed data $x$ is generated from a set of unobserved variables $z$ using a a data distribution $p (x ∣ z)$ and a prior distribution $p (z)$ .
In supervised learning, the unobserved variables might correspond to the weights of a regression problem, and in unsupervised learning to latent variables.
The posterior distribution of this generative process $p (z ∣ x)$ is unknown, and is approximated by a variational distribution $q (z ∣ x; θ)$ , which is a parameterised family of distributions with variational parameters $θ$ , e.g., the mean and variance of a gaussian distribution. Finding the distribution $q (z ∣ x; θ)$ that is closest to $p (z ∣ x)$ (e.g. in the KL sense) leads to an objective, the variational free energy, that optimizes an expected log-likelihood $lo g p (x ∣ z)$ subject to a regularization constraint that encourages closeness between the variational distribution $q$ and the prior distribution $p (z)$ .

Optimising the distribution $q$ requires the gradient of the free energy with respect to the variational parameters $θ$ :
$η = \nabla_{θ} E_{q (z ∣ x; θ)} [lo g p (x ∣ z) - lo g \frac{q ( z ∣ x ; θ )}{p ( z )}]$
This is an objective in the form we have in monte carlo estimation, where the cost function is the term in square brackets, and the measure is the variational distribution $q (z ∣ x; θ)$ .

This problem also appears in other research areas, especially in statistical physics, information theory and utility theory. Many of the solutions that have been developed for scene understanding, representation learning, photo-realistic image generation, or the simulation of molecular structures also rely on variational inference.

cleanup inaccuracies

Variational inference is treats inference (the process of deriving an intractable posterior from a prior distribution + observed data) as an optimization problem:

q_{θ} max L = E_{x \sim q_{θ}} [lo g p_{θ} (x)] - D (Q, P)

$ℓ (θ; x)$ … log-likelihood function - log probability of observing your data given parameters $θ$
$q (θ)$ … variational distribution - your approximation to the true posterior $p (θ ∣ data)$
You usually choose $D$ as the KL-divergence, i.e. you maximize the likelihood of data from a stochastic model with an arbitrary prior.

Reverse KL

We’re using the reverse $D_{K L} (q ∣∣ p)$ , with $p$ and $q$ swapped, because the forward KL would require us to sample from the posterior which we can’t compute and need variational inference for in the first place.
Note: $q$ is always the thing we’re optimizing when using KL

Behavioral differences:
$D_{K L} (p ∣∣ q)$ (forward KL) is “mass-covering”: weighted by $p$ , it tries to cover everything, penalizing $q$ for not putting mass where $p$ has mass

$D_{KL} (q ∣∣ p)$ (reverse KL) is “mode-seeking”: weighted by $q$ , it focuses on the main modes of $p$ , penalizing $q$ for putting mass where $p$ is low
→ tends to underfit rather than overfit

bayesian inference

Max Wolf's Second Brain

Explorer

variational inference

Graph View

Backlinks