Marginal LIkelihood / Evidence
The marginal likelihood is the probability of the observed data under a model after marginalizing (integrating out) all unobserved variables (e.g. latent variables; if bayesian, the model parameters too), then using the chain rule of probability to decompose the joint distribution:
→ We get the total probability of the data under the model:
Below are some common formulations of the marginal likelihood in different contexts:
Marginal likelihood over latents / "data likelihood"
Here we integrate over the latent variables , given fixed model parameters .
This is often called the data likelihood in the context of latent variable models, as it gives the likelihood of the observed data under the model, accounting for all possible configurations of the latent variables .
Marginal likelihood over parameters / "model evidence"
Todo
It’s also called evidence, because it quantifies how much the data “evidences” or supports the model.
# Fix mixing conccepts of model evidence and latent variable posterior evidence…
The marginal likelihood normalizes the posterior
The marginal likelihood represents the total probability of observing our data, accounting for all possible parameter settings according to our prior beliefs. It serves as the normalizing constant in Bayes Theorem:
Unlike the likelihood, which evaluates parameter settings individually, the marginal likelihood evaluates the model as a whole, integrating over the entire parameter space.
The marginal likelihood naturally implements occam’s razor for model comparison.
Complex models with greater flexibility can fit the observed data better, but they spread their prior probability over a larger parameter space. A model with the highest marginal likelihood balances goodness-of-fit against model complexity, as it must assign high probability to the observed data across its entire parameter space, not just at the best-fitting parameters (i.e. it can’t “cheat” by having a few extreme parameter settings that fit well but most others that don’t).