Likelihood function

A likelihood scores how plausible different make the observed data .
It has the same formula as the probability distribution in , but viewed as a function of , data fixed.

We vary to see how likely they make the observed data.

Note: While for each , the likelihood need not integrate to 1 over : , i.e. it is not a probability distribution (over ).
Note the equivalent and equally context dependent notation . Writing avoids this ambiguity by making the variable explicit.

Careful: (the posterior) is not a mirror image of the probability distribution regardless of whether it is viewed as a function of or . 1 They are related through Bayes Theorem:

Bayes Theorem: Normalized likelihood = posterior

The likelihood is not normalized over . That’s just another way of saying that it’s not a probability distribution in , i.e. .
Baye’s theorem normalizes the likelihood to get a proper probability distribution over (the posterior):

Since does not depend on , we can write this as a proportionality:

→ The posterior is the prior weighted by the likelihood, up to multiplication by a constant .,
→ Dropping this constant changes nothing for inference.

iid factorization of the likelihood

Independence turns the joint probability of the data into a product via the chain rule of probability, without it we’d have:
Being identically distributed makes it the same in every factor (otherwise each would have its own marginal ).

Properties of the likelihood

Some but not all properties of the probability carry over.
Nonnegativity:
chain rule of probability still works

Some properties don’t carry over:
Not a distribution over .
No countable additivity over disjoint parameter sets: Likelihoods don’t add like probabilities, they multiply (for independent data). The log-likelihood adds.
Marginalization over is not defined without a prior /measure.

Footnotes

  1. ^cc8b69. I certainly would never confuse these…