Surprise
Surprise is the log of the inverse probability (negative log likelihood).
It is also called “shannon information”.
(s is the state in the above image, h is the surprise function)
Extreme cases:
p(x) = 1 → 0 surprise
p(x) = 0 → surprise
Log-properties
Because we take the log of the probability, if two unrelated suprising things happen, we add suprises instead of multiplying like with probabilities:
entropy is the expected value of surprise
→ Average surprise of a random variable weighted by its probability.
High entropy = high uncertainty → higher avg. surprise. E.g.: uniform distribution = highest surprise (unpredictable - every outcome is equally surprising).
Low entropy = low uncertainty → less avg. surprise. Certain outcome (1 possibility) → 0 entropy, no surprise, predictable.
→ entropy = uncertainty/unpredictabiltiyWe can also write it like this in the discrete case:
More surprising things are harder to compress (in information theory).
In machine learning, for example if we predict how likely a football team is to win, we want to achieve a very low suprise (loss), so we can estimate confidently.
In machine learning, we want the next target to have a very high probability, i.e. low surprise → negative log-likelihood loss