Surprise

Surprise is the log of the inverse probability (negative log likelihood).
It is also called “shannon information”.

center
(s is the state in the above image, h is the surprise function)
Extreme cases:
p(x) = 1 → 0 surprise
p(x) = 0 → surprise

Log-properties

Because we take the log of the probability, if two unrelated suprising things happen, we add suprises instead of multiplying like with probabilities:

→ Average surprise of a random variable weighted by its probability.
High entropy = high uncertainty → higher avg. surprise. E.g.: uniform distribution = highest surprise (unpredictable - every outcome is equally surprising).
Low entropy = low uncertainty → less avg. surprise. Certain outcome (1 possibility) → 0 entropy, no surprise, predictable.
entropy = uncertainty/unpredictabiltiy

We can also write it like this in the discrete case:

More surprising things are harder to compress (in information theory).
In machine learning, for example if we predict how likely a football team is to win, we want to achieve a very low suprise (loss), so we can estimate confidently.

In machine learning, we want the next target to have a very high probability, i.e. low surprisenegative log-likelihood loss

References

The Key Equation Behind Probability