surprise

Surprise / Self-information

$\displaylines s (x) = l o g (\frac{1}{p ( x )}) = - lo g p (x)$
Surprise is the log of the inverse probability (negative log likelihood).
It is also called “shannon information”.

center
(s is the state in the above image, h is the surprise function)
Extreme cases:
p(x) = 1 → 0 surprise
p(x) = 0 → $\infty$ surprise

Log-properties

Because we take the log of the probability, if two unrelated suprising things happen, we add suprises instead of multiplying like with probabilities:
$\displaylines P (A \land B) = P (A) \cdot P (B) S (A \land B) = S (A) + S (B)$

entropy is the expected value of surprise

$H (x) = E [p_{X} (x) s_{X} (x)] = \int_{X} s_{X} (x) p_{X} (x) d x = \int_{X} p_{X} (x) lo g (\frac{1}{p _{X} ( x )}) d x = - \int_{X} p_{X} (x) lo g (p_{X} (x)) d x$
→ Average surprise of a random variable weighted by its probability.
High entropy = high uncertainty → higher avg. surprise. E.g.: uniform distribution = highest surprise (unpredictable - every outcome is equally surprising).
Low entropy = low uncertainty → less avg. surprise. Certain outcome (1 possibility) → 0 entropy, no surprise, predictable.
→ entropy = uncertainty/unpredictabiltiy

We can also write it like this in the discrete case:
$H (x) = x \sum p_{x} \cdot s (x) = - x \sum p_{x} \cdot lo g (\frac{1}{p _{x}}) = - x \sum p_{x} \cdot lo g (p_{x})$

More surprising things are harder to compress (in information theory).
In machine learning, for example if we predict how likely a football team is to win, we want to achieve a very low suprise (loss), so we can estimate confidently.

In machine learning, we want the next target to have a very high probability, i.e. low surprise → negative log-likelihood loss

References

The Key Equation Behind Probability

Max Wolf's Second Brain

Explorer

surprise

References

Graph View

Backlinks