cross-entropy

The cross-entropy is the average surprise (entropy) you get observing a random variable governed by probability distribution $P$ , while believing in its model $Q$ .

$H (P, Q) = s \sum p_{s} lo g (\frac{1}{q _{s}})$
Where $H$ is (cross-)entropy, $\frac{1}{q _{s}}$ is the surprise of state $s$ , $p_{s}$ is the probability of state $s$ , or how common it is.
$P$ represents data/observations/a measured probability distribution, $Q$ represents a theory/model/description/approximation of $P$ .

CE can tell you how good your model is.
If you model is perfect, i.e. $P = Q$ , then the crossentropy is simply equal to the entropy (uncertainty) of the true distribution: $H (P, Q) = H (P, P) = H (P)$ .

The cross-entropy can never be lower than the entropy of the generating distribution:

$H (P, Q) \geq H (P)$

$P$ and $Q$ are not interchangable

$H (P, Q) \neq = H (Q, P)$
E.g.: If you believe a coin is fair (0.5/0.5), but it is rigged (0.99/0.01), then the CE is: $0.99 lo g (\frac{1}{0.5}) + 0.01 lo g (\frac{1}{0.5}) \approx 0.7$
If you believe it is rigged when it is actually fair, then the CE is: $0.5 lo g (\frac{1}{0.99}) + 0.5 lo g (\frac{1}{0.01}) \approx 2.4$
In the second case, the entropy is much larger, as half of the time you are extremely surprsied to see tails, this extreme surprise dominates the average surprise.

The KL-divergence subtracts the uncertainty (entropy) about the true distribution from cross-entropy, leaving us with a measure for the similarity of these distributions.

$H (P, Q) - H (P) s \sum p_{s} lo g (\frac{1}{q _{s}}) - s \sum p_{s} lo g (\frac{1}{p _{s}}) s \sum p_{s} (lo g (\frac{1}{q _{s}}) - lo g (\frac{1}{p _{s}})) s \sum p_{s} lo g (\frac{p _{s}}{q _{s}}) = = = = D_{K L} (P, Q)$

Max Wolf's Second Brain

Explorer

cross-entropy

Graph View

Backlinks