KL-divergence

The KL divergence measures the difference between the probability distribution of a model and the true distribution, disregarding the uncertainty of the true distribution itself.

The KL-divergence subtracts the uncertainty (entropy) about the true distribution from cross-entropy, leaving us with a measure for the similarity of these distributions.

$H (P, Q) - H (P) s \sum p_{s} lo g (\frac{1}{q _{s}}) - s \sum p_{s} lo g (\frac{1}{p _{s}}) s \sum p_{s} (lo g (\frac{1}{q _{s}}) - lo g (\frac{1}{p _{s}})) s \sum p_{s} lo g (\frac{p _{s}}{q _{s}}) = = = = D_{K L} (P, Q)$

Link to original

KL divergence is not a metric (distance) because it is not symmetric! $D_{K L} (P ∣∣ Q) \neq = D_{K L} (Q ∣∣ P)$

→ When using KL divergence as a loss function, the choice of which distribution is the “reference” (P) versus the “approximation” (Q) matters significantly. For example, minimizing $D_{K L} (P ∣∣ Q)$ tends to produce Q that covers all modes of P (zero-avoiding), while minimizing $D_{K L} (Q ∣∣ P)$ produces Q that concentrates on high-probability regions of P (zero-forcing).

Why aren’t we using KL-divergence as a loss function, if it seems to have better properties?

Since we cannot change $P$ , we can only minimize w.r.t. $Q$ , and hence the KL divergent is equivalent to cross-entropy loss:

We don’t care about the exact value, so we don’t bother to estimate the entropy of the training data:

References

The Key Equation Behind Probability

Max Wolf's Second Brain

Explorer

KL-divergence

References

Graph View

Backlinks