The KL divergence measures the difference between the probability distribution of a model and the true distribution, disregarding the uncertainty of the true distribution itself.
Link to originalThe KL-divergence subtracts the uncertainty (entropy) about the true distribution from cross-entropy, leaving us with a measure for the similarity of these distributions.
KL divergence is not a metric (distance) because it is not symmetric!
→ When using KL divergence as a loss function, the choice of which distribution is the “reference” (P) versus the “approximation” (Q) matters significantly. For example, minimizing tends to produce Q that covers all modes of P (zero-avoiding), while minimizing produces Q that concentrates on high-probability regions of P (zero-forcing).
Why aren’t we using KL-divergence as a loss function, if it seems to have better properties?
Since we cannot change , we can only minimize w.r.t. , and hence the KL divergent is equivalent to cross-entropy loss:
We don’t care about the exact value, so we don’t bother to estimate the entropy of the training data: