The KL divergence measures the difference between the probability distribution of a model and the true distribution, disregarding the uncertainty of the true distribution itself.

The KL-divergence subtracts the uncertainty (entropy) about the true distribution from cross-entropy, leaving us with a measure for the similarity of these distributions.

Link to original

KL divergence is not a metric (distance) because it is not symmetric!

→ When using KL divergence as a loss function, the choice of which distribution is the “reference” (P) versus the “approximation” (Q) matters significantly. For example, minimizing tends to produce Q that covers all modes of P (zero-avoiding), while minimizing produces Q that concentrates on high-probability regions of P (zero-forcing).

Why aren’t we using KL-divergence as a loss function, if it seems to have better properties?

Since we cannot change , we can only minimize w.r.t. , and hence the KL divergent is equivalent to cross-entropy loss:

We don’t care about the exact value, so we don’t bother to estimate the entropy of the training data:

References

The Key Equation Behind Probability