The KL divergence measures the difference between the probability distribution of a model and the true distribution, disregarding the uncertainty of the true distribution itself.

The KL-divergence subtracts the uncertainty (entropy) about the true distribution from cross-entropy, leaving us with a measure for the similarity of these distributions.

Link to original

KL divergence is not a distance, because it is not symmetric!

Why aren’t we using KL-divergence as a loss function, if it seems to have better properties?

Since we cannot change , we can only minimize w.r.t. , and hence the KL divergent is equivalent to cross-entropy loss:

We don’t care about the exact value, so we don’t bother to estimate the entropy of the training data:

References

The Key Equation Behind Probability