The KL divergence measures the difference between the probability distribution of a model and the true distribution, disregarding the uncertainty of the true distribution itself.
Link to originalThe KL-divergence subtracts the uncertainty (entropy) about the true distribution from cross-entropy, leaving us with a measure for the similarity of these distributions.
KL divergence is not a distance, because it is not symmetric!
Why aren’t we using KL-divergence as a loss function, if it seems to have better properties?
Since we cannot change , we can only minimize w.r.t. , and hence the KL divergent is equivalent to cross-entropy loss:
We don’t care about the exact value, so we don’t bother to estimate the entropy of the training data: