cross-entropy loss

Cross-Entropy Loss

The cross-entropy loss is often used for classification tasks, in conjunction with softmax:
$ℓ_{CE}^{(i)} (x_{i}, q_{i}, θ) = - k \sum q_{i} (k) lo g (p_{θ} (k ∣ x_{i})) = H (q_{i}, p_{θ} (\cdot ∣ x_{i}))$
$i$ … $i^{t h}$ sample
$k$ … class index
$q$ … true distribution (soft labels, one-hot, …)
$p$ … predicted probability distribution (from softmax, …)
$θ$ … model parameters
$x$ … input features
→ $\overset{p}{^}_{i} = softmax (f_{θ} (x_{i})), ℓ_{CE}^{(i)} = - y_{i}^{T} lo g \overset{p}{^}_{i}$

Categorical Cross-Entropy Loss

For one-hot encodings, the cross-entropy loss simplifies to Categorical Cross-Entropy, aka negative log-likelihood loss:
$ℓ_{CCE}^{(i)} = - k \sum y_{i, k} lo g (p_{θ} (k ∣ x_{i})) = - lo g (p_{θ} (y_{i, k} ∣ x_{i}))$
$i$ … $i^{t h}$ sample
$k$ … index of the correct class
$y$ … true label (one-hot, all probability mass is on the true class)
$p$ … predicted probability distribution (from softmax, …)
$θ$ … model parameters
$x$ … input features
→ $\overset{p}{^}_{i} = softmax (f_{θ} (x_{i})), ℓ_{CCE}^{(i)} = - lo g \overset{p}{^}_{i, y_{i}}$

Link to original

nn.CrossentropyLoss <=> nn.LogSoftmax & nn.NLLoss
nn.BCEWithLogitsLoss <=> nn.Sigmoid & nn.BCELoss

center

When does it not perform well?

strong class imbalence (just becomes more confident in predicting the majority class and neglects the minority class)

fails to differentiate between easy and hard samples. Hard example = model makes significant errors; easy example = straightforward to classify. CE doesn’t allocate more attention to hard samples

Mitigations:

Addressing class imbalance: balanced CE loss
Adding a weighting factor for each class resolves the issue with class imbalances:
$CE (p_{t}) = - α_{t} lo g (p_{t})$
$p_{i} \dots$ predicted probability of class (from softmax, …)
$α_{t} \dots$ weight for class $t$
$α$ is usually calculated as the inverse of the class distribution like so:
$α_{t} = \frac{number of samples}{number of classes \times class count _{t}}$
sample_labels = torch.cat([dataset[i][1] for i in len(dataset)])  
unique, counts = torch.unique(sample_labels, return_counts=True)  
class_weights = counts.sum() / (len(unique) * counts)
(if the dataset is very large, take a random sample that fits into memory)

Addressing hard-negatives: focal loss

https://towardsdatascience.com/cross-entropy-negative-log-likelihood-and-all-that-jazz-47a95bd2e81
classification

Max Wolf's Second Brain

Explorer

cross-entropy loss

Graph View

Backlinks