We need a metric that ignores the easy negatives.
F1-score
F1 = harmonic mean of precision and recall
Harmonic mean because you can’t game it by maxing one: predict-everything gives recall 1 but precision 0 → F1 0.
Macro F1
Macro F1 is the average of F1-scores across classes, weighting each class equally.