F1-score

We need a metric that ignores the easy negatives.

F1 is built from only the positive class:
precision = TP/(TP+FP) … of what I flagged as class c, how much really is c (penalizes false alarms).
recall = TP/(TP+FN) … of the true c segments, how many I caught (penalizes misses).

F1-score

F1 = harmonic mean of precision and recall
$F 1 = 2 \cdot \frac{precision \cdot recall}{precision + recall} = \frac{2 \cdot TP}{2 \cdot TP + FP + FN}$

Harmonic mean because you can’t game it by maxing one: predict-everything gives recall 1 but precision 0 → F1 0.

Macro F1

Macro F1 is the average of F1-scores across classes, weighting each class equally.

Graph View

F1-score