We need a metric that ignores the easy negatives.

F1 is built from only the positive class:
precision = TP/(TP+FP) … of what I flagged as class c, how much really is c (penalizes false alarms).
recall = TP/(TP+FN) … of the true c segments, how many I caught (penalizes misses).

F1-score

F1 = harmonic mean of precision and recall

Harmonic mean because you can’t game it by maxing one: predict-everything gives recall 1 but precision 0 → F1 0.

Macro F1

Macro F1 is the average of F1-scores across classes, weighting each class equally.