W.r.t. cross-entropy loss , the perplexity is defined as:
It’s the geometric mean of the inverse predicted probabilities.
→ Perplexity is dimensionless and multiplicative: halving perplexity doubles the geometric mean of the probabilities the model assigned to the true labels.
Effective branching factor
Perplexity: The model is, on average (geometric mean), as confused as if it had to pick uniformly from candidates.
EXAMPLE
For a uniform distribution over candidates (e.g. vocabulary size), we have , so .
The lowest possible perplexity is at , i.e. .