GLU
…sigmoid
The first half acts as a gate that decides, per-feature, how much of the second half to let through.
Compare to a vanilla MLP:
The computation is factorized, expressed as a product of simpler functions, which gives it combinatorial coverage of multiplicative interactions between features, which an MLP can only approximate.
But this doesn’t make GLU able to do ICL; compare to self-attention:
In both cases, one part computes fast weights, another applies those to a projection of the input.
But attention’s dynamic weight matrix mixes across tokens, while GLU is pointwise across tokens.
It’s like dynamically selecting which computation to apply.
Since we add an extra weight matrix, we scale down the parameters per layer, to keep the total count the same:
Modern version: SwiGLU, delivers consistent improvements across pre-training and downstream benchmark tasks.
Computationally efficient version ( = ; ): squared ReLU
Challenge: The quadratic branching can can over long training spans produce outliers, when xG and xV align, which is a problem for FP8 training, addressed in scaling-fp8-training-to-trillion-token-llms.