Just scales the input by the root mean square of the input values.
Because RMSNorm naturally produces outputs orthogonal to the uniform vector 1, recentering is redundant.
… learnable scaling parameter (vector of same dimension as )
… small constant added for numerical stability to prevent division by zero.
Like layer normalization, it operates at the level of individual data points, i.e. normalizing across features/tokens rather than across a batch/sequence.
c.f. layernorm
Where and are the mean and standard deviation of the features in , and and are learnable parameters for scaling and shifting.
c.f. batchnorm
Where and are the mean and standard deviation computed over a batch of data, and and are learnable parameters for scaling and shifting.