Just scales the input by the root mean square of the input values.
Because RMSNorm naturally produces outputs orthogonal to the uniform vector 1, recentering is redundant.

… learnable scaling parameter (vector of same dimension as )
… small constant added for numerical stability to prevent division by zero.

Like layer normalization, it operates at the level of individual data points, i.e. normalizing across features/tokens rather than across a batch/sequence.

c.f. layernorm

Where and are the mean and standard deviation of the features in , and and are learnable parameters for scaling and shifting.

c.f. batchnorm

Where and are the mean and standard deviation computed over a batch of data, and and are learnable parameters for scaling and shifting.

Footnotes

  1. https://arxiv.org/pdf/2409.12951v1