RMS norm

Just scales the input by the root mean square of the input values.
Because RMSNorm naturally produces outputs orthogonal to the uniform vector ¹, recentering is redundant.

RMSNorm (x; ϵ) = γ ⊙ \frac{x}{mean ( x ^{2} ) + ϵ}

$γ$ … learnable scaling parameter (vector of same dimension as $x$ )
$ϵ$ … small constant added for numerical stability to prevent division by zero.

Like layer normalization, it operates at the level of individual data points, i.e. normalizing across features/tokens rather than across a batch/sequence.

c.f. layernorm

LayerNorm (x; γ, β, ϵ) = γ ⊙ \frac{x - μ}{σ + ϵ} + β

Where $μ$ and $σ$ are the mean and standard deviation of the features in $x$ , and $γ$ and $β$ are learnable parameters for scaling and shifting.

c.f. batchnorm

BatchNorm (x; γ, β, ϵ) = γ ⊙ \frac{x - μ _{B}}{σ _{B} + ϵ} + β

Where $μ_{B}$ and $σ_{B}$ are the mean and standard deviation computed over a batch of data, and $γ$ and $β$ are learnable parameters for scaling and shifting.

https://arxiv.org/pdf/2409.12951v1 ↩

Max Wolf's Second Brain

Explorer

RMS norm

Graph View

Backlinks

Max Wolf's Second Brain

Explorer

RMS norm

Footnotes

Graph View

Backlinks