layer normalization

https://paperswithcode.com/paper/layer-normalization

Types of normalization techniques in machine learning - comparison
Link to original

In contrast to batch normalization every datapoint gets normalized individually, but accross all the features (layers). Removes dependency on batch but features might have different scales.

Essentially BatchNorm:

Input: Values of $x$ over a mini-batch: $B = {x_{1} \dots m};$
$μ_{B} = \frac{1}{m} \sum_{i = 1}^{m} x_{i}$ // mini batch mean
$σ_{B}^{2} = \frac{1}{m} \sum_{i = 1}^{m} (x_{i} - μ_{B})^{2}$ // mini-batch variance
$\overset{x}{^}_{i} = \frac{x _{i} - μ _{B}}{σ _{B}^{2} + ϵ}$ // normalize
$ϵ$ … noise parameter in case variance is 0 (div by 0)
Link to original

Since mean and variance are heavily dependent on the batch, we introduce learnable parameters (unit gaussion at initialization but optimization might change that).
$γ$ (scale) approximates the true variance of the neuron activation and
$β$ (offset) apporximates the true mean of the neuron activation.
$z^{(i)} = γ \otimes \overset{x}{^}^{(i)} + β$
Link to original

but better:

LayerNorm (x) \overset{=}{˙} \frac{x - μ}{σ ^{2} + ϵ}, where μ \overset{=}{˙} \frac{1}{n} i = 1 \sum n x_{i} and σ^{2} \overset{=}{˙} \frac{1}{n} i = 1 \sum n (x_{i} - μ)^{2}

where $n$ is the dimensionality of $x$ and $ϵ$ is a small number used for numerical stability.

Why?

Batch Normalization vs Layer Normalization

Batch normalization is dependent on the batch size, as it normalizes on all the batches, whereas Layer Normalization normalizes per data point. (Batch Norm only good with bigger batch sizes $\approx$ >8)
Layer Normalization also performs the same operations during training and testing. (BN doesn’t)

Layer Norm good for:

sequences

variable batch num workers

parallelization

Not so good for:

CNNs

Features/Layers with different scales

Link to original

Example

Refer to batch normalization for more explanation.

layer normalization for fully connected layer $i$ , neuron $j$ and feature size $m$

\displaylines mean_{i} = \frac{1}{m} l = 1 \sum m σ_{i l} var_{i} = \frac{1}{m} l = 1 \sum m (σ_{i l} - m e a n_{i})^{2}

Normalize feature $j :$

\overset{σ}{^}_{ij} = \frac{σ _{ij} - mean _{i}}{v a r _{i} + ϵ}

$ϵ$ … noise parameter in case var is 0, see ^cd1cd1
Scale and shift normalized feature, with learnable params:

γ_{ij} = \overset{σ}{^}_{ij} + β_{ij}

source

Code

(from Karpathy)

Max Wolf's Second Brain

Explorer

layer normalization

Types of normalization techniques in machine learning - comparison

Why?

Batch Normalization vs Layer Normalization

Example

Code

Graph View

Backlinks