https://paperswithcode.com/paper/layer-normalization

Transclude of Normalization---Scaling---Standardization#^f74628

In contrast to batch normalization every datapoint gets normalized individually, but accross all the features (layers). Removes dependency on batch but features might have different scales.


Essentially BatchNorm:

Input: Values of over a mini-batch:
// mini batch mean
// mini-batch variance
// normalize
… noise parameter in case variance is 0 (div by 0)

Link to original

Since mean and variance are heavily dependent on the batch, we introduce learnable parameters (unit gaussion at initialization but optimization might change that).
(scale) approximates the true variance of the neuron activation and
(offset) apporximates the true mean of the neuron activation.

Link to original

but better:

where is the dimensionality of and is a small number used for numerical stability.

Why?

Transclude of Normalization---Scaling---Standardization#batch-normalization-vs-layer-normalization

Example

Refer to batch normalization for more explanation.

layer normalization for fully connected layer , neuron and feature size

Normalize feature

… noise parameter in case var is 0, see ^cd1cd1
Scale and shift normalized feature, with learnable params:

source

Code

(from Karpathy)