Normalization

$\overset{x}{^} = \frac{x - μ}{σ}$ → standardization
Subtract the mean and divide by standard deviation.
→ $μ = 0$ (zero mean)
→ $σ^{2} = 1$ (unit variance)
The standardization also normalzes the data, as it creates a standard normal distribution which scales the overwhelming majority of values to be in a small range and makes very small / large values very unlikely.

Normalizing is often used instead of standardizing in machine learning (standardizing has simmilar - even more useful - properties but does not strictly put the data in a range of [0-1] and only works best if your data follows a gaussian distribution).

What / When to normalize / standardize?

It’s usual to normalize a feature (column) so that, having done this for each feature, the features will be on more comparable scales.

Normalizing across rows usually doesn’t make any physical sense (Imagine mashing a person’s height, weight, and blood pressure together.).

Normalize training and test sets seperately or you will leak information into the test set!

If the features do not follow a normal distribution before standardization, they will not follow it after the standardization either. ¹

e.g. an exponential distribution will stay exponential after standardization (even though the mean will be 0 and the std will be 1).
So normalizing might be better than standardizing if your data is not gaussian.

In Algebra means dividing a vector by its length transforming the data into a range between 0 and 1.

Transclude of vector#^cf975b

Min-max scaling also normalizes data between 0 and 1 or -1 and 1:

x_{i}^{'} = \frac{x _{i} - min ( x )}{max ( x ) - min ( x )}

The normalization obtained by standardization does not strictly put the values in a range between 0 and 1 or -1 and 1 and it is also much more robust against outliers.

Types of normalization techniques in machine learning - comparison

Normalizations compared visually (They all do the same thing but from different angles of the data)
$C =$ channels, $N =$ batch size, $H$ , $W =$ 1D representation of outputs in channel
Explained by Yannik.

More rigorously: ³

Batch Norm

Given an input batch $x \in R^{N \times C \times H \times W}$ , $BN$ normalizes mean and standard deviation for each individual feature channel:

BN (x) = γ (\frac{x - μ ( x )}{σ ( x )}) + β

where $γ, β \in R^{C}$ are affine parameters learned from data; $μ (x), σ (x) \in R^{C}$ are the mean and standard deviation, computed across batch size and spatial dimensions independently for each feature channel:

\displaylines μ_{c} (x) = \frac{1}{N H W} n = 1 \sum N h = 1 \sum H w = 1 \sum W x_{n c h w} σ_{c} (x) = \frac{1}{N H W} n = 1 \sum N h = 1 \sum H w = 1 \sum W (x_{n c h w} - μ_{c} (x))^{2} + ϵ

$BN$ uses mini-batch statistics during training and replace them with popular statistics during inference, introducing discrepancy between training and inference.

Instance Norm

In the original feed-forward stylization method, the style transfer network contains a $BN$ layer after each convolutional layer. Surprisingly, Ulyanov et al. found that significant improvement could be achieved simply by replacing $BN$ layers with $IN$ layers:

IN (x) = γ (\frac{x - μ ( x )}{σ ( x )}) + β

Different from $BN$ layers, here $μ (x)$ and $σ (x)$ are computed across spatial dimensions independently for each channel and each sample:

\displaylines μ_{n c} (x) = \frac{1}{H W} h = 1 \sum H w = 1 \sum W x_{n c h w} σ_{n c} (x) = \frac{1}{H W} h = 1 \sum H w = 1 \sum W (x_{x c h w} - μ_{n c} (x))^{2} + ϵ

Another difference is that $IN$ layers are applied at test time unchanged, whereas $BN$ layers usually replace minibatch statistics with population statistics (=estimated mean and variance of the entire training dataset).

Adaptive Instance Normalization

Instance normalization normalizes the input to a single style specified by the affine parameters ( $γ, β$ ). $AdaIN$ was introduced to adapt to arbitrarily given styles (in generative AI) by using adaptive affine transformations. $AdaIN$ receives a content input $x$ and a style input $y$ and aligns the channel-wise mean and variance of $x$ to match those of $y$ . There are no learnable affine parameters (like in $BN, IN$ or conditional $IN$ ). Instead, it “adaptively computes affine parameters from the style input”:

AdaIN (x, y) = y_{s} (\frac{x - μ ( x )}{σ ( x )}) + y_{b}

TLDR.: Style is encoded through scaling and shifting of the values. ( $y_{s}$ and $y_{b}$ are learnt through a linear layer).

Like in $IN$ the statistics are computed across spatial dimensions. Intuitively, let us consider a feature channel that detects brushstrokes of a certain style. A style image with this kind of strokes will produce a high average activation for this feature. The output produced by AdaIN will have the same high average activation for this feature, while preserving the spatial structure of the content image. In the original paper, $y_{s}$ and $y_{b}$ , are replaced by std and mean of the style input, normalizing it to the given style. A bit less flexible

⁴ ⁵

Scaling

⁶

Batch Normalization vs Layer Normalization

Batch normalization is dependent on the batch size, as it normalizes on all the batches, whereas Layer Normalization normalizes per data point. (Batch Norm only good with bigger batch sizes $\approx$ >8)
Layer Normalization also performs the same operations during training and testing. (BN doesn’t)

Layer Norm good for:

sequences
variable batch num workers
parallelization

Not so good for:

CNNs
Features/Layers with different scales

Max Wolf's Second Brain

Explorer

Normalization - Scaling - Standardization

What / When to normalize / standardize?

Normalization

Types of normalization techniques in machine learning - comparison

Batch Norm

Instance Norm

Adaptive Instance Normalization

Scaling

Batch Normalization vs Layer Normalization

References

Graph View

Backlinks

Max Wolf's Second Brain

Explorer

Normalization - Scaling - Standardization

What / When to normalize / standardize?

Normalization

Types of normalization techniques in machine learning - comparison

Batch Norm

Instance Norm

Adaptive Instance Normalization

Scaling

Batch Normalization vs Layer Normalization

References

Footnotes

Graph View

Backlinks