batch normalization

Types of normalization techniques in machine learning - comparison
Link to original

Input: Values of $x$ over a mini-batch: $B = {x_{1} \dots m};$
$μ_{B} = \frac{1}{m} \sum_{i = 1}^{m} x_{i}$ // mini batch mean
$σ_{B}^{2} = \frac{1}{m} \sum_{i = 1}^{m} (x_{i} - μ_{B})^{2}$ // mini-batch variance
$\overset{x}{^}_{i} = \frac{x _{i} - μ _{B}}{σ _{B}^{2} + ϵ}$ // normalize
$ϵ$ … noise parameter in case variance is 0 (div by 0)

Since mean and variance are heavily dependent on the batch, we introduce learnable parameters (unit gaussion at initialization but optimization might change that).
$γ$ (scale) approximates the true variance of the neuron activation and
$β$ (offset) apporximates the true mean of the neuron activation.
$z^{(i)} = γ \otimes \overset{x}{^}^{(i)} + β$

Usually placed after layer that have multiplications (Linear, Conv, …)

You want to center your data, and normalize the data (rescale the axes) so they are more like gaussians > signal propagates better

(+ you could even save a parameter e.g. if your can be classified by $k x + c$ , if it’s centered, you don’t need the $c$ anymore)

Batch norm transforms data back into normal distribution after each layer, so numbers don’t get to big / small over time (continously re-centering and re-scaling the data):

Drawback: You need to know all of the data points to determine the mean etc. You can still estimate / guess it with smaller mini-batches, but the smaller the batch, the worse the outcome.

Yannik

Other benefits

YT src

Speeds up training

Small variations in height make a huge difference on the output, due to the varying scales-

After normalization → Loss smoothed

$\overset{x}{^} = \frac{x - μ}{σ}$ → standardization
Subtract the mean and divide by standard deviation.
→ $μ = 0$ (zero mean)
→ $σ^{2} = 1$ (unit variance)
The standardization also normalzes the data, as it creates a standard normal distribution which scales the overwhelming majority of values to be in a small range and makes very small / large values very unlikely.

Link to original

The function is more symmetric

You could also use an adaptive optimizer like Adam, to have one learning rate for age and one for height (3d loss curve) but you should still normalize.

Makes initial weights less important

… we arrive at the minimum in a simmilar number of steps in the more symmetric circle but might take way longer depending on initial weights for the unnormalized data.

regularization

The random factor of the batch norm values plays a little into regularization (Introducing a little bit of entropy / jitter, due to the other examples; sorta a data augmentation).

Stability

Also aids model stability → Inputs are in simmilar ranges → More simmilar weights + Helps to prevent vanishing and exploding gradients.

In practice:

You could find better results depending on whether you put it before or after the activation function.

You can save parameters by turning of biases for all the other layers, since the offset is basically the bias.

torch.nn.Linear(..., ..., bias=False)

Batch Normalization in depth explanation

Max Wolf's Second Brain

Explorer

batch normalization

Types of normalization techniques in machine learning - comparison

Other benefits

Speeds up training

Makes initial weights less important

regularization

Stability

In practice:

Graph View

Table of Contents

Backlinks