The variance is the squared mean deviation / distance of each datapoint from the overall mean of the dataset
It measures how spread out the distribution is. Squaring the distance is nicer - mathematically - than taking absolute values.
Squared distance to the mean of the distribution, but the units are off, because it’s squared… which is corrected in the standard deviation.
Variance with expected value notation:
is also called the second moment.
With bra-ket notation:
When scaling a random variable by a constant :
This follows directly from the definition of variance:
For independent random variables, variances add:
Consider a neuron receiving inputs (“fan-in”). Each input has variance 1, and each weight is sampled uniformly from . The neuron computes:
For each term :
Summing such terms:
This maintains unit variance through the network, preventing vanishing or exploding gradients.