The variance is the squared mean deviation / distance of each datapoint from the overall mean of the dataset
It measures how spread out the distribution is. Squaring the distance is nicer - mathematically - than taking absolute values. This causes the units to be off, which is corrected in the standard deviation.
Variance with expected value notation:
With bra-ket notation:
Variance is the second central moment.
is the second moment. When data isn’t centered at zero, gets artificially inflated by the mean’s distance from zero.
Example: Heights of 175cm, 180cm, 185cm (mean = 180cm)
- (huge!)
- (the artificial inflation from being centered at 180)
- (just the ±5cm spread)
Mathematically, we can decompose any :
Averaging over all :
So contains both the actual spread (variance) and the “artificial inflation” () from being far from zero. Subtracting removes this inflation, isolating just the variance.
When scaling a random variable by a constant :
This follows directly from the definition of variance:
For independent random variables, variances add:
Counter-example:
Consider a neuron receiving inputs (“fan-in”). Each input has variance 1, and each weight is sampled uniformly from . The neuron computes:
For each term :
Summing such terms:
This maintains unit variance through the network, preventing vanishing or exploding gradients.