The variance is the squared mean deviation / distance of each datapoint from the overall mean of the dataset

It measures how spread out the distribution is. Squaring the distance is nicer - mathematically - than taking absolute values. This causes the units to be off, which is corrected in the standard deviation.

Variance with expected value notation:

With bra-ket notation:

Variance is the second central moment.

is the second moment. When data isn’t centered at zero, gets artificially inflated by the mean’s distance from zero.

Example: Heights of 175cm, 180cm, 185cm (mean = 180cm)

  • (huge!)
  • (the artificial inflation from being centered at 180)
  • (just the ±5cm spread)

Mathematically, we can decompose any :

Averaging over all :

So contains both the actual spread (variance) and the “artificial inflation” () from being far from zero. Subtracting removes this inflation, isolating just the variance.

When scaling a random variable by a constant :

This follows directly from the definition of variance:

For independent random variables, variances add:

Counter-example:

Consider a neuron receiving inputs (“fan-in”). Each input has variance 1, and each weight is sampled uniformly from . The neuron computes:

For each term :

Summing such terms:

This maintains unit variance through the network, preventing vanishing or exploding gradients.