variance

The variance is the squared mean deviation / distance of each datapoint from the overall mean of the dataset $\overset{ˉ}{X}$

Var (X) = \frac{1}{n} i = 1 \sum n (X_{i} - \overset{ˉ}{X})^{2}

It measures how spread out the distribution is. Squaring the distance is nicer - mathematically - than taking absolute values.

Squared distance to the mean of the distribution, but the units are off, because it’s squared… which is corrected in the standard deviation.

Variance with expected value notation:

\displaylines Va r (X) = E [(X - E [X])^{2}] = E [X^{2} - 2 XE [X] + E [X]^{2}] = E [X^{2}] - 2 E [X] E [X] + E [X]^{2} = E [X^{2}] - 2 E [X]^{2} + E [X]^{2} = E [X^{2}] - E [X]^{2}

$E [X^{2}]$ is also called the second moment.

With bra-ket notation:

Var (X) = ⟨ X^{2} ⟩ - ⟨ X ⟩^{2}

When scaling a random variable $X$ by a constant $c$ :

$Var (c X) = c^{2} Var (X)$
This follows directly from the definition of variance:
$Var (c X) = E [(c X - c μ)^{2}] = E [c^{2} (X - μ)^{2}] = c^{2} E [(X - μ)^{2}] = c^{2} Var (X)$

For independent random variables, variances add:

$Var (X + Y) = Var (X) + Var (Y)$

LeCun Initialization

Consider a neuron receiving $n$ inputs (“fan-in”). Each input has variance 1, and each weight $w$ is sampled uniformly from $[- \frac{1}{n}, \frac{1}{n}]$ . The neuron computes:
$output = w_{1} x_{1} + w_{2} x_{2} + ... + w_{n} x_{n}$
For each term $w_{i} x_{i}$ :
$Var (w_{i} x_{i}) = w_{i}^{2} Var (x_{i}) \approx (\frac{1}{n})^{2} \cdot 1 = \frac{1}{n}$
Summing $n$ such terms:
$Var (output) = n \cdot \frac{1}{n} = 1$
This maintains unit variance through the network, preventing vanishing or exploding gradients.

Max Wolf's Second Brain

Explorer

variance

Graph View

Backlinks