expected value

$E$ stands for the expectation, which is a mathematical operator that calculates the average value of a random variable over many possible outcomes.

In short, it’s the mean/“center of mass” of a probability distribution

μ = E [X] = i = 1 \sum n P (X = x_{i}) \cdot x_{i}

$X \dots$ random variable (distribution), can be discrete or continuous
$x_{i} \dots$ possible outcomes of the random variable $X$
The expected value of a random variable is like saying you go through all of the possible outcomes and multiply the probability of that outcome times the value of that variable (weighted average).

For continous random variables (described by PDFs), the expected value is:

E [X] = \int_{- \infty}^{\infty} x \cdot f (x) d x

The integral exetends over all possible values of $X$ .
In a more general notation:

E [g (X)] = \int_{x} f (x) g (x)

where $f (x)$ is the probability density function and $g (x)$ is the expectation function of the random variable and $\int_{x}$ is shorthand for integrating over all possible values of $X$ .

Expectations can be approximated by with a sample mean:

E [X] \approx \frac{1}{n} i = 1 \sum n x_{i}

n \to \infty lim \frac{1}{n} i = 1 \sum n x_{i} \to μ = E [X]

If we draw from a particular distribution $P$ , we write it like this

x \sim P E [X] = \int_{x} P (x) x

For oddly shaped distributions, the expected value might not actually correspond ot a common value in the distribution

The gaussian and bimodal distributions below have the same $μ$ , but the bimodal distribution almost never takes on that value:

Properties

The expected value is linear

$E [X + Y] = E [X] + E [Y]$

The expected value is a constant and constants stay constants

$E [E [X]] E [X + c] E [c X] = E [X] = E [X] + c = c E [X]$

If the distribution is uniform, the expected value simplifies to the arithmetic mean of the values.

Only if $X, Y$ are independent, then …

$E [X Y] = E [X] \cdot E [Y]$

Derivation

\begin{align*}
\mathbb{E}(XY) &= \sum_{x} \sum_{y} xy , \underbrace{ P(X=x, Y=y) }{ \text{if independent} } \
&= \sum{x} \sum_{y} xy , P(X=x) P(Y=y) \
&= \left(\sum_{x} x , P(X=x)\right) \left(\sum_{y} y , P(Y=y)\right) \
&= \mathbb{E}[X] \cdot \mathbb{E}[Y]
\end{align*}

Where/Why it doesn’t hold $E [X Y] \neq = E [X] E [Y]$ because the joint distribution $P (X, Y) \neq = P (X) P (Y)$ . The difference is captured by the covariance: $Cov (X, Y) = E [X Y] - E [X] E [Y]$

For dependent variables,

Parent-child height $X$ = parent height, $Y$ = child height Parents: 160, 180 → $E [X] = 170$ Children: 155, 175 → $E [Y] = 165$
Joint pairs: (160,155), (180,175) $E [X] E [Y] = 170 \times 165 = 28, 050$ $E [X Y] = \frac{160 \times 155 + 180 \times 175}{2} = \frac{24 , 800 + 31 , 500}{2} = 28, 150$ The positive correlation ( $Cov = 100$ ) makes tall $\times$ tall and short $\times$ short pairs more likely, increasing $E [X Y]$ above the independent case.

Let

Law of the Unconscious Statistician (LOTUS)

For $Y = g (X)$ , you can compute $E [Y]$ directly using $X$ ‘s distribution:
$\begin{align*} E[g(X)] = \cases{ \sum_{x} g(x) P(x) & \text{discrete} \cr \int_{-\infty}^{\infty} g(x) f(x) \, dx & \text{continuous} } \end{align*}$
Important: $E [g (X)] \neq = g (E [X])$ in general (only equal for linear functions)

$X \in {0, 2}$ with $P (X = 0) = P (X = 2) = 0.5, g (x) = x^{2}$ $E [X] = 1$ , $g (E [X]) = 1^{2} = 1$ , but $E [X^{2}] = 0.5 (0^{2}) + 0.5 (2^{2}) = 2$

Let

Common notations for expectations

$E [f (X)]$ — after declaring $X \sim P$ . (Very common in stats/prob.)
$E_{X} [f (X)]$ or $E_{x} [f (x)]$ — variable-centric.
$E_{P} [f (X)]$ — measure/distribution-centric. (Favored in info theory; clean when $f$ doesn’t name $x$ .)
$E_{x \sim P} [f (x)]$ — explicit sampling form. (Very common in ML.)
$E_{p} [f (X)]$ or $E_{x \sim p (x)} [f (x)]$ — when referring to a density $p$ .
$\int f d P$ , $\int f (x) p (x) d x$ , $\sum_{x} f (x) p (x)$ — measure/integral form. (Probability/info theory texts.)
$⟨ f (X) ⟩_{P}$ — physics/stat-mech style; you’ll sometimes see it in evolutionary/EC papers.
Parametric models: $E_{P_{θ}} [f (X)]$ or $E_{p_{θ}} [f (X)]$ , sometimes $E_{θ} [\cdot]$ .
Machine learning (incl. VI/RL): $E_{x \sim p_{data}} [\cdot]$ , $E_{z \sim q_{ϕ}} [\cdot]$ ; Monte Carlo $\frac{1}{n} \sum_{i}$ .

conditional expectation

Converting Celsius to Fahrenheit $E [a X + b]$ & $Var (a X + b)$

I measure $X$ in $^{\circ} C$ , and want the expected value and variance in $^{\circ} F$ , $a = \frac{9}{5}, b = 32$ . $Y = a x + b$
$E [Y] = E [a X + b] = \int (a x + b) f (x) d x LOTUS = \int a x f (x) d x + \int b f (x) d x = a E [x] \int x f (x) d x + b 1 \int f (x) d x = a E [X] + b$
→ Since $Y$ is a linear function of $X$ , we can directly transform the expectation.

$Var (a X + b) = Var (Y) = E [(Y - E [Y]^{2})] = E [(a X + b - a E [X] - b])^{2}] = E [a^{2} - (X - E [X])^{2}] = a^{2} E [(X - E [X])^{2}] = a^{2} Var (X)$
→ Variance changes with change of units / scale, but is not affected by shifts.

Tail Sum Formula

For a non-negative random variable $X \in {0, \dots, n}$ , the expected value can be computed using the tail sum formula:
$E [X] = k = 1 \sum n P (X \geq k) = k = 0 \sum n k P (X = k) = 1 p_{1} + 2 p_{2} + 3 p_{3} + \dots + (n - 1) p_{n - 1} + n p_{n} = ⎩ ⎨ ⎧ P (X \geq 1) + P (X \geq 2) + P (X \geq 3) + \dots + P (X \geq n - 1) + P (X \geq n) = p_{1} + p_{2} + p_{3} + \dots + p_{n} = + p_{2} + p_{3} + \dots + p_{n} = + p_{3} + \dots + p_{n} = + \dots = + p_{n - 1} + p_{n} = + p_{n}$
Note that $P (X \geq k) = 1 - P (X \leq k) = 1 - F_{X} (k)$ , where $F_{X}$ is the cumulative distribution function of $X$ .

Link to original

References

https://chat.openai.com/c/92cd3571-92b1-4c8f-8345-b62a753888e3

probability

Max Wolf's Second Brain

Explorer

expected value

Properties

References

Graph View

Table of Contents

Backlinks