Bellman Equation

Bellman Equation of the value function

$V (s) = a max E (r + γ V (s^{'}))$
The value of a state equals the best immediate reward plus the discounted value of the next state.

Without this, evaluating a state requires considering every possible future trajectory to the end of time. The recursive decomposition collapses the entire future into a single number $V (s^{'})$ that can be iteratively refined, so you only ever need to look one step ahead.

$V$ appears on both sides. “Solving” means finding the $V^{*}$ where every state’s value is exactly confirmed by its neighbors: the best action’s reward plus the discounted value of where you land equals what you predicted. Iterative methods (value policy iteration, TD) converge to it by repeatedly correcting these mismatches.

This fixed point only exists because of stationarity: $P (s^{'} ∣ s, a)$ and $R (s, a)$ don’t change over time. If the rules shifted, $V (s)$ would be a moving target.

How different algorithm families approach solving it

Exact (dynamic programming, value policy iteration): sweep over all states, update $V (s)$ from neighbors until convergence. Requires a known model¹, cost is $O (∣ S ∣^{2} ∣ A ∣)$ per iteration.

Sampled (TD, Q-learning): don’t enumerate states, instead sample transitions from the environment and update toward the Bellman target. Trades exactness for scalability.

Policy-based (policy gradient family): sidestep $V$ entirely and optimize the policy directly. The Bellman equation is still implicitly there (actor critic methods use it for the critic), but the policy doesn’t need to solve it.

Knowing $P (s^{'} ∣ s, a)$ and $R (s, a)$ explicitly to computed the expected value of each action without taking it. ↩

Graph View

Bellman Equation

Backlinks

Graph View

Bellman Equation

Footnotes

Backlinks