See also: Introduction to RL, DQN, value function.

Bellman Equation of the value function

The value of a state equals the best immediate reward plus the discounted value of the next state.

Without this, evaluating a state requires considering every possible future trajectory to the end of time. The recursive decomposition collapses the entire future into a single number that can be iteratively refined, so you only ever need to look one step ahead.

appears on both sides. “Solving” means finding the where every state’s value is exactly confirmed by its neighbors: the best action’s reward plus the discounted value of where you land equals what you predicted. Iterative methods (valuepolicy iteration, TD) converge to it by repeatedly correcting these mismatches.

This fixed point only exists because of stationarity: and don’t change over time. If the rules shifted, would be a moving target.

How different algorithm families approach solving it

Exact (dynamic programming, valuepolicy iteration): sweep over all states, update from neighbors until convergence. Requires a known model1, cost is per iteration.

Sampled (TD, Q-learning): don’t enumerate states, instead sample transitions from the environment and update toward the Bellman target. Trades exactness for scalability.

Policy-based (policy gradient family): sidestep entirely and optimize the policy directly. The Bellman equation is still implicitly there (actor critic methods use it for the critic), but the policy doesn’t need to solve it.

Footnotes

  1. Knowing and explicitly to computed the expected value of each action without taking it.