value iteration

We can rewrite the

Transclude of Bellman-Equation#^36a1e7

to this (literally the same thing, just making $a$ and $P$ explicit):

V (s) = a max s^{'} \sum P (s^{'} ∣ s, a) (R (s^{'}, s, a) + γV (s^{'}))

(If the MDP was deterministic, we could drop the sum)
We can initialize this value function/table poorly, and recursively / iteratively refine it.
Once that’s converged, we can extract the optimal policy:

π (s, a) = ar g a max s^{'} \sum P (s^{'} ∣ s, a) (R (s^{'}, s, a) + γV (s^{'}))

Here, we have to know the transition probabilities $P$ , the next state $s^{'}$ and reward $R (s^{'}, s, a)$ , so this is model-based. We need some explicit model of the environment that tells us the expected next state and reward given the current state and action.

This is a type of dynamic programming. Given perfect information about the environment, we can find the optimal solution by DP. You can just code this algorithm up to solve low dimensional problems (e.g. tic tac toe), without neural nets.

Max Wolf's Second Brain

Explorer

value iteration

Graph View

Backlinks