We can rewrite the
Transclude of Bellman-Equation#^36a1e7
to this (literally the same thing, just making and explicit):
(If the MDP was deterministic, we could drop the sum)
We can initialize this value function/table poorly, and recursively / iteratively refine it.
Once that’s converged, we can extract the optimal policy:
Here, we have to know the transition probabilities , the next state and reward , so this is model-based. We need some explicit model of the environment that tells us the expected next state and reward given the current state and action.
This is a type of dynamic programming. Given perfect information about the environment, we can find the optimal solution by DP. You can just code this algorithm up to solve low dimensional problems (e.g. tic tac toe), without neural nets.