We evaluate our current best policy and thereby iteratively approximate our value function to the true value of the policy, exactly like in value iteration, but instead of taking the action with the maximum value, we act according to .

Then, we lock in the value function, and try to iteratively update the policy, so we can take the best action:

The big difference to value iteration is, that we do not need a model of the environment, at least not an explicit one: The information is implicitly in the Q-function, i.e. we can do model-free RL.

Policy iteration is guaranteed to converge to optimality.

If we have a policy and generate an improved version by greedily taking actions , the value of the improved policy is guaranteed to be better because:

Local greediness yields global optimality: Because each policy improvement step is locally greedy with respect to the current value function, it can never decrease performance. If there is any state where the greedy action has a higher Q-value than the old policy’s action, then you get a strictly better policy overall. If no improvement is possible in any state, you must already be at an optimal policy.

This iterative process is often done asynchronously with multiple actors evaluating potentially older policies, so updates will kind of be a weighted average.

This is a type of dynamic programming, like value iteration, given perfect information about the environment, we can find the optimal solution by DP. You can just code this algorithm up to solve low dimensional problems (e.g. tic tac toe), without neural nets.

PI Typically converges faster (less iterations) than VI, but requires more computation per iteration.