Q-Learning

See also DQN.

Policy and value function are learned at the same time, using the Q-function:

Q^{new} (s_{t}, a_{t}) = Q^{old} (s_{t}, a_{t}) + α (r_{t} + γ a max Q (s_{t + 1}, a) - Q^{old} (s_{t}, a_{t}))

… take an action using your old Q-function, update it ( $α$ … lr) based on the reward $r_{t}$ , and the predicted reward given we are always choosing the best predicted action in the future (greedily).

This is simply off-policy TD learning with the Q-function.
If we didn’t take the $max_{a}$ in the reward while taking off-policy $a$ , $Q$ would degrade over time. By taking the max, we can take suboptimal actions (explore; epsilon greedy) and learn from off-policy experiences, like following an expert player’s actions (imitation learning) or learning from past actions (experience replay).

SARSA is the on-policy variant of this.

References

reinforcement learning
https://lilianweng.github.io/posts/2018-02-19-rl-overview/#q-learning-off-policy-td-control

Max Wolf's Second Brain

Explorer

Q-Learning

References

Graph View

Backlinks