See also DQN.

Policy and value function are learned at the same time, using the Q-function:

… take an action using your old Q-function, update it ( … lr) based on the reward , and the predicted reward given we are always choosing the best predicted action in the future (greedily).

This is simply off-policy TD learning with the Q-function.
If we didn’t take the in the reward while taking off-policy , would degrade over time. By taking the max, we can take suboptimal actions (explore; epsilon greedy) and learn from off-policy experiences, like following an expert player’s actions (imitation learning) or learning from past actions (experience replay).

SARSA is the on-policy variant of this.

References

reinforcement learning
https://lilianweng.github.io/posts/2018-02-19-rl-overview/#q-learning-off-policy-td-control