SARSA

SARSA is on-policy TD learning with the Q-function.
Here we are always doing what we think is the best thing, so no epsilon greedy exploration. It’s more conservative “save” due to that (no random actions), and converges slower than Q-Learning, but we gain more cumulative reward.

In contrast to Q-learning, we update our Q-values based on the actual actions we take, not the maximum Q-value of the next state:

SARSA: Q-Learning: Q^{new} (s_{t}, a_{t}) = Q^{old} (s_{t}, a_{t}) + α (r_{t} + γ Q^{old} (s_{t + 1}, a_{t + 1}) - Q^{old} (s_{t}, a_{t})) Q^{new} (s_{t}, a_{t}) = Q^{old} (s_{t}, a_{t}) + α (r_{t} + γ a max Q (s_{t + 1}, a) - Q^{old} (s_{t}, a_{t}))

center

reinforcement learning
on-policy, temporal difference learning
https://lilianweng.github.io/posts/2018-02-19-rl-overview/#sarsa-on-policy-td-control

Max Wolf's Second Brain

Explorer

SARSA

Graph View

Backlinks