year: 2021/12
paper: https://arxiv.org/pdf/2111.00210.pdf#cite.mnih2015human
website:
code: https://github.com/YeWR/EfficientZero/tree/main
connections: model-based, RL


More or less a combination between MuZero (model free part) and SPR (model based part).

Improvement #1:

Off-Policy Correction

The off-policy nature of this method is an issue which gets resolved in Efficient Zero:
Usually the rewards are drawn as state action sequences from a replay buffer.
The experiences and states of the trajectory depend on the actions that were taken, but these are actions that the current (more trained) agent might not have taken and it would have then also found itself in different states.
So the problem is if the agent would do something different now, but is being updated towards a target as if it would have blindly done the same thing over and over.

The older an episode is, the shorter the bootstrap sequence.

If the trajectory that you imagine, giving you the sequence of rewards shown above, is relatively recent, then you’d still probably make the same decisions now. So you can sample from a relatively long trajectory. Your update might look like this:

On the other hand, if the trajectory that you imagine is relatively old, then you’d likely make different decisions now. So you should sample from a relatively short trajectory. Your update might look like this:

Link to original

https://www.lesswrong.com/posts/mRwJce3npmzbKfxws/efficientzero-how-it-works

MuZero