bootstrapping

See also temporal difference learning.

Bootstrapping in RL refers to constructing targets for your Value function $V$ as a combination of real rewards and $V$ itself

$V (s_{t}) \to r_{t + 1} + r_{t + 2} + r_{t + 3} + r_{t + 4} + \dots + V (s_{t + n})$
by which we don’t need to roll out episodes all the way to the end in order to learn the value function.

Long version:

The value function, defined as the expected return of a state given a policy $V (s_{t}) = E [R ∣ s_{t}, π]$ , could in the simplest case be obtained by just letting the policy gather lots of playouts from the state using the policy and average the sum of future rewards from the environment, which will definitionally converge towards the state-value function for that state and policy.
Since this is a bit slow, we improve the performance by bootstrapping:
Instead of calculating, as described above: $V (s_{t}) = E [r_{t + 1} + r_{t + 1} + \dots r_{t + N} ∣ s_{t}, π]$ ,
we set the target for the value function as: $V (s_{t}) \to r_{t + 1} + r_{t + 2} + r_{t + 3} + V (s_{t + 4})$
So we move $V$ towards the value on the right, consisting of actual rewards + estimate of the state-value of a future state. This way we don’t have to collect actual rewards all the way through but $V$ still converges to the right value by encorporating experienced rewards.
Link to original

Off-Policy Correction

The off-policy nature of this method is an issue which gets resolved in Efficient Zero:
Usually the rewards are drawn as state action sequences $s a r_{1}, s a r_{2}, \dots$ from a replay buffer.
The experiences and states of the trajectory depend on the actions that were taken, but these are actions that the current (more trained) agent might not have taken and it would have then also found itself in different states.
So the problem is if the agent would do something different now, but is being updated towards a target as if it would have blindly done the same thing over and over.

→ The older an episode is, the shorter the bootstrap sequence.

If the trajectory that you imagine, giving you the sequence of rewards shown above, is relatively recent, then you’d still probably make the same decisions now. So you can sample from a relatively long trajectory. Your update might look like this:

V (s_{t}) \to r_{t + 1} + r_{t + 2} + r_{t + 3} r_{t + 4} + V (s_{t + 5})

On the other hand, if the trajectory that you imagine is relatively old, then you’d likely make different decisions now. So you should sample from a relatively short trajectory. Your update might look like this:

V (s_{t}) \to r_{t + 1} + V (s_{t + 2})

Max Wolf's Second Brain

Explorer

bootstrapping

Off-Policy Correction

Graph View

Backlinks