The question of what we want to pick, i.e. how soon to start relying on predicted rewards, is handled by TD Lambda in a similar way / with a similar intuition as with reward discounting: We take a weighted average of all possible -step TD targets rather than picking a single best , discounting future rewards, as the further in the future we predict, the less confident we are.

Link to original

TD decays the n-step return by a factor :

We multiply by to normalize the lambdas, i.e form a weighted average, which follows from rearranging the geometric sequence:

This is different to discounting, where we want future rewards to have less total influence and thus don’t normalize the weights – the infinite sum represents the maximum possible return rather than forming a weighted average of complete estimates.

center

TD(0) is equivalent to using only a single step actual reward and predicting the next.
TD() is equivalent to monte carlo methods (complete trajectories of actual rewards).