eligibility trace

Eligibility Traces

Eligibility traces are short-term memory vectors that can be used to form multi-step methods in a streaming form, achieving better credit assignment than their one-step counterparts. The idea of eligibility traces (Sutton & Barto 1981) is influenced by the biological neuroscience model by Klopf (1972).

For a value function $\overset{v}{^} (s, w)$ with parameters $w$ , the eligibility trace vector $z_{t}$ is:
$z_{t} = γλ z_{t - 1} + \nabla_{w} \overset{v}{^} (S_{t}, w)$
where $γ$ is the discount factor, $λ \in [0, 1]$ is the trace parameter, and $z_{- 1} = 0$ .

The trace accumulates gradients of value estimates, discounted by $γλ$ . This allows the learning algorithm to credit past states for current rewards more effectively than one-step methods. The trace is reset to zero at the start of each episode since credit assignment doesn’t carry across episodes.

Since we accumulate values in the trace vector, the traces can be arbitrarily large, potentially causing divergence.

A careful update rule must be used to prevent eligibility traces from causing instability.

While similar to momentum in appearance, eligibility traces serve a fundamentally different purpose:

They maintain a trace of value function gradients rather than loss gradients, enabling multi-step credit assignment. Unlike momentum which persists indefinitely, traces reset between episodes since there is no meaningful credit assignment across different episodes.
Eligibility traces say “remember how each parameter affected our output”, “when I moved my arm this way, the ball went higher”.
Momentum says “remember how each parameter affected our error”, “when I moved my arm this way, I missed the target by more/less”

And unlike the momentum term, updates with eligibility traces are equivalent to those of multi-step returns (e.g., λ-return), achieving fixed points superior to one-step updates. Lastly, eligibility traces have been found to be effective primarily in tabular settings or with linear function approximation, while none of their deep-learning counterparts are known to perform well.

Eligibility traces provide an online, incremental way to implement TD(λ)

There are two equivalent ways to understand TD(λ):
The “forward view” (TD(λ) original formulation):
$G_{t}^{λ} = (1 - λ) n = 1 \sum \infty λ^{n - 1} G_{t}^{(n)}$
This looks ahead at future rewards, combining n-step returns with weights that decay by λ. While theoretically elegant, it’s not practical for online learning since you need to wait for future rewards.
The “backward view” (eligibility traces):
$w_{t + 1} z_{t} = w_{t} + α δ_{t} z_{t} = γλ z_{t - 1} + \nabla_{w} \overset{v}{^} (S_{t}, w_{t})$
This maintains a trace of past gradients, allowing immediate updates. It’s more practical for implementation.
Think of it like integration vs. summation:
The forward view (TD(λ)) is like calculating an area under a curve by looking at the whole curve.
The backward view (eligibility traces) is like approximating the same area by accumulating small rectangular slices as you go.
→ Rather than having to wait for future rewards to compute the λ-return, the trace maintains an exponentially-weighted sum of past gradients that approximates the same multi-step update. This makes them particularly valuable for continuous learning scenarios where we want to learn from partial trajectories.

The one-step TD error $δ_{t}$ for value estimates is defined as:
$δ_{t} = R_{t + 1} + γ \overset{v}{^} (S_{t + 1}, w) - \overset{v}{^} (S_{t}, w)$
Each n-step return can be decomposed into a sum of TD errors:
$G_{t}^{(n)} - \overset{v}{^} (S_{t}, w) = k = t \sum t + n - 1 γ^{k - t} δ_{k}$
Substituting this into the forward view equation and rearranging:
$G_{t}^{λ} - \overset{v}{^} (S_{t}, w) = k = t \sum \infty (γλ)^{k - t} δ_{k}$
For linear function approximation where $\overset{v}{^} (s, w) = w^{T} ϕ (s)$ , $ϕ (s)$ … feature vector representing state $s$ , the forward view update:
$w_{t + 1} = w_{t} + α (G_{t}^{λ} - \overset{v}{^} (S_{t}, w)) \nabla_{w} \overset{v}{^} (S_{t}, w)$
becomes equivalent to the backward view with eligibility traces:
$w_{t + 1} z_{t} = w_{t} + α δ_{t} z_{t} = γλ z_{t - 1} + \nabla_{w} \overset{v}{^} (S_{t}, w)$
The trace $z_{t}$ recursively accumulates exactly the information needed to implement the forward view, but does so incrementally without having to wait for future rewards. This equivalence holds exactly for linear function approximation, which partly explains why eligibility traces have been most successful in linear or tabular settings.

Max Wolf's Second Brain

Explorer

eligibility trace

Graph View

Backlinks