year: 2018
paper: https://arxiv.org/pdf/1506.02438
website:
code:
connections:
Abstract
policy gradient methods are an appealing approach in reinforcement learning because they directly optimize the cumulative reward and can straightforwardly be used with nonlinear function approximators such as neural networks.
The two main challenges are the large number of samples typically required, and the difficulty of obtaining stable and steady improvement despite the nonstationarity of the incoming data.
We address the first challenge by using value functions to substantially reduce the variance of policy gradient estimates at the cost of some bias, with an exponentially-weighted estimator of the advantage function that is analogous to TD Lambda.
We address the second challenge by using trust region optimization procedure for both the policy and the value function, which are represented by neural networks.
Motivation for GAE
In pure vanilla policy gradient, we use the actual (empirical) returns , which is unbiased, but high variance: empirical returns from the noisy environment and stochastic policy, the variance of the gradient estimator scales unfavorably with the time horizon, since the effect of an action is confounded with the effects of past and future actions.
actor critic methods alleviate that by estimating a value function instead, at the cost of introducing bias (see bias-variance tradeoff).
→ While variance is annoying as it makes your training potentially unstable and requires more examples to train on, bias can be a bigger problem, as you model might learn the wrong thing, and even with infinite samples never find the optimal policy (fail to converge, or converge to a suboptimal solution).
Generalized Advantage Estimation (GAE)
GAE is a method for estimating the advantage in policy gradient algorithms that trades off bias and variance through an exponentially weighted average of TD terms:
where is the temporal difference residual using value function .
GAE combines multiple n-step advantage estimates through exponential weighting controlled by :
When , we get just the one-step TD error.
When , we get the full discounted sum of residuals.Values between 0 and 1 balance short-horizon and long-horizon advantage estimates:
Lower : Relies more on value function bootstrapping, giving lower variance (shorter horizons → fewer random actions/transitions 1) but higher bias (from value estimation errors).
Higher : Uses longer trajectories of actual returns → more unbiased estimates but higher variance from accumulating stochastic outcomes (from env and policy).
Note
In practice, we truncate the infinite sum at the episode boundary or after a fixed number of steps. The estimator remains unbiased in expectation while significantly reducing variance compared to Monte Carlo returns.
γ-just estimators
An advantage estimator is -just if it gives an unbiased estimate of the discounted advantage function when used in the policy gradient. Mathematically:
A sufficient condition is that can be decomposed as , where gives an unbiased estimate of the γ-discounted Q-function and is any function of states and actions before time t.
reward shaping interpretation
GAE can be understood through reward shaping , where we transform the reward function using the value function:
This shaped reward equals the TD residual .
- First reshape rewards using to reduce temporal spread of credit assignment
- Then use steeper discount to cut off noise from long delays
The GAE parameters have the following interpretations:
- acts as the standard discount factor
- acts as an additional discount specifically for the shaped rewards
- primarily determines the scale of the value function and introduces bias regardless of value function accuracy
- only introduces bias when the value function is inaccurate.
- This explains why empirically good values are usually lower than good values - introduces less bias for a reasonably accurate value function.
- The exponential weighting by effectively reduces the temporal spread of the response function - how long it takes for an action’s effects to be captured in the rewards.
References
Footnotes
-
Imagine estimating how much money you’ll have after 10 coin flips vs 1000 coin flips. The longer sequence will have much more variance (independent variables add), even though it might give you a more unbiased estimate of the expected value. This is also analogous to the difference between a short-term and long-term investment strategy: Short-term strategies rely on quick gains and losses, while long-term strategies focus on the overall trajectory of the investment. ↩