actor critic

Actor critic methods are reinforcement learning algos that simply have a value estimation head or separate “critic” network which estimates the value function in adition to the policy network (actor).

Motivation for GAE

In pure vanilla policy gradient, we use the actual (empirical) returns $G_{t} = \sum_{t = 0}^{\infty} γ^{k} R_{t}$ , which is unbiased, but high variance: empirical returns from the noisy environment and stochastic policy, the variance of the gradient estimator scales unfavorably with the time horizon, since the effect of an action is confounded with the effects of past and future actions.
actor critic methods alleviate that by estimating a value function instead, at the cost of introducing bias (see bias-variance tradeoff).
→ While variance is annoying as it makes your training potentially unstable and requires more examples to train on, bias can be a bigger problem, as you model might learn the wrong thing, and even with infinite samples never find the optimal policy (fail to converge, or converge to a suboptimal solution).

Link to original

One-step AC uses the estimator $δ_{t} \nabla_{θ} lo g π (A_{t} ∣ S_{t,} θ) \sim \nabla J (θ)$ .

Max Wolf's Second Brain

Explorer

actor critic

Graph View

Backlinks