policy gradient

The goal of reinforcement learning is to find an optimal behavior strategy for the agent to obtain optimal rewards. Policy gradient methods target at modeling and optimizing the policy directly (as opposed to value-based methods that model a value function and derive the policy from it). The policy is usually modeled with a parameterized function respect to $θ$ , $π_{θ} (s ∣ a)$ The value of the reward (objective) function depends on this policy and then various algorithms can be applied to optimize for the best reward.
The reward function is defined as:

J (θ) = s \in S \sum d^{π} (s) V^{π} (s) = s \in S \sum d^{π} (s) a \in A \sum π_{θ} (a ∣ s) Q^{π} (s, a)

where $d^{π} (s) = lim_{t \to \infty} P (s_{t} = s ∣ s_{0}, π_{0})$ is the stationary distribution of the markov chain for $π_{θ}$ (on-policy state distribution under $π$ ).
Using gradient ascent, we can move $θ$ toward the direction suggested by the gradient $\nabla_{θ} J (θ)$ to find the $θ$ that maximizes the return.

Via the policy gradient theorem, we can drop $d^{π}$ from the gradient (i.e. don’t need to know / differentiate the environment).
We end up with the vanilla policy gradient:

\nabla_{θ} J (θ) = E_{π} [Q^{π} (s, a) \nabla_{θ} ln π_{θ} (a ∣ s)]

See vanilla policy gradient for a more detailed explanation of the most basic policy gradient algorithm / how we derive the gradient of this objective. Below is a more general formulation of different policy gradients, based on the High-Dimensional Continuous Control Using Generalized Advantage Estimation paper.

Policy gradient methods

Let $s_{0}$ be sampled from initial distribution $ρ_{0}$ . A trajectory $(s_{0}, a_{0}, s_{1}, a_{1}, \dots)$ is generated by:

Sampling actions: $a_{t} \sim π (a_{t} ∣ s_{t})$

Sampling states: $s_{t + 1} \sim P (s_{t + 1} ∣ s_{t}, a_{t})$

until reaching a terminal state. At each timestep, a reward $r_{t} = r (s_{t}, a_{t}, s_{t + 1})$ is received. The goal is to maximize the expected total reward $\sum_{t = 0}^{\infty} r_{t}$ (assumed to be finite).

Policy gradient methods maximize the expected total reward by estimating the gradient $g$ :
$g : = \nabla_{θ} E [t = 0 \sum \infty r_{t}] \approx E [t = 0 \sum \infty Ψ_{t} \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t})]$
where $Ψ_{t}$ can take various forms, to estimate the contribution of each action to the total reward:

$\sum_{t = 0}^{\infty} r_{t}$ … Total trajectory reward
$\sum_{t^{'} = t}^{\infty} r_{t^{'}}$ … Future reward after action $a_{t}$
$\sum_{t^{'} = t}^{\infty} r_{t^{'}} - b (s_{t})$ … Baselined future reward after action $a_{t}$
$Q^{π} (s_{t}, a_{t})$ … state-action value
$A^{π} (s_{t}, a_{t})$ … advantage function
$r_{t} + V^{π} (s_{t + 1}) - V^{π} (s_{t})$ … TD residual

where
$V^{π} (s_{t}) := E_{s_{t + 1 : \infty}, a_{t : \infty}} [\sum_{l = 0}^{\infty} r_{t + l}]$ … state-value; expected total reward from state $s_{t}$
$Q^{π} (s_{t}, a_{t}) := E_{s_{t + 1 : \infty}, a_{t + 1 : \infty}} [\sum_{l = 0}^{\infty} r_{t + l}]$ … state-action value; expected total reward from state $s_{t}$ and action $a_{t}$
$A^{π} (s_{t}, a_{t}) := Q^{π} (s_{t}, a_{t}) - V^{π} (s_{t})$

Choosing the advantage function for $Ψ_{t}$ yields almost the lowest possible variance, though in practice, the advantage function is not known and must be estimated. This statement can be intuitively justified by the following interpretation of the policy gradient: that a step in the policy gradient direction should increase the probability of better-than-average actions and decrease the probability of worse-than-average actions. The advantage function, by it’s definition measures whether or not the action is better or worse than the policy’s default behavior. Hence, we should choose $Ψ_{t}$ to be the advantage function, so that the gradient term $\nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t})$ points in the direction of increased $π_{θ} (a_{t} ∣ s_{t})$ if and only if $A^{π} (s_{t}, a_{t}) > 0$ .

References

$θ$ , but also inputs like $(s, a)$ are often ommitted in notation.

PPO

https://lilianweng.github.io/posts/2018-04-08-policy-gradient/

Max Wolf's Second Brain

Explorer

policy gradient

References

Graph View

Backlinks