The goal of reinforcement learning is to find an optimal behavior strategy for the agent to obtain optimal rewards. Policy gradient methods target at modeling and optimizing the policy directly (as opposed to value-based methods that model a value function and derive the policy from it). The policy is usually modeled with a parameterized function respect to , The value of the reward (objective) function depends on this policy and then various algorithms can be applied to optimize for the best reward.
The reward function is defined as:

where is the stationary distribution of the markov chain for (on-policy state distribution under ).
Using gradient ascent, we can move toward the direction suggested by the gradient to find the that maximizes the return.

Via the policy gradient theorem, we can drop from the gradient (i.e. don’t need to know / differentiate the environment).
We end up with the vanilla policy gradient:

See vanilla policy gradient for a more detailed explanation of the most basic policy gradient algorithm / how we derive the gradient of this objective. Below is a more general formulation of different policy gradients, based on the High-Dimensional Continuous Control Using Generalized Advantage Estimation paper.

Policy gradient methods

Let be sampled from initial distribution . A trajectory is generated by:

  • Sampling actions:
  • Sampling states:

until reaching a terminal state. At each timestep, a reward is received. The goal is to maximize the expected total reward (assumed to be finite).

Policy gradient methods maximize the expected total reward by estimating the gradient :

where can take various forms, to estimate the contribution of each action to the total reward:

… Total trajectory reward
… Future reward after action
… Baselined future reward after action
state-action value
advantage function
TD residual

where
state-value; expected total reward from state
state-action value; expected total reward from state and action

Choosing the advantage function for yields almost the lowest possible variance, though in practice, the advantage function is not known and must be estimated. This statement can be intuitively justified by the following interpretation of the policy gradient: that a step in the policy gradient direction should increase the probability of better-than-average actions and decrease the probability of worse-than-average actions. The advantage function, by it’s definition measures whether or not the action is better or worse than the policy’s default behavior. Hence, we should choose to be the advantage function, so that the gradient term points in the direction of increased if and only if .

References

, but also inputs like are often ommitted in notation.

PPO

https://lilianweng.github.io/posts/2018-04-08-policy-gradient/