Spinnig Up: Intro to Policy Optimization | code

The goal of reinforcement learning is to find an optimal behavior strategy for the agent to obtain optimal rewards. The policy gradient methods target at modeling and optimizing the policy directly. The policy is usually modeled with a parameterized function respect to . The value of the reward (objective) function depends on this policy and then various algorithms can be applied to optimize for the best reward.

We have a stochastic, parameterized policy, and want to maximize the expected return , so we are going to optimize the policy parameters via gradient ascent:

The is just a very explicit notation, that the gradient is calculated with a specific set of parameters in time - I think.
, the gradient of the policy performance is called policy gradient.
Policy gradient algorithms are e.g. VPG, TRPO, PPO.

derivation for basic policy gradient

The Probability of a Trajectory given that actions ome from is

( is the start state distribution, see Introduction to RL)

The Log-Derivative Trick:

log-derivative trick

The derivative of a function is the function times the derivative of its log:

This can be useful, when you are dealing with products in the derivative, as the log will turn products into a sum, e.g.:

… as opposed to this:

Link to original

So we can write

For the Log-Probability of the Trajectory, we can then turn the products into sums:

The environment has no dependence on , so gradients of are zero.
Why? The actions directly influence the environment, but the network internal parameters do not directly influence the parameters. and can lead to the same action in some states, or even to the same policies. The environment only depends on its current state and the actions it receives to generate the next state:

The gradient of the log-probability of a trajectory is thus

Putting it together, we derive the following:

Derivation for Basic Policy Gradient

means that wer are integrating over all possible trajectories.
We estimate the expectation of the policy gradient with a set of trajectories, where each trajectory is obtained by letting the agent act in the environment using the policy :

where number of trajectories.


The reward function is defined as:

where is the stationary distribution of Markov chain for (on-policy sate d)

The expected finite-horizon undiscounted return is

trajectory
advantage function of current policy
(the log simplifies the gradient calculation, as it turns the product of probabilities into a sum)

The weights are updated via stochastic gradient ascent:

Policy gradient implementations typically compute advantage function estimates based on the infinite-horizon discounted return, despite otherwise using the finite-horizon undiscounted policy gradient formula.

References

https://spinningup.openai.com/en/latest/algorithms/vpg.html

https://medium.com/@sofeikov/reinforce-algorithm-reinforcement-learning-from-scratch-in-pytorch-41fcccafa107

reinforcement learning