REINFORCE

Spinnig Up: Intro to Policy Optimization | code

The goal of reinforcement learning is to find an optimal behavior strategy for the agent to obtain optimal rewards. The policy gradient methods target at modeling and optimizing the policy directly. The policy is usually modeled with a parameterized function respect to $θ, π_{θ} (a ∣ s)$ . The value of the reward (objective) function depends on this policy and then various algorithms can be applied to optimize $θ$ for the best reward.

it is on-policy
can be used for both discrete and continuous action spaces

We have a stochastic, parameterized policy, $π_{θ}$ and want to maximize the expected return $J (π_{θ}) = τ \sim π_{θ} E [R (τ)]$ , so we are going to optimize the policy parameters $θ$ via gradient ascent:

θ_{k + 1} = θ_{k} + α + \nabla_{θ} J (π_{θ}) ∣_{θ_{k}}

The $∣_{θ_{k}}$ is just a very explicit notation, that the gradient is calculated with a specific set of parameters in time - I think.
$\nabla_{θ} J (π_{θ})$ , the gradient of the policy performance is called policy gradient.
Policy gradient algorithms are e.g. VPG, TRPO, PPO.

derivation for basic policy gradient

The Probability of a Trajectory $τ = (s_{0}, a_{0}, \dots, s_{T + 1})$ given that actions ome from $π_{θ}$ is

P (τ ∣ θ) = ρ_{0} (s_{0}) t = o \prod T P (_{t + 1} ∣ s_{t}, a_{t}) π_{θ} (a_{t} ∣ s_{t})

( $ρ$ is the start state distribution, see Introduction to RL)

The Log-Derivative Trick:

Transclude of derivative#log-derivative-trick

So we can write

\nabla_{θ} P (τ ∣ θ) = P (τ ∣ θ) \nabla_{θ} lo g P (τ ∣ θ)

For the Log-Probability of the Trajectory, we can then turn the products into sums:

lo g P (τ ∣ θ) = lo g ρ_{0} (s_{0}) + t = 0 \sum T (lo g P (s_{t + 1} ∣ s_{t}, a_{t}) + lo g π_{θ} (a_{t} ∣ s_{t}))

The environment has no dependence on $θ$ , so gradients of $ρ_{0} (s_{0}), P (s_{t + 1} ∣ s_{t}, a_{t}), and R (τ)$ are zero.
Why? The actions directly influence the environment, but the network internal parameters do not directly influence the parameters. $θ$ and $θ^{'}$ can lead to the same action in some states, or even to the same policies. The environment only depends on its current state and the actions it receives to generate the next state: $P (s_{t + 1} ∣ a_{t}, s_{t})$

The gradient of the log-probability of a trajectory is thus

\nabla_{θ} lo g P (τ ∣ θ) = \nabla_{θ} lo g ρ_{0} (s_{0}) + t = 0 \sum T (\nabla_{θ} lo g P (s_{t + 1} ∣ s_{t}, a_{t}) + \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t})) = t = 0 \sum T \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t})

Putting it together, we derive the following:

Derivation for Basic Policy Gradient

$\nabla_{θ} J (π_{θ}) ∴ \nabla_{θ} J (π_{θ}) = \nabla_{θ} E_{τ \sim π_{θ}} [R (τ)] = \nabla_{θ} \int_{τ} P (τ ∣ θ) R (τ) = \int_{τ} \nabla_{θ} P (τ ∣ θ) R (τ) = \int_{τ} P (τ ∣ θ) \nabla_{θ} lo g P (τ ∣ θ) R (τ) = τ \sim π_{θ} E [\nabla_{θ} lo g P (τ ∣ θ) R (τ)] = τ \sim π_{θ} E t = 0 \sum T \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) R (τ) Expand expectation Bring gradient under integral Log-derivative trick Return to expectation form Expression for grad-log-prob$

$\int_{τ}$ means that wer are integrating over all possible trajectories.
We estimate the expectation of the policy gradient with a set of $D = {τ_{i}}_{i = 1}, \dots, N$ trajectories, where each trajectory is obtained by letting the agent act in the environment using the policy $π_{θ}$ :

\nabla_{θ} J (π_{θ}) \approx \frac{1}{∣ D ∣} τ \in D \sum t = 0 \sum T \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) R (τ)

where $\frac{1}{∣ D ∣} = N =$ number of trajectories.

The reward function is defined as:

J (θ) = s \in S \sum d^{π} (s) V^{π} (s) = s \in S \sum d^{π} (s) a \in A \sum π_{θ} (a ∣ s) Q^{π} (s, a)

where $d^{π} (s)$ is the stationary distribution of Markov chain for $π_{θ}$ (on-policy sate d)

The expected finite-horizon undiscounted return $J$ is

\nabla_{θ} J (π_{θ}) = τ \sim π_{θ} E [R (τ)] = τ \sim π_{θ} E [t = 0 \sum T \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) A^{π_{θ}} (s_{t}, a_{t})]

$τ \dots$ trajectory
$A^{π_{θ}} \dots$ advantage function of current policy
(the log simplifies the gradient calculation, as it turns the product of probabilities into a sum)

The weights are updated via stochastic gradient ascent:

θ_{k + 1} = θ_{k} + α \nabla_{θ} J (π_{θ_{k}})

Policy gradient implementations typically compute advantage function estimates based on the infinite-horizon discounted return, despite otherwise using the finite-horizon undiscounted policy gradient formula.

References

https://spinningup.openai.com/en/latest/algorithms/vpg.html

https://medium.com/@sofeikov/reinforce-algorithm-reinforcement-learning-from-scratch-in-pytorch-41fcccafa107

reinforcement learning

Max Wolf's Second Brain

Explorer

REINFORCE

derivation for basic policy gradient

References

Graph View

Backlinks