TRPO

TRPO is a policy gradient method in RL that ensures policy parameters don’t change too much from one step to the next with the help of KL-divergence as a constraint.

We maximize the surrogate objective (instead of directly maximizing rewards):
$θ maximize subject to \hat{E}_{t} [\frac{π _{θ} ( a _{t} ∣ s _{t} )}{π _{θ_{old}} ( a _{t} ∣ s _{t} )} \hat{A}_{t}] \hat{E}_{t} [KL [π_{θ_{old}} (\cdot ∣ s_{t}), π_{θ} (\cdot ∣ s_{t})]] \leq δ$

Trust Region Policy Optimization

The objective function $J (θ)$ measures the expected advantage of policy $π_{θ}$ relative to a baseline, which in this case is the old policy parameters $θ_{o l d}$ . This baseline serves as a reference point for measuring improvement - we want to know if our new policy performs better than our previous one. The objective can be written as:
$J (θ) = s \in S \sum p^{π_{o l d}} a \in A \sum (π_{θ} (a ∣ s) \hat{A}_{θ_{o l d}} (s, a))$
where $p^{π_{o l d}}$ is the state visitation distribution under the old policy. The estimated advantage $\hat{A}_{θ_{o l d}} (s, a)$ measures how much better taking action $a$ in state $s$ is compared to the average value of that state.
Since we collect data using a behavior policy $β$ that differs from our target policy $π_{θ}$ , we need to account for this mismatch. We can do this by introducing an importance sampling ratio:
$J (θ) = s \in S \sum p^{π_{o l d}} a \in A \sum (β (a ∣ s) \frac{π _{θ} ( a ∣ s )}{β ( a ∣ s )} \hat{A}_{θ_{o l d}} (s, a))$
We can express this as an expectation over state-action pairs sampled according to the behavior policy. Note that when we convert from the sum to an expectation, the $β (a ∣ s)$ term becomes implicit in our sampling distribution $a \sim β$ , leaving us with:
$J (θ) = E_{s \sim p^{π_{o l d}}, a \sim β} (\frac{π _{θ} ( a ∣ s )}{β ( a ∣ s )} \hat{A}_{θ_{o l d}} (s, a))$
Even in an on-policy example, where the current policy is used to collect trajectories, the actor policy can sometimes get out of sync, i.e. $β = π_{θ_{old}}$

TRPO introduces a trust region constraint for the distance between the old and the new policy to be within a parameter $δ$ :
$E_{s \sim p^{π_{θ_{old}}}} [D_{KL} (π_{θ_{old}} (\cdot ∣ s) ∥ π_{θ} (\cdot ∣ s))] \leq δ$
We measure this distance using states sampled from the old policy’s distribution, since those represent the actual states where we have data to evaluate policy performance.

As a second order method (approximating the KL divergence by computing hessian through fisher information matrix), TRPO can be computationally expensive, which motivated the development of simpler first-order alternatives like PPO that achieve similar performance.

The theory justifying TRPO actually suggests using a penalty instead of a constraint, i.e., solving the unconstrained optimization problem:

$θ maximize \hat{E}_{t} [\frac{π _{θ} ( a _{t} ∣ s _{t} )}{π _{θ_{old}} ( a _{t} ∣ s _{t} )} \hat{A}_{t} - β KL [π_{θ_{old}} (\cdot ∣ s_{t}), π_{θ} (\cdot ∣ s_{t})]]$

for some coefficient β. This follows from the fact that a certain surrogate objective (which computes the max KL over states instead of the mean) forms a lower bound (i.e., a pessimistic bound) on the performance of the policy π. TRPO uses a hard constraint rather than a penalty because it is hard to choose a single value of β that performs well across different problems—or even within a single problem, where the the characteristics change over the course of learning.

The objective (without the constraint) was first introduced in conservative policy iteration.

Max Wolf's Second Brain

Explorer

TRPO

Graph View

Backlinks