policy gradient theorem

Computing the gradient $\nabla_{θ} J (θ)$ is tricky because it depends on both the action selection (directly determined by $π_{θ}$ ) and the stationary distribution of states $d^{π}$ following the target selection behavior (indirectly determined by ). Given that the environment is generally unknown (not differentiable), it is difficult to estimate the effect on the state distribution by a policy update.

Policy gradient theorem to the rescue:

Policy Gradient Theorem

$\nabla_{θ} J (θ) = \nabla_{θ} s \in S \sum d^{π} (s) a \in A \sum Q^{π} (s, a) π_{θ} (a ∣ s) \propto s \in S \sum d^{π} (s) a \in A \sum Q^{π} (s, a) \nabla_{θ} π_{θ} (a ∣ s)$

https://lilianweng.github.io/posts/2018-02-19-rl-overview/#policy-gradient

https://lilianweng.github.io/posts/2018-04-08-policy-gradient/

References

https://lilianweng.github.io/posts/2018-04-08-policy-gradient/

Max Wolf's Second Brain

Explorer

policy gradient theorem

References

Graph View

Backlinks