Computing the gradient is tricky because it depends on both the action selection (directly determined by ) and the stationary distribution of states following the target selection behavior (indirectly determined by ). Given that the environment is generally unknown (not differentiable), it is difficult to estimate the effect on the state distribution by a policy update.

Policy gradient theorem to the rescue:

Policy Gradient Theorem

References

https://lilianweng.github.io/posts/2018-04-08-policy-gradient/