Types of policies

Deterministic policy:

Stochastic policy:

Two major types are categorical policies for discrete action spaces and diagonal gaussian policies for continuous action spaces.

Categorical Policy

A categorical policy is just a classifier over discrete actions.
So like usual you have feature extraction → logitssoftmaxsample an action.
Get the log-likelihood for the action at state by indexing the log of the output vector: .

Diagonal Gaussian Policy

Diagonal gaussian policies map from observations to mean actions .
There’s two common ways to represent represent the covariance matrix:

A parameter vector of log standard deviations , which is not a function of state.

Network layers mapping from states to log standard deviations , may share params with .

We can then just sample from the distribution to get an action.

Log standard deviations are used so we don’t have to constrain the ANN output to be nonnegative, and can simply exponentiate the log outputs to obtain , without loosing anything.