The advantage function measures how much better taking action in state is compared to the average value of that state. Specifically, it is the difference between the Q-value (expected return starting from , taking action ) and the state-value (expected return from state ):

Theoretically, one could take any baseline that does not depend on the current action being taken (instead of ).

Choosing the advantage function for yields almost the lowest possible variance, though in practice, the advantage function is not known and must be estimated. This statement can be intuitively justified by the following interpretation of the policy gradient: that a step in the policy gradient direction should increase the probability of better-than-average actions and decrease the probability of worse-than-average actions. The advantage function, by it’s definition measures whether or not the action is better or worse than the policy’s default behavior. Hence, we should choose to be the advantage function, so that the gradient term points in the direction of increased if and only if .

Link to original