Good solutions to all chapters: https://github.com/vojtamolda/reinforcement-learning-an-introduction/tree/main

Chapter 2

Constant rates:

  • No exploration rises most quickly in the very beginning, but then performs the worst.
  • Low exploration (0.01) has a slower learning trajectory, but in the end performs the best.
  • High exploration (0.1) learns considerably faster, but the total reward will not go as high as lower exploration when run indefinitely.

estimate of action value after being selected times.
reward received after th selection of this action.

With some rearranging, this rurns into:

Which looks like this in the more general form:

Simple bandit algorithm

Initialize, for to :


Loop forever:

For non-stationary problems, a constant is desirable, as opposed to e.g. , being the number of times an action as been chosen (estimate→sample average), since a constant also keeps new / changing information more relevant.
It doesn’t really matter whether the step-size parameter is theoretically convergent, as those values take ages to converge or considerable tuning effort to find a good/quick one. In practice, fixed-step is very common.

There are tricks e..g. setting initial reward estimates high for all actions, so that the agent explores a bunch in the beginning, even if it only ever selects greedy values. On stationary problems, this works really well. But as soon as the optimal actions change a little, you’re out of luck with this trick, like with all tricks / methods that focus on initial values.

The beginning of time occurs only once, and thus we should not focus on it too much

Upper-Confidence-Bound Action Selection

controls the degree of exploration.
The part with is the uncertainty term. As the number of times an action is selected increases, its uncertainty (and degree of exploration) decreases. But every time it doesn’t get selected, the likelihood of being selected increases, as increases.

The use of the natural logarithm means that the increases get smaller over time, but are unbounded; all actions will eventually be selected, but actions with lower value estimates, or that have already been selected frequently, will be selected with decreasing frequency over time.

References

Richard Sutton
reinforcement learning