Sutton & Barto RL Book Notes

Good solutions to all chapters: https://github.com/vojtamolda/reinforcement-learning-an-introduction/tree/main

Chapter 2

Constant $ϵ$ rates:

No exploration rises most quickly in the very beginning, but then performs the worst.
Low exploration (0.01) has a slower learning trajectory, but in the end performs the best.
High exploration (0.1) learns considerably faster, but the total reward will not go as high as lower exploration when run indefinitely.

$Q_{n} \dots$ estimate of action value after being selected $n$ times.
$R_{i} \dots$ reward received after $i$ th selection of this action.

Q_{n} \overset{=}{˙} \frac{R _{1} + R _{2} + \dots + R _{n - 1}}{n - 1} = \frac{1}{n} i = 1 \sum n R_{i}

With some rearranging, this rurns into:

Q_{n + 1} = \frac{1}{n} (R_{n} + (n - 1) \frac{1}{n - 1} i = 1 \sum n - 1 R_{i}) = \frac{1}{n} (R_{n} + n Q_{n} - Q_{n}) = Q_{n} + \frac{1}{n} [R_{n} - Q_{n}]

Which looks like this in the more general form:

NewEstimate NewEstimate \leftarrow OldEstimate + StepSize [Target A - OldEstimate] or \leftarrow OldEstimate + α \cdot Error

Simple bandit algorithm

Initialize, for $a = 1$ to $k$ :
$Q (a) \leftarrow 0$
$N (a) \leftarrow 0$
Loop forever:
$A \leftarrow {ar g max_{a} Q (a) a random action with probability 1 - ϵ (breaking ties randomly) with probability ϵ R \leftarrow bandit (A) N (A) \leftarrow N (A) + 1 Q (A) \leftarrow Q (A) + \frac{1}{N ( A )} [R - Q (A)]$

For non-stationary problems, a constant $α$ is desirable, as opposed to e.g. $\frac{1}{n}$ , $n$ being the number of times an action as been chosen (estimate→sample average), since a constant $α$ also keeps new / changing information more relevant.
It doesn’t really matter whether the step-size parameter is theoretically convergent, as those values take ages to converge or considerable tuning effort to find a good/quick one. In practice, fixed-step is very common.

There are tricks e..g. setting initial reward estimates high for all actions, so that the agent explores a bunch in the beginning, even if it only ever selects greedy values. On stationary problems, this works really well. But as soon as the optimal actions change a little, you’re out of luck with this trick, like with all tricks / methods that focus on initial values.

The beginning of time occurs only once, and thus we should not focus on it too much

Upper-Confidence-Bound Action Selection

A_{t} ≐ a a r g max [Q_{t} (a) + c \frac{ln t}{N _{t} ( a )}]

$c$ controls the degree of exploration.
The part with $c$ is the uncertainty term. As the number of times an action is selected increases, its uncertainty (and degree of exploration) decreases. But every time it doesn’t get selected, the likelihood of being selected increases, as $t$ increases.

The use of the natural logarithm means that the increases get smaller over time, but are unbounded; all actions will eventually be selected, but actions with lower value estimates, or that have already been selected frequently, will be selected with decreasing frequency over time.

References

Richard Sutton
reinforcement learning

Max Wolf's Second Brain

Explorer

Sutton & Barto RL Book Notes

Chapter 2

Upper-Confidence-Bound Action Selection

References

Graph View

Backlinks