upper confidence bound

Exploration by optimism: act greedily with respect to the value estimate plus an uncertainty bonus.

A_{t} ≐ a ar g max [Q_{t} (a) + c \frac{ln t}{N _{t} ( a )}]

$Q_{t} (a)$ … current value estimate of action $a$
$N_{t} (a)$ … number of times $a$ has been selected so far
$c$ … controls the degree of exploration

The square-root term measures the uncertainty in $a$ ‘s value estimate. Each time $a$ is selected, $N_{t} (a)$ grows and its uncertainty (and degree of exploration) shrinks; each step $a$ is not selected, $t$ grows while $N_{t} (a)$ stays put, so its bonus slowly rises until it gets tried again.

The use of the natural logarithm means that the increases get smaller over time, but are unbounded; all actions will eventually be selected, but actions with lower value estimates, or that have already been selected frequently, will be selected with decreasing frequency over time.

exploration vs exploitation

Graph View

upper confidence bound

Backlinks