Upper-Confidence-Bound Action Selection

controls the degree of exploration.
The part with is the uncertainty term. As the number of times an action is selected increases, its uncertainty (and degree of exploration) decreases. But every time it doesn’t get selected, the likelihood of being selected increases, as increases.

The use of the natural logarithm means that the increases get smaller over time, but are unbounded; all actions will eventually be selected, but actions with lower value estimates, or that have already been selected frequently, will be selected with decreasing frequency over time.

Link to original