The optimal policy to collect information value is not a sampling policy. It is a deterministic greedy maximization. In other words, to minimize uncertainty during exploration an animal should not introduce additional uncertainty by taking random actions.

Link to original

Exploration & exploitation policies

The RL policy that seeks to maximize the discounted return in a given MDP is the exploitation policy. Exploration seeks to address the bootstrap by producing a second policy, which we call the exploration policy, to collect training data for the exploitation policy that is unlikely to be generated by simply following the exploitation policy. In order to find such data, the exploration policy typically either performs random actions in each state or seeks to maximize a measure of novelty.

In general, the exploration policy aims to maximize an intrinsic reward function,

Beyond the shared motivation for seeking informative transitions, methods differ in how exploration is folded into training. One common tactic is to maintain a separate exploration policy, e.g. one that learns to maximize the future novelty of the exploitation policy. 1
In this setting, the exploration policy is typically called the behaviour policy, as it is solely responsible for data collection, while the exploitation policy is called the target policy.
The target policy then trains on the transitions collected under the behaviour
policy using importance sampling.

Another increasingly popular approach is to use a single policy to serve as both exploration and exploitation policies. This single policy takes actions to maximize a weighted sum of extrinsic and intrinsic returns. As exploration continues to reduce the uncertainty in most states of the MDP, the intrinsic reward tends to zero, resulting in an approximately purely exploiting policy at convergence, though in general, annealing the intrinsic term may be required. (Embracing curiosity eliminates the exploration-exploitation dilemma argues against this, and for a complete separation between exploration and exploitation policies)

A related set of approaches based on probability matching samples actions according to the probability that each action is optimal, where actions may receive some minimum amount of support to encourage exploration, or support that increases with the estimated uncertainty of the transition resulting from taking the action.

Exploration can be reconceptualized as a search process over the space of MDP programs.

Such a view of exploration highlights direct correspondences between prioritized training methods in SL and exploration in RL, while radically expanding the domain of exploration to include the full space of possible training data.
Crucially, exploration remains the same process of finding the most informative datapoints for learning, regardless of the the optimization procedure used—whether based on an SL or RL objective.

References

General intelligence requires rethinking exploration
Embracing curiosity eliminates the exploration-exploitation dilemma
Self-Supervised Exploration via Disagreement

Chapter 2
RL

Footnotes

  1. See Embracing curiosity eliminates the exploration-exploitation dilemma for related paper; Cisek 2019 for biological motivation.