Embracing curiosity eliminates the exploration-exploitation dilemma

year: 2022
paper: https://www.biorxiv.org/content/10.1101/671362v11.full.pdf
website:
code:
connections: curiosity, exploration vs exploitation

Takeaway

We have shown that competitive union of pure curiosity with pure reward collection leads to a simple solution to the explore-exploit trade-off.
Curiosity matches or exceeds reward-based strategies, in practice. That is, when rewards are sparse, high-dimensional, or non-stationary.
It uniquely overcomes deception. Curiosity is also far more robust than the standard algorithms.
We have derived a measure of information value via learning progress, but done in an axiomatic way it is possible to directly measure.
This measure also properly generalizes many prior efforts to formalize curiosity.

Curiosity axioms.

Should any one role for curiosity dominate the others? We suggest no. We therefore define a new metric of learning progress that is general, practical, simple, and compatible with the normal limits of experimental measurements. That is, we do not assume the learning algorithm, or target goal, are known. We only assume we have a memory, who’s learning dynamics we can measure.
We reason the value of any information depends entirely on how much memory changes in making an observation. We formalize this idea with axioms, which follow.
These axioms are useful in three ways: First, they give a measure of value information without needing to know the exact learning algorithm(s) in use. This is helpful because we rarely know how an animal is learning in detail. This leads to the second useful property. If we want to measure a subjects intrinsic information value, we need only record differences in dynamics of their memory circuits. There is little need to decode, or interpret.

Notation

$E$ … information value / bayesian information gain (“IG”).
$M$ … a vector of size $p$ embedded in a real-valued space that is closed an bounded.
$X$ … observations
$f$ … learning function mapping observations to memory: $M \leftarrow f (X, M)$

Axiom of Memory.

$E$ is continuous with continuous changes in memory, $Δ M$ , between $M^{'}$ and $f (X, M)$ .

Axiom of Specificity.

If all $p$ elements $∣Δ M_{i} ∣$ in $M$ are equal, then $E$ is minimized.
That is, information that is learned evenly in memory cannot induce any inductive bias, and is therefore nonspecific and is the least valuable learning possible.

Axiom of Scholarship.

Learning has an innately positive value, even if later on it has some negative consequences. Therefore, $E \geq 0$ .

Axiom of Equilibrium.

Given the same observation $X_{i}$ , learning will reach equilibrium. Therefore $E$ will decrease to below some finite threshold $η > 0$ in some finite time $T$ .

Boredom

Boredom is implmented to stop exploration if for any $X$ , $E \leq η$ .
$η$ can be dynamically adjusted to fit an environment.

Conjectures for marrying reward and information collection:

Hypothesis of equal importance.

Reward and information collection are equally important for survival (or any other goal?).

The curiosity trick.

Curiosity is a sufficient solution for all exploration problems (where learning is possible).
(i.e. curiosity is enough to solve? open-ended environments / problems?)

The paper now models this with two separate policy functions: One for the external reward, one for exploration. The policy is selected based on which of the last payouts was higher:

Under the above conjectures, we model an animal who interacts with an environment who wishes to maximize both information and reward value.

$V_{π_{ER}} (S) = π_{ER} max E [t = 0 \sum T max [E_{t}, R_{t}] ∣ S_{t} = S] V_{π_{ER}} (S) = π_{ER} max E [t = 0 \sum T max [E_{t}, R_{t}] ∣ S_{t} = S]$
Bellman optimal solution for $V_{π_{ER}}$ :
$V_{π_{ER}} (S) = π_{ER} max E [max [E_{t}, R_{t}] + V_{π_{ER}} (S_{t + 1}) ∣ S_{t} = S, A_{t} = A]$
Substituting reward functions for $E$ and $R$ to arrive at the Bellman optimal decision policy:
$V_{π_{ER}} (S) π_{ER} (S) = π_{ER} max E [max [V_{π_{E}} (S), V_{π_{R}} (S)] + V_{π_{ER}} (S_{t + 1}) ∣ S_{t = S}, A_{t} = A] = {π_{R} (S) if V_{π_{R}} (S) \geq V_{π_{E}} (S) π_{E} (S) otherwise$
Issue: What if $V_{π_{E}}$ and $V_{π_{R}}$ are not readily available but must be estimated from the environment? → We wish to use value functions to decide between information and reward value but are simultaneously estimating those values. There are a range of methods to handle this (RL book, dynamic programming + optimal control), but we opted to further simplify $π_{ER}$ in three ways:
First we shifted from using the full value functions to using only the last payout, $E_{t - 1}$ and $R_{t - 1}$ .
Second, we removed all state-dependence leaving only time dependence.
Third, we also included $η$ to control the duration of exploration.
Final form of the “win-stay-lose-switch” rule:
$\tilde{π}_{ER} (S) = {π_{R} (S) if R_{t - 1} \geq E_{t - 1} - η π_{E} (S) otherwise$

Might it make sense not to have separate policies for exploitation & exploration, but one that balances $R_{exploration} + R_{environment} + R_{prediction}$ ?

Prediction refers to reward prediction (i.e. environment prediction loss / world modelling without explicit observation reconstruction), exploration refers to the information value of the observation / curiosity, and environment refers to the external reward defined by the environment.
Answer:
The paper’s central thesis is that the dilemma is eliminated by treating curiosity (E, exploration) and reward (R, environment/prediction) as separate, competing goals, not components to be summed into a single objective function. Reverting to a unified objective function where R and E are traded off against each other using a common currency (the summed reward) is precisely the framing the paper argues creates the intractability (or at least, complexity) that their competitive approach avoids.

Experiments (conducted on bandits, and a 2d foraging environment).

Curiosity alone (“E-explore (IG)”) sometimes outperformes (but in the experiments is never worse than) other methods (reward+ucb, reward+entropy, reward+novelty, reward, reward+evidence bound, reward+IG).
It is the only method which is robust against deception, e.g. when one bandit has decreasing reward in the beginning but then much bigger.
It is also most robust in terms of hyperparameter sensitivty.

Why curiosity?

→ Curiosity is a primary drive in most, if not all, animal behavior.
→ Curiosity is as strong, if not sometimes stronger, than the drive for reward.
→ Curiosity as an algorithm is highly effective at solving difficult optimization problems.

The benefits are an optimal value solution to the exploration versus exploitation trade-off. A solution which seems especially robust to model-enviroment mismatch. At the same time curiosity-as-exploration can build a model of the environment useful for later planning, creativity, imagination, while also building diverse action strategies.

Is this a slight of hand, theoretically?

Yes. We have taken one problem that cannot be solved and replaced it with another related problem that can be. In this replacement we swap one behavior, extrinsic reward seeking, for another, curiosity.

+Lots of great other FAQs which I’m not pasting here, but some TLDRs:

The axioms explain why humans consistently seek to learn falsehoods (fiction, conspiracy theories, stories of all kinds, …). They also contain positive value for learning.

Information is more than just the technical problem of transmitting accurately, but also the semantic problem of meaning and its effectiveness of using it to change behavior (Shannon’s work only addresses the first thing).

What distinguishes our approach to information value is we focus on memory dynamics, and base value on some general axioms. We try to embody the idea of curiosity “as learning for learning’s sake”, not done for the sake of some particular goal , or for future utility.

Is information a reward?

If reward is any quantity that motivates behavior, then our definition of information value is a reward, an intrinsic reward. This last point does not mean that information value and environmental rewards are interchangeable however. Rewards from the environment are a conserved resource, information is not. For example, if a rat shares a potato chip with a cage-mate, it must break the chip up leaving it less food for itself. While if a student shares an idea with a classmate, that idea is not divided up. It depends, in other words.

But isn’t curiosity impractical?

It does seem curiosity can lead away from a needed solution as towards it. Consider children, who are the prototypical curious explorers. This is why we focus on the derivatives of memory and limit curiosity with boredom, as well as counter curiosity with a drive for reward collecting (i.e., exploitation). All these elements combined seek to limit curiosity without compromising it.
“Engineering vs. science”
“Punctuated equilibrium management strategy”: Phases of pure curious exploration vs. phases of pure market exploitation.

What is boredom, besides being a tunable parameter?

A more complete version of the theory would let us derive a useful or even optimal value for boredom if given a learning problem or environment. We cannot do this. It is the next problem we will work on, and it is important. Does this mean you are hypothesizing that boredom is actively tuned? Yes we are predicting that. But can animals tune boredom? Geana and Daw showed this in a series of experiments. They reported that altering the expectations of future learning in turn alters self-reports of boredom. Others have shown how boredom and self-control interact to drive exploration.

Insights from biology!

Cisek 2019 has traced the evolution of perception, cognition, and action circuits from the Metazoan to the modern age. The circuits for reward exploitation and observation-driven exploration appear to have evolved separately, and act competitively, exactly the model we suggest. In particular he notes that exploration circuits in early animals were closely tied to the primary sense organs (i.e. information) and had no input from the homeostatic circuits.
This neural separation for independent circuits has been observed in some animals, including zebrafish and monkeys.

The optimal policy to collect information value is not a sampling policy. It is a deterministic greedy maximization. In other words, to minimize uncertainty during exploration an animal should not introduce additional uncertainty by taking random actions.

Max Wolf's Second Brain

Explorer

Embracing curiosity eliminates the exploration-exploitation dilemma

Graph View

Backlinks