year: 2021
paper: https://www.alphaxiv.org/overview/2109.08603v1
website: https://deepmind.com/research/publications/2021/Is-Curiosity-All-You-Need-On-the-Utility-of-Emergent-Behaviours-from-Curious-Exploration
code:
connections: curiosity, intrinsic motivation, reinforcement learning, exploration vs exploitation, catastrophic forgetting, open-ended


This paper asks whether pure curiosity - optimizing solely for prediction errors without any task rewards - can generate behaviors complex enough to be useful.
Answer: Yes, but the real value isn’t in the final curious policy, but in the diverse behaviors that emerge and disappear throughout training.

SelMo (Self-Motivated exploration)

An off-policy curiosity method where the reward is purely the prediction error of a world model:

The reward function itself is non-stationary because the world model keeps improving.

The architecture decouples the world model from the policy learner.
While the world model continuously improves its predictions, the policy learns from a replay buffer containing trajectories labeled with slightly outdated curiosity rewards → behaviors that were once surprising might still be reinforced even after the model has learned them, providing some stability to the exploration process.

Emergent behaviors

9-DOF (degrees of freedom) robotic arm: Pushing objects → lifting against walls → reliable grasping → moving cubes long distances → balancing cubes on edges → manipulating two cubes
20-DOF humanoid: Balance → stumbling → arm-assisted walking → safe falling/sitting → backward leaps → complex stretching poses

Death avoidance is a byproduct of maximizing cumulative curiosity.

The constantly shifting nature of curiosity creates what most would consider a problem: catastrophic forgetting.

As the world model masters predicting certain behaviors, they become “boring” and the policy abandons them for novel experiences.
But this paper reframes this as a feature: These transient behaviors - emerging from pure exploration, not following any given - turn out to be excellent stepping stones for solving actual tasks.

Instead of seeing it as a means to an end (better exploration for a specific task), curiosity-driven learning becomes a skill discovery process
The agent essentially creates its own open-ended curriculum, progressing from simple to complex behaviors as its world model evolves.
By following prediction errors, the system naturally seeks out learnable-but-not-yet-learned behaviors. Curiosity-driven exploration tends to discover behaviors that build on each other - you can’t balance a cube on its edge until you’ve learned to grasp it reliably.

The world model’s improving predictions create a moving target that naturally guides the agent through a progression of skills. The system follows the frontier of what’s learnable given current capabilities - not random wandering but structured discovery where each skill enables the next.

Rather than just a tool for better exploration during training, curiosity is a mechanism for discovering reusable skills.

The transient behaviors that emerge and disappear aren’t failures of the method - they’re potentially valuable capabilities worth identifying and preserving for future use.

Connection to OMNI-EPIC

Both approaches build repertoires of specialized behaviors: OMNI-EPIC explicitly generates task-specific agents via LLMs, while curiosity implicitly generates behavior-specific policies as the world model evolves. Each curiosity snapshot is essentially a “task-specific agent” where the task was “maximize surprise given world model at time “.

Following the meta-learning idea: Rather than just frozen snapshots, we could maintain shared meta-weights capturing general manipulation/locomotion capabilities while preserving each snapshot’s specialized context. The evolving world model acts as an implicit task generator, creating an endless curriculum of “interestingness” objectives.