year: 2021
paper: https://www.alphaxiv.org/overview/2109.08603v1
website: https://deepmind.com/research/publications/2021/Is-Curiosity-All-You-Need-On-the-Utility-of-Emergent-Behaviours-from-Curious-Exploration
code:
connections: curiosity, intrinsic motivation, reinforcement learning, exploration vs exploitation, catastrophic forgetting, open-ended
This paper asks whether pure curiosity - optimizing solely for prediction errors without any task rewards - can generate behaviors complex enough to be useful.
Answer: Yes, but the real value isn’t in the final curious policy, but in the diverse behaviors that emerge and disappear throughout training.
SelMo (Self-Motivated exploration)
An off-policy curiosity method where the reward is purely the prediction error of a world model:
The reward function itself is non-stationary because the world model keeps improving.
The architecture decouples the world model from the policy learner.
While the world model continuously improves its predictions, the policy learns from a replay buffer containing trajectories labeled with slightly outdated curiosity rewards → behaviors that were once surprising might still be reinforced even after the model has learned them, providing some stability to the exploration process.
Emergent behaviors
9-DOF (degrees of freedom) robotic arm: Pushing objects → lifting against walls → reliable grasping → moving cubes long distances → balancing cubes on edges → manipulating two cubes
20-DOF humanoid: Balance → stumbling → arm-assisted walking → safe falling/sitting → backward leaps → complex stretching poses
Death avoidance is a byproduct of maximizing cumulative curiosity.
The constantly shifting nature of curiosity creates what most would consider a problem: catastrophic forgetting.
As the world model masters predicting certain behaviors, they become “boring” and the policy abandons them for novel experiences.
But this paper reframes this as a feature: These transient behaviors - emerging from pure exploration, not following any given - turn out to be excellent stepping stones for solving actual tasks.
Utilizing emergent behaviors via RHPO
They use Regularized Hierarchical Policy Optimization to compose multiple policies hierarchically. The main policy learns to solve a specific task (lift red cube) while having access to frozen snapshots from curiosity training as auxiliary “skills” it can invoke.
A high-level policy decides which low-level policy (main task policy or one of 5 curiosity snapshots) to activate at each timestep. The curiosity snapshots remain frozen - they’re just behavioral primitives the system can use. When they randomly sample snapshots from the 10-20k episode range (where sustained lifting emerged), performance matches hand-designed auxiliary rewards in SAC-X.
Behaviors that curiosity abandoned (due to becoming predictable) might be exactly what you need for a specific task.
Snapshot timing matters: too early and you get unfocused exploration, too late and behaviors become overspecialized.
Instead of seeing it as a means to an end (better exploration for a specific task), curiosity-driven learning becomes a skill discovery process
The agent essentially creates its own open-ended curriculum, progressing from simple to complex behaviors as its world model evolves.
By following prediction errors, the system naturally seeks out learnable-but-not-yet-learned behaviors. Curiosity-driven exploration tends to discover behaviors that build on each other - you can’t balance a cube on its edge until you’ve learned to grasp it reliably.
Curiosity as automatic curriculum learning
The world model’s improving predictions create a moving target that naturally guides the agent through a progression of skills. The system follows the frontier of what’s learnable given current capabilities - not random wandering but structured discovery where each skill enables the next.
Rather than just a tool for better exploration during training, curiosity is a mechanism for discovering reusable skills.
The transient behaviors that emerge and disappear aren’t failures of the method - they’re potentially valuable capabilities worth identifying and preserving for future use.
Connection to OMNI-EPIC
Both approaches build repertoires of specialized behaviors: OMNI-EPIC explicitly generates task-specific agents via LLMs, while curiosity implicitly generates behavior-specific policies as the world model evolves. Each curiosity snapshot is essentially a “task-specific agent” where the task was “maximize surprise given world model at time “.
Following the meta-learning idea: Rather than just frozen snapshots, we could maintain shared meta-weights capturing general manipulation/locomotion capabilities while preserving each snapshot’s specialized context. The evolving world model acts as an implicit task generator, creating an endless curriculum of “interestingness” objectives.