year: 2023/09
paper: https://arxiv.org/pdf/2310.00166
website:
code: https://github.com/facebookresearch/motif/tree/main
connections: reward model, LLM, preference-based RL


Problem:
Reinforcement learning (RL) agents struggle in complex, open-ended environments, especially when rewards are sparse. They lack the common sense or prior knowledge humans use to evaluate actions and explore effectively. Manually providing this knowledge is difficult and doesn’t scale. While Large Language Models (LLMs) capture vast amounts of world knowledge, directly using them as policies or interfacing them with low-level agent sensors/actuators is challenging.

Proposed Solution: Motif
The paper proposes Motif (Intrinsic Motivation from Artificial Intelligence Feedback), a method to leverage LLM knowledge to guide RL agents without requiring the LLM to interact with the environment or act as a policy. Motif works in three phases:

  1. Dataset Annotation: An LLM is prompted to compare pairs of text captions describing events or observations from a pre-existing dataset of agent experiences (importantly, this dataset only needs observations, not actions or rewards). The LLM provides a preference, indicating which observation in the pair seems more aligned with progress or good gameplay based on its internal knowledge and the provided prompt.
  2. Reward Training: A separate reward model (a neural network) is trained to mimic the LLM’s preferences using standard techniques from preference-based RL. This distills the LLM’s high-level judgments into a scalar intrinsic reward function that operates on observations (specifically, their captions).
  3. RL Training: A standard RL agent is trained in the environment using the learned intrinsic reward signal, either alone or combined with the environment’s extrinsic reward.

Methodology & Evaluation:

  • Motif was evaluated in the challenging, procedurally generated NetHack Learning Environment (NLE).
  • Llama 2 (70B chat) was used as the preference annotator on a dataset of observations collected from a baseline RL agent.
  • The learned intrinsic reward was used to train a PPO-based agent.
  • Compared against baselines: extrinsic reward only, RND (Random Network Distillation), E3B, NovelD.
  • Evaluated on score maximization and several sparse-reward tasks (staircase descent, finding the oracle).
  • Analysis included quantitative performance, qualitative behavior analysis, alignment with human intuition, scalability with LLM size, sensitivity to prompts, and steerability via prompt modification.

Key Results & Findings:

  • Strong Performance: Motif significantly outperformed baselines, especially on sparse-reward tasks. It enabled progress on the extremely sparse oracle task where prior methods failed without demonstrations.
  • Intrinsic Reward Effectiveness: Surprisingly, training an agent only with Motif’s intrinsic reward led to a higher game score than training directly on the extrinsic score reward itself. Combining intrinsic and extrinsic rewards yielded the best results.
  • Human-Aligned & Anticipatory Behavior: The intrinsic reward encouraged more human-like, survival-oriented behaviors (e.g., less likely to kill its pet) and rewarded exploratory actions (like opening doors) that anticipate future progress, easing credit assignment for the RL agent.
  • Misalignment by Composition: On the oracle task, combining the intrinsic reward (which learned about hallucination effects) with the sparse extrinsic reward led to reward hacking – the agent learned to induce hallucinations to “see” the oracle rather than finding it properly.
  • Scalability & Steerability: Performance scaled positively with the size of the LLM used for annotation. Agent behavior could be effectively steered towards different goals (collecting gold, descending dungeons, fighting monsters) simply by modifying the natural language prompt given to the LLM during the annotation phase.
  • Robustness: Performance was reasonably robust to prompt variations (though sensitive on complex tasks) and reductions in dataset quality/diversity.

→ Motif relies on the principle that LLMs might be better evaluators than generators for complex sequential decision-making tasks.

Delve into the detail of how preference-based reward optimization works, after crafting a new prompt for Obsidian notes.