Crafter is a procedurally generated 2D survival game designed as a benchmark for reinforcement learning research. Created as a simpler, more tractable alternative to Minecraft while maintaining similar complexity in terms of tech trees, resource gathering, and long-term planning requirements.
Craftax: https://github.com/MichaelTMatthews/Craftax jax version with nethack like expansion.
Game Mechanics
procedural generation with resources (wood, stone, coal, iron, diamonds)
Crafting system with hierarchical dependencies (e.g., need wood → wooden pickaxe → stone → stone pickaxe → iron)
Survival elements: health, food, water, enemies (zombies, skeletons)
Day/night cycle affecting gameplay dynamics
22 achievement unlocks used as evaluation metrics
Pixel vs symbolic obs. Most train on symbolic obs since way faster and learning perception often not the point.
Crafter exposes fundamental challenges for traditional RL:
sparse reward - many actions needed before any achievement
Long-term credit assignment - crafting diamond tools requires ~20+ correct sequential decisions
exploration vs exploitation - must balance immediate survival with resource gathering
Compositional structure - understanding tool hierarchies and resource dependencies
Crafter demonstrates that pre-trained world models outperform task-specific learning for complex planning tasks.
Traditional RL approaches struggle:
PPO: ~1.5 score after 1M environment steps
Rainbow DQN: ~2.0 score after 1M steps
dreamerv3: ~10 score after millions of steps (previous SOTA)But using an LLM zero-shot, prompted with the latex source of the crafter paper:
SPRING (GPT-4): ~17.8 score with zero training
→ World knowledge and reasoning can substitute for millions of trial-and-error iterations.An LLM that reads game documentation outperforms RL agents with millions of gameplay steps - learning world models from scratch (esp. for a single, specific environment) when general priors could be learnt first instead, is incredibly inefficient and stupid.
Strong baselines:
Link to originalDetails of (unofficial) implementation on Crafter https://github.com/Reytuag/transformerXL_PPO_JA
Model Size: ~5M parameters Transformer-XL Configuration: - EMBED_SIZE: 256 - num_layers: 2 transformer layers - num_heads: 8 attention heads - qkv_features: 256 (query/key/value dimension) - hidden_layers: 256 (MLP hidden size for actor/critic heads) Sequence Length & Memory: - WINDOW_MEM: 128 (attends to last 128 steps) - WINDOW_GRAD: 64 (gradient flows through 64 steps during training) - NUM_STEPS: 128 (rollout length between updates) PPO Hyperparameters LR: 2e-4 (with linear annealing) UPDATE_EPOCHS: 4 NUM_MINIBATCHES: 8 GAMMA: 0.999 GAE_LAMBDA: 0.8 CLIP_EPS: 0.2 ENT_COEF: 0.002 VF_COEF: 0.5 MAX_GRAD_NORM: 1.0 NUM_ENVS: 1024 TOTAL_TIMESTEPS: 1e9 Observation Processing - Uses CraftaxSymbolicEnvNoAutoReset (symbolic observations, not pixels) - Observations flattened and fed directly to transformer encoder (Dense layer) - No CNN preprocessing - symbolic state is already a feature vector - Uses OptimisticResetVecEnvWrapper for efficient batched resets Rewards - Standard Craftax episode rewards (no reward shaping) - No intrinsic motivation (unlike baselines in paper) - No curriculum learning/UED (unlike baselines) Results Normalized Return: 18.3% (vs 15.3% for PPO-RNN baseline @ 1e9 steps) Key achievements: - Reached 3rd level (The Sewer) - not achieved by any baseline - High "enter gnomish mine" success rate (much higher than PPO-RNN even at 10e9 steps) - Several advanced achievements unlocked - not reached by any baseline - At 4e9 steps: 20.6% normalized return Training: 6h30 on single A100 for 1e9 steps (1024 envs)