Stabilizing Transformers for Reinforcement Learning

year: 2019
paper: https://arxiv.org/pdf/1910.06764
website:
code:
connections: TrXL, transformers in RL

Adds gating mechanism to Transformer-XL + uses PreNorm, which stabilizes training in RL settings:

The gating weighs the contribution of the residual connection vs the layer output:

z r \tilde{h} GatingLayer (x, y) = σ (W_{z} y + U_{z} x - b_{g}) = σ (W_{r} y + U_{r} x) = tanh (W_{g} y + U_{g} (r ⊙ x)) = (1 - z) ⊙ x + z ⊙ \tilde{h}

This is a GRU-style gating mechanism, with the difference that $b_{g}$ is initialized to $2$ , which biases $z$ to be small at the start of training (for random weight matrices paired with sigmoid), s.t. initially the layer behaves like the idenity function, $\approx 1 ⊙ x + 0 ⊙ \tilde{h} = x$ , making it easier to optimize.

It performed best vs. sigmoid on input stream, sigmoid on output stream, sigmoid on both streams (highway connection), and sigmoid on output stream + tanh before elementwise multiplication type of gating.

Details of (unofficial) implementation on Crafter https://github.com/Reytuag/transformerXL_PPO_JA

 Model Size: ~5M parameters
 Transformer-XL Configuration:
 - EMBED_SIZE: 256
 - num_layers: 2 transformer layers
 - num_heads: 8 attention heads
 - qkv_features: 256 (query/key/value dimension)
 - hidden_layers: 256 (MLP hidden size for actor/critic heads)
 Sequence Length & Memory:
 - WINDOW_MEM: 128 (attends to last 128 steps)
 - WINDOW_GRAD: 64 (gradient flows through 64 steps during training)
 - NUM_STEPS: 128 (rollout length between updates)

 PPO Hyperparameters
 LR: 2e-4 (with linear annealing)
 UPDATE_EPOCHS: 4
 NUM_MINIBATCHES: 8
 GAMMA: 0.999
 GAE_LAMBDA: 0.8
 CLIP_EPS: 0.2
 ENT_COEF: 0.002
 VF_COEF: 0.5
 MAX_GRAD_NORM: 1.0
 NUM_ENVS: 1024
 TOTAL_TIMESTEPS: 1e9

 Observation Processing
 - Uses CraftaxSymbolicEnvNoAutoReset (symbolic observations, not pixels)
 - Observations flattened and fed directly to transformer encoder (Dense layer)
 - No CNN preprocessing - symbolic state is already a feature vector
 - Uses OptimisticResetVecEnvWrapper for efficient batched resets
 
 Rewards
 
 - Standard Craftax episode rewards (no reward shaping)
 - No intrinsic motivation (unlike baselines in paper)
 - No curriculum learning/UED (unlike baselines)
 
 Results
 
 Normalized Return: 18.3% (vs 15.3% for PPO-RNN baseline @ 1e9 steps)
 
 Key achievements:
 - Reached 3rd level (The Sewer) - not achieved by any baseline
 - High "enter gnomish mine" success rate (much higher than PPO-RNN even at 10e9 steps)
 - Several advanced achievements unlocked - not reached by any baseline
 - At 4e9 steps: 20.6% normalized return
 
 Training: 6h30 on single A100 for 1e9 steps (1024 envs)

Cleaner impl, copy-pasteable model: https://github.com/subho406/agalite/blob/main/src_pure/models/gtrxl.py

https://chatgpt.com/c/691a39e5-bde4-8332-b0c3-60238fa53f29

Max Wolf's Second Brain

Explorer

Stabilizing Transformers for Reinforcement Learning

Graph View

Backlinks