year: 2019
paper: https://arxiv.org/pdf/1910.06764
website:
code:
connections: TrXL, transformers in RL


Adds gating mechanism to Transformer-XL + uses PreNorm, which stabilizes training in RL settings:

The gating weighs the contribution of the residual connection vs the layer output:

This is a GRU-style gating mechanism, with the difference that is initialized to , which biases to be small at the start of training (for random weight matrices paired with sigmoid), s.t. initially the layer behaves like the idenity function, , making it easier to optimize.

It performed best vs. sigmoid on input stream, sigmoid on output stream, sigmoid on both streams (highway connection), and sigmoid on output stream + tanh before elementwise multiplication type of gating.

Cleaner impl, copy-pasteable model: https://github.com/subho406/agalite/blob/main/src_pure/models/gtrxl.py


https://chatgpt.com/c/691a39e5-bde4-8332-b0c3-60238fa53f29