transformer-xl

year: 2019
paper: https://arxiv.org/pdf/1901.02860
website:
code:
connections: transformer, long-context, linear attention

XL … extra long (context)

TLDR: Processing long input sequences as segments, where the newest is processed in parallel, and the past segments’ keys and values are cached (and cross-attend to them from the new segment). ¹

In the standard approach, as soon as you’re OOC, you a) need to re-process the entire context window again, if you shift by 1 token, and b) you lose all context beyond the window c) how to give chunks information about eachother? (context fragementation) etc..:

Whereas TrXL keeps KV from past segments (context window chunks), processing the new segment with full attention, while letting it attend to a number of cached segments:

I like this visualization from the RMT paper better:

Each segment can look at the intermediate state from layer $n - 1$ of previous segments.
Since we progress in time, this creates a shifting effect, which allows for much longer effective context lengths, while still being efficient (no re-processing of past segments, no quadratic attention over all past tokens, no gradients, etc.). ²

$h_{τ + 1}^{(n - 1)} \in R^{L \times d}$ … hidden state from current (new) segment $τ + 1$ at layer $n - 1$ , with segment length $L$
$h_{τ}^{(n - 1)} \in R^{M \times d}$ … cached hidden state from previous segments at layer $n - 1$ , with memory length $M$
$h_{τ + 1}^{(n - 1)} \in R^{(M + L) \times d} = [stop-gradient (h_{τ}^{(n - 1)}); h_{τ + 1}^{(n - 1)}]$ … concatenation of both

Attention is then computed as usual, but with $h_{τ + 1}^{(n - 1)}$ as input for K and V:

Q_{τ + 1}^{(n)} K_{τ + 1}^{(n)} V_{τ + 1}^{(n)} h_{τ + 1}^{(n)} = h_{τ + 1}^{(n - 1)} W_{q}^{T} \in R^{L \times d_{k}} = h_{τ + 1}^{(n - 1)} W_{k}^{T} \in R^{(M + L) \times d_{k}} = h_{τ + 1}^{(n - 1)} W_{v}^{T} \in R^{(M + L) \times d_{v}} = transformer-layer (Q_{τ + 1}^{(n)}, K_{τ + 1}^{(n)}, V_{τ + 1}^{(n)}) \in R^{L \times d}

Note that the resulting attention matrix $\in R^{L \times M + L}$ isn’t square.
Each token in the new segment can attend to all tokens in the memory + the new segment, but not vice versa.

With causal masking, it looks like this for $L = 3$ , $M = 2$ :

A_{masked} = a_{0, - 2} a_{1, - 2} a_{2, - 2} a_{0, - 1} a_{1, - 1} a_{2, - 1} a_{0, 0} a_{1, 0} a_{2, 0} - \infty a_{1, 1} a_{2, 1} - \infty - \infty a_{2, 2}

(rows are queries from new segment, columns are keys from memory + new segment)

$M$ is a hyperparameter trading off compute/memory and context length.
$M = 0$ is a standard transformer.
$M \to \infty$ is full attention over all past tokens (unbounded, lossless read-only memory). ³

Models trained with smaller $M$ can generalize to bigger $M$ .
In the paper, they train with $M = L \approx 512$ and evaluate with $M = O (1000 - 4000)$ with good generalization.

Bigger $L$ → more parallism and expressivity, but higher memory per step.
Smaller $L$ → more sequential chunking, less expressivity.
In the extreme case of $L = 1$ we generate one token at a time.

TrXL’s attention is linear with respect to the total sequence length $T$
Compute: $O (T (M + L) d)$ vs. standard attention $O (T^{2} d)$
Memory: $O (T (M + L))$ vs. standard attention $O (T^{2})$

$M + L$ is the effective receptive field, i.e. the number of past tokens each new token can attend to.

relative positional encodings are used because they allow for better generalization when reusing past segments (and absolute PEs wouldn’t make sense when reusing past segments anyway).
They introduce some new relative PE method but you can just use any.

Visualization of attn patterns by gemini https://ai.studio/apps/drive/1ssW9IcHrVQiVhtUeVvnuct6Dw-0T5Eh5?fullscreenApplet=true

“segment-level recurrence” … in other words “processing the current segment based on previous segments”. That’s also why $τ$ is used to index segments, because like trajectories, it’s a sequence of tokens in each segment. Nothing to do with RNNs. ↩
In comparison to RMT, Yannik says TrXL only learns how to read memory, but that’s just not true? There’s no explicit, separate write mechanism, and due to stop-grad also no direct/long-range credit assignment to past segments, but over training the model has to shape its hidden states such that they are useful for future segments too, so it absolutely can learn to write useful information into the cached KV states. ↩
But as context length increases, the attention scores get more diffuse, making it harder to focus on relevant tokens. ↩

Max Wolf's Second Brain

Explorer

transformer-xl

Graph View

Backlinks

Max Wolf's Second Brain

Explorer

transformer-xl

Footnotes

Graph View

Backlinks