year: 2022
paper: https://arxiv.org/abs/2207.06881 | https://arxiv.org/abs/2304.11062
website: Yannik: Scaling Transformer to 1M tokens and beyond with RMT (Paper Explained)
code:
connections: memory token, long-context, transformer, RNN


The idea is to achieve infinite context length by making memory tokens recurrent:

Gradients flow through the recurrence during training (BPTT across segments), memory participates in full attention, and gets transformed; all in contrast to simply cached KVs as per TrXL. 1

For the encoder variant, RMT simply places memory tokens only at the beginning of each segment, and the next segment’s memory is taken from those tokens after the Transformer.
For decoder-only models, RMT places the same memory tokens at both ends of each segment:

Only at the very first segment is initialized from some fixed trainable vectors.

The scaling paper for RMT does technically run on 2 million tokens, but they use synthetic “needle-in-a-haystack” tasks (memorize a fact, recall it 1M tokens later). While impressive for demonstrating mechanism, it’s very different from reasoning over a 1M token codebase, and heavily optimized for algorithmic generalization (Copy, Reverse, Associative Retrieval). They prove the mechanism works .

Footnotes

  1. So in that sense it seems definitely more of a bio-plausible mechanism / suited for RL; or like rather it encourages to learn what to store in memory more directly / separates it from the rest of the processing. But TrXL caching and RMT recurrence can also be used together. Having extended lossless memory + some learned compression of it seems like a good combo, but for reasoning agents (or literally just for practical purposes, similar to biological constraints), I would put more emphasis on learned memory. Titans has a mix of persistent frozen + lossy long-term + normal (attention) short term memory.