Memory Tokens / Registers / Meta Tokens

These are all the same mechanism with different names:
Memory tokens: Memory Transformer (Burtsev et al., 2020)
Registers: Vision Transformers Need Registers (Darcet et al., 2024)
Meta tokens: Hymba (Nvidia, 2024)

The mechanism: prepend learnable vectors to the input sequence.
They participate in self-attention normally: They can attend to sequence tokens, and sequence tokens can attend to them.

In these three papers, memory tokens are stateless / discarded after each forward pass.

ViT need registers showed that without these, ViTs hijack low-information patch tokens (background/sky/…) for this purpose, creating artifacts in the attention maps. Even a single register is sufficient to remove these artifacts. While performance on dense prediction tasks (like segmentation) peaks at around 4 registers and then plateaus, ImageNet classification accuracy continues to improve slightly up to 16 registers. They ended up using 4.

Recurrent Memory Transformer feeds the processed memory tokens back as inputs to the next context segment: for sequences longer than the context window, split into segments. Process segment with memory tokens. Take the memory token outputs and use them as the memory token inputs for segment . Repeat.

RMT: Uses ~10–50 tokens for 512-token segments. That is ~2% to 10%.
Hymba: Uses 128 tokens for context windows of 2k–8k. That is ~1.5% to 6%.
Memory Transformer: Uses 10–20 tokens for shorter translation tasks.


Titans also use classic memory tokens termed persistent memory:

Titans frames memory as a three-component system analogous to human memory:
Short-term memory – standard attention over current context window
Persistent memory – learnable vectors concatenated to input, exactly like memory tokens. They call it “persistent” because these parameters are fixed after training / stateless between forward passes (unlike RMT where memory content changes across segments). Stores task-level knowledge.
Long-term memory – an MLP trained during inference to store key-value associations. Each token writes to memory by gradient-stepping on (i.e., learning to approximate ). History gets compressed into weights rather than stored as explicit KV.

Link to original


All papers in ctx: https://gemini.google.com/u/1/app/d729d43286cff096?pageId=none