- Learned prompt/prefix, aka memory tokens
- Feeding those processed tokens back: Recurrent Memory Transformer
- Extending memory by keeping old KV: transformer-xl
- RMT extended / applied to RL: Recurrent Action Transformer with Memory
some things i wanna look into / compare some time in more depth
leave-no-context-behind-efficient-infinite-context-transformers-with-infini-attention
https://github.com/lucidrains/memory-transformer-xl/tree/master (readme descr seems similar to infini attention on a skim)
https://github.com/lucidrains/memorizing-transformers-pytorch