Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

year: 05/2023
paper: https://arxiv.org/pdf/2305.13245.pdf
website: https://paperswithcode.com/method/grouped-query-attention
code: Mistral uses. Code
connections: MQA

https://www.ibm.com/think/topics/grouped-query-attention

Grouped-query attention an interpolation of multi-query and multi-head attention that achieves quality close to multi-head at comparable speed to multi-query attention.

Motivation is to reduce values and keys to save on KV cache size.
The extreme version of just one K and V for each Q doesn’t make that much sense but small group sizes like 2-3 have basically the saame perf while saving lots.

attention (ML)
ML-Performance Tricks

Max Wolf's Second Brain

Explorer

Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Graph View

Backlinks

Max Wolf's Second Brain

Explorer

Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Related

Graph View

Backlinks