self-attention, but the keys and values come from a different source, not from the same input.

Cross-Attention


Queries correspond to the rows of the attention matrix.
Keys correspond to the columns of the attention matrix.

Note: Neither the number of tokens nor the dimension of the two “modalities” need to match up. Hower in practice they often do (we use a single d_model, as in the image above).

HuggingFaceDiffusers code

Cross-Attention in Transformer Architecture (also stable diff, …)