This note is a reference for different kinds of attention.

Attention? Attention! (Blogpost) [Deep intro to attention]

See also:
self-attention (attention mostly refers to this nowadays)
scaled dot product attention
cross-attention

Link to original

Circular transclusion detected: general/self-attention

The hamiltonian of the Hopfield Network is identical to the ising model, except that the interaction strength is not constant, and very similar to attention!

For a hopfield network:

is the weight / interaction matrix, which encodes how patterns are stored, rather than being a simple interaction constant.
In a simplified form, the update rule of the HFN is:

And for attention:

Tokens in self-attention are just like spins in the ising model, but instead of spins, they are vectors in higher dimensional space, with all to all communication, self-organizing to compute the final representation.


See also: associative memory.

Link to original

Circular transclusion detected: general/self-attention

Causal, non-causal attention and cache: See causal attention.

Soft vs Hard Attention

Soft Attention:

  • The alignment weights are learned and placed “softly” over all patches in the source image; essentially the same type of attention as in Bahdanau et al., 2015.
  • Pro: the model is smooth and differentiable.
  • Con: expensive when the source input is large.

Hard Attention (Not acutally used):

  • only selects one patch of the image to attend to at a time.
  • Pro: less calculation at the inference time.
  • Con: the model is non-differentiable and requires more complicated techniques such as variance reduction or reinforcement learning to train. (Luong, et al., 2015)