Associative memory is a type of memory that can be retrieved by association, i.e. by similarity or relatedness to other memories. Importantly, the full memory can be reconstructed from partial information
For example, if you see a certain piece of clothing, it might remind you of a person which frequently wears it.
Link to originalmemory in the brain is associative. It is stored in reference frames based on grid cells.
This also explains why it is easier to remember things if you assign them to spatial locations, e.g. in a “thought palace” / “method of loci”.
The brain prefers to store things in reference frames.
The method of recalling things (“or thinking if you will” … not so sure about that yet. I think this is just the interpolative memory part - where is the “program synthesis” part?) is to mentally move through those reference frames.
One can physically move through reference frames, i.e. by moving through a room, touching an object, … or by thinking about them.
Outer-product memory
The rank-1 matrix produced by an outer product can be interpreted as storing an association from pattern (key) to pattern (value).
is a linear map that takes a query which, if similar to , returns something similar to :If the keys are unit norm, is exactly scaled by the cosine similarity of and .
→ outer product stores association from key to value
→ inner product measures similarity of query to key
→ Together, they implement content-based addressing, aka associative memory.For many pairs, we stack keys and values and build a weight matrix that stores all associations . Then
So computes the similarity of the query to all keys, and returns a similarity-weighted sum of all values.
If the keys are orthonormal, this exactly returns the value corresponding to the most similar key.
Too many non-orthogonal keys clutter recall.
Mitigations incude:
- normalization, scaling (e.g. NTM, SDP, layer normalization, RMS norm, qk-norm)
- applying a hard/soft threshold after reading, letting the correct item win even if noisy (Hopfield Network, SDP)
- increasing the the memory capacity, i.e. the dimensionality of keys/queries (for each key, each nonmatching adds ~ noise to the readout → std of noise is for items)
- using sparse keys/queries (e.g. locality-sensitive hashing, sparse attention, …)
Link to originalAttention is an associative memory
In a self-attention head, a query retrieves content by matching against keys and mixing the corresponding values → content-addressable recall aka associative memory in a differentiable key-value memory.