associative memory

Associative memory is a type of memory that can be retrieved by association, i.e. by similarity or relatedness to other memories. Importantly, the full memory can be reconstructed from partial information
For example, if you see a certain piece of clothing, it might remind you of a person which frequently wears it.

memory in the brain is associative. It is stored in reference frames based on grid cells.

This also explains why it is easier to remember things if you assign them to spatial locations, e.g. in a “thought palace” / “method of loci”.
The brain prefers to store things in reference frames.
The method of recalling things (“or thinking if you will” … not so sure about that yet. I think this is just the interpolative memory part - where is the “program synthesis” part?) is to mentally move through those reference frames.
One can physically move through reference frames, i.e. by moving through a room, touching an object, … or by thinking about them.

Link to original

Outer-product memory

The rank-1 matrix produced by an outer product can be interpreted as storing an association from pattern $x$ (key) to pattern $y$ (value).
$W = y x^{T} \in R^{m \times n}$ is a linear map $R^{n} \to R^{n}$ that takes a query $q \in R^{n}$ which, if similar to $x \in R^{n}$ , returns something similar to $y \in R^{m}$ :
$W q = y x^{T} q = y (x \cdot q)$
If the keys are unit norm, $W q$ is exactly $y$ scaled by the cosine similarity of $x$ and $q$ .

→ outer product stores association from key to value
→ inner product measures similarity of query to key
→ Together, they implement content-based addressing, aka associative memory.

For many pairs, we stack keys $K = [x^{(1)} \dots x^{(P)}] \in R^{n \times P}$ and values $V = [y^{(1)} \dots y^{(P)}] \in R^{m \times P}$ and build a weight matrix that stores all associations $W = V K^{⊤} \in R^{m \times n}$ . Then
$W q = V (K^{⊤} q)$
So $K^{⊤} q \in R^{P}$ computes the similarity of the query to all keys, and $V (K^{⊤} q)$ returns a similarity-weighted sum of all values.
If the keys are orthonormal, this exactly returns the value corresponding to the most similar key.
Too many non-orthogonal keys clutter recall.
Mitigations incude:

normalization, scaling (e.g. NTM, SDP, layer normalization, RMS norm, qk-norm)

applying a hard/soft threshold after reading, letting the correct item win even if noisy (Hopfield Network, SDP)

increasing the the memory capacity, i.e. the dimensionality of keys/queries $n$ (for each $R^{n}$ key, each nonmatching $x^{T} q$ adds ~ $O (\frac{1}{n})$ noise to the readout → std of noise is $O (\frac{P}{n})$ for $P$ items)

using sparse keys/queries (e.g. locality-sensitive hashing, sparse attention, …)

Attention is an associative memory

In a self-attention head, a query retrieves content by matching against keys and mixing the corresponding values → content-addressable recall aka associative memory in a differentiable key-value memory.

Link to original

Hopfield Network
Boltzman Machines

Linear Transformers Are Secretly Fast Weight Programmers

Max Wolf's Second Brain

Explorer

associative memory

Graph View

Backlinks