outer product

An outer product

R^{n \times 1} \times R^{1 \times n} a_{i} b_{j} \to R^{n \times n} \mapsto C_{ij}

results in a matrix where the $n^{th}$ row is the $n^{th}$ row of the first vector multiplied by the $n^{th}$ element of the second vector:

torch.einsum("i, j -> ij", [torch.arange(4), torch.arange(4)])
tensor([[0, 0, 0, 0],
		[0, 1, 2, 3],
		[0, 2, 4, 6],
		[0, 3, 6, 9]])

→ an outer product produces a rank-1 matrix

Outer-product memory

The rank-1 matrix produced by an outer product can be interpreted as storing an association from pattern $x$ (key) to pattern $y$ (value).
$W = y x^{T} \in R^{m \times n}$ is a linear map $R^{n} \to R^{n}$ that takes a query $q \in R^{n}$ which, if similar to $x \in R^{n}$ , returns something similar to $y \in R^{m}$ :
$W q = y x^{T} q = y (x \cdot q)$
If the keys are unit norm, $W q$ is exactly $y$ scaled by the cosine similarity of $x$ and $q$ .

→ outer product stores association from key to value
→ inner product measures similarity of query to key
→ Together, they implement content-based addressing, aka associative memory.

For many pairs, we stack keys $K = [x^{(1)} \dots x^{(P)}] \in R^{n \times P}$ and values $V = [y^{(1)} \dots y^{(P)}] \in R^{m \times P}$ and build a weight matrix that stores all associations $W = V K^{⊤} \in R^{m \times n}$ . Then
$W q = V (K^{⊤} q)$
So $K^{⊤} q \in R^{P}$ computes the similarity of the query to all keys, and $V (K^{⊤} q)$ returns a similarity-weighted sum of all values.
If the keys are orthonormal, this exactly returns the value corresponding to the most similar key.
Too many non-orthogonal keys clutter recall.
Mitigations incude:

normalization, scaling (e.g. NTM, SDP, layer normalization, RMS norm, qk-norm)

applying a hard/soft threshold after reading, letting the correct item win even if noisy (Hopfield Network, SDP)

increasing the the memory capacity, i.e. the dimensionality of keys/queries $n$ (for each $R^{n}$ key, each nonmatching $x^{T} q$ adds ~ $O (\frac{1}{n})$ noise to the readout → std of noise is $O (\frac{P}{n})$ for $P$ items)

using sparse keys/queries (e.g. locality-sensitive hashing, sparse attention, …)

Link to original

Max Wolf's Second Brain

Explorer

outer product

Graph View

Backlinks