Neurons that respond to multiple task variables rather than just one. In neuroscience, most prefrontal cortex neurons are mixed-selective”.
In mechanistic interpretability, this same phenomenon is called superposition.
Superposition in neural networks
When models represent more features than they have dimensions by encoding multiple features in the same direction. Features are stored as nearly-orthogonal vectors rather than perfectly orthogonal ones, creating “interference” between features.
Toy Models of Superposition shows this emerges naturally when models need to represent more features than they have capacity for. Features organize into geometric structures (digons, triangles, pentagons) to minimize interference while maximizing the number of features represented.
Read
From Neel Nanda’s mechinterp glossary:
- Superposition is when a model represents more than n features in an n dimensional activation space. That is, features still correspond to directions, but the set of interpretable directions is larger than the number of dimensions. Intuitively, this is the model simulating a larger model
- This set of >n directions is sometimes called an overcomplete basis. (Notably, this is not a basis, because it is not linearly independent)
- A key consequence is that if superposition is being used, there cannot be an interpretable basis. In particular, features as neurons cannot perfectly hold.
- Sparse coding is a field of maths that finds techniques to find an overcomplete basis for a set of vectors such that each vector is a sparse linear combination of these basis vectors.
- Importantly, if we try to read each feature vs projecting onto some direction, these cannot all be orthogonal, so we cannot perfectly recover each feature.
- Equivalently, because any set of >n directions is not linearly independent, any activation can be written as infinitely many different linear combination of those directions, and so can’t be uniquely interpreted as a set of features.
- There are two kinds of superposition worth caring about in a transformer: (these terms are how I think about it, but are not standard notation)
- Bottleneck superposition - this is when a bottleneck dimension experiences superposition. (Eg keys, queries, the residual stream, etc)
- This is not very surprising! If there’s 50,000 tokens in the vocabulary and 768 dimensions in the residual stream, there almost has to be more features than dimensions, and thus superposition.
- Intuitively, bottleneck superposition is just used for “storage”, bottleneck dimensions are intermediate states of linear maps and we do not expect them to be doing significant computation.
- Neuron superposition - this is when neuron activations experience superposition. Ie, there are more features represented in neuron activation space than there are neurons.
- Intuitively, neuron superposition represents doing computation in superposition - using n non-linearities to do some processing that outputs more than n features.
- Intuitively, bottleneck superposition is easier than neuron superposition - the only interference to care about is when projecting onto a feature direction is other features with non-zero dot product with that direction. While in neuron superposition, if one neuron has significant contribution from multiple features, then if one of those feature changes then that will affect all the other features in a weird and messy way.
- Is this actually harder to deal with in practice for a model? I have no idea! This is a pretty open question.
- Neuron polysemanticity is the idea that a single neuron activation corresponds to multiple features. Empirically we might observe that, a neuron activates on multiple clusters of seemingly unrelated things like pictures of dice and pictures of poets.
- Subtlety: Neuron superposition implies polysemanticity (since there are more features than neurons), but not the other way round. There could be an interpretable basis of features, just not the standard basis - this creates polysemanticity but not superposition.
- Alternately, neuron polysemanticity is equivalent to saying that the standard basis is not interpretable.
- Conversely, a neuron is monosemantic if it corresponds to a single feature.
- In practice, the standards for calling a neuron monosemantic are somewhat fuzzy and it’s not a binary - if a neuron activates really strongly for a single feature, but activates a bit on a bunch of of other features, I’d probably call it monosemantic.
- We can both use polysemanticity to refer to the neuron layer as a whole, or to refer to a specific neuron as being polysemantic. A layer of neurons could contain both polysemantic and monosemantic neurons.
- Note: Polysemanticity isn’t used to refer to bottleneck dimensions, because there’s no privileged basis to be polysemantic in.
- High-Level Concepts:
- Intuitively, superposition is a form of lossy compression. The model is able to represent more features, but at the cost of adding noise and interference between features. Models need to find an optimal point balancing between the two, and it’s plausible that the optimal point will not be zero superposition.
- There are two key aspects of a feature:
- Its importance - how useful is it for achieving lower loss? Important features are more useful to represent, and interference with them is more expensive
- Its sparsity - how frequently is it in the input? Controlling for importance, if a feature is sparse it will interfere with other features less.
- In general, problems with many sparse, unimportant features will show significant superposition.
- An underlying concept is that of the feature importance curve. There is, in theory, an arbitrary amount of features that matter, with a long tail of increasingly niche and unimportant features (like whether text occurs in a glossary about mechanistic interpretability!) which are still better than nothing. We can imagine enumerating all of these features, and then ordering them in decreasing order of importance. We’ll begin with incredibly important and frequent features (eg, “this is a new article” or “this is Python code”), and steadily drop off. Under this framing, we should expect models to always want to do non-zero superposition, as there will always be some incredibly sparse but useful feature it will want to learn (which may be extremely hard to detect!)
- This set of >n directions is sometimes called an overcomplete basis. (Notably, this is not a basis, because it is not linearly independent)