Superposition

Superposition is a concept from mechanistic interpretability that explains why the individual neurons of a neural network are so often hard to interpret. The term was developed in Anthropic’s 2022 “Toy Models of Superposition,” which studied it in detail in small, controlled networks.

The basic idea is one of compression. A model may want to represent far more distinct features - concepts, properties, patterns - than it has neurons to store them in. If many of those features are rare and seldom occur together, the network can pack them into the same set of neurons as overlapping directions in activation space, accepting a little interference in exchange for representing many more things than a strict one-feature-per-neuron scheme would allow. The result is that a single neuron lights up for several unrelated concepts, a property called polysemanticity.

Superposition matters because it is a direct obstacle to understanding models. If features are smeared across neurons rather than cleanly localized, then reading a network neuron by neuron will not reveal what it is doing. This realization motivated the search for better units of analysis - the sparse-autoencoder techniques in “Towards Monosemanticity” and “Scaling Monosemanticity” are explicitly attempts to pull a model’s features back out of superposition.

For a business reader, superposition is part of why large models behave like black boxes: their knowledge is stored in a compressed, overlapping code rather than in tidy, labeled slots. Undoing that compression is one of the central technical challenges in making AI systems auditable and trustworthy.

Sources

Related