Sparse Autoencoder (for interpretability)

A sparse autoencoder, in the context of mechanistic interpretability, is a small neural network used as a tool to make a larger model’s internals legible. Its job is to take the dense, overlapping activations inside a model and re-express them as a combination of a much larger set of learned directions, called features, while being forced to use only a few of them at a time.

The technique addresses superposition, the tendency of models to pack many concepts into the same neurons so that no single neuron has a clean meaning. By learning an overcomplete dictionary of features and penalizing the use of many features at once, the autoencoder is pushed to find directions that each correspond to a single, coherent concept - turning a tangle of polysemantic neurons into a vocabulary of interpretable features. This is why the approach is also described as dictionary learning.

Anthropic’s 2023 “Towards Monosemanticity” demonstrated the method on a one-layer model, extracting thousands of clean features from 512 neurons, and the 2024 “Scaling Monosemanticity” scaled it to the production model Claude 3 Sonnet, recovering millions of features and showing that adjusting them changes the model’s behavior. Sparse autoencoders have since become a standard workhorse of interpretability research.

For a business reader, the sparse autoencoder is the practical instrument behind the recent progress in seeing inside large models. It is the closest thing the field has to a microscope - a way to name the concepts a model is using and, in principle, to detect or adjust safety-relevant ones.

Sparse Autoencoder (for interpretability)

Sources

Related