Capsule networks: a celebrated idea that stayed quiet

In October 2017 Sara Sabour, Nicholas Frosst, and Geoffrey Hinton published “Dynamic Routing Between Capsules” on arXiv. The idea carried unusual weight because Hinton, one of the architects of deep learning, had been talking about “capsules” for years as a way to fix what he saw as a flaw in convolutional neural networks. The paper defines a capsule as “a group of neurons whose activity vector represents the instantiation parameters of a specific type of entity such as an object or an object part,” and proposes a routing-by-agreement mechanism so that lower-level capsules send their output to higher-level capsules that agree with them.

The promise was appealing. CNNs are good at detecting features but weak at preserving the spatial and pose relationships between parts - the classic example being that a face with the eyes and mouth scrambled still scores as a face. Capsules were meant to encode those relationships explicitly, learning from less data and generalizing better to new viewpoints. The paper reported strong results on small benchmarks like MNIST.

What did not happen is the takeover. Capsule networks were hard to scale, the routing computation was expensive, and the results on large, realistic image datasets did not clearly beat well-tuned convolutional and, soon after, transformer-based models. The field’s attention moved on - to residual networks, then to vision transformers - and capsules settled into a niche.

The honest version of this story is modest. The publication record shows a heavily cited, influential idea from a leading researcher that did not become the dominant architecture it was hoped to be. That is a common shape in research: a clean, well- argued alternative that the broader trajectory of the field simply routed around.

Sources

Last verified June 6, 2026