Model distillation is a technique for compressing the knowledge of a large, powerful model (the “teacher”) into a smaller, faster model (the “student”). The student is trained to reproduce the teacher’s outputs, often capturing most of its capability while being far cheaper to run.
The method was formalized by Hinton, Vinyals, and Dean in the 2015 paper “Distilling the Knowledge in a Neural Network.” The paper opens by noting that a simple way to improve performance is to train many models and average them, then shows that this cumbersome ensemble’s knowledge can be distilled into a single small model that is practical to deploy.
Distillation is widely used to produce the smaller, lower-latency versions of frontier models that power phones, edge devices, and high-volume services.
Why business readers should care: Distillation is how providers offer “mini” or “lite” models that are dramatically cheaper while staying useful. It is a central lever for controlling the cost of running AI at scale.