DeepMind's Chinchilla scaling result

In March 2022, DeepMind published “Training Compute-Optimal Large Language Models,” which introduced a model called Chinchilla and a new rule of thumb for how to spend a fixed training budget. The paper found that the largest language models of the time were substantially undertrained: builders had been making models bigger without giving them proportionally more data to learn from.

The central finding is that for compute-optimal training, model size and the number of training tokens should be scaled equally. For every doubling of model size, the amount of training data should also double. To test this, the authors trained Chinchilla with 70 billion parameters but on four times more data than DeepMind’s earlier 280-billion-parameter Gopher model, using the same compute budget. Chinchilla uniformly and significantly outperformed Gopher and other much larger models on downstream tasks.

The Chinchilla result reshaped how the field thought about scaling. It shifted attention from raw parameter counts toward the balance between model size and data, and the “Chinchilla-optimal” ratio became a reference point for later model designers deciding how big to build and how much to train.

DeepMind's Chinchilla scaling result

Sources

Related