DeepMind’s 2022 paper “Training Compute-Optimal Large Language Models” established that the largest models of the day were undertrained. Its central finding is that, for a fixed compute budget, “for every doubling of model size the number of training tokens should also be doubled.” The 70-billion-parameter Chinchilla model, trained on far more data, outperformed much larger models such as the 280-billion Gopher and 175-billion GPT-3.