MMLU measures how much general knowledge and reasoning ability a language model has across 57 subjects, spanning elementary mathematics, US history, computer science, law, and many professional fields. Each question is multiple-choice, and the model’s score is the percentage answered correctly. The goal is to test breadth: a single number that reflects whether a model has absorbed expert-level material across the academic and professional spectrum.
The benchmark was introduced by Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt in the paper “Measuring Massive Multitask Language Understanding,” first posted in September 2020 and presented at ICLR 2021. At the time, the authors reported that most models scored near random chance while the largest GPT-3 model showed meaningful improvement but still fell well short of expert performance.
MMLU became an industry standard because it gave researchers and companies one comparable, wide-coverage score to track progress as models improved. For business readers, it is a useful shorthand for “general knowledge” capability, though frontier models now score so high that it no longer separates the best systems well.
Current scores are published on official and community leaderboards and change frequently, so they are not reproduced here.