Humanity’s Last Exam measures whether an AI system can answer the hardest closed-ended academic questions experts can devise. It contains 2,500 questions spanning over 100 subjects across mathematics, the humanities, and the natural sciences, in both multiple-choice and short-answer form so they can be graded automatically. It also tracks model calibration, meaning how well a model knows when it might be wrong.
The benchmark was introduced in the January 2025 paper at arXiv 2501.14249, developed collaboratively by the Center for AI Safety and Scale AI with contributions from roughly a thousand subject-matter experts across hundreds of institutions worldwide. The creators explicitly built it to address saturation: established tests like MMLU now see frontier models score above 90 percent, which limits their usefulness for tracking further progress.
HLE became a notable reference because it aims to be a final, broad, closed-ended academic benchmark - deliberately set at the frontier of human knowledge so it can still separate top models.
Current scores are published on the official site and in the paper’s leaderboards and change as models improve, so they are not reproduced here.