Benchmarks

How AI is measured - each tied to the paper or site that defined it.

102 entries, all primary-sourced

benchmark 1964

Brown Corpus

The Brown Corpus, compiled in 1964, was the first million-word computer-readable sample of American English and seeded modern corpus linguistics.

benchmark

Aider Polyglot Benchmark

Tests whether an LLM can edit existing code correctly across six languages, using 225 hard Exercism exercises.

benchmark

AIME (American Invitational Mathematics Examination)

A hard high-school mathematics competition that has become a standard test of advanced reasoning in AI models.

benchmark

ARC-AGI (Abstraction and Reasoning Corpus)

A puzzle-based benchmark of novel reasoning tasks that are easy for humans but hard for AI, meant to measure fluid intelligence.

benchmark

GLUE and SuperGLUE

GLUE (2018) bundled nine language tasks into one score and became the BERT-era scoreboard, then saturated within months, prompting SuperGLUE in 2019.

benchmark

GPQA (Graduate-Level Google-Proof Q&A)

A set of 448 hard graduate-level science questions designed so they cannot be answered by simple web search.

benchmark

GSM8K (Grade School Math 8K)

A dataset of about 8,500 grade-school math word problems that tests a model's multi-step arithmetic reasoning.

benchmark

HumanEval

A 164-problem test that checks whether a model can write working Python code from a natural-language description, graded by unit tests.

benchmark

Humanity's Last Exam (HLE)

A 2,500-question expert-level exam across many subjects, built to stay hard for frontier AI as easier benchmarks get saturated.

benchmark

ImageNet Challenge (ILSVRC)

The annual image-recognition competition whose 2012 results launched the deep learning era.

benchmark

LMArena (Chatbot Arena)

A live leaderboard that ranks AI chatbots by anonymous head-to-head human preference votes.

benchmark

MATH (Competition Mathematics Dataset)

A dataset of 12,500 challenging competition math problems, each with a full worked solution, used to measure mathematical problem-solving in AI.

benchmark

MLPerf

An industry benchmark suite from MLCommons that measures how fast computing systems can train and run AI models.

benchmark

MMLU (Measuring Massive Multitask Language Understanding)

A 57-subject multiple-choice test measuring how broadly an AI model has acquired world knowledge and problem-solving ability.

benchmark

SWE-bench

A benchmark that tests whether AI systems can resolve real GitHub issues by editing real codebases, graded by the projects' own tests.

benchmark November 1995

WordNet

Princeton's hand-built lexical database of English that organized words into concept sets and underpinned decades of NLP, including ImageNet.

benchmark 1998

MNIST (handwritten digit database)

A dataset of 70,000 handwritten digit images that became the default first benchmark for computer vision and machine learning.

benchmark 1999

Penn Treebank

A Penn-built corpus of syntactically annotated English that became the standard training and test set for parsing and language modeling.

benchmark 2006

WMT (Conference on Machine Translation)

WMT is the annual machine translation conference whose shared tasks and common test sets have set the field's evaluation standards since 2006.

benchmark October 2007

Labeled Faces in the Wild (LFW)

LFW, released by UMass in 2007, holds 13,233 web photos of 5,749 people and became the standard face verification benchmark.

benchmark October 6, 2008

NASA C-MAPSS Turbofan Engine Degradation Dataset

NASA's simulated run-to-failure turbofan dataset became the standard benchmark for predicting an engine's remaining useful life.

benchmark April 8, 2009

CIFAR-10 and CIFAR-100

Two small labeled image datasets from Krizhevsky and Hinton that became standard testbeds for deep learning research.

benchmark 2012

Winograd Schema Challenge

The Winograd Schema Challenge tests commonsense reasoning with pronoun puzzles, proposed in 2012 as an alternative to the Turing test.

benchmark June 16, 2012

KITTI Vision Benchmark Suite

KITTI is a 2012 real-world driving benchmark with camera and lidar data that became the standard test for autonomous-vehicle perception.

benchmark July 19, 2012

The Arcade Learning Environment (ALE)

ALE turned hundreds of Atari 2600 games into a standard testbed that became the dominant benchmark for deep RL agents.

benchmark May 2014

COCO (Common Objects in Context)

A large image dataset with per-object segmentation that became the standard benchmark for detection, segmentation, and captioning.

benchmark 2015

LibriSpeech: An ASR Corpus from Public Domain Audiobooks

A roughly 1,000-hour corpus of read English speech from LibriVox audiobooks, the standard benchmark for speech recognition research.

benchmark February 23, 2016

Visual Genome

A dataset of 108,077 images with dense crowdsourced annotations of objects, attributes, and relationships, aimed at visual reasoning not just recognition.

benchmark June 16, 2016

SQuAD (Stanford Question Answering Dataset)

A Stanford reading-comprehension benchmark of 100,000-plus questions that helped drive the pre-Transformer era of NLP.

benchmark November 28, 2016

MS MARCO

Microsoft's reading-comprehension and retrieval dataset built from a million real Bing search queries with human-written answers and millions of passages.

benchmark May 5, 2017

ChestX-ray14 (ChestX-ray8)

NIH's 2017 release of 108,948 chest X-rays from 32,717 patients with NLP-mined disease labels became a standard medical-imaging benchmark.

benchmark January 2, 2018

DeepMind Control Suite

The DeepMind Control Suite is a standardized set of MuJoCo-based continuous control tasks widely used to benchmark RL agents.

benchmark March 14, 2018

ARC (AI2 Reasoning Challenge)

A 2018 set of 7,787 grade-school science questions, split into Easy and Challenge sets that defeated retrieval methods.

benchmark September 5, 2018

Conceptual Captions

Google's dataset of about 3.3 million image-caption pairs harvested and cleaned from web alt-text, far larger than hand-curated caption sets like COCO.

benchmark March 1, 2019

DROP

A reading-comprehension benchmark of 96,000 questions requiring discrete reasoning like counting, addition, and sorting.

benchmark March 26, 2019

nuScenes

nuScenes is a 2019 driving dataset with the full sensor suite, 6 cameras, 5 radars, and a lidar, across 1,000 annotated scenes.

benchmark May 2, 2019

SuperGLUE

A harder successor to GLUE, created after models surpassed non-expert humans on the original language-understanding benchmark.

benchmark May 19, 2019

HellaSwag

A 2019 commonsense benchmark built by adversarial filtering, where humans scored over 95% but top models under 48%.

benchmark July 24, 2019

WinoGrande

A 2019 benchmark of 44,000 Winograd-style pronoun puzzles, scaled up and debiased to test real commonsense reasoning.

benchmark October 23, 2019

C4 (Colossal Clean Crawled Corpus)

Google's cleaned subset of Common Crawl, built for the T5 model, that became a widely used open pre-training corpus.

benchmark December 10, 2019

Waymo Open Dataset

Waymo's 2019 open dataset released high-quality lidar and camera data from 1,150 driving scenes for autonomous-driving perception research.

benchmark December 13, 2019

Common Voice: A Massively-Multilingual Speech Corpus

Mozilla's crowdsourced, public-domain speech dataset, collecting validated voice clips across dozens of languages for open ASR.

benchmark 2020

The M-Competitions (M4 and M5 Forecasting Competitions)

Spyros Makridakis's open forecasting competitions that empirically test which methods actually predict best on thousands of real time series.

benchmark May 3, 2021

SUPERB: Speech Processing Universal Performance Benchmark

A 2021 benchmark and leaderboard that tests one frozen speech model across many tasks, measuring how general its representations are.

benchmark August 16, 2021

MBPP (Mostly Basic Python Problems)

A set of 974 entry-level Python tasks used alongside HumanEval to measure code-generation ability.

benchmark September 8, 2021

TruthfulQA

A 2021 benchmark of 817 questions where the largest models were often the least truthful, mimicking human misconceptions.

benchmark October 13, 2021

Ego4D

A 3,670-hour dataset of first-person daily-life video from 9 countries, built to teach machines egocentric perception.

benchmark June 9, 2022

BIG-bench (Beyond the Imitation Game)

A community benchmark of 204 tasks from 450 authors built to probe what large language models can and cannot do.