Benchmarks

How AI is measured - each tied to the paper or site that defined it.

102 entries, all primary-sourced
benchmark 1964

Brown Corpus

The Brown Corpus, compiled in 1964, was the first million-word computer-readable sample of American English and seeded modern corpus linguistics.

benchmark

Aider Polyglot Benchmark

Tests whether an LLM can edit existing code correctly across six languages, using 225 hard Exercism exercises.

benchmark

GLUE and SuperGLUE

GLUE (2018) bundled nine language tasks into one score and became the BERT-era scoreboard, then saturated within months, prompting SuperGLUE in 2019.

benchmark

GSM8K (Grade School Math 8K)

A dataset of about 8,500 grade-school math word problems that tests a model's multi-step arithmetic reasoning.

benchmark

HumanEval

A 164-problem test that checks whether a model can write working Python code from a natural-language description, graded by unit tests.

benchmark

Humanity's Last Exam (HLE)

A 2,500-question expert-level exam across many subjects, built to stay hard for frontier AI as easier benchmarks get saturated.

benchmark

LMArena (Chatbot Arena)

A live leaderboard that ranks AI chatbots by anonymous head-to-head human preference votes.

benchmark

MATH (Competition Mathematics Dataset)

A dataset of 12,500 challenging competition math problems, each with a full worked solution, used to measure mathematical problem-solving in AI.

benchmark

MLPerf

An industry benchmark suite from MLCommons that measures how fast computing systems can train and run AI models.

benchmark

SWE-bench

A benchmark that tests whether AI systems can resolve real GitHub issues by editing real codebases, graded by the projects' own tests.

benchmark November 1995

WordNet

Princeton's hand-built lexical database of English that organized words into concept sets and underpinned decades of NLP, including ImageNet.

benchmark 1998

MNIST (handwritten digit database)

A dataset of 70,000 handwritten digit images that became the default first benchmark for computer vision and machine learning.

benchmark 1999

Penn Treebank

A Penn-built corpus of syntactically annotated English that became the standard training and test set for parsing and language modeling.

benchmark 2006

WMT (Conference on Machine Translation)

WMT is the annual machine translation conference whose shared tasks and common test sets have set the field's evaluation standards since 2006.

benchmark October 2007

Labeled Faces in the Wild (LFW)

LFW, released by UMass in 2007, holds 13,233 web photos of 5,749 people and became the standard face verification benchmark.

benchmark April 8, 2009

CIFAR-10 and CIFAR-100

Two small labeled image datasets from Krizhevsky and Hinton that became standard testbeds for deep learning research.

benchmark 2012

Winograd Schema Challenge

The Winograd Schema Challenge tests commonsense reasoning with pronoun puzzles, proposed in 2012 as an alternative to the Turing test.

benchmark June 16, 2012

KITTI Vision Benchmark Suite

KITTI is a 2012 real-world driving benchmark with camera and lidar data that became the standard test for autonomous-vehicle perception.

benchmark May 2014

COCO (Common Objects in Context)

A large image dataset with per-object segmentation that became the standard benchmark for detection, segmentation, and captioning.

benchmark February 23, 2016

Visual Genome

A dataset of 108,077 images with dense crowdsourced annotations of objects, attributes, and relationships, aimed at visual reasoning not just recognition.

benchmark November 28, 2016

MS MARCO

Microsoft's reading-comprehension and retrieval dataset built from a million real Bing search queries with human-written answers and millions of passages.

benchmark May 5, 2017

ChestX-ray14 (ChestX-ray8)

NIH's 2017 release of 108,948 chest X-rays from 32,717 patients with NLP-mined disease labels became a standard medical-imaging benchmark.

benchmark January 2, 2018

DeepMind Control Suite

The DeepMind Control Suite is a standardized set of MuJoCo-based continuous control tasks widely used to benchmark RL agents.

benchmark March 14, 2018

ARC (AI2 Reasoning Challenge)

A 2018 set of 7,787 grade-school science questions, split into Easy and Challenge sets that defeated retrieval methods.

benchmark September 5, 2018

Conceptual Captions

Google's dataset of about 3.3 million image-caption pairs harvested and cleaned from web alt-text, far larger than hand-curated caption sets like COCO.

benchmark March 1, 2019

DROP

A reading-comprehension benchmark of 96,000 questions requiring discrete reasoning like counting, addition, and sorting.

benchmark March 26, 2019

nuScenes

nuScenes is a 2019 driving dataset with the full sensor suite, 6 cameras, 5 radars, and a lidar, across 1,000 annotated scenes.

benchmark May 2, 2019

SuperGLUE

A harder successor to GLUE, created after models surpassed non-expert humans on the original language-understanding benchmark.

benchmark May 19, 2019

HellaSwag

A 2019 commonsense benchmark built by adversarial filtering, where humans scored over 95% but top models under 48%.

benchmark July 24, 2019

WinoGrande

A 2019 benchmark of 44,000 Winograd-style pronoun puzzles, scaled up and debiased to test real commonsense reasoning.

benchmark October 23, 2019

C4 (Colossal Clean Crawled Corpus)

Google's cleaned subset of Common Crawl, built for the T5 model, that became a widely used open pre-training corpus.

benchmark December 10, 2019

Waymo Open Dataset

Waymo's 2019 open dataset released high-quality lidar and camera data from 1,150 driving scenes for autonomous-driving perception research.

benchmark September 8, 2021

TruthfulQA

A 2021 benchmark of 817 questions where the largest models were often the least truthful, mimicking human misconceptions.

benchmark October 13, 2021

Ego4D

A 3,670-hour dataset of first-person daily-life video from 9 countries, built to teach machines egocentric perception.