AlphaCode
DeepMind's competition-level code generation system that reached the median rank in real programming contests.
The major AI model families, documented from the developers' own pages. For rankings we link the live leaderboards - rankings change weekly, and we would rather send you to the source than publish stale numbers.
Durable facts only: who makes it, what it is, how it is distributed. Each entry notes the lineup as of its verification date.
DeepMind's competition-level code generation system that reached the median rank in real programming contests.
Gemini-powered successor to AlphaCode that reached the 85th percentile among human competitive programmers.
Aya is a 2024 open multilingual instruction-tuned model from Cohere for AI covering 101 languages, over half of them lower-resourced.
BLOOM is a 176B-parameter open multilingual language model built by the volunteer BigScience workshop, covering 46 languages plus 13 programming languages.
Amazon's Chronos, a time-series foundation model that tokenizes numeric series and trains a language model on them for zero-shot forecasting.
Anthropic's family of Claude assistant models, organized into Opus, Sonnet, and Haiku tiers and delivered via API and apps.
Anthropic's May 2026 flagship, an Opus 4.7 upgrade with sharper agentic judgment, faster fast mode, and dynamic workflows.
Meta's family of open code models built on Llama 2, with infilling and long-context support up to 100k tokens.
Salesforce's open code model family that framed program synthesis as a multi-turn conversation.
Mistral AI's first code model, a 22B open-weight system trained on more than 80 programming languages.
Salesforce's encoder-decoder code model that uses identifier-aware pretraining for both understanding and generation.
Cohere's enterprise language models, including Command R and Command R+, built for retrieval-augmented generation, tool use, and multilingual business tasks.
Cohere's family of models built for enterprise retrieval-augmented generation, tool use, and long-context multilingual tasks.
OpenAI's line of text-to-image models, from DALL-E 1 and 2 through DALL-E 3, succeeded by native image generation built into GPT-4o.
Databricks' open mixture-of-experts model, released March 2024 with 132B total and 36B active parameters, beating earlier open models on standard benchmarks.
Chinese lab DeepSeek's model lines, including the V-series general models and the R-series reasoning models.
Open code models from DeepSeek trained on project-level code that beat Codex and GPT-3.5 on coding benchmarks.
Meta's ESM-2 protein language model and ESMFold predict structure from a single sequence and built an atlas of 617M proteins.
Arc Institute's family of genomic foundation models that read and generate DNA, RNA, and protein sequences from raw nucleotides.
The UAE Technology Innovation Institute's open-weight language models, including Falcon-40B and the 180-billion-parameter Falcon-180B trained on web data.
Black Forest Labs' 12-billion-parameter text-to-image models, released in August 2024 by much of the original Stable Diffusion team in open and closed tiers.
Google DeepMind's family of natively multimodal Gemini models, spanning Pro, Flash, and specialized variants.
Google's family of open-weight models, built from the same research as Gemini and released for developers to download, fine-tune, and run themselves.
A transformer pretrained on about 30 million single-cell transcriptomes that transfers to network-biology tasks with little extra data.
DeepMind's foundation world models that turn an image or prompt into a playable, action-controllable virtual environment.
OpenAI's family of general-purpose generative pre-trained transformer models, delivered mainly through the OpenAI API and ChatGPT.
IBM's Apache 2.0 open model family for enterprise use, including the Granite code models trained on 116 programming languages and released in May 2024.
xAI's family of Grok models, the AI assistants developed by Elon Musk's xAI and integrated with the X platform.
A generative code model from Meta and collaborators that can fill in missing code using context on both sides.
AI21 Labs' open models built on a hybrid Transformer-Mamba mixture-of-experts architecture, with a 256K-token context window for long-document work.
Kuaishou's text-to-video model, unveiled in June 2024, generating up to two-minute clips at 1080p and 30 fps.
Meta's family of open-weight Llama models that organizations can download, fine-tune, and deploy themselves.
An independent lab's text-to-image generator known for a distinctive painterly aesthetic, run first through Discord and later on its own website.
French lab Mistral AI's models, mixing open-weight releases with commercial API models across general, code, and audio tasks.
Salesforce's Moirai, a universal time-series forecasting transformer trained on a 27-billion-observation archive for strong zero-shot performance.
Meta's NLLB-200 is a single open model that translates directly between 200 languages, including 150 low-resource ones.
NVIDIA's open foundation model for physical AI that unifies vision reasoning, world generation, and action across robots and vehicles.
OpenAI's line of reasoning models, beginning with o1, that think through problems step by step before answering.
A real-time, playable AI-generated Minecraft-like world, produced frame by frame by a neural model with no game engine.
The Allen Institute's family of fully open language models, released in February 2024 with not just weights but the training data, code, and checkpoints.
Microsoft's Phi family of small language models designed to deliver strong capability at sizes that can run locally.
EleutherAI's suite of 16 language models from 70M to 12B parameters, all trained on the same data in the same order with 154 checkpoints each.
Alibaba's Qwen family of language, vision, and image models, many released publicly through Hugging Face and GitHub.
Runway's 2024 video model that improved fidelity and motion over Gen-2 and was framed as a step toward general world models.
A generative foundation model for single-cell biology, pretrained on over 33 million cells for tasks like cell typing and batch integration.
SeamlessM4T is Meta's 2023 single model that translates and transcribes across speech and text for around 100 languages.
Snowflake's Apache 2.0 enterprise LLM, released April 2024 with a dense-MoE design of 480B total and 17B active parameters, trained for under $2 million.
OpenAI's text-to-video model line, from the February 2024 research preview to general availability in December 2024 and the Sora 2 release in 2025.
Stability AI's family of open text-to-image diffusion models, from the original 1.x release through SDXL and Stable Diffusion 3, runnable on consumer hardware.
BigCode's open 15B code model trained on permissively licensed GitHub repositories with an opt-out process.
Nixtla's TimeGPT, presented as the first foundation model for time series, producing zero-shot forecasts on data it never saw in training.
Google's TimesFM, a decoder-only foundation model for forecasting that gives strong zero-shot accuracy across many public time-series datasets.
The Unitree H1 is a full-size general-purpose humanoid robot from Chinese firm Unitree, notable for fast bipedal running and a relatively low price.
Google DeepMind's text-to-video model line, which by Veo 3 generates clips with synchronized native audio.
OpenAI's open-source automatic speech recognition models, trained on large-scale weakly supervised audio and released under a permissive MIT license.
Where current rankings actually live. These are the authoritative sources we point to instead of freezing scores that go stale.
Human-preference rankings from head-to-head votes
Structured data on notable models, compute, and hardware
Open model hub with downloads, evals, and trending models
Annual authoritative report on the state of AI
The benchmarks behind the headlines - what each one actually tests.
A 2023 benchmark spanning eight environments that measured how well LLMs act as agents, exposing a wide commercial-vs-open gap.
A benchmark of malicious agent tasks that tests whether tool-using LLM agents refuse harmful requests and resist jailbreaks.
A benchmark that grades foundation models on real human exams like the SAT, LSAT, and college entrance tests.
Tests whether an LLM can edit existing code correctly across six languages, using 225 hard Exercism exercises.
A hard high-school mathematics competition that has become a standard test of advanced reasoning in AI models.
A 2023 benchmark that runs real API calls to test whether LLMs can plan, retrieve, and correctly invoke tools.
A 2018 set of 7,787 grade-school science questions, split into Easy and Challenge sets that defeated retrieval methods.
A puzzle-based benchmark of novel reasoning tasks that are easy for humans but hard for AI, meant to measure fluid intelligence.
The 2025 successor to ARC-AGI, a harder reasoning benchmark that frontier AI systems initially scored in the single digits on.
A community benchmark of 204 tasks from 450 authors built to probe what large language models can and cannot do.
The 23 hardest BIG-Bench tasks where models trailed humans, used to show chain-of-thought prompting unlocks hidden ability.
A code benchmark of 1,140 practical tasks that require calling many real libraries and following complex instructions.
The Brown Corpus, compiled in 1964, was the first million-word computer-readable sample of American English and seeded modern corpus linguistics.
OpenAI benchmark of 1,266 questions that force a web agent to dig persistently for hard-to-find facts.
Google's cleaned subset of Common Crawl, built for the T5 model, that became a widely used open pre-training corpus.
NIH's 2017 release of 108,948 chest X-rays from 32,717 patients with NLP-mined disease labels became a standard medical-imaging benchmark.
Two small labeled image datasets from Krizhevsky and Hinton that became standard testbeds for deep learning research.
A large image dataset with per-object segmentation that became the standard benchmark for detection, segmentation, and captioning.
Mozilla's crowdsourced, public-domain speech dataset, collecting validated voice clips across dozens of languages for open ASR.
Google's dataset of about 3.3 million image-caption pairs harvested and cleaned from web alt-text, far larger than hand-curated caption sets like COCO.
A benchmark of 40 professional capture-the-flag tasks that measures how well AI agents can perform real cybersecurity work.
A 2023 benchmark that fixes the training code and competes on the data instead, using a pool of 12.8 billion image-text pairs to test dataset curation.
The DeepMind Control Suite is a standardized set of MuJoCo-based continuous control tasks widely used to benchmark RL agents.
A reading-comprehension benchmark of 96,000 questions requiring discrete reasoning like counting, addition, and sorting.
A benchmark of 1,000 real data-science coding problems across seven Python libraries, drawn from StackOverflow.
A 1,286-hour dataset pairing first-person and third-person video of skilled activities like sports, music, and dance.
A 3,670-hour dataset of first-person daily-life video from 9 countries, built to teach machines egocentric perception.
Augmented versions of HumanEval and MBPP with far more tests, exposing wrong code and changing model rankings.
An evaluation that breaks long text into atomic facts and scores the share supported by a reliable source.
Hugging Face's openly documented 15-trillion-token web dataset that set a new bar for transparent large-scale pre-training data.
FLORES-200 is Meta's human-translated benchmark covering 200+ languages and 40,000+ translation directions, built to evaluate low-resource MT.
Epoch AI's benchmark of original research-level math problems that leading models initially solved less than two percent of.
A 2023 Meta and Hugging Face benchmark of 466 real-world assistant tasks where humans scored 92% and GPT-4 with plugins only 15%.
OpenAI benchmark scoring AI on real economically valuable work across 44 occupations in nine GDP sectors.
A Cohere-led rebuild of MMLU across 42 languages that flags culturally biased and Western-centric questions.
GLUE (2018) bundled nine language tasks into one score and became the BERT-era scoreboard, then saturated within months, prompting SuperGLUE in 2019.
A set of 448 hard graduate-level science questions designed so they cannot be answered by simple web search.
A template-based variant of GSM8K showing model math accuracy drops when numbers change or irrelevant clauses are added.
A dataset of about 8,500 grade-school math word problems that tests a model's multi-step arithmetic reasoning.
Gymnasium is the maintained successor to OpenAI Gym, providing the de facto standard API for RL environments.
A Meta hallucination benchmark built on a clear taxonomy, with tasks that regenerate to resist leakage.
A standardized framework for automated red teaming, comparing attacks that try to make models produce harmful content.
A 2019 commonsense benchmark built by adversarial filtering, where humans scored over 95% but top models under 48%.
A Stanford framework that scored language models on seven metrics across many scenarios, not just accuracy, for transparent comparison.
A 164-problem test that checks whether a model can write working Python code from a natural-language description, graded by unit tests.
A 2,500-question expert-level exam across many subjects, built to stay hard for frontier AI as easier benchmarks get saturated.
A benchmark of about 500 prompts with verifiable instructions, scored automatically without human or model judges.
The annual image-recognition competition whose 2012 results launched the deep learning era.
KITTI is a 2012 real-world driving benchmark with camera and lidar data that became the standard test for autonomous-vehicle perception.
LFW, released by UMass in 2007, holds 13,233 web photos of 5,749 people and became the standard face verification benchmark.
A roughly 1,000-hour corpus of read English speech from LibriVox audiobooks, the standard benchmark for speech recognition research.
A contamination-limited benchmark with monthly fresh questions and objective scoring across six task categories.
A code benchmark that keeps collecting new contest problems over time so models cannot have memorized the answers.
A live leaderboard that ranks AI chatbots by anonymous head-to-head human preference votes.
A benchmark that separates honesty from accuracy, measuring whether language models lie under pressure rather than just whether they are correct.
A dataset of 12,500 challenging competition math problems, each with a full worked solution, used to measure mathematical problem-solving in AI.
A benchmark of 6,141 problems that test whether models can do math reasoning over images, charts, and diagrams.
A set of 974 entry-level Python tasks used alongside HumanEval to measure code-generation ability.
A multilingual grade-school math benchmark of 250 problems translated into ten diverse languages.
A 2023 benchmark of 2,000+ real tasks across 137 websites for testing agents that follow instructions on any site.
An industry benchmark suite from MLCommons that measures how fast computing systems can train and run AI models.
A 57-subject multiple-choice test measuring how broadly an AI model has acquired world knowledge and problem-solving ability.
A harder version of MMLU with ten answer choices and reasoning-focused questions, where scores dropped 16 to 33 percent.
A 2023 benchmark of 11,500 college-level multimodal questions across 30 subjects, where GPT-4V scored only 56%.
A hardened version of MMMU that strips out questions text-only models can guess and adds a vision-only mode.
A dataset of 70,000 handwritten digit images that became the default first benchmark for computer vision and machine learning.
Microsoft's reading-comprehension and retrieval dataset built from a million real Bing search queries with human-written answers and millions of passages.
A set of multi-turn questions graded by a strong model acting as judge, used to score chat assistants on open-ended replies.
A broad benchmark and leaderboard that tests text embedding models across 8 task types, 58 datasets, and 112 languages.
MultiMedQA, introduced with Med-PaLM in 2023, bundles seven medical question-answering sets to test how well LLMs encode clinical knowledge.
NASA's simulated run-to-failure turbofan dataset became the standard benchmark for predicting an engine's remaining useful life.
Greg Kamradt's simple test that plants a fact in long text and checks if a model can retrieve it.
nuScenes is a 2019 driving dataset with the full sensor suite, 6 cameras, 5 radars, and a lidar, across 1,000 annotated scenes.
Hugging Face's public ranking that scored open models on the same reproducible benchmarks, drawing over two million visitors.
A 2024 benchmark of 369 real computer tasks across Ubuntu, Windows, and macOS where humans scored 72% and the best agent only 12%.
A Penn-built corpus of syntactically annotated English that became the standard training and test set for parsing and language modeling.
A planning benchmark from the automated-planning community that exposes how poorly LLMs generate valid plans.
A METR benchmark of seven open-ended ML research tasks that compares AI agents against human experts on AI R&D work.
A benchmark for repository-level code completion that tests retrieval and prediction across multiple files.
A synthetic long-context benchmark that finds most models hold far less context than they advertise.
OpenAI benchmark of 4,326 short factual questions that measures whether a model knows what it knows.
A Stanford reading-comprehension benchmark of 100,000-plus questions that helped drive the pre-Transformer era of NLP.
A 2021 benchmark and leaderboard that tests one frozen speech model across many tasks, measuring how general its representations are.
A harder successor to GLUE, created after models surpassed non-expert humans on the original language-understanding benchmark.
A benchmark that tests whether AI systems can resolve real GitHub issues by editing real codebases, graded by the projects' own tests.
A human-validated 500-task subset of SWE-bench, built with OpenAI, that became the headline measure of AI coding agents.
A 2024 Sierra benchmark that tests tool-using agents in simulated customer conversations and measures how consistently they succeed.
ALE turned hundreds of Atari 2600 games into a standard testbed that became the dominant benchmark for deep RL agents.
Spyros Makridakis's open forecasting competitions that empirically test which methods actually predict best on thousands of real time series.
A 2024 CMU benchmark of 175 real workplace tasks in a simulated software company, where the best agent finished only about 30% autonomously.
A 2021 benchmark of 817 questions where the largest models were often the least truthful, mimicking human misconceptions.
A dataset of 108,077 images with dense crowdsourced annotations of objects, attributes, and relationships, aimed at visual reasoning not just recognition.
A visual-spatial benchmark testing whether multimodal models can understand and recall spaces from video.
Waymo's 2019 open dataset released high-quality lidar and camera data from 1,150 driving scenes for autonomous-driving perception research.
A 2023 CMU benchmark of realistic, self-hosted websites that exposed how poorly LLM agents perform real web tasks.
Google's web-scale multilingual image-text dataset, around 10 billion images with text in over 100 languages, built to train the PaLI vision-language model.
A 2024 benchmark and agent that uses screenshots and multimodal models to complete tasks on 15 real websites.
The Winograd Schema Challenge tests commonsense reasoning with pronoun puzzles, proposed in 2012 as an alternative to the Turing test.
A 2019 benchmark of 44,000 Winograd-style pronoun puzzles, scaled up and debiased to test real commonsense reasoning.
A benchmark of 3,668 questions that proxies hazardous biology, cyber, and chemistry knowledge in language models, paired with an unlearning method.
WMT is the annual machine translation conference whose shared tasks and common test sets have set the field's evaluation standards since 2006.
Princeton's hand-built lexical database of English that organized words into concept sets and underpinned decades of NLP, including ImageNet.