GSM8K (short for Grade School Math 8K) is a benchmark of grade-school-level math word problems used to measure whether a model can carry out multi-step reasoning. It was introduced in the 2021 OpenAI paper “Training Verifiers to Solve Math Word Problems” by Karl Cobbe and colleagues, who describe it as “a dataset of 8.5K high quality linguistically diverse grade school math word problems.” The problems are easy for a person - they involve only basic arithmetic - but each requires several reasoning steps in sequence, which made them surprisingly hard for early language models that tried to jump straight to an answer.
The paper’s other contribution helped shape how reasoning is evaluated and improved. Rather than trusting a single model output, the authors trained a separate “verifier” to judge candidate solutions, then had the model “generate many candidate solutions and select the one ranked highest by the verifier.” This generate-and-check approach scaled better than ordinary fine-tuning and previewed later ideas about getting models to reason more carefully and check their own work.
GSM8K became a standard early-stage reasoning benchmark and a frequent demonstration of chain-of-thought prompting, where asking a model to work through the steps sharply improves its accuracy on exactly this kind of problem. It sits below competition-level benchmarks like MATH and AIME in difficulty, forming the entry rung of the math-reasoning ladder.
For business readers, GSM8K is a clean illustration of why “the model can reason step by step” matters: the failure mode it exposed - confidently skipping steps and getting the wrong answer - is the same one that shows up in real analytical tasks. Frontier models now score very high on GSM8K, so it no longer separates the strongest systems; current scores live on leaderboards and change frequently, so they are not reproduced here.