Evaluating Large Language Models Trained on Code (Codex)

“Evaluating Large Language Models Trained on Code,” posted to arXiv in July 2021 by Mark Chen and 57 co-authors at OpenAI, introduced Codex - a GPT language model fine-tuned on publicly available code from GitHub - and, with it, the HumanEval benchmark. The paper made the case that a general language model, given enough code, could write working programs from natural-language descriptions, and it set out a way to measure that rigorously.

HumanEval is the lasting contribution: a set of 164 hand-written Python programming problems, each with a function signature, a docstring, and hidden unit tests. Performance is reported as pass@k - the chance that at least one of k sampled solutions passes all the tests - which scores functional correctness rather than text similarity. Codex solved 28.8 percent of the problems on the first try (pass@1), and 70.2 percent when allowed 100 samples per problem, against 0 percent for the original GPT-3.

The paper also disclosed that a distinct production version of Codex powered GitHub Copilot, the AI pair-programmer that had launched in preview shortly before. It honestly discussed limitations and risks too - misalignment, insecure code, and the legal and economic questions raised by training on public code. Codex and HumanEval together opened the modern era of AI code generation that later benchmarks like SWE-bench would push much further.

Sources

Last verified June 7, 2026