After the widely reported claim that GPT-4 scored near the 90th percentile of human test-takers on the Uniform Bar Exam, MIT researcher Eric Martinez published “Re-evaluating GPT-4’s Bar Exam Performance” in May 2023 (later published in the journal Artificial Intelligence and Law in 2024). The paper did not dispute that GPT-4 passed; it challenged how high its percentile rank really was, arguing the 90th-percentile figure was a methodological artifact rather than a measure of standing against practicing lawyers.
Martinez identified several issues. The percentile conversion that produced the 90th-percentile claim was anchored to February bar-exam administrations, which are dominated by repeat test-takers who previously failed and who score well below the overall population - making any given raw score look better than it should. Using more representative data, Martinez estimated GPT-4’s overall UBE percentile at roughly the 68th, and on the essay components closer to the 48th. Compared only against people who actually passed the exam - the relevant peer group for a tool meant to do lawyers’ work - he estimated GPT-4 fell to around the 48th percentile overall and the 15th percentile on essays.
The exchange became a frequently cited example of why benchmark claims about AI deserve scrutiny. The same raw performance can be framed as “top 10%” or “roughly average” depending entirely on the comparison group, and the difference shapes how much trust the public and professionals place in the system.
Why business readers should care: a vendor’s benchmark headline (“90th percentile,” “passes the bar”) can be technically defensible and still badly mislead about real-world capability. The right question is always: compared to whom, and on which subset of the task. Martinez’s paper is a clean template for interrogating an impressive-sounding AI score.