Elo Rating for Models

Elo rating is a system, originally designed for chess, that assigns each competitor a numerical score based on the outcomes of head-to-head matches. Beating a higher-rated opponent raises your score a lot; losing to a lower-rated one drops it a lot. The AI field borrowed this idea to rank language models: instead of scoring each model on a fixed test, you pit two models against the same prompt, let a human pick the better answer, and update both models’ ratings from who won.

This approach was popularized by Chatbot Arena, described by Wei-Lin Chiang, Lianmin Zheng, Ion Stoica, and colleagues in “Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference,” posted in March 2024. The platform collects pairwise human votes at scale, reporting over 240,000 votes by the time of the paper, and converts those battle outcomes into a leaderboard using statistical ranking methods in the Elo family. The authors showed that crowdsourced questions are diverse enough and that crowd votes agree well with expert raters.

The appeal of rating models this way is that it sidesteps the saturation and contamination problems of fixed benchmarks. Because users bring their own fresh, varied prompts and judge the answers themselves, the ranking reflects real preferences rather than performance on a memorizable test set.

For a business reader, an Elo-style leaderboard answers a practical question that a single accuracy number cannot: when two models go head to head on everyday requests, which one do people actually prefer.

Sources

Related