SWE-bench

SWE-bench measures whether an AI system can do real software engineering work: given an actual GitHub issue and the full codebase it belongs to, the model must produce a code change that fixes the problem. Success is judged by running the repository’s own test suite, so the bar is whether the fix actually works in a real project, not whether it looks correct.

The benchmark was introduced by Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan in “SWE-bench: Can Language Models Resolve Real-World GitHub Issues?”, posted in October 2023 and accepted at ICLR 2024. It draws on thousands of real issues from popular Python projects. When first published, the strongest model resolved only about 2 percent of issues, underscoring how hard the task was.

SWE-bench became the leading yardstick for autonomous coding agents because it tests end-to-end, real-world problem solving rather than isolated puzzles. The official site at swebench.com hosts multiple variants, including Verified, Lite, Full, Multimodal, and Multilingual.

Current resolution rates are published on the official leaderboards and move quickly, so they are not listed here.

Sources

Related