SWE-Bench Verified by OpenAI tests how well a model can solve real bugs in real Python code from GitHub.
These bugs are all public information — so the AI models have almost certainly trained on the actual text of the bug and on the fix for the bug.
In “The SWE-Bench Illusion,” researchers at Purdue and Microsoft try to check if various models memorise instead of reasoning. [arXiv, PDF]
They told the models the text of the issue and the name of the code repository for the program — and nothing else — then asked the model for the path to the file that needs fixing.
The models usually gave the correct answer! But they could only have given the right answer if they had trained on the questions.
OpenAI’s o3‑mini scored 76% when they gave it just the issue text and the repository name. Now that’s what I call vibe-coding.
All public code on GitHub will be in the training data. When they tested on bugs not in SWE-Bench, the success rate dropped to 57‑71% on random items, and 50‑68% on fresh issues created after the benchmark snapshot. I’m surprised they did that well.
The researchers recommend better benchmarks “to ensure that reported progress reflects genuine advances in software engineering capabilities rather than dataset-specific artifacts.”
I don’t expect that to happen. The vendors will keep selling the code assistants to management on the dream of not paying programmers, with chatbot-generated LinkedIn posts saying how great AI codebots are. Real world performance never really mattered.