SWE-Bench Failures: When Coding Agents Spiral Into 693 Lines of Hallucinations
When coding models spiral into self-reinforcing hallucinations, small mistakes compound into catastrophic failure. In SWE-bench, we saw SOTA models invent whole classes, methods, and terminal outputs – never realizing they had lost touch with the real codebase. In this case study, we’ll look at how three frontier coding agents tried to solve one particular SWE-bench problem: one spiraled into hallucinations and failed entirely, one spiraled but recovered, and one avoided hallucinations altogether. Our goal: to illustrate how dissecting real-world problems can steer models towards human-ready AGI.