The Solution - swebench

SWEBench is a benchmark designed to evaluate large language models on real-world software engineering tasks by using GitHub issues and their corresponding fixes from 12 popular Python repositories. It includes a total of 2,294 instances where AI systems generate patches to resolve issues, verified through fail-to-pass and pass-to-pass testing.