SWEBench
SWEBench is a benchmark designed to evaluate large language models on real-world software engineering tasks by using GitHub issues and their corresponding fixes from 12 popular Python repositories. It includes a total of 2,294 instances where AI systems generate patches to resolve issues, verified through fail-to-pass and pass-to-pass testing. Released in October 2023, SWE-bench offers multiple subsets such as Lite, Verified, Multimodal, and Multilingual to support different evaluation needs. The benchmark provides leaderboards to track model performance based on the percentage of issues resolved and includes a Harness API that facilitates Docker-based evaluation environments and automated grading. The benchmark targets developers and researchers who want to assess or train AI models for software engineering tasks, particularly issue resolution and patch generation. SWE-bench Verified subset contains 500 human-validated solvable instances, ensuring reliable evaluation results. The Multimodal subset incorporates visual elements like screenshots and diagrams, while the Multilingual subset covers multiple programming languages across various repositories. Although the full dataset requires significant compute resources, the Lite subset offers a smaller, more accessible evaluation option.
SWEBench benchmarks AI models on real-world software engineering tasks using GitHub issue-fix pairs from popular Python repositories.
Evaluating AI Models for Bug Fixing
Researchers can use SWE-bench to test how well their large language models generate patches that fix real GitHub issues.
Training AI Agents on Software Engineering Tasks
Developers can leverage the SWE-smith dataset and SWE-bench subsets to train models on realistic issue resolution scenarios.