COR Brief
Code & Development

SWEBench

SWEBench is a benchmark designed to evaluate large language models on real-world software engineering tasks by using GitHub issues and their corresponding fixes from 12 popular Python repositories. It includes a total of 2,294 instances where AI systems generate patches to resolve issues, verified through fail-to-pass and pass-to-pass testing. Released in October 2023, SWE-bench offers multiple subsets such as Lite, Verified, Multimodal, and Multilingual to support different evaluation needs. The benchmark provides leaderboards to track model performance based on the percentage of issues resolved and includes a Harness API that facilitates Docker-based evaluation environments and automated grading. The benchmark targets developers and researchers who want to assess or train AI models for software engineering tasks, particularly issue resolution and patch generation. SWE-bench Verified subset contains 500 human-validated solvable instances, ensuring reliable evaluation results. The Multimodal subset incorporates visual elements like screenshots and diagrams, while the Multilingual subset covers multiple programming languages across various repositories. Although the full dataset requires significant compute resources, the Lite subset offers a smaller, more accessible evaluation option.

Updated Feb 8, 2026unknown

SWEBench benchmarks AI models on real-world software engineering tasks using GitHub issue-fix pairs from popular Python repositories.

Pricing
unknown
Category
Code & Development
Company
Interactive PresentationOpen Fullscreen ↗
01
Includes 2,294 GitHub issue-fix pairs from 12 popular Python repositories for realistic software engineering evaluation.
02
Contains 500 human-validated solvable instances with fail-to-pass and pass-to-pass tests to ensure task solvability.
03
Offers Lite (300 instances), Multimodal (517 issues with visuals), and Multilingual (300 instances in 9 languages) subsets for varied testing needs.
04
Tracks model performance using the percentage of issues resolved metric, allowing direct comparison of AI systems.
05
Provides Docker-based environments, automated grading, and test specifications to facilitate reproducible evaluations.

Evaluating AI Models for Bug Fixing

Researchers can use SWE-bench to test how well their large language models generate patches that fix real GitHub issues.

Training AI Agents on Software Engineering Tasks

Developers can leverage the SWE-smith dataset and SWE-bench subsets to train models on realistic issue resolution scenarios.

1
Access the Website
Visit https://www.swebench.com to explore available datasets and leaderboards.
2
Download a Dataset Subset
Start with SWE-bench Lite (300 instances) for initial evaluation to reduce compute requirements.
3
Set Up Evaluation Environment
Use the Harness API to configure Docker environments, run tests, and generate patches.
4
Submit Results
Submit your predictions.json file with model-generated patches to the leaderboard to obtain % Resolved scores.
5
Request Custom Support
Contact support@swebench.com for custom datasets or to contribute to the benchmark.
📊

Strategic Context for SWEBench

Get weekly analysis on market dynamics, competitive positioning, and implementation ROI frameworks with AI Intelligence briefings.

Try Intelligence Free →
7 days free · No credit card
Pricing
Model: unknown

No pricing information is available; datasets and leaderboards appear freely accessible via the official website.

Assessment
Strengths
  • Uses real GitHub issues and fixes from popular repositories for realistic evaluation.
  • Multiple dataset subsets support different evaluation scopes and modalities.
  • Leaderboards enable direct comparison of model performance via a clear % Resolved metric.
  • Harness API offers Docker-based reproducibility and automated grading.
  • Verified subset ensures tasks are confirmed solvable, reducing evaluation noise.
Limitations
  • Full benchmark requires significant compute resources for evaluation.
  • Multimodal test evaluation depends on SWE-bench API, limiting standalone use.
  • No out-of-the-box support for running individual tests without modifying the harness.
Alternatives