Code & Development

SWEBench

SWEBench is a benchmark designed to evaluate large language models on real-world software engineering tasks by using GitHub issues and their corresponding fixes from 12 popular Python repositories. It includes a total of 2,294 instances where AI systems generate patches to resolve issues, verified through fail-to-pass and pass-to-pass testing. Released in October 2023, SWE-bench offers multiple subsets such as Lite, Verified, Multimodal, and Multilingual to support different evaluation needs. The benchmark provides leaderboards to track model performance based on the percentage of issues resolved and includes a Harness API that facilitates Docker-based evaluation environments and automated grading. The benchmark targets developers and researchers who want to assess or train AI models for software engineering tasks, particularly issue resolution and patch generation. SWE-bench Verified subset contains 500 human-validated solvable instances, ensuring reliable evaluation results. The Multimodal subset incorporates visual elements like screenshots and diagrams, while the Multilingual subset covers multiple programming languages across various repositories. Although the full dataset requires significant compute resources, the Lite subset offers a smaller, more accessible evaluation option.

Updated Feb 8, 2026unknown

Visit SWEBench ↗Visual Guide

Overview

SWEBench benchmarks AI models on real-world software engineering tasks using GitHub issue-fix pairs from popular Python repositories.

Pricing

unknown

Evaluating AI Models for Bug Fixing

Researchers can use SWE-bench to test how well their large language models generate patches that fix real GitHub issues.

Training AI Agents on Software Engineering Tasks

Developers can leverage the SWE-smith dataset and SWE-bench subsets to train models on realistic issue resolution scenarios.

Quick Start

Access the Website

Visit https://www.swebench.com to explore available datasets and leaderboards.

Download a Dataset Subset

Start with SWE-bench Lite (300 instances) for initial evaluation to reduce compute requirements.

Set Up Evaluation Environment

Use the Harness API to configure Docker environments, run tests, and generate patches.

Submit Results

Submit your predictions.json file with model-generated patches to the leaderboard to obtain % Resolved scores.

Request Custom Support

Contact support@swebench.com for custom datasets or to contribute to the benchmark.

📊

Strategic Context for SWEBench

Get weekly analysis on market dynamics, competitive positioning, and implementation ROI frameworks with AI Intelligence briefings.

Try Intelligence Free →

7 days free · No credit card

Assessment

Strengths

Uses real GitHub issues and fixes from popular repositories for realistic evaluation.
Multiple dataset subsets support different evaluation scopes and modalities.
Leaderboards enable direct comparison of model performance via a clear % Resolved metric.
Harness API offers Docker-based reproducibility and automated grading.
Verified subset ensures tasks are confirmed solvable, reducing evaluation noise.

Limitations

Full benchmark requires significant compute resources for evaluation.
Multimodal test evaluation depends on SWE-bench API, limiting standalone use.
No out-of-the-box support for running individual tests without modifying the harness.