Strengths
- Uses real GitHub issues and fixes from popular repositories for realistic evaluation.
- Multiple dataset subsets support different evaluation scopes and modalities.
- Leaderboards enable direct comparison of model performance via a clear % Resolved metric.
- Harness API offers Docker-based reproducibility and automated grading.
- Verified subset ensures tasks are confirmed solvable, reducing evaluation noise.
Limitations
- Full benchmark requires significant compute resources for evaluation.
- Multimodal test evaluation depends on SWE-bench API, limiting standalone use.
- No out-of-the-box support for running individual tests without modifying the harness.