Strengths & Limitations

Balanced assessment

Strengths

  • Uses real GitHub issues and fixes from popular repositories for realistic evaluation.
  • Multiple dataset subsets support different evaluation scopes and modalities.
  • Leaderboards enable direct comparison of model performance via a clear % Resolved metric.
  • Harness API offers Docker-based reproducibility and automated grading.
  • Verified subset ensures tasks are confirmed solvable, reducing evaluation noise.

Limitations

  • Full benchmark requires significant compute resources for evaluation.
  • Multimodal test evaluation depends on SWE-bench API, limiting standalone use.
  • No out-of-the-box support for running individual tests without modifying the harness.