Key Features - swebench

✨

Includes 2,294 GitHub issue-fix pairs from 12 popular Python repositories for realistic software engineering evaluation.

✨

Contains 500 human-validated solvable instances with fail-to-pass and pass-to-pass tests to ensure task solvability.

✨

Offers Lite (300 instances), Multimodal (517 issues with visuals), and Multilingual (300 instances in 9 languages) subsets for varied testing needs.

✨

Tracks model performance using the percentage of issues resolved metric, allowing direct comparison of AI systems.

✨

Provides Docker-based environments, automated grading, and test specifications to facilitate reproducible evaluations.