Comprehensive Dataset
Includes 2,294 GitHub issue-fix pairs from 12 popular Python repositories for realistic software engineering evaluation.
Verified Subset
Contains 500 human-validated solvable instances with fail-to-pass and pass-to-pass tests to ensure task solvability.
Multiple Subsets
Offers Lite (300 instances), Multimodal (517 issues with visuals), and Multilingual (300 instances in 9 languages) subsets for varied testing needs.
Leaderboards
Tracks model performance using the percentage of issues resolved metric, allowing direct comparison of AI systems.
Harness API
Provides Docker-based environments, automated grading, and test specifications to facilitate reproducible evaluations.