Key Features

What you can do

Comprehensive Dataset

Includes 2,294 GitHub issue-fix pairs from 12 popular Python repositories for realistic software engineering evaluation.

Verified Subset

Contains 500 human-validated solvable instances with fail-to-pass and pass-to-pass tests to ensure task solvability.

Multiple Subsets

Offers Lite (300 instances), Multimodal (517 issues with visuals), and Multilingual (300 instances in 9 languages) subsets for varied testing needs.

Leaderboards

Tracks model performance using the percentage of issues resolved metric, allowing direct comparison of AI systems.

Harness API

Provides Docker-based environments, automated grading, and test specifications to facilitate reproducible evaluations.