- Uses real GitHub issues and fixes from popular repositories for realistic evaluation.
- Multiple dataset subsets support different evaluation scopes and modalities.
- Leaderboards enable direct comparison of model performance via a clear % Resolved metric.
- Harness API offers Docker-based reproducibility and automated grading.