Extensive Task Set
Includes 1,865 diverse tasks from 41 repositories, featuring long-horizon problems that can require hours to days of engineering work.
Contamination-Resistant Data
Uses GPL-licensed public data and proprietary private sets to reduce data contamination and ensure reliable evaluation.
Human-Augmented Problem Specifications
Problem specifications are refined by humans to add context and clarity without altering the technical challenges.
Docker-Based Reproducible Environments
Provides prebuilt Docker images for each task instance to enable reproducible evaluation setups.
Dual Leaderboards
Maintains separate public and commercial leaderboards to accurately measure model generalization on open-source and proprietary codebases.