Key Features

What you can do

Extensive Task Set

Includes 1,865 diverse tasks from 41 repositories, featuring long-horizon problems that can require hours to days of engineering work.

Contamination-Resistant Data

Uses GPL-licensed public data and proprietary private sets to reduce data contamination and ensure reliable evaluation.

Human-Augmented Problem Specifications

Problem specifications are refined by humans to add context and clarity without altering the technical challenges.

Docker-Based Reproducible Environments

Provides prebuilt Docker images for each task instance to enable reproducible evaluation setups.

Dual Leaderboards

Maintains separate public and commercial leaderboards to accurately measure model generalization on open-source and proprietary codebases.