Strengths & Limitations

Includes a large and diverse set of 1,865 real-world software engineering tasks from 41 repositories.
Reduces data contamination by combining GPL-licensed public data with proprietary private sets.
Human-augmented task specifications improve clarity without changing technical difficulty.
Provides Docker environments for reproducible evaluations.
Offers separate public and commercial leaderboards for transparent performance tracking.

Held-out and commercial subsets comprising 1,134 instances are not publicly accessible.
Full scaled evaluation requires setup of Modal, Docker, and credential management.