Strengths
- Includes a large and diverse set of 1,865 real-world software engineering tasks from 41 repositories.
- Reduces data contamination by combining GPL-licensed public data with proprietary private sets.
- Human-augmented task specifications improve clarity without changing technical difficulty.
- Provides Docker environments for reproducible evaluations.
- Offers separate public and commercial leaderboards for transparent performance tracking.
Limitations
- Held-out and commercial subsets comprising 1,134 instances are not publicly accessible.
- Full scaled evaluation requires setup of Modal, Docker, and credential management.