Strengths & Limitations

Balanced assessment

Strengths

  • Includes a large and diverse set of 1,865 real-world software engineering tasks from 41 repositories.
  • Reduces data contamination by combining GPL-licensed public data with proprietary private sets.
  • Human-augmented task specifications improve clarity without changing technical difficulty.
  • Provides Docker environments for reproducible evaluations.
  • Offers separate public and commercial leaderboards for transparent performance tracking.

Limitations

  • Held-out and commercial subsets comprising 1,134 instances are not publicly accessible.
  • Full scaled evaluation requires setup of Modal, Docker, and credential management.