Key Features - s-bench-pro

✨

Includes 1,865 diverse tasks from 41 repositories, featuring long-horizon problems that can require hours to days of engineering work.

✨

Uses GPL-licensed public data and proprietary private sets to reduce data contamination and ensure reliable evaluation.

✨

Problem specifications are refined by humans to add context and clarity without altering the technical challenges.

✨

Provides prebuilt Docker images for each task instance to enable reproducible evaluation setups.

✨

Maintains separate public and commercial leaderboards to accurately measure model generalization on open-source and proprietary codebases.