S. Bench Pro
S. Bench Pro, also known as SWE-Bench Pro, is an AI benchmark developed by Scale AI designed to evaluate software engineering agents on real-world coding tasks. It includes 1,865 instances drawn from 41 repositories, covering tasks such as resolving GitHub issues that require multi-file code changes and long-horizon planning. The benchmark uses a combination of public, held-out, and commercial subsets to measure agent generalization while minimizing data contamination. Tasks are human-augmented for clarity and sourced from diverse codebases including business applications, B2B services, and developer tools. The benchmark provides Docker-based reproducible environments and maintains separate public and commercial leaderboards to track model performance.
S. Bench Pro is a comprehensive AI benchmark for evaluating software engineering agents on complex, real-world coding tasks.
AI Agent Evaluation
Developers and researchers can use S. Bench Pro to benchmark AI coding agents on realistic software engineering tasks.
Enterprise Testing
Enterprises can test AI agents' ability to generalize on proprietary codebases using the commercial subset and leaderboard.