COR Brief
AI ToolsCode & DevelopmentS. Bench Pro
Code & Development

S. Bench Pro

S. Bench Pro, also known as SWE-Bench Pro, is an AI benchmark developed by Scale AI designed to evaluate software engineering agents on real-world coding tasks. It includes 1,865 instances drawn from 41 repositories, covering tasks such as resolving GitHub issues that require multi-file code changes and long-horizon planning. The benchmark uses a combination of public, held-out, and commercial subsets to measure agent generalization while minimizing data contamination. Tasks are human-augmented for clarity and sourced from diverse codebases including business applications, B2B services, and developer tools. The benchmark provides Docker-based reproducible environments and maintains separate public and commercial leaderboards to track model performance.

Updated Feb 11, 2026open-source

S. Bench Pro is a comprehensive AI benchmark for evaluating software engineering agents on complex, real-world coding tasks.

Pricing
open-source
Category
Code & Development
Company
Interactive PresentationOpen Fullscreen ↗
01
Includes 1,865 diverse tasks from 41 repositories, featuring long-horizon problems that can require hours to days of engineering work.
02
Uses GPL-licensed public data and proprietary private sets to reduce data contamination and ensure reliable evaluation.
03
Problem specifications are refined by humans to add context and clarity without altering the technical challenges.
04
Provides prebuilt Docker images for each task instance to enable reproducible evaluation setups.
05
Maintains separate public and commercial leaderboards to accurately measure model generalization on open-source and proprietary codebases.

AI Agent Evaluation

Developers and researchers can use S. Bench Pro to benchmark AI coding agents on realistic software engineering tasks.

Enterprise Testing

Enterprises can test AI agents' ability to generalize on proprietary codebases using the commercial subset and leaderboard.

1
Install Docker
Install Docker on your system and complete any post-installation steps required for your OS, such as Linux.
2
Store Modal Credentials
Use the provided commands to securely store your Modal credentials needed for scaled evaluation.
3
Clone GitHub Repository
Clone the public SWE-Bench Pro GitHub repository to access benchmark code and resources.
4
Access Docker Images
Pull prebuilt Docker images for each task instance from hub.docker.com/r/jefzda/sweap-images.
5
Run Evaluation Scripts
Execute evaluation scripts on the public dataset via Hugging Face or the official leaderboard to benchmark your AI agent.
6
View Leaderboards
Monitor live leaderboards to compare model performance on public and commercial tasks.
📊

Strategic Context for S. Bench Pro

Get weekly analysis on market dynamics, competitive positioning, and implementation ROI frameworks with AI Intelligence briefings.

Try Intelligence Free →
7 days free · No credit card
Pricing
Model: open-source

S. Bench Pro is a free public benchmark with open access to the public dataset. No pricing information is available for commercial subsets.

Assessment
Strengths
  • Includes a large and diverse set of 1,865 real-world software engineering tasks from 41 repositories.
  • Reduces data contamination by combining GPL-licensed public data with proprietary private sets.
  • Human-augmented task specifications improve clarity without changing technical difficulty.
  • Provides Docker environments for reproducible evaluations.
  • Offers separate public and commercial leaderboards for transparent performance tracking.
Limitations
  • Held-out and commercial subsets comprising 1,134 instances are not publicly accessible.
  • Full scaled evaluation requires setup of Modal, Docker, and credential management.