Code & Development

S. Bench Pro

S. Bench Pro, also known as SWE-Bench Pro, is an AI benchmark developed by Scale AI designed to evaluate software engineering agents on real-world coding tasks. It includes 1,865 instances drawn from 41 repositories, covering tasks such as resolving GitHub issues that require multi-file code changes and long-horizon planning. The benchmark uses a combination of public, held-out, and commercial subsets to measure agent generalization while minimizing data contamination. Tasks are human-augmented for clarity and sourced from diverse codebases including business applications, B2B services, and developer tools. The benchmark provides Docker-based reproducible environments and maintains separate public and commercial leaderboards to track model performance.

Updated Feb 11, 2026open-source

Visit S. Bench Pro ↗Visual Guide

Overview

S. Bench Pro is a comprehensive AI benchmark for evaluating software engineering agents on complex, real-world coding tasks.

Pricing

open-source

AI Agent Evaluation

Developers and researchers can use S. Bench Pro to benchmark AI coding agents on realistic software engineering tasks.

Enterprise Testing

Enterprises can test AI agents' ability to generalize on proprietary codebases using the commercial subset and leaderboard.

Quick Start

Install Docker

Install Docker on your system and complete any post-installation steps required for your OS, such as Linux.

Store Modal Credentials

Use the provided commands to securely store your Modal credentials needed for scaled evaluation.

Clone GitHub Repository

Clone the public SWE-Bench Pro GitHub repository to access benchmark code and resources.

Access Docker Images

Pull prebuilt Docker images for each task instance from hub.docker.com/r/jefzda/sweap-images.

Run Evaluation Scripts

Execute evaluation scripts on the public dataset via Hugging Face or the official leaderboard to benchmark your AI agent.

View Leaderboards

Monitor live leaderboards to compare model performance on public and commercial tasks.

📊

Strategic Context for S. Bench Pro

Get weekly analysis on market dynamics, competitive positioning, and implementation ROI frameworks with AI Intelligence briefings.

Try Intelligence Free →

7 days free · No credit card

Assessment

Strengths

Includes a large and diverse set of 1,865 real-world software engineering tasks from 41 repositories.
Reduces data contamination by combining GPL-licensed public data with proprietary private sets.
Human-augmented task specifications improve clarity without changing technical difficulty.
Provides Docker environments for reproducible evaluations.
Offers separate public and commercial leaderboards for transparent performance tracking.

Limitations

Held-out and commercial subsets comprising 1,134 instances are not publicly accessible.
Full scaled evaluation requires setup of Modal, Docker, and credential management.