Agents & Automation

Terminal Bench 2.0

Terminal-Bench 2.0 is an updated benchmark and evaluation harness designed to assess AI agents' performance on terminal-based tasks. It provides a dataset of approximately 89 tasks that cover real-world software engineering challenges such as compiling code, training models, setting up servers, and vulnerability fixing. Each task includes English instructions, test scripts for verification, and reference solutions, executed within containerized environments using Docker images. The update from version 1.0 addresses prior issues with task reliability through manual and language model-assisted verification to improve quality. The benchmark connects AI agents or language models to a terminal sandbox environment, measuring their success rates on tasks that test terminal mastery. It includes a command-line interface tool for running evaluations and supports custom Docker configurations. Terminal-Bench 2.0 also features a public leaderboard to track agent performance and an adapter system for adding custom tasks. The project is open-source, actively maintained, and supported by a community of contributors and users.

Updated Jan 20, 2026open-source

Visit Terminal Bench 2.0 ↗Visual Guide

Overview

Terminal-Bench 2.0 is an open-source benchmark for evaluating AI agents on terminal-based software engineering tasks using containerized environments.

Pricing

open-source

AI Agent Performance Evaluation

Developers and researchers can benchmark AI agents on terminal-based tasks such as code compilation, server setup, and vulnerability fixing.

Custom Task Integration

Users can extend the benchmark by adding new tasks or modifying existing ones to suit specific evaluation needs.

Quick Start

Install Terminal-Bench

Run uv tool install terminal-bench or pip install terminal-bench to install the package.

Run Evaluations

Use the CLI commands tb or tb run to execute benchmark tasks and evaluate AI agents.

Configure Custom Docker Images

Set use_prebuilt_image=false in CLI commands or Python evaluation scripts to use custom Docker images.

View Leaderboard

Access the public leaderboard at https://www.tbench.ai/leaderboard/terminal-bench/2.0 to compare agent performance.

Contribute Tasks or Adapters

Follow documentation to add new tasks or adapters by placing files in the tasks folder and submitting a pull request.

📊

Strategic Context for Terminal Bench 2.0

Get weekly analysis on market dynamics, competitive positioning, and implementation ROI frameworks with AI Intelligence briefings.

Try Intelligence Free →

7 days free · No credit card

Assessment

Strengths

High-quality tasks verified manually and with language model assistance to ensure reliability.
Widely adopted as a standard benchmark by frontier AI labs since the initial release.
Flexible Docker and container support including cloud deployment and local builds.
Active community with approximately 1,000 Discord members and 100 GitHub contributors.
Public leaderboard enables transparent comparison of agent performance.

Limitations

Some tasks remain brittle due to external dependencies, such as YouTube anti-bot measures affecting prior solutions.
Differences in evaluation frameworks (e.g., Inspect AI ReAct vs. Harbor) can affect result consistency.
Currently in beta stage with planned expansions and no formal published releases.